Encroachment Detection System based on anomalies in Network¶

Network Encroachment detection systems (NEDS) are set up at a planned point within the network to examine traffic from all devices on the network. It performs an observation of passing traffic on the entire subnet and matches the traffic that is passed on the subnets to the collection of known attacks. Once an attack is identified or abnormal behavior is observed, the alert can be sent to the administrator

Setup: Importing neccessary libraries¶

import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import sklearn
import imblearn
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
# Importing the Keras libraries and packages
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

Downloading and Unzipping Dataset¶

!wget -N "https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/Encroachment.zip"
!unzip -o "Encroachment.zip" 
!rm "Encroachment.zip"

--2021-07-13 10:34:53--  https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/Encroachment.zip
Resolving cainvas-static.s3.amazonaws.com (cainvas-static.s3.amazonaws.com)... 52.219.64.32
Connecting to cainvas-static.s3.amazonaws.com (cainvas-static.s3.amazonaws.com)|52.219.64.32|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 838086 (818K) [application/x-zip-compressed]
Saving to: ‘Encroachment.zip’

Encroachment.zip    100%[===================>] 818.44K  --.-KB/s    in 0.006s  

2021-07-13 10:34:53 (142 MB/s) - ‘Encroachment.zip’ saved [838086/838086]

Archive:  Encroachment.zip
  inflating: Test_data.csv           
  inflating: Train_data.csv

Understanding and visualizing the data¶

train = pd.read_csv("Train_data.csv")
test = pd.read_csv("Test_data.csv")

print(train.head(4))

print("Training data has {} rows & {} columns".format(train.shape[0],train.shape[1]))

print(test.head(4))

print("Testing data has {} rows & {} columns".format(test.shape[0],test.shape[1]))

   duration protocol_type   service flag  src_bytes  dst_bytes  land  \
0         0           tcp  ftp_data   SF        491          0     0   
1         0           udp     other   SF        146          0     0   
2         0           tcp   private   S0          0          0     0   
3         0           tcp      http   SF        232       8153     0   

   wrong_fragment  urgent  hot  ...  dst_host_srv_count  \
0               0       0    0  ...                  25   
1               0       0    0  ...                   1   
2               0       0    0  ...                  26   
3               0       0    0  ...                 255   

   dst_host_same_srv_rate  dst_host_diff_srv_rate  \
0                    0.17                    0.03   
1                    0.00                    0.60   
2                    0.10                    0.05   
3                    1.00                    0.00   

   dst_host_same_src_port_rate  dst_host_srv_diff_host_rate  \
0                         0.17                         0.00   
1                         0.88                         0.00   
2                         0.00                         0.00   
3                         0.03                         0.04   

   dst_host_serror_rate  dst_host_srv_serror_rate  dst_host_rerror_rate  \
0                  0.00                      0.00                  0.05   
1                  0.00                      0.00                  0.00   
2                  1.00                      1.00                  0.00   
3                  0.03                      0.01                  0.00   

   dst_host_srv_rerror_rate    class  
0                      0.00   normal  
1                      0.00   normal  
2                      0.00  anomaly  
3                      0.01   normal  

[4 rows x 42 columns]
Training data has 25192 rows & 42 columns
   duration protocol_type   service flag  src_bytes  dst_bytes  land  \
0         0           tcp   private  REJ          0          0     0   
1         0           tcp   private  REJ          0          0     0   
2         2           tcp  ftp_data   SF      12983          0     0   
3         0          icmp     eco_i   SF         20          0     0   

   wrong_fragment  urgent  hot  ...  dst_host_count  dst_host_srv_count  \
0               0       0    0  ...             255                  10   
1               0       0    0  ...             255                   1   
2               0       0    0  ...             134                  86   
3               0       0    0  ...               3                  57   

   dst_host_same_srv_rate  dst_host_diff_srv_rate  \
0                    0.04                    0.06   
1                    0.00                    0.06   
2                    0.61                    0.04   
3                    1.00                    0.00   

   dst_host_same_src_port_rate  dst_host_srv_diff_host_rate  \
0                         0.00                         0.00   
1                         0.00                         0.00   
2                         0.61                         0.02   
3                         1.00                         0.28   

   dst_host_serror_rate  dst_host_srv_serror_rate  dst_host_rerror_rate  \
0                   0.0                       0.0                   1.0   
1                   0.0                       0.0                   1.0   
2                   0.0                       0.0                   0.0   
3                   0.0                       0.0                   0.0   

   dst_host_srv_rerror_rate  
0                       1.0  
1                       1.0  
2                       0.0  
3                       0.0  

[4 rows x 41 columns]
Testing data has 22544 rows & 41 columns

#Exploratory Analysis
# Descriptive statistics
train.describe()

print(train['num_outbound_cmds'].value_counts())
print(test['num_outbound_cmds'].value_counts())

#'num_outbound_cmds' is a redundant column so remove it from both train & test datasets
train.drop(['num_outbound_cmds'], axis=1, inplace=True)
test.drop(['num_outbound_cmds'], axis=1, inplace=True)

# Attack Class Distribution
train['class'].value_counts()

0    25192
Name: num_outbound_cmds, dtype: int64
0    22544
Name: num_outbound_cmds, dtype: int64

normal     13449
anomaly    11743
Name: class, dtype: int64

train[train['class'] == 'anomaly']

plt.figure(figsize = (6,5))
sns.countplot(train['class'], color = "orange")
plt.show()

train.hist(figsize=(30,30))
plt.show()

Data Pre-Processing¶

#Scalling numerical attributes

scaler = StandardScaler()

# extract numerical attributes and scale it to have zero mean and unit variance  
cols = train.select_dtypes(include=['float64','int64']).columns
sc_train = scaler.fit_transform(train.select_dtypes(include=['float64','int64']))
sc_test = scaler.fit_transform(test.select_dtypes(include=['float64','int64']))

# turn the result back to a dataframe
sc_traindf = pd.DataFrame(sc_train, columns = cols)
sc_testdf = pd.DataFrame(sc_test, columns = cols)

#Encoding categorical attributes

encoder = LabelEncoder()

# extract categorical attributes from both training and test sets 
cattrain = train.select_dtypes(include=['object']).copy()
cattest = test.select_dtypes(include=['object']).copy()

# encode the categorical attributes
traincat = cattrain.apply(encoder.fit_transform)
testcat = cattest.apply(encoder.fit_transform)

# separate target column from encoded data 
enctrain = traincat.drop(['class'], axis=1)
cat_Ytrain = traincat[['class']].copy()

#Union of processed numerical and categorical data
train_x = pd.concat([sc_traindf,enctrain],axis=1)
train_y = cat_Ytrain
train_x.shape

test_df = pd.concat([sc_testdf,testcat],axis=1)
test_df.shape

(22544, 40)

train_y

Feature Selection¶

#Feature Selection
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier();

# fit random forest classifier on the training set
rfc.fit(train_x, train_y);

# extract important features
score = np.round(rfc.feature_importances_,3)
importances = pd.DataFrame({'feature':train_x.columns,'importance':score})
importances = importances.sort_values('importance',ascending=False).set_index('feature')

# plot importances
plt.rcParams['figure.figsize'] = (11, 4)
importances.plot.bar();

#Recursive feature elimination
from sklearn.feature_selection import RFE
import itertools

rfc = RandomForestClassifier()

# create the RFE model and select 10 attributes
rfe = RFE(rfc, n_features_to_select=15)
rfe = rfe.fit(train_x, train_y)

# summarize the selection of the attributes
feature_map = [(i, v) for i, v in itertools.zip_longest(rfe.get_support(), train_x.columns)]
selected_features = [v for i, v in feature_map if i==True]

selected_features

['src_bytes',
 'dst_bytes',
 'logged_in',
 'count',
 'srv_count',
 'same_srv_rate',
 'diff_srv_rate',
 'dst_host_srv_count',
 'dst_host_same_srv_rate',
 'dst_host_diff_srv_rate',
 'dst_host_same_src_port_rate',
 'dst_host_srv_diff_host_rate',
 'protocol_type',
 'service',
 'flag']

a = [i[0] for i in feature_map]

train_x = train_x.iloc[:,a]
test_df = test_df.iloc[:,a]

Network Building¶

#Dataset Partition
from sklearn.model_selection import train_test_split

X_train,X_test,Y_train,Y_test = train_test_split(train_x,train_y,train_size=0.70, random_state=2)


# Initialising the ANN
classifier = Sequential()

# Adding the input layer and the first hidden layer
classifier.add(Dense(units = 8, kernel_initializer = 'uniform', activation = 'relu', input_dim = 15))

# Adding the second hidden layer
classifier.add(Dense(units = 8, kernel_initializer = 'uniform', activation = 'relu'))

# Adding the output layer
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))

classifier.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 8)                 128       
_________________________________________________________________
dense_1 (Dense)              (None, 8)                 72        
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 9         
=================================================================
Total params: 209
Trainable params: 209
Non-trainable params: 0
_________________________________________________________________

x_val = X_train[-5000:]
y_val = Y_train[-5000:]
X_train = X_train[:-5000]
Y_train = Y_train[:-5000]

x_val

Training¶

# Compiling the ANN
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
# Fitting the ANN to the Training set
history = classifier.fit(X_train, Y_train, epochs = 20, validation_data = (x_val, y_val), verbose=1)

Epoch 1/20
395/395 [==============================] - 1s 2ms/step - loss: 0.3475 - accuracy: 0.8746 - val_loss: 0.1667 - val_accuracy: 0.9368
Epoch 2/20
395/395 [==============================] - 1s 2ms/step - loss: 0.1579 - accuracy: 0.9438 - val_loss: 0.1506 - val_accuracy: 0.9482
Epoch 3/20
395/395 [==============================] - 1s 2ms/step - loss: 0.1505 - accuracy: 0.9468 - val_loss: 0.1410 - val_accuracy: 0.9532
Epoch 4/20
395/395 [==============================] - 1s 2ms/step - loss: 0.1454 - accuracy: 0.9505 - val_loss: 0.1380 - val_accuracy: 0.9562
Epoch 5/20
395/395 [==============================] - 1s 2ms/step - loss: 0.1419 - accuracy: 0.9527 - val_loss: 0.1340 - val_accuracy: 0.9596
Epoch 6/20
395/395 [==============================] - 1s 2ms/step - loss: 0.1384 - accuracy: 0.9533 - val_loss: 0.1329 - val_accuracy: 0.9596
Epoch 7/20
395/395 [==============================] - 1s 2ms/step - loss: 0.1361 - accuracy: 0.9555 - val_loss: 0.1357 - val_accuracy: 0.9490
Epoch 8/20
395/395 [==============================] - 1s 2ms/step - loss: 0.1354 - accuracy: 0.9550 - val_loss: 0.1302 - val_accuracy: 0.9612
Epoch 9/20
395/395 [==============================] - 1s 2ms/step - loss: 0.1330 - accuracy: 0.9558 - val_loss: 0.1283 - val_accuracy: 0.9620
Epoch 10/20
395/395 [==============================] - 1s 2ms/step - loss: 0.1314 - accuracy: 0.9572 - val_loss: 0.1255 - val_accuracy: 0.9618
Epoch 11/20
395/395 [==============================] - 1s 2ms/step - loss: 0.1307 - accuracy: 0.9574 - val_loss: 0.1243 - val_accuracy: 0.9632
Epoch 12/20
395/395 [==============================] - 1s 2ms/step - loss: 0.1297 - accuracy: 0.9585 - val_loss: 0.1241 - val_accuracy: 0.9652
Epoch 13/20
395/395 [==============================] - 1s 2ms/step - loss: 0.1272 - accuracy: 0.9596 - val_loss: 0.1248 - val_accuracy: 0.9648
Epoch 14/20
395/395 [==============================] - 1s 2ms/step - loss: 0.1255 - accuracy: 0.9610 - val_loss: 0.1211 - val_accuracy: 0.9660
Epoch 15/20
395/395 [==============================] - 1s 2ms/step - loss: 0.1242 - accuracy: 0.9630 - val_loss: 0.1200 - val_accuracy: 0.9642
Epoch 16/20
395/395 [==============================] - 1s 2ms/step - loss: 0.1222 - accuracy: 0.9631 - val_loss: 0.1164 - val_accuracy: 0.9664
Epoch 17/20
395/395 [==============================] - 1s 2ms/step - loss: 0.1205 - accuracy: 0.9628 - val_loss: 0.1153 - val_accuracy: 0.9668
Epoch 18/20
395/395 [==============================] - 1s 2ms/step - loss: 0.1191 - accuracy: 0.9636 - val_loss: 0.1147 - val_accuracy: 0.9680
Epoch 19/20
395/395 [==============================] - 1s 2ms/step - loss: 0.1181 - accuracy: 0.9637 - val_loss: 0.1126 - val_accuracy: 0.9668
Epoch 20/20
395/395 [==============================] - 1s 2ms/step - loss: 0.1162 - accuracy: 0.9656 - val_loss: 0.1120 - val_accuracy: 0.9664

fig, ax1 = plt.subplots(figsize= (10, 5))
plt.plot(history.history["accuracy"])
plt.plot(history.history["val_accuracy"])
plt.title("Model accuracy")
plt.ylabel("Accuracy")
plt.xlabel("Epoch")
plt.legend(["Train", "Validation"], loc = "upper left")
plt.show()

fig, ax1 = plt.subplots(figsize= (10, 5))
plt.plot(history.history["loss"])
plt.plot(history.history["val_loss"])
plt.title("Model loss")
plt.ylabel("Loss")
plt.xlabel("Epoch")
plt.legend(["Train", "Validation"], loc = "upper left")
plt.show()

Y_train

Evaluation¶

from sklearn.metrics import confusion_matrix
cnn_predictions = classifier.predict_classes(X_test)
confusion_matrix = confusion_matrix(Y_test, cnn_predictions)
sns.heatmap(confusion_matrix, annot=True, fmt="d", cbar = False)
plt.title("CNN Confusion Matrix")
plt.show()

WARNING:tensorflow:From <ipython-input-22-9d5cbfe81f49>:2: Sequential.predict_classes (from tensorflow.python.keras.engine.sequential) is deprecated and will be removed after 2021-01-01.
Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).

yhat_train = (classifier.predict(X_train) > 0.5)
yhat_test = (classifier.predict(X_test) > 0.5)

#pred_ann

#Validate Models
from sklearn import metrics
accuracy = metrics.accuracy_score(Y_test, yhat_test)
#confusion_matrix = metrics.confusion_matrix(Y_test, yhat_test)
classification = metrics.classification_report(Y_test, yhat_test)
print()
print('============================== ANN Model Test Results ==============================')
print()
print ("Model Accuracy:" "\n", accuracy)
print()
#print("Confusion matrix:" "\n", confusion_matrix)
print()
print("Classification report:" "\n", classification) 
print()

============================== ANN Model Test Results ==============================

Model Accuracy:
 0.9669224662609156


Classification report:
               precision    recall  f1-score   support

           0       0.97      0.95      0.96      3498
           1       0.96      0.98      0.97      4060

    accuracy                           0.97      7558
   macro avg       0.97      0.97      0.97      7558
weighted avg       0.97      0.97      0.97      7558

# PREDICTING FOR TEST DATA
pred_ann = classifier.predict(test_df)

pred_ann[pred_ann > 0.5] = 1
pred_ann[pred_ann < 0.5] = 0

pred_ann

array([[0.],
       [0.],
       [1.],
       ...,
       [1.],
       [1.],
       [0.]], dtype=float32)

for x in pred_ann[:100]:
    if x == 0:
        print("Anomaly")
    else:
        print("Normal")

Anomaly
Anomaly
Normal
Anomaly
Anomaly
Normal
Normal
Normal
Normal
Normal
Normal
Normal
Anomaly
Anomaly
Normal
Normal
Normal
Normal
Anomaly
Anomaly
Anomaly
Anomaly
Normal
Normal
Anomaly
Anomaly
Normal
Normal
Anomaly
Anomaly
Normal
Normal
Normal
Anomaly
Anomaly
Anomaly
Normal
Normal
Normal
Normal
Anomaly
Normal
Normal
Normal
Anomaly
Normal
Anomaly
Normal
Anomaly
Normal
Normal
Normal
Anomaly
Anomaly
Normal
Normal
Anomaly
Normal
Normal
Normal
Normal
Anomaly
Normal
Normal
Anomaly
Normal
Anomaly
Anomaly
Anomaly
Normal
Anomaly
Normal
Normal
Normal
Anomaly
Normal
Normal
Normal
Normal
Normal
Anomaly
Anomaly
Anomaly
Normal
Normal
Normal
Anomaly
Normal
Normal
Normal
Normal
Normal
Anomaly
Anomaly
Normal
Normal
Normal
Normal
Anomaly
Anomaly

classifier.save('EDS.h5')

DeepCC¶

!deepCC EDS.h5

[INFO]
Reading [keras model] 'EDS.h5'
[SUCCESS]
Saved 'EDS_deepC/EDS.onnx'
[INFO]
Reading [onnx model] 'EDS_deepC/EDS.onnx'
[INFO]
Model info:
  ir_vesion : 4
  doc       : 
[WARNING]
[ONNX]: terminal (input/output) dense_input's shape is less than 1. Changing it to 1.
[WARNING]
[ONNX]: terminal (input/output) dense_2's shape is less than 1. Changing it to 1.
WARN (GRAPH): found operator node with the same name (dense_2) as io node.
[INFO]
Running DNNC graph sanity check ...
[SUCCESS]
Passed sanity check.
[INFO]
Writing C++ file 'EDS_deepC/EDS.cpp'
[INFO]
deepSea model files are ready in 'EDS_deepC/' 
[RUNNING COMMAND]
g++ -std=c++11 -O3 -fno-rtti -fno-exceptions -I. -I/opt/tljh/user/lib/python3.7/site-packages/deepC-0.13-py3.7-linux-x86_64.egg/deepC/include -isystem /opt/tljh/user/lib/python3.7/site-packages/deepC-0.13-py3.7-linux-x86_64.egg/deepC/packages/eigen-eigen-323c052e1731 "EDS_deepC/EDS.cpp" -D_AITS_MAIN -o "EDS_deepC/EDS.exe"
[RUNNING COMMAND]
size "EDS_deepC/EDS.exe"
   text	   data	    bss	    dec	    hex	filename
 119979	   2968	    760	 123707	  1e33b	EDS_deepC/EDS.exe
[SUCCESS]
Saved model as executable "EDS_deepC/EDS.exe"

	class
0	1
1	1
2	0
3	1
4	1
...	...
25187	0
25188	0
25189	0
25190	0
25191	0

	src_bytes	dst_bytes	logged_in	count	srv_count	same_srv_rate	diff_srv_rate	dst_host_srv_count	dst_host_same_srv_rate	dst_host_diff_srv_rate	dst_host_same_src_port_rate	dst_host_srv_diff_host_rate	protocol_type	service	flag
9941	-0.010093	-0.039310	-0.807626	1.477344	-0.354628	-1.479792	0.042773	-1.021857	-1.135556	-0.013561	-0.478183	-0.287993	1	17	2
18391	-0.010093	-0.039310	-0.807626	0.378550	-0.106238	-1.138595	-0.013235	-0.913402	-1.046456	-0.013561	-0.478183	-0.287993	1	46	5
17908	-0.009599	-0.035235	1.238197	-0.728964	-0.354628	0.772109	-0.349282	0.089810	-0.088634	-0.280673	-0.478183	-0.107117	1	51	9
10384	-0.009991	-0.035144	1.238197	-0.554552	-0.092439	0.772109	-0.349282	1.264742	1.069663	-0.440940	-0.478183	-0.287993	1	22	9
18968	-0.009972	-0.005582	1.238197	-0.545832	-0.078640	0.772109	-0.349282	1.264742	1.069663	-0.440940	-0.478183	-0.287993	1	22	9
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
18898	-0.009968	-0.034660	1.238197	-0.711523	-0.340829	0.772109	-0.349282	-0.723605	1.069663	-0.440940	0.591993	0.254636	1	22	9
11798	-0.010089	-0.037734	-0.807626	-0.728964	-0.368427	0.772109	-0.349282	-1.012819	-0.823707	0.093284	-0.316035	-0.287993	1	17	9
6637	-0.009970	-0.032690	1.238197	-0.711523	-0.175235	0.772109	-0.349282	1.264742	1.069663	-0.440940	0.591993	0.073759	1	22	9
2575	-0.010093	-0.039310	-0.807626	-0.493508	-0.299430	-1.024862	0.434828	-0.461505	-0.600957	-0.334096	-0.478183	-0.287993	1	57	5
23720	-0.010081	-0.039310	-0.807626	-0.266773	0.362942	0.772109	-0.349282	-0.551884	-0.690057	2.978093	0.202838	-0.287993	2	46	9

	class
1312	0
536	0
12385	0
24516	0
19896	1
...	...
10037	1
3055	1
24245	0
16579	1
12047	1

	duration	protocol_type	service	flag	src_bytes	dst_bytes	land	wrong_fragment	urgent	hot	...	dst_host_srv_count	dst_host_same_srv_rate	dst_host_diff_srv_rate	dst_host_same_src_port_rate	dst_host_srv_diff_host_rate	dst_host_serror_rate	dst_host_srv_serror_rate	dst_host_rerror_rate	dst_host_srv_rerror_rate	class
2	0	tcp	private	S0	0	0	0	0	0	0	...	26	0.10	0.05	0.00	0.00	1.0	1.0	0.0	0.0	anomaly
5	0	tcp	private	REJ	0	0	0	0	0	0	...	19	0.07	0.07	0.00	0.00	0.0	0.0	1.0	1.0	anomaly
6	0	tcp	private	S0	0	0	0	0	0	0	...	9	0.04	0.05	0.00	0.00	1.0	1.0	0.0	0.0	anomaly
7	0	tcp	private	S0	0	0	0	0	0	0	...	15	0.06	0.07	0.00	0.00	1.0	1.0	0.0	0.0	anomaly
8	0	tcp	remote_job	S0	0	0	0	0	0	0	...	23	0.09	0.05	0.00	0.00	1.0	1.0	0.0	0.0	anomaly
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
25187	0	tcp	exec	RSTO	0	0	0	0	0	0	...	7	0.03	0.06	0.00	0.00	0.0	0.0	1.0	1.0	anomaly
25188	0	tcp	ftp_data	SF	334	0	0	0	0	0	...	39	1.00	0.00	1.00	0.18	0.0	0.0	0.0	0.0	anomaly
25189	0	tcp	private	REJ	0	0	0	0	0	0	...	13	0.05	0.07	0.00	0.00	0.0	0.0	1.0	1.0	anomaly
25190	0	tcp	nnsp	S0	0	0	0	0	0	0	...	20	0.08	0.06	0.00	0.00	1.0	1.0	0.0	0.0	anomaly
25191	0	tcp	finger	S0	0	0	0	0	0	0	...	49	0.19	0.03	0.01	0.00	1.0	1.0	0.0	0.0	anomaly