Cainvas

Malicious URL Detection

Credit: AITS Cainvas Community

Photo by MOWE on Dribbble

A malicious website is a site that attempts to install malware (a general term for anything that will disrupt computer operation, gather your personal information or, in a worst-case scenario, gain total access to your machine) onto your device.So it is necessary to detect malicious websites or urls.

Importing the necessary libraries

In [1]:
import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
import seaborn as sns

import os

from sklearn.model_selection import train_test_split

from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
from sklearn.model_selection import train_test_split

import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.callbacks import ModelCheckpoint
import keras
from keras.callbacks import ReduceLROnPlateau
from keras.models import Sequential
from keras.layers import Dense, Conv1D, MaxPooling1D, Flatten, Dropout, BatchNormalization ,Activation
from keras.utils import np_utils, to_categorical
from keras.callbacks import ModelCheckpoint

Importing The dataset

In [2]:
urldata = pd.read_csv("https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/urldata.csv")

Data Analysis

In [3]:
urldata.head()
Out[3]:
url label result url_length hostname_length path_length fd_length count- count@ count? ... count. count= count-http count-https count-www count-digits count-letters count_dir use_of_ip short_url
0 https://www.google.com benign 0 22 14 0 0 0 0 0 ... 2 0 1 1 1 0 17 0 1 1
1 https://www.youtube.com benign 0 23 15 0 0 0 0 0 ... 2 0 1 1 1 0 18 0 1 1
2 https://www.facebook.com benign 0 24 16 0 0 0 0 0 ... 2 0 1 1 1 0 19 0 1 1
3 https://www.baidu.com benign 0 21 13 0 0 0 0 0 ... 2 0 1 1 1 0 16 0 1 1
4 https://www.wikipedia.org benign 0 25 17 0 0 0 0 0 ... 2 0 1 1 1 0 20 0 1 1

5 rows × 21 columns

In [4]:
urldata.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450176 entries, 0 to 450175
Data columns (total 21 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   url              450176 non-null  object
 1   label            450176 non-null  object
 2   result           450176 non-null  int64 
 3   url_length       450176 non-null  int64 
 4   hostname_length  450176 non-null  int64 
 5   path_length      450176 non-null  int64 
 6   fd_length        450176 non-null  int64 
 7   count-           450176 non-null  int64 
 8   count@           450176 non-null  int64 
 9   count?           450176 non-null  int64 
 10  count%           450176 non-null  int64 
 11  count.           450176 non-null  int64 
 12  count=           450176 non-null  int64 
 13  count-http       450176 non-null  int64 
 14  count-https      450176 non-null  int64 
 15  count-www        450176 non-null  int64 
 16  count-digits     450176 non-null  int64 
 17  count-letters    450176 non-null  int64 
 18  count_dir        450176 non-null  int64 
 19  use_of_ip        450176 non-null  int64 
 20  short_url        450176 non-null  int64 
dtypes: int64(19), object(2)
memory usage: 72.1+ MB

Checking Missing Values

In [5]:
urldata.isnull().sum()
Out[5]:
url                0
label              0
result             0
url_length         0
hostname_length    0
path_length        0
fd_length          0
count-             0
count@             0
count?             0
count%             0
count.             0
count=             0
count-http         0
count-https        0
count-www          0
count-digits       0
count-letters      0
count_dir          0
use_of_ip          0
short_url          0
dtype: int64

The following features are already extracted from the URL for classification and stored in the csv file.

  1. Length Features
    • Length Of Url
    • Length of Hostname
    • Length Of Path
    • Length Of First Directory
    • Length Of Top Level Domain

  2. Count Features
    • Count Of '-'
    • Count Of '@'
    • Count Of '?'
    • Count Of '%'
    • Count Of '.'
    • Count Of '='
    • Count Of 'http'
    • Count Of 'www'
    • Count Of Digits
    • Count Of Letters
    • Count Of Number Of Directories

  3. Binary Features
    • Use of IP or not
    • Use of Shortening URL or not

Data Visualization

In [6]:
plt.figure(figsize=(15,5))
sns.countplot(x='label',data=urldata)
plt.title("Count Of URLs",fontsize=20)
plt.xlabel("Type Of URLs",fontsize=18)
plt.ylabel("Number Of URLs",fontsize=18)
Out[6]:
Text(0, 0.5, 'Number Of URLs')
In [7]:
print("Percent Of Malicious URLs:{:.2f} %".format(len(urldata[urldata['label']=='malicious'])/len(urldata['label'])*100))
print("Percent Of Benign URLs:{:.2f} %".format(len(urldata[urldata['label']=='benign'])/len(urldata['label'])*100))
Percent Of Malicious URLs:23.20 %
Percent Of Benign URLs:76.80 %

The data shows a class imbalance to some extent.

Creating X:features and Y:label for model training

In [8]:
#Independent Variables
x = urldata[['hostname_length',
       'path_length', 'fd_length', 'count-', 'count@', 'count?',
       'count%', 'count.', 'count=', 'count-http','count-https', 'count-www', 'count-digits',
       'count-letters', 'count_dir', 'use_of_ip']]

#Dependent Variable
y = urldata['result']

Train test split

In [9]:
#Oversampling using SMOTE
from imblearn.over_sampling import SMOTE

x_sample, y_sample = SMOTE().fit_sample(x, y.values.ravel())

x_sample = pd.DataFrame(x_sample)
y_sample = pd.DataFrame(y_sample)

# checking the sizes of the sample data
print("Size of x-sample :", x_sample.shape)
print("Size of y-sample :", y_sample.shape)
Size of x-sample : (691476, 16)
Size of y-sample : (691476, 1)
In [10]:
#Train test split
x_train, x_test, y_train, y_test = train_test_split(x_sample, y_sample, test_size = 0.2)
print("Shape of x_train: ", x_train.shape)
print("Shape of x_valid: ", x_test.shape)
print("Shape of y_train: ", y_train.shape)
print("Shape of y_valid: ", y_test.shape)
Shape of x_train:  (553180, 16)
Shape of x_valid:  (138296, 16)
Shape of y_train:  (553180, 1)
Shape of y_valid:  (138296, 1)

Model Architecture

In [11]:
model = Sequential()
model.add(Dense(32, activation = 'relu', input_shape = (16, )))

model.add(Dense(16, activation='relu'))

model.add(Dense(8, activation='relu')) 

model.add(Dense(1, activation='sigmoid')) 
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 32)                544       
_________________________________________________________________
dense_1 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_2 (Dense)              (None, 8)                 136       
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 9         
=================================================================
Total params: 1,217
Trainable params: 1,217
Non-trainable params: 0
_________________________________________________________________

Model Training

In [12]:
opt = keras.optimizers.Adam(lr=0.0001)
model.compile(optimizer= opt ,loss='binary_crossentropy',metrics=['acc'])
In [13]:
checkpointer = ModelCheckpoint('url.h5', monitor='val_acc', mode='max', verbose=2, save_best_only=True)
history=model.fit(x_train, y_train, batch_size=256, epochs=5, validation_data=(x_test, y_test), callbacks=[checkpointer])
Epoch 1/5
2129/2161 [============================>.] - ETA: 0s - loss: 0.5004 - acc: 0.8289
Epoch 00001: val_acc improved from -inf to 0.96910, saving model to url.h5
2161/2161 [==============================] - 4s 2ms/step - loss: 0.4950 - acc: 0.8310 - val_loss: 0.1405 - val_acc: 0.9691
Epoch 2/5
2137/2161 [============================>.] - ETA: 0s - loss: 0.0710 - acc: 0.9858
Epoch 00002: val_acc improved from 0.96910 to 0.99266, saving model to url.h5
2161/2161 [==============================] - 4s 2ms/step - loss: 0.0707 - acc: 0.9859 - val_loss: 0.0384 - val_acc: 0.9927
Epoch 3/5
2136/2161 [============================>.] - ETA: 0s - loss: 0.0287 - acc: 0.9947
Epoch 00003: val_acc improved from 0.99266 to 0.99603, saving model to url.h5
2161/2161 [==============================] - 4s 2ms/step - loss: 0.0286 - acc: 0.9948 - val_loss: 0.0236 - val_acc: 0.9960
Epoch 4/5
2139/2161 [============================>.] - ETA: 0s - loss: 0.0205 - acc: 0.9962
Epoch 00004: val_acc did not improve from 0.99603
2161/2161 [==============================] - 4s 2ms/step - loss: 0.0205 - acc: 0.9961 - val_loss: 0.0200 - val_acc: 0.9959
Epoch 5/5
2160/2161 [============================>.] - ETA: 0s - loss: 0.0182 - acc: 0.9963
Epoch 00005: val_acc improved from 0.99603 to 0.99627, saving model to url.h5
2161/2161 [==============================] - 4s 2ms/step - loss: 0.0182 - acc: 0.9963 - val_loss: 0.0182 - val_acc: 0.9963

Training Plots

In [14]:
# plot the training artifacts
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('Model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train_acc','val_acc'], loc = 'upper right')
plt.show()
In [15]:
# plot the training artifacts

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train_loss','val_loss'], loc = 'upper right')
plt.show()

Accessing the Model's Performance

In [16]:
# predicting on test data.
pred_test = model.predict(x_test)
for i in range (len(pred_test)):
    if (pred_test[i] < 0.5):
        pred_test[i] = 0
    else:
        pred_test[i] = 1
pred_test = pred_test.astype(int)
In [17]:
def view_result(array):
    array = np.array(array)
    for i in range(len(array)):
        if array[i] == 0:
            print("Non Mallicious")
        else:
            print("Mallicious")
In [18]:
view_result(pred_test[:10])
Mallicious
Non Mallicious
Non Mallicious
Non Mallicious
Non Mallicious
Non Mallicious
Non Mallicious
Mallicious
Mallicious
Non Mallicious
In [19]:
view_result(y_test[:10])
Mallicious
Non Mallicious
Non Mallicious
Non Mallicious
Non Mallicious
Non Mallicious
Non Mallicious
Mallicious
Mallicious
Non Mallicious

Compiling the model with deepC

In [20]:
!deepCC url.h5
[INFO]
Reading [keras model] 'url.h5'
[SUCCESS]
Saved 'url.onnx'
[INFO]
Reading [onnx model] 'url.onnx'
[INFO]
Model info:
  ir_vesion : 4
  doc       : 
[WARNING]
[ONNX]: terminal (input/output) dense_input's shape is less than 1. Changing it to 1.
[WARNING]
[ONNX]: terminal (input/output) dense_3's shape is less than 1. Changing it to 1.
WARN (GRAPH): found operator node with the same name (dense_3) as io node.
[INFO]
Running DNNC graph sanity check ...
[SUCCESS]
Passed sanity check.
[INFO]
Writing C++ file 'url_deepC/url.cpp'
[INFO]
deepSea model files are ready in 'url_deepC/' 
[RUNNING COMMAND]
g++ -std=c++11 -O3 -fno-rtti -fno-exceptions -I. -I/opt/tljh/user/lib/python3.7/site-packages/deepC-0.13-py3.7-linux-x86_64.egg/deepC/include -isystem /opt/tljh/user/lib/python3.7/site-packages/deepC-0.13-py3.7-linux-x86_64.egg/deepC/packages/eigen-eigen-323c052e1731 url_deepC/url.cpp -o url_deepC/url.exe
[RUNNING COMMAND]
size "url_deepC/url.exe"
   text	   data	    bss	    dec	    hex	filename
 118834	   7904	    760	 127498	  1f20a	url_deepC/url.exe
[SUCCESS]
Saved model as executable "url_deepC/url.exe"