A malicious website is a site that attempts to install malware (a general term for anything that will disrupt computer operation, gather your personal information or, in a worst-case scenario, gain total access to your machine) onto your device.So it is necessary to detect malicious websites or urls.
Importing the necessary libraries¶
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
from sklearn.model_selection import train_test_split
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.callbacks import ModelCheckpoint
import keras
from keras.callbacks import ReduceLROnPlateau
from keras.models import Sequential
from keras.layers import Dense, Conv1D, MaxPooling1D, Flatten, Dropout, BatchNormalization ,Activation
from keras.utils import np_utils, to_categorical
from keras.callbacks import ModelCheckpoint
Importing The dataset¶
In [2]:
urldata = pd.read_csv("https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/urldata.csv")
Data Analysis¶
In [3]:
urldata.head()
Out[3]:
In [4]:
urldata.info()
Checking Missing Values¶
In [5]:
urldata.isnull().sum()
Out[5]:
The following features are already extracted from the URL for classification and stored in the csv file.
- Length Features
- Length Of Url
- Length of Hostname
- Length Of Path
- Length Of First Directory
- Length Of Top Level Domain
- Count Features
- Count Of '-'
- Count Of '@'
- Count Of '?'
- Count Of '%'
- Count Of '.'
- Count Of '='
- Count Of 'http'
- Count Of 'www'
- Count Of Digits
- Count Of Letters
- Count Of Number Of Directories
- Binary Features
- Use of IP or not
- Use of Shortening URL or not
Data Visualization¶
In [6]:
plt.figure(figsize=(15,5))
sns.countplot(x='label',data=urldata)
plt.title("Count Of URLs",fontsize=20)
plt.xlabel("Type Of URLs",fontsize=18)
plt.ylabel("Number Of URLs",fontsize=18)
Out[6]:
In [7]:
print("Percent Of Malicious URLs:{:.2f} %".format(len(urldata[urldata['label']=='malicious'])/len(urldata['label'])*100))
print("Percent Of Benign URLs:{:.2f} %".format(len(urldata[urldata['label']=='benign'])/len(urldata['label'])*100))
The data shows a class imbalance to some extent.¶
Creating X:features and Y:label for model training¶
In [8]:
#Independent Variables
x = urldata[['hostname_length',
'path_length', 'fd_length', 'count-', 'count@', 'count?',
'count%', 'count.', 'count=', 'count-http','count-https', 'count-www', 'count-digits',
'count-letters', 'count_dir', 'use_of_ip']]
#Dependent Variable
y = urldata['result']
Train test split¶
In [9]:
#Oversampling using SMOTE
from imblearn.over_sampling import SMOTE
x_sample, y_sample = SMOTE().fit_sample(x, y.values.ravel())
x_sample = pd.DataFrame(x_sample)
y_sample = pd.DataFrame(y_sample)
# checking the sizes of the sample data
print("Size of x-sample :", x_sample.shape)
print("Size of y-sample :", y_sample.shape)
In [10]:
#Train test split
x_train, x_test, y_train, y_test = train_test_split(x_sample, y_sample, test_size = 0.2)
print("Shape of x_train: ", x_train.shape)
print("Shape of x_valid: ", x_test.shape)
print("Shape of y_train: ", y_train.shape)
print("Shape of y_valid: ", y_test.shape)
Model Architecture¶
In [11]:
model = Sequential()
model.add(Dense(32, activation = 'relu', input_shape = (16, )))
model.add(Dense(16, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()
Model Training¶
In [12]:
opt = keras.optimizers.Adam(lr=0.0001)
model.compile(optimizer= opt ,loss='binary_crossentropy',metrics=['acc'])
In [13]:
checkpointer = ModelCheckpoint('url.h5', monitor='val_acc', mode='max', verbose=2, save_best_only=True)
history=model.fit(x_train, y_train, batch_size=256, epochs=5, validation_data=(x_test, y_test), callbacks=[checkpointer])
Training Plots¶
In [14]:
# plot the training artifacts
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('Model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train_acc','val_acc'], loc = 'upper right')
plt.show()
In [15]:
# plot the training artifacts
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train_loss','val_loss'], loc = 'upper right')
plt.show()
Accessing the Model's Performance¶
In [16]:
# predicting on test data.
pred_test = model.predict(x_test)
for i in range (len(pred_test)):
if (pred_test[i] < 0.5):
pred_test[i] = 0
else:
pred_test[i] = 1
pred_test = pred_test.astype(int)
In [17]:
def view_result(array):
array = np.array(array)
for i in range(len(array)):
if array[i] == 0:
print("Non Mallicious")
else:
print("Mallicious")
In [18]:
view_result(pred_test[:10])
In [19]:
view_result(y_test[:10])
Compiling the model with deepC¶
In [20]:
!deepCC url.h5