Cainvas

Detecting Cervical Cancer

Credit: AITS Cainvas Community

Photo by Sharon Lee for LottieFiles on Dribbble

In [1]:
# Import all the necessary libraries

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import Model
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import Sequential
from tensorflow.keras.callbacks import EarlyStopping
import matplotlib.pyplot as plt

import seaborn as sns
import numpy as np
import os
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

Loading the Dataset

In [2]:
!wget 'https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/cervical_cancer.zip'

!unzip -qo cervical_cancer.zip 
!rm cervical_cancer.zip
--2021-12-09 17:00:45--  https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/cervical_cancer.zip
Resolving cainvas-static.s3.amazonaws.com (cainvas-static.s3.amazonaws.com)... 52.219.158.95
Connecting to cainvas-static.s3.amazonaws.com (cainvas-static.s3.amazonaws.com)|52.219.158.95|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9052 (8.8K) [application/x-zip-compressed]
Saving to: ‘cervical_cancer.zip’

cervical_cancer.zip 100%[===================>]   8.84K  --.-KB/s    in 0s      

2021-12-09 17:00:45 (147 MB/s) - ‘cervical_cancer.zip’ saved [9052/9052]

In [3]:
#Loading the data file using pandas library

data = pd.read_csv('kag_risk_factors_cervical_cancer.csv', sep = ",")
data.head(3)
Out[3]:
Age Number of sexual partners First sexual intercourse Num of pregnancies Smokes Smokes (years) Smokes (packs/year) Hormonal Contraceptives Hormonal Contraceptives (years) IUD ... STDs: Time since first diagnosis STDs: Time since last diagnosis Dx:Cancer Dx:CIN Dx:HPV Dx Hinselmann Schiller Citology Biopsy
0 18 4.0 15.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... ? ? 0 0 0 0 0 0 0 0
1 15 1.0 14.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... ? ? 0 0 0 0 0 0 0 0
2 34 1.0 ? 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... ? ? 0 0 0 0 0 0 0 0

3 rows × 36 columns

Dropping Redundant Data and Checking for NULL Values

In [4]:
data = data.drop(columns = ['STDs: Time since first diagnosis','STDs: Time since last diagnosis'])
data = data.replace('?', np.nan)
print(data.isna().sum())
Age                                     0
Number of sexual partners              26
First sexual intercourse                7
Num of pregnancies                     56
Smokes                                 13
Smokes (years)                         13
Smokes (packs/year)                    13
Hormonal Contraceptives               108
Hormonal Contraceptives (years)       108
IUD                                   117
IUD (years)                           117
STDs                                  105
STDs (number)                         105
STDs:condylomatosis                   105
STDs:cervical condylomatosis          105
STDs:vaginal condylomatosis           105
STDs:vulvo-perineal condylomatosis    105
STDs:syphilis                         105
STDs:pelvic inflammatory disease      105
STDs:genital herpes                   105
STDs:molluscum contagiosum            105
STDs:AIDS                             105
STDs:HIV                              105
STDs:Hepatitis B                      105
STDs:HPV                              105
STDs: Number of diagnosis               0
Dx:Cancer                               0
Dx:CIN                                  0
Dx:HPV                                  0
Dx                                      0
Hinselmann                              0
Schiller                                0
Citology                                0
Biopsy                                  0
dtype: int64

Filling NULL Values with Mean Value of the Data

In [5]:
data = data.fillna(data.mean())
data = data.apply(pd.to_numeric)
In [6]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 858 entries, 0 to 857
Data columns (total 34 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Age                                 858 non-null    int64  
 1   Number of sexual partners           832 non-null    float64
 2   First sexual intercourse            851 non-null    float64
 3   Num of pregnancies                  802 non-null    float64
 4   Smokes                              845 non-null    float64
 5   Smokes (years)                      845 non-null    float64
 6   Smokes (packs/year)                 845 non-null    float64
 7   Hormonal Contraceptives             750 non-null    float64
 8   Hormonal Contraceptives (years)     750 non-null    float64
 9   IUD                                 741 non-null    float64
 10  IUD (years)                         741 non-null    float64
 11  STDs                                753 non-null    float64
 12  STDs (number)                       753 non-null    float64
 13  STDs:condylomatosis                 753 non-null    float64
 14  STDs:cervical condylomatosis        753 non-null    float64
 15  STDs:vaginal condylomatosis         753 non-null    float64
 16  STDs:vulvo-perineal condylomatosis  753 non-null    float64
 17  STDs:syphilis                       753 non-null    float64
 18  STDs:pelvic inflammatory disease    753 non-null    float64
 19  STDs:genital herpes                 753 non-null    float64
 20  STDs:molluscum contagiosum          753 non-null    float64
 21  STDs:AIDS                           753 non-null    float64
 22  STDs:HIV                            753 non-null    float64
 23  STDs:Hepatitis B                    753 non-null    float64
 24  STDs:HPV                            753 non-null    float64
 25  STDs: Number of diagnosis           858 non-null    int64  
 26  Dx:Cancer                           858 non-null    int64  
 27  Dx:CIN                              858 non-null    int64  
 28  Dx:HPV                              858 non-null    int64  
 29  Dx                                  858 non-null    int64  
 30  Hinselmann                          858 non-null    int64  
 31  Schiller                            858 non-null    int64  
 32  Citology                            858 non-null    int64  
 33  Biopsy                              858 non-null    int64  
dtypes: float64(24), int64(10)
memory usage: 228.0 KB

Converting Data Types to Float for Preprocessing

In [7]:
# Change all the datatype to be float 64
data['Age'] = data['Age'].astype(float)
data['STDs: Number of diagnosis'] = data['STDs: Number of diagnosis'].astype(float)
data['Dx:Cancer'] = data['Dx:Cancer'].astype(float)
data['Dx:CIN'] = data['Dx:CIN'].astype(float)
data['Dx:HPV'] = data['Dx:HPV'].astype(float)
data['Dx'] = data['Dx'].astype(float)
data['Hinselmann'] = data['Hinselmann'].astype(float)
data['Schiller'] = data['Schiller'].astype(float)
data['Citology'] = data['Citology'].astype(float)
data['Biopsy'] = data['Biopsy'].astype(float)

Creating Column for Sum Total of All Cancers

In [8]:
data['count']=data['Hinselmann']+data['Schiller']+data['Citology']+data['Biopsy']

data['result']=np.where(data['count']>0,1,data['count'])
In [9]:
data['result'].unique()
Out[9]:
array([0., 1.])

Visualising the Relationship between Age & No. of Sexual Partners

In [10]:
plt.figure(figsize = (8, 5))
plt.xticks(rotation = 60)
sns.barplot(y=data['Age'],x=data['Number of sexual partners'],hue=data['Schiller'])
Out[10]:
<AxesSubplot:xlabel='Number of sexual partners', ylabel='Age'>
In [11]:
plt.figure(figsize = (8, 5))
plt.xticks(rotation = 60)
sns.barplot(y=data['Age'],x=data['Number of sexual partners'],hue=data['Biopsy'])
Out[11]:
<AxesSubplot:xlabel='Number of sexual partners', ylabel='Age'>
In [12]:
data_final = data.drop(columns = ['Hinselmann', 
                                  'Schiller', 
                                  'Citology', 
                                  'Biopsy', 
                                  'count', 
                                  'STDs:condylomatosis',
                                  'STDs:cervical condylomatosis',
                                  'STDs:vulvo-perineal condylomatosis',
                                  'STDs:syphilis',
                                  'STDs:pelvic inflammatory disease', 
                                  'STDs:genital herpes',
                                  'STDs:molluscum contagiosum',
                                  'STDs:AIDS', 'STDs:HIV',
                                  'STDs:Hepatitis B', 'STDs:HPV', 
                                  'STDs: Number of diagnosis',
                                  'Dx:Cancer', 'Dx:CIN', 'Dx:HPV'                                  
                                 ])

y = data_final['result']
X = data_final.drop(columns = ['result'])
In [13]:
# Plotting a heatmap/correlation plot to see how different values are related to each other
plt.figure(figsize=(15,15))
sns.heatmap(data_final.corr(),annot=True,linewidths=2)
plt.xticks(rotation = 60)
plt.show()
In [14]:
print(X.shape)
print(y.shape)
(858, 15)
(858,)
In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.40, stratify = y)
In [16]:
#Feature Scaling

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
In [17]:
# Getting the Final Data Shapes


print("Shape of Training Data")
print ("X = ",X_train.shape)
print ("Y = ",y_train.shape, "\n")


print("Shape of Testing Data")
print ("X = ",X_test.shape)
print ("Y = ",y_test.shape)
Shape of Training Data
X =  (514, 15)
Y =  (514,) 

Shape of Testing Data
X =  (344, 15)
Y =  (344,)

Training the Model

In [18]:
# Defining the architecture of our deep learning model

model = Sequential()

model.add(Dense(100, activation = "softmax", input_dim = 15))
model.add(Dense(1, activation = "softmax"))

model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 100)               1600      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
=================================================================
Total params: 1,701
Trainable params: 1,701
Non-trainable params: 0
_________________________________________________________________
In [19]:
# Compiling the model
model.compile(optimizer = Adam(lr = 0.0000001), loss = 'categorical_crossentropy', metrics = ['accuracy'])
In [20]:
es = EarlyStopping(monitor = 'val_accuracy', patience = 5)
In [21]:
# Run the model for a batch size of 5 for 100 epochs
history = model.fit(X_train, 
                    y_train, 
                    validation_data = (X_test, y_test),
                    batch_size = 5,
                    epochs = 100,
                    callbacks = es
                   )
Epoch 1/100
103/103 [==============================] - 0s 3ms/step - loss: nan - accuracy: 0.8794 - val_loss: nan - val_accuracy: 0.8808
Epoch 2/100
103/103 [==============================] - 0s 2ms/step - loss: nan - accuracy: 0.8813 - val_loss: nan - val_accuracy: 0.8808
Epoch 3/100
103/103 [==============================] - 0s 1ms/step - loss: nan - accuracy: 0.8813 - val_loss: nan - val_accuracy: 0.8808
Epoch 4/100
103/103 [==============================] - 0s 2ms/step - loss: nan - accuracy: 0.8813 - val_loss: nan - val_accuracy: 0.8808
Epoch 5/100
103/103 [==============================] - 0s 2ms/step - loss: nan - accuracy: 0.8813 - val_loss: nan - val_accuracy: 0.8808
Epoch 6/100
103/103 [==============================] - 0s 2ms/step - loss: nan - accuracy: 0.8813 - val_loss: nan - val_accuracy: 0.8808

Checking Model Training with varying epochs

In [22]:
# Function to plot "accuracy vs epoch" graphs and "loss vs epoch" graphs for training and validation data
def plot_metrics(model_name, metric = 'accuracy'):
    if metric == 'loss':
        plt.title("Loss Values")
        plt.plot(model_name.history['loss'], label = 'train')
        plt.plot(model_name.history['val_loss'], label = 'test')
        plt.legend()
        plt.show()
    else:
        plt.title("Accuracy Values")
        plt.plot(model_name.history['accuracy'], label='train') 
        plt.plot(model_name.history['val_accuracy'], label='test') 
        plt.legend()
        plt.show()
In [23]:
plot_metrics(history, 'accuracy')
plot_metrics(history, 'loss')
In [27]:
# Predicting on the testing data
Y_pred = np.argmax(model.predict(X_test), axis = 1)

from tensorflow.keras.models import save_model
if os.path.isfile('best_model.h5') is False:
    model.save('best_model.h5')

Getting the Classification Report

In [25]:
# Getting a Classification Report for checking the performance of our model
print(classification_report(y_test, Y_pred, target_names = ['No Cancer', 'Cancer']))
              precision    recall  f1-score   support

   No Cancer       0.88      1.00      0.94       303
      Cancer       0.00      0.00      0.00        41

    accuracy                           0.88       344
   macro avg       0.44      0.50      0.47       344
weighted avg       0.78      0.88      0.82       344

/opt/tljh/user/lib/python3.7/site-packages/sklearn/metrics/_classification.py:1221: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))