Model Files
final_model.h5
keras
Model

# Toxic Comment Detection¶

### Credit: AITS Cainvas Community¶

Photo by Daniel Montero on Dribbble

• Discussing things you care about can be difficult. The threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions. Platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments.

# Setup: Importing neccessary libraries¶

In [1]:
!pip install matplotlib-venn

Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: matplotlib-venn in /home/jupyter-dark/.local/lib/python3.7/site-packages (0.11.6)
Requirement already satisfied: matplotlib in /opt/tljh/user/lib/python3.7/site-packages (from matplotlib-venn) (3.3.3)
Requirement already satisfied: scipy in /opt/tljh/user/lib/python3.7/site-packages (from matplotlib-venn) (1.4.1)
Requirement already satisfied: numpy in /opt/tljh/user/lib/python3.7/site-packages (from matplotlib-venn) (1.18.5)
Requirement already satisfied: python-dateutil>=2.1 in /opt/tljh/user/lib/python3.7/site-packages (from matplotlib->matplotlib-venn) (2.8.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /opt/tljh/user/lib/python3.7/site-packages (from matplotlib->matplotlib-venn) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /opt/tljh/user/lib/python3.7/site-packages (from matplotlib->matplotlib-venn) (0.10.0)
Requirement already satisfied: numpy in /opt/tljh/user/lib/python3.7/site-packages (from matplotlib-venn) (1.18.5)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/tljh/user/lib/python3.7/site-packages (from matplotlib->matplotlib-venn) (1.3.1)
Requirement already satisfied: pillow>=6.2.0 in /opt/tljh/user/lib/python3.7/site-packages (from matplotlib->matplotlib-venn) (8.0.1)
Requirement already satisfied: six in /opt/tljh/user/lib/python3.7/site-packages (from cycler>=0.10->matplotlib->matplotlib-venn) (1.15.0)
Requirement already satisfied: six in /opt/tljh/user/lib/python3.7/site-packages (from cycler>=0.10->matplotlib->matplotlib-venn) (1.15.0)
Requirement already satisfied: numpy in /opt/tljh/user/lib/python3.7/site-packages (from matplotlib-venn) (1.18.5)
WARNING: You are using pip version 20.3.1; however, version 21.1.3 is available.
You should consider upgrading via the '/opt/tljh/user/bin/python -m pip install --upgrade pip' command.


## Importing Datasets¶

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import tensorflow as tf
from tensorflow import keras

#visualization
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
from wordcloud import WordCloud ,STOPWORDS
from PIL import Image
import matplotlib_venn as venn

#settings
color = sns.color_palette()
sns.set_style("dark")
%matplotlib inline

In [3]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import tensorflow as tf
from tensorflow import keras

#visualization
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
from wordcloud import WordCloud ,STOPWORDS
from PIL import Image
import matplotlib_venn as venn

#settings
color = sns.color_palette()
sns.set_style("dark")
%matplotlib inline


# Unzipping Dataset¶

In [4]:
!wget -N "https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/toxic_comment.zip"
!unzip -oq toxic_comment.zip
!rm toxic_comment.zip

--2021-06-28 10:11:17--  https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/toxic_comment.zip
Resolving cainvas-static.s3.amazonaws.com (cainvas-static.s3.amazonaws.com)... 52.219.160.55
Connecting to cainvas-static.s3.amazonaws.com (cainvas-static.s3.amazonaws.com)|52.219.160.55|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 55201987 (53M) [application/zip]
Saving to: ‘toxic_comment.zip’

toxic_comment.zip   100%[===================>]  52.64M   104MB/s    in 0.5s

2021-06-28 10:11:17 (104 MB/s) - ‘toxic_comment.zip’ saved [55201987/55201987]



# Data Pre-Processing and Visualization:¶

In [5]:
train_data = pd.read_csv("train.csv.zip")

Out[5]:
id comment_text toxic severe_toxic obscene threat insult identity_hate
0 0000997932d777bf Explanation\nWhy the edits made under my usern... 0 0 0 0 0 0
1 000103f0d9cfb60f D'aww! He matches this background colour I'm s... 0 0 0 0 0 0
2 000113f07ec002fd Hey man, I'm really not trying to edit war. It... 0 0 0 0 0 0
3 0001b41b1c6bb37e "\nMore\nI can't make any real suggestions on ... 0 0 0 0 0 0
4 0001d958c54c6e35 You, sir, are my hero. Any chance you remember... 0 0 0 0 0 0
In [6]:
X_train = train_data["comment_text"]

In [7]:
X_train

Out[7]:
0         Explanation\nWhy the edits made under my usern...
1         D'aww! He matches this background colour I'm s...
2         Hey man, I'm really not trying to edit war. It...
3         "\nMore\nI can't make any real suggestions on ...
4         You, sir, are my hero. Any chance you remember...
...
159566    ":::::And for the second time of asking, when ...
159567    You should be ashamed of yourself \n\nThat is ...
159568    Spitzer \n\nUmm, theres no actual article for ...
159569    And it looks like it was actually you who put ...
159570    "\nAnd ... I really don't think you understand...
Name: comment_text, Length: 159571, dtype: object
In [8]:
y_train = train_data.iloc[:, 2:]
y_train

Out[8]:
toxic severe_toxic obscene threat insult identity_hate
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
... ... ... ... ... ... ...
159566 0 0 0 0 0 0
159567 0 0 0 0 0 0
159568 0 0 0 0 0 0
159569 0 0 0 0 0 0
159570 0 0 0 0 0 0

159571 rows × 6 columns

In [9]:
y_train[y_train['toxic'] == 1]

Out[9]:
toxic severe_toxic obscene threat insult identity_hate
6 1 1 1 0 1 0
12 1 0 0 0 0 0
16 1 0 0 0 0 0
42 1 0 1 0 1 1
43 1 0 1 0 1 0
... ... ... ... ... ... ...
159494 1 0 1 0 1 1
159514 1 0 0 0 1 0
159541 1 0 1 0 1 0
159546 1 0 0 0 1 0
159554 1 0 1 0 1 0

15294 rows × 6 columns

## Checking the count of the various types of words¶

In [10]:
cols = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]

val_counts = y_train[cols].sum()

plt.figure(figsize=(8,5))
ax = sns.barplot(val_counts.index, val_counts.values, alpha=0.8)

plt.ylabel("Counts of the Comments")

rects = ax.patches
labels = val_counts.values
for rect, label in zip(rects, labels):
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width()/2, height+5, label, ha="center", va="bottom")

plt.show()

/opt/tljh/user/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be data, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
FutureWarning

In [11]:
#from wordcloud import WordCloud
words = ' '.join([text for text in X_train])

word_cloud = WordCloud(
width=1600,
height=800,
#colormap='PuRd',
margin=0,
max_words=500, # Maximum numbers of words we want to see
min_word_length=3, # Minimum numbers of letters of each word to be part of the cloud
max_font_size=150, min_font_size=30,  # Font size range
background_color="white").generate(words)

plt.figure(figsize=(10, 16))
plt.imshow(word_cloud, interpolation="gaussian")
plt.title('Comments and their Nature', fontsize = 40)
plt.axis("off")
plt.show()


## Tokenization¶

In [12]:
tokenizer = keras.preprocessing.text.Tokenizer()

In [13]:
tokenizer.fit_on_texts(X_train)

In [14]:
X_train = tokenizer.texts_to_sequences(X_train)

In [15]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
X_train = pad_sequences(X_train, maxlen=100)

In [16]:
X_train.shape

Out[16]:
(159571, 100)

# Model creation and Training¶

In [17]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2)

In [18]:
model = keras.Sequential([
keras.layers.Dense(20, activation="tanh"),
keras.layers.Dense(6, activation="softmax")
])

In [19]:
model.compile(loss="categorical_crossentropy", optimizer="sgd", metrics=["accuracy"])

In [20]:
model_history = model.fit(X_train, y_train, epochs=5, validation_data=(X_val, y_val))

Epoch 1/5
3990/3990 [==============================] - 7s 2ms/step - loss: 0.3699 - accuracy: 0.7902 - val_loss: 0.3554 - val_accuracy: 0.9741
Epoch 2/5
3990/3990 [==============================] - 7s 2ms/step - loss: 0.3624 - accuracy: 0.9792 - val_loss: 0.3565 - val_accuracy: 0.9842
Epoch 3/5
3990/3990 [==============================] - 7s 2ms/step - loss: 0.3625 - accuracy: 0.9856 - val_loss: 0.3545 - val_accuracy: 0.9856
Epoch 4/5
3990/3990 [==============================] - 7s 2ms/step - loss: 0.3625 - accuracy: 0.9862 - val_loss: 0.3499 - val_accuracy: 0.9865
Epoch 5/5
3990/3990 [==============================] - 7s 2ms/step - loss: 0.3624 - accuracy: 0.9878 - val_loss: 0.3546 - val_accuracy: 0.9856

In [21]:
## Plotting training & Validation accuracy values

In [22]:
plt.plot(model_history.history['accuracy'])
plt.plot(model_history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()


### Save model¶

In [23]:
model.save('final_model.h5')


# Evaluation¶

In [24]:
test_data = pd.read_csv('test.csv.zip')

In [25]:
test_data.head()

Out[25]:
id comment_text
0 00001cee341fdb12 Yo bitch Ja Rule is more succesful then you'll...
1 0000247867823ef7 == From RfC == \n\n The title is fine as it is...
2 00013b17ad220c46 " \n\n == Sources == \n\n * Zawe Ashton on Lap...
3 00017563c3f7919a :If you have a look back at the source, the in...
4 00017695ad8997eb I don't anonymously edit articles at all.
In [26]:
X_test = test_data['comment_text']

In [27]:
X_test

Out[27]:
0         Yo bitch Ja Rule is more succesful then you'll...
1         == From RfC == \n\n The title is fine as it is...
2         " \n\n == Sources == \n\n * Zawe Ashton on Lap...
3         :If you have a look back at the source, the in...
4                 I don't anonymously edit articles at all.
...
153159    . \n i totally agree, this stuff is nothing bu...
153160    == Throw from out field to home plate. == \n\n...
153161    " \n\n == Okinotorishima categories == \n\n I ...
153162    " \n\n == ""One of the founding nations of the...
153163    " \n :::Stop already. Your bullshit is not wel...
Name: comment_text, Length: 153164, dtype: object
In [28]:
type(test_data['comment_text'])

Out[28]:
pandas.core.series.Series
In [29]:
tokenizer2 = keras.preprocessing.text.Tokenizer()
tokenizer2.fit_on_texts(X_test)
X_test = tokenizer2.texts_to_sequences(X_test)
X_test = pad_sequences(X_test, maxlen=100)

In [30]:
X_test[0]

Out[30]:
array([     0,      0,      0,      0,      0,      0,      0,      0,
0,      0,      0,      0,      0,      0,      0,      0,
0,      0,      0,      0,      0,      0,      0,      0,
0,      0,   1614,    227,   3383,    812,      8,     56,
32384,     82,    884,    337,     16,   3782,     69,     20,
6,      5,   5515,      6,   1585, 106851,      7,     54,
227,   6234,   1190, 106852,    500,   5001,      5,     93,
6,      2,   2999,     32,    279,      6,    762,  29767,
42,   3383,    812,      8,     35,   4342,     10,    737,
636,    348,    507,  15299,      9,    171,     15,    158,
5,  15732,      8,    253,  19272,     44,   2607,     52,
24,      3,   2225,    154,   1973,    500,   2110,     93,
219,    144,    486,     84], dtype=int32)
In [31]:
y_test = pd.read_csv('test_labels.csv.zip')
y_test = y_test.iloc[:, 1:]
y_test

Out[31]:
toxic severe_toxic obscene threat insult identity_hate
0 -1 -1 -1 -1 -1 -1
1 -1 -1 -1 -1 -1 -1
2 -1 -1 -1 -1 -1 -1
3 -1 -1 -1 -1 -1 -1
4 -1 -1 -1 -1 -1 -1
... ... ... ... ... ... ...
153159 -1 -1 -1 -1 -1 -1
153160 -1 -1 -1 -1 -1 -1
153161 -1 -1 -1 -1 -1 -1
153162 -1 -1 -1 -1 -1 -1
153163 -1 -1 -1 -1 -1 -1

153164 rows × 6 columns

In [32]:
# Evaluate the model on the test data using evaluate
print("Evaluate on test data")
results = model.evaluate(X_test, y_test, batch_size=128)
print("test accuracy:", results[1])

Evaluate on test data
1197/1197 [==============================] - 2s 1ms/step - loss: -6.1557 - accuracy: 0.9889
test accuracy: 0.988926887512207