Toxic Comment Detection¶

Credit: AITS Cainvas Community ¶

Discussing things you care about can be difficult. The threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions. Platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments.

Setup: Importing neccessary libraries¶

!pip install matplotlib-venn

Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: matplotlib-venn in /home/jupyter-dark/.local/lib/python3.7/site-packages (0.11.6)
Requirement already satisfied: matplotlib in /opt/tljh/user/lib/python3.7/site-packages (from matplotlib-venn) (3.3.3)
Requirement already satisfied: scipy in /opt/tljh/user/lib/python3.7/site-packages (from matplotlib-venn) (1.4.1)
Requirement already satisfied: numpy in /opt/tljh/user/lib/python3.7/site-packages (from matplotlib-venn) (1.18.5)
Requirement already satisfied: python-dateutil>=2.1 in /opt/tljh/user/lib/python3.7/site-packages (from matplotlib->matplotlib-venn) (2.8.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /opt/tljh/user/lib/python3.7/site-packages (from matplotlib->matplotlib-venn) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /opt/tljh/user/lib/python3.7/site-packages (from matplotlib->matplotlib-venn) (0.10.0)
Requirement already satisfied: numpy in /opt/tljh/user/lib/python3.7/site-packages (from matplotlib-venn) (1.18.5)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/tljh/user/lib/python3.7/site-packages (from matplotlib->matplotlib-venn) (1.3.1)
Requirement already satisfied: pillow>=6.2.0 in /opt/tljh/user/lib/python3.7/site-packages (from matplotlib->matplotlib-venn) (8.0.1)
Requirement already satisfied: six in /opt/tljh/user/lib/python3.7/site-packages (from cycler>=0.10->matplotlib->matplotlib-venn) (1.15.0)
Requirement already satisfied: six in /opt/tljh/user/lib/python3.7/site-packages (from cycler>=0.10->matplotlib->matplotlib-venn) (1.15.0)
Requirement already satisfied: numpy in /opt/tljh/user/lib/python3.7/site-packages (from matplotlib-venn) (1.18.5)
WARNING: You are using pip version 20.3.1; however, version 21.1.3 is available.
You should consider upgrading via the '/opt/tljh/user/bin/python -m pip install --upgrade pip' command.

Importing Datasets¶

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import tensorflow as tf
from tensorflow import keras

#visualization
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec 
import seaborn as sns
from wordcloud import WordCloud ,STOPWORDS
from PIL import Image
import matplotlib_venn as venn


#settings
color = sns.color_palette()
sns.set_style("dark")
%matplotlib inline

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import tensorflow as tf
from tensorflow import keras

#visualization
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec 
import seaborn as sns
from wordcloud import WordCloud ,STOPWORDS
from PIL import Image
import matplotlib_venn as venn


#settings
color = sns.color_palette()
sns.set_style("dark")
%matplotlib inline

Unzipping Dataset¶

!wget -N "https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/toxic_comment.zip"
!unzip -oq toxic_comment.zip 
!rm toxic_comment.zip

--2021-06-28 10:11:17--  https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/toxic_comment.zip
Resolving cainvas-static.s3.amazonaws.com (cainvas-static.s3.amazonaws.com)... 52.219.160.55
Connecting to cainvas-static.s3.amazonaws.com (cainvas-static.s3.amazonaws.com)|52.219.160.55|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 55201987 (53M) [application/zip]
Saving to: ‘toxic_comment.zip’

toxic_comment.zip   100%[===================>]  52.64M   104MB/s    in 0.5s    

2021-06-28 10:11:17 (104 MB/s) - ‘toxic_comment.zip’ saved [55201987/55201987]

Data Pre-Processing and Visualization:¶

train_data = pd.read_csv("train.csv.zip")
train_data.head()

X_train = train_data["comment_text"]

X_train

0         Explanation\nWhy the edits made under my usern...
1         D'aww! He matches this background colour I'm s...
2         Hey man, I'm really not trying to edit war. It...
3         "\nMore\nI can't make any real suggestions on ...
4         You, sir, are my hero. Any chance you remember...
                                ...                        
159566    ":::::And for the second time of asking, when ...
159567    You should be ashamed of yourself \n\nThat is ...
159568    Spitzer \n\nUmm, theres no actual article for ...
159569    And it looks like it was actually you who put ...
159570    "\nAnd ... I really don't think you understand...
Name: comment_text, Length: 159571, dtype: object

y_train = train_data.iloc[:, 2:]
y_train

y_train[y_train['toxic'] == 1]

Checking the count of the various types of words¶

cols = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]


val_counts = y_train[cols].sum()

plt.figure(figsize=(8,5))
ax = sns.barplot(val_counts.index, val_counts.values, alpha=0.8)

plt.title("Comments per Classes")
plt.xlabel("Various Comments Type")
plt.ylabel("Counts of the Comments")

rects = ax.patches
labels = val_counts.values
for rect, label in zip(rects, labels):
    height = rect.get_height()
    ax.text(rect.get_x() + rect.get_width()/2, height+5, label, ha="center", va="bottom")


plt.show()

/opt/tljh/user/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

#from wordcloud import WordCloud
words = ' '.join([text for text in X_train])


word_cloud = WordCloud(
                       width=1600,
                       height=800,
                       #colormap='PuRd', 
                       margin=0,
                       max_words=500, # Maximum numbers of words we want to see 
                       min_word_length=3, # Minimum numbers of letters of each word to be part of the cloud
                       max_font_size=150, min_font_size=30,  # Font size range
                       background_color="white").generate(words)

plt.figure(figsize=(10, 16))
plt.imshow(word_cloud, interpolation="gaussian")
plt.title('Comments and their Nature', fontsize = 40)
plt.axis("off")
plt.show()

Tokenization¶

tokenizer = keras.preprocessing.text.Tokenizer()

tokenizer.fit_on_texts(X_train)

X_train = tokenizer.texts_to_sequences(X_train)

from tensorflow.keras.preprocessing.sequence import pad_sequences
X_train = pad_sequences(X_train, maxlen=100)

X_train.shape

(159571, 100)

Model creation and Training¶

from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2)

model = keras.Sequential([
    keras.layers.Dense(20, activation="tanh"),
    keras.layers.Dense(6, activation="softmax")
])

model.compile(loss="categorical_crossentropy", optimizer="sgd", metrics=["accuracy"])

model_history = model.fit(X_train, y_train, epochs=5, validation_data=(X_val, y_val))

Epoch 1/5
3990/3990 [==============================] - 7s 2ms/step - loss: 0.3699 - accuracy: 0.7902 - val_loss: 0.3554 - val_accuracy: 0.9741
Epoch 2/5
3990/3990 [==============================] - 7s 2ms/step - loss: 0.3624 - accuracy: 0.9792 - val_loss: 0.3565 - val_accuracy: 0.9842
Epoch 3/5
3990/3990 [==============================] - 7s 2ms/step - loss: 0.3625 - accuracy: 0.9856 - val_loss: 0.3545 - val_accuracy: 0.9856
Epoch 4/5
3990/3990 [==============================] - 7s 2ms/step - loss: 0.3625 - accuracy: 0.9862 - val_loss: 0.3499 - val_accuracy: 0.9865
Epoch 5/5
3990/3990 [==============================] - 7s 2ms/step - loss: 0.3624 - accuracy: 0.9878 - val_loss: 0.3546 - val_accuracy: 0.9856

## Plotting training & Validation accuracy values

plt.plot(model_history.history['accuracy'])
plt.plot(model_history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()

Save model¶

model.save('final_model.h5')

Evaluation¶

test_data = pd.read_csv('test.csv.zip')

test_data.head()

X_test = test_data['comment_text']

X_test

0         Yo bitch Ja Rule is more succesful then you'll...
1         == From RfC == \n\n The title is fine as it is...
2         " \n\n == Sources == \n\n * Zawe Ashton on Lap...
3         :If you have a look back at the source, the in...
4                 I don't anonymously edit articles at all.
                                ...                        
153159    . \n i totally agree, this stuff is nothing bu...
153160    == Throw from out field to home plate. == \n\n...
153161    " \n\n == Okinotorishima categories == \n\n I ...
153162    " \n\n == ""One of the founding nations of the...
153163    " \n :::Stop already. Your bullshit is not wel...
Name: comment_text, Length: 153164, dtype: object

type(test_data['comment_text'])

pandas.core.series.Series

tokenizer2 = keras.preprocessing.text.Tokenizer()
tokenizer2.fit_on_texts(X_test)
X_test = tokenizer2.texts_to_sequences(X_test)
X_test = pad_sequences(X_test, maxlen=100)

X_test[0]

array([     0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,   1614,    227,   3383,    812,      8,     56,
        32384,     82,    884,    337,     16,   3782,     69,     20,
            6,      5,   5515,      6,   1585, 106851,      7,     54,
          227,   6234,   1190, 106852,    500,   5001,      5,     93,
            6,      2,   2999,     32,    279,      6,    762,  29767,
           42,   3383,    812,      8,     35,   4342,     10,    737,
          636,    348,    507,  15299,      9,    171,     15,    158,
            5,  15732,      8,    253,  19272,     44,   2607,     52,
           24,      3,   2225,    154,   1973,    500,   2110,     93,
          219,    144,    486,     84], dtype=int32)

y_test = pd.read_csv('test_labels.csv.zip')
y_test = y_test.iloc[:, 1:]
y_test

# Evaluate the model on the test data using `evaluate`
print("Evaluate on test data")
results = model.evaluate(X_test, y_test, batch_size=128)
print("test accuracy:", results[1])

Evaluate on test data
1197/1197 [==============================] - 2s 1ms/step - loss: -6.1557 - accuracy: 0.9889
test accuracy: 0.988926887512207

	toxic	severe_toxic	obscene	threat	insult	identity_hate
0	-1	-1	-1	-1	-1	-1
1	-1	-1	-1	-1	-1	-1
2	-1	-1	-1	-1	-1	-1
3	-1	-1	-1	-1	-1	-1
4	-1	-1	-1	-1	-1	-1
...	...	...	...	...	...	...
153159	-1	-1	-1	-1	-1	-1
153160	-1	-1	-1	-1	-1	-1
153161	-1	-1	-1	-1	-1	-1
153162	-1	-1	-1	-1	-1	-1
153163	-1	-1	-1	-1	-1	-1

	id	comment_text
0	0000997932d777bf	Explanation\nWhy the edits made under my usern...
1	000103f0d9cfb60f	D'aww! He matches this background colour I'm s...
2	000113f07ec002fd	Hey man, I'm really not trying to edit war. It...
3	0001b41b1c6bb37e	"\nMore\nI can't make any real suggestions on ...
4	0001d958c54c6e35	You, sir, are my hero. Any chance you remember...

	toxic	severe_toxic	obscene	threat	insult	identity_hate
0	0	0	0	0	0	0
1	0	0	0	0	0	0
2	0	0	0	0	0	0
3	0	0	0	0	0	0
4	0	0	0	0	0	0
...	...	...	...	...	...	...
159566	0	0	0	0	0	0
159567	0	0	0	0	0	0
159568	0	0	0	0	0	0
159569	0	0	0	0	0	0
159570	0	0	0	0	0	0

	toxic	severe_toxic	obscene	threat	insult	identity_hate
6	1	1	1	0	1	0
12	1	0	0	0	0	0
16	1	0	0	0	0	0
42	1	0	1	0	1	1
43	1	0	1	0	1	0
...	...	...	...	...	...	...
159494	1	0	1	0	1	1
159514	1	0	0	0	1	0
159541	1	0	1	0	1	0
159546	1	0	0	0	1	0
159554	1	0	1	0	1	0

	id	comment_text
0	00001cee341fdb12	Yo bitch Ja Rule is more succesful then you'll...
1	0000247867823ef7	== From RfC == \n\n The title is fine as it is...
2	00013b17ad220c46	" \n\n == Sources == \n\n * Zawe Ashton on Lap...
3	00017563c3f7919a	:If you have a look back at the source, the in...
4	00017695ad8997eb	I don't anonymously edit articles at all.

Toxic Comment Detection¶

Credit: AITS Cainvas Community¶

Setup: Importing neccessary libraries¶

Importing Datasets¶

Unzipping Dataset¶

Data Pre-Processing and Visualization:¶

Checking the count of the various types of words¶

Tokenization¶

Model creation and Training¶

Save model¶

Evaluation¶

Credit: AITS Cainvas Community ¶