Cainvas

Toxic Comment Detection

Credit: AITS Cainvas Community

Photo by Daniel Montero on Dribbble

  • Discussing things you care about can be difficult. The threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions. Platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments.

Setup: Importing neccessary libraries

In [1]:
!pip install matplotlib-venn
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: matplotlib-venn in /home/jupyter-dark/.local/lib/python3.7/site-packages (0.11.6)
Requirement already satisfied: matplotlib in /opt/tljh/user/lib/python3.7/site-packages (from matplotlib-venn) (3.3.3)
Requirement already satisfied: scipy in /opt/tljh/user/lib/python3.7/site-packages (from matplotlib-venn) (1.4.1)
Requirement already satisfied: numpy in /opt/tljh/user/lib/python3.7/site-packages (from matplotlib-venn) (1.18.5)
Requirement already satisfied: python-dateutil>=2.1 in /opt/tljh/user/lib/python3.7/site-packages (from matplotlib->matplotlib-venn) (2.8.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /opt/tljh/user/lib/python3.7/site-packages (from matplotlib->matplotlib-venn) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /opt/tljh/user/lib/python3.7/site-packages (from matplotlib->matplotlib-venn) (0.10.0)
Requirement already satisfied: numpy in /opt/tljh/user/lib/python3.7/site-packages (from matplotlib-venn) (1.18.5)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/tljh/user/lib/python3.7/site-packages (from matplotlib->matplotlib-venn) (1.3.1)
Requirement already satisfied: pillow>=6.2.0 in /opt/tljh/user/lib/python3.7/site-packages (from matplotlib->matplotlib-venn) (8.0.1)
Requirement already satisfied: six in /opt/tljh/user/lib/python3.7/site-packages (from cycler>=0.10->matplotlib->matplotlib-venn) (1.15.0)
Requirement already satisfied: six in /opt/tljh/user/lib/python3.7/site-packages (from cycler>=0.10->matplotlib->matplotlib-venn) (1.15.0)
Requirement already satisfied: numpy in /opt/tljh/user/lib/python3.7/site-packages (from matplotlib-venn) (1.18.5)
WARNING: You are using pip version 20.3.1; however, version 21.1.3 is available.
You should consider upgrading via the '/opt/tljh/user/bin/python -m pip install --upgrade pip' command.

Importing Datasets

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import tensorflow as tf
from tensorflow import keras

#visualization
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec 
import seaborn as sns
from wordcloud import WordCloud ,STOPWORDS
from PIL import Image
import matplotlib_venn as venn


#settings
color = sns.color_palette()
sns.set_style("dark")
%matplotlib inline
In [3]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import tensorflow as tf
from tensorflow import keras

#visualization
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec 
import seaborn as sns
from wordcloud import WordCloud ,STOPWORDS
from PIL import Image
import matplotlib_venn as venn


#settings
color = sns.color_palette()
sns.set_style("dark")
%matplotlib inline

Unzipping Dataset

In [4]:
!wget -N "https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/toxic_comment.zip"
!unzip -oq toxic_comment.zip 
!rm toxic_comment.zip
--2021-06-28 10:11:17--  https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/toxic_comment.zip
Resolving cainvas-static.s3.amazonaws.com (cainvas-static.s3.amazonaws.com)... 52.219.160.55
Connecting to cainvas-static.s3.amazonaws.com (cainvas-static.s3.amazonaws.com)|52.219.160.55|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 55201987 (53M) [application/zip]
Saving to: ‘toxic_comment.zip’

toxic_comment.zip   100%[===================>]  52.64M   104MB/s    in 0.5s    

2021-06-28 10:11:17 (104 MB/s) - ‘toxic_comment.zip’ saved [55201987/55201987]

Data Pre-Processing and Visualization:

In [5]:
train_data = pd.read_csv("train.csv.zip")
train_data.head()
Out[5]:
id comment_text toxic severe_toxic obscene threat insult identity_hate
0 0000997932d777bf Explanation\nWhy the edits made under my usern... 0 0 0 0 0 0
1 000103f0d9cfb60f D'aww! He matches this background colour I'm s... 0 0 0 0 0 0
2 000113f07ec002fd Hey man, I'm really not trying to edit war. It... 0 0 0 0 0 0
3 0001b41b1c6bb37e "\nMore\nI can't make any real suggestions on ... 0 0 0 0 0 0
4 0001d958c54c6e35 You, sir, are my hero. Any chance you remember... 0 0 0 0 0 0
In [6]:
X_train = train_data["comment_text"]
In [7]:
X_train
Out[7]:
0         Explanation\nWhy the edits made under my usern...
1         D'aww! He matches this background colour I'm s...
2         Hey man, I'm really not trying to edit war. It...
3         "\nMore\nI can't make any real suggestions on ...
4         You, sir, are my hero. Any chance you remember...
                                ...                        
159566    ":::::And for the second time of asking, when ...
159567    You should be ashamed of yourself \n\nThat is ...
159568    Spitzer \n\nUmm, theres no actual article for ...
159569    And it looks like it was actually you who put ...
159570    "\nAnd ... I really don't think you understand...
Name: comment_text, Length: 159571, dtype: object
In [8]:
y_train = train_data.iloc[:, 2:]
y_train
Out[8]:
toxic severe_toxic obscene threat insult identity_hate
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
... ... ... ... ... ... ...
159566 0 0 0 0 0 0
159567 0 0 0 0 0 0
159568 0 0 0 0 0 0
159569 0 0 0 0 0 0
159570 0 0 0 0 0 0

159571 rows × 6 columns

In [9]:
y_train[y_train['toxic'] == 1]
Out[9]:
toxic severe_toxic obscene threat insult identity_hate
6 1 1 1 0 1 0
12 1 0 0 0 0 0
16 1 0 0 0 0 0
42 1 0 1 0 1 1
43 1 0 1 0 1 0
... ... ... ... ... ... ...
159494 1 0 1 0 1 1
159514 1 0 0 0 1 0
159541 1 0 1 0 1 0
159546 1 0 0 0 1 0
159554 1 0 1 0 1 0

15294 rows × 6 columns

Checking the count of the various types of words

In [10]:
cols = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]


val_counts = y_train[cols].sum()

plt.figure(figsize=(8,5))
ax = sns.barplot(val_counts.index, val_counts.values, alpha=0.8)

plt.title("Comments per Classes")
plt.xlabel("Various Comments Type")
plt.ylabel("Counts of the Comments")

rects = ax.patches
labels = val_counts.values
for rect, label in zip(rects, labels):
    height = rect.get_height()
    ax.text(rect.get_x() + rect.get_width()/2, height+5, label, ha="center", va="bottom")


plt.show()
/opt/tljh/user/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
In [11]:
#from wordcloud import WordCloud
words = ' '.join([text for text in X_train])


word_cloud = WordCloud(
                       width=1600,
                       height=800,
                       #colormap='PuRd', 
                       margin=0,
                       max_words=500, # Maximum numbers of words we want to see 
                       min_word_length=3, # Minimum numbers of letters of each word to be part of the cloud
                       max_font_size=150, min_font_size=30,  # Font size range
                       background_color="white").generate(words)

plt.figure(figsize=(10, 16))
plt.imshow(word_cloud, interpolation="gaussian")
plt.title('Comments and their Nature', fontsize = 40)
plt.axis("off")
plt.show()

Tokenization

In [12]:
tokenizer = keras.preprocessing.text.Tokenizer()
In [13]:
tokenizer.fit_on_texts(X_train)
In [14]:
X_train = tokenizer.texts_to_sequences(X_train)
In [15]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
X_train = pad_sequences(X_train, maxlen=100)
In [16]:
X_train.shape
Out[16]:
(159571, 100)

Model creation and Training

In [17]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2)
In [18]:
model = keras.Sequential([
    keras.layers.Dense(20, activation="tanh"),
    keras.layers.Dense(6, activation="softmax")
])
In [19]:
model.compile(loss="categorical_crossentropy", optimizer="sgd", metrics=["accuracy"])
In [20]:
model_history = model.fit(X_train, y_train, epochs=5, validation_data=(X_val, y_val))
Epoch 1/5
3990/3990 [==============================] - 7s 2ms/step - loss: 0.3699 - accuracy: 0.7902 - val_loss: 0.3554 - val_accuracy: 0.9741
Epoch 2/5
3990/3990 [==============================] - 7s 2ms/step - loss: 0.3624 - accuracy: 0.9792 - val_loss: 0.3565 - val_accuracy: 0.9842
Epoch 3/5
3990/3990 [==============================] - 7s 2ms/step - loss: 0.3625 - accuracy: 0.9856 - val_loss: 0.3545 - val_accuracy: 0.9856
Epoch 4/5
3990/3990 [==============================] - 7s 2ms/step - loss: 0.3625 - accuracy: 0.9862 - val_loss: 0.3499 - val_accuracy: 0.9865
Epoch 5/5
3990/3990 [==============================] - 7s 2ms/step - loss: 0.3624 - accuracy: 0.9878 - val_loss: 0.3546 - val_accuracy: 0.9856
In [21]:
## Plotting training & Validation accuracy values
In [22]:
plt.plot(model_history.history['accuracy'])
plt.plot(model_history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()

Save model

In [23]:
model.save('final_model.h5')

Evaluation

In [24]:
test_data = pd.read_csv('test.csv.zip')
In [25]:
test_data.head()
Out[25]:
id comment_text
0 00001cee341fdb12 Yo bitch Ja Rule is more succesful then you'll...
1 0000247867823ef7 == From RfC == \n\n The title is fine as it is...
2 00013b17ad220c46 " \n\n == Sources == \n\n * Zawe Ashton on Lap...
3 00017563c3f7919a :If you have a look back at the source, the in...
4 00017695ad8997eb I don't anonymously edit articles at all.
In [26]:
X_test = test_data['comment_text']
In [27]:
X_test
Out[27]:
0         Yo bitch Ja Rule is more succesful then you'll...
1         == From RfC == \n\n The title is fine as it is...
2         " \n\n == Sources == \n\n * Zawe Ashton on Lap...
3         :If you have a look back at the source, the in...
4                 I don't anonymously edit articles at all.
                                ...                        
153159    . \n i totally agree, this stuff is nothing bu...
153160    == Throw from out field to home plate. == \n\n...
153161    " \n\n == Okinotorishima categories == \n\n I ...
153162    " \n\n == ""One of the founding nations of the...
153163    " \n :::Stop already. Your bullshit is not wel...
Name: comment_text, Length: 153164, dtype: object
In [28]:
type(test_data['comment_text'])
Out[28]:
pandas.core.series.Series
In [29]:
tokenizer2 = keras.preprocessing.text.Tokenizer()
tokenizer2.fit_on_texts(X_test)
X_test = tokenizer2.texts_to_sequences(X_test)
X_test = pad_sequences(X_test, maxlen=100)
In [30]:
X_test[0]
Out[30]:
array([     0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,   1614,    227,   3383,    812,      8,     56,
        32384,     82,    884,    337,     16,   3782,     69,     20,
            6,      5,   5515,      6,   1585, 106851,      7,     54,
          227,   6234,   1190, 106852,    500,   5001,      5,     93,
            6,      2,   2999,     32,    279,      6,    762,  29767,
           42,   3383,    812,      8,     35,   4342,     10,    737,
          636,    348,    507,  15299,      9,    171,     15,    158,
            5,  15732,      8,    253,  19272,     44,   2607,     52,
           24,      3,   2225,    154,   1973,    500,   2110,     93,
          219,    144,    486,     84], dtype=int32)
In [31]:
y_test = pd.read_csv('test_labels.csv.zip')
y_test = y_test.iloc[:, 1:]
y_test
Out[31]:
toxic severe_toxic obscene threat insult identity_hate
0 -1 -1 -1 -1 -1 -1
1 -1 -1 -1 -1 -1 -1
2 -1 -1 -1 -1 -1 -1
3 -1 -1 -1 -1 -1 -1
4 -1 -1 -1 -1 -1 -1
... ... ... ... ... ... ...
153159 -1 -1 -1 -1 -1 -1
153160 -1 -1 -1 -1 -1 -1
153161 -1 -1 -1 -1 -1 -1
153162 -1 -1 -1 -1 -1 -1
153163 -1 -1 -1 -1 -1 -1

153164 rows × 6 columns

In [32]:
# Evaluate the model on the test data using `evaluate`
print("Evaluate on test data")
results = model.evaluate(X_test, y_test, batch_size=128)
print("test accuracy:", results[1])
Evaluate on test data
1197/1197 [==============================] - 2s 1ms/step - loss: -6.1557 - accuracy: 0.9889
test accuracy: 0.988926887512207