Cainvas

Hate Speech And Offensive Language Detection

Credit: AITS Cainvas Community

Photo by Lucien Leyh on Dribbble

Nowadays we are well aware of the fact that if social media platforms are not handled carefully then they can create chaos in the world.One of the problems faced on these platforms are usage of Hate Speech and Offensive Language.Usage of such Language often results in fights, crimes or sometimes riots at worst.So, Detection of such language is essential and as humans cannot monitor such large volumes of data, we can take help of AI and detect the use of such language and prevent users from using such languages.

Importing Libraries

In [ ]:
# Essential tools
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from wordcloud import WordCloud

#to data preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

#NLP tools
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer

#train split and fit models
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

#model selection
from sklearn.metrics import confusion_matrix, accuracy_score

import os

Importing the Dataset

In [2]:
tweets_df = pd.read_csv('https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/twitter_labeled_data.csv')

Data Visualization and Preprocessing

In [3]:
tweets_df.head()
Out[3]:
Unnamed: 0 count hate_speech offensive_language neither class tweet
0 0 3 0 0 3 2 !!! RT @mayasolovely: As a woman you shouldn't...
1 1 3 0 3 0 1 !!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2 2 3 0 3 0 1 !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3 3 3 0 2 1 1 !!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4 4 6 0 6 0 1 !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...
In [4]:
tweets_df = tweets_df.drop(['neither','Unnamed: 0','count','hate_speech','offensive_language'], axis= 1)
In [5]:
tweets_df.head()
Out[5]:
class tweet
0 2 !!! RT @mayasolovely: As a woman you shouldn't...
1 1 !!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2 1 !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3 1 !!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4 1 !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...

Adding length column to data to see length of tweets

In [6]:
tweets_df['length'] = tweets_df['tweet'].apply(len)
In [7]:
tweets_df.head()
Out[7]:
class tweet length
0 2 !!! RT @mayasolovely: As a woman you shouldn't... 140
1 1 !!!!! RT @mleew17: boy dats cold...tyga dwn ba... 85
2 1 !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby... 120
3 1 !!!!!!!!! RT @C_G_Anderson: @viva_based she lo... 62
4 1 !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you... 137
In [8]:
tweets_df.describe()
Out[8]:
class length
count 24783.000000 24783.000000
mean 1.110277 85.436065
std 0.462089 41.548238
min 0.000000 5.000000
25% 1.000000 52.000000
50% 1.000000 81.000000
75% 1.000000 119.000000
max 2.000000 754.000000

Segregating data on the basis of class

In [9]:
hatespeech = tweets_df[tweets_df['class']==0]
In [10]:
hatespeech
Out[10]:
class tweet length
85 0 "@Blackman38Tide: @WhaleLookyHere @HowdyDowdy1... 61
89 0 "@CB_Baby24: @white_thunduh alsarabsss" hes a ... 83
110 0 "@DevilGrimz: @VigxRArts you're fucking gay, b... 119
184 0 "@MarkRoundtreeJr: LMFAOOOO I HATE BLACK PEOPL... 117
202 0 "@NoChillPaz: "At least I'm not a nigger" http... 72
... ... ... ...
24576 0 this guy is the biggest faggot omfg 35
24685 0 which one of these names is more offensive kik... 106
24751 0 you a pussy ass nigga and I know it nigga. 42
24776 0 you're all niggers 18
24777 0 you're such a retard i hope you get type 2 dia... 106

1430 rows × 3 columns

In [11]:
offensive = tweets_df[tweets_df['class']==1]
In [12]:
offensive
Out[12]:
class tweet length
1 1 !!!!! RT @mleew17: boy dats cold...tyga dwn ba... 85
2 1 !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby... 120
3 1 !!!!!!!!! RT @C_G_Anderson: @viva_based she lo... 62
4 1 !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you... 137
5 1 !!!!!!!!!!!!!!!!!!"@T_Madison_x: The shit just... 158
... ... ... ...
24774 1 you really care bout dis bitch. my dick all in... 58
24775 1 you worried bout other bitches, you need me for? 48
24778 1 you's a muthaf***in lie “@LifeAsKing: @2... 146
24780 1 young buck wanna eat!!.. dat nigguh like I ain... 67
24781 1 youu got wild bitches tellin you lies 37

19190 rows × 3 columns

In [13]:
neutral = tweets_df[tweets_df['class']==2]
In [14]:
neutral
Out[14]:
class tweet length
0 2 !!! RT @mayasolovely: As a woman you shouldn't... 140
40 2 " momma said no pussy cats inside my doghouse " 47
63 2 "@Addicted2Guys: -SimplyAddictedToGuys http://... 87
66 2 "@AllAboutManFeet: http://t.co/3gzUpfuMev" woo... 66
67 2 "@Allyhaaaaa: Lemmie eat a Oreo & do these... 69
... ... ... ...
24736 2 yaya ho.. cute avi tho RT @ViVaLa_Ari I had no... 75
24737 2 yea so about @N_tel 's new friend.. all my fri... 115
24767 2 you know what they say, the early bird gets th... 95
24779 2 you've gone and broke the wrong heart baby, an... 70
24782 2 ~~Ruffled | Ntac Eileen Dahlia - Beautiful col... 127

4163 rows × 3 columns

Visualizing each class

In [15]:
sentences = hatespeech['tweet'].tolist()
len(sentences)
Out[15]:
1430
In [16]:
sentences_as_one_string = " ".join(sentences)
In [17]:
plt.figure(figsize=(20,20))
plt.imshow(WordCloud().generate(sentences_as_one_string))
Out[17]:
<matplotlib.image.AxesImage at 0x7f9650512080>
In [18]:
sentences = offensive['tweet'].tolist()
len(sentences)
Out[18]:
19190
In [19]:
sentences_as_one_string = " ".join(sentences)
In [20]:
plt.figure(figsize=(20,20))
plt.imshow(WordCloud().generate(sentences_as_one_string))
Out[20]:
<matplotlib.image.AxesImage at 0x7f96500019b0>
In [21]:
sentences = neutral['tweet'].tolist()
len(sentences)
Out[21]:
4163
In [22]:
sentences_as_one_string = " ".join(sentences)
In [23]:
plt.figure(figsize=(20,20))
plt.imshow(WordCloud().generate(sentences_as_one_string))
Out[23]:
<matplotlib.image.AxesImage at 0x7f9646a95320>

Preprocessing the tweets

In [24]:
import string
string.punctuation
Out[24]:
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
In [25]:
# Let's define a pipeline to clean up all the messages 
# The pipeline performs the following: (1) remove punctuation, (2) remove stopwords

def message_cleaning(message):
    Test_punc_removed = [char for char in message if char not in string.punctuation]
    Test_punc_removed_join = ''.join(Test_punc_removed)
    Test_punc_removed_join_clean = [word for word in Test_punc_removed_join.split() if word.lower() not in stopwords.words('english')]
    Test_punc_removed_join_clean_join = ' '.join(Test_punc_removed_join_clean)
    return Test_punc_removed_join_clean_join
In [26]:
# Create a new Dataframe for cleaned text
tweets_df_clean = pd.DataFrame(columns=['class', 'tweet'])
tweets_df_clean['tweet'] = tweets_df['tweet'].apply(message_cleaning)
tweets_df_clean['class'] = tweets_df['class']
In [27]:
tweets_df_clean.head()
Out[27]:
class tweet
0 2 RT mayasolovely woman shouldnt complain cleani...
1 1 RT mleew17 boy dats coldtyga dwn bad cuffin da...
2 1 RT UrKindOfBrand Dawg RT 80sbaby4life ever fuc...
3 1 RT CGAnderson vivabased look like tranny
4 1 RT ShenikaRoberts shit hear might true might f...
In [28]:
print(tweets_df_clean['tweet'][5]) # show the cleaned up version
print(tweets_df['tweet'][5]) # show the original version
TMadisonx shit blows meclaim faithful somebody still fucking hoes 128514128514128514
!!!!!!!!!!!!!!!!!!"@T_Madison_x: The shit just blows me..claim you so faithful and down for somebody but still fucking with hoes! &#128514;&#128514;&#128514;"

Vectorizing the cleaned text for model training

In [29]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer = message_cleaning)
tweets_countvectorizer = CountVectorizer(analyzer = message_cleaning, dtype = 'uint8').fit_transform(tweets_df_clean['tweet']).toarray()
In [30]:
tweets_countvectorizer.shape
Out[30]:
(24783, 63)
In [31]:
X = tweets_countvectorizer
X
Out[31]:
array([[11,  0,  0, ...,  0,  3,  0],
       [11,  0,  2, ...,  0,  2,  0],
       [11,  1,  0, ...,  0,  2,  0],
       ...,
       [ 9,  0,  0, ...,  0,  1,  0],
       [ 5,  0,  0, ...,  0,  1,  0],
       [13,  1,  0, ...,  0,  1,  0]], dtype=uint8)
In [32]:
y = tweets_df_clean['class']
y = pd.get_dummies(y)
y = np.array(y)
y
Out[32]:
array([[0, 0, 1],
       [0, 1, 0],
       [0, 1, 0],
       ...,
       [0, 1, 0],
       [0, 1, 0],
       [0, 0, 1]], dtype=uint8)
In [33]:
X.shape
Out[33]:
(24783, 63)
In [34]:
y.shape
Out[34]:
(24783, 3)

Test-Train Split

In [35]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
In [36]:
X_train.shape
Out[36]:
(22304, 63)
In [37]:
total_words = 200
total_words
Out[37]:
200

Model architecture

In [38]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten,RepeatVector, Embedding, Input, LSTM, Conv1D, MaxPool1D, Bidirectional
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Embedding, Dense, GlobalAveragePooling1D
In [39]:
# Sequential Model
model1 = Sequential()

# embeddidng layer
model1.add(Embedding(total_words, output_dim = 32))
model1.add(LSTM(32))
model1.add(RepeatVector(200))
model1.add(GlobalAveragePooling1D())
model1.add(Dense(32, activation='relu'))
model1.add(Dense(16, activation='relu'))

model1.add(Dense(3,activation= 'softmax'))
model1.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
model1.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 32)          6400      
_________________________________________________________________
lstm (LSTM)                  (None, 32)                8320      
_________________________________________________________________
repeat_vector (RepeatVector) (None, 200, 32)           0         
_________________________________________________________________
global_average_pooling1d (Gl (None, 32)                0         
_________________________________________________________________
dense (Dense)                (None, 32)                1056      
_________________________________________________________________
dense_1 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_2 (Dense)              (None, 3)                 51        
=================================================================
Total params: 16,355
Trainable params: 16,355
Non-trainable params: 0
_________________________________________________________________

Model Training

In [40]:
# train the model
history = model1.fit(X_train, y_train, batch_size = 256, validation_split = 0.1, epochs = 50)
Epoch 1/50
79/79 [==============================] - 1s 11ms/step - loss: 0.7696 - acc: 0.7588 - val_loss: 0.6580 - val_acc: 0.7763
Epoch 2/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6622 - acc: 0.7740 - val_loss: 0.6578 - val_acc: 0.7763
Epoch 3/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6593 - acc: 0.7740 - val_loss: 0.6493 - val_acc: 0.7763
Epoch 4/50
79/79 [==============================] - 0s 5ms/step - loss: 0.6511 - acc: 0.7740 - val_loss: 0.6464 - val_acc: 0.7763
Epoch 5/50
79/79 [==============================] - 0s 5ms/step - loss: 0.6481 - acc: 0.7741 - val_loss: 0.6472 - val_acc: 0.7759
Epoch 6/50
79/79 [==============================] - 0s 5ms/step - loss: 0.6472 - acc: 0.7745 - val_loss: 0.6478 - val_acc: 0.7759
Epoch 7/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6470 - acc: 0.7756 - val_loss: 0.6462 - val_acc: 0.7741
Epoch 8/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6472 - acc: 0.7745 - val_loss: 0.6468 - val_acc: 0.7723
Epoch 9/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6466 - acc: 0.7762 - val_loss: 0.6499 - val_acc: 0.7763
Epoch 10/50
79/79 [==============================] - 0s 5ms/step - loss: 0.6460 - acc: 0.7759 - val_loss: 0.6452 - val_acc: 0.7754
Epoch 11/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6446 - acc: 0.7772 - val_loss: 0.6447 - val_acc: 0.7745
Epoch 12/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6441 - acc: 0.7772 - val_loss: 0.6440 - val_acc: 0.7768
Epoch 13/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6440 - acc: 0.7767 - val_loss: 0.6431 - val_acc: 0.7750
Epoch 14/50
79/79 [==============================] - 0s 5ms/step - loss: 0.6419 - acc: 0.7785 - val_loss: 0.6411 - val_acc: 0.7786
Epoch 15/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6412 - acc: 0.7778 - val_loss: 0.6458 - val_acc: 0.7759
Epoch 16/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6393 - acc: 0.7790 - val_loss: 0.6389 - val_acc: 0.7777
Epoch 17/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6348 - acc: 0.7803 - val_loss: 0.6382 - val_acc: 0.7732
Epoch 18/50
79/79 [==============================] - 0s 5ms/step - loss: 0.6339 - acc: 0.7813 - val_loss: 0.6381 - val_acc: 0.7777
Epoch 19/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6300 - acc: 0.7799 - val_loss: 0.6314 - val_acc: 0.7772
Epoch 20/50
79/79 [==============================] - 0s 5ms/step - loss: 0.6263 - acc: 0.7810 - val_loss: 0.6310 - val_acc: 0.7759
Epoch 21/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6195 - acc: 0.7824 - val_loss: 0.6274 - val_acc: 0.7790
Epoch 22/50
79/79 [==============================] - 0s 5ms/step - loss: 0.6175 - acc: 0.7819 - val_loss: 0.6235 - val_acc: 0.7786
Epoch 23/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6127 - acc: 0.7834 - val_loss: 0.6172 - val_acc: 0.7786
Epoch 24/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6108 - acc: 0.7831 - val_loss: 0.6146 - val_acc: 0.7813
Epoch 25/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6098 - acc: 0.7823 - val_loss: 0.6121 - val_acc: 0.7835
Epoch 26/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6076 - acc: 0.7838 - val_loss: 0.6119 - val_acc: 0.7813
Epoch 27/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6051 - acc: 0.7848 - val_loss: 0.6111 - val_acc: 0.7826
Epoch 28/50
79/79 [==============================] - 0s 5ms/step - loss: 0.6039 - acc: 0.7840 - val_loss: 0.6089 - val_acc: 0.7835
Epoch 29/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6064 - acc: 0.7826 - val_loss: 0.6061 - val_acc: 0.7826
Epoch 30/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6047 - acc: 0.7848 - val_loss: 0.6094 - val_acc: 0.7799
Epoch 31/50
79/79 [==============================] - 0s 5ms/step - loss: 0.6017 - acc: 0.7859 - val_loss: 0.6043 - val_acc: 0.7817
Epoch 32/50
79/79 [==============================] - 0s 5ms/step - loss: 0.6005 - acc: 0.7840 - val_loss: 0.6083 - val_acc: 0.7822
Epoch 33/50
79/79 [==============================] - 0s 6ms/step - loss: 0.5992 - acc: 0.7851 - val_loss: 0.6038 - val_acc: 0.7759
Epoch 34/50
79/79 [==============================] - 0s 6ms/step - loss: 0.5979 - acc: 0.7856 - val_loss: 0.6074 - val_acc: 0.7826
Epoch 35/50
79/79 [==============================] - 0s 6ms/step - loss: 0.5986 - acc: 0.7850 - val_loss: 0.6031 - val_acc: 0.7822
Epoch 36/50
79/79 [==============================] - 0s 6ms/step - loss: 0.5958 - acc: 0.7864 - val_loss: 0.6016 - val_acc: 0.7848
Epoch 37/50
79/79 [==============================] - 0s 5ms/step - loss: 0.5953 - acc: 0.7857 - val_loss: 0.6019 - val_acc: 0.7781
Epoch 38/50
79/79 [==============================] - 0s 5ms/step - loss: 0.5937 - acc: 0.7874 - val_loss: 0.5994 - val_acc: 0.7795
Epoch 39/50
79/79 [==============================] - 0s 6ms/step - loss: 0.5936 - acc: 0.7866 - val_loss: 0.5970 - val_acc: 0.7857
Epoch 40/50
79/79 [==============================] - 0s 5ms/step - loss: 0.5924 - acc: 0.7866 - val_loss: 0.6001 - val_acc: 0.7826
Epoch 41/50
79/79 [==============================] - 0s 5ms/step - loss: 0.5919 - acc: 0.7866 - val_loss: 0.5954 - val_acc: 0.7853
Epoch 42/50
79/79 [==============================] - 0s 5ms/step - loss: 0.5920 - acc: 0.7882 - val_loss: 0.5963 - val_acc: 0.7848
Epoch 43/50
79/79 [==============================] - 0s 6ms/step - loss: 0.5893 - acc: 0.7873 - val_loss: 0.6008 - val_acc: 0.7813
Epoch 44/50
79/79 [==============================] - 0s 6ms/step - loss: 0.5881 - acc: 0.7892 - val_loss: 0.5944 - val_acc: 0.7804
Epoch 45/50
79/79 [==============================] - 0s 5ms/step - loss: 0.5877 - acc: 0.7873 - val_loss: 0.5971 - val_acc: 0.7862
Epoch 46/50
79/79 [==============================] - 0s 6ms/step - loss: 0.5856 - acc: 0.7893 - val_loss: 0.6001 - val_acc: 0.7840
Epoch 47/50
79/79 [==============================] - 0s 5ms/step - loss: 0.5853 - acc: 0.7881 - val_loss: 0.5913 - val_acc: 0.7866
Epoch 48/50
79/79 [==============================] - 0s 6ms/step - loss: 0.5828 - acc: 0.7895 - val_loss: 0.5931 - val_acc: 0.7853
Epoch 49/50
79/79 [==============================] - 0s 6ms/step - loss: 0.5831 - acc: 0.7884 - val_loss: 0.5931 - val_acc: 0.7853
Epoch 50/50
79/79 [==============================] - 0s 6ms/step - loss: 0.5810 - acc: 0.7905 - val_loss: 0.5974 - val_acc: 0.7898

Model Accuracy can be improved and around 80% accuracy acn be achieved by letting the model train a little longer

Training Plots

In [41]:
# plot the training artifacts

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train_loss','val_loss'], loc = 'upper right')
plt.show()
In [42]:
# plot the training artifacts

plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('Model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train_acc','val_acc'], loc = 'upper right')
plt.show()

Save the trained model

In [43]:
model1.save("hate_speech.h5")

Accessing the Model's Performance

In [44]:
model1.evaluate(X_test,y_test)
78/78 [==============================] - 0s 2ms/step - loss: 0.6026 - acc: 0.7838
Out[44]:
[0.6025590896606445, 0.7837837934494019]
In [45]:
print(tweets_df['tweet'][0])
print(tweets_df['tweet'][1])
print(tweets_df['tweet'][2])
print(tweets_df['tweet'][3])
print(tweets_df['tweet'][4])
!!! RT @mayasolovely: As a woman you shouldn't complain about cleaning up your house. &amp; as a man you should always take the trash out...
!!!!! RT @mleew17: boy dats cold...tyga dwn bad for cuffin dat hoe in the 1st place!!
!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby4life: You ever fuck a bitch and she start to cry? You be confused as shit
!!!!!!!!! RT @C_G_Anderson: @viva_based she look like a tranny
!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you hear about me might be true or it might be faker than the bitch who told it to ya &#57361;
In [46]:
tweets_countvectorizer = CountVectorizer(analyzer = message_cleaning, dtype = 'uint8').fit_transform(tweets_df['tweet'][:5]).toarray()
In [47]:
preds = model1.predict(tweets_countvectorizer)
In [48]:
preds_class = []
for i in range(len(preds)):
    preds_class.append(np.argmax(preds[i]))
preds_class = np.array(preds_class) 
In [49]:
df = pd.DataFrame(columns=['Predicted Labels', 'Actual Labels'])
df['Predicted Labels'] = preds_class
df['Actual Labels'] = tweets_df['class'][:5]
df.head()
Out[49]:
Predicted Labels Actual Labels
0 1 2
1 1 1
2 1 1
3 1 1
4 1 1

Compiling the model with DeepC

In [50]:
!deepCC hate_speech.h5