Hate Speech And Offensive Language Detection¶

Nowadays we are well aware of the fact that if social media platforms are not handled carefully then they can create chaos in the world.One of the problems faced on these platforms are usage of Hate Speech and Offensive Language.Usage of such Language often results in fights, crimes or sometimes riots at worst.So, Detection of such language is essential and as humans cannot monitor such large volumes of data, we can take help of AI and detect the use of such language and prevent users from using such languages.

Importing Libraries¶

# Essential tools
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from wordcloud import WordCloud

#to data preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

#NLP tools
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer

#train split and fit models
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

#model selection
from sklearn.metrics import confusion_matrix, accuracy_score

import os

Importing the Dataset¶

tweets_df = pd.read_csv('https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/twitter_labeled_data.csv')

Data Visualization and Preprocessing¶

tweets_df.head()

tweets_df = tweets_df.drop(['neither','Unnamed: 0','count','hate_speech','offensive_language'], axis= 1)

tweets_df.head()

Adding length column to data to see length of tweets¶

tweets_df['length'] = tweets_df['tweet'].apply(len)

tweets_df.head()

tweets_df.describe()

Segregating data on the basis of class¶

hatespeech = tweets_df[tweets_df['class']==0]

hatespeech

offensive = tweets_df[tweets_df['class']==1]

offensive

neutral = tweets_df[tweets_df['class']==2]

neutral

Visualizing each class¶

sentences = hatespeech['tweet'].tolist()
len(sentences)

1430

sentences_as_one_string = " ".join(sentences)

plt.figure(figsize=(20,20))
plt.imshow(WordCloud().generate(sentences_as_one_string))

<matplotlib.image.AxesImage at 0x7f9650512080>

sentences = offensive['tweet'].tolist()
len(sentences)

19190

sentences_as_one_string = " ".join(sentences)

plt.figure(figsize=(20,20))
plt.imshow(WordCloud().generate(sentences_as_one_string))

<matplotlib.image.AxesImage at 0x7f96500019b0>

sentences = neutral['tweet'].tolist()
len(sentences)

4163

sentences_as_one_string = " ".join(sentences)

plt.figure(figsize=(20,20))
plt.imshow(WordCloud().generate(sentences_as_one_string))

<matplotlib.image.AxesImage at 0x7f9646a95320>

Preprocessing the tweets¶

import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

# Let's define a pipeline to clean up all the messages 
# The pipeline performs the following: (1) remove punctuation, (2) remove stopwords

def message_cleaning(message):
    Test_punc_removed = [char for char in message if char not in string.punctuation]
    Test_punc_removed_join = ''.join(Test_punc_removed)
    Test_punc_removed_join_clean = [word for word in Test_punc_removed_join.split() if word.lower() not in stopwords.words('english')]
    Test_punc_removed_join_clean_join = ' '.join(Test_punc_removed_join_clean)
    return Test_punc_removed_join_clean_join

# Create a new Dataframe for cleaned text
tweets_df_clean = pd.DataFrame(columns=['class', 'tweet'])
tweets_df_clean['tweet'] = tweets_df['tweet'].apply(message_cleaning)
tweets_df_clean['class'] = tweets_df['class']

tweets_df_clean.head()

print(tweets_df_clean['tweet'][5]) # show the cleaned up version
print(tweets_df['tweet'][5]) # show the original version

TMadisonx shit blows meclaim faithful somebody still fucking hoes 128514128514128514
!!!!!!!!!!!!!!!!!!"@T_Madison_x: The shit just blows me..claim you so faithful and down for somebody but still fucking with hoes! &#128514;&#128514;&#128514;"

Vectorizing the cleaned text for model training¶

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer = message_cleaning)
tweets_countvectorizer = CountVectorizer(analyzer = message_cleaning, dtype = 'uint8').fit_transform(tweets_df_clean['tweet']).toarray()

tweets_countvectorizer.shape

(24783, 63)

X = tweets_countvectorizer
X

array([[11,  0,  0, ...,  0,  3,  0],
       [11,  0,  2, ...,  0,  2,  0],
       [11,  1,  0, ...,  0,  2,  0],
       ...,
       [ 9,  0,  0, ...,  0,  1,  0],
       [ 5,  0,  0, ...,  0,  1,  0],
       [13,  1,  0, ...,  0,  1,  0]], dtype=uint8)

y = tweets_df_clean['class']
y = pd.get_dummies(y)
y = np.array(y)
y

array([[0, 0, 1],
       [0, 1, 0],
       [0, 1, 0],
       ...,
       [0, 1, 0],
       [0, 1, 0],
       [0, 0, 1]], dtype=uint8)

X.shape

(24783, 63)

y.shape

(24783, 3)

Test-Train Split¶

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

X_train.shape

(22304, 63)

total_words = 200
total_words

200

Model architecture¶

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten,RepeatVector, Embedding, Input, LSTM, Conv1D, MaxPool1D, Bidirectional
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Embedding, Dense, GlobalAveragePooling1D

# Sequential Model
model1 = Sequential()

# embeddidng layer
model1.add(Embedding(total_words, output_dim = 32))
model1.add(LSTM(32))
model1.add(RepeatVector(200))
model1.add(GlobalAveragePooling1D())
model1.add(Dense(32, activation='relu'))
model1.add(Dense(16, activation='relu'))

model1.add(Dense(3,activation= 'softmax'))
model1.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
model1.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 32)          6400      
_________________________________________________________________
lstm (LSTM)                  (None, 32)                8320      
_________________________________________________________________
repeat_vector (RepeatVector) (None, 200, 32)           0         
_________________________________________________________________
global_average_pooling1d (Gl (None, 32)                0         
_________________________________________________________________
dense (Dense)                (None, 32)                1056      
_________________________________________________________________
dense_1 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_2 (Dense)              (None, 3)                 51        
=================================================================
Total params: 16,355
Trainable params: 16,355
Non-trainable params: 0
_________________________________________________________________

Model Training¶

# train the model
history = model1.fit(X_train, y_train, batch_size = 256, validation_split = 0.1, epochs = 50)

Epoch 1/50
79/79 [==============================] - 1s 11ms/step - loss: 0.7696 - acc: 0.7588 - val_loss: 0.6580 - val_acc: 0.7763
Epoch 2/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6622 - acc: 0.7740 - val_loss: 0.6578 - val_acc: 0.7763
Epoch 3/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6593 - acc: 0.7740 - val_loss: 0.6493 - val_acc: 0.7763
Epoch 4/50
79/79 [==============================] - 0s 5ms/step - loss: 0.6511 - acc: 0.7740 - val_loss: 0.6464 - val_acc: 0.7763
Epoch 5/50
79/79 [==============================] - 0s 5ms/step - loss: 0.6481 - acc: 0.7741 - val_loss: 0.6472 - val_acc: 0.7759
Epoch 6/50
79/79 [==============================] - 0s 5ms/step - loss: 0.6472 - acc: 0.7745 - val_loss: 0.6478 - val_acc: 0.7759
Epoch 7/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6470 - acc: 0.7756 - val_loss: 0.6462 - val_acc: 0.7741
Epoch 8/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6472 - acc: 0.7745 - val_loss: 0.6468 - val_acc: 0.7723
Epoch 9/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6466 - acc: 0.7762 - val_loss: 0.6499 - val_acc: 0.7763
Epoch 10/50
79/79 [==============================] - 0s 5ms/step - loss: 0.6460 - acc: 0.7759 - val_loss: 0.6452 - val_acc: 0.7754
Epoch 11/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6446 - acc: 0.7772 - val_loss: 0.6447 - val_acc: 0.7745
Epoch 12/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6441 - acc: 0.7772 - val_loss: 0.6440 - val_acc: 0.7768
Epoch 13/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6440 - acc: 0.7767 - val_loss: 0.6431 - val_acc: 0.7750
Epoch 14/50
79/79 [==============================] - 0s 5ms/step - loss: 0.6419 - acc: 0.7785 - val_loss: 0.6411 - val_acc: 0.7786
Epoch 15/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6412 - acc: 0.7778 - val_loss: 0.6458 - val_acc: 0.7759
Epoch 16/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6393 - acc: 0.7790 - val_loss: 0.6389 - val_acc: 0.7777
Epoch 17/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6348 - acc: 0.7803 - val_loss: 0.6382 - val_acc: 0.7732
Epoch 18/50
79/79 [==============================] - 0s 5ms/step - loss: 0.6339 - acc: 0.7813 - val_loss: 0.6381 - val_acc: 0.7777
Epoch 19/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6300 - acc: 0.7799 - val_loss: 0.6314 - val_acc: 0.7772
Epoch 20/50
79/79 [==============================] - 0s 5ms/step - loss: 0.6263 - acc: 0.7810 - val_loss: 0.6310 - val_acc: 0.7759
Epoch 21/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6195 - acc: 0.7824 - val_loss: 0.6274 - val_acc: 0.7790
Epoch 22/50
79/79 [==============================] - 0s 5ms/step - loss: 0.6175 - acc: 0.7819 - val_loss: 0.6235 - val_acc: 0.7786
Epoch 23/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6127 - acc: 0.7834 - val_loss: 0.6172 - val_acc: 0.7786
Epoch 24/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6108 - acc: 0.7831 - val_loss: 0.6146 - val_acc: 0.7813
Epoch 25/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6098 - acc: 0.7823 - val_loss: 0.6121 - val_acc: 0.7835
Epoch 26/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6076 - acc: 0.7838 - val_loss: 0.6119 - val_acc: 0.7813
Epoch 27/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6051 - acc: 0.7848 - val_loss: 0.6111 - val_acc: 0.7826
Epoch 28/50
79/79 [==============================] - 0s 5ms/step - loss: 0.6039 - acc: 0.7840 - val_loss: 0.6089 - val_acc: 0.7835
Epoch 29/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6064 - acc: 0.7826 - val_loss: 0.6061 - val_acc: 0.7826
Epoch 30/50
79/79 [==============================] - 0s 6ms/step - loss: 0.6047 - acc: 0.7848 - val_loss: 0.6094 - val_acc: 0.7799
Epoch 31/50
79/79 [==============================] - 0s 5ms/step - loss: 0.6017 - acc: 0.7859 - val_loss: 0.6043 - val_acc: 0.7817
Epoch 32/50
79/79 [==============================] - 0s 5ms/step - loss: 0.6005 - acc: 0.7840 - val_loss: 0.6083 - val_acc: 0.7822
Epoch 33/50
79/79 [==============================] - 0s 6ms/step - loss: 0.5992 - acc: 0.7851 - val_loss: 0.6038 - val_acc: 0.7759
Epoch 34/50
79/79 [==============================] - 0s 6ms/step - loss: 0.5979 - acc: 0.7856 - val_loss: 0.6074 - val_acc: 0.7826
Epoch 35/50
79/79 [==============================] - 0s 6ms/step - loss: 0.5986 - acc: 0.7850 - val_loss: 0.6031 - val_acc: 0.7822
Epoch 36/50
79/79 [==============================] - 0s 6ms/step - loss: 0.5958 - acc: 0.7864 - val_loss: 0.6016 - val_acc: 0.7848
Epoch 37/50
79/79 [==============================] - 0s 5ms/step - loss: 0.5953 - acc: 0.7857 - val_loss: 0.6019 - val_acc: 0.7781
Epoch 38/50
79/79 [==============================] - 0s 5ms/step - loss: 0.5937 - acc: 0.7874 - val_loss: 0.5994 - val_acc: 0.7795
Epoch 39/50
79/79 [==============================] - 0s 6ms/step - loss: 0.5936 - acc: 0.7866 - val_loss: 0.5970 - val_acc: 0.7857
Epoch 40/50
79/79 [==============================] - 0s 5ms/step - loss: 0.5924 - acc: 0.7866 - val_loss: 0.6001 - val_acc: 0.7826
Epoch 41/50
79/79 [==============================] - 0s 5ms/step - loss: 0.5919 - acc: 0.7866 - val_loss: 0.5954 - val_acc: 0.7853
Epoch 42/50
79/79 [==============================] - 0s 5ms/step - loss: 0.5920 - acc: 0.7882 - val_loss: 0.5963 - val_acc: 0.7848
Epoch 43/50
79/79 [==============================] - 0s 6ms/step - loss: 0.5893 - acc: 0.7873 - val_loss: 0.6008 - val_acc: 0.7813
Epoch 44/50
79/79 [==============================] - 0s 6ms/step - loss: 0.5881 - acc: 0.7892 - val_loss: 0.5944 - val_acc: 0.7804
Epoch 45/50
79/79 [==============================] - 0s 5ms/step - loss: 0.5877 - acc: 0.7873 - val_loss: 0.5971 - val_acc: 0.7862
Epoch 46/50
79/79 [==============================] - 0s 6ms/step - loss: 0.5856 - acc: 0.7893 - val_loss: 0.6001 - val_acc: 0.7840
Epoch 47/50
79/79 [==============================] - 0s 5ms/step - loss: 0.5853 - acc: 0.7881 - val_loss: 0.5913 - val_acc: 0.7866
Epoch 48/50
79/79 [==============================] - 0s 6ms/step - loss: 0.5828 - acc: 0.7895 - val_loss: 0.5931 - val_acc: 0.7853
Epoch 49/50
79/79 [==============================] - 0s 6ms/step - loss: 0.5831 - acc: 0.7884 - val_loss: 0.5931 - val_acc: 0.7853
Epoch 50/50
79/79 [==============================] - 0s 6ms/step - loss: 0.5810 - acc: 0.7905 - val_loss: 0.5974 - val_acc: 0.7898

Model Accuracy can be improved and around 80% accuracy acn be achieved by letting the model train a little longer

Training Plots¶

# plot the training artifacts

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train_loss','val_loss'], loc = 'upper right')
plt.show()

# plot the training artifacts

plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('Model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train_acc','val_acc'], loc = 'upper right')
plt.show()

Save the trained model¶

model1.save("hate_speech.h5")

Accessing the Model's Performance¶

model1.evaluate(X_test,y_test)

78/78 [==============================] - 0s 2ms/step - loss: 0.6026 - acc: 0.7838

[0.6025590896606445, 0.7837837934494019]

print(tweets_df['tweet'][0])
print(tweets_df['tweet'][1])
print(tweets_df['tweet'][2])
print(tweets_df['tweet'][3])
print(tweets_df['tweet'][4])

!!! RT @mayasolovely: As a woman you shouldn't complain about cleaning up your house. &amp; as a man you should always take the trash out...
!!!!! RT @mleew17: boy dats cold...tyga dwn bad for cuffin dat hoe in the 1st place!!
!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby4life: You ever fuck a bitch and she start to cry? You be confused as shit
!!!!!!!!! RT @C_G_Anderson: @viva_based she look like a tranny
!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you hear about me might be true or it might be faker than the bitch who told it to ya &#57361;

tweets_countvectorizer = CountVectorizer(analyzer = message_cleaning, dtype = 'uint8').fit_transform(tweets_df['tweet'][:5]).toarray()

preds = model1.predict(tweets_countvectorizer)

preds_class = []
for i in range(len(preds)):
    preds_class.append(np.argmax(preds[i]))
preds_class = np.array(preds_class)

df = pd.DataFrame(columns=['Predicted Labels', 'Actual Labels'])
df['Predicted Labels'] = preds_class
df['Actual Labels'] = tweets_df['class'][:5]
df.head()

Compiling the model with DeepC¶

!deepCC hate_speech.h5

	class	tweet
0	2	!!! RT @mayasolovely: As a woman you shouldn't...
1	1	!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2	1	!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3	1	!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4	1	!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...

	class	tweet	length
0	2	!!! RT @mayasolovely: As a woman you shouldn't...	140
1	1	!!!!! RT @mleew17: boy dats cold...tyga dwn ba...	85
2	1	!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...	120
3	1	!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...	62
4	1	!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...	137

	class	length
count	24783.000000	24783.000000
mean	1.110277	85.436065
std	0.462089	41.548238
min	0.000000	5.000000
25%	1.000000	52.000000
50%	1.000000	81.000000
75%	1.000000	119.000000
max	2.000000	754.000000

	class	tweet	length
85	0	"@Blackman38Tide: @WhaleLookyHere @HowdyDowdy1...	61
89	0	"@CB_Baby24: @white_thunduh alsarabsss" hes a ...	83
110	0	"@DevilGrimz: @VigxRArts you're fucking gay, b...	119
184	0	"@MarkRoundtreeJr: LMFAOOOO I HATE BLACK PEOPL...	117
202	0	"@NoChillPaz: "At least I'm not a nigger" http...	72
...	...	...	...
24576	0	this guy is the biggest faggot omfg	35
24685	0	which one of these names is more offensive kik...	106
24751	0	you a pussy ass nigga and I know it nigga.	42
24776	0	you're all niggers	18
24777	0	you're such a retard i hope you get type 2 dia...	106

	class	tweet	length
1	1	!!!!! RT @mleew17: boy dats cold...tyga dwn ba...	85
2	1	!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...	120
3	1	!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...	62
4	1	!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...	137
5	1	!!!!!!!!!!!!!!!!!!"@T_Madison_x: The shit just...	158
...	...	...	...
24774	1	you really care bout dis bitch. my dick all in...	58
24775	1	you worried bout other bitches, you need me for?	48
24778	1	you's a muthaf***in lie “@LifeAsKing: @2...	146
24780	1	young buck wanna eat!!.. dat nigguh like I ain...	67
24781	1	youu got wild bitches tellin you lies	37

	class	tweet	length
0	2	!!! RT @mayasolovely: As a woman you shouldn't...	140
40	2	" momma said no pussy cats inside my doghouse "	47
63	2	"@Addicted2Guys: -SimplyAddictedToGuys http://...	87
66	2	"@AllAboutManFeet: http://t.co/3gzUpfuMev" woo...	66
67	2	"@Allyhaaaaa: Lemmie eat a Oreo & do these...	69
...	...	...	...
24736	2	yaya ho.. cute avi tho RT @ViVaLa_Ari I had no...	75
24737	2	yea so about @N_tel 's new friend.. all my fri...	115
24767	2	you know what they say, the early bird gets th...	95
24779	2	you've gone and broke the wrong heart baby, an...	70
24782	2	~~Ruffled \| Ntac Eileen Dahlia - Beautiful col...	127

	class	tweet
0	2	RT mayasolovely woman shouldnt complain cleani...
1	1	RT mleew17 boy dats coldtyga dwn bad cuffin da...
2	1	RT UrKindOfBrand Dawg RT 80sbaby4life ever fuc...
3	1	RT CGAnderson vivabased look like tranny
4	1	RT ShenikaRoberts shit hear might true might f...

	Predicted Labels	Actual Labels
0	1	2
1	1	1
2	1	1
3	1	1
4	1	1