Cainvas

Hate Speech And Offensive Language Detection

Credit: AITS Cainvas Community

Photo by Lucien Leyh on Dribbble

Nowadays we are well aware of the fact that if social media platforms are not handled carefully then they can create chaos in the world.One of the problems faced on these platforms are usage of Hate Speech and Offensive Language.Usage of such Language often results in fights, crimes or sometimes riots at worst.So, Detection of such language is essential and as humans cannot monitor such large volumes of data, we can take help of AI and detect the use of such language and prevent users from using such languages.

Importing Libraries

In [ ]:
# Essential tools
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from wordcloud import WordCloud

#to data preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

#NLP tools
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer

#train split and fit models
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

#model selection
from sklearn.metrics import confusion_matrix, accuracy_score

import os

Importing the Dataset

In [2]:
tweets_df = pd.read_csv('https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/twitter_labeled_data.csv')

Data Visualization and Preprocessing

In [3]:
tweets_df.head()
Out[3]:
Unnamed: 0 count hate_speech offensive_language neither class tweet
0 0 3 0 0 3 2 !!! RT @mayasolovely: As a woman you shouldn't...
1 1 3 0 3 0 1 !!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2 2 3 0 3 0 1 !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3 3 3 0 2 1 1 !!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4 4 6 0 6 0 1 !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...
In [4]:
tweets_df = tweets_df.drop(['neither','Unnamed: 0','count','hate_speech','offensive_language'], axis= 1)
In [5]:
tweets_df.head()
Out[5]:
class tweet
0 2 !!! RT @mayasolovely: As a woman you shouldn't...
1 1 !!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2 1 !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3 1 !!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4 1 !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...

Adding length column to data to see length of tweets

In [6]:
tweets_df['length'] = tweets_df['tweet'].apply(len)
In [7]:
tweets_df.head()
Out[7]:
class tweet length
0 2 !!! RT @mayasolovely: As a woman you shouldn't... 140
1 1 !!!!! RT @mleew17: boy dats cold...tyga dwn ba... 85
2 1 !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby... 120
3 1 !!!!!!!!! RT @C_G_Anderson: @viva_based she lo... 62
4 1 !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you... 137
In [8]:
tweets_df.describe()
Out[8]:
class length
count 24783.000000 24783.000000
mean 1.110277 85.436065
std 0.462089 41.548238
min 0.000000 5.000000
25% 1.000000 52.000000
50% 1.000000 81.000000
75% 1.000000 119.000000
max 2.000000 754.000000

Segregating data on the basis of class

In [9]:
hatespeech = tweets_df[tweets_df['class']==0]
In [10]:
hatespeech
Out[10]:
class tweet length
85 0 "@Blackman38Tide: @WhaleLookyHere @HowdyDowdy1... 61
89 0 "@CB_Baby24: @white_thunduh alsarabsss" hes a ... 83
110 0 "@DevilGrimz: @VigxRArts you're fucking gay, b... 119
184 0 "@MarkRoundtreeJr: LMFAOOOO I HATE BLACK PEOPL... 117
202 0 "@NoChillPaz: "At least I'm not a nigger" http... 72
... ... ... ...
24576 0 this guy is the biggest faggot omfg 35
24685 0 which one of these names is more offensive kik... 106
24751 0 you a pussy ass nigga and I know it nigga. 42
24776 0 you're all niggers 18
24777 0 you're such a retard i hope you get type 2 dia... 106

1430 rows × 3 columns

In [11]:
offensive = tweets_df[tweets_df['class']==1]
In [12]:
offensive
Out[12]:
class tweet length
1 1 !!!!! RT @mleew17: boy dats cold...tyga dwn ba... 85
2 1 !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby... 120
3 1 !!!!!!!!! RT @C_G_Anderson: @viva_based she lo... 62
4 1 !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you... 137
5 1 !!!!!!!!!!!!!!!!!!"@T_Madison_x: The shit just... 158
... ... ... ...
24774 1 you really care bout dis bitch. my dick all in... 58
24775 1 you worried bout other bitches, you need me for? 48
24778 1 you's a muthaf***in lie “@LifeAsKing: @2... 146
24780 1 young buck wanna eat!!.. dat nigguh like I ain... 67
24781 1 youu got wild bitches tellin you lies 37

19190 rows × 3 columns

In [13]:
neutral = tweets_df[tweets_df['class']==2]
In [14]:
neutral
Out[14]:
class tweet length
0 2 !!! RT @mayasolovely: As a woman you shouldn't... 140
40 2 " momma said no pussy cats inside my doghouse " 47
63 2 "@Addicted2Guys: -SimplyAddictedToGuys http://... 87
66 2 "@AllAboutManFeet: http://t.co/3gzUpfuMev" woo... 66
67 2 "@Allyhaaaaa: Lemmie eat a Oreo & do these... 69
... ... ... ...
24736 2 yaya ho.. cute avi tho RT @ViVaLa_Ari I had no... 75
24737 2 yea so about @N_tel 's new friend.. all my fri... 115
24767 2 you know what they say, the early bird gets th... 95
24779 2 you've gone and broke the wrong heart baby, an... 70
24782 2 ~~Ruffled | Ntac Eileen Dahlia - Beautiful col... 127

4163 rows × 3 columns

Visualizing each class

In [15]:
sentences = hatespeech['tweet'].tolist()
len(sentences)
Out[15]:
1430
In [16]:
sentences_as_one_string = " ".join(sentences)
In [17]:
plt.figure(figsize=(20,20))
plt.imshow(WordCloud().generate(sentences_as_one_string))
Out[17]:
<matplotlib.image.AxesImage at 0x7f9650512080>
In [18]:
sentences = offensive['tweet'].tolist()
len(sentences)
Out[18]:
19190
In [19]:
sentences_as_one_string = " ".join(sentences)
In [20]:
plt.figure(figsize=(20,20))
plt.imshow(WordCloud().generate(sentences_as_one_string))
Out[20]:
<matplotlib.image.AxesImage at 0x7f96500019b0>