NOTE: This Use Case is not purposed for resource constrained devices.
Hate Speech And Offensive Language Detection¶
Credit: AITS Cainvas Community
Photo by Lucien Leyh on Dribbble
Nowadays we are well aware of the fact that if social media platforms are not handled carefully then they can create chaos in the world.One of the problems faced on these platforms are usage of Hate Speech and Offensive Language.Usage of such Language often results in fights, crimes or sometimes riots at worst.So, Detection of such language is essential and as humans cannot monitor such large volumes of data, we can take help of AI and detect the use of such language and prevent users from using such languages.
Importing Libraries¶
In [ ]:
# Essential tools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud
#to data preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
#NLP tools
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
#train split and fit models
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
#model selection
from sklearn.metrics import confusion_matrix, accuracy_score
import os
Importing the Dataset¶
In [2]:
tweets_df = pd.read_csv('https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/twitter_labeled_data.csv')
Data Visualization and Preprocessing¶
In [3]:
tweets_df.head()
Out[3]:
In [4]:
tweets_df = tweets_df.drop(['neither','Unnamed: 0','count','hate_speech','offensive_language'], axis= 1)
In [5]:
tweets_df.head()
Out[5]:
Adding length column to data to see length of tweets¶
In [6]:
tweets_df['length'] = tweets_df['tweet'].apply(len)
In [7]:
tweets_df.head()
Out[7]:
In [8]:
tweets_df.describe()
Out[8]:
Segregating data on the basis of class¶
In [9]:
hatespeech = tweets_df[tweets_df['class']==0]
In [10]:
hatespeech
Out[10]:
In [11]:
offensive = tweets_df[tweets_df['class']==1]
In [12]:
offensive
Out[12]:
In [13]:
neutral = tweets_df[tweets_df['class']==2]
In [14]:
neutral
Out[14]:
Visualizing each class¶
In [15]:
sentences = hatespeech['tweet'].tolist()
len(sentences)
Out[15]:
In [16]:
sentences_as_one_string = " ".join(sentences)
In [17]:
plt.figure(figsize=(20,20))
plt.imshow(WordCloud().generate(sentences_as_one_string))
Out[17]:
In [18]:
sentences = offensive['tweet'].tolist()
len(sentences)
Out[18]:
In [19]:
sentences_as_one_string = " ".join(sentences)
In [20]:
plt.figure(figsize=(20,20))
plt.imshow(WordCloud().generate(sentences_as_one_string))
Out[20]: