Spam text classification¶
Credit: AITS Cainvas Community
Photo by Emanuele Colombo on Dribbble
Identifying whether the given text is spam or not (ham). This helps in filtering through unnecessary text content and keep us focussed on the important information.
Importing necessary libraries¶
In [1]:
import pandas as pd
import numpy as np
import nltk
import matplotlib.pyplot as plt
import re
import matplotlib.pyplot as plt
from tensorflow.keras import layers, optimizers, losses, callbacks, models
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score
import random
from wordcloud import WordCloud
# stopwords
nltk.download('stopwords')
Out[1]:
The dataset¶
Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011. Website | UCI
The dataset is a CSV file with messages falling into one of two categories - ham and spam.
In [2]:
df = pd.read_csv('https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/SPAM_text_message_20170820_-_Data.csv')
df
Out[2]:
Preprocessing¶
Dropping repeated rows¶
In [3]:
# Distribution of score values
df['Category'].value_counts()
Out[3]:
In [4]:
df = df.drop_duplicates()
df['Category'].value_counts()
Out[4]:
It is not a balanced dataset but we will go forward with it.
Encoding the category values¶
In [5]:
# Labels as 1 - spam or 0 - ham
df['Category'] = df['Category'].apply(lambda x : 1 if x == 'spam' else 0)
df
Out[5]:
Data cleaning¶
In [6]:
# Remove html tags
def removeHTML(sentence):
regex = re.compile('<.*?>')
return re.sub(regex, ' ', sentence)
# Remove URLs
def removeURL(sentence):
regex = re.compile('http[s]?://\S+')
return re.sub(regex, ' ', sentence)
# remove numbers, punctuation and any special characters (keep only alphabets)
def onlyAlphabets(sentence):
regex = re.compile('[^a-zA-Z]')
return re.sub(regex, ' ', sentence)
def removeRecurring(sentence):
return re.sub(r'(.)\1{2,}', r'\1', sentence)
# Defining stopwords
stop = nltk.corpus.stopwords.words('english')
In [7]:
sno = nltk.stem.SnowballStemmer('english') # Initializing stemmer
spam = [] # All words in positive reviews
ham = [] # All words in negative reviews
all_sentences = [] # All cleaned sentences
for x in range(len(df['Message'].values)):
review = df['Message'].values[x]
rating = df['Category'].values[x]
cleaned_sentence = []
sentence = removeURL(review)
sentence = removeHTML(sentence)
sentence = onlyAlphabets(sentence)
sentence = sentence.lower()
sentence = removeRecurring(sentence)
for word in sentence.split():
#if word not in stop:
stemmed = sno.stem(word)
cleaned_sentence.append(stemmed)
if rating == 1 :
spam.append(stemmed)
else:
ham.append(stemmed)
all_sentences.append(' '.join(cleaned_sentence))
# add as column in dataframe
df['Cleaned'] = all_sentences
Visualization¶
In [8]:
plt.figure(figsize=(20,20))
plt.imshow(WordCloud().generate(' '.join(spam)))
Out[8]:
In [9]:
plt.figure(figsize=(20,20))
plt.imshow(WordCloud().generate(' '.join(ham)))
Out[9]: