Cainvas
Model Files
spam_text.h5
keras
Model
deepSea Compiled Models
spam_text.exe
deepSea
Ubuntu

Spam text classification

Credit: AITS Cainvas Community

Photo by Emanuele Colombo on Dribbble

Identifying whether the given text is spam or not (ham). This helps in filtering through unnecessary text content and keep us focussed on the important information.

Importing necessary libraries

In [1]:
import pandas as pd
import numpy as np
import nltk
import matplotlib.pyplot as plt
import re
import matplotlib.pyplot as plt
from keras import layers, optimizers, losses, callbacks, models
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer 
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score
import random
from wordcloud import WordCloud
# stopwords
nltk.download('stopwords')
[nltk_data] Downloading package stopwords to /home/jupyter-
[nltk_data]     gunjan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Out[1]:
True

The dataset

On Kaggle by Team AI

Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011. Website | UCI

The dataset is a CSV file with messages falling into one of two categories - ham and spam.

In [2]:
df = pd.read_csv('https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/SPAM_text_message_20170820_-_Data.csv')
df
Out[2]:
Category Message
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
... ... ...
5567 spam This is the 2nd time we have tried 2 contact u...
5568 ham Will ü b going to esplanade fr home?
5569 ham Pity, * was in mood for that. So...any other s...
5570 ham The guy did some bitching but I acted like i'd...
5571 ham Rofl. Its true to its name

5572 rows × 2 columns

Preprocessing

Dropping repeated rows

In [3]:
# Distribution of score values
df['Category'].value_counts()
Out[3]:
ham     4825
spam     747
Name: Category, dtype: int64
In [4]:
df = df.drop_duplicates()
df['Category'].value_counts()
Out[4]:
ham     4516
spam     641
Name: Category, dtype: int64

It is not a balanced dataset but we will go forward with it.

Encoding the category values

In [5]:
# Labels as 1 - spam or 0 - ham
df['Category'] = df['Category'].apply(lambda x : 1 if x == 'spam' else 0)

df
/opt/tljh/user/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
Out[5]:
Category Message
0 0 Go until jurong point, crazy.. Available only ...
1 0 Ok lar... Joking wif u oni...
2 1 Free entry in 2 a wkly comp to win FA Cup fina...
3 0 U dun say so early hor... U c already then say...
4 0 Nah I don't think he goes to usf, he lives aro...
... ... ...
5567 1 This is the 2nd time we have tried 2 contact u...
5568 0 Will ü b going to esplanade fr home?
5569 0 Pity, * was in mood for that. So...any other s...
5570 0 The guy did some bitching but I acted like i'd...
5571 0 Rofl. Its true to its name

5157 rows × 2 columns

Data cleaning

In [6]:
# Remove html tags
def removeHTML(sentence):
    regex = re.compile('<.*?>')
    return re.sub(regex, ' ', sentence)

# Remove URLs
def removeURL(sentence):
    regex = re.compile('http[s]?://\S+')
    return re.sub(regex, ' ', sentence)

# remove numbers, punctuation and any special characters (keep only alphabets)
def onlyAlphabets(sentence):
    regex = re.compile('[^a-zA-Z]')
    return re.sub(regex, ' ', sentence)

def removeRecurring(sentence):
    return re.sub(r'(.)\1{2,}', r'\1', sentence)

# Defining stopwords
stop = nltk.corpus.stopwords.words('english')
In [7]:
sno = nltk.stem.SnowballStemmer('english')    # Initializing stemmer
spam = []    # All words in positive reviews
ham = []    # All words in negative reviews
all_sentences = []    # All cleaned sentences


for x in range(len(df['Message'].values)):
    review = df['Message'].values[x]
    rating = df['Category'].values[x]

    cleaned_sentence = []
    sentence = removeURL(review) 
    sentence = removeHTML(sentence)
    sentence = onlyAlphabets(sentence)
    sentence = sentence.lower()   

    sentence = removeRecurring(sentence)  

    for word in sentence.split():
        #if word not in stop:
            stemmed = sno.stem(word)
            cleaned_sentence.append(stemmed)
            
            if rating == 1 :
                spam.append(stemmed)
            else:
                ham.append(stemmed)

    all_sentences.append(' '.join(cleaned_sentence))

# add as column in dataframe
df['Cleaned'] = all_sentences
/opt/tljh/user/lib/python3.7/site-packages/ipykernel_launcher.py:32: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Visualization

In [8]:
plt.figure(figsize=(20,20))
plt.imshow(WordCloud().generate(' '.join(spam)))
Out[8]:
<matplotlib.image.AxesImage at 0x7f4cce6872b0>
In [9]:
plt.figure(figsize=(20,20))
plt.imshow(WordCloud().generate(' '.join(ham)))
Out[9]:
<matplotlib.image.AxesImage at 0x7f4cce604080>