Spam text classification¶

Identifying whether the given text is spam or not (ham). This helps in filtering through unnecessary text content and keep us focussed on the important information.

Importing necessary libraries¶

import pandas as pd
import numpy as np
import nltk
import matplotlib.pyplot as plt
import re
import matplotlib.pyplot as plt
from tensorflow.keras import layers, optimizers, losses, callbacks, models
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer 
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score
import random
from wordcloud import WordCloud
# stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/jupyter-
[nltk_data]     Rodio346/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.

True

The dataset¶

On Kaggle by Team AI

Almeida, T.A., GÃ³mez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011. Website | UCI

The dataset is a CSV file with messages falling into one of two categories - ham and spam.

df = pd.read_csv('https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/SPAM_text_message_20170820_-_Data.csv')
df

Preprocessing¶

Dropping repeated rows¶

# Distribution of score values
df['Category'].value_counts()

ham     4825
spam     747
Name: Category, dtype: int64

df = df.drop_duplicates()
df['Category'].value_counts()

ham     4516
spam     641
Name: Category, dtype: int64

It is not a balanced dataset but we will go forward with it.

Encoding the category values¶

# Labels as 1 - spam or 0 - ham
df['Category'] = df['Category'].apply(lambda x : 1 if x == 'spam' else 0)

df

/opt/tljh/user/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Data cleaning¶

# Remove html tags
def removeHTML(sentence):
    regex = re.compile('<.*?>')
    return re.sub(regex, ' ', sentence)

# Remove URLs
def removeURL(sentence):
    regex = re.compile('http[s]?://\S+')
    return re.sub(regex, ' ', sentence)

# remove numbers, punctuation and any special characters (keep only alphabets)
def onlyAlphabets(sentence):
    regex = re.compile('[^a-zA-Z]')
    return re.sub(regex, ' ', sentence)

def removeRecurring(sentence):
    return re.sub(r'(.)\1{2,}', r'\1', sentence)

# Defining stopwords
stop = nltk.corpus.stopwords.words('english')

sno = nltk.stem.SnowballStemmer('english')    # Initializing stemmer
spam = []    # All words in positive reviews
ham = []    # All words in negative reviews
all_sentences = []    # All cleaned sentences


for x in range(len(df['Message'].values)):
    review = df['Message'].values[x]
    rating = df['Category'].values[x]

    cleaned_sentence = []
    sentence = removeURL(review) 
    sentence = removeHTML(sentence)
    sentence = onlyAlphabets(sentence)
    sentence = sentence.lower()   

    sentence = removeRecurring(sentence)  

    for word in sentence.split():
        #if word not in stop:
            stemmed = sno.stem(word)
            cleaned_sentence.append(stemmed)
            
            if rating == 1 :
                spam.append(stemmed)
            else:
                ham.append(stemmed)

    all_sentences.append(' '.join(cleaned_sentence))

# add as column in dataframe
df['Cleaned'] = all_sentences

/opt/tljh/user/lib/python3.7/site-packages/ipykernel_launcher.py:32: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Visualization¶

plt.figure(figsize=(20,20))
plt.imshow(WordCloud().generate(' '.join(spam)))

<matplotlib.image.AxesImage at 0x7f7bd24c3128>

plt.figure(figsize=(20,20))
plt.imshow(WordCloud().generate(' '.join(ham)))

<matplotlib.image.AxesImage at 0x7f7bd244de10>

Preprocessing (continued...)¶

Train test split¶

# Splitting into train, val and test set -- 80-10-10 split

# First, an 80-20 split
train_df, val_test_df = train_test_split(df, test_size = 0.2, random_state = 113)

# Then split the 20% into half
val_df, test_df = train_test_split(val_test_df, test_size = 0.5, random_state = 113)

print("Number of samples in...")
print("Training set: ", len(train_df))
print("Validation set: ", len(val_df))
print("Testing set: ", len(test_df))

Number of samples in...
Training set:  4125
Validation set:  516
Testing set:  516

Tokenization¶

cv = CountVectorizer(ngram_range = (1,1), max_features=20000)

train_bow = cv.fit_transform(train_df['Cleaned'])
val_bow = cv.transform(val_df['Cleaned'])
test_bow = cv.transform(test_df['Cleaned'])

tfidf = TfidfTransformer().fit(train_bow)

train_tf = tfidf.transform(train_bow)
val_tf = tfidf.transform(val_bow)
test_tf = tfidf.transform(test_bow)

Defining the input and output¶

Xtrain = train_tf.toarray()
ytrain = train_df['Category']

Xval = val_tf.toarray()
yval = val_df['Category']

ytest = test_df['Category']
Xtest = test_tf.toarray()

The model¶

Here we implement a model based on the frequency of different words in the sentence.

model = models.Sequential([
    layers.Dense(16, activation = 'relu', input_shape = Xtrain[0].shape),   
    layers.Dense(4, activation = 'relu'),
    layers.Dense(1, activation = 'sigmoid')
])

cb = [callbacks.EarlyStopping(patience = 5, restore_best_weights = True)]
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 16)                90768     
_________________________________________________________________
dense_1 (Dense)              (None, 4)                 68        
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 5         
=================================================================
Total params: 90,841
Trainable params: 90,841
Non-trainable params: 0
_________________________________________________________________

model.compile(optimizer = optimizers.Adam(0.0001), loss = losses.BinaryCrossentropy(), metrics = ['accuracy'])

history = model.fit(Xtrain, ytrain, validation_data = (Xval, yval), epochs = 128, callbacks = cb)

Epoch 1/128
129/129 [==============================] - 0s 3ms/step - loss: 0.6916 - accuracy: 0.7282 - val_loss: 0.6887 - val_accuracy: 0.8605
Epoch 2/128
129/129 [==============================] - 0s 2ms/step - loss: 0.6859 - accuracy: 0.8841 - val_loss: 0.6837 - val_accuracy: 0.8740
Epoch 3/128
129/129 [==============================] - 0s 2ms/step - loss: 0.6803 - accuracy: 0.9093 - val_loss: 0.6775 - val_accuracy: 0.9186
Epoch 4/128
129/129 [==============================] - 0s 2ms/step - loss: 0.6733 - accuracy: 0.9554 - val_loss: 0.6693 - val_accuracy: 0.9671
Epoch 5/128
129/129 [==============================] - 0s 2ms/step - loss: 0.6640 - accuracy: 0.9787 - val_loss: 0.6585 - val_accuracy: 0.9748
Epoch 6/128
129/129 [==============================] - 0s 2ms/step - loss: 0.6516 - accuracy: 0.9857 - val_loss: 0.6445 - val_accuracy: 0.9748
Epoch 7/128
129/129 [==============================] - 0s 2ms/step - loss: 0.6357 - accuracy: 0.9886 - val_loss: 0.6267 - val_accuracy: 0.9748
Epoch 8/128
129/129 [==============================] - 0s 2ms/step - loss: 0.6155 - accuracy: 0.9913 - val_loss: 0.6048 - val_accuracy: 0.9767
Epoch 9/128
129/129 [==============================] - 0s 2ms/step - loss: 0.5912 - accuracy: 0.9918 - val_loss: 0.5793 - val_accuracy: 0.9767
Epoch 10/128
129/129 [==============================] - 0s 2ms/step - loss: 0.5632 - accuracy: 0.9927 - val_loss: 0.5500 - val_accuracy: 0.9767
Epoch 11/128
129/129 [==============================] - 0s 2ms/step - loss: 0.5316 - accuracy: 0.9937 - val_loss: 0.5176 - val_accuracy: 0.9806
Epoch 12/128
129/129 [==============================] - 0s 2ms/step - loss: 0.4971 - accuracy: 0.9932 - val_loss: 0.4829 - val_accuracy: 0.9826
Epoch 13/128
129/129 [==============================] - 0s 2ms/step - loss: 0.4603 - accuracy: 0.9935 - val_loss: 0.4465 - val_accuracy: 0.9826
Epoch 14/128
129/129 [==============================] - 0s 2ms/step - loss: 0.4221 - accuracy: 0.9930 - val_loss: 0.4093 - val_accuracy: 0.9864
Epoch 15/128
129/129 [==============================] - 0s 2ms/step - loss: 0.3838 - accuracy: 0.9925 - val_loss: 0.3726 - val_accuracy: 0.9864
Epoch 16/128
129/129 [==============================] - 0s 2ms/step - loss: 0.3464 - accuracy: 0.9925 - val_loss: 0.3373 - val_accuracy: 0.9864
Epoch 17/128
129/129 [==============================] - 0s 2ms/step - loss: 0.3108 - accuracy: 0.9925 - val_loss: 0.3045 - val_accuracy: 0.9864
Epoch 18/128
129/129 [==============================] - 0s 2ms/step - loss: 0.2777 - accuracy: 0.9930 - val_loss: 0.2739 - val_accuracy: 0.9845
Epoch 19/128
129/129 [==============================] - 0s 2ms/step - loss: 0.2475 - accuracy: 0.9922 - val_loss: 0.2467 - val_accuracy: 0.9845
Epoch 20/128
129/129 [==============================] - 0s 2ms/step - loss: 0.2203 - accuracy: 0.9930 - val_loss: 0.2221 - val_accuracy: 0.9845
Epoch 21/128
129/129 [==============================] - 0s 2ms/step - loss: 0.1961 - accuracy: 0.9927 - val_loss: 0.2007 - val_accuracy: 0.9845
Epoch 22/128
129/129 [==============================] - 0s 2ms/step - loss: 0.1748 - accuracy: 0.9935 - val_loss: 0.1818 - val_accuracy: 0.9845
Epoch 23/128
129/129 [==============================] - 0s 2ms/step - loss: 0.1562 - accuracy: 0.9935 - val_loss: 0.1655 - val_accuracy: 0.9845
Epoch 24/128
129/129 [==============================] - 0s 2ms/step - loss: 0.1399 - accuracy: 0.9937 - val_loss: 0.1512 - val_accuracy: 0.9845
Epoch 25/128
129/129 [==============================] - 0s 2ms/step - loss: 0.1257 - accuracy: 0.9935 - val_loss: 0.1389 - val_accuracy: 0.9845
Epoch 26/128
129/129 [==============================] - 0s 2ms/step - loss: 0.1133 - accuracy: 0.9939 - val_loss: 0.1282 - val_accuracy: 0.9845
Epoch 27/128
129/129 [==============================] - 0s 2ms/step - loss: 0.1025 - accuracy: 0.9939 - val_loss: 0.1189 - val_accuracy: 0.9864
Epoch 28/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0931 - accuracy: 0.9942 - val_loss: 0.1108 - val_accuracy: 0.9845
Epoch 29/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0848 - accuracy: 0.9944 - val_loss: 0.1037 - val_accuracy: 0.9845
Epoch 30/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0775 - accuracy: 0.9952 - val_loss: 0.0975 - val_accuracy: 0.9845
Epoch 31/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0710 - accuracy: 0.9952 - val_loss: 0.0921 - val_accuracy: 0.9845
Epoch 32/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0653 - accuracy: 0.9954 - val_loss: 0.0873 - val_accuracy: 0.9845
Epoch 33/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0603 - accuracy: 0.9959 - val_loss: 0.0831 - val_accuracy: 0.9845
Epoch 34/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0557 - accuracy: 0.9956 - val_loss: 0.0793 - val_accuracy: 0.9845
Epoch 35/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0517 - accuracy: 0.9959 - val_loss: 0.0759 - val_accuracy: 0.9845
Epoch 36/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0481 - accuracy: 0.9961 - val_loss: 0.0730 - val_accuracy: 0.9845
Epoch 37/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0448 - accuracy: 0.9961 - val_loss: 0.0704 - val_accuracy: 0.9845
Epoch 38/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0418 - accuracy: 0.9964 - val_loss: 0.0681 - val_accuracy: 0.9845
Epoch 39/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0392 - accuracy: 0.9964 - val_loss: 0.0659 - val_accuracy: 0.9864
Epoch 40/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0367 - accuracy: 0.9964 - val_loss: 0.0641 - val_accuracy: 0.9864
Epoch 41/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0345 - accuracy: 0.9964 - val_loss: 0.0624 - val_accuracy: 0.9864
Epoch 42/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0325 - accuracy: 0.9966 - val_loss: 0.0610 - val_accuracy: 0.9864
Epoch 43/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0307 - accuracy: 0.9964 - val_loss: 0.0594 - val_accuracy: 0.9864
Epoch 44/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0290 - accuracy: 0.9968 - val_loss: 0.0583 - val_accuracy: 0.9864
Epoch 45/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0275 - accuracy: 0.9971 - val_loss: 0.0572 - val_accuracy: 0.9845
Epoch 46/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0260 - accuracy: 0.9973 - val_loss: 0.0560 - val_accuracy: 0.9845
Epoch 47/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0247 - accuracy: 0.9973 - val_loss: 0.0553 - val_accuracy: 0.9826
Epoch 48/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0235 - accuracy: 0.9976 - val_loss: 0.0547 - val_accuracy: 0.9826
Epoch 49/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0224 - accuracy: 0.9976 - val_loss: 0.0538 - val_accuracy: 0.9826
Epoch 50/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0214 - accuracy: 0.9976 - val_loss: 0.0532 - val_accuracy: 0.9826
Epoch 51/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0204 - accuracy: 0.9978 - val_loss: 0.0526 - val_accuracy: 0.9826
Epoch 52/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0195 - accuracy: 0.9981 - val_loss: 0.0520 - val_accuracy: 0.9826
Epoch 53/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0187 - accuracy: 0.9981 - val_loss: 0.0516 - val_accuracy: 0.9826
Epoch 54/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0179 - accuracy: 0.9981 - val_loss: 0.0511 - val_accuracy: 0.9826
Epoch 55/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0172 - accuracy: 0.9981 - val_loss: 0.0509 - val_accuracy: 0.9826
Epoch 56/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0165 - accuracy: 0.9983 - val_loss: 0.0505 - val_accuracy: 0.9826
Epoch 57/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0159 - accuracy: 0.9983 - val_loss: 0.0505 - val_accuracy: 0.9826
Epoch 58/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0153 - accuracy: 0.9983 - val_loss: 0.0502 - val_accuracy: 0.9826
Epoch 59/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0148 - accuracy: 0.9983 - val_loss: 0.0499 - val_accuracy: 0.9826
Epoch 60/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0143 - accuracy: 0.9985 - val_loss: 0.0497 - val_accuracy: 0.9826
Epoch 61/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0138 - accuracy: 0.9985 - val_loss: 0.0498 - val_accuracy: 0.9826
Epoch 62/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0133 - accuracy: 0.9985 - val_loss: 0.0496 - val_accuracy: 0.9826
Epoch 63/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0129 - accuracy: 0.9985 - val_loss: 0.0495 - val_accuracy: 0.9826
Epoch 64/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0124 - accuracy: 0.9985 - val_loss: 0.0494 - val_accuracy: 0.9826
Epoch 65/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0120 - accuracy: 0.9985 - val_loss: 0.0492 - val_accuracy: 0.9826
Epoch 66/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0117 - accuracy: 0.9985 - val_loss: 0.0494 - val_accuracy: 0.9826
Epoch 67/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0113 - accuracy: 0.9985 - val_loss: 0.0496 - val_accuracy: 0.9806
Epoch 68/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0110 - accuracy: 0.9985 - val_loss: 0.0495 - val_accuracy: 0.9806
Epoch 69/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0107 - accuracy: 0.9985 - val_loss: 0.0496 - val_accuracy: 0.9806
Epoch 70/128
129/129 [==============================] - 0s 2ms/step - loss: 0.0104 - accuracy: 0.9988 - val_loss: 0.0496 - val_accuracy: 0.9806

model.evaluate(Xtest, ytest)

print("F1 score - ", f1_score(ytest, (model.predict(Xtest)>0.5).astype('int')))

17/17 [==============================] - 0s 1ms/step - loss: 0.0556 - accuracy: 0.9884
F1 score -  0.9433962264150944

ytest_val = ['spam' if i == 1 else 'ham' for i in ytest]
ypred = (model.predict(Xtest)>0.5).astype('int')
ypred_val = ['spam' if i == 1 else 'ham' for i in ypred]

cm = confusion_matrix(ytest_val, ypred_val, labels=['spam', 'ham'])
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

fig = plt.figure(figsize = (4, 4))
ax = fig.add_subplot(111)

for i in range(cm.shape[1]):
    for j in range(cm.shape[0]):   
        ax.text(j, i, format(cm[i, j], '.2f'), horizontalalignment="center", color="black")

columns = ['ham', 'spam']

_ = ax.imshow(cm, cmap=plt.cm.Blues)
ax.set_xticks(range(2))
ax.set_yticks(range(2))
ax.set_xticklabels(columns, rotation = 90)
ax.set_yticklabels(columns)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

A significant percentage of the ham messages are classified as spam. This can be improved with a larger dataset that includes more spam samples.

Plotting the metrics¶

def plot(history, variable, variable2):
    plt.plot(range(len(history[variable])), history[variable])
    plt.plot(range(len(history[variable2])), history[variable2])
    plt.title(variable)

plot(history.history, "accuracy", 'val_accuracy')

plot(history.history, "loss", "val_loss")

Prediction¶

x = np.random.randint(0, Xtest.shape[0] - 1)

sentence = test_df['Message'].values[x]
print("Sentence: ", sentence)

cleaned_sentence = []
sentence = removeURL(sentence) 
sentence = removeHTML(sentence)
sentence = onlyAlphabets(sentence) 
sentence = sentence.lower() 
sentence = removeRecurring(sentence)

for word in sentence.split():
    #if word not in stop:
        stemmed = sno.stem(word)
        cleaned_sentence.append(stemmed)

sentence = [' '.join(cleaned_sentence)]
print("\nCleaned sentence: ", sentence[0])

sentence = cv.transform(sentence)
sentence = tfidf.transform(sentence)

print("\nTrue value: ", columns[test_df['Category'].values[x]])

pred = model.predict(sentence.toarray())[0][0]
print("\nPredicted value: ", columns[int(pred>0.5)], "(", pred, "-->", (pred>0.5).astype('int'), ")")

Sentence:  We not watching movie already. Xy wants 2 shop so i'm shopping w her now.

Cleaned sentence:  we not watch movi alreadi xy want shop so i m shop w her now

True value:  ham

Predicted value:  ham ( 0.0043061357 --> 0 )

deepC¶

model.save('spam_text.h5')

!deepCC spam_text.h5

[INFO]
Reading [keras model] 'spam_text.h5'
[SUCCESS]
Saved 'spam_text_deepC/spam_text.onnx'
[INFO]
Reading [onnx model] 'spam_text_deepC/spam_text.onnx'
[INFO]
Model info:
  ir_vesion : 4
  doc       : 
[WARNING]
[ONNX]: terminal (input/output) dense_input's shape is less than 1. Changing it to 1.
[WARNING]
[ONNX]: terminal (input/output) dense_2's shape is less than 1. Changing it to 1.
WARN (GRAPH): found operator node with the same name (dense_2) as io node.
[INFO]
Running DNNC graph sanity check ...
[SUCCESS]
Passed sanity check.
[INFO]
Writing C++ file 'spam_text_deepC/spam_text.cpp'
[INFO]
deepSea model files are ready in 'spam_text_deepC/' 
[RUNNING COMMAND]
g++ -std=c++11 -O3 -fno-rtti -fno-exceptions -I. -I/opt/tljh/user/lib/python3.7/site-packages/deepC-0.13-py3.7-linux-x86_64.egg/deepC/include -isystem /opt/tljh/user/lib/python3.7/site-packages/deepC-0.13-py3.7-linux-x86_64.egg/deepC/packages/eigen-eigen-323c052e1731 "spam_text_deepC/spam_text.cpp" -D_AITS_MAIN -o "spam_text_deepC/spam_text.exe"
[RUNNING COMMAND]
size "spam_text_deepC/spam_text.exe"
   text	   data	    bss	    dec	    hex	filename
 481891	   2968	    760	 485619	  768f3	spam_text_deepC/spam_text.exe
[SUCCESS]
Saved model as executable "spam_text_deepC/spam_text.exe"

x = np.random.randint(0, Xtest.shape[0] - 1)

sentence = test_df['Message'].values[x]
print("Sentence: ", sentence)

cleaned_sentence = []
sentence = removeURL(sentence) 
sentence = removeHTML(sentence)
sentence = onlyAlphabets(sentence) 
sentence = sentence.lower() 
sentence = removeRecurring(sentence)

for word in sentence.split():
    if word not in stop:
        stemmed = sno.stem(word)
        cleaned_sentence.append(stemmed)

sentence = [' '.join(cleaned_sentence)]
print("\nCleaned sentence: ", sentence[0])

sentence = cv.transform(sentence)
sentence = tfidf.transform(sentence)

print() 

np.savetxt('sample.data', sentence.toarray())    # xth sample into text file

# run exe with input
!spam_text_deepC/spam_text.exe sample.data

# show predicted output
nn_out = np.loadtxt('deepSea_result_1.out')

pred = (nn_out>0.5).astype('int')
print("\nPredicted value: ", columns[int(pred>0.5)], "(", pred, "-->", (pred>0.5).astype('int'), ")")

print("\nTrue value: ", columns[test_df['Category'].values[x]])

Sentence:  Call FREEPHONE 0800 542 0578 now!

Cleaned sentence:  call freephon

writing file deepSea_result_1.out.

Predicted value:  spam ( 1 --> 1 )

True value:  spam

	Category	Message
0	0	Go until jurong point, crazy.. Available only ...
1	0	Ok lar... Joking wif u oni...
2	1	Free entry in 2 a wkly comp to win FA Cup fina...
3	0	U dun say so early hor... U c already then say...
4	0	Nah I don't think he goes to usf, he lives aro...
...	...	...
5567	1	This is the 2nd time we have tried 2 contact u...
5568	0	Will ü b going to esplanade fr home?
5569	0	Pity, * was in mood for that. So...any other s...
5570	0	The guy did some bitching but I acted like i'd...
5571	0	Rofl. Its true to its name

	Category	Message
0	ham	Go until jurong point, crazy.. Available only ...
1	ham	Ok lar... Joking wif u oni...
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...
3	ham	U dun say so early hor... U c already then say...
4	ham	Nah I don't think he goes to usf, he lives aro...
...	...	...
5567	spam	This is the 2nd time we have tried 2 contact u...
5568	ham	Will ü b going to esplanade fr home?
5569	ham	Pity, * was in mood for that. So...any other s...
5570	ham	The guy did some bitching but I acted like i'd...
5571	ham	Rofl. Its true to its name

Model Files
spam_text.h5 keras Model
deepSea Compiled Models
spam_text.exe deepSea Ubuntu