Tweet Authenticity Prediction

Our Objective is to predict if a Tweet is describing about a disaster or not .

Important Libraries

import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import model_selection,feature_extraction
from collections import Counter
import nltk
nltk.download('punkt')
nltk.download('stopwords')
import re
!pip install contractions
import contractions
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
!pip install pyspellchecker
from spellchecker import SpellChecker
import tensorflow as tf
from tensorflow import keras
!pip install wget
!wget -N 'https://cainvas-static.s3.amazonaws.com/media/user_data/devanshchowd/tweets.csv'

[nltk_data] Downloading package punkt to /home/jupyter-
[nltk_data]     devanshchowd/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jupyter-
[nltk_data]     devanshchowd/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: contractions in ./.local/lib/python3.7/site-packages (0.0.52)
Requirement already satisfied: textsearch>=0.0.21 in ./.local/lib/python3.7/site-packages (from contractions) (0.0.21)
Requirement already satisfied: anyascii in ./.local/lib/python3.7/site-packages (from textsearch>=0.0.21->contractions) (0.2.0)
Requirement already satisfied: pyahocorasick in ./.local/lib/python3.7/site-packages (from textsearch>=0.0.21->contractions) (1.4.2)
WARNING: You are using pip version 20.3.1; however, version 21.1.3 is available.
You should consider upgrading via the '/opt/tljh/user/bin/python -m pip install --upgrade pip' command.

[nltk_data] Downloading package wordnet to /home/jupyter-
[nltk_data]     devanshchowd/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.

Defaulting to user installation because normal site-packages is not writeable
Collecting pyspellchecker
  Downloading pyspellchecker-0.6.2-py3-none-any.whl (2.7 MB)
     |████████████████████████████████| 2.7 MB 31.5 MB/s eta 0:00:01
Installing collected packages: pyspellchecker
Successfully installed pyspellchecker-0.6.2
WARNING: You are using pip version 20.3.1; however, version 21.1.3 is available.
You should consider upgrading via the '/opt/tljh/user/bin/python -m pip install --upgrade pip' command.
Defaulting to user installation because normal site-packages is not writeable
Collecting wget
  Downloading wget-3.2.zip (10 kB)
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... done
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9681 sha256=865dec9e9e56037ae701e740021f60c298ad1a8a40638730d89d8f60aa43d960
  Stored in directory: /home/jupyter-devanshchowd/.cache/pip/wheels/a1/b6/7c/0e63e34eb06634181c63adacca38b79ff8f35c37e3c13e3c02
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2
WARNING: You are using pip version 20.3.1; however, version 21.1.3 is available.
You should consider upgrading via the '/opt/tljh/user/bin/python -m pip install --upgrade pip' command.
--2021-06-28 21:03:56--  https://cainvas-static.s3.amazonaws.com/media/user_data/devanshchowd/tweets.csv
Resolving cainvas-static.s3.amazonaws.com (cainvas-static.s3.amazonaws.com)... 52.219.66.8
Connecting to cainvas-static.s3.amazonaws.com (cainvas-static.s3.amazonaws.com)|52.219.66.8|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1615005 (1.5M) [text/csv]
Saving to: ‘tweets.csv’

tweets.csv          100%[===================>]   1.54M  --.-KB/s    in 0.02s   

2021-06-28 21:03:56 (69.2 MB/s) - ‘tweets.csv’ saved [1615005/1615005]

data = pd.read_csv('tweets.csv',index_col='id')
data.reset_index(drop=True,inplace=True)
data = data[['text', 'target']]
data_train,data_test = model_selection.train_test_split(data,test_size=.2,random_state = 43,stratify=data.target)

data_train.reset_index(drop=True,inplace=True)
data_test.reset_index(drop=True,inplace=True)

The target of 0 indicate a Non Disastrous Tweet and 1 indicate a Disastrous Tweet ,meaning tweet is describing / report

ing a disastor .

display(data_train)

BASIC EDA

THE 9000 sample is divided in roughly 7400 False and 1700 True target points.

sns.set_style('darkgrid')
print(data_train['target'].value_counts())
sns.countplot(data = data_train,x= 'target')

0    7405
1    1691
Name: target, dtype: int64

<AxesSubplot:xlabel='target', ylabel='count'>

Text Length Distribution

sns.histplot(data_train['text'].apply(lambda x:len(x)))

<AxesSubplot:xlabel='text', ylabel='Count'>

print('Short Message in our data :',data_train.loc[data['text'].apply(lambda x:len(x)).idxmin()].values,sep='\n')
print('========')
print('Long Message in our data  :',data_train.loc[data_train['text'].apply(lambda x:len(x))>140].values[0],sep='\n')

Short Message in our data :
['HAHAHAHAHAHA THIS IS SO BAD aw i miss screaming in courtney’s ears' 0]
========
Long Message in our data  :
["Tributes for British tourist who died in cliff fall at Sydney's Diamond Bay https://t.co/yMsX6DXWqI - so sad &amp; so… https://t.co/G9U4lDSpWU"
 1]

Transforming Data

stop_words = nltk.corpus.stopwords.words('english')
i = 0
wnl = WordNetLemmatizer()
stemmer=SnowballStemmer('english')
for doc in data_train.text:
  doc = re.sub(r'https?://\S+|www\.\S+','',doc)
  doc = re.sub(r'<.*?>','',doc)
  doc = re.sub(r'[^a-zA-Z\s]','',doc,re.I|re.A)
  doc = ' '.join([wnl.lemmatize(i) for i in doc.lower().split()])
  doc = contractions.fix(doc)
  tokens = nltk.word_tokenize(doc)
  filtered = [token for token in tokens if token not in stop_words]
  doc = ' '.join(filtered)
  data_train.text[i] = doc
  i+=1
i=0
for doc in data_test.text:
  doc = re.sub(r'https?://\S+|www\.\S+','',doc)
  doc = re.sub(r'<.*?>','',doc)
  doc = re.sub(r'[^a-zA-Z\s]','',doc,re.I|re.A)
  doc = ' '.join([wnl.lemmatize(i) for i in doc.lower().split()])
  doc = contractions.fix(doc)
  tokens = nltk.word_tokenize(doc)
  filtered = [token for token in tokens if token not in stop_words]
  doc = ' '.join(filtered)
  data_test.text[i] = doc
  i+=1

print('Short Message in our data :',data_train.loc[data['text'].apply(lambda x:len(x)).idxmin()].values,sep='\n')
print('========')
print('Long Message in our data  :',data_train.loc[data_train['text'].apply(lambda x:len(x))>100].values[0],sep='\n')

Short Message in our data :
['hahahahahaha bad aw miss screaming courtneys ear' 0]
========
Long Message in our data  :
['showed activity wa recently founded bioterrorism manufactured disease tin man android global organism would'
 0]

You can see the conversions being done ,Previously the shortest statement was

'British diver Neil Anthony Fears found dead by the wreck of a steamship - Daily Mail http://t.co/QP3GVvfoFq' which got Transformed into

'british diver neil anthony fear found dead wreck steamship daily mail' as you can see we were successfull in doing 2 things-

1)Making the data a bit shorter

2)Not loosing the data meaning by making conversion

X = data_train['text']
y = data_train['target']
X_test = data_test['text']
y_test = data_test['target']

The Vocablury size of 17293 is too big , this means that to represent a sentence I will have to store a matrix of size of 17293 length ,this seems very much unneccesary as most of the text message have length between 120 - 140 .

count_vectorizer = feature_extraction.text.CountVectorizer()
count_vectorizer.fit_transform(X)
print(len(count_vectorizer.vocabulary_))

17293

Now here comes the difficult part , how many word to be kept in our Vocablury , each has its pro and cons like -

1) If the number is high , this allows us to keep track of more words and store more information but the con being we use too much space and huge computations .

2) If the number is low , Our model will train fast and less space will be used ,but by eliminating other words we are losing some potential information that might have been useful for model.

I will chose to keep words that come more than once

freq_of_word = Counter()
for i in X:
    freq_of_word.update(i.split())
    
vocablury = []
for word,freq in freq_of_word.items():
    if freq>1:
        vocablury.append(word)
print(len(vocablury))
vocab_size = len(vocablury)

7582

def create_tokenizer(post):
    tokenizer = keras.preprocessing.text.Tokenizer(num_words=vocab_size)
    tokenizer.fit_on_texts(post) 
    return tokenizer
def change(post):
    sentence = ''
    for i in post.split():
        if i in vocablury:
            sentence = sentence + i + ' '
    return sentence
X_transformed = X.apply(change)
X_test_transformed = X_test.apply(change)

tokenizer = create_tokenizer(X_transformed)
print('Our New Dicioinary has',len(tokenizer.word_index),'Unique Words')

Our New Dicioinary has 7582 Unique Words

sequence = tokenizer.texts_to_sequences(X_transformed)
X_transformed_seq= keras.preprocessing.sequence.pad_sequences(sequence, maxlen=150)
X_train, X_val, y_train, y_val = model_selection.train_test_split(X_transformed_seq,y ,test_size = .15,stratify = y)
print(X_train.shape,X_val.shape)

(7731, 150) (1365, 150)

keras.backend.clear_session()
model = keras.models.Sequential()
model.add(keras.layers.Embedding(vocab_size, 32))
model.add(keras.layers.LSTM(3,recurrent_dropout=.5,dropout=.5,return_sequences=False,kernel_regularizer=keras.regularizers.l2()))
model.add(keras.layers.Dense(1, activation='sigmoid'))
model.summary()
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
all_callbacks = [keras.callbacks.EarlyStopping(patience = 5,min_delta=.02,restore_best_weights=True),
                 keras.callbacks.ModelCheckpoint('LSTM.h5',save_best_only=True,monitor='val_accuracy')]
history  = model.fit(X_train,y_train,validation_data= (X_val,y_val),batch_size = 64,epochs=10,callbacks=all_callbacks)

WARNING:tensorflow:Layer lstm will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 32)          242624    
_________________________________________________________________
lstm (LSTM)                  (None, 3)                 432       
_________________________________________________________________
dense (Dense)                (None, 1)                 4         
=================================================================
Total params: 243,060
Trainable params: 243,060
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
121/121 [==============================] - 53s 438ms/step - loss: 0.6874 - accuracy: 0.7978 - val_loss: 0.5387 - val_accuracy: 0.8139
Epoch 2/10
121/121 [==============================] - 53s 435ms/step - loss: 0.4969 - accuracy: 0.8143 - val_loss: 0.4542 - val_accuracy: 0.8139
Epoch 3/10
121/121 [==============================] - 53s 436ms/step - loss: 0.4114 - accuracy: 0.8326 - val_loss: 0.3797 - val_accuracy: 0.8505
Epoch 4/10
121/121 [==============================] - 53s 437ms/step - loss: 0.3259 - accuracy: 0.8935 - val_loss: 0.3267 - val_accuracy: 0.8930
Epoch 5/10
121/121 [==============================] - 52s 434ms/step - loss: 0.2721 - accuracy: 0.9220 - val_loss: 0.3064 - val_accuracy: 0.8974
Epoch 6/10
121/121 [==============================] - 53s 434ms/step - loss: 0.2320 - accuracy: 0.9397 - val_loss: 0.3019 - val_accuracy: 0.8952
Epoch 7/10
121/121 [==============================] - 53s 437ms/step - loss: 0.2052 - accuracy: 0.9488 - val_loss: 0.2978 - val_accuracy: 0.9004
Epoch 8/10
121/121 [==============================] - 52s 433ms/step - loss: 0.1832 - accuracy: 0.9547 - val_loss: 0.3040 - val_accuracy: 0.8930
Epoch 9/10
121/121 [==============================] - 53s 437ms/step - loss: 0.1638 - accuracy: 0.9596 - val_loss: 0.3176 - val_accuracy: 0.8945
Epoch 10/10
121/121 [==============================] - 53s 435ms/step - loss: 0.1464 - accuracy: 0.9642 - val_loss: 0.3166 - val_accuracy: 0.8894

plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
plt.savefig('Accuracy vs Epochs.png')

<Figure size 432x288 with 0 Axes>

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
plt.savefig('Loss vs Epochs.png')

<Figure size 432x288 with 0 Axes>

model = keras.models.load_model('LSTM.h5')
model.evaluate(X_train,y_train)
model.evaluate(X_val,y_val)

WARNING:tensorflow:Layer lstm will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
242/242 [==============================] - 7s 28ms/step - loss: 0.1644 - accuracy: 0.9646
43/43 [==============================] - 1s 27ms/step - loss: 0.2978 - accuracy: 0.9004

[0.29777950048446655, 0.9003663063049316]

sequence = tokenizer.texts_to_sequences(X_test_transformed)
X_test_transformed_seq= keras.preprocessing.sequence.pad_sequences(sequence, maxlen=150)
model.evaluate(X_test_transformed_seq,y_test)

72/72 [==============================] - 2s 28ms/step - loss: 0.2916 - accuracy: 0.8901

[0.2916449010372162, 0.8900615572929382]

We are Achieving about 88% Accuracy .

sample_pred = (model.predict(X_test_transformed_seq[:5])>.5).astype(int)
sample_text = X_test[:5]
truth = y_test[:5]
display(pd.DataFrame(zip(sample_pred,sample_text,truth),columns=['Model Predictions','TEXT','Truth Label']))

	text	target
0	🤣 Are there horses on M’Baku St? If so, I may ...	0
1	Police officers like Davinder Singh who swear ...	0
2	Weverse Official BTS \| JIMIN 200113 ↘ [ARMY ZI...	0
3	A war between uhuru and ruto will crash the co...	0
4	As I opened my mouth to speak, everything went...	0
...	...	...
9091	Hi Lisa, I'm very sorry about this, a freight ...	1
9092	AMEN! Set the whole system ablaze, man. https:...	0
9093	Phivolcs also noted that another indication of...	0
9094	"Let their habitation be desolate; [and] let n...	0
9095	🤔was wondering why i never seen the word c*nt ...	0

	Model Predictions	TEXT	Truth Label
0	[0]	even news site loud meant wearesorry	0
1	[0]	yeah yer well prepared hit building amp knocke...	0
2	[0]	desolate valley wa transformed thriving hub hi...	0
3	[1]	although late extend condolence family victim ...	1
4	[0]	violent storm quicker pass paulo coelho	0