
Tweet Authenticity Prediction

Our Objective is to predict if a Tweet is describing about a disaster or not .

Important Libraries

In [1]:
import warnings
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import model_selection,feature_extraction
from collections import Counter
import nltk'punkt')'stopwords')
import re
!pip install contractions
import contractions
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer'wordnet')
!pip install pyspellchecker
from spellchecker import SpellChecker
import tensorflow as tf
from tensorflow import keras
!pip install wget
!wget -N ''
data = pd.read_csv('tweets.csv',index_col='id')
data = data[['text', 'target']]
data_train,data_test = model_selection.train_test_split(data,test_size=.2,random_state = 43,
The target of 0 indicate a Non Disastrous Tweet and 1 indicate a Disastrous Tweet ,meaning tweet is describing / report

ing a disastor .

text target
0 🤣 Are there horses on M’Baku St? If so, I may ... 0
1 Police officers like Davinder Singh who swear ... 0
2 Weverse Official BTS | JIMIN 200113 ↘ [ARMY ZI... 0
3 A war between uhuru and ruto will crash the co... 0
4 As I opened my mouth to speak, everything went... 0
... ... ...
9091 Hi Lisa, I'm very sorry about this, a freight ... 1
9092 AMEN! Set the whole system ablaze, man. https:... 0
9093 Phivolcs also noted that another indication of... 0
9094 "Let their habitation be desolate; [and] let n... 0
9095 🤔was wondering why i never seen the word c*nt ... 0

9096 rows × 2 columns


THE 9000 sample is divided in roughly 7400 False and 1700 True target points.

sns.countplot(data = data_train,x= 'target')
0    7405
1    1691
Name: target, dtype: int64
<AxesSubplot:xlabel='target', ylabel='count'>

Text Length Distribution

sns.histplot(data_train['text'].apply(lambda x:len(x)))
<AxesSubplot:xlabel='text', ylabel='Count'>
print('Short Message in our data :',data_train.loc[data['text'].apply(lambda x:len(x)).idxmin()].values,sep='\n')
print('Long Message in our data  :',data_train.loc[data_train['text'].apply(lambda x:len(x))>140].values[0],sep='\n')
Short Message in our data :
['HAHAHAHAHAHA THIS IS SO BAD aw i miss screaming in courtney’s ears' 0]
Long Message in our data  :
["Tributes for British tourist who died in cliff fall at Sydney's Diamond Bay - so sad &amp; so…"

Transforming Data

stop_words = nltk.corpus.stopwords.words('english')
i = 0
wnl = WordNetLemmatizer()
for doc in data_train.text:
  doc = re.sub(r'https?://\S+|www\.\S+','',doc)
  doc = re.sub(r'<.*?>','',doc)
  doc = re.sub(r'[^a-zA-Z\s]','',doc,re.I|re.A)
  doc = ' '.join([wnl.lemmatize(i) for i in doc.lower().split()])
  doc = contractions.fix(doc)
  tokens = nltk.word_tokenize(doc)
  filtered = [token for token in tokens if token not in stop_words]
  doc = ' '.join(filtered)
  data_train.text[i] = doc
for doc in data_test.text:
  doc = re.sub(r'https?://\S+|www\.\S+','',doc)
  doc = re.sub(r'<.*?>','',doc)
  doc = re.sub(r'[^a-zA-Z\s]','',doc,re.I|re.A)
  doc = ' '.join([wnl.lemmatize(i) for i in doc.lower().split()])
  doc = contractions.fix(doc)
  tokens = nltk.word_tokenize(doc)
  filtered = [token for token in tokens if token not in stop_words]
  doc = ' '.join(filtered)
  data_test.text[i] = doc
print('Short Message in our data :',data_train.loc[data['text'].apply(lambda x:len(x)).idxmin()].values,sep='\n')
print('Long Message in our data  :',data_train.loc[data_train['text'].apply(lambda x:len(x))>100].values[0],sep='\n')
Short Message in our data :
['hahahahahaha bad aw miss screaming courtneys ear' 0]
Long Message in our data  :
['showed activity wa recently founded bioterrorism manufactured disease tin man android global organism would'

You can see the conversions being done ,Previously the shortest statement was

'British diver Neil Anthony Fears found dead by the wreck of a steamship - Daily Mail' which got Transformed into

'british diver neil anthony fear found dead wreck steamship daily mail' as you can see we were successfull in doing 2 things-

1)Making the data a bit shorter

2)Not loosing the data meaning by making conversion

X = data_train['text']
y = data_train['target']
X_test = data_test['text']
y_test = data_test['target']

The Vocablury size of 17293 is too big , this means that to represent a sentence I will have to store a matrix of size of 17293 length ,this seems very much unneccesary as most of the text message have length between 120 - 140 .

count_vectorizer = feature_extraction.text.CountVectorizer()

Now here comes the difficult part , how many word to be kept in our Vocablury , each has its pro and cons like -

1) If the number is high , this allows us to keep track of more words and store more information but the con being we use too much space and huge computations .

2) If the number is low , Our model will train fast and less space will be used ,but by eliminating other words we are losing some potential information that might have been useful for model.

I will chose to keep words that come more than once

freq_of_word = Counter()
for i in X:
vocablury = []
for word,freq in freq_of_word.items():
    if freq>1:
vocab_size = len(vocablury)
def create_tokenizer(post):
    tokenizer = keras.preprocessing.text.Tokenizer(num_words=vocab_size)
    return tokenizer
def change(post):
    sentence = ''
    for i in post.split():
        if i in vocablury:
            sentence = sentence + i + ' '
    return sentence
X_transformed = X.apply(change)
X_test_transformed = X_test.apply(change)
tokenizer = create_tokenizer(X_transformed)
print('Our New Dicioinary has',len(tokenizer.word_index),'Unique Words')
Our New Dicioinary has 7582 Unique Words
sequence = tokenizer.texts_to_sequences(X_transformed)
X_transformed_seq= keras.preprocessing.sequence.pad_sequences(sequence, maxlen=150)
X_train, X_val, y_train, y_val = model_selection.train_test_split(X_transformed_seq,y ,test_size = .15,stratify = y)
(7731, 150) (1365, 150)
model = keras.models.Sequential()
model.add(keras.layers.Embedding(vocab_size, 32))
model.add(keras.layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
all_callbacks = [keras.callbacks.EarlyStopping(patience = 5,min_delta=.02,restore_best_weights=True),
history  =,y_train,validation_data= (X_val,y_val),batch_size = 64,epochs=10,callbacks=all_callbacks)
WARNING:tensorflow:Layer lstm will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
Model: "sequential"
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 32)          242624    
lstm (LSTM)                  (None, 3)                 432       
dense (Dense)                (None, 1)                 4         
Total params: 243,060
Trainable params: 243,060
Non-trainable params: 0
Epoch 1/10
121/121 [==============================] - 53s 438ms/step - loss: 0.6874 - accuracy: 0.7978 - val_loss: 0.5387 - val_accuracy: 0.8139
Epoch 2/10
121/121 [==============================] - 53s 435ms/step - loss: 0.4969 - accuracy: 0.8143 - val_loss: 0.4542 - val_accuracy: 0.8139
Epoch 3/10
121/121 [==============================] - 53s 436ms/step - loss: 0.4114 - accuracy: 0.8326 - val_loss: 0.3797 - val_accuracy: 0.8505
Epoch 4/10
121/121 [==============================] - 53s 437ms/step - loss: 0.3259 - accuracy: 0.8935 - val_loss: 0.3267 - val_accuracy: 0.8930
Epoch 5/10
121/121 [==============================] - 52s 434ms/step - loss: 0.2721 - accuracy: 0.9220 - val_loss: 0.3064 - val_accuracy: 0.8974
Epoch 6/10
121/121 [==============================] - 53s 434ms/step - loss: 0.2320 - accuracy: 0.9397 - val_loss: 0.3019 - val_accuracy: 0.8952
Epoch 7/10
121/121 [==============================] - 53s 437ms/step - loss: 0.2052 - accuracy: 0.9488 - val_loss: 0.2978 - val_accuracy: 0.9004
Epoch 8/10
121/121 [==============================] - 52s 433ms/step - loss: 0.1832 - accuracy: 0.9547 - val_loss: 0.3040 - val_accuracy: 0.8930
Epoch 9/10
121/121 [==============================] - 53s 437ms/step - loss: 0.1638 - accuracy: 0.9596 - val_loss: 0.3176 - val_accuracy: 0.8945
Epoch 10/10
121/121 [==============================] - 53s 435ms/step - loss: 0.1464 - accuracy: 0.9642 - val_loss: 0.3166 - val_accuracy: 0.8894
plt.title('model accuracy')
plt.legend(['train', 'test'], loc='upper left')
plt.savefig('Accuracy vs Epochs.png')
<Figure size 432x288 with 0 Axes>
plt.title('model loss')
plt.legend(['train', 'test'], loc='upper left')
plt.savefig('Loss vs Epochs.png')
<Figure size 432x288 with 0 Axes>
model = keras.models.load_model('LSTM.h5')
WARNING:tensorflow:Layer lstm will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
242/242 [==============================] - 7s 28ms/step - loss: 0.1644 - accuracy: 0.9646
43/43 [==============================] - 1s 27ms/step - loss: 0.2978 - accuracy: 0.9004
[0.29777950048446655, 0.9003663063049316]
sequence = tokenizer.texts_to_sequences(X_test_transformed)
X_test_transformed_seq= keras.preprocessing.sequence.pad_sequences(sequence, maxlen=150)
72/72 [==============================] - 2s 28ms/step - loss: 0.2916 - accuracy: 0.8901
[0.2916449010372162, 0.8900615572929382]

We are Achieving about 88% Accuracy .

sample_pred = (model.predict(X_test_transformed_seq[:5])>.5).astype(int)
sample_text = X_test[:5]
truth = y_test[:5]
display(pd.DataFrame(zip(sample_pred,sample_text,truth),columns=['Model Predictions','TEXT','Truth Label']))
Model Predictions TEXT Truth Label
0 [0] even news site loud meant wearesorry 0
1 [0] yeah yer well prepared hit building amp knocke... 0
2 [0] desolate valley wa transformed thriving hub hi... 0
3 [1] although late extend condolence family victim ... 1
4 [0] violent storm quicker pass paulo coelho 0
