Cainvas

Disaster Tweet Prediction

Credit: AITS Cainvas Community

Image

Photo by Piotr Wojtczak on Dribbble

Our Objective is to predict if a Tweet is describing about a disaster or not.

Important Libraries

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import model_selection,feature_extraction
from collections import Counter
import nltk
nltk.download('punkt')
nltk.download('stopwords')
import re
!pip install contractions
import contractions
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
!pip install pyspellchecker
from spellchecker import SpellChecker
import tensorflow as tf
from tensorflow import keras
!pip install wget
!wget -N 'https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/tweets.csv'
[nltk_data] Downloading package punkt to /home/jupyter-
[nltk_data]     dark/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jupyter-
[nltk_data]     dark/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: contractions in /home/jupyter-dark/.local/lib/python3.7/site-packages (0.0.52)
Requirement already satisfied: textsearch>=0.0.21 in /home/jupyter-dark/.local/lib/python3.7/site-packages (from contractions) (0.0.21)
Requirement already satisfied: pyahocorasick in /home/jupyter-dark/.local/lib/python3.7/site-packages (from textsearch>=0.0.21->contractions) (1.4.2)
Requirement already satisfied: anyascii in /home/jupyter-dark/.local/lib/python3.7/site-packages (from textsearch>=0.0.21->contractions) (0.2.0)
WARNING: You are using pip version 20.3.1; however, version 21.1.3 is available.
You should consider upgrading via the '/opt/tljh/user/bin/python -m pip install --upgrade pip' command.
[nltk_data] Downloading package wordnet to /home/jupyter-
[nltk_data]     dark/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: pyspellchecker in /home/jupyter-dark/.local/lib/python3.7/site-packages (0.6.2)
WARNING: You are using pip version 20.3.1; however, version 21.1.3 is available.
You should consider upgrading via the '/opt/tljh/user/bin/python -m pip install --upgrade pip' command.
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: wget in /home/jupyter-dark/.local/lib/python3.7/site-packages (3.2)
WARNING: You are using pip version 20.3.1; however, version 21.1.3 is available.
You should consider upgrading via the '/opt/tljh/user/bin/python -m pip install --upgrade pip' command.
--2021-06-29 11:23:00--  https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/tweets.csv
Resolving cainvas-static.s3.amazonaws.com (cainvas-static.s3.amazonaws.com)... 52.219.64.0
Connecting to cainvas-static.s3.amazonaws.com (cainvas-static.s3.amazonaws.com)|52.219.64.0|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1615005 (1.5M) [text/csv]
Saving to: ‘tweets.csv’

tweets.csv          100%[===================>]   1.54M  --.-KB/s    in 0.01s   

2021-06-29 11:23:00 (162 MB/s) - ‘tweets.csv’ saved [1615005/1615005]

In [2]:
data = pd.read_csv('tweets.csv',index_col='id')
data.reset_index(drop=True,inplace=True)
data = data[['text', 'target']]
data_train,data_test = model_selection.train_test_split(data,test_size=.2,random_state = 43,stratify=data.target)
In [3]:
data_train.reset_index(drop=True,inplace=True)
data_test.reset_index(drop=True,inplace=True)

The target of 0 indicate a Non Disastrous Tweet and 1 indicate a Disastrous Tweet ,meaning tweet is describing / report

ing a disastor .

In [4]:
display(data_train)
text target
0 🤣 Are there horses on M’Baku St? If so, I may ... 0
1 Police officers like Davinder Singh who swear ... 0
2 Weverse Official BTS | JIMIN 200113 ↘ [ARMY ZI... 0
3 A war between uhuru and ruto will crash the co... 0
4 As I opened my mouth to speak, everything went... 0
... ... ...
9091 Hi Lisa, I'm very sorry about this, a freight ... 1
9092 AMEN! Set the whole system ablaze, man. https:... 0
9093 Phivolcs also noted that another indication of... 0
9094 "Let their habitation be desolate; [and] let n... 0
9095 🤔was wondering why i never seen the word c*nt ... 0

9096 rows × 2 columns

BASIC EDA

THE 9000 sample is divided in roughly 7400 False and 1700 True target points.

In [5]:
sns.set_style('darkgrid')
print(data_train['target'].value_counts())
sns.countplot(data = data_train,x= 'target')
0    7405
1    1691
Name: target, dtype: int64
Out[5]:
<AxesSubplot:xlabel='target', ylabel='count'>

Text Length Distribution

In [6]:
sns.histplot(data_train['text'].apply(lambda x:len(x)))
Out[6]:
<AxesSubplot:xlabel='text', ylabel='Count'>
In [7]:
print('Short Message in our data :',data_train.loc[data['text'].apply(lambda x:len(x)).idxmin()].values,sep='\n')
print('========')
print('Long Message in our data  :',data_train.loc[data_train['text'].apply(lambda x:len(x))>140].values[0],sep='\n')
Short Message in our data :
['HAHAHAHAHAHA THIS IS SO BAD aw i miss screaming in courtney’s ears' 0]
========
Long Message in our data  :
["Tributes for British tourist who died in cliff fall at Sydney's Diamond Bay https://t.co/yMsX6DXWqI - so sad &amp; so… https://t.co/G9U4lDSpWU"
 1]

Transforming Data

In [8]:
stop_words = nltk.corpus.stopwords.words('english')
i = 0
wnl = WordNetLemmatizer()
stemmer=SnowballStemmer('english')
for doc in data_train.text:
  doc = re.sub(r'https?://\S+|www\.\S+','',doc)
  doc = re.sub(r'<.*?>','',doc)
  doc = re.sub(r'[^a-zA-Z\s]','',doc,re.I|re.A)
  doc = ' '.join([wnl.lemmatize(i) for i in doc.lower().split()])
  doc = contractions.fix(doc)
  tokens = nltk.word_tokenize(doc)
  filtered = [token for token in tokens if token not in stop_words]
  doc = ' '.join(filtered)
  data_train.text[i] = doc
  i+=1
i=0
for doc in data_test.text:
  doc = re.sub(r'https?://\S+|www\.\S+','',doc)
  doc = re.sub(r'<.*?>','',doc)
  doc = re.sub(r'[^a-zA-Z\s]','',doc,re.I|re.A)
  doc = ' '.join([wnl.lemmatize(i) for i in doc.lower().split()])
  doc = contractions.fix(doc)
  tokens = nltk.word_tokenize(doc)
  filtered = [token for token in tokens if token not in stop_words]
  doc = ' '.join(filtered)
  data_test.text[i] = doc
  i+=1
In [9]:
print('Short Message in our data :',data_train.loc[data['text'].apply(lambda x:len(x)).idxmin()].values,sep='\n')
print('========')
print('Long Message in our data  :',data_train.loc[data_train['text'].apply(lambda x:len(x))>100].values[0],sep='\n')
Short Message in our data :
['hahahahahaha bad aw miss screaming courtneys ear' 0]
========
Long Message in our data  :
['showed activity wa recently founded bioterrorism manufactured disease tin man android global organism would'
 0]

You can see the conversions being done ,Previously the shortest statement was

'British diver Neil Anthony Fears found dead by the wreck of a steamship - Daily Mail http://t.co/QP3GVvfoFq' which got Transformed into

'british diver neil anthony fear found dead wreck steamship daily mail' as you can see we were successfull in doing 2 things-

1)Making the data a bit shorter

2)Not loosing the data meaning by making conversion

In [10]:
X = data_train['text']
y = data_train['target']
X_test = data_test['text']
y_test = data_test['target']

The Vocablury size of 17293 is too big , this means that to represent a sentence I will have to store a matrix of size of 17293 length ,this seems very much unneccesary as most of the text message have length between 120 - 140 .

In [11]:
count_vectorizer = feature_extraction.text.CountVectorizer()
count_vectorizer.fit_transform(X)
print(len(count_vectorizer.vocabulary_))
17293

Now here comes the difficult part , how many word to be kept in our Vocablury , each has its pro and cons like -

1) If the number is high , this allows us to keep track of more words and store more information but the con being we use too much space and huge computations .

2) If the number is low , Our model will train fast and less space will be used ,but by eliminating other words we are losing some potential information that might have been useful for model.

I will chose to keep words that come more than once

In [12]:
freq_of_word = Counter()
for i in X:
    freq_of_word.update(i.split())
    
vocablury = []
for word,freq in freq_of_word.items():
    if freq>1:
        vocablury.append(word)
print(len(vocablury))
vocab_size = len(vocablury)
7582
In [13]:
def create_tokenizer(post):
    tokenizer = keras.preprocessing.text.Tokenizer(num_words=vocab_size)
    tokenizer.fit_on_texts(post) 
    return tokenizer
def change(post):
    sentence = ''
    for i in post.split():
        if i in vocablury:
            sentence = sentence + i + ' '
    return sentence
X_transformed = X.apply(change)
X_test_transformed = X_test.apply(change)
In [14]:
tokenizer = create_tokenizer(X_transformed)
print('Our New Dicioinary has',len(tokenizer.word_index),'Unique Words')
Our New Dicioinary has 7582 Unique Words
In [15]:
sequence = tokenizer.texts_to_sequences(X_transformed)
X_transformed_seq= keras.preprocessing.sequence.pad_sequences(sequence, maxlen=150)
X_train, X_val, y_train, y_val = model_selection.train_test_split(X_transformed_seq,y ,test_size = .15,stratify = y)
print(X_train.shape,X_val.shape)
(7731, 150) (1365, 150)
In [16]:
keras.backend.clear_session()
model = keras.models.Sequential()
model.add(keras.layers.Embedding(vocab_size, 32))
model.add(keras.layers.LSTM(3,recurrent_dropout=.5,dropout=.5,return_sequences=False,kernel_regularizer=keras.regularizers.l2()))
model.add(keras.layers.Dense(1, activation='sigmoid'))
model.summary()
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
all_callbacks = [keras.callbacks.EarlyStopping(patience = 5,min_delta=.02,restore_best_weights=True),
                 keras.callbacks.ModelCheckpoint('LSTM.h5',save_best_only=True,monitor='val_accuracy')]
history  = model.fit(X_train,y_train,validation_data= (X_val,y_val),batch_size = 64,epochs=10,callbacks=all_callbacks)
WARNING:tensorflow:Layer lstm will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 32)          242624    
_________________________________________________________________
lstm (LSTM)                  (None, 3)                 432       
_________________________________________________________________
dense (Dense)                (None, 1)                 4         
=================================================================
Total params: 243,060
Trainable params: 243,060
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
121/121 [==============================] - 52s 426ms/step - loss: 0.7210 - accuracy: 0.7905 - val_loss: 0.5761 - val_accuracy: 0.8139
Epoch 2/10
121/121 [==============================] - 52s 426ms/step - loss: 0.5219 - accuracy: 0.8141 - val_loss: 0.4897 - val_accuracy: 0.8139
Epoch 3/10
121/121 [==============================] - 51s 425ms/step - loss: 0.4494 - accuracy: 0.8179 - val_loss: 0.4183 - val_accuracy: 0.8242
Epoch 4/10
121/121 [==============================] - 52s 430ms/step - loss: 0.3587 - accuracy: 0.8721 - val_loss: 0.3658 - val_accuracy: 0.8674
Epoch 5/10
121/121 [==============================] - 56s 459ms/step - loss: 0.2928 - accuracy: 0.9131 - val_loss: 0.3287 - val_accuracy: 0.8886
Epoch 6/10
121/121 [==============================] - 52s 432ms/step - loss: 0.2539 - accuracy: 0.9294 - val_loss: 0.3193 - val_accuracy: 0.8894
Epoch 7/10
121/121 [==============================] - 51s 424ms/step - loss: 0.2256 - accuracy: 0.9366 - val_loss: 0.3183 - val_accuracy: 0.8908
Epoch 8/10
121/121 [==============================] - 49s 403ms/step - loss: 0.2016 - accuracy: 0.9476 - val_loss: 0.3208 - val_accuracy: 0.8908
Epoch 9/10
121/121 [==============================] - 49s 407ms/step - loss: 0.1764 - accuracy: 0.9552 - val_loss: 0.3186 - val_accuracy: 0.8864
Epoch 10/10
121/121 [==============================] - 50s 413ms/step - loss: 0.1606 - accuracy: 0.9599 - val_loss: 0.3257 - val_accuracy: 0.8784
In [17]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
plt.savefig('Accuracy vs Epochs.png')
<Figure size 432x288 with 0 Axes>
In [18]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
plt.savefig('Loss vs Epochs.png')
<Figure size 432x288 with 0 Axes>
In [19]:
model = keras.models.load_model('LSTM.h5')
model.evaluate(X_train,y_train)
model.evaluate(X_val,y_val)
WARNING:tensorflow:Layer lstm will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
242/242 [==============================] - 7s 29ms/step - loss: 0.1866 - accuracy: 0.9499
43/43 [==============================] - 1s 28ms/step - loss: 0.3183 - accuracy: 0.8908
Out[19]:
[0.31832030415534973, 0.8908424973487854]
In [20]:
sequence = tokenizer.texts_to_sequences(X_test_transformed)
X_test_transformed_seq= keras.preprocessing.sequence.pad_sequences(sequence, maxlen=150)
model.evaluate(X_test_transformed_seq,y_test)
72/72 [==============================] - 2s 28ms/step - loss: 0.3003 - accuracy: 0.8936
Out[20]:
[0.3002898395061493, 0.8935796022415161]

We are Achieving about 89% Accuracy .

In [21]:
sample_pred = (model.predict(X_test_transformed_seq[:5])>.5).astype(int)
sample_text = X_test[:5]
truth = y_test[:5]
display(pd.DataFrame(zip(sample_pred,sample_text,truth),columns=['Model Predictions','TEXT','Truth Label']))
Model Predictions TEXT Truth Label
0 [0] even news site loud meant wearesorry 0
1 [0] yeah yer well prepared hit building amp knocke... 0
2 [0] desolate valley wa transformed thriving hub hi... 0
3 [1] although late extend condolence family victim ... 1
4 [0] violent storm quicker pass paulo coelho 0