Cainvas

Sarcasm detection in news headlines

Credit: AITS Cainvas Community

Photo by Su for RaDesign on Dribbble

Sarcasm has the ability to flip the sentiment of the sentence.Thus makes sarcasm detection an important part of sentiment analysis.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import nltk
import re
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from sklearn.metrics import confusion_matrix, f1_score
from tensorflow.keras import models, layers, optimizers, losses, callbacks

Dataset

On Kaggle by Rishabh Misra

This dataset is collected from two news websites. The Onion aims at producing sarcastic versions of current events and the headlines from News in Brief and News in Photos categories (which are sarcastic) were collected. Also, real (and non-sarcastic) news headlines were collected from HuffPost.

In [2]:
df = pd.read_json('https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/Sarcasm_Headlines_Dataset_v2.json',
                  lines = True)
df
Out[2]:
is_sarcastic headline article_link
0 1 thirtysomething scientists unveil doomsday clo... https://www.theonion.com/thirtysomething-scien...
1 0 dem rep. totally nails why congress is falling... https://www.huffingtonpost.com/entry/donna-edw...
2 0 eat your veggies: 9 deliciously different recipes https://www.huffingtonpost.com/entry/eat-your-...
3 1 inclement weather prevents liar from getting t... https://local.theonion.com/inclement-weather-p...
4 1 mother comes pretty close to using word 'strea... https://www.theonion.com/mother-comes-pretty-c...
... ... ... ...
28614 1 jews to celebrate rosh hashasha or something https://www.theonion.com/jews-to-celebrate-ros...
28615 1 internal affairs investigator disappointed con... https://local.theonion.com/internal-affairs-in...
28616 0 the most beautiful acceptance speech this week... https://www.huffingtonpost.com/entry/andrew-ah...
28617 1 mars probe destroyed by orbiting spielberg-gat... https://www.theonion.com/mars-probe-destroyed-...
28618 1 dad clarifies this not a food stop https://www.theonion.com/dad-clarifies-this-no...

28619 rows × 3 columns

Distribution of values in classes -

In [3]:
df['is_sarcastic'].value_counts()
Out[3]:
0    14985
1    13634
Name: is_sarcastic, dtype: int64

It is an almost balanced dataset.

Data preprocessing

In [4]:
# Remove html tags
def removeHTML(sentence):
    regex = re.compile('<.*?>')
    return re.sub(regex, ' ', sentence)

# Remove URLs
def removeURL(sentence):
    regex = re.compile('http[s]?://\S+')
    return re.sub(regex, ' ', sentence)

# remove numbers, punctuation and any special characters (keep only alphabets)
def onlyAlphabets(sentence):
    regex = re.compile('[^a-zA-Z]')
    return re.sub(regex, ' ', sentence)
In [5]:
sno = nltk.stem.SnowballStemmer('english')    # Initializing stemmer
wordcloud = [[], []]
all_sentences = []    # All cleaned sentences


for x in range(len(df['headline'].values)):
    headline = df['headline'].values[x]
    sarcasm = df['is_sarcastic'].values[x]

    cleaned_sentence = []
    sentence = removeURL(headline) 
    sentence = removeHTML(sentence)
    sentence = onlyAlphabets(sentence)
    sentence = sentence.lower()   

    for word in sentence.split():
        #if word not in stop:
            stemmed = sno.stem(word)
            cleaned_sentence.append(stemmed)
            
            wordcloud[sarcasm].append(word)
            

    all_sentences.append(' '.join(cleaned_sentence))

# add as column in dataframe
X = all_sentences
y = df['is_sarcastic']
In [6]:
class_names = ['Not sarcastic', 'Sarcastic']

Visualization

In [7]:
plt.figure(figsize=(10,10))

for i in range(len(class_names)):
    ax = plt.subplot(len(class_names), 1, i + 1)
    plt.imshow(WordCloud().generate(' '.join(wordcloud[i])))
    plt.title(class_names[i])
    plt.axis("off")

Train - val split

In [8]:
# Splitting into train and val set -- 80-20 split

Xtrain, Xval, ytrain, yval = train_test_split(X, y, test_size = 0.2)
In [9]:
# Tokenization
vocab = 1500
mlen = 200
 
tokenizer = Tokenizer(num_words = vocab, oov_token = '<UNK>')
tokenizer.fit_on_texts(Xtrain)
 
Xtrain = tokenizer.texts_to_sequences(Xtrain)
Xtrain = pad_sequences(Xtrain, maxlen=mlen)

Xval = tokenizer.texts_to_sequences(Xval)
Xval = pad_sequences(Xval, maxlen=mlen)

The model

In [10]:
# Build and train neural network
embedding_dim = 128
 
model = models.Sequential([
    layers.Embedding(vocab, embedding_dim, input_length = mlen),
    layers.LSTM(128, activation='tanh'),
    layers.Dense(32, activation = 'relu'),
    layers.Dense(16, activation = 'relu'),
    layers.Dense(1, activation = 'sigmoid')
])
 
cb = [callbacks.EarlyStopping(patience = 5, restore_best_weights = True)]
In [11]:
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 200, 128)          192000    
_________________________________________________________________
lstm (LSTM)                  (None, 128)               131584    
_________________________________________________________________
dense (Dense)                (None, 32)                4128      
_________________________________________________________________
dense_1 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
=================================================================
Total params: 328,257
Trainable params: 328,257
Non-trainable params: 0
_________________________________________________________________
In [12]:
model.compile(optimizer = optimizers.Adam(0.01), loss = losses.BinaryCrossentropy(), metrics = ['accuracy'])
 
history = model.fit(Xtrain, ytrain, batch_size=64, epochs = 256, validation_data=(Xval, yval), callbacks = cb)
Epoch 1/256
358/358 [==============================] - 10s 29ms/step - loss: 0.4424 - accuracy: 0.7910 - val_loss: 0.3857 - val_accuracy: 0.8262
Epoch 2/256
358/358 [==============================] - 10s 29ms/step - loss: 0.3381 - accuracy: 0.8525 - val_loss: 0.3598 - val_accuracy: 0.8367
Epoch 3/256
358/358 [==============================] - 10s 28ms/step - loss: 0.2905 - accuracy: 0.8750 - val_loss: 0.3695 - val_accuracy: 0.8374
Epoch 4/256
358/358 [==============================] - 10s 27ms/step - loss: 0.2551 - accuracy: 0.8891 - val_loss: 0.3912 - val_accuracy: 0.8368
Epoch 5/256
358/358 [==============================] - 10s 27ms/step - loss: 0.2296 - accuracy: 0.9029 - val_loss: 0.4042 - val_accuracy: 0.8387
Epoch 6/256
358/358 [==============================] - 10s 27ms/step - loss: 0.2141 - accuracy: 0.9109 - val_loss: 0.4138 - val_accuracy: 0.8314
Epoch 7/256
358/358 [==============================] - 10s 27ms/step - loss: 0.1969 - accuracy: 0.9161 - val_loss: 0.4651 - val_accuracy: 0.8297
In [13]:
model.evaluate(Xval, yval)
179/179 [==============================] - 2s 9ms/step - loss: 0.3598 - accuracy: 0.8367
Out[13]:
[0.35975325107574463, 0.8366526961326599]
In [14]:
cm = confusion_matrix(yval, (model.predict(Xval)>0.5).astype('int64'))
cm = cm.astype('int') / cm.sum(axis=1)[:, np.newaxis]

fig = plt.figure(figsize = (5, 5))
ax = fig.add_subplot(111)

for i in range(cm.shape[1]):
    for j in range(cm.shape[0]):
        if cm[i,j] > 0.8:
            clr = "white"
        else:
            clr = "black"
        ax.text(j, i, format(cm[i, j], '.2f'), horizontalalignment="center", color=clr)

_ = ax.imshow(cm, cmap=plt.cm.Blues)
ax.set_xticks(range(len(class_names)))
ax.set_yticks(range(len(class_names)))
ax.set_xticklabels(class_names, rotation = 90)
ax.set_yticklabels(class_names)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

Plotting the metrics

In [15]:
def plot(history, variable, variable2):
    plt.plot(range(len(history[variable])), history[variable])
    plt.plot(range(len(history[variable2])), history[variable2])
    plt.legend([variable, variable2])
    plt.title(variable)
In [16]:
plot(history.history, "accuracy", 'val_accuracy')
In [17]:
plot(history.history, "loss", 'val_loss')

Prediction

In [18]:
x = np.random.randint(0, Xval.shape[0] - 1)

headline = df['headline'].values[x]

print("Headline: ", headline)

cleaned_text = []

sentence = removeURL(headline) 
sentence = removeHTML(sentence)
sentence = onlyAlphabets(sentence)
sentence = sentence.lower()   

for word in sentence.split():
    #if word not in stop:
        stemmed = sno.stem(word)
        cleaned_text.append(stemmed)

cleaned_text = [' '.join(cleaned_text)]

print("Cleaned text: ", cleaned_text[0])

cleaned_text = tokenizer.texts_to_sequences(cleaned_text)
cleaned_text = pad_sequences(cleaned_text, maxlen=mlen)

category = df['is_sarcastic'].values[x]  

print("\nTrue category: ", class_names[category])

output = model.predict(cleaned_text)[0][0]

pred = (output>0.5).astype('int64')

print("\nPredicted category: ", class_names[pred], "(", output, "-->", pred, ")")
Headline:  barack obama just cracked down on wall street
Cleaned text:  barack obama just crack down on wall street

True category:  Not sarcastic

Predicted category:  Not sarcastic ( 0.38051096 --> 0 )

deepC

In [20]:
model.save('sarcasm.h5')
!deepCC sarcasm.h5
[INFO]
Reading [keras model] 'sarcasm.h5'
[SUCCESS]
Saved 'sarcasm_deepC/sarcasm.onnx'
[INFO]
Reading [onnx model] 'sarcasm_deepC/sarcasm.onnx'
[INFO]
Model info:
  ir_vesion : 4
  doc       : 
[WARNING]
[ONNX]: lstm (LSTM) has 4 inputs, that aren't connected.
[WARNING]
[ONNX]: terminal (input/output) embedding_input's shape is less than 1. Changing it to 1.
[WARNING]
[ONNX]: terminal (input/output) dense_2's shape is less than 1. Changing it to 1.
WARN (GRAPH): found operator node with the same name (dense_2) as io node.
[INFO]
Running DNNC graph sanity check ...
ERROR (GRAPH): some of graph sequential's node lstm's
               outputs are not connected to other nodes in the graph.
[ERROR]
Failed. Please check your model. graph sequential
operator Cast {
	input embedding_input
	output casted
}
operator embedding {
	input embedding_embeddings_0
	input casted
	output embedding_embedding_lookup_Identity_1_0
}
operator Transpose {
	input embedding_embedding_lookup_Identity_1_0
	output lstm_X
}
operator lstm {
	input lstm_X
	input lstm_W
	input lstm_R
	input lstm_B
	output lstm_Y
	output lstm_Y_h
	output lstm_Y_c
}
operator Squeeze {
	input lstm_Y_h
	output lstm_PartitionedCall_0
}
operator dense {
	input lstm_PartitionedCall_0
	input dense_kernel_0
	output dense0
}
operator Add2 {
	input dense0
	input dense_bias_0
	output biased_tensor_name2
}
operator Relu1 {
	input biased_tensor_name2
	output dense_Relu_0
}
operator dense_1 {
	input dense_Relu_0
	input dense_1_kernel_0
	output dense_10
}
operator Add1 {
	input dense_10
	input dense_1_bias_0
	output biased_tensor_name1
}
operator Relu {
	input biased_tensor_name1
	output dense_1_Relu_0
}
operator dense_2 {
	input dense_1_Relu_0
	input dense_2_kernel_0
	output dense_20
}
operator Add {
	input dense_20
	input dense_2_bias_0
	output biased_tensor_name
}
operator Sigmoid {
	input biased_tensor_name
	output dense_2
}
weight { float dense_2_kernel_0 [16,1] }
weight { float dense_2_bias_0 [1] }
weight { float dense_1_kernel_0 [32,16] }
weight { float dense_1_bias_0 [16] }
weight { float dense_kernel_0 [128,32] }
weight { float dense_bias_0 [32] }
weight { float lstm_W [1,512,128] }
weight { float lstm_R [1,512,128] }
weight { float lstm_B [1,1024] }
weight { float embedding_embeddings_0 [1500,128] }
input {float embedding_input[1,200]}
output {float dense_2[1,1]}


[INFO]
Writing C++ file 'sarcasm_deepC/sarcasm.cpp'
ERROR (TYPE INFER): cound not find all nodes for lstm,
WARN (CODEGEN): cound not find all nodes for lstm,
                an instance of LSTM.
                Please check model's sanity and try again.
[INFO]
deepSea model files are ready in 'sarcasm_deepC/' 
[RUNNING COMMAND]
g++ -std=c++11 -O3 -fno-rtti -fno-exceptions -I. -I/opt/tljh/user/lib/python3.7/site-packages/deepC-0.13-py3.7-linux-x86_64.egg/deepC/include -isystem /opt/tljh/user/lib/python3.7/site-packages/deepC-0.13-py3.7-linux-x86_64.egg/deepC/packages/eigen-eigen-323c052e1731 "sarcasm_deepC/sarcasm.cpp" -D_AITS_MAIN -o "sarcasm_deepC/sarcasm.exe"
[ERROR]
sarcasm_deepC/sarcasm.cpp: In function ‘std::vector<deepSea::ndarray<float> > deepSea_model(deepSea::ndarray<float>)’:
sarcasm_deepC/sarcasm.cpp:71:26: error: wrong number of template arguments (2, should be 1)
   71 |   dnnc::Cast<float, float> Cast("Cast");
      |                          ^
In file included from sarcasm_deepC/sarcasm.cpp:21:
/opt/tljh/user/lib/python3.7/site-packages/deepC-0.13-py3.7-linux-x86_64.egg/deepC/include/operators/Cast.h:31:29: note: provided for ‘template<class T> class dnnc::Cast’
   31 | template <typename T> class Cast : public baseOperator<T, T, T> {
      |                             ^~~~
sarcasm_deepC/sarcasm.cpp:71:33: error: invalid conversion from ‘const char*’ to ‘int’ [-fpermissive]
   71 |   dnnc::Cast<float, float> Cast("Cast");
      |                                 ^~~~~~
      |                                 |
      |                                 const char*
sarcasm_deepC/sarcasm.cpp:73:8: error: request for member ‘setAttribute’ in ‘Cast’, which is of non-class type ‘int’
   73 |   Cast.setAttribute ( attr_to, Cast_to );
      |        ^~~~~~~~~~~~
sarcasm_deepC/sarcasm.cpp:74:41: error: request for member ‘compute’ in ‘Cast’, which is of non-class type ‘int’
   74 |   tensor<float> dnnc_Cast_casted = Cast.compute ( dnnc_embedding_input);
      |                                         ^~~~~~~
sarcasm_deepC/sarcasm.cpp:77:35: error: wrong number of template arguments (3, should be 2)
   77 |   dnnc::Gather<float, float, float> embedding("embedding");
      |                                   ^
In file included from sarcasm_deepC/sarcasm.cpp:22:
/opt/tljh/user/lib/python3.7/site-packages/deepC-0.13-py3.7-linux-x86_64.egg/deepC/include/operators/Gather.h:32:7: note: provided for ‘template<class To, class Ti> class dnnc::Gather’
   32 | class Gather : public baseOperator<To, To, Ti> {
      |       ^~~~~~
sarcasm_deepC/sarcasm.cpp:77:47: error: invalid conversion from ‘const char*’ to ‘int’ [-fpermissive]
   77 |   dnnc::Gather<float, float, float> embedding("embedding");
      |                                               ^~~~~~~~~~~
      |                                               |
      |                                               const char*
sarcasm_deepC/sarcasm.cpp:78:84: error: request for member ‘compute’ in ‘embedding’, which is of non-class type ‘int’
   78 |   tensor<float> dnnc_embedding_embedding_embedding_lookup_Identity_1_0 = embedding.compute ( dnnc_embedding_embeddings_0, dnnc_Cast_casted);
      |                                                                                    ^~~~~~~

[ERROR]
Couldn't create executable.

usage: deepCC [-h] [--output] [--format] [--verbose] [--profile ]
              [--app_tensors FILE] [--archive] [--bundle] [--debug]
              [--mem_override] [--optimize_peak_mem] [--init_net_model]
              [--input_data_type] [--input_shape] [--cc] [--cc_flags  [...]]
              [--board]
              input

In [ ]: