Cainvas

Question classification

Credit: AITS Cainvas Community

Photo by Mike Mirandi on Dribbble

Finding the intent of the question asked, i.e., the type of answer to be given.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import nltk
import re
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from sklearn.metrics import confusion_matrix, f1_score
from tensorflow.keras import models, layers, optimizers, losses, callbacks

The dataset

On Kaggle by ARES

The dataset is a CSV file with questions and their corresponding categories and sub-categories.

In [2]:
df = pd.read_csv('https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/Question_Classification_Dataset.csv')
df
Out[2]:
Unnamed: 0 Questions Category0 Category1 Category2
0 0 How did serfdom develop in and then leave Russ... DESCRIPTION DESC manner
1 1 What films featured the character Popeye Doyle ? ENTITY ENTY cremat
2 2 How can I find a list of celebrities ' real na... DESCRIPTION DESC manner
3 3 What fowl grabs the spotlight after the Chines... ENTITY ENTY animal
4 4 What is the full form of .com ? ABBREVIATION ABBR exp
... ... ... ... ... ...
5447 5447 What 's the shape of a camel 's spine ? ENTITY ENTY other
5448 5448 What type of currency is used in China ? ENTITY ENTY currency
5449 5449 What is the temperature today ? NUMERIC NUM temp
5450 5450 What is the temperature for cooking ? NUMERIC NUM temp
5451 5451 What currency is used in Australia ? ENTITY ENTY currency

5452 rows × 5 columns

Preprocessing

Dropping unwanted columns

In [3]:
df = df.drop(columns = ['Unnamed: 0', 'Category1', 'Category2'])
df
Out[3]:
Questions Category0
0 How did serfdom develop in and then leave Russ... DESCRIPTION
1 What films featured the character Popeye Doyle ? ENTITY
2 How can I find a list of celebrities ' real na... DESCRIPTION
3 What fowl grabs the spotlight after the Chines... ENTITY
4 What is the full form of .com ? ABBREVIATION
... ... ...
5447 What 's the shape of a camel 's spine ? ENTITY
5448 What type of currency is used in China ? ENTITY
5449 What is the temperature today ? NUMERIC
5450 What is the temperature for cooking ? NUMERIC
5451 What currency is used in Australia ? ENTITY

5452 rows × 2 columns

Target labels

In [4]:
df['Category0'].value_counts()
Out[4]:
ENTITY          1250
HUMAN           1223
DESCRIPTION     1162
NUMERIC          896
LOCATION         835
ABBREVIATION      86
Name: Category0, dtype: int64

It is not a balanced dataset. But we will go ahead with this.

One hot encoding

The labels are not range dependent and are thus one hot encoded.

In [5]:
y = pd.get_dummies(df['Category0'])
In [6]:
class_names = list(y.columns)

class_names
Out[6]:
['ABBREVIATION', 'DESCRIPTION', 'ENTITY', 'HUMAN', 'LOCATION', 'NUMERIC']

Text cleaning

In [7]:
# Remove html tags
def removeHTML(sentence):
    regex = re.compile('<.*?>')
    return re.sub(regex, ' ', sentence)

# Remove URLs
def removeURL(sentence):
    regex = re.compile('http[s]?://\S+')
    return re.sub(regex, ' ', sentence)

# remove numbers, punctuation and any special characters (keep only alphabets)
def onlyAlphabets(sentence):
    regex = re.compile('[^a-zA-Z]')
    return re.sub(regex, ' ', sentence)
In [8]:
sno = nltk.stem.SnowballStemmer('english')    # Initializing stemmer
wordcloud = [[], [], [], [], [], [], []]
all_sentences = []    # All cleaned sentences


for x in range(len(df['Questions'].values)):
    question = df['Questions'].values[x]
    classname = df['Category0'].values[x]

    cleaned_sentence = []
    sentence = removeURL(question) 
    sentence = removeHTML(sentence)
    sentence = onlyAlphabets(sentence)
    sentence = sentence.lower()   

    for word in sentence.split():
        #if word not in stop:
            stemmed = sno.stem(word)
            cleaned_sentence.append(stemmed)
            
            wordcloud[class_names.index(classname)].append(word)
            

    all_sentences.append(' '.join(cleaned_sentence))

# add as column in dataframe
X = all_sentences

Visualization

In [9]:
plt.figure(figsize=(40,40))

for i in range(len(class_names)):
    ax = plt.subplot(len(class_names), 1, i + 1)
    plt.imshow(WordCloud().generate(' '.join(wordcloud[i])))
    plt.title(class_names[i])
    plt.axis("off")