Cainvas

Neural Machine Translation (English To French)

Credit: AITS Cainvas Community

Photo by Olivia G. Sutanto for Google on Dribbble

Language Translation is a key service that is needed by the people across the whole globe. A Neural Machine Translator to translate English to French using a seq2seq NLP model which uses a birectional LSTM neural network model to translate English To French.

Dataset for training the model was taken from Kaggle. Here is the link

Import Libraries

In [1]:
import nltk

# download nltk packages
nltk.download('punkt')

# download stopwords
nltk.download("stopwords")
[nltk_data] Downloading package punkt to /home/jupyter-
[nltk_data]     gunjan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Out[1]:
True
In [2]:
from collections import Counter
import operator
import plotly.express as px
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS
import nltk
import re
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from tensorflow.keras.preprocessing.text import one_hot, Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, TimeDistributed, RepeatVector, Embedding, Input, LSTM, Conv1D, MaxPool1D, Bidirectional
from tensorflow.keras.models import Model

Import Datasets

In [3]:
# load the data
df_english = pd.read_csv('https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/small_vocab_en.csv',
                         sep = '/t', names = ['english'], engine='python')
df_french = pd.read_csv('https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/small_vocab_fr.csv',
                         sep = '/t', names = ['french'], engine='python')

Visualizing both the data frames

In [4]:
df_english
Out[4]:
english
0 new jersey is sometimes quiet during autumn , ...
1 the united states is usually chilly during jul...
2 california is usually quiet during march , and...
3 the united states is sometimes mild during jun...
4 your least liked fruit is the grape , but my l...
... ...
137855 france is never busy during march , and it is ...
137856 india is sometimes beautiful during spring , a...
137857 india is never wet during summer , but it is s...
137858 france is never chilly during january , but it...
137859 the orange is her favorite fruit , but the ban...

137860 rows × 1 columns

In [5]:
df_french
Out[5]:
french
0 new jersey est parfois calme pendant l' automn...
1 les états-unis est généralement froid en juill...
2 california est généralement calme en mars , et...
3 les états-unis est parfois légère en juin , et...
4 votre moins aimé fruit est le raisin , mais mo...
... ...
137855 la france est jamais occupée en mars , et il e...
137856 l' inde est parfois belle au printemps , et il...
137857 l' inde est jamais mouillé pendant l' été , ma...
137858 la france est jamais froid en janvier , mais i...
137859 l'orange est son fruit préféré , mais la banan...

137860 rows × 1 columns

In [6]:
df_english.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137860 entries, 0 to 137859
Data columns (total 1 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   english  137860 non-null  object
dtypes: object(1)
memory usage: 1.1+ MB
In [7]:
df_french.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137860 entries, 0 to 137859
Data columns (total 1 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   french  137860 non-null  object
dtypes: object(1)
memory usage: 1.1+ MB

Concatinating both English and French dataframe into a single DataFrame

In [8]:
df = pd.concat([df_english, df_french], axis = 1)
In [9]:
df
Out[9]:
english french
0 new jersey is sometimes quiet during autumn , ... new jersey est parfois calme pendant l' automn...
1 the united states is usually chilly during jul... les états-unis est généralement froid en juill...
2 california is usually quiet during march , and... california est généralement calme en mars , et...
3 the united states is sometimes mild during jun... les états-unis est parfois légère en juin , et...
4 your least liked fruit is the grape , but my l... votre moins aimé fruit est le raisin , mais mo...
... ... ...
137855 france is never busy during march , and it is ... la france est jamais occupée en mars , et il e...
137856 india is sometimes beautiful during spring , a... l' inde est parfois belle au printemps , et il...
137857 india is never wet during summer , but it is s... l' inde est jamais mouillé pendant l' été , ma...
137858 france is never chilly during january , but it... la france est jamais froid en janvier , mais i...
137859 the orange is her favorite fruit , but the ban... l'orange est son fruit préféré , mais la banan...

137860 rows × 2 columns

In [10]:
#printing total records
print('Total English records = {}'.format(len(df['english'])))
print('Total French records = {}'.format(len(df['french'])))
Total English records = 137860
Total French records = 137860

Performing Data Cleaning

In [12]:
# function to remove punctuations
def remove_punc(x):
    return re.sub('[!#?,.:";]', '', x)
In [13]:
df['french'] = df['french'].apply(remove_punc)
df['english'] = df['english'].apply(remove_punc)
In [14]:
english_words = []
french_words  = []

Finding out total unique words in our English and French Vocabulary

In [15]:
def get_unique_words(x, word_list):
    for word in x.split():
        if word not in word_list:
            word_list.append(word)
            
df['english'].apply(lambda x: get_unique_words(x, english_words))    
total_english_words = len(english_words)
total_english_words
Out[15]:
199
In [16]:
# number of unique words in french
df['french'].apply(lambda x: get_unique_words(x, french_words))  
total_french_words = len(french_words)
total_french_words
Out[16]:
350

VISUALIZE CLEANED UP DATASET

In [17]:
# Obtain list of all words in the dataset
words = []
for i in df['english']:
    for word in i.split():
        words.append(word)
    
In [18]:
# Obtain the total count of words
english_words_counts = Counter(words)
In [19]:
# sort the dictionary by values
english_words_counts = sorted(english_words_counts.items(), key = operator.itemgetter(1), reverse = True)
In [20]:
#finding out each word count in our data
english_words_counts
Out[20]:
[('is', 205858),
 ('in', 75525),
 ('it', 75137),
 ('during', 74933),
 ('the', 67628),
 ('but', 63987),
 ('and', 59850),
 ('sometimes', 37746),
 ('usually', 37507),
 ('never', 37500),
 ('favorite', 28332),
 ('least', 27564),
 ('fruit', 27192),
 ('most', 14934),
 ('loved', 14166),
 ('liked', 14046),
 ('new', 12197),
 ('paris', 11334),
 ('india', 11277),
 ('united', 11270),
 ('states', 11270),
 ('california', 11250),
 ('jersey', 11225),
 ('france', 11170),
 ('china', 10953),
 ('he', 10786),
 ('she', 10786),
 ('grapefruit', 10692),
 ('your', 9734),
 ('my', 9700),
 ('his', 9700),
 ('her', 9700),
 ('fall', 9134),
 ('june', 9133),
 ('spring', 9102),
 ('january', 9090),
 ('winter', 9038),
 ('march', 9023),
 ('autumn', 9004),
 ('may', 8995),
 ('nice', 8984),
 ('september', 8958),
 ('july', 8956),
 ('april', 8954),
 ('november', 8951),
 ('summer', 8948),
 ('december', 8945),
 ('february', 8942),
 ('our', 8932),
 ('their', 8932),
 ('freezing', 8928),
 ('pleasant', 8916),
 ('beautiful', 8915),
 ('october', 8910),
 ('snowy', 8898),
 ('warm', 8890),
 ('cold', 8878),
 ('wonderful', 8808),
 ('dry', 8794),
 ('busy', 8791),
 ('august', 8789),
 ('chilly', 8770),
 ('rainy', 8761),
 ('mild', 8743),
 ('wet', 8726),
 ('relaxing', 8696),
 ('quiet', 8693),
 ('hot', 8639),
 ('dislikes', 7314),
 ('likes', 7314),
 ('limes', 5844),
 ('lemons', 5844),
 ('grapes', 5844),
 ('mangoes', 5844),
 ('apples', 5844),
 ('peaches', 5844),
 ('oranges', 5844),
 ('pears', 5844),
 ('strawberries', 5844),
 ('bananas', 5844),
 ('to', 5166),
 ('grape', 4848),
 ('apple', 4848),
 ('orange', 4848),
 ('lemon', 4848),
 ('lime', 4848),
 ('banana', 4848),
 ('mango', 4848),
 ('pear', 4848),
 ('strawberry', 4848),
 ('peach', 4848),
 ('like', 4588),
 ('dislike', 4444),
 ('they', 3222),
 ('that', 2712),
 ('i', 2664),
 ('we', 2532),
 ('you', 2414),
 ('animal', 2304),
 ('a', 1944),
 ('truck', 1944),
 ('car', 1944),
 ('automobile', 1944),
 ('was', 1867),
 ('next', 1666),
 ('go', 1386),
 ('driving', 1296),
 ('visit', 1224),
 ('little', 1016),
 ('big', 1016),
 ('old', 972),
 ('yellow', 972),
 ('red', 972),
 ('rusty', 972),
 ('blue', 972),
 ('white', 972),
 ('black', 972),
 ('green', 972),
 ('shiny', 972),
 ('are', 870),
 ('last', 781),
 ('feared', 768),
 ('animals', 768),
 ('this', 768),
 ('plan', 714),
 ('going', 666),
 ('saw', 648),
 ('disliked', 648),
 ('drives', 648),
 ('drove', 648),
 ('between', 540),
 ('translate', 480),
 ('plans', 476),
 ('were', 384),
 ('went', 378),
 ('might', 378),
 ('wanted', 378),
 ('thinks', 360),
 ('spanish', 312),
 ('portuguese', 312),
 ('chinese', 312),
 ('english', 312),
 ('french', 312),
 ('translating', 300),
 ('difficult', 260),
 ('fun', 260),
 ('easy', 260),
 ('wants', 252),
 ('think', 240),
 ('why', 240),
 ("it's", 240),
 ('did', 204),
 ('cat', 192),
 ('shark', 192),
 ('bird', 192),
 ('mouse', 192),
 ('horse', 192),
 ('elephant', 192),
 ('dog', 192),
 ('monkey', 192),
 ('lion', 192),
 ('bear', 192),
 ('rabbit', 192),
 ('snake', 192),
 ('when', 144),
 ('want', 126),
 ('do', 84),
 ('how', 67),
 ('elephants', 64),
 ('horses', 64),
 ('dogs', 64),
 ('sharks', 64),
 ('snakes', 64),
 ('cats', 64),
 ('rabbits', 64),
 ('monkeys', 64),
 ('bears', 64),
 ('birds', 64),
 ('lions', 64),
 ('mice', 64),
 ("didn't", 60),
 ('eiffel', 57),
 ('tower', 57),
 ('grocery', 57),
 ('store', 57),
 ('football', 57),
 ('field', 57),
 ('lake', 57),
 ('school', 57),
 ('would', 48),
 ("aren't", 36),
 ('been', 36),
 ('weather', 33),
 ('does', 24),
 ('has', 24),
 ("isn't", 24),
 ('am', 24),
 ('where', 12),
 ('have', 12)]
In [21]:
# append the values to a list for visualization purposes
english_words = []
english_counts = []
for i in range(len(english_words_counts)):
    english_words.append(english_words_counts[i][0])
    english_counts.append(english_words_counts[i][1])
In [22]:
# Plot barplot using plotly 
fig = px.bar(x = english_words, y = english_counts)
fig.show()