Cainvas

Neural Machine Translation (English To French)

Credit: AITS Cainvas Community

Photo by Olivia G. Sutanto for Google on Dribbble

Language Translation is a key service that is needed by the people across the whole globe. A Neural Machine Translator to translate English to French using a seq2seq NLP model which uses a birectional LSTM neural network model to translate English To French.

Dataset for training the model was taken from Kaggle. Here is the link

Import Libraries

In [1]:
import nltk

# download nltk packages
nltk.download('punkt')

# download stopwords
nltk.download("stopwords")
[nltk_data] Downloading package punkt to /home/jupyter-
[nltk_data]     gunjan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Out[1]:
True
In [2]:
from collections import Counter
import operator
import plotly.express as px
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS
import nltk
import re
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from tensorflow.keras.preprocessing.text import one_hot, Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, TimeDistributed, RepeatVector, Embedding, Input, LSTM, Conv1D, MaxPool1D, Bidirectional
from tensorflow.keras.models import Model

Import Datasets

In [3]:
# load the data
df_english = pd.read_csv('https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/small_vocab_en.csv',
                         sep = '/t', names = ['english'], engine='python')
df_french = pd.read_csv('https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/small_vocab_fr.csv',
                         sep = '/t', names = ['french'], engine='python')

Visualizing both the data frames

In [4]:
df_english
Out[4]:
english
0 new jersey is sometimes quiet during autumn , ...
1 the united states is usually chilly during jul...
2 california is usually quiet during march , and...
3 the united states is sometimes mild during jun...
4 your least liked fruit is the grape , but my l...
... ...
137855 france is never busy during march , and it is ...
137856 india is sometimes beautiful during spring , a...
137857 india is never wet during summer , but it is s...
137858 france is never chilly during january , but it...
137859 the orange is her favorite fruit , but the ban...

137860 rows × 1 columns

In [5]:
df_french
Out[5]:
french
0 new jersey est parfois calme pendant l' automn...
1 les états-unis est généralement froid en juill...
2 california est généralement calme en mars , et...
3 les états-unis est parfois légère en juin , et...
4 votre moins aimé fruit est le raisin , mais mo...
... ...
137855 la france est jamais occupée en mars , et il e...
137856 l' inde est parfois belle au printemps , et il...
137857 l' inde est jamais mouillé pendant l' été , ma...
137858 la france est jamais froid en janvier , mais i...
137859 l'orange est son fruit préféré , mais la banan...

137860 rows × 1 columns

In [6]:
df_english.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137860 entries, 0 to 137859
Data columns (total 1 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   english  137860 non-null  object
dtypes: object(1)
memory usage: 1.1+ MB
In [7]:
df_french.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137860 entries, 0 to 137859
Data columns (total 1 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   french  137860 non-null  object
dtypes: object(1)
memory usage: 1.1+ MB

Concatinating both English and French dataframe into a single DataFrame

In [8]:
df = pd.concat([df_english, df_french], axis = 1)
In [9]:
df
Out[9]:
english french
0 new jersey is sometimes quiet during autumn , ... new jersey est parfois calme pendant l' automn...
1 the united states is usually chilly during jul... les états-unis est généralement froid en juill...
2 california is usually quiet during march , and... california est généralement calme en mars , et...
3 the united states is sometimes mild during jun... les états-unis est parfois légère en juin , et...
4 your least liked fruit is the grape , but my l... votre moins aimé fruit est le raisin , mais mo...
... ... ...
137855 france is never busy during march , and it is ... la france est jamais occupée en mars , et il e...
137856 india is sometimes beautiful during spring , a... l' inde est parfois belle au printemps , et il...
137857 india is never wet during summer , but it is s... l' inde est jamais mouillé pendant l' été , ma...
137858 france is never chilly during january , but it... la france est jamais froid en janvier , mais i...
137859 the orange is her favorite fruit , but the ban... l'orange est son fruit préféré , mais la banan...

137860 rows × 2 columns

In [10]:
#printing total records
print('Total English records = {}'.format(len(df['english'])))
print('Total French records = {}'.format(len(df['french'])))
Total English records = 137860
Total French records = 137860

Performing Data Cleaning

In [12]:
# function to remove punctuations
def remove_punc(x):
    return re.sub('[!#?,.:";]', '', x)
In [13]:
df['french'] = df['french'].apply(remove_punc)
df['english'] = df['english'].apply(remove_punc)
In [14]:
english_words = []
french_words  = []

Finding out total unique words in our English and French Vocabulary

In [15]:
def get_unique_words(x, word_list):
    for word in x.split():
        if word not in word_list:
            word_list.append(word)
            
df['english'].apply(lambda x: get_unique_words(x, english_words))    
total_english_words = len(english_words)
total_english_words
Out[15]:
199
In [16]:
# number of unique words in french
df['french'].apply(lambda x: get_unique_words(x, french_words))  
total_french_words = len(french_words)
total_french_words
Out[16]:
350

VISUALIZE CLEANED UP DATASET

In [17]:
# Obtain list of all words in the dataset
words = []
for i in df['english']:
    for word in i.split():
        words.append(word)
    
In [18]:
# Obtain the total count of words
english_words_counts = Counter(words)
In [19]:
# sort the dictionary by values
english_words_counts = sorted(english_words_counts.items(), key = operator.itemgetter(1), reverse = True)
In [20]:
#finding out each word count in our data
english_words_counts
Out[20]:
[('is', 205858),
 ('in', 75525),
 ('it', 75137),
 ('during', 74933),
 ('the', 67628),
 ('but', 63987),
 ('and', 59850),
 ('sometimes', 37746),
 ('usually', 37507),
 ('never', 37500),
 ('favorite', 28332),
 ('least', 27564),
 ('fruit', 27192),
 ('most', 14934),
 ('loved', 14166),
 ('liked', 14046),
 ('new', 12197),
 ('paris', 11334),
 ('india', 11277),
 ('united', 11270),
 ('states', 11270),
 ('california', 11250),
 ('jersey', 11225),
 ('france', 11170),
 ('china', 10953),
 ('he', 10786),
 ('she', 10786),
 ('grapefruit', 10692),
 ('your', 9734),
 ('my', 9700),
 ('his', 9700),
 ('her', 9700),
 ('fall', 9134),
 ('june', 9133),
 ('spring', 9102),
 ('january', 9090),
 ('winter', 9038),
 ('march', 9023),
 ('autumn', 9004),
 ('may', 8995),
 ('nice', 8984),
 ('september', 8958),
 ('july', 8956),
 ('april', 8954),
 ('november', 8951),
 ('summer', 8948),
 ('december', 8945),
 ('february', 8942),
 ('our', 8932),
 ('their', 8932),
 ('freezing', 8928),
 ('pleasant', 8916),
 ('beautiful', 8915),
 ('october', 8910),
 ('snowy', 8898),
 ('warm', 8890),
 ('cold', 8878),
 ('wonderful', 8808),
 ('dry', 8794),
 ('busy', 8791),
 ('august', 8789),
 ('chilly', 8770),
 ('rainy', 8761),
 ('mild', 8743),
 ('wet', 8726),
 ('relaxing', 8696),
 ('quiet', 8693),
 ('hot', 8639),
 ('dislikes', 7314),
 ('likes', 7314),
 ('limes', 5844),
 ('lemons', 5844),
 ('grapes', 5844),
 ('mangoes', 5844),
 ('apples', 5844),
 ('peaches', 5844),
 ('oranges', 5844),
 ('pears', 5844),
 ('strawberries', 5844),
 ('bananas', 5844),
 ('to', 5166),
 ('grape', 4848),
 ('apple', 4848),
 ('orange', 4848),
 ('lemon', 4848),
 ('lime', 4848),
 ('banana', 4848),
 ('mango', 4848),
 ('pear', 4848),
 ('strawberry', 4848),
 ('peach', 4848),
 ('like', 4588),
 ('dislike', 4444),
 ('they', 3222),
 ('that', 2712),
 ('i', 2664),
 ('we', 2532),
 ('you', 2414),
 ('animal', 2304),
 ('a', 1944),
 ('truck', 1944),
 ('car', 1944),
 ('automobile', 1944),
 ('was', 1867),
 ('next', 1666),
 ('go', 1386),
 ('driving', 1296),
 ('visit', 1224),
 ('little', 1016),
 ('big', 1016),
 ('old', 972),
 ('yellow', 972),
 ('red', 972),
 ('rusty', 972),
 ('blue', 972),
 ('white', 972),
 ('black', 972),
 ('green', 972),
 ('shiny', 972),
 ('are', 870),
 ('last', 781),
 ('feared', 768),
 ('animals', 768),
 ('this', 768),
 ('plan', 714),
 ('going', 666),
 ('saw', 648),
 ('disliked', 648),
 ('drives', 648),
 ('drove', 648),
 ('between', 540),
 ('translate', 480),
 ('plans', 476),
 ('were', 384),
 ('went', 378),
 ('might', 378),
 ('wanted', 378),
 ('thinks', 360),
 ('spanish', 312),
 ('portuguese', 312),
 ('chinese', 312),
 ('english', 312),
 ('french', 312),
 ('translating', 300),
 ('difficult', 260),
 ('fun', 260),
 ('easy', 260),
 ('wants', 252),
 ('think', 240),
 ('why', 240),
 ("it's", 240),
 ('did', 204),
 ('cat', 192),
 ('shark', 192),
 ('bird', 192),
 ('mouse', 192),
 ('horse', 192),
 ('elephant', 192),
 ('dog', 192),
 ('monkey', 192),
 ('lion', 192),
 ('bear', 192),
 ('rabbit', 192),
 ('snake', 192),
 ('when', 144),
 ('want', 126),
 ('do', 84),
 ('how', 67),
 ('elephants', 64),
 ('horses', 64),
 ('dogs', 64),
 ('sharks', 64),
 ('snakes', 64),
 ('cats', 64),
 ('rabbits', 64),
 ('monkeys', 64),
 ('bears', 64),
 ('birds', 64),
 ('lions', 64),
 ('mice', 64),
 ("didn't", 60),
 ('eiffel', 57),
 ('tower', 57),
 ('grocery', 57),
 ('store', 57),
 ('football', 57),
 ('field', 57),
 ('lake', 57),
 ('school', 57),
 ('would', 48),
 ("aren't", 36),
 ('been', 36),
 ('weather', 33),
 ('does', 24),
 ('has', 24),
 ("isn't", 24),
 ('am', 24),
 ('where', 12),
 ('have', 12)]
In [21]:
# append the values to a list for visualization purposes
english_words = []
english_counts = []
for i in range(len(english_words_counts)):
    english_words.append(english_words_counts[i][0])
    english_counts.append(english_words_counts[i][1])
In [22]:
# Plot barplot using plotly 
fig = px.bar(x = english_words, y = english_counts)
fig.show()
In [23]:
# plot the word cloud for text that is Real
plt.figure(figsize = (20,20)) 
wc = WordCloud(max_words = 2000, width = 1600, height = 800 ).generate(" ".join(df.english))
plt.imshow(wc, interpolation = 'bilinear')
Out[23]:
<matplotlib.image.AxesImage at 0x7f6658e342e8>
In [24]:
#Tokenized form of first record
df.english[0]
nltk.word_tokenize(df.english[0])
Out[24]:
['new',
 'jersey',
 'is',
 'sometimes',
 'quiet',
 'during',
 'autumn',
 'and',
 'it',
 'is',
 'snowy',
 'in',
 'april']
In [25]:
# Maximum length (number of words) per record. We will need it later for embeddings
maxlen_english = -1
for doc in df.english:
    tokens = nltk.word_tokenize(doc)
    if(maxlen_english < len(tokens)):
        maxlen_english = len(tokens)
print("The maximum number of words in any record = ", maxlen_english)
The maximum number of words in any record =  15

Doing similar operations on French data

In [26]:
words = []
for i in df['french']:
    for word in i.split():
        words.append(word)
In [27]:
french_words_counts = Counter(words)
In [28]:
# sort the dictionary by values and printing
french_words_counts = sorted(french_words_counts.items(), key = operator.itemgetter(1), reverse = True)

french_words_counts
Out[28]:
[('est', 196809),
 ('en', 105768),
 ('il', 84079),
 ('les', 65255),
 ('mais', 63987),
 ('et', 59851),
 ('la', 49861),
 ('parfois', 37746),
 ('jamais', 37215),
 ('le', 35306),
 ("l'", 32917),
 ('généralement', 31292),
 ('moins', 27557),
 ('aimé', 25852),
 ('au', 25738),
 ('fruit', 23626),
 ('préféré', 23305),
 ('agréable', 17751),
 ('froid', 16794),
 ('son', 16496),
 ('chaud', 16405),
 ('de', 15070),
 ('plus', 14934),
 ('automne', 14727),
 ('mois', 14350),
 ('à', 13870),
 ('elle', 12056),
 ('citrons', 11679),
 ('paris', 11334),
 ('inde', 11277),
 ('états-unis', 11210),
 ('france', 11170),
 ('jersey', 11052),
 ('new', 11047),
 ('chine', 10936),
 ('pendant', 10741),
 ('pamplemousse', 10140),
 ('mon', 9403),
 ('votre', 9368),
 ('juin', 9133),
 ('printemps', 9100),
 ('janvier', 9090),
 ('hiver', 9038),
 ('mars', 9023),
 ('été', 8999),
 ('mai', 8995),
 ('septembre', 8958),
 ('juillet', 8956),
 ('avril', 8954),
 ('novembre', 8951),
 ('décembre', 8945),
 ('février', 8942),
 ('octobre', 8911),
 ('aime', 8870),
 ('août', 8789),
 ('merveilleux', 8704),
 ('relaxant', 8458),
 ('doux', 8458),
 ('humide', 8446),
 ('notre', 8319),
 ('californie', 8189),
 ('sec', 7957),
 ('leur', 7855),
 ('occupé', 7782),
 ('pluvieux', 7658),
 ('calme', 7256),
 ('beau', 6387),
 ('habituellement', 6215),
 ('pommes', 5844),
 ('pêches', 5844),
 ('oranges', 5844),
 ('poires', 5844),
 ('fraises', 5844),
 ('bananes', 5844),
 ('verts', 5835),
 ('raisins', 5780),
 ('mangues', 5774),
 ("d'", 5100),
 ('mangue', 4899),
 ('gel', 4886),
 ('raisin', 4852),
 ('pomme', 4848),
 ("l'orange", 4848),
 ('citron', 4848),
 ('chaux', 4848),
 ('banane', 4848),
 ('poire', 4848),
 ('fraise', 4848),
 ('pêche', 4848),
 ('pas', 4495),
 ('enneigée', 4008),
 ('favori', 3857),
 ('déteste', 3743),
 ('gèle', 3622),
 ('fruits', 3566),
 ('voiture', 3510),
 ("l'automne", 3411),
 ('ils', 3185),
 ("n'aime", 3131),
 ('california', 3061),
 ('neige', 3016),
 ('fait', 2916),
 ('belle', 2726),
 ('ne', 2715),
 ('nous', 2520),
 ('vous', 2517),
 ('des', 2435),
 ('animal', 2248),
 ('camion', 1944),
 ('cours', 1927),
 ('neigeux', 1867),
 ('conduit', 1706),
 ('prochain', 1666),
 ('je', 1548),
 ('ce', 1465),
 ('tranquille', 1437),
 ('a', 1356),
 ('cher', 1308),
 ('une', 1278),
 ('cette', 1239),
 ('était', 1198),
 ('aller', 1180),
 ('chaude', 1124),
 ('aiment', 1116),
 ('aimons', 1111),
 ("n'aiment", 1111),
 ("n'aimez", 1094),
 ('leurs', 1072),
 ('aimez', 1053),
 ('sont', 1018),
 ('détestons', 1001),
 ('jaune', 972),
 ('rouge', 972),
 ("j'aime", 966),
 ('visiter', 908),
 ('sèche', 837),
 ('occupée', 836),
 ('frisquet', 834),
 ('préférée', 770),
 ('animaux', 768),
 ('dernier', 757),
 ('aimait', 707),
 ('un', 698),
 ('conduisait', 673),
 ('que', 667),
 ('nouvelle', 648),
 ('vieille', 647),
 ('vu', 645),
 ('verte', 628),
 ('petite', 615),
 ('nos', 613),
 ('noire', 602),
 ('brillant', 587),
 ('blanche', 579),
 ('redouté', 576),
 ('pleut', 562),
 ("n'aimait", 561),
 ('pamplemousses', 552),
 ('pense', 540),
 ('entre', 540),
 ('bleue', 504),
 ('nouveau', 502),
 ('traduire', 501),
 ('rouillée', 486),
 ('bleu', 468),
 ('se', 461),
 ('grande', 459),
 ('rouillé', 454),
 ('ses', 402),
 ("qu'il", 393),
 ('blanc', 393),
 ('aux', 392),
 ('brillante', 385),
 ('préférés', 383),
 ('noir', 370),
 ('pluies', 367),
 ('envisage', 360),
 ('étaient', 357),
 ('va', 355),
 ('rendre', 350),
 ('vert', 344),
 ('-', 328),
 ('vieux', 325),
 ('petit', 324),
 ('espagnol', 312),
 ('portugais', 312),
 ('chinois', 312),
 ('anglais', 312),
 ('français', 312),
 ('glaciales', 307),
 ('mes', 297),
 ('cet', 286),
 ('automobile', 278),
 ('traduction', 277),
 ('mouillé', 273),
 ('difficile', 260),
 ('amusant', 260),
 ('facile', 260),
 ('comme', 259),
 ('gros', 258),
 ('souris', 256),
 ('pourrait', 252),
 ('voulait', 252),
 ('veut', 252),
 ('pourquoi', 240),
 ('aimés', 237),
 ('prévois', 233),
 ('prévoyons', 232),
 ('vos', 225),
 ('intention', 206),
 ('clémentes', 200),
 ('ont', 194),
 ('chat', 192),
 ('requin', 192),
 ('cheval', 192),
 ('chien', 192),
 ('singe', 192),
 ('lion', 192),
 ('ours', 192),
 ('lapin', 192),
 ('serpent', 192),
 ('redoutés', 190),
 ('allé', 187),
 ('grosse', 185),
 ('pluie', 174),
 ('trop', 173),
 ('monde', 173),
 ('maillot', 173),
 ('vont', 168),
 ('volant', 165),
 ('avez', 162),
 ('i', 150),
 ('allés', 150),
 ('allée', 150),
 ('quand', 144),
 ('oiseau', 128),
 ('éléphant', 128),
 ('pourraient', 126),
 ('voulaient', 126),
 ('veulent', 126),
 ('détendre', 111),
 ('aimée', 105),
 ('magnifique', 104),
 ("l'automobile", 100),
 ("n'aimons", 97),
 ('-ce', 95),
 ('gelé', 94),
 ('détestait', 87),
 ('grand', 81),
 ('bien', 77),
 ('vers', 76),
 ('prévoient', 75),
 ('prévoit', 75),
 ('lui', 70),
 ('visite', 68),
 ('comment', 67),
 ('éléphants', 64),
 ('chevaux', 64),
 ('chiens', 64),
 ("l'éléphant", 64),
 ("l'oiseau", 64),
 ('requins', 64),
 ("l'ours", 64),
 ('serpents', 64),
 ('chats', 64),
 ('lapins', 64),
 ('singes', 64),
 ('oiseaux', 64),
 ('lions', 64),
 ('légère', 63),
 ('cépage', 60),
 ('pensez', 60),
 ('États-unis', 57),
 ('tour', 57),
 ('eiffel', 57),
 ("l'épicerie", 57),
 ('terrain', 57),
 ('football', 57),
 ('lac', 57),
 ("l'école", 57),
 ("l'animal", 56),
 ("n'est", 47),
 ('allons', 45),
 ('allez', 45),
 ('peu', 41),
 ('pousse', 41),
 ('du', 39),
 ('-il', 36),
 ('temps', 33),
 ('at', 32),
 ('rouille', 32),
 ('sur', 28),
 ("qu'elle", 26),
 ('-ils', 26),
 ('petites', 26),
 ('-elle', 24),
 ('dernière', 24),
 ('êtes-vous', 24),
 ('vais', 24),
 ('voudrait', 24),
 ('proches', 20),
 ('frais', 20),
 ('manguiers', 19),
 ('avons', 19),
 ('t', 18),
 ('porcelaine', 17),
 ('détestez', 17),
 ("c'est", 17),
 ('grandes', 16),
 ('préférées', 16),
 ('douce', 14),
 ('durant', 14),
 ('congélation', 14),
 ('plaît', 13),
 ('où', 12),
 ('dans', 12),
 ('est-ce', 12),
 ('voulez', 12),
 ('aimeraient', 12),
 ("n'a", 12),
 ('petits', 10),
 ('aiment-ils', 10),
 ('grands', 9),
 ('limes', 9),
 ('envisagent', 9),
 ('grosses', 8),
 ('bénigne', 8),
 ('mouillée', 7),
 ('enneigé', 7),
 ('moindres', 7),
 ('conduite', 6),
 ('gelés', 5),
 ('tout', 4),
 ('etats-unis', 3),
 ("n'êtes", 3),
 ('vit', 3),
 ('ressort', 2),
 ('détend', 2),
 ('redoutée', 2),
 ('qui', 2),
 ('traduis', 2),
 ('apprécié', 2),
 ('allions', 1),
 ('trouvé', 1),
 ('as-tu', 1),
 ('faire', 1),
 ('favoris', 1),
 ('souvent', 1),
 ('es-tu', 1),
 ('moteur', 1)]
In [29]:
# append the values to a list for visuaization purpose
french_words = []
french_counts = []
for i in range(len(french_words_counts)):
    french_words.append(french_words_counts[i][0])
    french_counts.append(french_words_counts[i][1])

fig = px.bar(x = french_words, y = french_counts)
fig.show()
In [30]:
# plot the word cloud for French
plt.figure(figsize = (20,20)) 
wc = WordCloud(max_words = 2000 , width = 1600 , height = 800).generate(" ".join(df.french))
plt.imshow(wc, interpolation = 'bilinear')
Out[30]:
<matplotlib.image.AxesImage at 0x7f6643bb8898>
In [31]:
# Maximum length (number of words) per records. We will need it later for embeddings
maxlen_french = -1
for doc in df.french:
    tokens = nltk.word_tokenize(doc)
    if(maxlen_french < len(tokens)):
        maxlen_french = len(tokens)
print("The maximum number of words in any record = ", maxlen_french)
The maximum number of words in any record =  23

Preparing the Data by Performing Tokenization and Padding

In [32]:
def tokenize_and_pad(x, maxlen):
  #  a tokenier to tokenize the words and create sequences of tokenized words
    tokenizer = Tokenizer(char_level = False)
    tokenizer.fit_on_texts(x)
    sequences = tokenizer.texts_to_sequences(x)
    padded = pad_sequences(sequences, maxlen = maxlen, padding = 'post')
    return tokenizer, sequences, padded
In [33]:
# tokenize and padding to the data 
x_tokenizer, x_sequences, x_padded = tokenize_and_pad(df.english, maxlen_english)
y_tokenizer, y_sequences, y_padded = tokenize_and_pad(df.french,  maxlen_french)
In [34]:
# Total vocab size, since we added padding we add 1 to the total word count
english_vocab_size = total_english_words + 1
print("Complete English Vocab Size:", english_vocab_size)
Complete English Vocab Size: 200
In [35]:
# Total vocab size, since we added padding we add 1 to the total word count
french_vocab_size = total_french_words + 1
print("Complete French Vocab Size:", french_vocab_size)
Complete French Vocab Size: 351
In [36]:
print("The tokenized version for document\n", df.english[-1:].item(),"\n is : ", x_padded[-1:])
The tokenized version for document
 the orange is her favorite fruit  but the banana is your favorite  
 is :  [[ 5 84  1 32 11 13  6  5 87  1 29 11  0  0  0]]
In [37]:
print("The tokenized version for document\n", df.french[-1:].item(),"\n is : ", y_padded[-1:])
The tokenized version for document
 l'orange est son fruit préféré  mais la banane est votre favori  
 is :  [[84  1 20 16 17  5  7 87  1 40 93  0  0  0  0  0  0  0  0  0  0  0  0]]
In [38]:
# function to obtain the text from padded variables
def pad_to_text(padded, tokenizer):

    id_to_word = {id: word for word, id in tokenizer.word_index.items()}
    id_to_word[0] = ''

    return ' '.join([id_to_word[j] for j in padded])
In [39]:
# Otaining the actual text back in original form.
pad_to_text(y_padded[0], y_tokenizer)
Out[39]:
"new jersey est parfois calme pendant l' automne et il est neigeux en avril         "

Defining the model

In [40]:
# Train test split
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x_padded, y_padded, test_size = 0.1)
In [41]:
# Sequential Model
model = Sequential()
# embedding layer
model.add(Embedding(english_vocab_size, 256, input_length = maxlen_english, mask_zero = True))
# encoder
model.add(LSTM(256))
# decoder
# repeatvector repeats the input for the desired number of times to change
# 2D-array to 3D array. For example: (1,256) to (1,23,256)
model.add(RepeatVector(maxlen_french))
model.add(LSTM(256, return_sequences= True ))
model.add(TimeDistributed(Dense(french_vocab_size, activation ='softmax')))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 15, 256)           51200     
_________________________________________________________________
lstm (LSTM)                  (None, 256)               525312    
_________________________________________________________________
repeat_vector (RepeatVector) (None, 23, 256)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 23, 256)           525312    
_________________________________________________________________
time_distributed (TimeDistri (None, 23, 351)           90207     
=================================================================
Total params: 1,192,031
Trainable params: 1,192,031
Non-trainable params: 0
_________________________________________________________________
In [42]:
# change the shape of target from 2D to 3D
y_train = np.expand_dims(y_train, axis = 2)
y_train.shape
Out[42]:
(124074, 23, 1)
In [43]:
# train the model
history = model.fit(x_train, y_train, batch_size=1024, validation_split= 0.1, epochs=25)
Epoch 1/25
110/110 [==============================] - 10s 94ms/step - loss: 2.7059 - accuracy: 0.4973 - val_loss: 2.1038 - val_accuracy: 0.5381
Epoch 2/25
110/110 [==============================] - 9s 83ms/step - loss: 1.8692 - accuracy: 0.5757 - val_loss: 1.5988 - val_accuracy: 0.6029
Epoch 3/25
110/110 [==============================] - 9s 83ms/step - loss: 1.4723 - accuracy: 0.6228 - val_loss: 1.3504 - val_accuracy: 0.6453
Epoch 4/25
110/110 [==============================] - 9s 84ms/step - loss: 1.2523 - accuracy: 0.6597 - val_loss: 1.1983 - val_accuracy: 0.6677
Epoch 5/25
110/110 [==============================] - 9s 86ms/step - loss: 1.0952 - accuracy: 0.6908 - val_loss: 1.0297 - val_accuracy: 0.7072
Epoch 6/25
110/110 [==============================] - 9s 84ms/step - loss: 0.9589 - accuracy: 0.7259 - val_loss: 0.9238 - val_accuracy: 0.7331
Epoch 7/25
110/110 [==============================] - 9s 85ms/step - loss: 0.8685 - accuracy: 0.7486 - val_loss: 0.8167 - val_accuracy: 0.7641
Epoch 8/25
110/110 [==============================] - 9s 85ms/step - loss: 0.7712 - accuracy: 0.7762 - val_loss: 0.7629 - val_accuracy: 0.7775
Epoch 9/25
110/110 [==============================] - 9s 86ms/step - loss: 0.7004 - accuracy: 0.7966 - val_loss: 0.7070 - val_accuracy: 0.7928
Epoch 10/25
110/110 [==============================] - 10s 87ms/step - loss: 0.6434 - accuracy: 0.8130 - val_loss: 0.6110 - val_accuracy: 0.8238
Epoch 11/25
110/110 [==============================] - 10s 89ms/step - loss: 0.5683 - accuracy: 0.8373 - val_loss: 0.5463 - val_accuracy: 0.8430
Epoch 12/25
110/110 [==============================] - 10s 89ms/step - loss: 0.5107 - accuracy: 0.8538 - val_loss: 0.5201 - val_accuracy: 0.8494
Epoch 13/25
110/110 [==============================] - 10s 90ms/step - loss: 0.4632 - accuracy: 0.8671 - val_loss: 0.4746 - val_accuracy: 0.8606
Epoch 14/25
110/110 [==============================] - 10s 91ms/step - loss: 0.4207 - accuracy: 0.8795 - val_loss: 0.4399 - val_accuracy: 0.8725
Epoch 15/25
110/110 [==============================] - 10s 91ms/step - loss: 0.3721 - accuracy: 0.8948 - val_loss: 0.3738 - val_accuracy: 0.8945
Epoch 16/25
110/110 [==============================] - 10s 91ms/step - loss: 0.3371 - accuracy: 0.9055 - val_loss: 0.3264 - val_accuracy: 0.9095
Epoch 17/25
110/110 [==============================] - 10s 92ms/step - loss: 0.2945 - accuracy: 0.9195 - val_loss: 0.3386 - val_accuracy: 0.9069
Epoch 18/25
110/110 [==============================] - 10s 92ms/step - loss: 0.2636 - accuracy: 0.9293 - val_loss: 0.2985 - val_accuracy: 0.9171
Epoch 19/25
110/110 [==============================] - 10s 92ms/step - loss: 0.2376 - accuracy: 0.9365 - val_loss: 0.2775 - val_accuracy: 0.9211
Epoch 20/25
110/110 [==============================] - 10s 91ms/step - loss: 0.2111 - accuracy: 0.9450 - val_loss: 0.2172 - val_accuracy: 0.9427
Epoch 21/25
110/110 [==============================] - 10s 91ms/step - loss: 0.1887 - accuracy: 0.9514 - val_loss: 0.2720 - val_accuracy: 0.9207
Epoch 22/25
110/110 [==============================] - 10s 90ms/step - loss: 0.1786 - accuracy: 0.9531 - val_loss: 0.1717 - val_accuracy: 0.9557
Epoch 23/25
110/110 [==============================] - 10s 91ms/step - loss: 0.1559 - accuracy: 0.9603 - val_loss: 0.1566 - val_accuracy: 0.9592
Epoch 24/25
110/110 [==============================] - 10s 91ms/step - loss: 0.1438 - accuracy: 0.9632 - val_loss: 0.1545 - val_accuracy: 0.9591
Epoch 25/25
110/110 [==============================] - 10s 90ms/step - loss: 0.1346 - accuracy: 0.9650 - val_loss: 0.1347 - val_accuracy: 0.9644
In [44]:
print(history.history.keys())
dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])

Visualizing the train and test metrics

In [45]:
import matplotlib.pyplot as plt
%matplotlib inline
In [46]:
# summarize history for Accuracy
fig_acc = plt.figure(figsize=(10, 10))
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
In [47]:
# summarize history for Loss
fig_acc = plt.figure(figsize=(10, 10))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
In [48]:
# save the model
model.save("eng2french.h5")

ASSESS TRAINED MODEL PERFORMANCE

In [49]:
# function to make prediction
def prediction(x, x_tokenizer = x_tokenizer, y_tokenizer = y_tokenizer):
    predictions = model.predict(x)[0]
    id_to_word = {id: word for word, id in y_tokenizer.word_index.items()}
    id_to_word[0] = ''
    return ' '.join([id_to_word[j] for j in np.argmax(predictions,1)])
In [50]:
# Printing the English text with corrent French Translation and predicted French Translation 
for i in range(5):

    print('Original English word - {}\n'.format(pad_to_text(x_test[i], x_tokenizer)))
    print('Original French word - {}\n'.format(pad_to_text(y_test[i], y_tokenizer)))
    print('Predicted French word - {}\n\n\n\n'.format(prediction(x_test[i:i+1])))
Original English word - new jersey is sometimes relaxing during may and it is never busy in april 

Original French word - new jersey est relaxant parfois au mois de mai et il est jamais occupé en avril       

Predicted French word - new jersey est relaxant parfois au mois de mai et il est jamais en en avril       




Original English word - our least favorite fruit is the grapefruit but his least favorite is the banana 

Original French word - notre fruit préféré moins est le pamplemousse mais son moins préféré est la banane         

Predicted French word - notre fruit préféré moins est le pamplemousse mais son moins préféré est la banane         




Original English word - you like grapefruit bananas and mangoes         

Original French word - vous aimez le pamplemousse les bananes et les mangues              

Predicted French word - vous aimez le pamplemousse les bananes et les mangues              




Original English word - the united states is sometimes wet during winter but it is busy in october 

Original French word - les états unis est parfois humide pendant l' hiver mais il est occupé en octobre        

Predicted French word - les états unis est parfois humide pendant l' hiver mais il est occupé en octobre        




Original English word - the united states is freezing during autumn and it is never warm in february 

Original French word - les états unis est le gel pendant l' automne et il est jamais chaud en février       

Predicted French word - les états unis est le gel pendant l' automne et il est jamais chaud en février       




Compliling the model to DeepC for production

In [ ]:
!deepCC eng2french.h5