Cainvas

Auto Image Captioning

Credit: AITS Cainvas Community

Photo by Alexei Evdokimov on Dribbble

Automatic Image Captioning is the process by which we train a deep learning model to automatically assign metadata in the form of captions or keywords to a digital image

Application:

Image captioning has various applications such as for annotating images, Undersating content type on Social Media, and specially Combining NLP to help Blind people to understand their surroundings and environment.

Source of dataset on Kaggle.

If you want a pretrained model, run the following command-

In [1]:
!wget -N "https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/image_captioning.h5"
--2020-12-15 11:36:57--  https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/image_captioning.h5
Resolving cainvas-static.s3.amazonaws.com (cainvas-static.s3.amazonaws.com)... 52.219.64.80
Connecting to cainvas-static.s3.amazonaws.com (cainvas-static.s3.amazonaws.com)|52.219.64.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 60985892 (58M) [application/x-hdf]
Saving to: ‘image_captioning.h5’

image_captioning.h5 100%[===================>]  58.16M  95.3MB/s    in 0.6s    

2020-12-15 11:36:58 (95.3 MB/s) - ‘image_captioning.h5’ saved [60985892/60985892]

Importing necessary Libraries

In [43]:
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.applications.vgg16 import preprocess_input
from keras.models import Sequential, Model
from tensorflow import keras
import os
import matplotlib.pyplot as plt
import string
from keras.applications.resnet50 import ResNet50
from pickle import dump
from pickle import load
from IPython.display import Image
from keras.layers import Dense, Flatten,Input, Convolution2D, Dropout, LSTM, TimeDistributed, Embedding, Bidirectional, Activation, RepeatVector,Concatenate
import numpy as np
from keras.models import load_model
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical, plot_model
from keras.layers.merge import add, concatenate
from keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau
In [2]:
# Defining the Dataset path
image_dataset_path = '../input/flickr8k-imageswithcaptions/Flickr8k_Dataset/Flicker8k_Dataset'
caption_dataset_path = '../input/flickr8k-imageswithcaptions/Flickr8k_text/Flickr8k.token.txt'
In [3]:
# Visulaizing the Image Data
Image('../input/flickr8k-imageswithcaptions/Flickr8k_Dataset/Flicker8k_Dataset/1000268201_693b08cb0e.jpg')
Out[3]:

Caption Processor

In [4]:
# load the caption file & read it
def load_caption_file(path):
    
    # dictionary to store captions
    captions_dict = {}
    
    # iterate through the file
    for caption in open(path):
        
        # Splitting the name of image file and image caption
        tokens = caption.split()
        caption_id, caption_text = tokens[0].split('.')[0], tokens[1:]
        caption_text = ' '.join(caption_text)
        
        # save it in the captions dictionary
        if caption_id not in captions_dict:
            captions_dict[caption_id] = caption_text
        
    return captions_dict

# call the function
captions_dict = load_caption_file(caption_dataset_path)

Preprocessing the captions

In [5]:
# dictionary to store the cleaned captions
new_captions_dict = {}

# prepare translation table for removing punctuation.
table = str.maketrans('', '', string.punctuation)

# loop through the dictionary
for caption_id, caption_text in captions_dict.items():
    # tokenize the caption_text
    caption_text = caption_text.split()
    # convert it into lower case
    caption_text = [token.lower() for token in caption_text]
    # remove punctuation from each token
    caption_text = [token.translate(table) for token in caption_text]
    # remove all the single letter tokens like 'a', 's'
    caption_text = [token for token in caption_text if len(token)>1]
    # store the cleaned captions
    new_captions_dict[caption_id] = 'startseq ' + ' '.join(caption_text) + ' endseq'
    
In [6]:
# delete unwanted captions which do not match any images
del captions_dict
In [7]:
print('"' + list(new_captions_dict.keys())[0] + '"' + ' : ' + new_captions_dict[list(new_captions_dict.keys())[0]])
"1000268201_693b08cb0e" : startseq child in pink dress is climbing up set of stairs in an entry way endseq

Make a list of only those images who has caption

In [8]:
caption_images_list = []

image_index = list(new_captions_dict.keys())

caption_images_list = [ image.split('.')[0] for image in os.listdir(image_dataset_path) if image.split('.')[0] in image_index ]
In [9]:
caption_images_list[0]
Out[9]:
'3317145805_071b15debb'
In [10]:
# Total images along with captions
len(caption_images_list)
Out[10]:
8091

Make training, validation and test split

In [11]:
train_validate_images = caption_images_list[0:8081]  
In [12]:
test_images = caption_images_list[8081:8091]
test_images
Out[12]:
['1124448967_2221af8dc5',
 '2711075591_f3ee53cfaa',
 '3670907052_c827593564',
 '1663751778_90501966f0',
 '918886676_3323fb2a01',
 '1131932671_c8d17751b3',
 '3697153626_90fb177731',
 '3371279606_c0d0cddab2',
 '3134586018_ae03ba20a0',
 '1562478333_43d13e5427']

Image Feature Extractor

extract_features function extracts the important features out of the images using a ResNet50 model.

In [13]:
# extract features from each photo in the directory
def extract_features1(directory, image_keys):
    # load the model
    model = ResNet50(include_top=False,weights='imagenet',input_shape=(224,224,3),pooling='avg')
    
    # model summary
    print(model.summary())
    
    # extract features from each photo
    features = dict()
    
    for name in image_keys:
        
        # load an image from file
        filename = directory + '/' + name + '.jpg'
        
        # load the image and convert it into size accepted by ResNet model
        image = load_img(filename, target_size=(224, 224))
        
        # convert the image pixels to a numpy array
        image = img_to_array(image)
        
        # reshape data for the model.
        image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
        
        # prepare the image for the ResNet model
        image = preprocess_input(image)
        
        # get features
        feature = model.predict(image, verbose=0)
        
        # get image id
        image_id = name.split('.')[0]
        
        # store feature
        features[image_id] = feature
         

    return features

Feature Extraction

In [14]:
# Note: This section takes time as it is processing the entire dataset.
train_validate_features1 = extract_features1(image_dataset_path, train_validate_images)
Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5
94773248/94765736 [==============================] - 1s 0us/step
Model: "resnet50"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            [(None, 224, 224, 3) 0                                            
__________________________________________________________________________________________________
conv1_pad (ZeroPadding2D)       (None, 230, 230, 3)  0           input_1[0][0]                    
__________________________________________________________________________________________________
conv1_conv (Conv2D)             (None, 112, 112, 64) 9472        conv1_pad[0][0]                  
__________________________________________________________________________________________________
conv1_bn (BatchNormalization)   (None, 112, 112, 64) 256         conv1_conv[0][0]                 
__________________________________________________________________________________________________
conv1_relu (Activation)         (None, 112, 112, 64) 0           conv1_bn[0][0]                   
__________________________________________________________________________________________________
pool1_pad (ZeroPadding2D)       (None, 114, 114, 64) 0           conv1_relu[0][0]                 
__________________________________________________________________________________________________
pool1_pool (MaxPooling2D)       (None, 56, 56, 64)   0           pool1_pad[0][0]                  
__________________________________________________________________________________________________
conv2_block1_1_conv (Conv2D)    (None, 56, 56, 64)   4160        pool1_pool[0][0]                 
__________________________________________________________________________________________________
conv2_block1_1_bn (BatchNormali (None, 56, 56, 64)   256         conv2_block1_1_conv[0][0]        
__________________________________________________________________________________________________
conv2_block1_1_relu (Activation (None, 56, 56, 64)   0           conv2_block1_1_bn[0][0]          
__________________________________________________________________________________________________
conv2_block1_2_conv (Conv2D)    (None, 56, 56, 64)   36928       conv2_block1_1_relu[0][0]        
__________________________________________________________________________________________________
conv2_block1_2_bn (BatchNormali (None, 56, 56, 64)   256         conv2_block1_2_conv[0][0]        
__________________________________________________________________________________________________
conv2_block1_2_relu (Activation (None, 56, 56, 64)   0           conv2_block1_2_bn[0][0]          
__________________________________________________________________________________________________
conv2_block1_0_conv (Conv2D)    (None, 56, 56, 256)  16640       pool1_pool[0][0]                 
__________________________________________________________________________________________________
conv2_block1_3_conv (Conv2D)    (None, 56, 56, 256)  16640       conv2_block1_2_relu[0][0]        
__________________________________________________________________________________________________
conv2_block1_0_bn (BatchNormali (None, 56, 56, 256)  1024        conv2_block1_0_conv[0][0]        
__________________________________________________________________________________________________
conv2_block1_3_bn (BatchNormali (None, 56, 56, 256)  1024        conv2_block1_3_conv[0][0]        
__________________________________________________________________________________________________
conv2_block1_add (Add)          (None, 56, 56, 256)  0           conv2_block1_0_bn[0][0]          
                                                                 conv2_block1_3_bn[0][0]          
__________________________________________________________________________________________________
conv2_block1_out (Activation)   (None, 56, 56, 256)  0           conv2_block1_add[0][0]           
__________________________________________________________________________________________________
conv2_block2_1_conv (Conv2D)    (None, 56, 56, 64)   16448       conv2_block1_out[0][0]           
__________________________________________________________________________________________________
conv2_block2_1_bn (BatchNormali (None, 56, 56, 64)   256         conv2_block2_1_conv[0][0]        
__________________________________________________________________________________________________
conv2_block2_1_relu (Activation (None, 56, 56, 64)   0           conv2_block2_1_bn[0][0]          
__________________________________________________________________________________________________
conv2_block2_2_conv (Conv2D)    (None, 56, 56, 64)   36928       conv2_block2_1_relu[0][0]        
__________________________________________________________________________________________________
conv2_block2_2_bn (BatchNormali (None, 56, 56, 64)   256         conv2_block2_2_conv[0][0]        
__________________________________________________________________________________________________
conv2_block2_2_relu (Activation (None, 56, 56, 64)   0           conv2_block2_2_bn[0][0]          
__________________________________________________________________________________________________
conv2_block2_3_conv (Conv2D)    (None, 56, 56, 256)  16640       conv2_block2_2_relu[0][0]        
__________________________________________________________________________________________________
conv2_block2_3_bn (BatchNormali (None, 56, 56, 256)  1024        conv2_block2_3_conv[0][0]        
__________________________________________________________________________________________________
conv2_block2_add (Add)          (None, 56, 56, 256)  0           conv2_block1_out[0][0]           
                                                                 conv2_block2_3_bn[0][0]          
__________________________________________________________________________________________________
conv2_block2_out (Activation)   (None, 56, 56, 256)  0           conv2_block2_add[0][0]           
__________________________________________________________________________________________________
conv2_block3_1_conv (Conv2D)    (None, 56, 56, 64)   16448       conv2_block2_out[0][0]           
__________________________________________________________________________________________________
conv2_block3_1_bn (BatchNormali (None, 56, 56, 64)   256         conv2_block3_1_conv[0][0]        
__________________________________________________________________________________________________
conv2_block3_1_relu (Activation (None, 56, 56, 64)   0           conv2_block3_1_bn[0][0]          
__________________________________________________________________________________________________
conv2_block3_2_conv (Conv2D)    (None, 56, 56, 64)   36928       conv2_block3_1_relu[0][0]        
__________________________________________________________________________________________________
conv2_block3_2_bn (BatchNormali (None, 56, 56, 64)   256         conv2_block3_2_conv[0][0]        
__________________________________________________________________________________________________
conv2_block3_2_relu (Activation (None, 56, 56, 64)   0           conv2_block3_2_bn[0][0]          
__________________________________________________________________________________________________
conv2_block3_3_conv (Conv2D)    (None, 56, 56, 256)  16640       conv2_block3_2_relu[0][0]        
__________________________________________________________________________________________________
conv2_block3_3_bn (BatchNormali (None, 56, 56, 256)  1024        conv2_block3_3_conv[0][0]        
__________________________________________________________________________________________________
conv2_block3_add (Add)          (None, 56, 56, 256)  0           conv2_block2_out[0][0]           
                                                                 conv2_block3_3_bn[0][0]          
__________________________________________________________________________________________________
conv2_block3_out (Activation)   (None, 56, 56, 256)  0           conv2_block3_add[0][0]           
__________________________________________________________________________________________________
conv3_block1_1_conv (Conv2D)    (None, 28, 28, 128)  32896       conv2_block3_out[0][0]           
__________________________________________________________________________________________________
conv3_block1_1_bn (BatchNormali (None, 28, 28, 128)  512         conv3_block1_1_conv[0][0]        
__________________________________________________________________________________________________
conv3_block1_1_relu (Activation (None, 28, 28, 128)  0           conv3_block1_1_bn[0][0]          
__________________________________________________________________________________________________
conv3_block1_2_conv (Conv2D)    (None, 28, 28, 128)  147584      conv3_block1_1_relu[0][0]        
__________________________________________________________________________________________________
conv3_block1_2_bn (BatchNormali (None, 28, 28, 128)  512         conv3_block1_2_conv[0][0]        
__________________________________________________________________________________________________
conv3_block1_2_relu (Activation (None, 28, 28, 128)  0           conv3_block1_2_bn[0][0]          
__________________________________________________________________________________________________
conv3_block1_0_conv (Conv2D)    (None, 28, 28, 512)  131584      conv2_block3_out[0][0]           
__________________________________________________________________________________________________
conv3_block1_3_conv (Conv2D)    (None, 28, 28, 512)  66048       conv3_block1_2_relu[0][0]        
__________________________________________________________________________________________________
conv3_block1_0_bn (BatchNormali (None, 28, 28, 512)  2048        conv3_block1_0_conv[0][0]        
__________________________________________________________________________________________________
conv3_block1_3_bn (BatchNormali (None, 28, 28, 512)  2048        conv3_block1_3_conv[0][0]        
__________________________________________________________________________________________________
conv3_block1_add (Add)          (None, 28, 28, 512)  0           conv3_block1_0_bn[0][0]          
                                                                 conv3_block1_3_bn[0][0]          
__________________________________________________________________________________________________
conv3_block1_out (Activation)   (None, 28, 28, 512)  0           conv3_block1_add[0][0]           
__________________________________________________________________________________________________
conv3_block2_1_conv (Conv2D)    (None, 28, 28, 128)  65664       conv3_block1_out[0][0]           
__________________________________________________________________________________________________
conv3_block2_1_bn (BatchNormali (None, 28, 28, 128)  512         conv3_block2_1_conv[0][0]        
__________________________________________________________________________________________________
conv3_block2_1_relu (Activation (None, 28, 28, 128)  0           conv3_block2_1_bn[0][0]          
__________________________________________________________________________________________________
conv3_block2_2_conv (Conv2D)    (None, 28, 28, 128)  147584      conv3_block2_1_relu[0][0]        
__________________________________________________________________________________________________
conv3_block2_2_bn (BatchNormali (None, 28, 28, 128)  512         conv3_block2_2_conv[0][0]        
__________________________________________________________________________________________________
conv3_block2_2_relu (Activation (None, 28, 28, 128)  0           conv3_block2_2_bn[0][0]          
__________________________________________________________________________________________________
conv3_block2_3_conv (Conv2D)    (None, 28, 28, 512)  66048       conv3_block2_2_relu[0][0]        
__________________________________________________________________________________________________
conv3_block2_3_bn (BatchNormali (None, 28, 28, 512)  2048        conv3_block2_3_conv[0][0]        
__________________________________________________________________________________________________
conv3_block2_add (Add)          (None, 28, 28, 512)  0           conv3_block1_out[0][0]           
                                                                 conv3_block2_3_bn[0][0]          
__________________________________________________________________________________________________
conv3_block2_out (Activation)   (None, 28, 28, 512)  0           conv3_block2_add[0][0]           
__________________________________________________________________________________________________
conv3_block3_1_conv (Conv2D)    (None, 28, 28, 128)  65664       conv3_block2_out[0][0]           
__________________________________________________________________________________________________
conv3_block3_1_bn (BatchNormali (None, 28, 28, 128)  512         conv3_block3_1_conv[0][0]        
__________________________________________________________________________________________________
conv3_block3_1_relu (Activation (None, 28, 28, 128)  0           conv3_block3_1_bn[0][0]          
__________________________________________________________________________________________________
conv3_block3_2_conv (Conv2D)    (None, 28, 28, 128)  147584      conv3_block3_1_relu[0][0]        
__________________________________________________________________________________________________
conv3_block3_2_bn (BatchNormali (None, 28, 28, 128)  512         conv3_block3_2_conv[0][0]        
__________________________________________________________________________________________________
conv3_block3_2_relu (Activation (None, 28, 28, 128)  0           conv3_block3_2_bn[0][0]          
__________________________________________________________________________________________________
conv3_block3_3_conv (Conv2D)    (None, 28, 28, 512)  66048       conv3_block3_2_relu[0][0]        
__________________________________________________________________________________________________
conv3_block3_3_bn (BatchNormali (None, 28, 28, 512)  2048        conv3_block3_3_conv[0][0]        
__________________________________________________________________________________________________
conv3_block3_add (Add)          (None, 28, 28, 512)  0           conv3_block2_out[0][0]           
                                                                 conv3_block3_3_bn[0][0]          
__________________________________________________________________________________________________
conv3_block3_out (Activation)   (None, 28, 28, 512)  0           conv3_block3_add[0][0]           
__________________________________________________________________________________________________
conv3_block4_1_conv (Conv2D)    (None, 28, 28, 128)  65664       conv3_block3_out[0][0]           
__________________________________________________________________________________________________
conv3_block4_1_bn (BatchNormali (None, 28, 28, 128)  512         conv3_block4_1_conv[0][0]        
__________________________________________________________________________________________________
conv3_block4_1_relu (Activation (None, 28, 28, 128)  0           conv3_block4_1_bn[0][0]          
__________________________________________________________________________________________________
conv3_block4_2_conv (Conv2D)    (None, 28, 28, 128)  147584      conv3_block4_1_relu[0][0]        
__________________________________________________________________________________________________
conv3_block4_2_bn (BatchNormali (None, 28, 28, 128)  512         conv3_block4_2_conv[0][0]        
__________________________________________________________________________________________________
conv3_block4_2_relu (Activation (None, 28, 28, 128)  0           conv3_block4_2_bn[0][0]          
__________________________________________________________________________________________________
conv3_block4_3_conv (Conv2D)    (None, 28, 28, 512)  66048       conv3_block4_2_relu[0][0]        
__________________________________________________________________________________________________
conv3_block4_3_bn (BatchNormali (None, 28, 28, 512)  2048        conv3_block4_3_conv[0][0]        
__________________________________________________________________________________________________
conv3_block4_add (Add)          (None, 28, 28, 512)  0           conv3_block3_out[0][0]           
                                                                 conv3_block4_3_bn[0][0]          
__________________________________________________________________________________________________
conv3_block4_out (Activation)   (None, 28, 28, 512)  0           conv3_block4_add[0][0]           
__________________________________________________________________________________________________
conv4_block1_1_conv (Conv2D)    (None, 14, 14, 256)  131328      conv3_block4_out[0][0]           
__________________________________________________________________________________________________
conv4_block1_1_bn (BatchNormali (None, 14, 14, 256)  1024        conv4_block1_1_conv[0][0]        
__________________________________________________________________________________________________
conv4_block1_1_relu (Activation (None, 14, 14, 256)  0           conv4_block1_1_bn[0][0]          
__________________________________________________________________________________________________
conv4_block1_2_conv (Conv2D)    (None, 14, 14, 256)  590080      conv4_block1_1_relu[0][0]        
__________________________________________________________________________________________________
conv4_block1_2_bn (BatchNormali (None, 14, 14, 256)  1024        conv4_block1_2_conv[0][0]        
__________________________________________________________________________________________________
conv4_block1_2_relu (Activation (None, 14, 14, 256)  0           conv4_block1_2_bn[0][0]          
__________________________________________________________________________________________________
conv4_block1_0_conv (Conv2D)    (None, 14, 14, 1024) 525312      conv3_block4_out[0][0]           
__________________________________________________________________________________________________
conv4_block1_3_conv (Conv2D)    (None, 14, 14, 1024) 263168      conv4_block1_2_relu[0][0]        
__________________________________________________________________________________________________
conv4_block1_0_bn (BatchNormali (None, 14, 14, 1024) 4096        conv4_block1_0_conv[0][0]        
__________________________________________________________________________________________________
conv4_block1_3_bn (BatchNormali (None, 14, 14, 1024) 4096        conv4_block1_3_conv[0][0]        
__________________________________________________________________________________________________
conv4_block1_add (Add)          (None, 14, 14, 1024) 0           conv4_block1_0_bn[0][0]          
                                                                 conv4_block1_3_bn[0][0]          
__________________________________________________________________________________________________
conv4_block1_out (Activation)   (None, 14, 14, 1024) 0           conv4_block1_add[0][0]           
__________________________________________________________________________________________________
conv4_block2_1_conv (Conv2D)    (None, 14, 14, 256)  262400      conv4_block1_out[0][0]           
__________________________________________________________________________________________________
conv4_block2_1_bn (BatchNormali (None, 14, 14, 256)  1024        conv4_block2_1_conv[0][0]        
__________________________________________________________________________________________________
conv4_block2_1_relu (Activation (None, 14, 14, 256)  0           conv4_block2_1_bn[0][0]          
__________________________________________________________________________________________________
conv4_block2_2_conv (Conv2D)    (None, 14, 14, 256)  590080      conv4_block2_1_relu[0][0]        
__________________________________________________________________________________________________
conv4_block2_2_bn (BatchNormali (None, 14, 14, 256)  1024        conv4_block2_2_conv[0][0]        
__________________________________________________________________________________________________
conv4_block2_2_relu (Activation (None, 14, 14, 256)  0           conv4_block2_2_bn[0][0]          
__________________________________________________________________________________________________
conv4_block2_3_conv (Conv2D)    (None, 14, 14, 1024) 263168      conv4_block2_2_relu[0][0]        
__________________________________________________________________________________________________
conv4_block2_3_bn (BatchNormali (None, 14, 14, 1024) 4096        conv4_block2_3_conv[0][0]        
__________________________________________________________________________________________________
conv4_block2_add (Add)          (None, 14, 14, 1024) 0           conv4_block1_out[0][0]           
                                                                 conv4_block2_3_bn[0][0]          
__________________________________________________________________________________________________
conv4_block2_out (Activation)   (None, 14, 14, 1024) 0           conv4_block2_add[0][0]           
__________________________________________________________________________________________________
conv4_block3_1_conv (Conv2D)    (None, 14, 14, 256)  262400      conv4_block2_out[0][0]           
__________________________________________________________________________________________________
conv4_block3_1_bn (BatchNormali (None, 14, 14, 256)  1024        conv4_block3_1_conv[0][0]        
__________________________________________________________________________________________________
conv4_block3_1_relu (Activation (None, 14, 14, 256)  0           conv4_block3_1_bn[0][0]          
__________________________________________________________________________________________________
conv4_block3_2_conv (Conv2D)    (None, 14, 14, 256)  590080      conv4_block3_1_relu[0][0]        
__________________________________________________________________________________________________
conv4_block3_2_bn (BatchNormali (None, 14, 14, 256)  1024        conv4_block3_2_conv[0][0]        
__________________________________________________________________________________________________
conv4_block3_2_relu (Activation (None, 14, 14, 256)  0           conv4_block3_2_bn[0][0]          
__________________________________________________________________________________________________
conv4_block3_3_conv (Conv2D)    (None, 14, 14, 1024) 263168      conv4_block3_2_relu[0][0]        
__________________________________________________________________________________________________
conv4_block3_3_bn (BatchNormali (None, 14, 14, 1024) 4096        conv4_block3_3_conv[0][0]        
__________________________________________________________________________________________________
conv4_block3_add (Add)          (None, 14, 14, 1024) 0           conv4_block2_out[0][0]           
                                                                 conv4_block3_3_bn[0][0]          
__________________________________________________________________________________________________
conv4_block3_out (Activation)   (None, 14, 14, 1024) 0           conv4_block3_add[0][0]           
__________________________________________________________________________________________________
conv4_block4_1_conv (Conv2D)    (None, 14, 14, 256)  262400      conv4_block3_out[0][0]           
__________________________________________________________________________________________________
conv4_block4_1_bn (BatchNormali (None, 14, 14, 256)  1024        conv4_block4_1_conv[0][0]        
__________________________________________________________________________________________________
conv4_block4_1_relu (Activation (None, 14, 14, 256)  0           conv4_block4_1_bn[0][0]          
__________________________________________________________________________________________________
conv4_block4_2_conv (Conv2D)    (None, 14, 14, 256)  590080      conv4_block4_1_relu[0][0]        
__________________________________________________________________________________________________
conv4_block4_2_bn (BatchNormali (None, 14, 14, 256)  1024        conv4_block4_2_conv[0][0]        
__________________________________________________________________________________________________
conv4_block4_2_relu (Activation (None, 14, 14, 256)  0           conv4_block4_2_bn[0][0]          
__________________________________________________________________________________________________
conv4_block4_3_conv (Conv2D)    (None, 14, 14, 1024) 263168      conv4_block4_2_relu[0][0]        
__________________________________________________________________________________________________
conv4_block4_3_bn (BatchNormali (None, 14, 14, 1024) 4096        conv4_block4_3_conv[0][0]        
__________________________________________________________________________________________________
conv4_block4_add (Add)          (None, 14, 14, 1024) 0           conv4_block3_out[0][0]           
                                                                 conv4_block4_3_bn[0][0]          
__________________________________________________________________________________________________
conv4_block4_out (Activation)   (None, 14, 14, 1024) 0           conv4_block4_add[0][0]           
__________________________________________________________________________________________________
conv4_block5_1_conv (Conv2D)    (None, 14, 14, 256)  262400      conv4_block4_out[0][0]           
__________________________________________________________________________________________________
conv4_block5_1_bn (BatchNormali (None, 14, 14, 256)  1024        conv4_block5_1_conv[0][0]        
__________________________________________________________________________________________________
conv4_block5_1_relu (Activation (None, 14, 14, 256)  0           conv4_block5_1_bn[0][0]          
__________________________________________________________________________________________________
conv4_block5_2_conv (Conv2D)    (None, 14, 14, 256)  590080      conv4_block5_1_relu[0][0]        
__________________________________________________________________________________________________
conv4_block5_2_bn (BatchNormali (None, 14, 14, 256)  1024        conv4_block5_2_conv[0][0]        
__________________________________________________________________________________________________
conv4_block5_2_relu (Activation (None, 14, 14, 256)  0           conv4_block5_2_bn[0][0]          
__________________________________________________________________________________________________
conv4_block5_3_conv (Conv2D)    (None, 14, 14, 1024) 263168      conv4_block5_2_relu[0][0]        
__________________________________________________________________________________________________
conv4_block5_3_bn (BatchNormali (None, 14, 14, 1024) 4096        conv4_block5_3_conv[0][0]        
__________________________________________________________________________________________________
conv4_block5_add (Add)          (None, 14, 14, 1024) 0           conv4_block4_out[0][0]           
                                                                 conv4_block5_3_bn[0][0]          
__________________________________________________________________________________________________
conv4_block5_out (Activation)   (None, 14, 14, 1024) 0           conv4_block5_add[0][0]           
__________________________________________________________________________________________________
conv4_block6_1_conv (Conv2D)    (None, 14, 14, 256)  262400      conv4_block5_out[0][0]           
__________________________________________________________________________________________________
conv4_block6_1_bn (BatchNormali (None, 14, 14, 256)  1024        conv4_block6_1_conv[0][0]        
__________________________________________________________________________________________________
conv4_block6_1_relu (Activation (None, 14, 14, 256)  0           conv4_block6_1_bn[0][0]          
__________________________________________________________________________________________________
conv4_block6_2_conv (Conv2D)    (None, 14, 14, 256)  590080      conv4_block6_1_relu[0][0]        
__________________________________________________________________________________________________
conv4_block6_2_bn (BatchNormali (None, 14, 14, 256)  1024        conv4_block6_2_conv[0][0]        
__________________________________________________________________________________________________
conv4_block6_2_relu (Activation (None, 14, 14, 256)  0           conv4_block6_2_bn[0][0]          
__________________________________________________________________________________________________
conv4_block6_3_conv (Conv2D)    (None, 14, 14, 1024) 263168      conv4_block6_2_relu[0][0]        
__________________________________________________________________________________________________
conv4_block6_3_bn (BatchNormali (None, 14, 14, 1024) 4096        conv4_block6_3_conv[0][0]        
__________________________________________________________________________________________________
conv4_block6_add (Add)          (None, 14, 14, 1024) 0           conv4_block5_out[0][0]           
                                                                 conv4_block6_3_bn[0][0]          
__________________________________________________________________________________________________
conv4_block6_out (Activation)   (None, 14, 14, 1024) 0           conv4_block6_add[0][0]           
__________________________________________________________________________________________________
conv5_block1_1_conv (Conv2D)    (None, 7, 7, 512)    524800      conv4_block6_out[0][0]           
__________________________________________________________________________________________________
conv5_block1_1_bn (BatchNormali (None, 7, 7, 512)    2048        conv5_block1_1_conv[0][0]        
__________________________________________________________________________________________________
conv5_block1_1_relu (Activation (None, 7, 7, 512)    0           conv5_block1_1_bn[0][0]          
__________________________________________________________________________________________________
conv5_block1_2_conv (Conv2D)    (None, 7, 7, 512)    2359808     conv5_block1_1_relu[0][0]        
__________________________________________________________________________________________________
conv5_block1_2_bn (BatchNormali (None, 7, 7, 512)    2048        conv5_block1_2_conv[0][0]        
__________________________________________________________________________________________________
conv5_block1_2_relu (Activation (None, 7, 7, 512)    0           conv5_block1_2_bn[0][0]          
__________________________________________________________________________________________________
conv5_block1_0_conv (Conv2D)    (None, 7, 7, 2048)   2099200     conv4_block6_out[0][0]           
__________________________________________________________________________________________________
conv5_block1_3_conv (Conv2D)    (None, 7, 7, 2048)   1050624     conv5_block1_2_relu[0][0]        
__________________________________________________________________________________________________
conv5_block1_0_bn (BatchNormali (None, 7, 7, 2048)   8192        conv5_block1_0_conv[0][0]        
__________________________________________________________________________________________________
conv5_block1_3_bn (BatchNormali (None, 7, 7, 2048)   8192        conv5_block1_3_conv[0][0]        
__________________________________________________________________________________________________
conv5_block1_add (Add)          (None, 7, 7, 2048)   0           conv5_block1_0_bn[0][0]          
                                                                 conv5_block1_3_bn[0][0]          
__________________________________________________________________________________________________
conv5_block1_out (Activation)   (None, 7, 7, 2048)   0           conv5_block1_add[0][0]           
__________________________________________________________________________________________________
conv5_block2_1_conv (Conv2D)    (None, 7, 7, 512)    1049088     conv5_block1_out[0][0]           
__________________________________________________________________________________________________
conv5_block2_1_bn (BatchNormali (None, 7, 7, 512)    2048        conv5_block2_1_conv[0][0]        
__________________________________________________________________________________________________
conv5_block2_1_relu (Activation (None, 7, 7, 512)    0           conv5_block2_1_bn[0][0]          
__________________________________________________________________________________________________
conv5_block2_2_conv (Conv2D)    (None, 7, 7, 512)    2359808     conv5_block2_1_relu[0][0]        
__________________________________________________________________________________________________
conv5_block2_2_bn (BatchNormali (None, 7, 7, 512)    2048        conv5_block2_2_conv[0][0]        
__________________________________________________________________________________________________
conv5_block2_2_relu (Activation (None, 7, 7, 512)    0           conv5_block2_2_bn[0][0]          
__________________________________________________________________________________________________
conv5_block2_3_conv (Conv2D)    (None, 7, 7, 2048)   1050624     conv5_block2_2_relu[0][0]        
__________________________________________________________________________________________________
conv5_block2_3_bn (BatchNormali (None, 7, 7, 2048)   8192        conv5_block2_3_conv[0][0]        
__________________________________________________________________________________________________
conv5_block2_add (Add)          (None, 7, 7, 2048)   0           conv5_block1_out[0][0]           
                                                                 conv5_block2_3_bn[0][0]          
__________________________________________________________________________________________________
conv5_block2_out (Activation)   (None, 7, 7, 2048)   0           conv5_block2_add[0][0]           
__________________________________________________________________________________________________
conv5_block3_1_conv (Conv2D)    (None, 7, 7, 512)    1049088     conv5_block2_out[0][0]           
__________________________________________________________________________________________________
conv5_block3_1_bn (BatchNormali (None, 7, 7, 512)    2048        conv5_block3_1_conv[0][0]        
__________________________________________________________________________________________________
conv5_block3_1_relu (Activation (None, 7, 7, 512)    0           conv5_block3_1_bn[0][0]          
__________________________________________________________________________________________________
conv5_block3_2_conv (Conv2D)    (None, 7, 7, 512)    2359808     conv5_block3_1_relu[0][0]        
__________________________________________________________________________________________________
conv5_block3_2_bn (BatchNormali (None, 7, 7, 512)    2048        conv5_block3_2_conv[0][0]        
__________________________________________________________________________________________________
conv5_block3_2_relu (Activation (None, 7, 7, 512)    0           conv5_block3_2_bn[0][0]          
__________________________________________________________________________________________________
conv5_block3_3_conv (Conv2D)    (None, 7, 7, 2048)   1050624     conv5_block3_2_relu[0][0]        
__________________________________________________________________________________________________
conv5_block3_3_bn (BatchNormali (None, 7, 7, 2048)   8192        conv5_block3_3_conv[0][0]        
__________________________________________________________________________________________________
conv5_block3_add (Add)          (None, 7, 7, 2048)   0           conv5_block2_out[0][0]           
                                                                 conv5_block3_3_bn[0][0]          
__________________________________________________________________________________________________
conv5_block3_out (Activation)   (None, 7, 7, 2048)   0           conv5_block3_add[0][0]           
__________________________________________________________________________________________________
avg_pool (GlobalAveragePooling2 (None, 2048)         0           conv5_block3_out[0][0]           
==================================================================================================
Total params: 23,587,712
Trainable params: 23,534,592
Non-trainable params: 53,120
__________________________________________________________________________________________________
None
In [15]:
# Printing the features extracted from a sample image.
print("{} : {}".format(list(train_validate_features1.keys())[0], train_validate_features1[list(train_validate_features1.keys())[0]] ))
3317145805_071b15debb : [[0.09465548 0.42188954 0.00809849 ... 0.01401249 1.6295764  0.06330808]]
In [16]:
len(train_validate_features1)
Out[16]:
8081
In [17]:
# Converting the features into pickle format for later uses.
dump(train_validate_features1, open('./train_validate_features1.pkl', 'wb'))
In [18]:
# Loading the features from pickle format.
train_validate_features1 = load(open('./train_validate_features1.pkl', 'rb'))

Preparing the input dataset

In [19]:
# make a dictionary of image with caption for train_validate_images
train_validate_image_caption = {}

for image, caption in new_captions_dict.items():
    
    # check whether the image is available in both train_validate_images list and train_validate_features dictionary
    if image in train_validate_images and image in list(train_validate_features1.keys()):
        
         train_validate_image_caption.update({image : caption})

len(train_validate_image_caption)
Out[19]:
8081

Checking whether the correct caption is mapped with the correct image

In [20]:
list(train_validate_image_caption.values())[2]
Out[20]:
'startseq little girl covered in paint sits in front of painted rainbow with her hands in bowl endseq'
In [21]:
Image(image_dataset_path+'/'+list(train_validate_image_caption.keys())[2]+'.jpg')
Out[21]:
In [22]:
# initialise tokenizer
tokenizer = Tokenizer()

# create word count dictionary on the captions list
tokenizer.fit_on_texts(list(train_validate_image_caption.values()))

# Creating the vocabulary
vocab_len = len(tokenizer.word_index) + 1

# Store the length of the maximum sentence
max_len = max(len(train_validate_image_caption[image].split()) for image in train_validate_image_caption)

print("vocab_len ", vocab_len)
print("max_len ", max_len)

def prepare_data(image_keys):
    
    # x1 will store the image feature, x2 will store one sequence and y will store the next sequence
    x1, x2, y = [], [], []

    # iterate through all the images 
    for image in image_keys:

        # store the caption of that image
        caption = train_validate_image_caption[image]

        # split the image into tokens
        caption = caption.split()

        # generate integer sequences of the
        seq = tokenizer.texts_to_sequences([caption])[0]

        length = len(seq)

        for i in range(1, length):

            x2_seq, y_seq = seq[:i] , seq[i]  

            # pad the sequences
            x2_seq = pad_sequences([x2_seq], maxlen = max_len)[0]


            # encode the output sequence                
            y_seq = to_categorical([y_seq], num_classes = vocab_len)[0]

            x1.append( train_validate_features1[image][0] )

            x2.append(x2_seq)

            y.append(y_seq)
               
    return np.array(x1), np.array(x2), np.array(y)
vocab_len  4487
max_len  30
In [23]:
train_x1, train_x2, train_y = prepare_data( train_validate_images[0:7081] )
validate_x1, validate_x2, validate_y = prepare_data( train_validate_images[7081:8081] )
In [24]:
len(train_x1)
Out[24]:
72489

Preparing the Final Model

In [25]:
embedding_size = 128
image_model = Sequential()

image_model.add(Dense(embedding_size, input_shape=(2048,), activation='relu'))
image_model.add(Dropout(0.5))
image_model.add(RepeatVector(max_len))

image_model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 128)               262272    
_________________________________________________________________
dropout (Dropout)            (None, 128)               0         
_________________________________________________________________
repeat_vector (RepeatVector) (None, 30, 128)           0         
=================================================================
Total params: 262,272
Trainable params: 262,272
Non-trainable params: 0
_________________________________________________________________
In [26]:
language_model = Sequential()

language_model.add(Embedding(input_dim=vocab_len, output_dim=embedding_size, input_length=max_len))
language_model.add(LSTM(256,return_sequences=True))
language_model.add(Dropout(0.5))
language_model.add(TimeDistributed(Dense(embedding_size)))

language_model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 30, 128)           574336    
_________________________________________________________________
lstm (LSTM)                  (None, 30, 256)           394240    
_________________________________________________________________
dropout_1 (Dropout)          (None, 30, 256)           0         
_________________________________________________________________
time_distributed (TimeDistri (None, 30, 128)           32896     
=================================================================
Total params: 1,001,472
Trainable params: 1,001,472
Non-trainable params: 0
_________________________________________________________________
In [38]:
# Final Model
conca = Concatenate()([image_model.output, language_model.output])
x = LSTM(128, dropout=0.5, recurrent_dropout=0.5,return_sequences=True)(conca)
x = LSTM(512, dropout=0.5, recurrent_dropout=0.5,return_sequences=False)(x)
x = Dense(vocab_len)(x)
out = Activation('softmax')(x)
model = Model(inputs=[image_model.input, language_model.input], outputs = out)

optimizer = keras.optimizers.Adam(learning_rate=0.0001)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
model.summary()
Model: "functional_7"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
embedding_input (InputLayer)    [(None, 30)]         0                                            
__________________________________________________________________________________________________
dense_input (InputLayer)        [(None, 2048)]       0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 30, 128)      574336      embedding_input[0][0]            
__________________________________________________________________________________________________
dense (Dense)                   (None, 128)          262272      dense_input[0][0]                
__________________________________________________________________________________________________
lstm (LSTM)                     (None, 30, 256)      394240      embedding[0][0]                  
__________________________________________________________________________________________________
dropout (Dropout)               (None, 128)          0           dense[0][0]                      
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, 30, 256)      0           lstm[0][0]                       
__________________________________________________________________________________________________
repeat_vector (RepeatVector)    (None, 30, 128)      0           dropout[0][0]                    
__________________________________________________________________________________________________
time_distributed (TimeDistribut (None, 30, 128)      32896       dropout_1[0][0]                  
__________________________________________________________________________________________________
concatenate_3 (Concatenate)     (None, 30, 256)      0           repeat_vector[0][0]              
                                                                 time_distributed[0][0]           
__________________________________________________________________________________________________
lstm_7 (LSTM)                   (None, 30, 128)      197120      concatenate_3[0][0]              
__________________________________________________________________________________________________
lstm_8 (LSTM)                   (None, 512)          1312768     lstm_7[0][0]                     
__________________________________________________________________________________________________
dense_5 (Dense)                 (None, 4487)         2301831     lstm_8[0][0]                     
__________________________________________________________________________________________________
activation_3 (Activation)       (None, 4487)         0           dense_5[0][0]                    
==================================================================================================
Total params: 5,075,463
Trainable params: 5,075,463
Non-trainable params: 0
__________________________________________________________________________________________________

Defining callbacks for Model Training

In [39]:
# define checkpoint callback to save the best model only
filepath = './image_captioning.h5'

callbacks = [ ModelCheckpoint(filepath= filepath, verbose = 2,save_best_only=True, monitor='val_loss', mode='min') ]

Make sure feature data and target data share the same first dimension

In [40]:
print("shape of train_x1 ", train_x1.shape)
print("shape of train_x2 ", train_x2.shape)
print("shape of train_y ", train_y.shape)
print()
print("shape of validate_x1 ", validate_x1.shape)
print("shape of validate_x2 ", validate_x2.shape)
print("shape of validate_y ", validate_y.shape)
shape of train_x1  (72489, 2048)
shape of train_x2  (72489, 30)
shape of train_y  (72489, 4487)

shape of validate_x1  (10059, 2048)
shape of validate_x2  (10059, 30)
shape of validate_y  (10059, 4487)

Training Model

In [41]:
BATCH_SIZE = 512

# Define training epochs
EPOCHS = 100

history = model.fit([train_x1, train_x2],  
                    train_y,              
                    verbose = 1,            
                    epochs = EPOCHS,
                    batch_size = BATCH_SIZE,
                    callbacks = callbacks, 
                    validation_data=([validate_x1, validate_x2], validate_y)) 
Epoch 1/100
142/142 [==============================] - ETA: 0s - loss: 6.4257
Epoch 00001: val_loss improved from inf to 5.71079, saving model to ./model1.h5
142/142 [==============================] - 42s 296ms/step - loss: 6.4257 - val_loss: 5.7108
Epoch 2/100
142/142 [==============================] - ETA: 0s - loss: 5.6116
Epoch 00002: val_loss improved from 5.71079 to 5.54960, saving model to ./model1.h5
142/142 [==============================] - 41s 288ms/step - loss: 5.6116 - val_loss: 5.5496
Epoch 3/100
142/142 [==============================] - ETA: 0s - loss: 5.3725
Epoch 00003: val_loss improved from 5.54960 to 5.29885, saving model to ./model1.h5
142/142 [==============================] - 41s 291ms/step - loss: 5.3725 - val_loss: 5.2989
Epoch 4/100
142/142 [==============================] - ETA: 0s - loss: 5.1091
Epoch 00004: val_loss improved from 5.29885 to 5.10842, saving model to ./model1.h5
142/142 [==============================] - 40s 282ms/step - loss: 5.1091 - val_loss: 5.1084
Epoch 5/100
142/142 [==============================] - ETA: 0s - loss: 4.9061
Epoch 00005: val_loss improved from 5.10842 to 4.96152, saving model to ./model1.h5
142/142 [==============================] - 41s 289ms/step - loss: 4.9061 - val_loss: 4.9615
Epoch 6/100
142/142 [==============================] - ETA: 0s - loss: 4.7445
Epoch 00006: val_loss improved from 4.96152 to 4.84895, saving model to ./model1.h5
142/142 [==============================] - 41s 290ms/step - loss: 4.7445 - val_loss: 4.8490
Epoch 7/100
142/142 [==============================] - ETA: 0s - loss: 4.6188
Epoch 00007: val_loss improved from 4.84895 to 4.77124, saving model to ./model1.h5
142/142 [==============================] - 40s 282ms/step - loss: 4.6188 - val_loss: 4.7712
Epoch 8/100
142/142 [==============================] - ETA: 0s - loss: 4.5176
Epoch 00008: val_loss improved from 4.77124 to 4.71032, saving model to ./model1.h5
142/142 [==============================] - 41s 290ms/step - loss: 4.5176 - val_loss: 4.7103
Epoch 9/100
142/142 [==============================] - ETA: 0s - loss: 4.4358
Epoch 00009: val_loss improved from 4.71032 to 4.66173, saving model to ./model1.h5
142/142 [==============================] - 41s 287ms/step - loss: 4.4358 - val_loss: 4.6617
Epoch 10/100
142/142 [==============================] - ETA: 0s - loss: 4.3666
Epoch 00010: val_loss improved from 4.66173 to 4.62412, saving model to ./model1.h5
142/142 [==============================] - 40s 281ms/step - loss: 4.3666 - val_loss: 4.6241
Epoch 11/100
142/142 [==============================] - ETA: 0s - loss: 4.3084
Epoch 00011: val_loss improved from 4.62412 to 4.59351, saving model to ./model1.h5
142/142 [==============================] - 41s 285ms/step - loss: 4.3084 - val_loss: 4.5935
Epoch 12/100
142/142 [==============================] - ETA: 0s - loss: 4.2578
Epoch 00012: val_loss improved from 4.59351 to 4.57024, saving model to ./model1.h5
142/142 [==============================] - 41s 286ms/step - loss: 4.2578 - val_loss: 4.5702
Epoch 13/100
142/142 [==============================] - ETA: 0s - loss: 4.2110
Epoch 00013: val_loss improved from 4.57024 to 4.54921, saving model to ./model1.h5
142/142 [==============================] - 39s 276ms/step - loss: 4.2110 - val_loss: 4.5492
Epoch 14/100
142/142 [==============================] - ETA: 0s - loss: 4.1737
Epoch 00014: val_loss improved from 4.54921 to 4.53371, saving model to ./model1.h5
142/142 [==============================] - 40s 282ms/step - loss: 4.1737 - val_loss: 4.5337
Epoch 15/100
142/142 [==============================] - ETA: 0s - loss: 4.1332
Epoch 00015: val_loss improved from 4.53371 to 4.51695, saving model to ./model1.h5
142/142 [==============================] - 40s 282ms/step - loss: 4.1332 - val_loss: 4.5170
Epoch 16/100
142/142 [==============================] - ETA: 0s - loss: 4.1016
Epoch 00016: val_loss improved from 4.51695 to 4.50345, saving model to ./model1.h5
142/142 [==============================] - 40s 281ms/step - loss: 4.1016 - val_loss: 4.5034
Epoch 17/100
142/142 [==============================] - ETA: 0s - loss: 4.0689
Epoch 00017: val_loss improved from 4.50345 to 4.49452, saving model to ./model1.h5
142/142 [==============================] - 40s 284ms/step - loss: 4.0689 - val_loss: 4.4945
Epoch 18/100
142/142 [==============================] - ETA: 0s - loss: 4.0407
Epoch 00018: val_loss improved from 4.49452 to 4.48104, saving model to ./model1.h5
142/142 [==============================] - 40s 280ms/step - loss: 4.0407 - val_loss: 4.4810
Epoch 19/100
142/142 [==============================] - ETA: 0s - loss: 4.0119
Epoch 00019: val_loss improved from 4.48104 to 4.47403, saving model to ./model1.h5
142/142 [==============================] - 40s 279ms/step - loss: 4.0119 - val_loss: 4.4740
Epoch 20/100
142/142 [==============================] - ETA: 0s - loss: 3.9852
Epoch 00020: val_loss improved from 4.47403 to 4.46802, saving model to ./model1.h5
142/142 [==============================] - 41s 286ms/step - loss: 3.9852 - val_loss: 4.4680
Epoch 21/100
142/142 [==============================] - ETA: 0s - loss: 3.9630
Epoch 00021: val_loss improved from 4.46802 to 4.46417, saving model to ./model1.h5
142/142 [==============================] - 40s 282ms/step - loss: 3.9630 - val_loss: 4.4642
Epoch 22/100
142/142 [==============================] - ETA: 0s - loss: 3.9390
Epoch 00022: val_loss improved from 4.46417 to 4.45409, saving model to ./model1.h5
142/142 [==============================] - 39s 276ms/step - loss: 3.9390 - val_loss: 4.4541
Epoch 23/100
142/142 [==============================] - ETA: 0s - loss: 3.9173
Epoch 00023: val_loss improved from 4.45409 to 4.45044, saving model to ./model1.h5
142/142 [==============================] - 40s 278ms/step - loss: 3.9173 - val_loss: 4.4504
Epoch 24/100
142/142 [==============================] - ETA: 0s - loss: 3.8944
Epoch 00024: val_loss improved from 4.45044 to 4.45039, saving model to ./model1.h5
142/142 [==============================] - 40s 285ms/step - loss: 3.8944 - val_loss: 4.4504
Epoch 25/100
142/142 [==============================] - ETA: 0s - loss: 3.8735
Epoch 00025: val_loss improved from 4.45039 to 4.44258, saving model to ./model1.h5
142/142 [==============================] - 39s 273ms/step - loss: 3.8735 - val_loss: 4.4426
Epoch 26/100
142/142 [==============================] - ETA: 0s - loss: 3.8533
Epoch 00026: val_loss improved from 4.44258 to 4.43676, saving model to ./model1.h5
142/142 [==============================] - 40s 281ms/step - loss: 3.8533 - val_loss: 4.4368
Epoch 27/100
142/142 [==============================] - ETA: 0s - loss: 3.8367
Epoch 00027: val_loss improved from 4.43676 to 4.43461, saving model to ./model1.h5
142/142 [==============================] - 41s 286ms/step - loss: 3.8367 - val_loss: 4.4346
Epoch 28/100
142/142 [==============================] - ETA: 0s - loss: 3.8188
Epoch 00028: val_loss improved from 4.43461 to 4.43210, saving model to ./model1.h5
142/142 [==============================] - 39s 277ms/step - loss: 3.8188 - val_loss: 4.4321
Epoch 29/100
142/142 [==============================] - ETA: 0s - loss: 3.8002
Epoch 00029: val_loss improved from 4.43210 to 4.43048, saving model to ./model1.h5
142/142 [==============================] - 40s 281ms/step - loss: 3.8002 - val_loss: 4.4305
Epoch 30/100
142/142 [==============================] - ETA: 0s - loss: 3.7808
Epoch 00030: val_loss improved from 4.43048 to 4.42760, saving model to ./model1.h5
142/142 [==============================] - 40s 283ms/step - loss: 3.7808 - val_loss: 4.4276
Epoch 31/100
142/142 [==============================] - ETA: 0s - loss: 3.7671
Epoch 00031: val_loss improved from 4.42760 to 4.42525, saving model to ./model1.h5
142/142 [==============================] - 39s 272ms/step - loss: 3.7671 - val_loss: 4.4253
Epoch 32/100
142/142 [==============================] - ETA: 0s - loss: 3.7475
Epoch 00032: val_loss did not improve from 4.42525
142/142 [==============================] - 41s 285ms/step - loss: 3.7475 - val_loss: 4.4290
Epoch 33/100
142/142 [==============================] - ETA: 0s - loss: 3.7332
Epoch 00033: val_loss improved from 4.42525 to 4.42322, saving model to ./model1.h5
142/142 [==============================] - 40s 282ms/step - loss: 3.7332 - val_loss: 4.4232
Epoch 34/100
142/142 [==============================] - ETA: 0s - loss: 3.7168
Epoch 00034: val_loss did not improve from 4.42322
142/142 [==============================] - 38s 270ms/step - loss: 3.7168 - val_loss: 4.4246
Epoch 35/100
142/142 [==============================] - ETA: 0s - loss: 3.7023
Epoch 00035: val_loss improved from 4.42322 to 4.42074, saving model to ./model1.h5
142/142 [==============================] - 40s 283ms/step - loss: 3.7023 - val_loss: 4.4207
Epoch 36/100
142/142 [==============================] - ETA: 0s - loss: 3.6890
Epoch 00036: val_loss did not improve from 4.42074
142/142 [==============================] - 40s 284ms/step - loss: 3.6890 - val_loss: 4.4227
Epoch 37/100
142/142 [==============================] - ETA: 0s - loss: 3.6735
Epoch 00037: val_loss improved from 4.42074 to 4.41951, saving model to ./model1.h5
142/142 [==============================] - 39s 275ms/step - loss: 3.6735 - val_loss: 4.4195
Epoch 38/100
142/142 [==============================] - ETA: 0s - loss: 3.6590
Epoch 00038: val_loss improved from 4.41951 to 4.41807, saving model to ./model1.h5
142/142 [==============================] - 40s 283ms/step - loss: 3.6590 - val_loss: 4.4181
Epoch 39/100
142/142 [==============================] - ETA: 0s - loss: 3.6424
Epoch 00039: val_loss did not improve from 4.41807
142/142 [==============================] - 40s 282ms/step - loss: 3.6424 - val_loss: 4.4190
Epoch 40/100
142/142 [==============================] - ETA: 0s - loss: 3.6297
Epoch 00040: val_loss improved from 4.41807 to 4.41731, saving model to ./model1.h5
142/142 [==============================] - 39s 277ms/step - loss: 3.6297 - val_loss: 4.4173
Epoch 41/100
142/142 [==============================] - ETA: 0s - loss: 3.6149
Epoch 00041: val_loss did not improve from 4.41731
142/142 [==============================] - 40s 281ms/step - loss: 3.6149 - val_loss: 4.4175
Epoch 42/100
142/142 [==============================] - ETA: 0s - loss: 3.5974
Epoch 00042: val_loss improved from 4.41731 to 4.41264, saving model to ./model1.h5
142/142 [==============================] - 40s 283ms/step - loss: 3.5974 - val_loss: 4.4126
Epoch 43/100
142/142 [==============================] - ETA: 0s - loss: 3.5827
Epoch 00043: val_loss improved from 4.41264 to 4.41105, saving model to ./model1.h5
142/142 [==============================] - 39s 272ms/step - loss: 3.5827 - val_loss: 4.4111
Epoch 44/100
142/142 [==============================] - ETA: 0s - loss: 3.5669
Epoch 00044: val_loss improved from 4.41105 to 4.40693, saving model to ./model1.h5
142/142 [==============================] - 41s 285ms/step - loss: 3.5669 - val_loss: 4.4069
Epoch 45/100
142/142 [==============================] - ETA: 0s - loss: 3.5545
Epoch 00045: val_loss improved from 4.40693 to 4.40636, saving model to ./model1.h5
142/142 [==============================] - 40s 279ms/step - loss: 3.5545 - val_loss: 4.4064
Epoch 46/100
142/142 [==============================] - ETA: 0s - loss: 3.5407
Epoch 00046: val_loss improved from 4.40636 to 4.40469, saving model to ./model1.h5
142/142 [==============================] - 39s 276ms/step - loss: 3.5407 - val_loss: 4.4047
Epoch 47/100
142/142 [==============================] - ETA: 0s - loss: 3.5265
Epoch 00047: val_loss improved from 4.40469 to 4.40257, saving model to ./model1.h5
142/142 [==============================] - 40s 285ms/step - loss: 3.5265 - val_loss: 4.4026
Epoch 48/100
142/142 [==============================] - ETA: 0s - loss: 3.5139
Epoch 00048: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 284ms/step - loss: 3.5139 - val_loss: 4.4051
Epoch 49/100
142/142 [==============================] - ETA: 0s - loss: 3.4992
Epoch 00049: val_loss did not improve from 4.40257
142/142 [==============================] - 39s 275ms/step - loss: 3.4992 - val_loss: 4.4043
Epoch 50/100
142/142 [==============================] - ETA: 0s - loss: 3.4836
Epoch 00050: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 282ms/step - loss: 3.4836 - val_loss: 4.4034
Epoch 51/100
142/142 [==============================] - ETA: 0s - loss: 3.4715
Epoch 00051: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 278ms/step - loss: 3.4715 - val_loss: 4.4049
Epoch 52/100
142/142 [==============================] - ETA: 0s - loss: 3.4568
Epoch 00052: val_loss did not improve from 4.40257
142/142 [==============================] - 39s 278ms/step - loss: 3.4568 - val_loss: 4.4065
Epoch 53/100
142/142 [==============================] - ETA: 0s - loss: 3.4482
Epoch 00053: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 279ms/step - loss: 3.4482 - val_loss: 4.4078
Epoch 54/100
142/142 [==============================] - ETA: 0s - loss: 3.4346
Epoch 00054: val_loss did not improve from 4.40257
142/142 [==============================] - 39s 278ms/step - loss: 3.4346 - val_loss: 4.4111
Epoch 55/100
142/142 [==============================] - ETA: 0s - loss: 3.4264
Epoch 00055: val_loss did not improve from 4.40257
142/142 [==============================] - 39s 273ms/step - loss: 3.4264 - val_loss: 4.4098
Epoch 56/100
142/142 [==============================] - ETA: 0s - loss: 3.4123
Epoch 00056: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 285ms/step - loss: 3.4123 - val_loss: 4.4135
Epoch 57/100
142/142 [==============================] - ETA: 0s - loss: 3.4016
Epoch 00057: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 283ms/step - loss: 3.4016 - val_loss: 4.4151
Epoch 58/100
142/142 [==============================] - ETA: 0s - loss: 3.3901
Epoch 00058: val_loss did not improve from 4.40257
142/142 [==============================] - 39s 276ms/step - loss: 3.3901 - val_loss: 4.4169
Epoch 59/100
142/142 [==============================] - ETA: 0s - loss: 3.3769
Epoch 00059: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 278ms/step - loss: 3.3769 - val_loss: 4.4218
Epoch 60/100
142/142 [==============================] - ETA: 0s - loss: 3.3656
Epoch 00060: val_loss did not improve from 4.40257
142/142 [==============================] - 41s 285ms/step - loss: 3.3656 - val_loss: 4.4236
Epoch 61/100
142/142 [==============================] - ETA: 0s - loss: 3.3555
Epoch 00061: val_loss did not improve from 4.40257
142/142 [==============================] - 39s 275ms/step - loss: 3.3555 - val_loss: 4.4282
Epoch 62/100
142/142 [==============================] - ETA: 0s - loss: 3.3491
Epoch 00062: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 279ms/step - loss: 3.3491 - val_loss: 4.4281
Epoch 63/100
142/142 [==============================] - ETA: 0s - loss: 3.3369
Epoch 00063: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 281ms/step - loss: 3.3369 - val_loss: 4.4300
Epoch 64/100
142/142 [==============================] - ETA: 0s - loss: 3.3243
Epoch 00064: val_loss did not improve from 4.40257
142/142 [==============================] - 39s 277ms/step - loss: 3.3243 - val_loss: 4.4361
Epoch 65/100
142/142 [==============================] - ETA: 0s - loss: 3.3181
Epoch 00065: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 282ms/step - loss: 3.3181 - val_loss: 4.4374
Epoch 66/100
142/142 [==============================] - ETA: 0s - loss: 3.3055
Epoch 00066: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 282ms/step - loss: 3.3055 - val_loss: 4.4426
Epoch 67/100
142/142 [==============================] - ETA: 0s - loss: 3.2976
Epoch 00067: val_loss did not improve from 4.40257
142/142 [==============================] - 39s 273ms/step - loss: 3.2976 - val_loss: 4.4431
Epoch 68/100
142/142 [==============================] - ETA: 0s - loss: 3.2886
Epoch 00068: val_loss did not improve from 4.40257
142/142 [==============================] - 41s 285ms/step - loss: 3.2886 - val_loss: 4.4495
Epoch 69/100
142/142 [==============================] - ETA: 0s - loss: 3.2797
Epoch 00069: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 282ms/step - loss: 3.2797 - val_loss: 4.4515
Epoch 70/100
142/142 [==============================] - ETA: 0s - loss: 3.2724
Epoch 00070: val_loss did not improve from 4.40257
142/142 [==============================] - 39s 271ms/step - loss: 3.2724 - val_loss: 4.4547
Epoch 71/100
142/142 [==============================] - ETA: 0s - loss: 3.2614
Epoch 00071: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 282ms/step - loss: 3.2614 - val_loss: 4.4566
Epoch 72/100
142/142 [==============================] - ETA: 0s - loss: 3.2529
Epoch 00072: val_loss did not improve from 4.40257
142/142 [==============================] - 41s 287ms/step - loss: 3.2529 - val_loss: 4.4610
Epoch 73/100
142/142 [==============================] - ETA: 0s - loss: 3.2422
Epoch 00073: val_loss did not improve from 4.40257
142/142 [==============================] - 39s 273ms/step - loss: 3.2422 - val_loss: 4.4675
Epoch 74/100
142/142 [==============================] - ETA: 0s - loss: 3.2334
Epoch 00074: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 283ms/step - loss: 3.2334 - val_loss: 4.4704
Epoch 75/100
142/142 [==============================] - ETA: 0s - loss: 3.2290
Epoch 00075: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 282ms/step - loss: 3.2290 - val_loss: 4.4739
Epoch 76/100
142/142 [==============================] - ETA: 0s - loss: 3.2188
Epoch 00076: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 279ms/step - loss: 3.2188 - val_loss: 4.4744
Epoch 77/100
142/142 [==============================] - ETA: 0s - loss: 3.2123
Epoch 00077: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 282ms/step - loss: 3.2123 - val_loss: 4.4812
Epoch 78/100
142/142 [==============================] - ETA: 0s - loss: 3.2027
Epoch 00078: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 280ms/step - loss: 3.2027 - val_loss: 4.4880
Epoch 79/100
142/142 [==============================] - ETA: 0s - loss: 3.1942
Epoch 00079: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 281ms/step - loss: 3.1942 - val_loss: 4.4872
Epoch 80/100
142/142 [==============================] - ETA: 0s - loss: 3.1872
Epoch 00080: val_loss did not improve from 4.40257
142/142 [==============================] - 41s 290ms/step - loss: 3.1872 - val_loss: 4.4891
Epoch 81/100
142/142 [==============================] - ETA: 0s - loss: 3.1779
Epoch 00081: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 284ms/step - loss: 3.1779 - val_loss: 4.4947
Epoch 82/100
142/142 [==============================] - ETA: 0s - loss: 3.1695
Epoch 00082: val_loss did not improve from 4.40257
142/142 [==============================] - 39s 276ms/step - loss: 3.1695 - val_loss: 4.4959
Epoch 83/100
142/142 [==============================] - ETA: 0s - loss: 3.1645
Epoch 00083: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 282ms/step - loss: 3.1645 - val_loss: 4.5021
Epoch 84/100
142/142 [==============================] - ETA: 0s - loss: 3.1544
Epoch 00084: val_loss did not improve from 4.40257
142/142 [==============================] - 41s 289ms/step - loss: 3.1544 - val_loss: 4.5069
Epoch 85/100
142/142 [==============================] - ETA: 0s - loss: 3.1463
Epoch 00085: val_loss did not improve from 4.40257
142/142 [==============================] - 39s 276ms/step - loss: 3.1463 - val_loss: 4.5163
Epoch 86/100
142/142 [==============================] - ETA: 0s - loss: 3.1411
Epoch 00086: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 280ms/step - loss: 3.1411 - val_loss: 4.5153
Epoch 87/100
142/142 [==============================] - ETA: 0s - loss: 3.1329
Epoch 00087: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 280ms/step - loss: 3.1329 - val_loss: 4.5178
Epoch 88/100
142/142 [==============================] - ETA: 0s - loss: 3.1244
Epoch 00088: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 280ms/step - loss: 3.1244 - val_loss: 4.5214
Epoch 89/100
142/142 [==============================] - ETA: 0s - loss: 3.1144
Epoch 00089: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 280ms/step - loss: 3.1144 - val_loss: 4.5261
Epoch 90/100
142/142 [==============================] - ETA: 0s - loss: 3.1100
Epoch 00090: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 279ms/step - loss: 3.1100 - val_loss: 4.5262
Epoch 91/100
142/142 [==============================] - ETA: 0s - loss: 3.1014
Epoch 00091: val_loss did not improve from 4.40257
142/142 [==============================] - 39s 277ms/step - loss: 3.1014 - val_loss: 4.5334
Epoch 92/100
142/142 [==============================] - ETA: 0s - loss: 3.0952
Epoch 00092: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 284ms/step - loss: 3.0952 - val_loss: 4.5356
Epoch 93/100
142/142 [==============================] - ETA: 0s - loss: 3.0895
Epoch 00093: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 282ms/step - loss: 3.0895 - val_loss: 4.5363
Epoch 94/100
142/142 [==============================] - ETA: 0s - loss: 3.0802
Epoch 00094: val_loss did not improve from 4.40257
142/142 [==============================] - 39s 277ms/step - loss: 3.0802 - val_loss: 4.5421
Epoch 95/100
142/142 [==============================] - ETA: 0s - loss: 3.0760
Epoch 00095: val_loss did not improve from 4.40257
142/142 [==============================] - 39s 277ms/step - loss: 3.0760 - val_loss: 4.5437
Epoch 96/100
142/142 [==============================] - ETA: 0s - loss: 3.0651
Epoch 00096: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 285ms/step - loss: 3.0651 - val_loss: 4.5490
Epoch 97/100
142/142 [==============================] - ETA: 0s - loss: 3.0599
Epoch 00097: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 283ms/step - loss: 3.0599 - val_loss: 4.5514
Epoch 98/100
142/142 [==============================] - ETA: 0s - loss: 3.0526
Epoch 00098: val_loss did not improve from 4.40257
142/142 [==============================] - 39s 275ms/step - loss: 3.0526 - val_loss: 4.5503
Epoch 99/100
142/142 [==============================] - ETA: 0s - loss: 3.0474
Epoch 00099: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 280ms/step - loss: 3.0474 - val_loss: 4.5536
Epoch 100/100
142/142 [==============================] - ETA: 0s - loss: 3.0411
Epoch 00100: val_loss did not improve from 4.40257
142/142 [==============================] - 40s 281ms/step - loss: 3.0411 - val_loss: 4.5625

Model Evaluation

Function for extracting features

In [44]:
# plot the training artifacts

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train_loss','val_loss'], loc = 'upper right')
plt.show()
In [45]:
# extract features from each photo in the directory
def extract_feat(filename):
    # load the model
    model = ResNet50(include_top=False,weights='imagenet',input_shape=(224,224,3),pooling='avg')

    # load the photo
    image = load_img(filename, target_size=(224, 224))
    # convert the image pixels to a numpy array
    image = img_to_array(image)
    # reshape data for the model
    image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
    # prepare the image for the VGG model
    image = preprocess_input(image)
    # get features
    feature = model.predict(image, verbose=0)
    return feature

# map an integer to a word
def word_for_id(integer, tokenizr):
    for word, index in tokenizr.word_index.items():
        if index == integer:
            return word
    return None
 

Function for generating features

In [46]:
# generate a description for an image
def generate_desc(model, tokenizer, photo, max_length):
    # seed the generation process
    in_text = 'startseq'
    # iterate over the whole length of the sequence
    for i in range(max_length):
        # integer encode input sequence
        sequence = tokenizer.texts_to_sequences([in_text])[0]
        # pad input
        sequence = pad_sequences([sequence], maxlen=max_length)
        # predict next word
        yhat = model.predict([photo,sequence], verbose=0)
        # convert probability to integer
        yhat = np.argmax(yhat)
        # map integer to word
        word = word_for_id(yhat, tokenizer)
        # stop if we cannot map the word
        if word is None:
            break
        # append as input for generating the next word
        in_text += ' ' + word
        # stop if we predict the end of the sequence
        if word == 'endseq':
            break
    return in_text

Visualizing the results and accessing Model Perfromance

In [61]:
# generate a description for an image
model = load_model('./image_captioning.h5')
tokenizr = Tokenizer()
tokenizr.fit_on_texts([caption for image, caption in new_captions_dict.items() if image in train_validate_images])
max_length = max_len

photo = extract_feat('../input/flickr8k-imageswithcaptions/Flickr8k_Dataset/Flicker8k_Dataset/3561433412_3985208d53.jpg')  

    # seed the generation process
in_text = 'startseq'
    # iterate over the whole length of the sequence
for i in range(max_length):
        # integer encode input sequence
    sequence = tokenizer.texts_to_sequences([in_text])[0]
        # pad input
    sequence = pad_sequences([sequence], maxlen=max_length)
        # predict next word
    yhat = model.predict([photo,sequence], verbose=0)
        # convert probability to integer
    yhat = np.argmax(yhat)
        # map integer to word
    word = word_for_id(yhat, tokenizer)
        # stop if we cannot map the word
    if word is None:
        break
        # append as input for generating the next word
    in_text += ' ' + word
        # stop if we predict the end of the sequence
    if word == 'endseq':
        break
in_text = in_text.replace('startseq','') 
in_text = in_text.replace('endseq','') 
print("Predicted caption -> ", in_text)
print()
print('*********************************************************************')
Image('../input/flickr8k-imageswithcaptions/Flickr8k_Dataset/Flicker8k_Dataset/3561433412_3985208d53.jpg')
Predicted caption ->   black and white dog is running in the grass 

*********************************************************************
Out[61]:
In [62]:
# generate a description for an image
model = load_model('./image_captioning.h5')
tokenizr = Tokenizer()
tokenizr.fit_on_texts([caption for image, caption in new_captions_dict.items() if image in train_validate_images])
max_length = max_len

photo = extract_feat('../input/flickr8k-imageswithcaptions/Flickr8k_Dataset/Flicker8k_Dataset/3229821595_77ace81c6b.jpg')  

    # seed the generation process
in_text = 'startseq'
    # iterate over the whole length of the sequence
for i in range(max_length):
        # integer encode input sequence
    sequence = tokenizer.texts_to_sequences([in_text])[0]
        # pad input
    sequence = pad_sequences([sequence], maxlen=max_length)
        # predict next word
    yhat = model.predict([photo,sequence], verbose=0)
        # convert probability to integer
    yhat = np.argmax(yhat)
        # map integer to word
    word = word_for_id(yhat, tokenizer)
        # stop if we cannot map the word
    if word is None:
        break
        # append as input for generating the next word
    in_text += ' ' + word
        # stop if we predict the end of the sequence
    if word == 'endseq':
        break
in_text = in_text.replace('startseq','') 
in_text = in_text.replace('endseq','')
print("Predicted caption -> ", in_text)
print()
print('*********************************************************************')
Image('../input/flickr8k-imageswithcaptions/Flickr8k_Dataset/Flicker8k_Dataset/3229821595_77ace81c6b.jpg')
Predicted caption ->   group of people are standing on the road 

*********************************************************************
Out[62]:
In [63]:
# generate a description for an image
model = load_model('./image_captioning.h5')
tokenizr = Tokenizer()
tokenizr.fit_on_texts([caption for image, caption in new_captions_dict.items() if image in train_validate_images])
max_length = max_len

photo = extract_feat('../input/flickr8k-imageswithcaptions/Flickr8k_Dataset/Flicker8k_Dataset/554526471_a31f8b74ef.jpg')  

    # seed the generation process
in_text = 'startseq'
    # iterate over the whole length of the sequence
for i in range(max_length):
        # integer encode input sequence
    sequence = tokenizer.texts_to_sequences([in_text])[0]
        # pad input
    sequence = pad_sequences([sequence], maxlen=max_length)
        # predict next word
    yhat = model.predict([photo,sequence], verbose=0)
        # convert probability to integer
    yhat = np.argmax(yhat)
        # map integer to word
    word = word_for_id(yhat, tokenizer)
        # stop if we cannot map the word
    if word is None:
        break
        # append as input for generating the next word
    in_text += ' ' + word
        # stop if we predict the end of the sequence
    if word == 'endseq':
        break
in_text = in_text.replace('startseq','') 
in_text = in_text.replace('endseq','')
print("Predicted caption -> ", in_text)
print()
print('*********************************************************************')
Image('../input/flickr8k-imageswithcaptions/Flickr8k_Dataset/Flicker8k_Dataset/554526471_a31f8b74ef.jpg')
Predicted caption ->   group of people are playing in the water 

*********************************************************************
Out[63]: