Image Captioning using Jupyter
Image captioning is a process in which textual description is generated based on an image. To better understand image captioning, we need to first differentiate it from image classification.
Difference between image classification and image captioning
Image classification is a relatively simple process that only tells us what is in an image. For example, if there is a boy on a bike, image classification will not give us a description; it will just provide the result as boy or bike. Image classification can tell us whether there is a woman or a dog in the image, or an action, such as snowboarding. This is not a desirable result as there is no description of what exactly is going on in the image.
The following is the result we get using image classification:
On the other hand, image captioning will provide a result with a description. For the preceding example, the result of image captioning would be a boy riding on a bike or a man is snowboarding. This could be useful for generating content for a book or maybe helping the hearing or visually impaired.
The following is the result we get using image captioning:
However, this is considerably more challenging as conventional neural networks are powerful, but they're not very compatible with sequential data. Sequential data is where we have data that's coming in an order and that order actually matters. In audio or video, we have words coming in a sequential order; jumbling the words might change the meaning of the sentence or just make it complete gibberish.
Recurrent neural networks with long short-term memory
As powerful as convolutional neural networks (CNNs) are, they don't handle sequential data so well; however, they are great for non-sequential tasks, such as image classification.
How CNNs work is shown in the following diagram:
Recurrent neural networks (RNNs), which really are state of the art, can handle sequential tasks. An RNN consists of CNNs where data is received in a sequence.
How RNNs work is shown in the following diagram:
Data coming in a sequence (xi) goes through the neural network and we get an output (yi). The output is then fed through to another iteration and forms a loop. This helps us remember the data coming from before and is helpful for sequential data tasks such as audio and speech recognition, language translation, video identification, and text generation.
Another concept that has been around for a while and is very helpful is long short-term memory (LSTM) with RNNs. It is a way to handle long-term memory and avoid just passing data from one iteration to the next. It handles the data from the iterations in a robust way and it allows us to effectively train RNNs.
Google Brain im2txt captioning model
Google Brain im2txt was used by Google in a paper 2015 MSCOCO Image Captioning Challenge, and will form the foundation of the image captioning code that we will implement in this article.
The Google's GitHubTensorFlow page can be found at https://github.com/tensorflow/models/tree/master/research/im2txt.
In the research directory, we will find the im2txt file, which was used by Google in the paper, 2015 MSCOCO Image Captioning Challenge, which is available for free at https://arxiv.org/abs/1609.06647. It covers RNNs, LSTM, and fundamental algorithms in detail.
We can check how CNNs are used for image classification and also learn how to use the LSTM RNNs for actually generating sequential caption outputs.
We can download the code from the GitHub link; however, it has not been set up to run easily as it does not include a pre-trained model, so we may face some challenges. We have provided you with a pre-trained model to avoid training an image classifier from scratch, since it is a time-consuming process.
There have been some modifications made to the code that will make the code easy to run on a Jupyter Notebook or to incorporate in your own projects. The pre-trained model is very quick to learn using just a CPU. The same code without a pre-trained model might actually take weeks to learn, even on a good GPU.
Running the captioning code on Jupyter
Let's now run our own version of the code on a Jupyter Notebook. We can start up own ownJupyter Notebook and load the Section_1-Tensorflow_Image_Captioning.ipynb file from the GitHub repository (https://github.com/PacktPublishing/Computer-Vision-Projects-with-OpenCV-and-Python-3/blob/master/Chapter01/Section_1-Tensorflow_Image_Captioning.ipynb).
Once we load the file on a Jupyter Notebook, it will look something like this:
In the first part, we are going to load some essential libraries, including math, os, and tensorflow. We will also use our handy utility function, %pylab inline, to easily read and display images within the Notebook.
Select the first code block:
# load essential libraries
import math
importos
importtensorflow as tf
%pylab inline
When we hit Ctrl + Enter to execute the code in the cell, we will get the following output:
We need to now load the TensorFlow/Google Brain base code, which we can get from https://github.com/PacktPublishing/Computer-Vision-Projects-with-OpenCV-and-Python-3.
There are multiple utility functions, but we will be using and executing only a few of them in our example:
# load Tensorflow/Google Brain base code
# https://github.com/tensorflow/models/tree/master/research/im2txt
from im2txt import configuration
from im2txt import inference_wrapper
from im2txt.inference_utils import caption_generator
from im2txt.inference_utils import vocabulary
We need to tell our function where to find the trained model and vocabulary:
# tell our function where to find the trained model and vocabulary
checkpoint_path = './model'
vocab_file = './model/word_counts.txt'
The code for the trained model and vocabulary has been added in the GitHub repository, and you can access it from this link:
https://github.com/PacktPublishing/Computer-Vision-Projects-with-OpenCV-and-Python-3
The folder contains checkpoint, word_counts.txt, and the pre-trained model. We need to make sure that we use these files and avoid using other outdated files that might not be compatible with the latest version of TensorFlow. The word_counts.txt file contains a vocabulary list with the number of counts from our trained model, which our image caption generator is going to need.
Once these steps have been completed, we can look at our main function, which will generate the captions for us. The function can take an input as a string of input files (comma separated) or could be just one file that we want to process. The verbosity is set to tf.logging.FATAL out of the different logging levels available, as it will tell us whether something has gone really wrong:
In the initial part of the main code, we perform the following steps:
- Set the verbosity level to tf.logging.FATAL.
- Load our pre-trained model.
- Load the inference wrapper from our utility file provided by Google.
- Load our pre-trained model from the checkpoint path that we established in the previous cell.
- Run the finalize function:
# this is the function we'll call to produce our captions
# given input file name(s) -- separate file names by a,
# if more than one
defgen_caption(input_files):
# only print serious log messages
tf.logging.set_verbosity(tf.logging.FATAL)
# load our pretrained model
g = tf.Graph()
withg.as_default():
model = inference_wrapper.InferenceWrapper()
restore_fn = model.build_graph_from_config(configuration.ModelConfig(),
checkpoint_path)
g.finalize()
- Load the vocabulary file again from the cell that we previously ran:
# Create the vocabulary.
vocab = vocabulary.Vocabulary(vocab_file)
- Pre-process the filenames:
filenames = []
forfile_pattern in input_files.split(","):
- Perform the Glob action:
filenames.extend(tf.gfile.Glob(file_pattern))
- Create a list of filenames so you can know on which file the image caption generator is running:
tf.logging.info("Running caption generation on %d files matching %s",
len(filenames), input_files)
- Create a session. We need to use the restore function since we are using a ```
pre-trained model:
withtf.Session(graph=g) as sess:
# Load the model from checkpoint.
restore_fn(sess)
The code for these steps is included here:
#this is the function we'll call to produce our captions
#given input file name(s) -- separate file names by a,
#if more than one
defgen_caption(input_files):
# only print serious log messages
tf.logging.set_verbosity(tf.logging.FATAL)
# load our pretrained model
g = tf.Graph()
withg.as_default():
model = inference_wrapper.InferenceWrapper()
restore_fn = model.build_graph_from_config(configuration.ModelConfig(),
checkpoint_path)
g.finalize()
# Create the vocabulary.
vocab = vocabulary.Vocabulary(vocab_file)
filenames = []
forfile_pattern in input_files.split(","):
filenames.extend(tf.gfile.Glob(file_pattern))
tf.logging.info("Running caption generation on %d files matching %s",
len(filenames), input_files)
withtf.Session(graph=g) as sess:
# Load the model from checkpoint.
restore_fn(sess)
We now move to the second half of the main code. Once the session has been restored, we perform the following steps:
- Load caption_generator from our model and the vocabulary stored in an object called generator:
generator = caption_generator.CaptionGenerator(model, vocab)
- Make a caption list:
captionlist = []
- Iterate the files and load them in the generator called beam_search to analyze the image:
for filename in filenames:
withtf.gfile.GFile(filename, "rb") as f:
image = f.read()
captions = generator.beam_search(sess, image)
- Print the captions:
print("Captions for image %s:" % os.path.basename(filename))
- Iterate to create multiple captions with the iteration already set for the model:
for i, caption in enumerate(captions):
# Ignore begin and end words.
sentence = [vocab.id_to_word(w) for w in caption.sentence[1:-1]]
sentence = " ".join(sentence)
print(" %d) %s (p=%f)" % (i, sentence, math.exp(caption.logprob)))
captionlist.append(sentence)
- Return captionlist:
returncaptionlist
Run the code to generate the function.
See the following code block for the complete code:
# Prepare the caption generator. Here we are implicitly using the default
# beam search parameters. See caption_generator.py for a description of the
# available beam search parameters.
generator = caption_generator.CaptionGenerator(model, vocab)
captionlist = []
for filename in filenames:
withtf.gfile.GFile(filename, "rb") as f:
image = f.read()
captions = generator.beam_search(sess, image)
print("Captions for image %s:" % os.path.basename(filename))
for i, caption in enumerate(captions):
# Ignore begin and end words.
sentence = [vocab.id_to_word(w) for w in caption.sentence[1:-1]]
sentence = " ".join(sentence)
print(" %d) %s (p=%f)" % (i, sentence, math.exp(caption.logprob)))
captionlist.append(sentence)
returncaptionlist
In the next code block, we will execute the code on sample stock photos froma test folder. The code will create a figure, show it, and then run the caption generator. We can then display the output using the print statement.
The following is the code we use to select the image for computation:
testfile = 'test_images/dog.jpeg'
figure()
imshow(imread(testfile))
capts = gen_caption(testfile)
When we run our first test image, dog.jpeg, we get the following output:
The result, a woman and a dog are standing on the grass, is a good caption for the image. Since all the three results are pretty similar, we can say that our model is working pretty well.
Analyzing the result captions
Let's take a few examples to check our model. When we execute football.jpeg, we get the following output:
Here we clearly have American football going on in the image, and a couple of men playing a game of football is a very good result. However, the first result, a couple of men playing a game of frisbee, is not the desired output, nor is a couple of men playing a game of soccer. So, in this case, the second caption is generally going to be the best, but it is not always going to be perfect, depending on the log probability.
Let's try one more example, giraffes.jpeg:
Clearly, we have an image of giraffes, and the first caption, a group of giraffe standing next to each other, seems to be correct, except for the grammar issue. The other two results are a group of giraffes are standing in a field and a group of giraffe standing next to each other on a field.
Let's take a look at one more example, headphones.jpeg:
Here we selected headphones.jpeg, but the results did not include headphones as an output. The result was a woman holding a cell phone in her hand, which is a good result. The second result, a woman holding a cell phone up to her ear, is technically incorrect, but these are some good captions overall.
Let's take one last example, ballons.jpeg. When we run the image, we get the following output:
The results we get for this image are a woman standing on a beach flying a kite, a woman is flying a kite on the beach, and a young girl flying a kite on a beach. So, the model got the woman or a young girl, but it got a kite instead of the balloon, even though "balloon" is in the vocabulary. So, we can infer that the model is not perfect, but it is impressive and can be included in your own application.
Running the captioning code on Jupyter for multiple images
Multiple images can also be added as an input string by separating the image path of the different images using commas. The execution time of a string of images will be greater than the times we've seen thus far.
The following is an example of multiple input files:
input_files = 'test_images/ballons.jpeg,test_images/bike.jpeg,test_images/dog.jpeg,test_images/fireworks.jpeg,test_images/football.jpeg,test_images/giraffes.jpeg,test_images/headphones.jpeg,test_images/laughing.jpeg,test_images/objects.jpeg,test_images/snowboard.jpeg,test_images/surfing.jpeg'
capts = gen_caption(input_files)
We will not be displaying the images, so the output will include only the results. We can see that some of the results are better than others:
This wraps up running the pre-trained image captioning model.
Hope you enjoyed reading this article and found it insightful. If you’d like to explore more about computer vision, Computer Vision Projects with OpenCV and Python 3 is highly recommended. With a detailed, project-based approach, Computer Vision Projects with OpenCV and Python 3 demonstrates techniques to leverage the power of Python, OpenCV, and TensorFlow to solve problems in Computer Vision and also shows how to build an application that can estimate human poses within images.
Thank you for the clear article! I would like to use your pre-trained model you mentioned but there is no /model folder on your Github. Do you think you could still share it? A big thank you in advance!