Path: blob/master/examples/nlp/ipynb/multimodal_entailment.ipynb
3508 views
Multimodal entailment
Author: Sayak Paul
Date created: 2021/08/08
Last modified: 2025/01/03
Description: Training a multimodal model for predicting entailment.
Introduction
In this example, we will build and train a model for predicting multimodal entailment. We will be using the multimodal entailment dataset recently introduced by Google Research.
What is multimodal entailment?
On social media platforms, to audit and moderate content we may want to find answers to the following questions in near real-time:
Does a given piece of information contradict the other?
Does a given piece of information imply the other?
In NLP, this task is called analyzing textual entailment. However, that's only when the information comes from text content. In practice, it's often the case the information available comes not just from text content, but from a multimodal combination of text, images, audio, video, etc. Multimodal entailment is simply the extension of textual entailment to a variety of new input modalities.
Requirements
This example requires TensorFlow 2.5 or higher. In addition, TensorFlow Hub and TensorFlow Text are required for the BERT model (Devlin et al.). These libraries can be installed using the following command:
Imports
Define a label map
Collect the dataset
The original dataset is available here. It comes with URLs of images which are hosted on Twitter's photo storage system called the Photo Blob Storage (PBS for short). We will be working with the downloaded images along with additional data that comes with the original dataset. Thanks to Nilabhra Roy Chowdhury who worked on preparing the image data.
Read the dataset and apply basic preprocessing
The columns we are interested in are the following:
text_1
image_1
text_2
image_2
label
The entailment task is formulated as the following:
Given the pairs of (text_1
, image_1
) and (text_2
, image_2
) do they entail (or not entail or contradict) each other?
We have the images already downloaded. image_1
is downloaded as id1
as its filename and image2
is downloaded as id2
as its filename. In the next step, we will add two more columns to df
- filepaths of image_1
s and image_2
s.
Dataset visualization
Train/test split
The dataset suffers from class imbalance problem. We can confirm that in the following cell.
To account for that we will go for a stratified split.
Data input pipeline
Keras Hub provides variety of BERT family of models. Each of those models comes with a corresponding preprocessing layer. You can learn more about these models and their preprocessing layers from this resource.
To keep the runtime of this example relatively short, we will use a base_unacased variant of the original BERT model.
text preprocessing using KerasHub
Run the preprocessor on a sample input
We will now create tf.data.Dataset
objects from the dataframes.
Note that the text inputs will be preprocessed as a part of the data input pipeline. But the preprocessing modules can also be a part of their corresponding BERT models. This helps reduce the training/serving skew and lets our models operate with raw text inputs. Follow this tutorial to learn more about how to incorporate the preprocessing modules directly inside the models.
Preprocessing utilities
Create the final datasets, method adapted from PyDataset doc string.
Create train, validation and test datasets
Model building utilities
Our final model will accept two images along with their text counterparts. While the images will be directly fed to the model the text inputs will first be preprocessed and then will make it into the model. Below is a visual illustration of this approach:
The model consists of the following elements:
A standalone encoder for the images. We will use a ResNet50V2 pre-trained on the ImageNet-1k dataset for this.
A standalone encoder for the images. A pre-trained BERT will be used for this.
After extracting the individual embeddings, they will be projected in an identical space. Finally, their projections will be concatenated and be fed to the final classification layer.
This is a multi-class classification problem involving the following classes:
NoEntailment
Implies
Contradictory
project_embeddings()
, create_vision_encoder()
, and create_text_encoder()
utilities are referred from this example.
Projection utilities
Vision encoder utilities
Text encoder utilities
Multimodal model utilities
You can inspect the structure of the individual encoders as well by setting the expand_nested
argument of plot_model()
to True
. You are encouraged to play with the different hyperparameters involved in building this model and observe how the final performance is affected.
Compile and train the model
Evaluate the model
Additional notes regarding training
Incorporating regularization:
The training logs suggest that the model is starting to overfit and may have benefitted from regularization. Dropout (Srivastava et al.) is a simple yet powerful regularization technique that we can use in our model. But how should we apply it here?
We could always introduce Dropout (keras.layers.Dropout
) in between different layers of the model. But here is another recipe. Our model expects inputs from two different data modalities. What if either of the modalities is not present during inference? To account for this, we can introduce Dropout to the individual projections just before they get concatenated:
Attending to what matters:
Do all parts of the images correspond equally to their textual counterparts? It's likely not the case. To make our model only focus on the most important bits of the images that relate well to their corresponding textual parts we can use "cross-attention":
To see this in action, refer to this notebook.
Handling class imbalance:
The dataset suffers from class imbalance. Investigating the confusion matrix of the above model reveals that it performs poorly on the minority classes. If we had used a weighted loss then the training would have been more guided. You can check out this notebook that takes class-imbalance into account during model training.
Using only text inputs:
Also, what if we had only incorporated text inputs for the entailment task? Because of the nature of the text inputs encountered on social media platforms, text inputs alone would have hurt the final performance. Under a similar training setup, by only using text inputs we get to 67.14% top-1 accuracy on the same test set. Refer to this notebook for details.
Finally, here is a table comparing different approaches taken for the entailment task:
Type | Standard Cross-entropy | Loss-weighted Cross-entropy | Focal Loss |
---|---|---|---|
Multimodal | 77.86% | 67.86% | 86.43% |
Only text | 67.14% | 11.43% | 37.86% |
You can check out this repository to learn more about how the experiments were conducted to obtain these numbers.
Final remarks
The architecture we used in this example is too large for the number of data points available for training. It's going to benefit from more data.
We used a smaller variant of the original BERT model. Chances are high that with a larger variant, this performance will be improved. TensorFlow Hub provides a number of different BERT models that you can experiment with.
We kept the pre-trained models frozen. Fine-tuning them on the multimodal entailment task would could resulted in better performance.
We built a simple baseline model for the multimodal entailment task. There are various approaches that have been proposed to tackle the entailment problem. This presentation deck from the Recognizing Multimodal Entailment tutorial provides a comprehensive overview.
You can use the trained model hosted on Hugging Face Hub and try the demo on Hugging Face Spaces