Path: blob/master/examples/nlp/ipynb/multiple_choice_task_with_transfer_learning.ipynb
3508 views
MultipleChoice Task with Transfer Learning
Author: Md Awsafur Rahman
Date created: 2023/09/14
Last modified: 2025/06/16
Description: Use pre-trained nlp models for multiplechoice task.
Introduction
In this example, we will demonstrate how to perform the MultipleChoice task by finetuning pre-trained DebertaV3 model. In this task, several candidate answers are provided along with a context and the model is trained to select the correct answer unlike question answering. We will use SWAG dataset to demonstrate this example.
Setup
Dataset
In this example we'll use SWAG dataset for multiplechoice task.
Configuration
Reproducibility
Sets value for random seed to produce similar result in each run.
Meta Data
train.csv - will be used for training.
sent1
andsent2
: these fields show how a sentence starts, and if you put the two together, you get thestartphrase
field.ending_<i>
: suggests a possible ending for how a sentence can end, but only one of them is correct.label
: identifies the correct sentence ending.
val.csv - similar to
train.csv
but will be used for validation.
Contextualize Options
Our approach entails furnishing the model with question and answer pairs, as opposed to employing a single question for all five options. In practice, this signifies that for the five options, we will supply the model with the same set of five questions combined with each respective answer choice (e.g., (Q + A)
, (Q + B)
, and so on). This analogy draws parallels to the practice of revisiting a question multiple times during an exam to promote a deeper understanding of the problem at hand.
Notably, in the context of SWAG dataset, question is the start of a sentence and options are possible ending of that sentence.
Apply the make_options
function to each row of the dataframe
Preprocessing
What it does: The preprocessor takes input strings and transforms them into a dictionary (token_ids
, padding_mask
) containing preprocessed tensors. This process starts with tokenization, where input strings are converted into sequences of token IDs.
Why it's important: Initially, raw text data is complex and challenging for modeling due to its high dimensionality. By converting text into a compact set of tokens, such as transforming "The quick brown fox"
into ["the", "qu", "##ick", "br", "##own", "fox"]
, we simplify the data. Many models rely on special tokens and additional tensors to understand input. These tokens help divide input and identify padding, among other tasks. Making all sequences the same length through padding boosts computational efficiency, making subsequent steps smoother.
Explore the following pages to access the available preprocessing and tokenizer layers in KerasHub:
Now, let's examine what the output shape of the preprocessing layer looks like. The output shape of the layer can be represented as .
We'll use the preprocessing_fn
function to transform each text option using the dataset.map(preprocessing_fn)
method.
Augmentation
In this notebook, we'll experiment with an interesting augmentation technique, option_shuffle
. Since we're providing the model with one option at a time, we can introduce a shuffle to the order of options. For instance, options [A, C, E, D, B]
would be rearranged as [D, B, A, E, C]
. This practice will help the model focus on the content of the options themselves, rather than being influenced by their positions.
Note: Even though option_shuffle
function is written in pure tensorflow, it can be used with any backend (e.g. JAX, PyTorch) as it is only used in tf.data.Dataset
pipeline which is compatible with Keras 3 routines.
In the following function, we'll merge all augmentation functions to apply to the text. These augmentations will be applied to the data using the dataset.map(augment_fn)
approach.
DataLoader
The code below sets up a robust data flow pipeline using tf.data.Dataset
for data processing. Notable aspects of tf.data
include its ability to simplify pipeline construction and represent components in sequences.
To learn more about tf.data
, refer to this documentation.
Now let's create train and valid dataloader using above function.
LR Schedule
Implementing a learning rate scheduler is crucial for transfer learning. The learning rate initiates at lr_start
and gradually tapers down to lr_min
using cosine curve.
Importance: A well-structured learning rate schedule is essential for efficient model training, ensuring optimal convergence and avoiding issues such as overshooting or stagnation.
Callbacks
The function below will gather all the training callbacks, such as lr_scheduler
, model_checkpoint
.
MultipleChoice Model
Pre-trained Models
The KerasHub
library provides comprehensive, ready-to-use implementations of popular NLP model architectures. It features a variety of pre-trained models including Bert
, Roberta
, DebertaV3
, and more. In this notebook, we'll showcase the usage of DistillBert
. However, feel free to explore all available models in the KerasHub documentation. Also for a deeper understanding of KerasHub
, refer to the informative getting started guide.
Our approach involves using keras_hub.models.XXClassifier
to process each question and option pari (e.g. (Q+A), (Q+B), etc.), generating logits. These logits are then combined and passed through a softmax function to produce the final output.
Classifier for Multiple-Choice Tasks
When dealing with multiple-choice questions, instead of giving the model the question and all options together (Q + A + B + C ...)
, we provide the model with one option at a time along with the question. For instance, (Q + A)
, (Q + B)
, and so on. Once we have the prediction scores (logits) for all options, we combine them using the Softmax
function to get the ultimate result. If we had given all options at once to the model, the text's length would increase, making it harder for the model to handle. The picture below illustrates this idea:
From a coding perspective, remember that we use the same model for all five options, with shared weights. Despite the figure suggesting five separate models, they are, in fact, one model with shared weights. Another point to consider is the the input shapes of Classifier and MultipleChoice.
Input shape for Multiple Choice:
Input shape for Classifier:
Certainly, it's clear that we can't directly give the data for the multiple-choice task to the model because the input shapes don't match. To handle this, we'll use slicing. This means we'll separate the features of each option, like and , and give them one by one to the NLP classifier. After we get the prediction scores and for all the options, we'll use the Softmax function, like , to combine them. This final step helps us make the ultimate decision or choice.
Note that in the classifier, we set
num_classes=1
instead of5
. This is because the classifier produces a single output for each option. When dealing with five options, these individual outputs are joined together and then processed through a softmax function to generate the final result, which has a dimension of5
.
Let's checkout the model summary to have a better insight on the model.
Finally, let's check the model structure visually if everything is in place.