Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/master/Natural Language Processing with Attention Models/Week 3 - Question Answering/C4_W3_Assignment.ipynb
Views: 13373
Assignment 3: Question Answering
Welcome to this week's assignment of course 4. In this you will explore question answering. You will implement the "Text to Text Transfer from Transformers" (better known as T5). Since you implemented transformers from scratch last week you will now be able to use them.
Outline
Overview
This assignment will be different from the two previous ones. Due to memory and time constraints of this environment you will not be able to train a model and use it for inference. Instead you will create the necessary building blocks for the transformer encoder model and will use a pretrained version of the same model in two ungraded labs after this assignment.
After completing these 3 (1 graded and 2 ungraded) labs you will:
Implement the code neccesary for Bidirectional Encoder Representation from Transformer (BERT).
Understand how the C4 dataset is structured.
Use a pretrained model for inference.
Understand how the "Text to Text Transfer from Transformers" or T5 model works.
Part 1: C4 Dataset
The C4 is a huge data set. For the purpose of this assignment you will use a few examples out of it which are present in data.txt
. C4 is based on the common crawl project. Feel free to read more on their website.
Run the cell below to see how the examples look like.
Notice the b
before each string? This means that this data comes as bytes rather than strings. Strings are actually lists of bytes so for the rest of the assignments the name strings
will be used to describe the data.
To check this run the following cell:
1.1 Pre-Training Objective
Note: The word "mask" will be used throughout this assignment in context of hiding/removing word(s)
You will be implementing the BERT loss as shown in the following image.
Assume you have the following text: Thank you for inviting me to your party last week
Now as input you will mask the words in red in the text:
Input: Thank you X me to your party Y week.
Output: The model should predict the words(s) for X and Y.
Z is used to represent the end.
1.2.1 Decode to natural language
The following functions will help you detokenize
andtokenize
the text data.
The sentencepiece
vocabulary was used to convert from text to ids. This vocabulary file is loaded and used in this helper functions.
natural_language_texts
has the text from the examples we gave you.
Run the cells below to see what is going on.
As you can see above, you were able to take a piece of string and tokenize it.
Now you will create input
and target
pairs that will allow you to train your model. T5 uses the ids at the end of the vocab file as sentinels. For example, it will replace:
vocab_size - 1
by<Z>
vocab_size - 2
by<Y>
and so forth.
It assigns every word a chr
.
The pretty_decode
function below, which you will use in a bit, helps in handling the type when decoding. Take a look and try to understand what the function is doing.
Notice that:
NOTE: Targets may have more than the 52 sentinels we replace, but this is just to give you an idea of things.
The functions above make your inputs
and targets
more readable. For example, you might see something like this once you implement the masking function below.
Input sentence: Younes and Lukasz were working together in the lab yesterday after lunch.
Input: Younes and Lukasz Z together in the Y yesterday after lunch.
Target: Z were working Y lab.
Expected Output:
You will now use the inputs and the targets from the tokenize_and_mask
function you implemented above. Take a look at the masked sentence using your inps
and targs
from the sentence above.
Part 2: Transfomer
We now load a Transformer model checkpoint that has been pre-trained using the above C4 dataset and decode from it. This will save you a lot of time rather than have to train your model yourself. Later in this notebook, we will show you how to fine-tune your model.
Start by loading in the model. We copy the checkpoint to local dir for speed, otherwise initialization takes a very long time. Last week you implemented the decoder part for the transformer. Now you will implement the encoder part. Concretely you will implement the following.
2.1 Transformer Encoder
You will now implement the transformer encoder. Concretely you will implement two functions. The first function is FeedForwardBlock
.
2.1.1 The Feedforward Block
The FeedForwardBlock
function is an important one so you will start by implementing it. To do so, you need to return a list of the following:
tl.LayerNorm()
= layer normalization.tl.Dense(d_ff)
= fully connected layer.activation
= activation relu, tanh, sigmoid etc.dropout_middle
= we gave you this function (don't worry about its implementation).tl.Dense(d_model)
= fully connected layer with same dimension as the model.dropout_final
= we gave you this function (don't worry about its implementation).
You can always take a look at trax documentation if needed.
Instructions: Implement the feedforward part of the transformer. You will be returning a list.
Expected Output:
2.1.2 The Encoder Block
The encoder block will use the FeedForwardBlock
.
You will have to build two residual connections. Inside the first residual connection you will have the tl.layerNorm()
, attention
, and dropout_
layers. The second residual connection will have the feed_forward
.
You will also need to implement feed_forward
, attention
and dropout_
blocks.
So far you haven't seen the tl.Attention()
and tl.Residual()
layers so you can check the docs by clicking on them.
Expected Output:
2.1.3 The Transformer Encoder
Now that you have implemented the EncoderBlock
, it is time to build the full encoder. BERT, or Bidirectional Encoder Representations from Transformers is one such encoder.
You will implement its core code in the function below by using the functions you have coded so far.
The model takes in many hyperparameters, such as the vocab_size
, the number of classes, the dimension of your model, etc. You want to build a generic function that will take in many parameters, so you can use it later. At the end of the day, anyone can just load in an API and call transformer, but we think it is important to make sure you understand how it is built. Let's get started.
Instructions: For this encoder you will need a positional_encoder
first (which is already provided) followed by n_layers
encoder blocks, which are the same encoder blocks you previously built. Once you store the n_layers
EncoderBlock
in a list, you are going to encode a Serial
layer with the following sublayers:
tl.Branch
: helps with the branching and has the following sublayers:positional_encoder
.tl.PaddingMask()
: layer that maps integer sequences to padding masks.
Your list of
EncoderBlock
stl.Select([0], n_in=2)
: Copies, reorders, or deletes stack elements according to indices.tl.Mean()
: Mean along the first axis.tl.Dense()
with n_units set to n_classes.tl.LogSoftmax()
Please refer to the trax documentation for further information.
Expected Output:
NOTE Congratulations! You have completed all of the graded functions of this assignment. Since the rest of the assignment takes a lot of time and memory to run we are providing some extra ungraded labs for you to see this model in action.
Keep it up!
To see this model in action continue to the next 2 ungraded labs. We strongly recommend you to try the colab versions of them as they will yield a much smoother experience. The links to the colabs can be found within the ungraded labs or if you already know how to open files within colab here are some shortcuts (if not, head to the ungraded labs which contain some extra instructions):