Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/master/Natural Language Processing with Attention Models/Week 2 - Text Summarization/C4_W2_Assignment.ipynb
Views: 13373
Assignment 2: Transformer Summarizer
Welcome to the second assignment of course 4. In this assignment you will explore summarization using the transformer model. Yes, you will implement the transformer decoder from scratch, but we will slowly walk you through it. There are many hints in this notebook so feel free to use them as needed.
Outline
Introduction
Summarization is an important task in natural language processing and could be useful for a consumer enterprise. For example, bots can be used to scrape articles, summarize them, and then you can use sentiment analysis to identify the sentiment about certain stocks. Anyways who wants to read an article or a long email today, when you can build a transformer to summarize text for you. Let's get started, by completing this assignment you will learn to:
Use built-in functions to preprocess your data
Implement DotProductAttention
Implement Causal Attention
Understand how attention works
Build the transformer model
Evaluate your model
Summarize an article
As you can tell, this model is slightly different than the ones you have already implemented. This is heavily based on attention and does not rely on sequences, which allows for parallel computing.
Trax makes it easy to work with Tensorflow's datasets:
1.1 Tokenize & Detokenize helper functions
Just like in the previous assignment, the cell above loads in the encoder for you. Given any data set, you have to be able to map words to their indices, and indices to their words. The inputs and outputs to your Trax models are usually tensors of numbers where each number corresponds to a word. If you were to process your data manually, you would have to make use of the following:
word2Ind: a dictionary mapping the word to its index.
ind2Word: a dictionary mapping the index to its word.
word2Count: a dictionary mapping the word to the number of times it appears.
num_words: total number of words that have appeared.
Since you have already implemented these in previous assignments of the specialization, we will provide you with helper functions that will do this for you. Run the cell below to get the following functions:
tokenize: converts a text sentence to its corresponding token list (i.e. list of indices). Also converts words to subwords.
detokenize: converts a token list to its corresponding sentence (i.e. string).
1.2 Preprocessing for Language Models: Concatenate It!
This week you will use a language model -- Transformer Decoder -- to solve an input-output problem. As you know, language models only predict the next word, they have no notion of inputs. To create a single input suitable for a language model, we concatenate inputs with targets putting a separator in between. We also need to create a mask -- with 0s at inputs and 1s at targets -- so that the model is not penalized for mis-predicting the article and only focuses on the summary. See the preprocess function below for how this is done.
Things to notice:
First we see the corresponding values of the words.
The first 1, which represents the
<EOS>
tag of the article.Followed by a 0, which represents a
<pad>
tag.After the first 0 (
<pad>
tag) the corresponding values are of the words that are used for the summary of the article.The second 1 represents the
<EOS>
tag for the summary.All the trailing 0s represent
<pad>
tags which are appended to maintain consistent length (If you don't see them then it would mean it is already of max length)
You can see that the data has the following structure:
[Article] ->
<EOS>
-><pad>
-> [Article Summary] -><EOS>
-> (possibly) multiple<pad>
The loss is taken only on the summary using cross_entropy as loss function.
Part 2: Summarization with transformer
Now that we have given you the data generator and have handled the preprocessing for you, it is time for you to build your own model. We saved you some time because we know you have already preprocessed data before in this specialization, so we would rather you spend your time doing the next steps.
You will be implementing the attention from scratch and then using it in your transformer model. Concretely, you will understand how attention works, how you use it to connect the encoder and the decoder.
2.1 Dot product attention
Now you will implement dot product attention which takes in a query, key, value, and a mask. It returns the output.
Here are some helper functions that will help you create tensors and display useful information:
create_tensor
creates ajax numpy array
from a list of lists.display_tensor
prints out the shape and the actual tensor.
Before implementing it yourself, you can play around with a toy example of dot product attention
without the softmax operation. Technically it would not be dot product attention
without the softmax but this is done to avoid giving away too much of the answer and the idea is to display these tensors to give you a sense of how they look like.
The formula for attention is this one:
$$\text { Attention }(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}+{M}\right) V\tag{1}\$$stands for the dimension of queries and keys.
The query
, key
, value
and mask
vectors are provided for this example.
Notice that the masking is done using very negative values that will yield a similar effect to using .
Expected Output:
Expected Output:
Expected Output:
Expected Output:
In order to use the previous dummy tensors to test some of the graded functions, a batch dimension should be added to them so they mimic the shape of real-life examples. The mask is also replaced by a version of it that resembles the one that is used by trax:
Expected Output:
Exercise 01
Instructions: Implement the dot product attention. Concretely, implement the following equation
$$\text { Attention }(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}+{M}\right) V\tag{1}\$$- query, - key, - values, - mask, - depth/dimension of the queries and keys (used for scaling down)
You can implement this formula either by trax
numpy (trax.math.numpy) or regular numpy
but it is recommended to use jnp
.
Something to take into consideration is that within trax, the masks are tensors of True/False
values not 0's and as in the previous example. Within the graded function don't think of applying the mask by summing up matrices, instead use jnp.where()
and treat the mask as a tensor of boolean values with False
for values that need to be masked and True for the ones that don't.
Also take into account that the real tensors are far more complex than the toy ones you just played with. Because of this avoid using shortened operations such as @
for dot product or .T
for transposing. Use jnp.matmul()
and jnp.swapaxes()
instead.
This is the self-attention block for the transformer decoder. Good luck!
Expected Output:
2.2 Causal Attention
Now you are going to implement causal attention: multi-headed attention with a mask to attend only to words that occurred before.
In the image above, a word can see everything that is before it, but not what is after it. To implement causal attention, you will have to transform vectors and do many reshapes. You will need to implement the functions below.
Exercise 02
Implement the following functions that will be needed for Causal Attention:
compute_attention_heads : Gets an input of dimension (batch_size, seqlen, n_heads d_head) and splits the last (depth) dimension and stacks it to the zeroth dimension to allow matrix multiplication (batch_size n_heads, seqlen, d_head).
dot_product_self_attention : Creates a mask matrix with
False
values above the diagonal andTrue
values below and calls DotProductAttention which implements dot product self attention.compute_attention_output : Undoes compute_attention_heads by splitting first (vertical) dimension and stacking in the last (depth) dimension (batch_size, seqlen, n_heads d_head). These operations concatenate (stack/merge) the heads.
Next there are some toy tensors which may serve to give you an idea of the data shapes and opperations involved in Causal Attention. They are also useful to test out your functions!
It is important to know that the following 3 functions would normally be defined within the CausalAttention
function further below.
However this makes these functions harder to test. Because of this, these functions are shown individually using a closure
(when necessary) that simulates them being inside of the CausalAttention
function. This is done because they rely on some variables that can be accessed from within CausalAttention
.
Support Functions
compute_attention_heads : Gets an input of dimension (batch_size, seqlen, n_heads d_head) and splits the last (depth) dimension and stacks it to the zeroth dimension to allow matrix multiplication (batch_size n_heads, seqlen, d_head).
For the closures you only have to fill the inner function.
Expected Output:
dot_product_self_attention : Creates a mask matrix with False
values above the diagonal and True
values below and calls DotProductAttention which implements dot product self attention.
Expected Output:
compute_attention_output : Undoes compute_attention_heads by splitting first (vertical) dimension and stacking in the last (depth) dimension (batch_size, seqlen, n_heads d_head). These operations concatenate (stack/merge) the heads.
Expected Output:
Causal Attention Function
Now it is time for you to put everything together within the CausalAttention
or Masked multi-head attention function:
Instructions: Implement the causal attention. Your model returns the causal attention through a with the following:
tl.Branch : consisting of 3 [tl.Dense(d_feature), ComputeAttentionHeads] to account for the queries, keys, and values.
tl.Fn: Takes in dot_product_self_attention function and uses it to compute the dot product using , , .
tl.Fn: Takes in compute_attention_output_closure to allow for parallel computing.
tl.Dense: Final Dense layer, with dimension
d_feature
.
Remember that in order for trax to properly handle the functions you just defined, they need to be added as layers using the tl.Fn()
function.
Expected Output:
2.3 Transformer decoder block
Now that you have implemented the causal part of the transformer, you will implement the transformer decoder block. Concretely you will be implementing this image now.
To implement this function, you will have to call the CausalAttention
or Masked multi-head attention function you implemented above. You will have to add a feedforward which consists of:
tl.LayerNorm : used to layer normalize
tl.Dense : the dense layer
ff_activation : feed forward activation (we use ReLu) here.
tl.Dropout : dropout layer
tl.Dense : dense layer
tl.Dropout : dropout layer
Finally once you implement the feedforward, you can go ahead and implement the entire block using:
tl.Residual : takes in the tl.LayerNorm(), causal attention block, tl.dropout.
tl.Residual : takes in the feedforward block you will implement.
Exercise 03
Instructions: Implement the transformer decoder block. Good luck!
Expected Output:
2.4 Transformer Language Model
You will now bring it all together. In this part you will use all the subcomponents you previously built to make the final model. Concretely, here is the image you will be implementing.
Exercise 04
Instructions: Previously you coded the decoder block. Now you will code the transformer language model. Here is what you will need.
positional_enconder - a list containing the following layers:
A list of
n_layers
decoder blocks.tl.Serial: takes in the following layers or lists of layers:
tl.ShiftRight: : shift the tensor to the right by padding on axis 1.
positional_encoder : encodes the text positions.
decoder_blocks : the ones you created.
tl.LayerNorm : a layer norm.
tl.Dense : takes in the vocab_size.
tl.LogSoftmax : to predict.
Go go go!! You can do it 😃
Expected Output:
Part 3: Training
Now you are going to train your model. As usual, you have to define the cost function, the optimizer, and decide whether you will be training it on a gpu
or cpu
. In this case, you will train your model on a cpu for a few steps and we will load in a pre-trained model that you can use to predict with your own words.
3.1 Training the model
You will now write a function that takes in your model and trains it. To train your model you have to decide how many times you want to iterate over the entire data set. Each iteration is defined as an epoch
. For each epoch, you have to go over all the data, using your training iterator.
Exercise 05
Instructions: Implement the train_model
program below to train the neural network above. Here is a list of things you should do:
Create the train task by calling
trax.supervised.training.TrainTask
and pass in the following:labeled_data = train_gen
loss_fn = tl.CrossEntropyLoss()
optimizer = trax.optimizers.Adam(0.01)
lr_schedule = lr_schedule
Create the eval task by calling
trax.supervised.training.EvalTask
and pass in the following:labeled_data = eval_gen
metrics = tl.CrossEntropyLoss() and tl.Accuracy()
Create the training loop by calling
trax.supervised.Training.Loop
and pass in the following:TransformerLM
train_task
eval_task = [eval_task]
output_dir = output_dir
You will be using a cross entropy loss, with Adam optimizer. Please read the Trax documentation to get a full understanding.
The training loop that this function returns can be runned using the run()
method by passing in the desired number of steps.
Notice that the model will be trained for only 10 steps.
Even with this constraint the model with the original default arguments took a very long time to finish. Because of this some parameters are changed when defining the model that is fed into the training loop in the function above.
Part 4: Evaluation
4.1 Loading in a trained model
In this part you will evaluate by loading in an almost exact version of the model you coded, but we trained it for you to save you time. Please run the cell below to load in the model.
As you may have already noticed the model that you trained and the pretrained model share the same overall architecture but they have different values for some of the parameters:
Original (pretrained) model:
Your model:
Only the parameters shown for your model were changed. The others stayed the same.
Part 5: Testing with your own input
You will now test your input. You are going to implement greedy decoding. This consists of two functions. The first one allows you to identify the next symbol. It gets the argmax of the output of your model and then returns that index.
Exercise 06
Instructions: Implement the next symbol function that takes in the cur_output_tokens and the trained model to return the index of the next word.
Expected Output:
Expected Output:
Expected Output:
Congratulations on finishing this week's assignment! You did a lot of work and now you should have a better understanding of the encoder part of Transformers and how Transformers can be used for text summarization.
Keep it up!