Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/master/Natural Language Processing with Sequence Models/Week 2 - Recureent Neural Networks for Language Modelling/C3_W2_Assignment.ipynb
Views: 13373
Assignment 2: Deep N-grams
Welcome to the second assignment of course 3. In this assignment you will explore Recurrent Neural Networks RNN
.
You will be using the fundamentals of google's trax package to implement any kind of deeplearning model.
By completing this assignment, you will learn how to implement models from scratch:
How to convert a line of text into a tensor
Create an iterator to feed data to the model
Define a GRU model using
trax
Train the model using
trax
Compute the accuracy of your model using the perplexity
Predict using your own model
Outline
Overview
Your task will be to predict the next set of characters using the previous characters.
Although this task sounds simple, it is pretty useful.
You will start by converting a line of text into a tensor
Then you will create a generator to feed data into the model
You will train a neural network in order to predict the new set of characters of defined length.
You will use embeddings for each character and feed them as inputs to your model.
Many natural language tasks rely on using embeddings for predictions.
Your model will convert each character to its embedding, run the embeddings through a Gated Recurrent Unit
GRU
, and run it through a linear layer to predict the next set of characters.
The figure above gives you a summary of what you are about to implement.
You will get the embeddings;
Stack the embeddings on top of each other;
Run them through two layers with a relu activation in the middle;
Finally, you will compute the softmax.
To predict the next character:
Use the softmax output and identify the word with the highest probability.
The word with the highest probability is the prediction for the next word.
Part 1: Importing the Data
1.1 Loading in the data
Now import the dataset and do some processing.
The dataset has one sentence per line.
You will be doing character generation, so you have to process each sentence by converting each character (and not word) to a number.
You will use the
ord
function to convert a unique character to a unique integer ID.Store each line in a list.
Create a data generator that takes in the
batch_size
and themax_length
.The
max_length
corresponds to the maximum length of the sentence.
Notice that the letters are both uppercase and lowercase. In order to reduce the complexity of the task, we will convert all characters to lowercase. This way, the model only needs to predict the likelihood that a letter is 'a' and not decide between uppercase 'A' and lowercase 'a'.
1.2 Convert a line to tensor
Now that you have your list of lines, you will convert each character in that list to a number. You can use Python's ord
function to do it.
Given a string representing of one Unicode character, the ord
function return an integer representing the Unicode code point of that character.
Exercise 01
Instructions: Write a function that takes in a single line and transforms each character into its unicode integer. This returns a list of integers, which we'll refer to as a tensor.
Use a special integer to represent the end of the sentence (the end of the line).
This will be the EOS_int (end of sentence integer) parameter of the function.
Include the EOS_int as the last integer of the
For this exercise, you will use the number
1
to represent the end of a sentence.
Expected Output
1.3 Batch generator
Most of the time in Natural Language Processing, and AI in general we use batches when training our data sets. Here, you will build a data generator that takes in a text and returns a batch of text lines (lines are sentences).
The generator converts text lines (sentences) into numpy arrays of integers padded by zeros so that all arrays have the same length, which is the length of the longest sentence in the entire data set.
Once you create the generator, you can iterate on it like this:
This generator returns the data in a format that you could directly use in your model when computing the feed-forward of your algorithm. This iterator returns a batch of lines and per token mask. The batch is a tuple of three parts: inputs, targets, mask. The inputs and targets are identical. The second column will be used to evaluate your predictions. Mask is 1 for non-padding tokens.
Exercise 02
Instructions: Implement the data generator below. Here are some things you will need.
While True loop: this will yield one batch at a time.
if index >= num_lines, set index to 0.
The generator should return shuffled batches of data. To achieve this without modifying the actual lines a list containing the indexes of
data_lines
is created. This list can be shuffled and used to get random batches everytime the index is reset.if len(line) < max_length append line to cur_batch.
Note that a line that has length equal to max_length should not be appended to the batch.
This is because when converting the characters into a tensor of integers, an additional end of sentence token id will be added.
So if max_length is 5, and a line has 4 characters, the tensor representing those 4 characters plus the end of sentence character will be of length 5, which is the max length.
if len(cur_batch) == batch_size, go over every line, convert it to an int and store it.
Remember that when calling np you are really calling trax.fastmath.numpy which is trax’s version of numpy that is compatible with JAX. As a result of this, where you used to encounter the type numpy.ndarray now you will find the type jax.interpreters.xla.DeviceArray.
Hints
- Use the line_to_tensor function above inside a list comprehension in order to pad lines with zeros.
- Keep in mind that the length of the tensor is always 1 + the length of the original line of characters. Keep this in mind when setting the padding of zeros.
Expected output
Now that you have your generator, you can just call them and they will return tensors which correspond to your lines in Shakespeare. The first column and the second column are identical. Now you can go ahead and start building your neural network.
1.4 Repeating Batch generator
The way the iterator is currently defined, it will keep providing batches forever.
Although it is not needed, we want to show you the itertools.cycle
function which is really useful when the generator eventually stops
Notice that it is expected to use this function within the training function further below
Usually we want to cycle over the dataset multiple times during training (i.e. train for multiple epochs).
For small datasets we can use itertools.cycle
to achieve this easily.
You can see that we can get more than the 5 lines in tmp_lines using this.
Part 2: Defining the GRU model
Now that you have the input and output tensors, you will go ahead and initialize your model. You will be implementing the GRULM
, gated recurrent unit model. To implement this model, you will be using google's trax
package. Instead of making you implement the GRU
from scratch, we will give you the necessary methods from a build in package. You can use the following packages when constructing the model:
tl.Serial
: Combinator that applies layers serially (by function composition). docs / source codeYou can pass in the layers as arguments to
Serial
, separated by commas.For example:
tl.Serial(tl.Embeddings(...), tl.Mean(...), tl.Dense(...), tl.LogSoftmax(...))
tl.ShiftRight
: Allows the model to go right in the feed forward. docs / source codeShiftRight(n_shifts=1, mode='train')
layer to shift the tensor to the right n_shift timesHere in the exercise you only need to specify the mode and not worry about n_shifts
tl.Embedding
: Initializes the embedding. In this case it is the size of the vocabulary by the dimension of the model. docs / source codetl.Embedding(vocab_size, d_feature)
.vocab_size
is the number of unique words in the given vocabulary.d_feature
is the number of elements in the word embedding (some choices for a word embedding size range from 150 to 300, for example).
tl.GRU
:Trax
GRU layer. docs / source codeGRU(n_units)
Builds a traditional GRU of n_cells with dense internal transformations.GRU
paper: https://arxiv.org/abs/1412.3555
tl.Dense
: A dense layer. docs / source codetl.Dense(n_units)
: The parametern_units
is the number of units chosen for this dense layer.
tl.LogSoftmax
: Log of the output probabilities. docs / source codeHere, you don't need to set any parameters for
LogSoftMax()
.
Exercise 03
Instructions: Implement the GRULM
class below. You should be using all the methods explained above.
Expected output
Part 3: Training
Now you are going to train your model. As usual, you have to define the cost function, the optimizer, and decide whether you will be training it on a gpu
or cpu
. You also have to feed in a built model. Before, going into the training, we re-introduce the TrainTask
and EvalTask
abstractions from the last week's assignment.
To train a model on a task, Trax defines an abstraction trax.supervised.training.TrainTask
which packages the train data, loss and optimizer (among other things) together into an object.
Similarly to evaluate a model, Trax defines an abstraction trax.supervised.training.EvalTask
which packages the eval data and metrics (among other things) into another object.
The final piece tying things together is the trax.supervised.training.Loop
abstraction that is a very simple and flexible way to put everything together and train the model, all the while evaluating it and saving checkpoints. Using training.Loop
will save you a lot of code compared to always writing the training loop by hand, like you did in courses 1 and 2. More importantly, you are less likely to have a bug in that code that would ruin your training.
An epoch
is traditionally defined as one pass through the dataset.
Since the dataset was divided in batches
you need several steps
(gradient evaluations) in order to complete an epoch
. So, one epoch
corresponds to the number of examples in a batch
times the number of steps
. In short, in each epoch
you go over all the dataset.
The max_length
variable defines the maximum length of lines to be used in training our data, lines longer that that length are discarded.
Below is a function and results that indicate how many lines conform to our criteria of maximum length of a sentence in the entire dataset and how many steps
are required in order to cover the entire dataset which in turn corresponds to an epoch
.
Expected output:
Number of used lines from the dataset: 25881
Batch size (a power of 2): 32
Number of steps to cover one epoch: 808
3.1 Training the model
You will now write a function that takes in your model and trains it. To train your model you have to decide how many times you want to iterate over the entire data set.
Exercise 04
Instructions: Implement the train_model
program below to train the neural network above. Here is a list of things you should do:
Create a
trax.supervised.trainer.TrainTask
object, this encapsulates the aspects of the dataset and the problem at hand:labeled_data = the labeled data that we want to train on.
loss_fn = tl.CrossEntropyLoss()
optimizer = trax.optimizers.Adam() with learning rate = 0.0005
Create a
trax.supervised.trainer.EvalTask
object, this encapsulates aspects of evaluating the model:labeled_data = the labeled data that we want to evaluate on.
metrics = tl.CrossEntropyLoss() and tl.Accuracy()
How frequently we want to evaluate and checkpoint the model.
Create a
trax.supervised.trainer.Loop
object, this encapsulates the following:The previously created
TrainTask
andEvalTask
objects.the training model = GRULM
optionally the evaluation model, if different from the training model. NOTE: in presence of Dropout etc we usually want the evaluation model to behave slightly differently than the training model.
You will be using a cross entropy loss, with Adam optimizer. Please read the trax documentation to get a full understanding. Make sure you use the number of steps provided as a parameter to train for the desired number of steps.
NOTE: Don't forget to wrap the data generator in itertools.cycle
to iterate on it for multiple epochs.
The model was only trained for 1 step due to the constraints of this environment. Even on a GPU accelerated environment it will take many hours for it to achieve a good level of accuracy. For the rest of the assignment you will be using a pretrained model but now you should understand how the training can be done using Trax.
Part 4: Evaluation
4.1 Evaluating using the deep nets
Now that you have learned how to train a model, you will learn how to evaluate it. To evaluate language models, we usually use perplexity which is a measure of how well a probability model predicts a sample. Note that perplexity is defined as:
As an implementation hack, you would usually take the log of that formula (to enable us to use the log probabilities we get as output of our RNN
, convert exponents to products, and products into sums which makes computations less complicated and computationally more efficient). You should also take care of the padding, since you do not want to include the padding when calculating the perplexity (because we do not want to have a perplexity measure artificially good).
Exercise 05
Instructions: Write a program that will help evaluate your model. Implementation hack: your program takes in preds and target. Preds is a tensor of log probabilities. You can use tl.one_hot
to transform the target into the same dimension. You then multiply them and sum.
You also have to create a mask to only get the non-padded probabilities. Good luck!
Hints
- To convert the target into the same dimension as the predictions tensor use tl.one.hot with target and preds.shape[-1].
- You will also need the np.equal function in order to unpad the data and properly compute perplexity.
- Keep in mind while implementing the formula above that wi represents a letter from our 256 letter alphabet.
Expected Output: The log perplexity and perplexity of your model are respectively around 1.9 and 7.2.
The Gumbel Probability Density Function (PDF) is defined as:
where:
The maximum value, which is what we choose as the prediction in the last step of a Recursive Neural Network RNN
we are using for text generation, in a sample of a random variable following an exponential distribution approaches the Gumbel distribution when the sample increases asymptotically. For that reason, the Gumbel distribution is used to sample from a categorical distribution.
In the generated text above, you can see that the model generates text that makes sense capturing dependencies between words and without any input. A simple n-gram model would have not been able to capture all of that in one sentence.
On statistical methods
Using a statistical method like the one you implemented in course 2 will not give you results that are as good. Your model will not be able to encode information seen previously in the data set and as a result, the perplexity will increase. Remember from course 2 that the higher the perplexity, the worse your model is. Furthermore, statistical ngram models take up too much space and memory. As a result, it will be inefficient and too slow. Conversely, with deepnets, you can get a better perplexity. Note, learning about n-gram language models is still important and allows you to better understand deepnets.