Path: blob/main/C3/W4/assignment/C3W4_Assignment.ipynb
2956 views
Week 4: Predicting the next word
Welcome to this assignment! During this week you saw how to create a model that will predict the next word in a text sequence, now you will implement such model and train it using a corpus of Shakespeare Sonnets, while also creating some helper functions to pre-process the data.
TIPS FOR SUCCESSFUL GRADING OF YOUR ASSIGNMENT:
All cells are frozen except for the ones where you need to submit your solutions or when explicitly mentioned you can interact with it.
You can add new cells to experiment but these will be omitted by the grader, so don't rely on newly created cells to host your solution code, use the provided places for this.
You can add the comment # grade-up-to-here in any graded cell to signal the grader that it must only evaluate up to that point. This is helpful if you want to check if you are on the right track even if you are not done with the whole assignment. Be sure to remember to delete the comment afterwards!
Avoid using global variables unless you absolutely have to. The grader tests your code in an isolated environment without running all cells from the top. As a result, global variables may be unavailable when scoring your submission. Global variables that are meant to be used will be defined in UPPERCASE.
To submit your notebook, save it and then click on the blue submit button at the beginning of the page.
Let's get started!
Defining some useful global variables
Next you will define some global variables that will be used throughout the assignment. Feel free to reference them in the upcoming exercises:
FILE_PATH
: The file path where the sonnets file is located.NUM_BATCHES
: Number of batches. Defaults to 16.LSTM_UNITS
: Number of LSTM units in the LSTM layer.EMBEDDING_DIM
: Number of dimensions in the embedding layer.
A note about grading:
When you submit this assignment for grading these same values for these globals will be used so make sure that all your code works well with these values. After submitting and passing this assignment, you are encouraged to come back here and play with these parameters to see the impact they have in the classification process. Since this next cell is frozen, you will need to copy the contents into a new cell and run it to overwrite the values for these globals.
Reading the dataset
For this assignment you will be using the Shakespeare Sonnets Dataset, which contains more than 2000 lines of text extracted from Shakespeare's sonnets.
Exercise 1: fit_vectorizer
In this exercise, you will use the tf.keras.layers.TextVectorization layer to tokenize and transform the text into numeric values.
Note that in this case you will not pad the sentences right now as you've done before, because you need to build the n-grams before padding, so pay attention with the appropriate arguments passed to the TextVectorization layer!
Note:
You should remove the punctuation and use only lowercase words, so you must pass the correct argument to TextVectorization layer.
In this case you will not pad the sentences with the TextVectorization layer as you've done before, because you need to build the n-grams before padding. Remember that by default, the TextVectorization layer will return a Tensor and therefore every element in it must have the same size, so if you pass two sentences of different length to be parsed, they will be padded. If you do not want to do that, you need to either pass the parameter ragged=True, or pass only a single sentence at the time. Later on in the assignment you will build the n-grams and depending on how you will iterate over the sentences, this may be important. If you choose to first pass the entire corpus to the TextVectorization and then perform the iteration, then you should pass ragged=True, otherwise, if you use the TextVectorization on each sentence separately, then you should not worry about it.
Expected output:
One thing to note is that you can either pass a string or a list of strings to vectorizer. If you pass the former, it will return a tensor whereas if you pass the latter, it will return a ragged tensor if you've correctly configured the TextVectorization layer to do so.
Expected output:
Generating n-grams
As you saw in the lecture, the idea now is to generate the n-grams for each sentence in the corpus. So, for instance, if a vectorized sentence is given by [45, 75, 195, 879]
, you must generate the following vectors:
Exercise 2: n_grams_seqs
Now complete the n_gram_seqs
function below. This function receives the fitted vectorizer and the corpus (which is a list of strings) and should return a list containing the n_gram
sequences for each line in the corpus.
NOTE:
If you pass
vectorizer(sentence)
the result is not padded, whereas if you passvectorizer(list_of_sentences)
, the result won't be padded only if you passed the argumentragged = True
in the TextVectorization setup.This exercise directly depends on the previous one, because you need to pass the defined vectorizer as a parameter, so any error thrown in the previous exercise may propagate here.
Expected Output:
Expected Output:
Apply the n_gram_seqs
transformation to the whole corpus and save the maximum sequence length to use it later:
Expected Output:
Exercise 3: pad_seqs
Now code the pad_seqs
function which will pad any given sequences to the desired maximum length. Notice that this function receives a list of sequences and should return a numpy array with the padded sequences. You may have a look at the documentation of tf.keras.utils.pad_sequences
.
NOTE:
Remember to pass the correct padding method as discussed in the lecture.
Expected Output:
Expected Output:
Expected Output:
Exercise 4: features_and_labels_dataset
Before feeding the data into the neural network you should split it into features and labels. In this case the features will be the padded n_gram sequences with the last element removed from them and the labels will be the removed words.
Complete the features_and_labels_dataset
function below. This function expects the padded n_gram sequences
as input and should return a batched tensorflow dataset containing elements in the form (sentence, label).
NOTE:
Notice that the function also receives the total of words in the corpus, this parameter will be very important when one hot encoding the labels since every word in the corpus will be a label at least once. The function you should use is
tf.keras.utils.to_categorical
.To generate a dataset you may use the function tf.data.Dataset.from_tensor_slices after obtaining the sentences and their respective labels.
To batch a dataset, you may call the method .batch. A good number is
16
, but feel free to choose any number you want to, but keep it not greater than 64, otherwise the model may take too many epochs to achieve a good accuracy. Remember this is defined as a global variable.
Expected Output:
Now let's generate the whole dataset that will be used for training. In this case, let's use the .prefetch method to speed up the training. Since the dataset is not that big, you should not have problems with memory by doing this.
Expected Output:
Exercise 5: create_model
Now you should define a model architecture capable of achieving an accuracy of at least 80%.
Some hints to help you in this task:
The first layer in your model must be an Input layer with the appropriate parameters, remember that your input are vectors with a fixed length size. Be careful with the size value you should pass as you've removed the last element of every input to be the label.
An appropriate
output_dim
for the first layer (Embedding) is 100, this is already provided for you.A Bidirectional LSTM is helpful for this particular problem.
The last layer should have the same number of units as the total number of words in the corpus and a softmax activation function.
This problem can be solved with only two layers (excluding the Embedding and Input) so try out small architectures first.
30 epochs should be enough to get an accuracy higher than 80%, if this is not the case try changing the architecture of your model.
The next cell allows you to check the number of total and trainable parameters of your model and prompts a warning in case these exceeds those of a reference solution, this serves the following 3 purposes listed in order of priority:
Helps you prevent crashing the kernel during training.
Helps you avoid longer-than-necessary training times.
Provides a reasonable estimate of the size of your model. In general you will usually prefer smaller models given that they accomplish their goal successfully.
Notice that this is just informative and may be very well below the actual limit for size of the model necessary to crash the kernel. So even if you exceed this reference you are probably fine. However, if the kernel crashes during training or it is taking a very long time and your model is larger than the reference, come back here and try to get the number of parameters closer to the reference.
Expected output:
Where NUM_BATCHES
is the number of batches you have set to your dataset.
To pass this assignment, your model should achieve a training accuracy of at least 80%. If your model didn't achieve this threshold, try training again with a different model architecture. Consider increasing the number of units in your LSTM
layer.
If the accuracy meets the requirement of being greater than 80%, then save the history.pkl
file which contains the information of the training history of your model and will be used to compute your grade. You can do this by running the following code:
See your model in action
After all your work it is finally time to see your model generating text.
Run the cell below to generate the next 100 words of a seed text.
After submitting your assignment you are encouraged to try out training for different amounts of epochs and seeing how this affects the coherency of the generated text. Also try changing the seed text to see what you get!
Congratulations on finishing this week's assignment!
You have successfully implemented a neural network capable of predicting the next word in a sequence of text!
We hope to see you in the next course of the specialization! Keep it up!