Path: blob/main/C3/W3/assignment/C3W3_Assignment.ipynb
2955 views
Week 3: Exploring Overfitting in NLP
Welcome to this assignment! During this week you saw different ways to handle sequence-like data. You saw how some Keras' layers such as GRU
, Conv
and LSTM
can be used to tackle problems in this space. Now you will put this knowledge into practice by creating a model architecture that does not overfit.
For this assignment you will be using a variation of the Sentiment140 dataset, which contains 1.6 million tweets alongside their respective sentiment (0 for negative and 4 for positive). This variation contains only 160 thousand tweets.
You will also need to create the helper functions very similar to the ones you coded in previous assignments pre-process data and to tokenize sentences. However the objective of the assignment is to find a model architecture that will not overfit.
Let's get started!
TIPS FOR SUCCESSFUL GRADING OF YOUR ASSIGNMENT:
All cells are frozen except for the ones where you need to submit your solutions or when explicitly mentioned you can interact with it.
You can add new cells to experiment but these will be omitted by the grader, so don't rely on newly created cells to host your solution code, use the provided places for this.
You can add the comment # grade-up-to-here in any graded cell to signal the grader that it must only evaluate up to that point. This is helpful if you want to check if you are on the right track even if you are not done with the whole assignment. Be sure to remember to delete the comment afterwards!
Avoid using global variables unless you absolutely have to. The grader tests your code in an isolated environment without running all cells from the top. As a result, global variables may be unavailable when scoring your submission. Global variables that are meant to be used will be defined in UPPERCASE.
To submit your notebook, save it and then click on the blue submit button at the beginning of the page.
Let's get started!
Defining some useful global variables
Next you will define some global variables that will be used throughout the assignment. Feel free to reference them in the upcoming exercises:
EMBEDDING_DIM
: Dimension of the dense embedding, will be used in the embedding layer of the model. Defaults to 100.MAX_LENGTH
: Maximum length of all sequences. Defaults to 32.TRAINING_SPLIT
: Proportion of data used for training. Defaults to 0.9NUM_BATCHES
: Number of batches. Defaults to 128
A note about grading:
When you submit this assignment for grading these same values for these globals will be used so make sure that all your code works well with these values. After submitting and passing this assignment, you are encouraged to come back here and play with these parameters to see the impact they have in the classification process. Since this next cell is frozen, you will need to copy the contents into a new cell and run it to overwrite the values for these globals.
Explore the dataset
The dataset is provided in a csv file.
Each row of this file contains the following values separated by commas:
target: the polarity of the tweet (0 = negative, 4 = positive)
ids: The id of the tweet
date: the date of the tweet
flag: The query. If there is no query, then this value is NO_QUERY.
user: the user that tweeted
text: the text of the tweet
Take a look at the first five rows of this dataset.
Looking at the contents of the csv file by using pandas is a great way of checking how the data looks like. Now you need to create a tf.data.Dataset
with the corresponding text and sentiment for each tweet:
Exercise 1: train_val_datasets
Now you will code the train_val_datasets
function, which given the full tensorflow dataset, shuffles it and splits the dataset into two, one for training and the other one for validation taking into account the TRAINING_SPLIT
defined earlier. It should also batch the dataset so that it is arranged into NUM_BATCHES
batches.
In the previous week you created this split between training and validation by manipulating numpy arrays but this time the data already comes encoded as tf.data.Dataset
s. This is so you are comfortable manipulating this kind of data regardless of the format.
Hints:
Expected Output:
Exercise 2: fit_vectorizer
Now that you have batched datasets for training and validation it is time for you to begin the tokenization process.
Begin by completing the fit_vectorizer
function below. This function should return a TextVectorization layer that has been fitted to the training sentences.
Hints:
This time you didn't define a custom
standardize_func
but you should convert to lower-case and strip punctuation out of the texts. For this check the different options for thestandardize
argument of the TextVectorization layer.The texts should be truncated so that the maximum length is equal to the
MAX_LENGTH
defined earlier. Once again check thedocs
for an argument that can help you with this.You should NOT predefine a vocabulary size but let the layer learn it from the sentences.
Expected Output:
This time you don't need to encode the labels since these are already encoded as 0 for negative and 1 for positive. But you still need to apply the vectorization to the texts of the dataset using the adapted vectorizer you've just built. You can do so by running the following cell:
Using pre-defined Embeddings
This time you will not be learning embeddings from your data but you will be using pre-trained word vectors. In particular you will be using the 100 dimension version of GloVe from Stanford.
Now you have access to GloVe's pre-trained word vectors. Isn't that cool?
Let's take a look at the vector for the word dog:
Feel free to change the test_word
to see the vector representation of any word you can think of.
Also, notice that the dimension of each vector is 100. You can easily double check this by running the following cell:
Now you can represent the words in your vocabulary using the embeddings. To do this, save the vector representation of each word in the vocabulary in a numpy array.
A couple of things to notice:
You need to build a
word_index
dictionary where it stores the encoding for each word in the adapted vectorizer.If a word in your vocabulary is not present in
GLOVE_EMBEDDINGS
the representation for that word is left as a column of zeros.
As a sanity check, make sure that the vector representation for the word dog
matches the column of its index in the EMBEDDINGS_MATRIX
:
Now you have the pre-trained embeddings ready to use!
Exercise 3: create_model
Now you need to define a model that will handle the problem at hand while not overfitting.
Hints:
The layer immediately after
tf.keras.Input
should be atf.keras.layers.Embedding
. The parameter that configures the usage of the pre-trained embeddings is already provided but you still need to fill the other ones.There multiple ways of solving this problem. So try an architecture that you think will not overfit.
You can try different combinations of layers covered in previous ungraded labs such as:
Conv1D
Dropout
GlobalMaxPooling1D
MaxPooling1D
LSTM
Bidirectional(LSTM)
Include at least one
Dropout
layer to mitigate overfitting.The last two layers should be
Dense
layers.Try simpler architectures first to avoid long training times. Architectures that are able to solve this problem usually have around 3-4 layers (excluding the last two
Dense
ones).
The next cell allows you to check the number of total and trainable parameters of your model and prompts a warning in case these exceeds those of a reference solution, this serves the following 3 purposes listed in order of priority:
Helps you prevent crashing the kernel during training.
Helps you avoid longer-than-necessary training times.
Provides a reasonable estimate of the size of your model. In general you will usually prefer smaller models given that they accomplish their goal successfully.
Notice that this is just informative and may be very well below the actual limit for size of the model necessary to crash the kernel. So even if you exceed this reference you are probably fine. However, if the kernel crashes during training or it is taking a very long time and your model is larger than the reference, come back here and try to get the number of parameters closer to the reference.
Expected Output:
Where NUM_BATCHES
is the globally defined variable and n_units
is the number of units of the last layer of your model.
To pass this assignment your val_loss
(validation loss) should either be flat or decreasing.
Although a flat val_loss
and a lowering train_loss
(or just loss
) also indicate some overfitting what you really want to avoid is having a lowering train_loss
and an increasing val_loss
.
With this in mind, the following three curves will be acceptable solutions:
![]() | ![]() | ![]() |
While the following would not be able to pass the grading:
![]() |
Run the next block of code to plot the metrics.
A more rigorous way of setting the passing threshold of this assignment is to use the slope of your val_loss
curve.
To pass this assignment the slope of your val_loss
curve should be 0.0005 at maximum. You can test this by running the next cell:
If your model generated a validation loss curve that meets the criteria above, run the following cell and then submit your assignment for grading. Otherwise, try with a different architecture.
Congratulations on finishing this week's assignment!
You have successfully implemented a neural network capable of classifying sentiment in text data while doing a fairly good job of not overfitting! Nice job!
Keep it up!