Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/master/Natural Language Processing with Sequence Models/Week 1 - Neural Netowrks for Sentiment Analysis/C3_W1_Assignment.ipynb
Views: 13373
Assignment 1: Sentiment with Deep Neural Networks
Welcome to the first assignment of course 3. In this assignment, you will explore sentiment analysis using deep neural networks.
Outline
In course 1, you implemented Logistic regression and Naive Bayes for sentiment analysis. However if you were to give your old models an example like:
Your model would have predicted a positive sentiment for that review. However, that sentence has a negative sentiment and indicates that the movie was not good. To solve those kinds of misclassifications, you will write a program that uses deep neural networks to identify sentiment in text. By completing this assignment, you will:
Understand how you can build/design a model using layers
Train a model using a training loop
Use a binary cross-entropy loss function
Compute the accuracy of your model
Predict using your own input
As you can tell, this model follows a similar structure to the one you previously implemented in the second course of this specialization.
Indeed most of the deep nets you will be implementing will have a similar structure. The only thing that changes is the model architecture, the inputs, and the outputs. Before starting the assignment, we will introduce you to the Google library
trax
that we use for building and training models.
Now we will show you how to compute the gradient of a certain function f
by just using .grad(f)
.
Notice that trax.fastmath.numpy returns a DeviceArray from the jax library.
The gradient (derivative) of function f
with respect to its input x
is the derivative of .
The derivative of is .
When x is 5, then .
You can calculate the gradient of a function by using trax.fastmath.grad(fun=)
and passing in the name of the function.
In this case the function you want to take the gradient of is
f
.The object returned (saved in
grad_f
in this example) is a function that can calculate the gradient of f for a given trax.fastmath.numpy array.
The function returned by trax.fastmath.grad takes in x=5 and calculates the gradient of f, which is 2*x, which is 10. The value is also stored as a DeviceArray from the jax library.
Part 2: Importing the data
2.1 Loading in the data
Import the data set.
You may recognize this from earlier assignments in the specialization.
Details of process_tweet function are available in utils.py file
Now import a function that processes tweets (we've provided this in the utils.py file).
`process_tweets' removes unwanted characters e.g. hashtag, hyperlinks, stock tickers from tweet.
It also returns a list of words (it tokenizes the original string).
Notice that the function process_tweet
keeps key words, removes the hash # symbol, and ignores usernames (words that begin with '@'). It also returns a list of the words.
2.2 Building the vocabulary
Now build the vocabulary.
Map each word in each tweet to an integer (an "index").
The following code does this for you, but please read it and understand what it's doing.
Note that you will build the vocabulary based on the training data.
To do so, you will assign an index to everyword by iterating over your training set.
The vocabulary will also include some special tokens
__PAD__
: padding</e>
: end of line__UNK__
: a token representing any word that is not in the vocabulary.
The dictionary Vocab
will look like this:
Each unique word has a unique integer associated with it.
The total number of words in Vocab: 9088
2.3 Converting a tweet to a tensor
Write a function that will convert each tweet to a tensor (a list of unique integer IDs representing the processed tweet).
Note, the returned data type will be a regular Python
list()
You won't use TensorFlow in this function
You also won't use a numpy array
You also won't use trax.fastmath.numpy array
For words in the tweet that are not in the vocabulary, set them to the unique ID for the token
__UNK__
.
Example
Input a tweet:
The tweet_to_tensor will first conver the tweet into a list of tokens (including only relevant words)
Then it will convert each word into its unique integer
Notice that the word "maria" is not in the vocabulary, so it is assigned the unique integer associated with the
__UNK__
token, because it is considered "unknown."
Exercise 01
Instructions: Write a program tweet_to_tensor
that takes in a tweet and converts it to an array of numbers. You can use the Vocab
dictionary you just found to help create the tensor.
Use the vocab_dict parameter and not a global variable.
Do not hard code the integer value for the
__UNK__
token.
Hints
- Map each word in tweet to corresponding token in 'Vocab'
- Use Python's Dictionary.get(key,value) so that the function returns a default value if the key is not found in the dictionary.
Expected output
All tests passed
2.4 Creating a batch generator
Most of the time in Natural Language Processing, and AI in general we use batches when training our data sets.
If instead of training with batches of examples, you were to train a model with one example at a time, it would take a very long time to train the model.
You will now build a data generator that takes in the positive/negative tweets and returns a batch of training examples. It returns the model inputs, the targets (positive or negative labels) and the weight for each target (ex: this allows us to can treat some examples as more important to get right than others, but commonly this will all be 1.0).
Once you create the generator, you could include it in a for loop
You can also get a single batch like this:
The generator returns the next batch each time it's called.
This generator returns the data in a format (tensors) that you could directly use in your model.
It returns a triple: the inputs, targets, and loss weights: -- Inputs is a tensor that contains the batch of tweets we put into the model. -- Targets is the corresponding batch of labels that we train to generate. -- Loss weights here are just 1s with same shape as targets. Next week, you will use it to mask input padding.
Now you can use your data generator to create a data generator for the training data, and another data generator for the validation data.
We will create a third data generator that does not loop, for testing the final accuracy of the model.
Expected output
Now that you have your train/val generators, you can just call them and they will return tensors which correspond to your tweets in the first column and their corresponding labels in the second column. Now you can go ahead and start building your neural network.
Part 3: Defining classes
In this part, you will write your own library of layers. It will be very similar to the one used in Trax and also in Keras and PyTorch. Writing your own small framework will help you understand how they all work and use them effectively in the future.
Your framework will be based on the following Layer
class from utils.py.
Hints
- Please use numpy.maximum(A,k) to find the maximum between each element in A and a scalar k
Expected Outout
3.2 Dense class
Exercise
Implement the forward function of the Dense class.
The forward function multiplies the input to the layer (
x
) by the weight matrix (W
)
You can use
numpy.dot
to perform the matrix multiplication.
Note that for more efficient code execution, you will use the trax version of math
, which includes a trax version of numpy
and also random
.
Implement the weight initializer new_weights
function
Weights are initialized with a random key.
The second parameter is a tuple for the desired shape of the weights (num_rows, num_cols)
The num of rows for weights should equal the number of columns in x, because for forward propagation, you will multiply x times weights.
Please use trax.fastmath.random.normal(key, shape, dtype=tf.float32)
to generate random values for the weight matrix. The key difference between this function and the standard numpy
randomness is the explicit use of random keys, which need to be passed. While it can look tedious at the first sight to pass the random key everywhere, you will learn in Course 4 why this is very helpful when implementing some advanced models.
key
can be generated by callingrandom.get_prng(seed=)
and passing in a number for theseed
.shape
is a tuple with the desired shape of the weight matrix.The number of rows in the weight matrix should equal the number of columns in the variable
x
. Sincex
may have 2 dimensions if it reprsents a single training example (row, col), or three dimensions (batch_size, row, col), get the last dimension from the tuple that holds the dimensions of x.The number of columns in the weight matrix is the number of units chosen for that dense layer. Look at the
__init__
function to see which variable stores the number of units.
dtype
is the data type of the values in the generated matrix; keep the default oftf.float32
. In this case, don't explicitly set the dtype (just let it use the default value).
Set the standard deviation of the random values to 0.1
The values generated have a mean of 0 and standard deviation of 1.
Set the default standard deviation
stdev
to be 0.1 by multiplying the standard deviation to each of the values in the weight matrix.
Expected Outout
3.3 Model
Now you will implement a classifier using neural networks. Here is the model architecture you will be implementing.
For the model implementation, you will use the Trax layers library tl
. Note that the second character of tl
is the lowercase of letter L
, not the number 1. Trax layers are very similar to the ones you implemented above, but in addition to trainable weights also have a non-trainable state. State is used in layers like batch normalization and for inference, you will learn more about it in course 4.
First, look at the code of the Trax Dense layer and compare to your implementation above.
tl.Dense: Trax Dense layer implementation
One other important layer that you will use a lot is one that allows to execute one layer after another in sequence.
tl.Serial: Combinator that applies layers serially.
You can pass in the layers as arguments to
Serial
, separated by commas.For example:
tl.Serial(tl.Embeddings(...), tl.Mean(...), tl.Dense(...), tl.LogSoftmax(...))
Please use the help
function to view documentation for each layer.
tl.Embedding: Layer constructor function for an embedding layer.
tl.Embedding(vocab_size, d_feature)
.vocab_size
is the number of unique words in the given vocabulary.d_feature
is the number of elements in the word embedding (some choices for a word embedding size range from 150 to 300, for example).
tl.Mean: Calculates means across an axis. In this case, please choose axis = 1 to get an average embedding vector (an embedding vector that is an average of all words in the vocabulary).
For example, if the embedding matrix is 300 elements and vocab size is 10,000 words, taking the mean of the embedding matrix along axis=1 will yield a vector of 300 elements.
tl.LogSoftmax: Implements log softmax function
Here, you don't need to set any parameters for
LogSoftMax()
.
Online documentation
Expected Outout
Part 4: Training
To train a model on a task, Trax defines an abstraction trax.supervised.training.TrainTask
which packages the train data, loss and optimizer (among other things) together into an object.
Similarly to evaluate a model, Trax defines an abstraction trax.supervised.training.EvalTask
which packages the eval data and metrics (among other things) into another object.
The final piece tying things together is the trax.supervised.training.Loop
abstraction that is a very simple and flexible way to put everything together and train the model, all the while evaluating it and saving checkpoints. Using Loop
will save you a lot of code compared to always writing the training loop by hand, like you did in courses 1 and 2. More importantly, you are less likely to have a bug in that code that would ruin your training.
Notice some available optimizers include:
This defines a model trained using tl.CrossEntropyLoss
optimized with the trax.optimizers.Adam
optimizer, all the while tracking the accuracy using tl.Accuracy
metric. We also track tl.CrossEntropyLoss
on the validation set.
Now let's make an output directory and train the model.
Expected output (Approximately)
4.2 Practice Making a prediction
Now that you have trained a model, you can access it as training_loop.model
object. We will actually use training_loop.eval_model
and in the next weeks you will learn why we sometimes use a different model for evaluation, e.g., one without dropout. For now, make predictions with your model.
Use the training data just to see how the prediction process works.
Later, you will use validation data to evaluate your model's performance.
To turn these probabilities into categories (negative or positive sentiment prediction), for each row:
Compare the probabilities in each column.
If column 1 has a value greater than column 0, classify that as a positive tweet.
Otherwise if column 1 is less than or equal to column 0, classify that example as a negative tweet.
Notice that since you are making a prediction using a training batch, it's more likely that the model's predictions match the actual targets (labels).
Every prediction that the tweet is positive is also matching the actual target of 1 (positive sentiment).
Similarly, all predictions that the sentiment is not positive matches the actual target of 0 (negative sentiment)
One more useful thing to know is how to compare if the prediction is matching the actual target (label).
The result of calculation
is_positive
is a boolean.The target is a type trax.fastmath.numpy.int32
If you expect to be doing division, you may prefer to work with decimal numbers with the data type type trax.fastmath.numpy.int32
Note that Python usually does type conversion for you when you compare a boolean to an integer
True compared to 1 is True, otherwise any other integer is False.
False compared to 0 is True, otherwise any ohter integer is False.
However, we recommend that you keep track of the data type of your variables to avoid unexpected outcomes. So it helps to convert the booleans into integers
Compare 1 to 1 rather than comparing True to 1.
Hopefully you are now familiar with what kinds of inputs and outputs the model uses when making a prediction.
This will help you implement a function that estimates the accuracy of the model's predictions.
Part 5: Evaluation
5.1 Computing the accuracy on a batch
You will now write a function that evaluates your model on the validation set and returns the accuracy.
preds
contains the predictions.Its dimensions are
(batch_size, output_dim)
.output_dim
is two in this case. Column 0 contains the probability that the tweet belongs to class 0 (negative sentiment). Column 1 contains probability that it belongs to class 1 (positive sentiment).If the probability in column 1 is greater than the probability in column 0, then interpret this as the model's prediction that the example has label 1 (positive sentiment).
Otherwise, if the probabilities are equal or the probability in column 0 is higher, the model's prediction is 0 (negative sentiment).
y
contains the actual labels.y_weights
contains the weights to give to predictions.
Expected output (Approximately)
5.2 Testing your model on Validation Data
Now you will write test your model's prediction accuracy on validation data.
This program will take in a data generator and your model.
The generator allows you to get batches of data. You can use it with a
for
loop:
batch
has dimensions (X, Y, weights)
.
Column 0 corresponds to the tweet as a tensor (input).
Column 1 corresponds to its target (actual label, positive or negative sentiment).
Column 2 corresponds to the weights associated (example weights)
You can feed the tweet into model and it will return the predictions for the batch.
Expected Output (Approximately)
Notice that the model works well even for complex sentences.
On Deep Nets
Deep nets allow you to understand and capture dependencies that you would have not been able to capture with a simple linear regression, or logistic regression.
It also allows you to better use pre-trained embeddings for classification and tends to generalize better.