Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/master/Natural Language Processing with Probabilistic Models/Week 4 - Word Embeddings with Neural Networks/C2_W4_Assignment.ipynb
Views: 13346
Assignment 4: Word Embeddings
Welcome to the fourth (and last) programming assignment of Course 2!
In this assignment, you will practice how to compute word embeddings and use them for sentiment analysis.
To implement sentiment analysis, you can go beyond counting the number of positive words and negative words.
You can find a way to represent each word numerically, by a vector.
The vector could then represent syntactic (i.e. parts of speech) and semantic (i.e. meaning) structures.
In this assignment, you will explore a classic way of generating word embeddings or representations.
You will implement a famous model called the continuous bag of words (CBOW) model.
By completing this assignment you will:
Train word vectors from scratch.
Learn how to create batches of data.
Understand how backpropagation works.
Plot and visualize your learned word vectors.
Knowing how to train these models will give you a better understanding of word vectors, which are building blocks to many applications in natural language processing.
1. The Continuous bag of words model
Let's take a look at the following sentence:
'I am happy because I am learning'.
In continuous bag of words (CBOW) modeling, we try to predict the center word given a few context words (the words around the center word).
For example, if you were to choose a context half-size of say , then you would try to predict the word happy given the context that includes 2 words before and 2 words after the center word:
words before: [I, am]
words after: [because, I]
In other words:
The structure of your model will look like this:

Where is the average of all the one hot vectors of the context words.

Once you have encoded all the context words, you can use as the input to your model.
The architecture you will be implementing is as follows:
Mapping words to indices and indices to words
We provide a helper function to create a dictionary that maps words to indices and indices to words.
2 Training the Model
Initializing the model
You will now initialize two matrices and two vectors.
The first matrix () is of dimension , where is the number of words in your vocabulary and is the dimension of your word vector.
The second matrix () is of dimension .
Vector has dimensions
Vector has dimensions .
and are the bias vectors of the linear layers from matrices and .
The overall structure of the model will look as in Figure 1, but at this stage we are just initializing the parameters.
Exercise 01
Please use numpy.random.rand to generate matrices that are initialized with random values from a uniform distribution, ranging between 0 and 1.
Note: In the next cell you will encounter a random seed. Please DO NOT modify this seed so your solution can be tested correctly.
Expected Output
2.1 Softmax
Before we can start training the model, we need to implement the softmax function as defined in equation 5:
Array indexing in code starts at 0.
is the number of words in the vocabulary (which is also the number of rows of ).
goes from 0 to |V| - 1.
Exercise 02
Instructions: Implement the softmax function below.
Assume that the input to
softmax
is a 2D arrayEach training example is represented by a column of shape (V, 1) in this 2D array.
There may be more than one column, in the 2D array, because you can put in a batch of examples to increase efficiency. Let's call the batch size lowercase , so the array has shape (V, m)
When taking the sum from , take the sum for each column (each example) separately.
Please use
numpy.exp
numpy.sum (set the axis so that you take the sum of each column in z)
Expected Ouput
Hints
- You can use numpy.maximum(x1,x2) to get the maximum of two values
- Use numpy.dot(A,B) to matrix multiply A and B
Expected output
Expected output
Expected output
Gradient Descent
Exercise 05
Now that you have implemented a function to compute the gradients, you will implement batch gradient descent over your training set.
Hint: For that, you will use initialize_model
and the back_prop
functions which you just created (and the compute_cost
function). You can also use the provided get_batches
helper function:
for x, y in get_batches(data, word2Ind, V, C, batch_size):
...
Also: print the cost after each batch is processed (use batch size = 128)
Expected Output
Your numbers may differ a bit depending on which version of Python you're using.
You can see that man and king are next to each other. However, we have to be careful with the interpretation of this projected word vectors, since the PCA depends on the projection -- as shown in the following illustration.