Path: blob/master/Natural Language Processing with Classification and Vector Spaces/Week 4 - Machine Translation and Document Search/C1_W4_Assignment.ipynb
14373 views
Assignment 4 - Naive Machine Translation and LSH
You will now implement your first machine translation system and then you will see how locality sensitive hashing works. Let's get started by importing the required functions!
If you are running this notebook in your local computer, don't forget to download the twitter samples and stopwords from nltk.
NOTE: The Exercise xx
numbers in this assignment are inconsistent with the UNQ_Cx
numbers.
This assignment covers the folowing topics:
1. The word embeddings data for English and French words
Write a program that translates English to French.
The data
The full dataset for English embeddings is about 3.64 gigabytes, and the French embeddings are about 629 megabytes. To prevent the Coursera workspace from crashing, we've extracted a subset of the embeddings for the words that you'll use in this assignment.
If you want to run this on your local computer and use the full dataset, you can download the
English embeddings from Google code archive word2vec look for GoogleNews-vectors-negative300.bin.gz
You'll need to unzip the file first.
and the French embeddings from cross_lingual_text_classification.
in the terminal, type (in one line)
curl -o ./wiki.multi.fr.vec https://dl.fbaipublicfiles.com/arrival/vectors/wiki.multi.fr.vec
Then copy-paste the code below and run it.
The subset of data
To do the assignment on the Coursera workspace, we'll use the subset of word embeddings.
Look at the data
en_embeddings_subset: the key is an English word, and the vaule is a 300 dimensional array, which is the embedding for that word.
fr_embeddings_subset: the key is an French word, and the vaule is a 300 dimensional array, which is the embedding for that word.
Load two dictionaries mapping the English to French words
A training dictionary
and a testing dictionary.
Looking at the English French dictionary
en_fr_train
is a dictionary where the key is the English word and the value is the French translation of that English word.
en_fr_test
is similar toen_fr_train
, but is a test set. We won't look at it until we get to testing.
1.1 Generate embedding and transform matrices
Exercise 01: Translating English dictionary to French by using embeddings
You will now implement a function get_matrices
, which takes the loaded data and returns matrices X
and Y
.
Inputs:
en_fr
: English to French dictionaryen_embeddings
: English to embeddings dictionaryfr_embeddings
: French to embeddings dictionary
Returns:
Matrix
X
and matrixY
, where each row in X is the word embedding for an english word, and the same row in Y is the word embedding for the French version of that English word.

Use the en_fr
dictionary to ensure that the ith row in the X
matrix corresponds to the ith row in the Y
matrix.
Instructions: Complete the function get_matrices()
:
Iterate over English words in
en_fr
dictionary.Check if the word have both English and French embedding.
Hints
- Sets are useful data structures that can be used to check if an item is a member of a group.
- You can get words which are embedded into the language by using keys method.
- Keep vectors in `X` and `Y` sorted in list. You can use np.vstack() to merge them into the numpy matrix.
- numpy.vstack stacks the items in a list as rows in a matrix.
Now we will use function get_matrices()
to obtain sets X_train
and Y_train
of English and French word embeddings into the corresponding vector space models.
2. Translations

Write a program that translates English words to French words using word embeddings and vector space models.
2.1 Translation as linear transformation of embeddings
Given dictionaries of English and French word embeddings you will create a transformation matrix R
Given an English word embedding, , you can multiply to get a new word embedding .
Both and are row vectors.
You can then compute the nearest neighbors to
f
in the french embeddings and recommend the word that is most similar to the transformed word embedding.
Describing translation as the minimization problem
Find a matrix R
that minimizes the following equation.
Frobenius norm
The Frobenius norm of a matrix (assuming it is of dimension ) is defined as the square root of the sum of the absolute squares of its elements:
Actual loss function
In the real world applications, the Frobenius norm loss:
is often replaced by it's squared value divided by :
where is the number of examples (rows in ).
The same R is found when using this loss function versus the original Frobenius norm.
The reason for taking the square is that it's easier to compute the gradient of the squared Frobenius.
The reason for dividing by is that we're more interested in the average loss per embedding than the loss for the entire training set.
The loss for all training set increases with more words (training examples), so taking the average helps us to track the average loss regardless of the size of the training set.
[Optional] Detailed explanation why we use norm squared instead of the norm:
Click for optional details
- The norm is always nonnegative (we're summing up absolute values), and so is the square.
- When we take the square of all non-negative (positive or zero) numbers, the order of the data is preserved.
- For example, if 3 > 2, 3^2 > 2^2
- Using the norm or squared norm in gradient descent results in the same location of the minimum.
- Squaring cancels the square root in the Frobenius norm formula. Because of the chain rule, we would have to do more calculations if we had a square root in our expression for summation.
- Dividing the function value by the positive number doesn't change the optimum of the function, for the same reason as described above.
- We're interested in transforming English embedding into the French. Thus, it is more important to measure average loss per embedding than the loss for the entire dictionary (which increases as the number of words in the dictionary increases).
Exercise 02: Implementing translation mechanism described in this section.
Step 1: Computing the loss
The loss function will be squared Frobenoius norm of the difference between matrix and its approximation, divided by the number of training examples .
Its formula is:
where is value in th row and th column of the matrix .
Instructions: complete the compute_loss()
function
Compute the approximation of
Y
by matrix multiplyingX
andR
Compute difference
XR - Y
Compute the squared Frobenius norm of the difference and divide it by .
Hints
- Useful functions: Numpy dot , Numpy sum, Numpy square, Numpy norm
- Be careful about which operation is elementwise and which operation is a matrix multiplication.
- Try to use matrix operations instead of the numpy norm function. If you choose to use norm function, take care of extra arguments and that it's returning loss squared, and not the loss itself.
Exercise 03
Step 2: Computing the gradient of loss in respect to transform matrix R
Calculate the gradient of the loss with respect to transform matrix
R
.The gradient is a matrix that encodes how much a small change in
R
affect the change in the loss function.The gradient gives us the direction in which we should decrease
R
to minimize the loss.is the number of training examples (number of rows in ).
The formula for the gradient of the loss function is:
Instructions: Complete the compute_gradient
function below.
Hints
- Transposing in numpy
- Finding out the dimensions of matrices in numpy
- Remember to use numpy.dot for matrix multiplication
Step 3: Finding the optimal R with gradient descent algorithm
Gradient descent
Gradient descent is an iterative algorithm which is used in searching for the optimum of the function.
Earlier, we've mentioned that the gradient of the loss with respect to the matrix encodes how much a tiny change in some coordinate of that matrix affect the change of loss function.
Gradient descent uses that information to iteratively change matrix
R
until we reach a point where the loss is minimized.
Training with a fixed number of iterations
Most of the time we iterate for a fixed number of training steps rather than iterating until the loss falls below a threshold.
OPTIONAL: explanation for fixed number of iterations
click here for detailed discussion
- You cannot rely on training loss getting low -- what you really want is the validation loss to go down, or validation accuracy to go up. And indeed - in some cases people train until validation accuracy reaches a threshold, or -- commonly known as "early stopping" -- until the validation accuracy starts to go down, which is a sign of over-fitting.
- Why not always do "early stopping"? Well, mostly because well-regularized models on larger data-sets never stop improving. Especially in NLP, you can often continue training for months and the model will continue getting slightly and slightly better. This is also the reason why it's hard to just stop at a threshold -- unless there's an external customer setting the threshold, why stop, where do you put the threshold?
- Stopping after a certain number of steps has the advantage that you know how long your training will take - so you can keep some sanity and not train for months. You can then try to get the best performance within this time budget. Another advantage is that you can fix your learning rate schedule -- e.g., lower the learning rate at 10% before finish, and then again more at 1% before finishing. Such learning rate schedules help a lot, but are harder to do if you don't know how long you're training.
Pseudocode:
Calculate gradient of the loss with respect to the matrix .
Update with the formula:
Where is the learning rate, which is a scalar.
Learning rate
The learning rate or "step size" is a coefficient which decides how much we want to change in each step.
If we change too much, we could skip the optimum by taking too large of a step.
If we make only small changes to , we will need many steps to reach the optimum.
Learning rate is used to control those changes.
Values of are chosen depending on the problem, and we'll use
learning_rate
as the default value for our algorithm.
Hints
- Use the 'compute_gradient()' function to get the gradient in each step
Expected Output:
Calculate transformation matrix R
Using those the training set, find the transformation matrix by calling the function align_embeddings()
.
NOTE: The code cell below will take a few minutes to fully execute (~3 mins)
Expected Output
2.2 Testing the translation
k-Nearest neighbors algorithm
k-NN is a method which takes a vector as input and finds the other vectors in the dataset that are closest to it.
The 'k' is the number of "nearest neighbors" to find (e.g. k=2 finds the closest two neighbors).
Searching for the translation embedding
Since we're approximating the translation function from English to French embeddings by a linear transformation matrix , most of the time we won't get the exact embedding of a French word when we transform embedding of some particular English word into the French embedding space.
This is where -NN becomes really useful! By using -NN with as input, we can search for an embedding (as a row) in the matrix which is the closest to the transformed vector
Cosine similarity
Cosine similarity between vectors and calculated as the cosine of the angle between them. The formula is
= when and lie on the same line and have the same direction.
is when they have exactly opposite directions.
is when the vectors are orthogonal (perpendicular) to each other.
Note: Distance and similarity are pretty much opposite things.
We can obtain distance metric from cosine similarity, but the cosine similarity can't be used directly as the distance metric.
When the cosine similarity increases (towards ), the "distance" between the two vectors decreases (towards ).
We can define the cosine distance between and as
Exercise 05: Complete the function nearest_neighbor()
Inputs:
Vector
v
,A set of possible nearest neighbors
candidates
k
nearest neighbors to find.The distance metric should be based on cosine similarity.
cosine_similarity
function is already implemented and imported for you. It's arguments are two vectors and it returns the cosine of the angle between them.Iterate over rows in
candidates
, and save the result of similarities between current row and vectorv
in a python list. Take care that similarities are in the same order as row vectors ofcandidates
.Now you can use numpy argsort to sort the indices for the rows of
candidates
.
Hints
- numpy.argsort sorts values from most negative to most positive (smallest to largest)
- The candidates that are nearest to 'v' should have the highest cosine similarity
- To get the last element of a list 'tmp', the notation is tmp[-1:]
Expected Output:
[[9 9 9] [1 0 5] [2 0 1]]
Test your translation and compute its accuracy
Exercise 06: Complete the function test_vocabulary
which takes in English embedding matrix , French embedding matrix and the matrix and returns the accuracy of translations from to by .
Iterate over transformed English word embeddings and check if the closest French word vector belongs to French word that is the actual translation.
Obtain an index of the closest French embedding by using
nearest_neighbor
(with argumentk=1
), and compare it to the index of the English embedding you have just transformed.Keep track of the number of times you get the correct translation.
Calculate accuracy as
Let's see how is your translation mechanism working on the unseen data:
Expected Output:
You managed to translate words from one language to another language without ever seing them with almost 56% accuracy by using some basic linear algebra and learning a mapping of words from one language to another!
3. LSH and document search
In this part of the assignment, you will implement a more efficient version of k-nearest neighbors using locality sensitive hashing. You will then apply this to document search.
Process the tweets and represent each tweet as a vector (represent a document with a vector embedding).
Use locality sensitive hashing and k nearest neighbors to find tweets that are similar to a given tweet.
3.1 Getting the document embeddings
Bag-of-words (BOW) document models
Text documents are sequences of words.
The ordering of words makes a difference. For example, sentences "Apple pie is better than pepperoni pizza." and "Pepperoni pizza is better than apple pie" have opposite meanings due to the word ordering.
However, for some applications, ignoring the order of words can allow us to train an efficient and still effective model.
This approach is called Bag-of-words document model.
Document embeddings
Document embedding is created by summing up the embeddings of all words in the document.
If we don't know the embedding of some word, we can ignore that word.
Exercise 07: Complete the get_document_embedding()
function.
The function
get_document_embedding()
encodes entire document as a "document" embedding.It takes in a docoument (as a string) and a dictionary,
en_embeddings
It processes the document, and looks up the corresponding embedding of each word.
It then sums them up and returns the sum of all word vectors of that processed tweet.
Hints
- You can handle missing words easier by using the `get()` method of the python dictionary instead of the bracket notation (i.e. "[ ]"). See more about it here
- The default value for missing word should be the zero vector. Numpy will broadcast simple 0 scalar into a vector of zeros during the summation.
- Alternatively, skip the addition if a word is not in the dictonary.
- You can use your `process_tweet()` function which allows you to process the tweet. The function just takes in a tweet and returns a list of words.
Expected output:
Expected Output
Expected Output
3.3 Finding the most similar tweets with LSH
You will now implement locality sensitive hashing (LSH) to identify the most similar tweet.
Instead of looking at all 10,000 vectors, you can just search a subset to find its nearest neighbors.
Let's say your data points are plotted like this:

You can divide the vector space into regions and search within one region for nearest neighbors of a given vector.

Choosing the number of planes
Each plane divides the space to parts.
So planes divide the space into hash buckets.
We want to organize 10,000 document vectors into buckets so that every bucket has about vectors.
For that we need buckets.
We're interested in , number of planes, so that . Now, we can calculate .
3.4 Getting the hash number for a vector
For each vector, we need to get a unique number associated to that vector in order to assign it to a "hash bucket".
Hyperlanes in vector spaces
In -dimensional vector space, the hyperplane is a regular plane. In dimensional vector space, the hyperplane is a line.
Generally, the hyperplane is subspace which has dimension lower than the original vector space has.
A hyperplane is uniquely defined by its normal vector.
Normal vector of the plane is the vector to which all vectors in the plane are orthogonal (perpendicular in dimensional case).
Using Hyperplanes to split the vector space
We can use a hyperplane to split the vector space into parts.
All vectors whose dot product with a plane's normal vector is positive are on one side of the plane.
All vectors whose dot product with the plane's normal vector is negative are on the other side of the plane.
Encoding hash buckets
For a vector, we can take its dot product with all the planes, then encode this information to assign the vector to a single hash bucket.
When the vector is pointing to the opposite side of the hyperplane than normal, encode it by 0.
Otherwise, if the vector is on the same side as the normal vector, encode it by 1.
If you calculate the dot product with each plane in the same order for every vector, you've encoded each vector's unique hash ID as a binary number, like [0, 1, 1, ... 0].
Exercise 09: Implementing hash buckets
We've initialized hash table hashes
for you. It is list of N_UNIVERSES
matrices, each describes its own hash table. Each matrix has N_DIMS
rows and N_PLANES
columns. Every column of that matrix is a N_DIMS
-dimensional normal vector for each of N_PLANES
hyperplanes which are used for creating buckets of the particular hash table.
Exercise: Your task is to complete the function hash_value_of_vector
which places vector v
in the correct hash bucket.
First multiply your vector
v
, with a corresponding plane. This will give you a vector of dimension ParseError: KaTeX parse error: Expected 'EOF', got '_' at position 11: (1,\text{N_̲planes}).You will then convert every element in that vector to 0 or 1.
You create a hash vector by doing the following: if the element is negative, it becomes a 0, otherwise you change it to a 1.
You then compute the unique number for the vector by iterating over
N_PLANES
Then you multiply times the corresponding bit (0 or 1).
You will then store that sum in the variable
hash_value
.
Intructions: Create a hash for the vector in the function below. Use this formula:
Create the sets of planes
Create multiple (25) sets of planes (the planes that divide up the region).
You can think of these as 25 separate ways of dividing up the vector space with a different set of planes.
Each element of this list contains a matrix with 300 rows (the word vector have 300 dimensions), and 10 columns (there are 10 planes in each "universe").
Hints
- numpy.squeeze() removes unused dimensions from an array; for instance, it converts a (10,1) 2D array into a (10,) 1D array
Expected Output
3.5 Creating a hash table
Exercise 10
Given that you have a unique number for each vector (or tweet), You now want to create a hash table. You need a hash table, so that given a hash_id, you can quickly look up the corresponding vectors. This allows you to reduce your search by a significant amount of time.

We have given you the make_hash_table
function, which maps the tweet vectors to a bucket and stores the vector there. It returns the hash_table
and the id_table
. The id_table
allows you know which vector in a certain bucket corresponds to what tweet.
Hints
- a dictionary comprehension, similar to a list comprehension, looks like this: `{i:0 for i in range(10)}`, where the key is 'i' and the value is zero for all key-value pairs.
Expected output
3.6 Creating all hash tables
You can now hash your vectors and store them in a hash table that would allow you to quickly look up and search for similar vectors. Run the cell below to create the hashes. By doing so, you end up having several tables which have all the vectors. Given a vector, you then identify the buckets in all the tables. You can then iterate over the buckets and consider much fewer vectors. The more buckets you use, the more accurate your lookup will be, but also the longer it will take.
Approximate K-NN
Exercise 11
Implement approximate K nearest neighbors using locality sensitive hashing, to search for documents that are similar to a given document at the index doc_id
.
Inputs
doc_id
is the index into the document listall_tweets
.v
is the document vector for the tweet inall_tweets
at indexdoc_id
.planes_l
is the list of planes (the global variable created earlier).k
is the number of nearest neighbors to search for.num_universes_to_use
: to save time, we can use fewer than the total number of available universes. By default, it's set toN_UNIVERSES
, which is for this assignment.
The approximate_knn
function finds a subset of candidate vectors that are in the same "hash bucket" as the input vector 'v'. Then it performs the usual k-nearest neighbors search on this subset (instead of searching through all 10,000 tweets).
Hints
- There are many dictionaries used in this function. Try to print out planes_l, hash_tables, id_tables to understand how they are structured, what the keys represent, and what the values contain.
- To remove an item from a list, use `.remove()`
- To append to a list, use `.append()`
- To add to a set, use `.add()`
4 Conclusion
Congratulations - Now you can look up vectors that are similar to the encoding of your tweet using LSH!