Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/master/Natural Language Processing with Probabilistic Models/Week 3 - Autocomplete and Language Models/C2_W3_Assignment.ipynb
Views: 13373
Language Models: Auto-Complete
In this assignment, you will build an auto-complete system. Auto-complete system is something you may see every day
When you google something, you often have suggestions to help you complete your search.
When you are writing an email, you get suggestions telling you possible endings to your sentence.
By the end of this assignment, you will develop a prototype of such a system.
A key building block for an auto-complete system is a language model. A language model assigns the probability to a sequence of words, in a way that more "likely" sequences receive higher scores. For example,
"I have a pen" is expected to have a higher probability than "I am a pen" since the first one seems to be a more natural sentence in the real world.
You can take advantage of this probability calculation to develop an auto-complete system. Suppose the user typed
"I eat scrambled" Then you can find a word
x
such that "I eat scrambled x" receives the highest probability. If x = "eggs", the sentence would be "I eat scrambled eggs"
While a variety of language models have been developed, this assignment uses N-grams, a simple but powerful method for language modeling.
N-grams are also used in machine translation and speech recognition.
Here are the steps of this assignment:
Load and preprocess data
Load and tokenize data.
Split the sentences into train and test sets.
Replace words with a low frequency by an unknown marker
<unk>
.
Develop N-gram based language models
Compute the count of n-grams from a given data set.
Estimate the conditional probability of a next word with k-smoothing.
Evaluate the N-gram models by computing the perplexity score.
Use your own model to suggest an upcoming word given your sentence.
Part 1.2 Pre-process the data
Preprocess this data with the following steps:
Split data into sentences using "\n" as the delimiter.
Split each sentence into tokens. Note that in this assignment we use "token" and "words" interchangeably.
Assign sentences into train or test sets.
Find tokens that appear at least N times in the training data.
Replace tokens that appear less than N times by
<unk>
Note: we omit validation data in this exercise.
In real applications, we should hold a part of data as a validation set and use it to tune our training.
We skip this process for simplicity.
Expected answer:
Exercise 02
The next step is to tokenize sentences (split a sentence into a list of words).
Convert all tokens into lower case so that words which are capitalized (for example, at the start of a sentence) in the original text are treated the same as the lowercase versions of the words.
Append each tokenized list of words into a list of tokenized sentences.
Hints
- Use str.lower to convert strings to lowercase.
- Please use nltk.word_tokenize to split sentences into tokens.
- If you used str.split insteaad of nltk.word_tokenize, there are additional edge cases to handle, such as the punctuation (comma, period) that follows a word.
Expected output
Expected outcome
Split into train and test sets
Now run the cell below to split data into training and test sets.
Expected output
Exercise 04
You won't use all the tokens (words) appearing in the data for training. Instead, you will use the more frequently used words.
You will focus on the words that appear at least N times in the data.
First count how many times each word appears in the data.
You will need a double for-loop, one for sentences and the other for tokens within a sentence.
Hints
- If you decide to import and use defaultdict, remember to cast the dictionary back to a regular 'dict' before returning it.
Expected output
Note that the order may differ.
Handling 'Out of Vocabulary' words
If your model is performing autocomplete, but encounters a word that it never saw during training, it won't have an input word to help it determine the next word to suggest. The model will not be able to predict the next word because there are no counts for the current word.
This 'new' word is called an 'unknown word', or out of vocabulary (OOV) words.
The percentage of unknown words in the test set is called the OOV rate.
To handle unknown words during prediction, use a special token to represent all unknown words 'unk'.
Modify the training data so that it has some 'unknown' words to train on.
Words to convert into "unknown" words are those that do not occur very frequently in the training set.
Create a list of the most frequent words in the training set, called the closed vocabulary .
Convert all the other words that are not part of the closed vocabulary to the token 'unk'.
Expected output
Expected answer
Expected outcome
Preprocess the train and test data
Run the cell below to complete the preprocessing both for training and test sets.
Expected output
You are done with the preprocessing section of the assignment. Objects train_data_processed
, test_data_processed
, and vocabulary
will be used in the rest of the exercises.
Part 2: Develop n-gram based language models
In this section, you will develop the n-grams language model.
Assume the probability of the next word depends only on the previous n-gram.
The previous n-gram is the series of the previous 'n' words.
The conditional probability for the word at position 't' in the sentence, given that the words preceding it are is:
You can estimate this probability by counting the occurrences of these series of words in the training data.
The probability can be estimated as a ratio, where
The numerator is the number of times word 't' appears after words t-1 through t-n appear in the training data.
The denominator is the number of times word t-1 through t-n appears in the training data.
The function denotes the number of occurence of the given sequence.
means the estimation of .
Notice that denominator of the equation (2) is the number of occurence of the previous words, and the numerator is the same sequence followed by the word .
Later, you will modify the equation (2) by adding k-smoothing, which avoids errors when any counts are zero.
The equation (2) tells us that to estimate probabilities based on n-grams, you need the counts of n-grams (for denominator) and (n+1)-grams (for numerator).
Exercise 08
Next, you will implement a function that computes the counts of n-grams for an arbitrary number .
When computing the counts for n-grams, prepare the sentence beforehand by prepending starting markers "<s>" to indicate the beginning of the sentence.
For example, in the bi-gram model (N=2), a sequence with two start tokens "<s><s>" should predict the first word of a sentence.
So, if the sentence is "I like food", modify it to be "<s><s> I like food".
Also prepare the sentence for counting by appending an end token "<e>" so that the model can predict when to finish a sentence.
Technical note: In this implementation, you will store the counts as a dictionary.
The key of each key-value pair in the dictionary is a tuple of n words (and not a list)
The value in the key-value pair is the number of occurrences.
The reason for using a tuple as a key instead of a list is because a list in Python is a mutable object (it can be changed after it is first created). A tuple is "immutable", so it cannot be altered after it is first created. This makes a tuple suitable as a data type for the key in a dictionary.
Hints
- To prepend or append, you can create lists and concatenate them using the + operator
- To create a list of a repeated value, you can follow this syntax:
['a'] * 3
to get['a','a','a']
- To set the range for index 'i', think of this example: An n-gram where n=2 (bigram), and the sentence is length N=5 (including two start tokens and one end token). So the index positions are
[0,1,2,3,4]
. The largest index 'i' where a bigram can start is at position i=3, because the word tokens at position 3 and 4 will form the bigram. - Remember that the
range()
function excludes the value that is used for the maximum of the range.range(3)
produces (0,1,2) but excludes 3.
Expected outcome:
Exercise 09
Next, estimate the probability of a word given the prior 'n' words using the n-gram counts.
This formula doesn't work when a count of an n-gram is zero..
Suppose we encounter an n-gram that did not occur in the training data.
Then, the equation (2) cannot be evaluated (it becomes zero divided by zero).
A way to handle zero counts is to add k-smoothing.
K-smoothing adds a positive constant to each numerator and in the denominator, where is the number of words in the vocabulary.
For n-grams that have a zero count, the equation (3) becomes .
This means that any n-gram with zero count has the same probability of .
Define a function that computes the probability estimate (3) from n-gram counts and a constant .
The function takes in a dictionary 'n_gram_counts', where the key is the n-gram and the value is the count of that n-gram.
The function also takes another dictionary n_plus1_gram_counts, which you'll use to find the count for the previous n-gram plus the current word.
Hints
- To define a tuple containing a single value, add a comma after that value. For example:
('apple',)
is a tuple containing a single string 'apple' - To concatenate two tuples, use the '+' operator
- words
Expected output
Estimate probabilities for all words
The function defined below loops over all words in vocabulary to calculate probabilities for all possible words.
This function is provided for you.
Expected output
Expected output
Count and probability matrices
As we have seen so far, the n-gram counts computed above are sufficient for computing the probabilities of the next word.
It can be more intuitive to present them as count or probability matrices.
The functions defined in the next cells return count or probability matrices.
This function is provided for you.
Expected output
Expected output
The following function calculates the probabilities of each word given the previous n-gram, and stores this in matrix form.
This function is provided for you.
Confirm that you obtain the same results as for the estimate_probabilities
function that you implemented.
Part 3: Perplexity
In this section, you will generate the perplexity score to evaluate your model on the test set.
You will also use back-off when needed.
Perplexity is used as an evaluation metric of your language model.
To calculate the the perplexity score of the test set on an n-gram model, use:
where is the length of the sentence.
is the number of words in the n-gram (e.g. 2 for a bigram).
In math, the numbering starts at one and not zero.
In code, array indexing starts at zero, so the code will use ranges for according to this formula:
The higher the probabilities are, the lower the perplexity will be.
The more the n-grams tell us about the sentence, the lower the perplexity score will be.
Hints
- Remember that
range(2,4)
produces the integers [2, 3] (and excludes 4).
Expected Output
Note: If your sentence is really long, there will be underflow when multiplying many fractions.
To handle longer sentences, modify your implementation to take the sum of the log of the probabilities.
Hints
estimate_probabilities
returns a dictionary where the key is a word and the value is the word's probability.- Use
str1.startswith(str2)
to determine if a string starts with the letters of another string. For example,'learning'.startswith('lea')
returns True, whereas'learning'.startswith('ear')
returns False. There are two additional parameters instr.startswith()
, but you can use the default values for those parameters in this case.
Expected output
Get multiple suggestions
The function defined below loop over varioud n-gram models to get multiple suggestions.
Suggest multiple words using n-grams of varying length
Congratulations! You have developed all building blocks for implementing your own auto-complete systems.
Let's see this with n-grams of varying lengths (unigrams, bigrams, trigrams, 4-grams...6-grams).
Congratulations!
You've completed this assignment by building an autocomplete model using an n-gram language model!
Please continue onto the fourth and final week of this course!