Path: blob/master/Natural Language Processing with Classification and Vector Spaces/Week 2 - Sentiment Analysis with Naive Bayes/C1_W2_Assignment.ipynb
14375 views
Assignment 2: Naive Bayes
Welcome to week two of this specialization. You will learn about Naive Bayes. Concretely, you will be using Naive Bayes for sentiment analysis on tweets. Given a tweet, you will decide if it has a positive sentiment or a negative one. Specifically you will:
Train a naive bayes model on a sentiment analysis task
Test using your model
Compute ratios of positive words to negative words
Do some error analysis
Predict on your own tweet
You may already be familiar with Naive Bayes and its justification in terms of conditional probabilities and independence.
In this week's lectures and assignments we used the ratio of probabilities between positive and negative sentiments.
This approach gives us simpler formulas for these 2-way classification tasks.
Load the cell below to import some packages. You may want to browse the documentation of unfamiliar libraries and functions.
If you are running this notebook in your local computer, don't forget to download the twitter samples and stopwords from nltk.
Part 1: Process the Data
For any machine learning project, once you've gathered the data, the first step is to process it to make useful inputs to your model.
Remove noise: You will first want to remove noise from your data -- that is, remove words that don't tell you much about the content. These include all common words like 'I, you, are, is, etc...' that would not give us enough information on the sentiment.
We'll also remove stock market tickers, retweet symbols, hyperlinks, and hashtags because they can not tell you a lot of information on the sentiment.
You also want to remove all the punctuation from a tweet. The reason for doing this is because we want to treat words with or without the punctuation as the same word, instead of treating "happy", "happy?", "happy!", "happy," and "happy." as different words.
Finally you want to use stemming to only keep track of one variation of each word. In other words, we'll treat "motivation", "motivated", and "motivate" similarly by grouping them within the same stem of "motiv-".
We have given you the function process_tweet()
that does this for you.
Part 1.1 Implementing your helper functions
To help train your naive bayes model, you will need to build a dictionary where the keys are a (word, label) tuple and the values are the corresponding frequency. Note that the labels we'll use here are 1 for positive and 0 for negative.
You will also implement a lookup()
helper function that takes in the freqs
dictionary, a word, and a label (1 or 0) and returns the number of times that word and label tuple appears in the collection of tweets.
For example: given a list of tweets ["i am rather excited", "you are rather happy"]
and the label 1, the function will return a dictionary that contains the following key-value pairs:
{ ("rather", 1): 2 ("happi", 1) : 1 ("excit", 1) : 1 }
Notice how for each word in the given string, the same label 1 is assigned to each word.
Notice how the words "i" and "am" are not saved, since it was removed by process_tweet because it is a stopword.
Notice how the word "rather" appears twice in the list of tweets, and so its count value is 2.
Instructions
Create a function count_tweets()
that takes a list of tweets as input, cleans all of them, and returns a dictionary.
The key in the dictionary is a tuple containing the stemmed word and its class label, e.g. ("happi",1).
The value the number of times this word appears in the given collection of tweets (an integer).
Hints
- Please use the `process_tweet` function that was imported above, and then store the words in their respective dictionaries and sets.
- You may find it useful to use the `zip` function to match each element in `tweets` with each element in `ys`.
- Remember to check if the key in the dictionary exists before adding that key to the dictionary, or incrementing its value.
- Assume that the `result` dictionary that is input will contain clean key-value pairs (you can assume that the values will be integers that can be incremented). It is good practice to check the datatype before incrementing the value, but it's not required here.
Expected Output: {('happi', 1): 1, ('trick', 0): 1, ('sad', 0): 1, ('tire', 0): 2}
Part 2: Train your model using Naive Bayes
Naive bayes is an algorithm that could be used for sentiment analysis. It takes a short time to train and also has a short prediction time.
So how do you train a Naive Bayes classifier?
The first part of training a naive bayes classifier is to identify the number of classes that you have.
You will create a probability for each class. is the probability that the document is positive. is the probability that the document is negative. Use the formulas as follows and store the values in a dictionary:
Where is the total number of documents, or tweets in this case, is the total number of positive tweets and is the total number of negative tweets.
Prior and Logprior
The prior probability represents the underlying probability in the target population that a tweet is positive versus negative. In other words, if we had no specific information and blindly picked a tweet out of the population set, what is the probability that it will be positive versus that it will be negative? That is the "prior".
The prior is the ratio of the probabilities . We can take the log of the prior to rescale it, and we'll call this the logprior
Note that is the same as . So the logprior can also be calculated as the difference between two logs:
Positive and Negative Probability of a Word
To compute the positive probability and the negative probability for a specific word in the vocabulary, we'll use the following inputs:
and are the frequencies of that specific word in the positive or negative class. In other words, the positive frequency of a word is the number of times the word is counted with the label of 1.
and are the total number of positive and negative words for all documents (for all tweets), respectively.
is the number of unique words in the entire set of documents, for all classes, whether positive or negative.
We'll use these to compute the positive and negative probability for a specific word using this formula:
Notice that we add the "+1" in the numerator for additive smoothing. This wiki article explains more about additive smoothing.
Log likelihood
To compute the loglikelihood of that very same word, we can implement the following equations:
Create freqs
dictionary
Given your
count_tweets()
function, you can compute a dictionary calledfreqs
that contains all the frequencies.In this
freqs
dictionary, the key is the tuple (word, label)The value is the number of times it has appeared.
We will use this dictionary in several parts of this assignment.
Instructions
Given a freqs dictionary, train_x
(a list of tweets) and a train_y
(a list of labels for each tweet), implement a naive bayes classifier.
Calculate
You can then compute the number of unique words that appear in the
freqs
dictionary to get your (you can use theset
function).
Calculate and
Using your
freqs
dictionary, you can compute the positive and negative frequency of each word and .
Calculate and
Using
freqs
dictionary, you can also compute the total number of positive words and total number of negative words and .
Calculate , ,
Using the
train_y
input list of labels, calculate the number of documents (tweets) , as well as the number of positive documents (tweets) and number of negative documents (tweets) .Calculate the probability that a document (tweet) is positive , and the probability that a document (tweet) is negative
Calculate the logprior
the logprior is
Calculate log likelihood
Finally, you can iterate over each word in the vocabulary, use your
lookup
function to get the positive frequencies, , and the negative frequencies, , for that specific word.Compute the positive probability of each word , negative probability of each word using equations 4 & 5.
Note: We'll use a dictionary to store the log likelihoods for each word. The key is the word, the value is the log likelihood of that word).
You can then compute the loglikelihood: ParseError: KaTeX parse error: \tag works only in display equations.
Expected Output:
0.0
9089
Part 3: Test your naive bayes
Now that we have the logprior
and loglikelihood
, we can test the naive bayes function by making predicting on some tweets!
Implement naive_bayes_predict
Instructions: Implement the naive_bayes_predict
function to make predictions on tweets.
The function takes in the
tweet
,logprior
,loglikelihood
.It returns the probability that the tweet belongs to the positive or negative class.
For each tweet, sum up loglikelihoods of each word in the tweet.
Also add the logprior to this sum to get the predicted sentiment of that tweet.
Note
Note we calculate the prior from the training data, and that the training data is evenly split between positive and negative labels (4000 positive and 4000 negative tweets). This means that the ratio of positive to negative 1, and the logprior is 0.
The value of 0.0 means that when we add the logprior to the log likelihood, we're just adding zero to the log likelihood. However, please remember to include the logprior, because whenever the data is not perfectly balanced, the logprior will be a non-zero value.
Expected Output:
The expected output is around 1.57
The sentiment is positive.
Implement test_naive_bayes
Instructions:
Implement
test_naive_bayes
to check the accuracy of your predictions.The function takes in your
test_x
,test_y
, log_prior, and loglikelihoodIt returns the accuracy of your model.
First, use
naive_bayes_predict
function to make predictions for each tweet in text_x.
Expected Accuracy:
0.9940
Expected Output:
I am happy -> 2.15
I am bad -> -1.29
this movie should have been great. -> 2.14
great -> 2.14
great great -> 4.28
great great great -> 6.41
great great great great -> 8.55
Part 4: Filter words by Ratio of positive to negative counts
Some words have more positive counts than others, and can be considered "more positive". Likewise, some words can be considered more negative than others.
One way for us to define the level of positiveness or negativeness, without calculating the log likelihood, is to compare the positive to negative frequency of the word.
Note that we can also use the log likelihood calculations to compare relative positivity or negativity of words.
We can calculate the ratio of positive to negative frequencies of a word.
Once we're able to calculate these ratios, we can also filter a subset of words that have a minimum ratio of positivity / negativity or higher.
Similarly, we can also filter a subset of words that have a maximum ratio of positivity / negativity or lower (words that are at least as negative, or even more negative than a given threshold).
Implement get_ratio()
Given the
freqs
dictionary of words and a particular word, uselookup(freqs,word,1)
to get the positive count of the word.Similarly, use the
lookup()
function to get the negative count of that word.Calculate the ratio of positive divided by negative counts
Where pos_words and neg_words correspond to the frequency of the words in their respective classes.
Words | Positive word count | Negative Word Count |
glad | 41 | 2 |
arriv | 57 | 4 |
:( | 1 | 3663 |
:-( | 0 | 378 |
Implement get_words_by_threshold(freqs,label,threshold)
If we set the label to 1, then we'll look for all words whose threshold of positive/negative is at least as high as that threshold, or higher.
If we set the label to 0, then we'll look for all words whose threshold of positive/negative is at most as low as the given threshold, or lower.
Use the
get_ratio()
function to get a dictionary containing the positive count, negative count, and the ratio of positive to negative counts.Append a dictionary to a list, where the key is the word, and the dictionary is the dictionary
pos_neg_ratio
that is returned by theget_ratio()
function. An example key-value pair would have this structure:
Notice the difference between the positive and negative ratios. Emojis like 😦 and words like 'me' tend to have a negative connotation. Other words like 'glad', 'community', and 'arrives' tend to be found in the positive tweets.
Part 5: Error Analysis
In this part you will see some tweets that your model missclassified. Why do you think the misclassifications happened? Were there any assumptions made by the naive bayes model?
Part 6: Predict with your own tweet
In this part you can predict the sentiment of your own tweet.
Congratulations on completing this assignment. See you next week!