Path: blob/master/Natural Language Processing with Classification and Vector Spaces/Week 1 - Sentiment Analysis with Logistic Regression/C1_W1_Assignment.ipynb
14375 views
Assignment 1: Logistic Regression
Welcome to week one of this specialization. You will learn about logistic regression. Concretely, you will be implementing logistic regression for sentiment analysis on tweets. Given a tweet, you will decide if it has a positive sentiment or a negative one. Specifically you will:
Learn how to extract features for logistic regression given some text
Implement logistic regression from scratch
Apply logistic regression on a natural language processing task
Test using your logistic regression
Perform error analysis
We will be using a data set of tweets. Hopefully you will get more than 99% accuracy. Run the cell below to load in the packages.
Import functions and data
Imported functions
Download the data needed for this assignment. Check out the documentation for the twitter_samples dataset.
twitter_samples: if you're running this notebook on your local computer, you will need to download it using:
stopwords: if you're running this notebook on your local computer, you will need to download it using:
Import some helper functions that we provided in the utils.py file:
process_tweet()
: cleans the text, tokenizes it into separate words, removes stopwords, and converts words to stems.build_freqs()
: this counts how often a word in the 'corpus' (the entire set of tweets) was associated with a positive label '1' or a negative label '0', then builds thefreqs
dictionary, where each key is a (word,label) tuple, and the value is the count of its frequency within the corpus of tweets.
Prepare the data
The
twitter_samples
contains subsets of 5,000 positive tweets, 5,000 negative tweets, and the full set of 10,000 tweets.If you used all three datasets, we would introduce duplicates of the positive tweets and negative tweets.
You will select just the five thousand positive tweets and five thousand negative tweets.
Train test split: 20% will be in the test set, and 80% in the training set.
Create the numpy array of positive labels and negative labels.
Create the frequency dictionary using the imported
build_freqs()
function.We highly recommend that you open
utils.py
and read thebuild_freqs()
function to understand what it is doing.To view the file directory, go to the menu and click File->Open.
Notice how the outer for loop goes through each tweet, and the inner for loop steps through each word in a tweet.
The
freqs
dictionary is the frequency dictionary that's being built.The key is the tuple (word, label), such as ("happy",1) or ("happy",0). The value stored for each key is the count of how many times the word "happy" was associated with a positive label, or how many times "happy" was associated with a negative label.
Expected output
Process tweet
The given function process_tweet()
tokenizes the tweet into individual words, removes stop words and applies stemming.
Expected output
Part 1: Logistic regression
Part 1.1: Sigmoid
You will learn to use logistic regression for text classification.
The sigmoid function is defined as:
It maps the input 'z' to a value that ranges between 0 and 1, and so it can be treated as a probability.

Instructions: Implement the sigmoid function
You will want this function to work if z is a scalar as well as if it is an array.
Logistic regression: regression and a sigmoid
Logistic regression takes a regular linear regression, and applies a sigmoid to the output of the linear regression.
Regression: Note that the values are "weights". If you took the Deep Learning Specialization, we referred to the weights with the w
vector. In this course, we're using a different variable to refer to the weights.
Logistic regression We will refer to 'z' as the 'logits'.
Part 1.2 Cost function and Gradient
The cost function used for logistic regression is the average of the log loss across all training examples:
is the number of training examples
is the actual label of the i-th training example.
is the model's prediction for the i-th training example.
The loss function for a single training example is
All the values are between 0 and 1, so the logs will be negative. That is the reason for the factor of -1 applied to the sum of the two loss terms.
Note that when the model predicts 1 () and the label is also 1, the loss for that training example is 0.
Similarly, when the model predicts 0 () and the actual label is also 0, the loss for that training example is 0.
However, when the model prediction is close to 1 () and the label is 0, the second term of the log loss becomes a large negative number, which is then multiplied by the overall factor of -1 to convert it to a positive loss value. The closer the model prediction gets to 1, the larger the loss.
Likewise, if the model predicts close to 0 () but the actual label is 1, the first term in the loss function becomes a large number: . The closer the prediction is to zero, the larger the loss.
Update the weights
To update your weight vector , you will apply gradient descent to iteratively improve your model's predictions. The gradient of the cost function with respect to one of the weights is:
'i' is the index across all 'm' training examples.
'j' is the index of the weight , so is the feature associated with weight
To update the weight , we adjust it by subtracting a fraction of the gradient determined by :
The learning rate is a value that we choose to control how big a single update will be.
Instructions: Implement gradient descent function
The number of iterations
num_iters
is the number of times that you'll use the entire training set.For each iteration, you'll calculate the cost function using all training examples (there are
m
training examples), and for all features.Instead of updating a single weight at a time, we can update all the weights in the column vector:
has dimensions (n+1, 1), where 'n' is the number of features, and there is one more element for the bias term (note that the corresponding feature value is 1).
The 'logits', 'z', are calculated by multiplying the feature matrix 'x' with the weight vector 'theta'.
has dimensions (m, n+1)
: has dimensions (n+1, 1)
: has dimensions (m, 1)
The prediction 'h', is calculated by applying the sigmoid to each element in 'z': , and has dimensions (m,1).
The cost function is calculated by taking the dot product of the vectors 'y' and 'log(h)'. Since both 'y' and 'h' are column vectors (m,1), transpose the vector to the left, so that matrix multiplication of a row vector with column vector performs the dot product.
The update of theta is also vectorized. Because the dimensions of are (m, n+1), and both and are (m, 1), we need to transpose the and place it on the left in order to perform matrix multiplication, which then yields the (n+1, 1) answer we need:
Hints
- use np.dot for matrix multiplication.
- To ensure that the fraction -1/m is a decimal value, cast either the numerator or denominator (or both), like `float(1)`, or write `1.` for the float version of 1.
Expected output
Part 2: Extracting the features
Given a list of tweets, extract the features and store them in a matrix. You will extract two features.
The first feature is the number of positive words in a tweet.
The second feature is the number of negative words in a tweet.
Then train your logistic regression classifier on these features.
Test the classifier on a validation set.
Instructions: Implement the extract_features function.
This function takes in a single tweet.
Process the tweet using the imported
process_tweet()
function and save the list of tweet words.Loop through each word in the list of processed words
For each word, check the
freqs
dictionary for the count when that word has a positive '1' label. (Check for the key (word, 1.0)Do the same for the count for when the word is associated with the negative label '0'. (Check for the key (word, 0.0).)
Hints
- Make sure you handle cases when the (word, label) key is not found in the dictionary.
- Search the web for hints about using the `.get()` method of a Python dictionary. Here is an example
Expected output
Expected output
Part 3: Training Your Model
To train the model:
Stack the features for all training examples into a matrix
X
.Call
gradientDescent
, which you've implemented above.
This section is given to you. Please read it for understanding and run the cell.
Expected Output:
Part 4: Test your logistic regression
It is time for you to test your logistic regression function on some new input that your model has not seen before.
Instructions: Write predict_tweet
Predict whether a tweet is positive or negative.
Given a tweet, process it, then extract the features.
Apply the model's learned weights on the features to get the logits.
Apply the sigmoid to the logits to get the prediction (a value between 0 and 1).
Expected Output:
Check performance using the test set
After training your model using the training set above, check how your model might perform on real, unseen data, by testing it against the test set.
Instructions: Implement test_logistic_regression
Given the test data and the weights of your trained model, calculate the accuracy of your logistic regression model.
Use your
predict_tweet()
function to make predictions on each tweet in the test set.If the prediction is > 0.5, set the model's classification
y_hat
to 1, otherwise set the model's classificationy_hat
to 0.A prediction is accurate when
y_hat
equalstest_y
. Sum up all the instances when they are equal and divide bym
.
Hints
- Use np.asarray() to convert a list to a numpy array
- Use np.squeeze() to make an (m,1) dimensional array into an (m,) array
Expected Output:
0.9950
Pretty good!
Part 5: Error Analysis
In this part you will see some tweets that your model misclassified. Why do you think the misclassifications happened? Specifically what kind of tweets does your model misclassify?
Later in this specialization, we will see how we can use deep learning to improve the prediction performance.