Path: blob/master/Natural Language Processing with Classification and Vector Spaces/Week 1 - Sentiment Analysis with Logistic Regression/NLP_C1_W1_lecture_nb_03.ipynb
14375 views
Visualizing tweets and the Logistic Regression model
Objectives: Visualize and interpret the logistic regression model
Steps:
Plot tweets in a scatter plot using their positive and negative sums.
Plot the output of the logistic regression model in the same plot as a solid line
Import the required libraries
We will be using NLTK, an opensource NLP library, for collecting, handling, and processing Twitter data. In this lab, we will use the example dataset that comes alongside with NLTK. This dataset has been manually annotated and serves to establish baselines for models quickly.
So, to start, let's import the required libraries.
Load the NLTK sample dataset
To complete this lab, you need the sample dataset of the previous lab. Here, we assume the files are already available, and we only need to load into Python lists.
Load the extracted features
Part of this week's assignment is the creation of the numerical features needed for the Logistic regression model. In order not to interfere with it, we have previously calculated and stored these features in a CSV file for the entire training set.
So, please load these features created for the tweets sample.
Now let us get rid of the data frame to keep only Numpy arrays.
Load a pretrained Logistic Regression model
In the same way, as part of this week's assignment, a Logistic regression model must be trained. The next cell contains the resulting model from such training. Notice that a list of 3 numeric values represents the whole model, that we have called theta .
Plot the samples in a scatter plot
The vector theta represents a plane that split our feature space into two parts. Samples located over that plane are considered positive, and samples located under that plane are considered negative. Remember that we have a 3D feature space, i.e., each tweet is represented as a vector comprised of three values: [bias, positive_sum, negative_sum]
, always having bias = 1
.
If we ignore the bias term, we can plot each tweet in a cartesian plane, using positive_sum
and negative_sum
. In the cell below, we do precisely this. Additionally, we color each tweet, depending on its class. Positive tweets will be green and negative tweets will be red.
From the plot, it is evident that the features that we have chosen to represent tweets as numerical vectors allow an almost perfect separation between positive and negative tweets. So you can expect a very high accuracy for this model!
Plot the model alongside the data
We will draw a gray line to show the cutoff between the positive and negative regions. In other words, the gray line marks the line where To draw this line, we have to solve the above equation in terms of one of the independent variables.
The red and green lines that point in the direction of the corresponding sentiment are calculated using a perpendicular line to the separation line calculated in the previous equations(neg function). It must point in the same direction as the derivative of the Logit function, but the magnitude may differ. It is only for a visual representation of the model.
The green line in the chart points in the direction where z > 0 and the red line points in the direction where z < 0. The direction of these lines are given by the weights and
Note that more critical than the Logistic regression itself, are the features extracted from tweets that allow getting the right results in this exercise.
That is all, folks. Hopefully, now you understand better what the Logistic regression model represents, and why it works that well for this specific problem.