Path: blob/master/Natural Language Processing with Classification and Vector Spaces/Week 1 - Sentiment Analysis with Logistic Regression/NLP_C1_W1_lecture_nb_02.ipynb
14375 views
Building and Visualizing word frequencies
In this lab, we will focus on the build_freqs()
helper function and visualizing a dataset fed into it. In our goal of tweet sentiment analysis, this function will build a dictionary where we can lookup how many times a word appears in the lists of positive or negative tweets. This will be very helpful when extracting the features of the dataset in the week's programming assignment. Let's see how this function is implemented under the hood in this notebook.
Setup
Let's import the required libraries for this lab:
Import some helper functions that we provided in the utils.py file:
process_tweet()
: Cleans the text, tokenizes it into separate words, removes stopwords, and converts words to stems.build_freqs()
: This counts how often a word in the 'corpus' (the entire set of tweets) was associated with a positive label1
or a negative label0
. It then builds thefreqs
dictionary, where each key is a(word,label)
tuple, and the value is the count of its frequency within the corpus of tweets.
Load the NLTK sample dataset
As in the previous lab, we will be using the Twitter dataset from NLTK.
Next, we will build a labels array that matches the sentiments of our tweets. This data type works pretty much like a regular list but is optimized for computations and manipulation. The labels
array will be composed of 10000 elements. The first 5000 will be filled with 1
labels denoting positive sentiments, and the next 5000 will be 0
labels denoting the opposite. We can do this easily with a series of operations provided by the numpy
library:
np.ones()
- create an array of 1'snp.zeros()
- create an array of 0'snp.append()
- concatenate arrays
Dictionaries
In Python, a dictionary is a mutable and indexed collection. It stores items as key-value pairs and uses hash tables underneath to allow practically constant time lookups. In NLP, dictionaries are essential because it enables fast retrieval of items or containment checks even with thousands of entries in the collection.
Definition
A dictionary in Python is declared using curly brackets. Look at the next example:
The former line defines a dictionary with two entries. Keys and values can be almost any type (with a few restriction on keys), and in this case, we used strings. We can also use floats, integers, tuples, etc.
Adding or editing entries
New entries can be inserted into dictionaries using square brackets. If the dictionary already contains the specified key, its value is overwritten.
Accessing values and lookup keys
Performing dictionary lookups and retrieval are common tasks in NLP. There are two ways to do this:
Using square bracket notation: This form is allowed if the lookup key is in the dictionary. It produces an error otherwise.
Using the get() method: This allows us to set a default value if the dictionary key does not exist.
Let us see these in action:
However, if the key is missing, the operation produce an error
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-8-8d63520997fb> in <module>
1 # The output of this line is intended to produce a KeyError
----> 2 print(dictionary['key8'])
KeyError: 'key8'
When using a square bracket lookup, it is common to use an if-else block to check for containment first (with the keyword in
) before getting the item. On the other hand, you can use the .get()
method if you want to set a default value when the key is not found. Let's compare these in the cells below:
Word frequency dictionary
Now that we know the building blocks, let's finally take a look at the build_freqs() function in utils.py. This is the function that creates the dictionary containing the word counts from each corpus.
You can also do the for loop like this to make it a bit more compact:
As shown above, each key is a 2-element tuple containing a (word, y)
pair. The word
is an element in a processed tweet while y
is an integer representing the corpus: 1
for the positive tweets and 0
for the negative tweets. The value associated with this key is the number of times that word appears in the specified corpus. For example:
Now, it is time to use the dictionary returned by the build_freqs()
function. First, let us feed our tweets
and labels
lists then print a basic report:
Now print the frequency of each word depending on its class.
Unfortunately, this does not help much to understand the data. It would be better to visualize this output to gain better insights.
Table of word counts
We will select a set of words that we would like to visualize. It is better to store this temporary information in a table that is very easy to use later.
We can then use a scatter plot to inspect this table visually. Instead of plotting the raw counts, we will plot it in the logarithmic scale to take into account the wide discrepancies between the raw counts (e.g. :)
has 3568 counts in the positive while only 2 in the negative). The red line marks the boundary between positive and negative areas. Words close to the red line can be classified as neutral.
This chart is straightforward to interpret. It shows that emoticons :)
and :(
are very important for sentiment analysis. Thus, we should not let preprocessing steps get rid of these symbols!
Furthermore, what is the meaning of the crown symbol? It seems to be very negative!