Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place.
Path: blob/main/beginner_source/nlp/deep_learning_tutorial.py
Views: 713
# -*- coding: utf-8 -*-1r"""2Deep Learning with PyTorch3**************************45Deep Learning Building Blocks: Affine maps, non-linearities and objectives6==========================================================================78Deep learning consists of composing linearities with non-linearities in9clever ways. The introduction of non-linearities allows for powerful10models. In this section, we will play with these core components, make11up an objective function, and see how the model is trained.121314Affine Maps15~~~~~~~~~~~1617One of the core workhorses of deep learning is the affine map, which is18a function :math:`f(x)` where1920.. math:: f(x) = Ax + b2122for a matrix :math:`A` and vectors :math:`x, b`. The parameters to be23learned here are :math:`A` and :math:`b`. Often, :math:`b` is refered to24as the *bias* term.252627PyTorch and most other deep learning frameworks do things a little28differently than traditional linear algebra. It maps the rows of the29input instead of the columns. That is, the :math:`i`'th row of the30output below is the mapping of the :math:`i`'th row of the input under31:math:`A`, plus the bias term. Look at the example below.3233"""3435# Author: Robert Guthrie3637import torch38import torch.nn as nn39import torch.nn.functional as F40import torch.optim as optim4142torch.manual_seed(1)434445######################################################################4647lin = nn.Linear(5, 3) # maps from R^5 to R^3, parameters A, b48# data is 2x5. A maps from 5 to 3... can we map "data" under A?49data = torch.randn(2, 5)50print(lin(data)) # yes515253######################################################################54# Non-Linearities55# ~~~~~~~~~~~~~~~56#57# First, note the following fact, which will explain why we need58# non-linearities in the first place. Suppose we have two affine maps59# :math:`f(x) = Ax + b` and :math:`g(x) = Cx + d`. What is60# :math:`f(g(x))`?61#62# .. math:: f(g(x)) = A(Cx + d) + b = ACx + (Ad + b)63#64# :math:`AC` is a matrix and :math:`Ad + b` is a vector, so we see that65# composing affine maps gives you an affine map.66#67# From this, you can see that if you wanted your neural network to be long68# chains of affine compositions, that this adds no new power to your model69# than just doing a single affine map.70#71# If we introduce non-linearities in between the affine layers, this is no72# longer the case, and we can build much more powerful models.73#74# There are a few core non-linearities.75# :math:`\tanh(x), \sigma(x), \text{ReLU}(x)` are the most common. You are76# probably wondering: "why these functions? I can think of plenty of other77# non-linearities." The reason for this is that they have gradients that78# are easy to compute, and computing gradients is essential for learning.79# For example80#81# .. math:: \frac{d\sigma}{dx} = \sigma(x)(1 - \sigma(x))82#83# A quick note: although you may have learned some neural networks in your84# intro to AI class where :math:`\sigma(x)` was the default non-linearity,85# typically people shy away from it in practice. This is because the86# gradient *vanishes* very quickly as the absolute value of the argument87# grows. Small gradients means it is hard to learn. Most people default to88# tanh or ReLU.89#9091# In pytorch, most non-linearities are in torch.functional (we have it imported as F)92# Note that non-linearites typically don't have parameters like affine maps do.93# That is, they don't have weights that are updated during training.94data = torch.randn(2, 2)95print(data)96print(F.relu(data))979899######################################################################100# Softmax and Probabilities101# ~~~~~~~~~~~~~~~~~~~~~~~~~102#103# The function :math:`\text{Softmax}(x)` is also just a non-linearity, but104# it is special in that it usually is the last operation done in a105# network. This is because it takes in a vector of real numbers and106# returns a probability distribution. Its definition is as follows. Let107# :math:`x` be a vector of real numbers (positive, negative, whatever,108# there are no constraints). Then the i'th component of109# :math:`\text{Softmax}(x)` is110#111# .. math:: \frac{\exp(x_i)}{\sum_j \exp(x_j)}112#113# It should be clear that the output is a probability distribution: each114# element is non-negative and the sum over all components is 1.115#116# You could also think of it as just applying an element-wise117# exponentiation operator to the input to make everything non-negative and118# then dividing by the normalization constant.119#120121# Softmax is also in torch.nn.functional122data = torch.randn(5)123print(data)124print(F.softmax(data, dim=0))125print(F.softmax(data, dim=0).sum()) # Sums to 1 because it is a distribution!126print(F.log_softmax(data, dim=0)) # theres also log_softmax127128129######################################################################130# Objective Functions131# ~~~~~~~~~~~~~~~~~~~132#133# The objective function is the function that your network is being134# trained to minimize (in which case it is often called a *loss function*135# or *cost function*). This proceeds by first choosing a training136# instance, running it through your neural network, and then computing the137# loss of the output. The parameters of the model are then updated by138# taking the derivative of the loss function. Intuitively, if your model139# is completely confident in its answer, and its answer is wrong, your140# loss will be high. If it is very confident in its answer, and its answer141# is correct, the loss will be low.142#143# The idea behind minimizing the loss function on your training examples144# is that your network will hopefully generalize well and have small loss145# on unseen examples in your dev set, test set, or in production. An146# example loss function is the *negative log likelihood loss*, which is a147# very common objective for multi-class classification. For supervised148# multi-class classification, this means training the network to minimize149# the negative log probability of the correct output (or equivalently,150# maximize the log probability of the correct output).151#152153154######################################################################155# Optimization and Training156# =========================157#158# So what we can compute a loss function for an instance? What do we do159# with that? We saw earlier that Tensors know how to compute gradients160# with respect to the things that were used to compute it. Well,161# since our loss is an Tensor, we can compute gradients with162# respect to all of the parameters used to compute it! Then we can perform163# standard gradient updates. Let :math:`\theta` be our parameters,164# :math:`L(\theta)` the loss function, and :math:`\eta` a positive165# learning rate. Then:166#167# .. math:: \theta^{(t+1)} = \theta^{(t)} - \eta \nabla_\theta L(\theta)168#169# There are a huge collection of algorithms and active research in170# attempting to do something more than just this vanilla gradient update.171# Many attempt to vary the learning rate based on what is happening at172# train time. You don't need to worry about what specifically these173# algorithms are doing unless you are really interested. Torch provides174# many in the torch.optim package, and they are all completely175# transparent. Using the simplest gradient update is the same as the more176# complicated algorithms. Trying different update algorithms and different177# parameters for the update algorithms (like different initial learning178# rates) is important in optimizing your network's performance. Often,179# just replacing vanilla SGD with an optimizer like Adam or RMSProp will180# boost performance noticably.181#182183184######################################################################185# Creating Network Components in PyTorch186# ======================================187#188# Before we move on to our focus on NLP, lets do an annotated example of189# building a network in PyTorch using only affine maps and190# non-linearities. We will also see how to compute a loss function, using191# PyTorch's built in negative log likelihood, and update parameters by192# backpropagation.193#194# All network components should inherit from nn.Module and override the195# forward() method. That is about it, as far as the boilerplate is196# concerned. Inheriting from nn.Module provides functionality to your197# component. For example, it makes it keep track of its trainable198# parameters, you can swap it between CPU and GPU with the ``.to(device)``199# method, where device can be a CPU device ``torch.device("cpu")`` or CUDA200# device ``torch.device("cuda:0")``.201#202# Let's write an annotated example of a network that takes in a sparse203# bag-of-words representation and outputs a probability distribution over204# two labels: "English" and "Spanish". This model is just logistic205# regression.206#207208209######################################################################210# Example: Logistic Regression Bag-of-Words classifier211# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~212#213# Our model will map a sparse BoW representation to log probabilities over214# labels. We assign each word in the vocab an index. For example, say our215# entire vocab is two words "hello" and "world", with indices 0 and 1216# respectively. The BoW vector for the sentence "hello hello hello hello"217# is218#219# .. math:: \left[ 4, 0 \right]220#221# For "hello world world hello", it is222#223# .. math:: \left[ 2, 2 \right]224#225# etc. In general, it is226#227# .. math:: \left[ \text{Count}(\text{hello}), \text{Count}(\text{world}) \right]228#229# Denote this BOW vector as :math:`x`. The output of our network is:230#231# .. math:: \log \text{Softmax}(Ax + b)232#233# That is, we pass the input through an affine map and then do log234# softmax.235#236237data = [("me gusta comer en la cafeteria".split(), "SPANISH"),238("Give it to me".split(), "ENGLISH"),239("No creo que sea una buena idea".split(), "SPANISH"),240("No it is not a good idea to get lost at sea".split(), "ENGLISH")]241242test_data = [("Yo creo que si".split(), "SPANISH"),243("it is lost on me".split(), "ENGLISH")]244245# word_to_ix maps each word in the vocab to a unique integer, which will be its246# index into the Bag of words vector247word_to_ix = {}248for sent, _ in data + test_data:249for word in sent:250if word not in word_to_ix:251word_to_ix[word] = len(word_to_ix)252print(word_to_ix)253254VOCAB_SIZE = len(word_to_ix)255NUM_LABELS = 2256257258class BoWClassifier(nn.Module): # inheriting from nn.Module!259260def __init__(self, num_labels, vocab_size):261# calls the init function of nn.Module. Dont get confused by syntax,262# just always do it in an nn.Module263super(BoWClassifier, self).__init__()264265# Define the parameters that you will need. In this case, we need A and b,266# the parameters of the affine mapping.267# Torch defines nn.Linear(), which provides the affine map.268# Make sure you understand why the input dimension is vocab_size269# and the output is num_labels!270self.linear = nn.Linear(vocab_size, num_labels)271272# NOTE! The non-linearity log softmax does not have parameters! So we don't need273# to worry about that here274275def forward(self, bow_vec):276# Pass the input through the linear layer,277# then pass that through log_softmax.278# Many non-linearities and other functions are in torch.nn.functional279return F.log_softmax(self.linear(bow_vec), dim=1)280281282def make_bow_vector(sentence, word_to_ix):283vec = torch.zeros(len(word_to_ix))284for word in sentence:285vec[word_to_ix[word]] += 1286return vec.view(1, -1)287288289def make_target(label, label_to_ix):290return torch.LongTensor([label_to_ix[label]])291292293model = BoWClassifier(NUM_LABELS, VOCAB_SIZE)294295# the model knows its parameters. The first output below is A, the second is b.296# Whenever you assign a component to a class variable in the __init__ function297# of a module, which was done with the line298# self.linear = nn.Linear(...)299# Then through some Python magic from the PyTorch devs, your module300# (in this case, BoWClassifier) will store knowledge of the nn.Linear's parameters301for param in model.parameters():302print(param)303304# To run the model, pass in a BoW vector305# Here we don't need to train, so the code is wrapped in torch.no_grad()306with torch.no_grad():307sample = data[0]308bow_vector = make_bow_vector(sample[0], word_to_ix)309log_probs = model(bow_vector)310print(log_probs)311312313######################################################################314# Which of the above values corresponds to the log probability of ENGLISH,315# and which to SPANISH? We never defined it, but we need to if we want to316# train the thing.317#318319label_to_ix = {"SPANISH": 0, "ENGLISH": 1}320321322######################################################################323# So lets train! To do this, we pass instances through to get log324# probabilities, compute a loss function, compute the gradient of the loss325# function, and then update the parameters with a gradient step. Loss326# functions are provided by Torch in the nn package. nn.NLLLoss() is the327# negative log likelihood loss we want. It also defines optimization328# functions in torch.optim. Here, we will just use SGD.329#330# Note that the *input* to NLLLoss is a vector of log probabilities, and a331# target label. It doesn't compute the log probabilities for us. This is332# why the last layer of our network is log softmax. The loss function333# nn.CrossEntropyLoss() is the same as NLLLoss(), except it does the log334# softmax for you.335#336337# Run on test data before we train, just to see a before-and-after338with torch.no_grad():339for instance, label in test_data:340bow_vec = make_bow_vector(instance, word_to_ix)341log_probs = model(bow_vec)342print(log_probs)343344# Print the matrix column corresponding to "creo"345print(next(model.parameters())[:, word_to_ix["creo"]])346347loss_function = nn.NLLLoss()348optimizer = optim.SGD(model.parameters(), lr=0.1)349350# Usually you want to pass over the training data several times.351# 100 is much bigger than on a real data set, but real datasets have more than352# two instances. Usually, somewhere between 5 and 30 epochs is reasonable.353for epoch in range(100):354for instance, label in data:355# Step 1. Remember that PyTorch accumulates gradients.356# We need to clear them out before each instance357model.zero_grad()358359# Step 2. Make our BOW vector and also we must wrap the target in a360# Tensor as an integer. For example, if the target is SPANISH, then361# we wrap the integer 0. The loss function then knows that the 0th362# element of the log probabilities is the log probability363# corresponding to SPANISH364bow_vec = make_bow_vector(instance, word_to_ix)365target = make_target(label, label_to_ix)366367# Step 3. Run our forward pass.368log_probs = model(bow_vec)369370# Step 4. Compute the loss, gradients, and update the parameters by371# calling optimizer.step()372loss = loss_function(log_probs, target)373loss.backward()374optimizer.step()375376with torch.no_grad():377for instance, label in test_data:378bow_vec = make_bow_vector(instance, word_to_ix)379log_probs = model(bow_vec)380print(log_probs)381382# Index corresponding to Spanish goes up, English goes down!383print(next(model.parameters())[:, word_to_ix["creo"]])384385386######################################################################387# We got the right answer! You can see that the log probability for388# Spanish is much higher in the first example, and the log probability for389# English is much higher in the second for the test data, as it should be.390#391# Now you see how to make a PyTorch component, pass some data through it392# and do gradient updates. We are ready to dig deeper into what deep NLP393# has to offer.394#395396397