Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place.
Path: blob/main/beginner_source/introyt/modelsyt_tutorial.py
Views: 713
"""1`Introduction <introyt1_tutorial.html>`_ ||2`Tensors <tensors_deeper_tutorial.html>`_ ||3`Autograd <autogradyt_tutorial.html>`_ ||4**Building Models** ||5`TensorBoard Support <tensorboardyt_tutorial.html>`_ ||6`Training Models <trainingyt.html>`_ ||7`Model Understanding <captumyt.html>`_89Building Models with PyTorch10============================1112Follow along with the video below or on `youtube <https://www.youtube.com/watch?v=OSqIP-mOWOI>`__.1314.. raw:: html1516<div style="margin-top:10px; margin-bottom:10px;">17<iframe width="560" height="315" src="https://www.youtube.com/embed/OSqIP-mOWOI" frameborder="0" allow="accelerometer; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>18</div>1920``torch.nn.Module`` and ``torch.nn.Parameter``21----------------------------------------------2223In this video, we’ll be discussing some of the tools PyTorch makes24available for building deep learning networks.2526Except for ``Parameter``, the classes we discuss in this video are all27subclasses of ``torch.nn.Module``. This is the PyTorch base class meant28to encapsulate behaviors specific to PyTorch Models and their29components.3031One important behavior of ``torch.nn.Module`` is registering parameters.32If a particular ``Module`` subclass has learning weights, these weights33are expressed as instances of ``torch.nn.Parameter``. The ``Parameter``34class is a subclass of ``torch.Tensor``, with the special behavior that35when they are assigned as attributes of a ``Module``, they are added to36the list of that modules parameters. These parameters may be accessed37through the ``parameters()`` method on the ``Module`` class.3839As a simple example, here’s a very simple model with two linear layers40and an activation function. We’ll create an instance of it and ask it to41report on its parameters:4243"""4445import torch4647class TinyModel(torch.nn.Module):4849def __init__(self):50super(TinyModel, self).__init__()5152self.linear1 = torch.nn.Linear(100, 200)53self.activation = torch.nn.ReLU()54self.linear2 = torch.nn.Linear(200, 10)55self.softmax = torch.nn.Softmax()5657def forward(self, x):58x = self.linear1(x)59x = self.activation(x)60x = self.linear2(x)61x = self.softmax(x)62return x6364tinymodel = TinyModel()6566print('The model:')67print(tinymodel)6869print('\n\nJust one layer:')70print(tinymodel.linear2)7172print('\n\nModel params:')73for param in tinymodel.parameters():74print(param)7576print('\n\nLayer params:')77for param in tinymodel.linear2.parameters():78print(param)798081#########################################################################82# This shows the fundamental structure of a PyTorch model: there is an83# ``__init__()`` method that defines the layers and other components of a84# model, and a ``forward()`` method where the computation gets done. Note85# that we can print the model, or any of its submodules, to learn about86# its structure.87#88# Common Layer Types89# ------------------90#91# Linear Layers92# ~~~~~~~~~~~~~93#94# The most basic type of neural network layer is a *linear* or *fully95# connected* layer. This is a layer where every input influences every96# output of the layer to a degree specified by the layer’s weights. If a97# model has *m* inputs and *n* outputs, the weights will be an *m* x *n*98# matrix. For example:99#100101lin = torch.nn.Linear(3, 2)102x = torch.rand(1, 3)103print('Input:')104print(x)105106print('\n\nWeight and Bias parameters:')107for param in lin.parameters():108print(param)109110y = lin(x)111print('\n\nOutput:')112print(y)113114115#########################################################################116# If you do the matrix multiplication of ``x`` by the linear layer’s117# weights, and add the biases, you’ll find that you get the output vector118# ``y``.119#120# One other important feature to note: When we checked the weights of our121# layer with ``lin.weight``, it reported itself as a ``Parameter`` (which122# is a subclass of ``Tensor``), and let us know that it’s tracking123# gradients with autograd. This is a default behavior for ``Parameter``124# that differs from ``Tensor``.125#126# Linear layers are used widely in deep learning models. One of the most127# common places you’ll see them is in classifier models, which will128# usually have one or more linear layers at the end, where the last layer129# will have *n* outputs, where *n* is the number of classes the classifier130# addresses.131#132# Convolutional Layers133# ~~~~~~~~~~~~~~~~~~~~134#135# *Convolutional* layers are built to handle data with a high degree of136# spatial correlation. They are very commonly used in computer vision,137# where they detect close groupings of features which the compose into138# higher-level features. They pop up in other contexts too - for example,139# in NLP applications, where a word’s immediate context (that is, the140# other words nearby in the sequence) can affect the meaning of a141# sentence.142#143# We saw convolutional layers in action in LeNet5 in an earlier video:144#145146import torch.functional as F147148149class LeNet(torch.nn.Module):150151def __init__(self):152super(LeNet, self).__init__()153# 1 input image channel (black & white), 6 output channels, 5x5 square convolution154# kernel155self.conv1 = torch.nn.Conv2d(1, 6, 5)156self.conv2 = torch.nn.Conv2d(6, 16, 3)157# an affine operation: y = Wx + b158self.fc1 = torch.nn.Linear(16 * 6 * 6, 120) # 6*6 from image dimension159self.fc2 = torch.nn.Linear(120, 84)160self.fc3 = torch.nn.Linear(84, 10)161162def forward(self, x):163# Max pooling over a (2, 2) window164x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))165# If the size is a square you can only specify a single number166x = F.max_pool2d(F.relu(self.conv2(x)), 2)167x = x.view(-1, self.num_flat_features(x))168x = F.relu(self.fc1(x))169x = F.relu(self.fc2(x))170x = self.fc3(x)171return x172173def num_flat_features(self, x):174size = x.size()[1:] # all dimensions except the batch dimension175num_features = 1176for s in size:177num_features *= s178return num_features179180181##########################################################################182# Let’s break down what’s happening in the convolutional layers of this183# model. Starting with ``conv1``:184#185# - LeNet5 is meant to take in a 1x32x32 black & white image. **The first186# argument to a convolutional layer’s constructor is the number of187# input channels.** Here, it is 1. If we were building this model to188# look at 3-color channels, it would be 3.189# - A convolutional layer is like a window that scans over the image,190# looking for a pattern it recognizes. These patterns are called191# *features,* and one of the parameters of a convolutional layer is the192# number of features we would like it to learn. **This is the second193# argument to the constructor is the number of output features.** Here,194# we’re asking our layer to learn 6 features.195# - Just above, I likened the convolutional layer to a window - but how196# big is the window? **The third argument is the window or kernel197# size.** Here, the “5” means we’ve chosen a 5x5 kernel. (If you want a198# kernel with height different from width, you can specify a tuple for199# this argument - e.g., ``(3, 5)`` to get a 3x5 convolution kernel.)200#201# The output of a convolutional layer is an *activation map* - a spatial202# representation of the presence of features in the input tensor.203# ``conv1`` will give us an output tensor of 6x28x28; 6 is the number of204# features, and 28 is the height and width of our map. (The 28 comes from205# the fact that when scanning a 5-pixel window over a 32-pixel row, there206# are only 28 valid positions.)207#208# We then pass the output of the convolution through a ReLU activation209# function (more on activation functions later), then through a max210# pooling layer. The max pooling layer takes features near each other in211# the activation map and groups them together. It does this by reducing212# the tensor, merging every 2x2 group of cells in the output into a single213# cell, and assigning that cell the maximum value of the 4 cells that went214# into it. This gives us a lower-resolution version of the activation map,215# with dimensions 6x14x14.216#217# Our next convolutional layer, ``conv2``, expects 6 input channels218# (corresponding to the 6 features sought by the first layer), has 16219# output channels, and a 3x3 kernel. It puts out a 16x12x12 activation220# map, which is again reduced by a max pooling layer to 16x6x6. Prior to221# passing this output to the linear layers, it is reshaped to a 16 \* 6 \*222# 6 = 576-element vector for consumption by the next layer.223#224# There are convolutional layers for addressing 1D, 2D, and 3D tensors.225# There are also many more optional arguments for a conv layer226# constructor, including stride length(e.g., only scanning every second or227# every third position) in the input, padding (so you can scan out to the228# edges of the input), and more. See the229# `documentation <https://pytorch.org/docs/stable/nn.html#convolution-layers>`__230# for more information.231#232# Recurrent Layers233# ~~~~~~~~~~~~~~~~234#235# *Recurrent neural networks* (or *RNNs)* are used for sequential data -236# anything from time-series measurements from a scientific instrument to237# natural language sentences to DNA nucleotides. An RNN does this by238# maintaining a *hidden state* that acts as a sort of memory for what it239# has seen in the sequence so far.240#241# The internal structure of an RNN layer - or its variants, the LSTM (long242# short-term memory) and GRU (gated recurrent unit) - is moderately243# complex and beyond the scope of this video, but we’ll show you what one244# looks like in action with an LSTM-based part-of-speech tagger (a type of245# classifier that tells you if a word is a noun, verb, etc.):246#247248class LSTMTagger(torch.nn.Module):249250def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):251super(LSTMTagger, self).__init__()252self.hidden_dim = hidden_dim253254self.word_embeddings = torch.nn.Embedding(vocab_size, embedding_dim)255256# The LSTM takes word embeddings as inputs, and outputs hidden states257# with dimensionality hidden_dim.258self.lstm = torch.nn.LSTM(embedding_dim, hidden_dim)259260# The linear layer that maps from hidden state space to tag space261self.hidden2tag = torch.nn.Linear(hidden_dim, tagset_size)262263def forward(self, sentence):264embeds = self.word_embeddings(sentence)265lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))266tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))267tag_scores = F.log_softmax(tag_space, dim=1)268return tag_scores269270271########################################################################272# The constructor has four arguments:273#274# - ``vocab_size`` is the number of words in the input vocabulary. Each275# word is a one-hot vector (or unit vector) in a276# ``vocab_size``-dimensional space.277# - ``tagset_size`` is the number of tags in the output set.278# - ``embedding_dim`` is the size of the *embedding* space for the279# vocabulary. An embedding maps a vocabulary onto a low-dimensional280# space, where words with similar meanings are close together in the281# space.282# - ``hidden_dim`` is the size of the LSTM’s memory.283#284# The input will be a sentence with the words represented as indices of285# one-hot vectors. The embedding layer will then map these down to an286# ``embedding_dim``-dimensional space. The LSTM takes this sequence of287# embeddings and iterates over it, fielding an output vector of length288# ``hidden_dim``. The final linear layer acts as a classifier; applying289# ``log_softmax()`` to the output of the final layer converts the output290# into a normalized set of estimated probabilities that a given word maps291# to a given tag.292#293# If you’d like to see this network in action, check out the `Sequence294# Models and LSTM295# Networks <https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html>`__296# tutorial on pytorch.org.297#298# Transformers299# ~~~~~~~~~~~~300#301# *Transformers* are multi-purpose networks that have taken over the state302# of the art in NLP with models like BERT. A discussion of transformer303# architecture is beyond the scope of this video, but PyTorch has a304# ``Transformer`` class that allows you to define the overall parameters305# of a transformer model - the number of attention heads, the number of306# encoder & decoder layers, dropout and activation functions, etc. (You307# can even build the BERT model from this single class, with the right308# parameters!) The ``torch.nn.Transformer`` class also has classes to309# encapsulate the individual components (``TransformerEncoder``,310# ``TransformerDecoder``) and subcomponents (``TransformerEncoderLayer``,311# ``TransformerDecoderLayer``). For details, check out the312# `documentation <https://pytorch.org/docs/stable/nn.html#transformer-layers>`__313# on transformer classes.314#315# Other Layers and Functions316# --------------------------317#318# Data Manipulation Layers319# ~~~~~~~~~~~~~~~~~~~~~~~~320#321# There are other layer types that perform important functions in models,322# but don’t participate in the learning process themselves.323#324# **Max pooling** (and its twin, min pooling) reduce a tensor by combining325# cells, and assigning the maximum value of the input cells to the output326# cell (we saw this). For example:327#328329my_tensor = torch.rand(1, 6, 6)330print(my_tensor)331332maxpool_layer = torch.nn.MaxPool2d(3)333print(maxpool_layer(my_tensor))334335336#########################################################################337# If you look closely at the values above, you’ll see that each of the338# values in the maxpooled output is the maximum value of each quadrant of339# the 6x6 input.340#341# **Normalization layers** re-center and normalize the output of one layer342# before feeding it to another. Centering and scaling the intermediate343# tensors has a number of beneficial effects, such as letting you use344# higher learning rates without exploding/vanishing gradients.345#346347my_tensor = torch.rand(1, 4, 4) * 20 + 5348print(my_tensor)349350print(my_tensor.mean())351352norm_layer = torch.nn.BatchNorm1d(4)353normed_tensor = norm_layer(my_tensor)354print(normed_tensor)355356print(normed_tensor.mean())357358359360##########################################################################361# Running the cell above, we’ve added a large scaling factor and offset to362# an input tensor; you should see the input tensor’s ``mean()`` somewhere363# in the neighborhood of 15. After running it through the normalization364# layer, you can see that the values are smaller, and grouped around zero365# - in fact, the mean should be very small (> 1e-8).366#367# This is beneficial because many activation functions (discussed below)368# have their strongest gradients near 0, but sometimes suffer from369# vanishing or exploding gradients for inputs that drive them far away370# from zero. Keeping the data centered around the area of steepest371# gradient will tend to mean faster, better learning and higher feasible372# learning rates.373#374# **Dropout layers** are a tool for encouraging *sparse representations*375# in your model - that is, pushing it to do inference with less data.376#377# Dropout layers work by randomly setting parts of the input tensor378# *during training* - dropout layers are always turned off for inference.379# This forces the model to learn against this masked or reduced dataset.380# For example:381#382383my_tensor = torch.rand(1, 4, 4)384385dropout = torch.nn.Dropout(p=0.4)386print(dropout(my_tensor))387print(dropout(my_tensor))388389390##########################################################################391# Above, you can see the effect of dropout on a sample tensor. You can use392# the optional ``p`` argument to set the probability of an individual393# weight dropping out; if you don’t it defaults to 0.5.394#395# Activation Functions396# ~~~~~~~~~~~~~~~~~~~~397#398# Activation functions make deep learning possible. A neural network is399# really a program - with many parameters - that *simulates a mathematical400# function*. If all we did was multiple tensors by layer weights401# repeatedly, we could only simulate *linear functions;* further, there402# would be no point to having many layers, as the whole network would403# reduce could be reduced to a single matrix multiplication. Inserting404# *non-linear* activation functions between layers is what allows a deep405# learning model to simulate any function, rather than just linear ones.406#407# ``torch.nn.Module`` has objects encapsulating all of the major408# activation functions including ReLU and its many variants, Tanh,409# Hardtanh, sigmoid, and more. It also includes other functions, such as410# Softmax, that are most useful at the output stage of a model.411#412# Loss Functions413# ~~~~~~~~~~~~~~414#415# Loss functions tell us how far a model’s prediction is from the correct416# answer. PyTorch contains a variety of loss functions, including common417# MSE (mean squared error = L2 norm), Cross Entropy Loss and Negative418# Likelihood Loss (useful for classifiers), and others.419#420421422