CoCalc -- modelsyt

GitHub Repository: pytorch/tutorials
Path: blob/main/beginner_source/introyt/modelsyt_tutorial.py
²¹³⁹ views
1
"""
2
`Introduction <introyt1_tutorial.html>`_ ||
3
`Tensors <tensors_deeper_tutorial.html>`_ ||
4
`Autograd <autogradyt_tutorial.html>`_ ||
5
**Building Models** ||
6
`TensorBoard Support <tensorboardyt_tutorial.html>`_ ||
7
`Training Models <trainingyt.html>`_ ||
8
`Model Understanding <captumyt.html>`_
9

10
Building Models with PyTorch
11
============================
12

13
Follow along with the video below or on `youtube <https://www.youtube.com/watch?v=OSqIP-mOWOI>`__.
14

15
.. raw:: html
16

17
   <div style="margin-top:10px; margin-bottom:10px;">
18
     <iframe width="560" height="315" src="https://www.youtube.com/embed/OSqIP-mOWOI" frameborder="0" allow="accelerometer; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
19
   </div>
20

21
``torch.nn.Module`` and ``torch.nn.Parameter``
22
----------------------------------------------
23

24
In this video, we’ll be discussing some of the tools PyTorch makes
25
available for building deep learning networks.
26

27
Except for ``Parameter``, the classes we discuss in this video are all
28
subclasses of ``torch.nn.Module``. This is the PyTorch base class meant
29
to encapsulate behaviors specific to PyTorch Models and their
30
components.
31

32
One important behavior of ``torch.nn.Module`` is registering parameters.
33
If a particular ``Module`` subclass has learning weights, these weights
34
are expressed as instances of ``torch.nn.Parameter``. The ``Parameter``
35
class is a subclass of ``torch.Tensor``, with the special behavior that
36
when they are assigned as attributes of a ``Module``, they are added to
37
the list of that modules parameters. These parameters may be accessed
38
through the ``parameters()`` method on the ``Module`` class.
39

40
As a simple example, here’s a very simple model with two linear layers
41
and an activation function. We’ll create an instance of it and ask it to
42
report on its parameters:
43

44
"""
45

46
import torch
47

48
class TinyModel(torch.nn.Module):
49
    
50
    def __init__(self):
51
        super(TinyModel, self).__init__()
52
        
53
        self.linear1 = torch.nn.Linear(100, 200)
54
        self.activation = torch.nn.ReLU()
55
        self.linear2 = torch.nn.Linear(200, 10)
56
        self.softmax = torch.nn.Softmax()
57
    
58
    def forward(self, x):
59
        x = self.linear1(x)
60
        x = self.activation(x)
61
        x = self.linear2(x)
62
        x = self.softmax(x)
63
        return x
64

65
tinymodel = TinyModel()
66

67
print('The model:')
68
print(tinymodel)
69

70
print('\n\nJust one layer:')
71
print(tinymodel.linear2)
72

73
print('\n\nModel params:')
74
for param in tinymodel.parameters():
75
    print(param)
76

77
print('\n\nLayer params:')
78
for param in tinymodel.linear2.parameters():
79
    print(param)
80

81

82
#########################################################################
83
# This shows the fundamental structure of a PyTorch model: there is an
84
# ``__init__()`` method that defines the layers and other components of a
85
# model, and a ``forward()`` method where the computation gets done. Note
86
# that we can print the model, or any of its submodules, to learn about
87
# its structure.
88
# 
89
# Common Layer Types
90
# ------------------
91
# 
92
# Linear Layers
93
# ~~~~~~~~~~~~~
94
# 
95
# The most basic type of neural network layer is a *linear* or *fully
96
# connected* layer. This is a layer where every input influences every
97
# output of the layer to a degree specified by the layer’s weights. If a
98
# model has *m* inputs and *n* outputs, the weights will be an *m* x *n*
99
# matrix. For example:
100
# 
101

102
lin = torch.nn.Linear(3, 2)
103
x = torch.rand(1, 3)
104
print('Input:')
105
print(x)
106

107
print('\n\nWeight and Bias parameters:')
108
for param in lin.parameters():
109
    print(param)
110

111
y = lin(x)
112
print('\n\nOutput:')
113
print(y)
114

115

116
#########################################################################
117
# If you do the matrix multiplication of ``x`` by the linear layer’s
118
# weights, and add the biases, you’ll find that you get the output vector
119
# ``y``.
120
# 
121
# One other important feature to note: When we checked the weights of our
122
# layer with ``lin.weight``, it reported itself as a ``Parameter`` (which
123
# is a subclass of ``Tensor``), and let us know that it’s tracking
124
# gradients with autograd. This is a default behavior for ``Parameter``
125
# that differs from ``Tensor``.
126
# 
127
# Linear layers are used widely in deep learning models. One of the most
128
# common places you’ll see them is in classifier models, which will
129
# usually have one or more linear layers at the end, where the last layer
130
# will have *n* outputs, where *n* is the number of classes the classifier
131
# addresses.
132
# 
133
# Convolutional Layers
134
# ~~~~~~~~~~~~~~~~~~~~
135
# 
136
# *Convolutional* layers are built to handle data with a high degree of
137
# spatial correlation. They are very commonly used in computer vision,
138
# where they detect close groupings of features which the compose into
139
# higher-level features. They pop up in other contexts too - for example,
140
# in NLP applications, where a word’s immediate context (that is, the
141
# other words nearby in the sequence) can affect the meaning of a
142
# sentence.
143
# 
144
# We saw convolutional layers in action in LeNet5 in an earlier video:
145
# 
146

147
import torch.functional as F
148

149

150
class LeNet(torch.nn.Module):
151

152
    def __init__(self):
153
        super(LeNet, self).__init__()
154
        # 1 input image channel (black & white), 6 output channels, 5x5 square convolution
155
        # kernel
156
        self.conv1 = torch.nn.Conv2d(1, 6, 5)
157
        self.conv2 = torch.nn.Conv2d(6, 16, 3)
158
        # an affine operation: y = Wx + b
159
        self.fc1 = torch.nn.Linear(16 * 6 * 6, 120)  # 6*6 from image dimension
160
        self.fc2 = torch.nn.Linear(120, 84)
161
        self.fc3 = torch.nn.Linear(84, 10)
162

163
    def forward(self, x):
164
        # Max pooling over a (2, 2) window
165
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
166
        # If the size is a square you can only specify a single number
167
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
168
        x = x.view(-1, self.num_flat_features(x))
169
        x = F.relu(self.fc1(x))
170
        x = F.relu(self.fc2(x))
171
        x = self.fc3(x)
172
        return x
173

174
    def num_flat_features(self, x):
175
        size = x.size()[1:]  # all dimensions except the batch dimension
176
        num_features = 1
177
        for s in size:
178
            num_features *= s
179
        return num_features
180

181

182
##########################################################################
183
# Let’s break down what’s happening in the convolutional layers of this
184
# model. Starting with ``conv1``:
185
# 
186
# -  LeNet5 is meant to take in a 1x32x32 black & white image. **The first
187
#    argument to a convolutional layer’s constructor is the number of
188
#    input channels.** Here, it is 1. If we were building this model to
189
#    look at 3-color channels, it would be 3.
190
# -  A convolutional layer is like a window that scans over the image,
191
#    looking for a pattern it recognizes. These patterns are called
192
#    *features,* and one of the parameters of a convolutional layer is the
193
#    number of features we would like it to learn. **This is the second
194
#    argument to the constructor is the number of output features.** Here,
195
#    we’re asking our layer to learn 6 features.
196
# -  Just above, I likened the convolutional layer to a window - but how
197
#    big is the window? **The third argument is the window or kernel
198
#    size.** Here, the “5” means we’ve chosen a 5x5 kernel. (If you want a
199
#    kernel with height different from width, you can specify a tuple for
200
#    this argument - e.g., ``(3, 5)`` to get a 3x5 convolution kernel.)
201
# 
202
# The output of a convolutional layer is an *activation map* - a spatial
203
# representation of the presence of features in the input tensor.
204
# ``conv1`` will give us an output tensor of 6x28x28; 6 is the number of
205
# features, and 28 is the height and width of our map. (The 28 comes from
206
# the fact that when scanning a 5-pixel window over a 32-pixel row, there
207
# are only 28 valid positions.)
208
# 
209
# We then pass the output of the convolution through a ReLU activation
210
# function (more on activation functions later), then through a max
211
# pooling layer. The max pooling layer takes features near each other in
212
# the activation map and groups them together. It does this by reducing
213
# the tensor, merging every 2x2 group of cells in the output into a single
214
# cell, and assigning that cell the maximum value of the 4 cells that went
215
# into it. This gives us a lower-resolution version of the activation map,
216
# with dimensions 6x14x14.
217
# 
218
# Our next convolutional layer, ``conv2``, expects 6 input channels
219
# (corresponding to the 6 features sought by the first layer), has 16
220
# output channels, and a 3x3 kernel. It puts out a 16x12x12 activation
221
# map, which is again reduced by a max pooling layer to 16x6x6. Prior to
222
# passing this output to the linear layers, it is reshaped to a 16 \* 6 \*
223
# 6 = 576-element vector for consumption by the next layer.
224
# 
225
# There are convolutional layers for addressing 1D, 2D, and 3D tensors.
226
# There are also many more optional arguments for a conv layer
227
# constructor, including stride length(e.g., only scanning every second or
228
# every third position) in the input, padding (so you can scan out to the
229
# edges of the input), and more. See the
230
# `documentation <https://pytorch.org/docs/stable/nn.html#convolution-layers>`__
231
# for more information.
232
# 
233
# Recurrent Layers
234
# ~~~~~~~~~~~~~~~~
235
# 
236
# *Recurrent neural networks* (or *RNNs)* are used for sequential data -
237
# anything from time-series measurements from a scientific instrument to
238
# natural language sentences to DNA nucleotides. An RNN does this by
239
# maintaining a *hidden state* that acts as a sort of memory for what it
240
# has seen in the sequence so far.
241
# 
242
# The internal structure of an RNN layer - or its variants, the LSTM (long
243
# short-term memory) and GRU (gated recurrent unit) - is moderately
244
# complex and beyond the scope of this video, but we’ll show you what one
245
# looks like in action with an LSTM-based part-of-speech tagger (a type of
246
# classifier that tells you if a word is a noun, verb, etc.):
247
# 
248

249
class LSTMTagger(torch.nn.Module):
250

251
    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
252
        super(LSTMTagger, self).__init__()
253
        self.hidden_dim = hidden_dim
254

255
        self.word_embeddings = torch.nn.Embedding(vocab_size, embedding_dim)
256

257
        # The LSTM takes word embeddings as inputs, and outputs hidden states
258
        # with dimensionality hidden_dim.
259
        self.lstm = torch.nn.LSTM(embedding_dim, hidden_dim)
260

261
        # The linear layer that maps from hidden state space to tag space
262
        self.hidden2tag = torch.nn.Linear(hidden_dim, tagset_size)
263

264
    def forward(self, sentence):
265
        embeds = self.word_embeddings(sentence)
266
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
267
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
268
        tag_scores = F.log_softmax(tag_space, dim=1)
269
        return tag_scores
270

271

272
########################################################################
273
# The constructor has four arguments:
274
# 
275
# -  ``vocab_size`` is the number of words in the input vocabulary. Each
276
#    word is a one-hot vector (or unit vector) in a
277
#    ``vocab_size``-dimensional space.
278
# -  ``tagset_size`` is the number of tags in the output set.
279
# -  ``embedding_dim`` is the size of the *embedding* space for the
280
#    vocabulary. An embedding maps a vocabulary onto a low-dimensional
281
#    space, where words with similar meanings are close together in the
282
#    space.
283
# -  ``hidden_dim`` is the size of the LSTM’s memory.
284
# 
285
# The input will be a sentence with the words represented as indices of
286
# one-hot vectors. The embedding layer will then map these down to an
287
# ``embedding_dim``-dimensional space. The LSTM takes this sequence of
288
# embeddings and iterates over it, fielding an output vector of length
289
# ``hidden_dim``. The final linear layer acts as a classifier; applying
290
# ``log_softmax()`` to the output of the final layer converts the output
291
# into a normalized set of estimated probabilities that a given word maps
292
# to a given tag.
293
# 
294
# If you’d like to see this network in action, check out the `Sequence
295
# Models and LSTM
296
# Networks <https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html>`__
297
# tutorial on pytorch.org.
298
# 
299
# Transformers
300
# ~~~~~~~~~~~~
301
# 
302
# *Transformers* are multi-purpose networks that have taken over the state
303
# of the art in NLP with models like BERT. A discussion of transformer
304
# architecture is beyond the scope of this video, but PyTorch has a
305
# ``Transformer`` class that allows you to define the overall parameters
306
# of a transformer model - the number of attention heads, the number of
307
# encoder & decoder layers, dropout and activation functions, etc. (You
308
# can even build the BERT model from this single class, with the right
309
# parameters!) The ``torch.nn.Transformer`` class also has classes to
310
# encapsulate the individual components (``TransformerEncoder``,
311
# ``TransformerDecoder``) and subcomponents (``TransformerEncoderLayer``,
312
# ``TransformerDecoderLayer``). For details, check out the
313
# `documentation <https://pytorch.org/docs/stable/nn.html#transformer-layers>`__
314
# on transformer classes.
315
# 
316
# Other Layers and Functions
317
# --------------------------
318
# 
319
# Data Manipulation Layers
320
# ~~~~~~~~~~~~~~~~~~~~~~~~
321
# 
322
# There are other layer types that perform important functions in models,
323
# but don’t participate in the learning process themselves.
324
# 
325
# **Max pooling** (and its twin, min pooling) reduce a tensor by combining
326
# cells, and assigning the maximum value of the input cells to the output
327
# cell (we saw this). For example:
328
# 
329

330
my_tensor = torch.rand(1, 6, 6)
331
print(my_tensor)
332

333
maxpool_layer = torch.nn.MaxPool2d(3)
334
print(maxpool_layer(my_tensor))
335

336

337
#########################################################################
338
# If you look closely at the values above, you’ll see that each of the
339
# values in the maxpooled output is the maximum value of each quadrant of
340
# the 6x6 input.
341
# 
342
# **Normalization layers** re-center and normalize the output of one layer
343
# before feeding it to another. Centering and scaling the intermediate
344
# tensors has a number of beneficial effects, such as letting you use
345
# higher learning rates without exploding/vanishing gradients.
346
# 
347

348
my_tensor = torch.rand(1, 4, 4) * 20 + 5
349
print(my_tensor)
350

351
print(my_tensor.mean())
352

353
norm_layer = torch.nn.BatchNorm1d(4)
354
normed_tensor = norm_layer(my_tensor)
355
print(normed_tensor)
356

357
print(normed_tensor.mean())
358

359

360

361
##########################################################################
362
# Running the cell above, we’ve added a large scaling factor and offset to
363
# an input tensor; you should see the input tensor’s ``mean()`` somewhere
364
# in the neighborhood of 15. After running it through the normalization
365
# layer, you can see that the values are smaller, and grouped around zero
366
# - in fact, the mean should be very small (> 1e-8).
367
# 
368
# This is beneficial because many activation functions (discussed below)
369
# have their strongest gradients near 0, but sometimes suffer from
370
# vanishing or exploding gradients for inputs that drive them far away
371
# from zero. Keeping the data centered around the area of steepest
372
# gradient will tend to mean faster, better learning and higher feasible
373
# learning rates.
374
# 
375
# **Dropout layers** are a tool for encouraging *sparse representations*
376
# in your model - that is, pushing it to do inference with less data.
377
# 
378
# Dropout layers work by randomly setting parts of the input tensor
379
# *during training* - dropout layers are always turned off for inference.
380
# This forces the model to learn against this masked or reduced dataset.
381
# For example:
382
# 
383

384
my_tensor = torch.rand(1, 4, 4)
385

386
dropout = torch.nn.Dropout(p=0.4)
387
print(dropout(my_tensor))
388
print(dropout(my_tensor))
389

390

391
##########################################################################
392
# Above, you can see the effect of dropout on a sample tensor. You can use
393
# the optional ``p`` argument to set the probability of an individual
394
# weight dropping out; if you don’t it defaults to 0.5.
395
# 
396
# Activation Functions
397
# ~~~~~~~~~~~~~~~~~~~~
398
# 
399
# Activation functions make deep learning possible. A neural network is
400
# really a program - with many parameters - that *simulates a mathematical
401
# function*. If all we did was multiple tensors by layer weights
402
# repeatedly, we could only simulate *linear functions;* further, there
403
# would be no point to having many layers, as the whole network would
404
# reduce could be reduced to a single matrix multiplication. Inserting
405
# *non-linear* activation functions between layers is what allows a deep
406
# learning model to simulate any function, rather than just linear ones.
407
# 
408
# ``torch.nn.Module`` has objects encapsulating all of the major
409
# activation functions including ReLU and its many variants, Tanh,
410
# Hardtanh, sigmoid, and more. It also includes other functions, such as
411
# Softmax, that are most useful at the output stage of a model.
412
# 
413
# Loss Functions
414
# ~~~~~~~~~~~~~~
415
# 
416
# Loss functions tell us how far a model’s prediction is from the correct
417
# answer. PyTorch contains a variety of loss functions, including common
418
# MSE (mean squared error = L2 norm), Cross Entropy Loss and Negative
419
# Likelihood Loss (useful for classifiers), and others.
420
# 
421

422
Product

Resources

Company