CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
pytorch

CoCalc provides the best real-time collaborative environment for Jupyter Notebooks, LaTeX documents, and SageMath, scalable from individual users to large groups and classes!

GitHub Repository: pytorch/tutorials
Path: blob/main/beginner_source/blitz/autograd_tutorial.py
Views: 494
1
# -*- coding: utf-8 -*-
2
"""
3
A Gentle Introduction to ``torch.autograd``
4
===========================================
5
6
``torch.autograd`` is PyTorch’s automatic differentiation engine that powers
7
neural network training. In this section, you will get a conceptual
8
understanding of how autograd helps a neural network train.
9
10
Background
11
~~~~~~~~~~
12
Neural networks (NNs) are a collection of nested functions that are
13
executed on some input data. These functions are defined by *parameters*
14
(consisting of weights and biases), which in PyTorch are stored in
15
tensors.
16
17
Training a NN happens in two steps:
18
19
**Forward Propagation**: In forward prop, the NN makes its best guess
20
about the correct output. It runs the input data through each of its
21
functions to make this guess.
22
23
**Backward Propagation**: In backprop, the NN adjusts its parameters
24
proportionate to the error in its guess. It does this by traversing
25
backwards from the output, collecting the derivatives of the error with
26
respect to the parameters of the functions (*gradients*), and optimizing
27
the parameters using gradient descent. For a more detailed walkthrough
28
of backprop, check out this `video from
29
3Blue1Brown <https://www.youtube.com/watch?v=tIeHLnjs5U8>`__.
30
31
32
33
34
Usage in PyTorch
35
~~~~~~~~~~~~~~~~
36
Let's take a look at a single training step.
37
For this example, we load a pretrained resnet18 model from ``torchvision``.
38
We create a random data tensor to represent a single image with 3 channels, and height & width of 64,
39
and its corresponding ``label`` initialized to some random values. Label in pretrained models has
40
shape (1,1000).
41
42
.. note::
43
This tutorial works only on the CPU and will not work on GPU devices (even if tensors are moved to CUDA).
44
45
"""
46
import torch
47
from torchvision.models import resnet18, ResNet18_Weights
48
model = resnet18(weights=ResNet18_Weights.DEFAULT)
49
data = torch.rand(1, 3, 64, 64)
50
labels = torch.rand(1, 1000)
51
52
############################################################
53
# Next, we run the input data through the model through each of its layers to make a prediction.
54
# This is the **forward pass**.
55
#
56
57
prediction = model(data) # forward pass
58
59
############################################################
60
# We use the model's prediction and the corresponding label to calculate the error (``loss``).
61
# The next step is to backpropagate this error through the network.
62
# Backward propagation is kicked off when we call ``.backward()`` on the error tensor.
63
# Autograd then calculates and stores the gradients for each model parameter in the parameter's ``.grad`` attribute.
64
#
65
66
loss = (prediction - labels).sum()
67
loss.backward() # backward pass
68
69
############################################################
70
# Next, we load an optimizer, in this case SGD with a learning rate of 0.01 and `momentum <https://towardsdatascience.com/stochastic-gradient-descent-with-momentum-a84097641a5d>`__ of 0.9.
71
# We register all the parameters of the model in the optimizer.
72
#
73
74
optim = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)
75
76
######################################################################
77
# Finally, we call ``.step()`` to initiate gradient descent. The optimizer adjusts each parameter by its gradient stored in ``.grad``.
78
#
79
80
optim.step() #gradient descent
81
82
######################################################################
83
# At this point, you have everything you need to train your neural network.
84
# The below sections detail the workings of autograd - feel free to skip them.
85
#
86
87
88
######################################################################
89
# --------------
90
#
91
92
93
######################################################################
94
# Differentiation in Autograd
95
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~
96
# Let's take a look at how ``autograd`` collects gradients. We create two tensors ``a`` and ``b`` with
97
# ``requires_grad=True``. This signals to ``autograd`` that every operation on them should be tracked.
98
#
99
100
import torch
101
102
a = torch.tensor([2., 3.], requires_grad=True)
103
b = torch.tensor([6., 4.], requires_grad=True)
104
105
######################################################################
106
# We create another tensor ``Q`` from ``a`` and ``b``.
107
#
108
# .. math::
109
# Q = 3a^3 - b^2
110
111
Q = 3*a**3 - b**2
112
113
114
######################################################################
115
# Let's assume ``a`` and ``b`` to be parameters of an NN, and ``Q``
116
# to be the error. In NN training, we want gradients of the error
117
# w.r.t. parameters, i.e.
118
#
119
# .. math::
120
# \frac{\partial Q}{\partial a} = 9a^2
121
#
122
# .. math::
123
# \frac{\partial Q}{\partial b} = -2b
124
#
125
#
126
# When we call ``.backward()`` on ``Q``, autograd calculates these gradients
127
# and stores them in the respective tensors' ``.grad`` attribute.
128
#
129
# We need to explicitly pass a ``gradient`` argument in ``Q.backward()`` because it is a vector.
130
# ``gradient`` is a tensor of the same shape as ``Q``, and it represents the
131
# gradient of Q w.r.t. itself, i.e.
132
#
133
# .. math::
134
# \frac{dQ}{dQ} = 1
135
#
136
# Equivalently, we can also aggregate Q into a scalar and call backward implicitly, like ``Q.sum().backward()``.
137
#
138
external_grad = torch.tensor([1., 1.])
139
Q.backward(gradient=external_grad)
140
141
142
#######################################################################
143
# Gradients are now deposited in ``a.grad`` and ``b.grad``
144
145
# check if collected gradients are correct
146
print(9*a**2 == a.grad)
147
print(-2*b == b.grad)
148
149
150
######################################################################
151
# Optional Reading - Vector Calculus using ``autograd``
152
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
153
#
154
# Mathematically, if you have a vector valued function
155
# :math:`\vec{y}=f(\vec{x})`, then the gradient of :math:`\vec{y}` with
156
# respect to :math:`\vec{x}` is a Jacobian matrix :math:`J`:
157
#
158
# .. math::
159
#
160
#
161
# J
162
# =
163
# \left(\begin{array}{cc}
164
# \frac{\partial \bf{y}}{\partial x_{1}} &
165
# ... &
166
# \frac{\partial \bf{y}}{\partial x_{n}}
167
# \end{array}\right)
168
# =
169
# \left(\begin{array}{ccc}
170
# \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}}\\
171
# \vdots & \ddots & \vdots\\
172
# \frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
173
# \end{array}\right)
174
#
175
# Generally speaking, ``torch.autograd`` is an engine for computing
176
# vector-Jacobian product. That is, given any vector :math:`\vec{v}`, compute the product
177
# :math:`J^{T}\cdot \vec{v}`
178
#
179
# If :math:`\vec{v}` happens to be the gradient of a scalar function :math:`l=g\left(\vec{y}\right)`:
180
#
181
# .. math::
182
#
183
#
184
# \vec{v}
185
# =
186
# \left(\begin{array}{ccc}\frac{\partial l}{\partial y_{1}} & \cdots & \frac{\partial l}{\partial y_{m}}\end{array}\right)^{T}
187
#
188
# then by the chain rule, the vector-Jacobian product would be the
189
# gradient of :math:`l` with respect to :math:`\vec{x}`:
190
#
191
# .. math::
192
#
193
#
194
# J^{T}\cdot \vec{v}=\left(\begin{array}{ccc}
195
# \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{1}}\\
196
# \vdots & \ddots & \vdots\\
197
# \frac{\partial y_{1}}{\partial x_{n}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
198
# \end{array}\right)\left(\begin{array}{c}
199
# \frac{\partial l}{\partial y_{1}}\\
200
# \vdots\\
201
# \frac{\partial l}{\partial y_{m}}
202
# \end{array}\right)=\left(\begin{array}{c}
203
# \frac{\partial l}{\partial x_{1}}\\
204
# \vdots\\
205
# \frac{\partial l}{\partial x_{n}}
206
# \end{array}\right)
207
#
208
# This characteristic of vector-Jacobian product is what we use in the above example;
209
# ``external_grad`` represents :math:`\vec{v}`.
210
#
211
212
213
214
######################################################################
215
# Computational Graph
216
# ~~~~~~~~~~~~~~~~~~~
217
#
218
# Conceptually, autograd keeps a record of data (tensors) & all executed
219
# operations (along with the resulting new tensors) in a directed acyclic
220
# graph (DAG) consisting of
221
# `Function <https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function>`__
222
# objects. In this DAG, leaves are the input tensors, roots are the output
223
# tensors. By tracing this graph from roots to leaves, you can
224
# automatically compute the gradients using the chain rule.
225
#
226
# In a forward pass, autograd does two things simultaneously:
227
#
228
# - run the requested operation to compute a resulting tensor, and
229
# - maintain the operation’s *gradient function* in the DAG.
230
#
231
# The backward pass kicks off when ``.backward()`` is called on the DAG
232
# root. ``autograd`` then:
233
#
234
# - computes the gradients from each ``.grad_fn``,
235
# - accumulates them in the respective tensor’s ``.grad`` attribute, and
236
# - using the chain rule, propagates all the way to the leaf tensors.
237
#
238
# Below is a visual representation of the DAG in our example. In the graph,
239
# the arrows are in the direction of the forward pass. The nodes represent the backward functions
240
# of each operation in the forward pass. The leaf nodes in blue represent our leaf tensors ``a`` and ``b``.
241
#
242
# .. figure:: /_static/img/dag_autograd.png
243
#
244
# .. note::
245
# **DAGs are dynamic in PyTorch**
246
# An important thing to note is that the graph is recreated from scratch; after each
247
# ``.backward()`` call, autograd starts populating a new graph. This is
248
# exactly what allows you to use control flow statements in your model;
249
# you can change the shape, size and operations at every iteration if
250
# needed.
251
#
252
# Exclusion from the DAG
253
# ^^^^^^^^^^^^^^^^^^^^^^
254
#
255
# ``torch.autograd`` tracks operations on all tensors which have their
256
# ``requires_grad`` flag set to ``True``. For tensors that don’t require
257
# gradients, setting this attribute to ``False`` excludes it from the
258
# gradient computation DAG.
259
#
260
# The output tensor of an operation will require gradients even if only a
261
# single input tensor has ``requires_grad=True``.
262
#
263
264
x = torch.rand(5, 5)
265
y = torch.rand(5, 5)
266
z = torch.rand((5, 5), requires_grad=True)
267
268
a = x + y
269
print(f"Does `a` require gradients?: {a.requires_grad}")
270
b = x + z
271
print(f"Does `b` require gradients?: {b.requires_grad}")
272
273
274
######################################################################
275
# In a NN, parameters that don't compute gradients are usually called **frozen parameters**.
276
# It is useful to "freeze" part of your model if you know in advance that you won't need the gradients of those parameters
277
# (this offers some performance benefits by reducing autograd computations).
278
#
279
# In finetuning, we freeze most of the model and typically only modify the classifier layers to make predictions on new labels.
280
# Let's walk through a small example to demonstrate this. As before, we load a pretrained resnet18 model, and freeze all the parameters.
281
282
from torch import nn, optim
283
284
model = resnet18(weights=ResNet18_Weights.DEFAULT)
285
286
# Freeze all the parameters in the network
287
for param in model.parameters():
288
param.requires_grad = False
289
290
######################################################################
291
# Let's say we want to finetune the model on a new dataset with 10 labels.
292
# In resnet, the classifier is the last linear layer ``model.fc``.
293
# We can simply replace it with a new linear layer (unfrozen by default)
294
# that acts as our classifier.
295
296
model.fc = nn.Linear(512, 10)
297
298
######################################################################
299
# Now all parameters in the model, except the parameters of ``model.fc``, are frozen.
300
# The only parameters that compute gradients are the weights and bias of ``model.fc``.
301
302
# Optimize only the classifier
303
optimizer = optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)
304
305
##########################################################################
306
# Notice although we register all the parameters in the optimizer,
307
# the only parameters that are computing gradients (and hence updated in gradient descent)
308
# are the weights and bias of the classifier.
309
#
310
# The same exclusionary functionality is available as a context manager in
311
# `torch.no_grad() <https://pytorch.org/docs/stable/generated/torch.no_grad.html>`__
312
#
313
314
######################################################################
315
# --------------
316
#
317
318
######################################################################
319
# Further readings:
320
# ~~~~~~~~~~~~~~~~~~~
321
#
322
# - `In-place operations & Multithreaded Autograd <https://pytorch.org/docs/stable/notes/autograd.html>`__
323
# - `Example implementation of reverse-mode autodiff <https://colab.research.google.com/drive/1VpeE6UvEPRz9HmsHh1KS0XxXjYu533EC>`__
324
# - `Video: PyTorch Autograd Explained - In-depth Tutorial <https://www.youtube.com/watch?v=MswxJw-8PvE>`__
325
326