CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
pytorch

CoCalc provides the best real-time collaborative environment for Jupyter Notebooks, LaTeX documents, and SageMath, scalable from individual users to large groups and classes!

GitHub Repository: pytorch/tutorials
Path: blob/main/beginner_source/basics/autogradqs_tutorial.py
Views: 494
1
"""
2
`Learn the Basics <intro.html>`_ ||
3
`Quickstart <quickstart_tutorial.html>`_ ||
4
`Tensors <tensorqs_tutorial.html>`_ ||
5
`Datasets & DataLoaders <data_tutorial.html>`_ ||
6
`Transforms <transforms_tutorial.html>`_ ||
7
`Build Model <buildmodel_tutorial.html>`_ ||
8
**Autograd** ||
9
`Optimization <optimization_tutorial.html>`_ ||
10
`Save & Load Model <saveloadrun_tutorial.html>`_
11
12
Automatic Differentiation with ``torch.autograd``
13
=================================================
14
15
When training neural networks, the most frequently used algorithm is
16
**back propagation**. In this algorithm, parameters (model weights) are
17
adjusted according to the **gradient** of the loss function with respect
18
to the given parameter.
19
20
To compute those gradients, PyTorch has a built-in differentiation engine
21
called ``torch.autograd``. It supports automatic computation of gradient for any
22
computational graph.
23
24
Consider the simplest one-layer neural network, with input ``x``,
25
parameters ``w`` and ``b``, and some loss function. It can be defined in
26
PyTorch in the following manner:
27
"""
28
29
import torch
30
31
x = torch.ones(5) # input tensor
32
y = torch.zeros(3) # expected output
33
w = torch.randn(5, 3, requires_grad=True)
34
b = torch.randn(3, requires_grad=True)
35
z = torch.matmul(x, w)+b
36
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)
37
38
39
######################################################################
40
# Tensors, Functions and Computational graph
41
# ------------------------------------------
42
#
43
# This code defines the following **computational graph**:
44
#
45
# .. figure:: /_static/img/basics/comp-graph.png
46
# :alt:
47
#
48
# In this network, ``w`` and ``b`` are **parameters**, which we need to
49
# optimize. Thus, we need to be able to compute the gradients of loss
50
# function with respect to those variables. In order to do that, we set
51
# the ``requires_grad`` property of those tensors.
52
53
#######################################################################
54
# .. note:: You can set the value of ``requires_grad`` when creating a
55
# tensor, or later by using ``x.requires_grad_(True)`` method.
56
57
#######################################################################
58
# A function that we apply to tensors to construct computational graph is
59
# in fact an object of class ``Function``. This object knows how to
60
# compute the function in the *forward* direction, and also how to compute
61
# its derivative during the *backward propagation* step. A reference to
62
# the backward propagation function is stored in ``grad_fn`` property of a
63
# tensor. You can find more information of ``Function`` `in the
64
# documentation <https://pytorch.org/docs/stable/autograd.html#function>`__.
65
#
66
67
print(f"Gradient function for z = {z.grad_fn}")
68
print(f"Gradient function for loss = {loss.grad_fn}")
69
70
######################################################################
71
# Computing Gradients
72
# -------------------
73
#
74
# To optimize weights of parameters in the neural network, we need to
75
# compute the derivatives of our loss function with respect to parameters,
76
# namely, we need :math:`\frac{\partial loss}{\partial w}` and
77
# :math:`\frac{\partial loss}{\partial b}` under some fixed values of
78
# ``x`` and ``y``. To compute those derivatives, we call
79
# ``loss.backward()``, and then retrieve the values from ``w.grad`` and
80
# ``b.grad``:
81
#
82
83
loss.backward()
84
print(w.grad)
85
print(b.grad)
86
87
88
######################################################################
89
# .. note::
90
# - We can only obtain the ``grad`` properties for the leaf
91
# nodes of the computational graph, which have ``requires_grad`` property
92
# set to ``True``. For all other nodes in our graph, gradients will not be
93
# available.
94
# - We can only perform gradient calculations using
95
# ``backward`` once on a given graph, for performance reasons. If we need
96
# to do several ``backward`` calls on the same graph, we need to pass
97
# ``retain_graph=True`` to the ``backward`` call.
98
#
99
100
101
######################################################################
102
# Disabling Gradient Tracking
103
# ---------------------------
104
#
105
# By default, all tensors with ``requires_grad=True`` are tracking their
106
# computational history and support gradient computation. However, there
107
# are some cases when we do not need to do that, for example, when we have
108
# trained the model and just want to apply it to some input data, i.e. we
109
# only want to do *forward* computations through the network. We can stop
110
# tracking computations by surrounding our computation code with
111
# ``torch.no_grad()`` block:
112
#
113
114
z = torch.matmul(x, w)+b
115
print(z.requires_grad)
116
117
with torch.no_grad():
118
z = torch.matmul(x, w)+b
119
print(z.requires_grad)
120
121
122
######################################################################
123
# Another way to achieve the same result is to use the ``detach()`` method
124
# on the tensor:
125
#
126
127
z = torch.matmul(x, w)+b
128
z_det = z.detach()
129
print(z_det.requires_grad)
130
131
######################################################################
132
# There are reasons you might want to disable gradient tracking:
133
# - To mark some parameters in your neural network as **frozen parameters**.
134
# - To **speed up computations** when you are only doing forward pass, because computations on tensors that do
135
# not track gradients would be more efficient.
136
137
138
######################################################################
139
140
######################################################################
141
# More on Computational Graphs
142
# ----------------------------
143
# Conceptually, autograd keeps a record of data (tensors) and all executed
144
# operations (along with the resulting new tensors) in a directed acyclic
145
# graph (DAG) consisting of
146
# `Function <https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function>`__
147
# objects. In this DAG, leaves are the input tensors, roots are the output
148
# tensors. By tracing this graph from roots to leaves, you can
149
# automatically compute the gradients using the chain rule.
150
#
151
# In a forward pass, autograd does two things simultaneously:
152
#
153
# - run the requested operation to compute a resulting tensor
154
# - maintain the operation’s *gradient function* in the DAG.
155
#
156
# The backward pass kicks off when ``.backward()`` is called on the DAG
157
# root. ``autograd`` then:
158
#
159
# - computes the gradients from each ``.grad_fn``,
160
# - accumulates them in the respective tensor’s ``.grad`` attribute
161
# - using the chain rule, propagates all the way to the leaf tensors.
162
#
163
# .. note::
164
# **DAGs are dynamic in PyTorch**
165
# An important thing to note is that the graph is recreated from scratch; after each
166
# ``.backward()`` call, autograd starts populating a new graph. This is
167
# exactly what allows you to use control flow statements in your model;
168
# you can change the shape, size and operations at every iteration if
169
# needed.
170
171
######################################################################
172
# Optional Reading: Tensor Gradients and Jacobian Products
173
# --------------------------------------------------------
174
#
175
# In many cases, we have a scalar loss function, and we need to compute
176
# the gradient with respect to some parameters. However, there are cases
177
# when the output function is an arbitrary tensor. In this case, PyTorch
178
# allows you to compute so-called **Jacobian product**, and not the actual
179
# gradient.
180
#
181
# For a vector function :math:`\vec{y}=f(\vec{x})`, where
182
# :math:`\vec{x}=\langle x_1,\dots,x_n\rangle` and
183
# :math:`\vec{y}=\langle y_1,\dots,y_m\rangle`, a gradient of
184
# :math:`\vec{y}` with respect to :math:`\vec{x}` is given by **Jacobian
185
# matrix**:
186
#
187
# .. math::
188
#
189
#
190
# J=\left(\begin{array}{ccc}
191
# \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}}\\
192
# \vdots & \ddots & \vdots\\
193
# \frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
194
# \end{array}\right)
195
#
196
# Instead of computing the Jacobian matrix itself, PyTorch allows you to
197
# compute **Jacobian Product** :math:`v^T\cdot J` for a given input vector
198
# :math:`v=(v_1 \dots v_m)`. This is achieved by calling ``backward`` with
199
# :math:`v` as an argument. The size of :math:`v` should be the same as
200
# the size of the original tensor, with respect to which we want to
201
# compute the product:
202
#
203
204
inp = torch.eye(4, 5, requires_grad=True)
205
out = (inp+1).pow(2).t()
206
out.backward(torch.ones_like(out), retain_graph=True)
207
print(f"First call\n{inp.grad}")
208
out.backward(torch.ones_like(out), retain_graph=True)
209
print(f"\nSecond call\n{inp.grad}")
210
inp.grad.zero_()
211
out.backward(torch.ones_like(out), retain_graph=True)
212
print(f"\nCall after zeroing gradients\n{inp.grad}")
213
214
215
######################################################################
216
# Notice that when we call ``backward`` for the second time with the same
217
# argument, the value of the gradient is different. This happens because
218
# when doing ``backward`` propagation, PyTorch **accumulates the
219
# gradients**, i.e. the value of computed gradients is added to the
220
# ``grad`` property of all leaf nodes of computational graph. If you want
221
# to compute the proper gradients, you need to zero out the ``grad``
222
# property before. In real-life training an *optimizer* helps us to do
223
# this.
224
225
######################################################################
226
# .. note:: Previously we were calling ``backward()`` function without
227
# parameters. This is essentially equivalent to calling
228
# ``backward(torch.tensor(1.0))``, which is a useful way to compute the
229
# gradients in case of a scalar-valued function, such as loss during
230
# neural network training.
231
#
232
233
######################################################################
234
# --------------
235
#
236
237
#################################################################
238
# Further Reading
239
# ~~~~~~~~~~~~~~~~~
240
# - `Autograd Mechanics <https://pytorch.org/docs/stable/notes/autograd.html>`_
241
242