CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
pytorch

CoCalc provides the best real-time collaborative environment for Jupyter Notebooks, LaTeX documents, and SageMath, scalable from individual users to large groups and classes!

GitHub Repository: pytorch/tutorials
Path: blob/main/intermediate_source/dqn_with_rnn_tutorial.py
Views: 494
1
# -*- coding: utf-8 -*-
2
3
"""
4
Recurrent DQN: Training recurrent policies
5
==========================================
6
7
**Author**: `Vincent Moens <https://github.com/vmoens>`_
8
9
.. grid:: 2
10
11
.. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn
12
:class-card: card-prerequisites
13
14
* How to incorporating an RNN in an actor in TorchRL
15
* How to use that memory-based policy with a replay buffer and a loss module
16
17
.. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites
18
:class-card: card-prerequisites
19
20
* PyTorch v2.0.0
21
* gym[mujoco]
22
* tqdm
23
"""
24
25
#########################################################################
26
# Overview
27
# --------
28
#
29
# Memory-based policies are crucial not only when the observations are partially
30
# observable but also when the time dimension must be taken into account to
31
# make informed decisions.
32
#
33
# Recurrent neural network have long been a popular tool for memory-based
34
# policies. The idea is to keep a recurrent state in memory between two
35
# consecutive steps, and use this as an input to the policy along with the
36
# current observation.
37
#
38
# This tutorial shows how to incorporate an RNN in a policy using TorchRL.
39
#
40
# Key learnings:
41
#
42
# - Incorporating an RNN in an actor in TorchRL;
43
# - Using that memory-based policy with a replay buffer and a loss module.
44
#
45
# The core idea of using RNNs in TorchRL is to use TensorDict as a data carrier
46
# for the hidden states from one step to another. We'll build a policy that
47
# reads the previous recurrent state from the current TensorDict, and writes the
48
# current recurrent states in the TensorDict of the next state:
49
#
50
# .. figure:: /_static/img/rollout_recurrent.png
51
# :alt: Data collection with a recurrent policy
52
#
53
# As this figure shows, our environment populates the TensorDict with zeroed recurrent
54
# states which are read by the policy together with the observation to produce an
55
# action, and recurrent states that will be used for the next step.
56
# When the :func:`~torchrl.envs.utils.step_mdp` function is called, the recurrent states
57
# from the next state are brought to the current TensorDict. Let's see how this
58
# is implemented in practice.
59
60
######################################################################
61
# If you are running this in Google Colab, make sure you install the following dependencies:
62
#
63
# .. code-block:: bash
64
#
65
# !pip3 install torchrl
66
# !pip3 install gym[mujoco]
67
# !pip3 install tqdm
68
#
69
# Setup
70
# -----
71
#
72
73
# sphinx_gallery_start_ignore
74
import warnings
75
76
warnings.filterwarnings("ignore")
77
from torch import multiprocessing
78
79
# TorchRL prefers spawn method, that restricts creation of ``~torchrl.envs.ParallelEnv`` inside
80
# `__main__` method call, but for the easy of reading the code switch to fork
81
# which is also a default spawn method in Google's Colaboratory
82
try:
83
multiprocessing.set_start_method("fork")
84
except RuntimeError:
85
pass
86
87
# sphinx_gallery_end_ignore
88
89
import torch
90
import tqdm
91
from tensordict.nn import TensorDictModule as Mod, TensorDictSequential as Seq
92
from torch import nn
93
from torchrl.collectors import SyncDataCollector
94
from torchrl.data import LazyMemmapStorage, TensorDictReplayBuffer
95
from torchrl.envs import (
96
Compose,
97
ExplorationType,
98
GrayScale,
99
InitTracker,
100
ObservationNorm,
101
Resize,
102
RewardScaling,
103
set_exploration_type,
104
StepCounter,
105
ToTensorImage,
106
TransformedEnv,
107
)
108
from torchrl.envs.libs.gym import GymEnv
109
from torchrl.modules import ConvNet, EGreedyModule, LSTMModule, MLP, QValueModule
110
from torchrl.objectives import DQNLoss, SoftUpdate
111
112
is_fork = multiprocessing.get_start_method() == "fork"
113
device = (
114
torch.device(0)
115
if torch.cuda.is_available() and not is_fork
116
else torch.device("cpu")
117
)
118
119
######################################################################
120
# Environment
121
# -----------
122
#
123
# As usual, the first step is to build our environment: it helps us
124
# define the problem and build the policy network accordingly. For this tutorial,
125
# we'll be running a single pixel-based instance of the CartPole gym
126
# environment with some custom transforms: turning to grayscale, resizing to
127
# 84x84, scaling down the rewards and normalizing the observations.
128
#
129
# .. note::
130
# The :class:`~torchrl.envs.transforms.StepCounter` transform is accessory. Since the CartPole
131
# task goal is to make trajectories as long as possible, counting the steps
132
# can help us track the performance of our policy.
133
#
134
# Two transforms are important for the purpose of this tutorial:
135
#
136
# - :class:`~torchrl.envs.transforms.InitTracker` will stamp the
137
# calls to :meth:`~torchrl.envs.EnvBase.reset` by adding a ``"is_init"``
138
# boolean mask in the TensorDict that will track which steps require a reset
139
# of the RNN hidden states.
140
# - The :class:`~torchrl.envs.transforms.TensorDictPrimer` transform is a bit more
141
# technical. It is not required to use RNN policies. However, it
142
# instructs the environment (and subsequently the collector) that some extra
143
# keys are to be expected. Once added, a call to `env.reset()` will populate
144
# the entries indicated in the primer with zeroed tensors. Knowing that
145
# these tensors are expected by the policy, the collector will pass them on
146
# during collection. Eventually, we'll be storing our hidden states in the
147
# replay buffer, which will help us bootstrap the computation of the
148
# RNN operations in the loss module (which would otherwise be initiated
149
# with 0s). In summary: not including this transform will not impact hugely
150
# the training of our policy, but it will make the recurrent keys disappear
151
# from the collected data and the replay buffer, which will in turn lead to
152
# a slightly less optimal training.
153
# Fortunately, the :class:`~torchrl.modules.LSTMModule` we propose is
154
# equipped with a helper method to build just that transform for us, so
155
# we can wait until we build it!
156
#
157
158
env = TransformedEnv(
159
GymEnv("CartPole-v1", from_pixels=True, device=device),
160
Compose(
161
ToTensorImage(),
162
GrayScale(),
163
Resize(84, 84),
164
StepCounter(),
165
InitTracker(),
166
RewardScaling(loc=0.0, scale=0.1),
167
ObservationNorm(standard_normal=True, in_keys=["pixels"]),
168
),
169
)
170
171
######################################################################
172
# As always, we need to initialize manually our normalization constants:
173
#
174
env.transform[-1].init_stats(1000, reduce_dim=[0, 1, 2], cat_dim=0, keep_dims=[0])
175
td = env.reset()
176
177
######################################################################
178
# Policy
179
# ------
180
#
181
# Our policy will have 3 components: a :class:`~torchrl.modules.ConvNet`
182
# backbone, an :class:`~torchrl.modules.LSTMModule` memory layer and a shallow
183
# :class:`~torchrl.modules.MLP` block that will map the LSTM output onto the
184
# action values.
185
#
186
# Convolutional network
187
# ~~~~~~~~~~~~~~~~~~~~~
188
#
189
# We build a convolutional network flanked with a :class:`torch.nn.AdaptiveAvgPool2d`
190
# that will squash the output in a vector of size 64. The :class:`~torchrl.modules.ConvNet`
191
# can assist us with this:
192
#
193
194
feature = Mod(
195
ConvNet(
196
num_cells=[32, 32, 64],
197
squeeze_output=True,
198
aggregator_class=nn.AdaptiveAvgPool2d,
199
aggregator_kwargs={"output_size": (1, 1)},
200
device=device,
201
),
202
in_keys=["pixels"],
203
out_keys=["embed"],
204
)
205
######################################################################
206
# we execute the first module on a batch of data to gather the size of the
207
# output vector:
208
#
209
n_cells = feature(env.reset())["embed"].shape[-1]
210
211
######################################################################
212
# LSTM Module
213
# ~~~~~~~~~~~
214
#
215
# TorchRL provides a specialized :class:`~torchrl.modules.LSTMModule` class
216
# to incorporate LSTMs in your code-base. It is a :class:`~tensordict.nn.TensorDictModuleBase`
217
# subclass: as such, it has a set of ``in_keys`` and ``out_keys`` that indicate
218
# what values should be expected to be read and written/updated during the
219
# execution of the module. The class comes with customizable predefined
220
# values for these attributes to facilitate its construction.
221
#
222
# .. note::
223
# *Usage limitations*: The class supports almost all LSTM features such as
224
# dropout or multi-layered LSTMs.
225
# However, to respect TorchRL's conventions, this LSTM must have the ``batch_first``
226
# attribute set to ``True`` which is **not** the default in PyTorch. However,
227
# our :class:`~torchrl.modules.LSTMModule` changes this default
228
# behavior, so we're good with a native call.
229
#
230
# Also, the LSTM cannot have a ``bidirectional`` attribute set to ``True`` as
231
# this wouldn't be usable in online settings. In this case, the default value
232
# is the correct one.
233
#
234
235
lstm = LSTMModule(
236
input_size=n_cells,
237
hidden_size=128,
238
device=device,
239
in_key="embed",
240
out_key="embed",
241
)
242
243
######################################################################
244
# Let us look at the LSTM Module class, specifically its in and out_keys:
245
print("in_keys", lstm.in_keys)
246
print("out_keys", lstm.out_keys)
247
248
######################################################################
249
# We can see that these values contain the key we indicated as the in_key (and out_key)
250
# as well as recurrent key names. The out_keys are preceded by a "next" prefix
251
# that indicates that they will need to be written in the "next" TensorDict.
252
# We use this convention (which can be overridden by passing the in_keys/out_keys
253
# arguments) to make sure that a call to :func:`~torchrl.envs.utils.step_mdp` will
254
# move the recurrent state to the root TensorDict, making it available to the
255
# RNN during the following call (see figure in the intro).
256
#
257
# As mentioned earlier, we have one more optional transform to add to our
258
# environment to make sure that the recurrent states are passed to the buffer.
259
# The :meth:`~torchrl.modules.LSTMModule.make_tensordict_primer` method does
260
# exactly that:
261
#
262
env.append_transform(lstm.make_tensordict_primer())
263
264
######################################################################
265
# and that's it! We can print the environment to check that everything looks good now
266
# that we have added the primer:
267
print(env)
268
269
######################################################################
270
# MLP
271
# ~~~
272
#
273
# We use a single-layer MLP to represent the action values we'll be using for
274
# our policy.
275
#
276
mlp = MLP(
277
out_features=2,
278
num_cells=[
279
64,
280
],
281
device=device,
282
)
283
######################################################################
284
# and fill the bias with zeros:
285
286
mlp[-1].bias.data.fill_(0.0)
287
mlp = Mod(mlp, in_keys=["embed"], out_keys=["action_value"])
288
289
######################################################################
290
# Using the Q-Values to select an action
291
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
292
#
293
# The last part of our policy is the Q-Value Module.
294
# The Q-Value module :class:`~torchrl.modules.tensordict_module.QValueModule`
295
# will read the ``"action_values"`` key that is produced by our MLP and
296
# from it, gather the action that has the maximum value.
297
# The only thing we need to do is to specify the action space, which can be done
298
# either by passing a string or an action-spec. This allows us to use
299
# Categorical (sometimes called "sparse") encoding or the one-hot version of it.
300
#
301
qval = QValueModule(spec=env.action_spec)
302
303
######################################################################
304
# .. note::
305
# TorchRL also provides a wrapper class :class:`torchrl.modules.QValueActor` that
306
# wraps a module in a Sequential together with a :class:`~torchrl.modules.tensordict_module.QValueModule`
307
# like we are doing explicitly here. There is little advantage to do this
308
# and the process is less transparent, but the end results will be similar to
309
# what we do here.
310
#
311
# We can now put things together in a :class:`~tensordict.nn.TensorDictSequential`
312
#
313
stoch_policy = Seq(feature, lstm, mlp, qval)
314
315
######################################################################
316
# DQN being a deterministic algorithm, exploration is a crucial part of it.
317
# We'll be using an :math:`\epsilon`-greedy policy with an epsilon of 0.2 decaying
318
# progressively to 0.
319
# This decay is achieved via a call to :meth:`~torchrl.modules.EGreedyModule.step`
320
# (see training loop below).
321
#
322
exploration_module = EGreedyModule(
323
annealing_num_steps=1_000_000, spec=env.action_spec, eps_init=0.2
324
)
325
stoch_policy = Seq(
326
stoch_policy,
327
exploration_module,
328
)
329
330
######################################################################
331
# Using the model for the loss
332
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
333
#
334
# The model as we've built it is well equipped to be used in sequential settings.
335
# However, the class :class:`torch.nn.LSTM` can use a cuDNN-optimized backend
336
# to run the RNN sequence faster on GPU device. We would not want to miss
337
# such an opportunity to speed up our training loop!
338
# To use it, we just need to tell the LSTM module to run on "recurrent-mode"
339
# when used by the loss.
340
# As we'll usually want to have two copies of the LSTM module, we do this by
341
# calling a :meth:`~torchrl.modules.LSTMModule.set_recurrent_mode` method that
342
# will return a new instance of the LSTM (with shared weights) that will
343
# assume that the input data is sequential in nature.
344
#
345
policy = Seq(feature, lstm.set_recurrent_mode(True), mlp, qval)
346
347
######################################################################
348
# Because we still have a couple of uninitialized parameters we should
349
# initialize them before creating an optimizer and such.
350
#
351
policy(env.reset())
352
353
######################################################################
354
# DQN Loss
355
# --------
356
#
357
# Out DQN loss requires us to pass the policy and, again, the action-space.
358
# While this may seem redundant, it is important as we want to make sure that
359
# the :class:`~torchrl.objectives.DQNLoss` and the :class:`~torchrl.modules.tensordict_module.QValueModule`
360
# classes are compatible, but aren't strongly dependent on each other.
361
#
362
# To use the Double-DQN, we ask for a ``delay_value`` argument that will
363
# create a non-differentiable copy of the network parameters to be used
364
# as a target network.
365
loss_fn = DQNLoss(policy, action_space=env.action_spec, delay_value=True)
366
367
######################################################################
368
# Since we are using a double DQN, we need to update the target parameters.
369
# We'll use a :class:`~torchrl.objectives.SoftUpdate` instance to carry out
370
# this work.
371
#
372
updater = SoftUpdate(loss_fn, eps=0.95)
373
374
optim = torch.optim.Adam(policy.parameters(), lr=3e-4)
375
376
######################################################################
377
# Collector and replay buffer
378
# ---------------------------
379
#
380
# We build the simplest data collector there is. We'll try to train our algorithm
381
# with a million frames, extending the buffer with 50 frames at a time. The buffer
382
# will be designed to store 20 thousands trajectories of 50 steps each.
383
# At each optimization step (16 per data collection), we'll collect 4 items
384
# from our buffer, for a total of 200 transitions.
385
# We'll use a :class:`~torchrl.data.replay_buffers.LazyMemmapStorage` storage to keep the data
386
# on disk.
387
#
388
# .. note::
389
# For the sake of efficiency, we're only running a few thousands iterations
390
# here. In a real setting, the total number of frames should be set to 1M.
391
#
392
collector = SyncDataCollector(env, stoch_policy, frames_per_batch=50, total_frames=200, device=device)
393
rb = TensorDictReplayBuffer(
394
storage=LazyMemmapStorage(20_000), batch_size=4, prefetch=10
395
)
396
397
######################################################################
398
# Training loop
399
# -------------
400
#
401
# To keep track of the progress, we will run the policy in the environment once
402
# every 50 data collection, and plot the results after training.
403
#
404
405
utd = 16
406
pbar = tqdm.tqdm(total=1_000_000)
407
longest = 0
408
409
traj_lens = []
410
for i, data in enumerate(collector):
411
if i == 0:
412
print(
413
"Let us print the first batch of data.\nPay attention to the key names "
414
"which will reflect what can be found in this data structure, in particular: "
415
"the output of the QValueModule (action_values, action and chosen_action_value),"
416
"the 'is_init' key that will tell us if a step is initial or not, and the "
417
"recurrent_state keys.\n",
418
data,
419
)
420
pbar.update(data.numel())
421
# it is important to pass data that is not flattened
422
rb.extend(data.unsqueeze(0).to_tensordict().cpu())
423
for _ in range(utd):
424
s = rb.sample().to(device, non_blocking=True)
425
loss_vals = loss_fn(s)
426
loss_vals["loss"].backward()
427
optim.step()
428
optim.zero_grad()
429
longest = max(longest, data["step_count"].max().item())
430
pbar.set_description(
431
f"steps: {longest}, loss_val: {loss_vals['loss'].item(): 4.4f}, action_spread: {data['action'].sum(0)}"
432
)
433
exploration_module.step(data.numel())
434
updater.step()
435
436
with set_exploration_type(ExplorationType.MODE), torch.no_grad():
437
rollout = env.rollout(10000, stoch_policy)
438
traj_lens.append(rollout.get(("next", "step_count")).max().item())
439
440
######################################################################
441
# Let's plot our results:
442
#
443
if traj_lens:
444
from matplotlib import pyplot as plt
445
446
plt.plot(traj_lens)
447
plt.xlabel("Test collection")
448
plt.title("Test trajectory lengths")
449
450
######################################################################
451
# Conclusion
452
# ----------
453
#
454
# We have seen how an RNN can be incorporated in a policy in TorchRL.
455
# You should now be able:
456
#
457
# - Create an LSTM module that acts as a :class:`~tensordict.nn.TensorDictModule`
458
# - Indicate to the LSTM module that a reset is needed via an :class:`~torchrl.envs.transforms.InitTracker`
459
# transform
460
# - Incorporate this module in a policy and in a loss module
461
# - Make sure that the collector is made aware of the recurrent state entries
462
# such that they can be stored in the replay buffer along with the rest of
463
# the data
464
#
465
# Further Reading
466
# ---------------
467
#
468
# - The TorchRL documentation can be found `here <https://pytorch.org/rl/>`_.
469
470