CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
pytorch

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place.

GitHub Repository: pytorch/tutorials
Path: blob/main/intermediate_source/pinmem_nonblock.py
Views: 712
1
# -*- coding: utf-8 -*-
2
"""
3
A guide on good usage of ``non_blocking`` and ``pin_memory()`` in PyTorch
4
=========================================================================
5
6
**Author**: `Vincent Moens <https://github.com/vmoens>`_
7
8
Introduction
9
------------
10
11
Transferring data from the CPU to the GPU is fundamental in many PyTorch applications.
12
It's crucial for users to understand the most effective tools and options available for moving data between devices.
13
This tutorial examines two key methods for device-to-device data transfer in PyTorch:
14
:meth:`~torch.Tensor.pin_memory` and :meth:`~torch.Tensor.to` with the ``non_blocking=True`` option.
15
16
What you will learn
17
~~~~~~~~~~~~~~~~~~~
18
19
Optimizing the transfer of tensors from the CPU to the GPU can be achieved through asynchronous transfers and memory
20
pinning. However, there are important considerations:
21
22
- Using ``tensor.pin_memory().to(device, non_blocking=True)`` can be up to twice as slow as a straightforward ``tensor.to(device)``.
23
- Generally, ``tensor.to(device, non_blocking=True)`` is an effective choice for enhancing transfer speed.
24
- While ``cpu_tensor.to("cuda", non_blocking=True).mean()`` executes correctly, attempting
25
``cuda_tensor.to("cpu", non_blocking=True).mean()`` will result in erroneous outputs.
26
27
Preamble
28
~~~~~~~~
29
30
The performance reported in this tutorial are conditioned on the system used to build the tutorial.
31
Although the conclusions are applicable across different systems, the specific observations may vary slightly
32
depending on the hardware available, especially on older hardware.
33
The primary objective of this tutorial is to offer a theoretical framework for understanding CPU to GPU data transfers.
34
However, any design decisions should be tailored to individual cases and guided by benchmarked throughput measurements,
35
as well as the specific requirements of the task at hand.
36
37
"""
38
39
import torch
40
41
assert torch.cuda.is_available(), "A cuda device is required to run this tutorial"
42
43
44
######################################################################
45
#
46
# This tutorial requires tensordict to be installed. If you don't have tensordict in your environment yet, install it
47
# by running the following command in a separate cell:
48
#
49
# .. code-block:: bash
50
#
51
# # Install tensordict with the following command
52
# !pip3 install tensordict
53
#
54
# We start by outlining the theory surrounding these concepts, and then move to concrete test examples of the features.
55
#
56
#
57
# Background
58
# ----------
59
#
60
# .. _pinned_memory_background:
61
#
62
# Memory management basics
63
# ~~~~~~~~~~~~~~~~~~~~~~~~
64
#
65
# .. _pinned_memory_memory:
66
#
67
# When one creates a CPU tensor in PyTorch, the content of this tensor needs to be placed
68
# in memory. The memory we talk about here is a rather complex concept worth looking at carefully.
69
# We distinguish two types of memory that are handled by the Memory Management Unit: the RAM (for simplicity)
70
# and the swap space on disk (which may or may not be the hard drive). Together, the available space in disk and RAM (physical memory)
71
# make up the virtual memory, which is an abstraction of the total resources available.
72
# In short, the virtual memory makes it so that the available space is larger than what can be found on RAM in isolation
73
# and creates the illusion that the main memory is larger than it actually is.
74
#
75
# In normal circumstances, a regular CPU tensor is pageable which means that it is divided in blocks called pages that
76
# can live anywhere in the virtual memory (both in RAM or on disk). As mentioned earlier, this has the advantage that
77
# the memory seems larger than what the main memory actually is.
78
#
79
# Typically, when a program accesses a page that is not in RAM, a "page fault" occurs and the operating system (OS) then brings
80
# back this page into RAM ("swap in" or "page in").
81
# In turn, the OS may have to swap out (or "page out") another page to make room for the new page.
82
#
83
# In contrast to pageable memory, a pinned (or page-locked or non-pageable) memory is a type of memory that cannot
84
# be swapped out to disk.
85
# It allows for faster and more predictable access times, but has the downside that it is more limited than the
86
# pageable memory (aka the main memory).
87
#
88
# .. figure:: /_static/img/pinmem/pinmem.png
89
# :alt:
90
#
91
# CUDA and (non-)pageable memory
92
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
93
#
94
# .. _pinned_memory_cuda_pageable_memory:
95
#
96
# To understand how CUDA copies a tensor from CPU to CUDA, let's consider the two scenarios above:
97
#
98
# - If the memory is page-locked, the device can access the memory directly in the main memory. The memory addresses are well
99
# defined and functions that need to read these data can be significantly accelerated.
100
# - If the memory is pageable, all the pages will have to be brought to the main memory before being sent to the GPU.
101
# This operation may take time and is less predictable than when executed on page-locked tensors.
102
#
103
# More precisely, when CUDA sends pageable data from CPU to GPU, it must first create a page-locked copy of that data
104
# before making the transfer.
105
#
106
# Asynchronous vs. Synchronous Operations with ``non_blocking=True`` (CUDA ``cudaMemcpyAsync``)
107
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
108
#
109
# .. _pinned_memory_async_sync:
110
#
111
# When executing a copy from a host (e.g., CPU) to a device (e.g., GPU), the CUDA toolkit offers modalities to do these
112
# operations synchronously or asynchronously with respect to the host.
113
#
114
# In practice, when calling :meth:`~torch.Tensor.to`, PyTorch always makes a call to
115
# `cudaMemcpyAsync <https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g85073372f776b4c4d5f89f7124b7bf79>`_.
116
# If ``non_blocking=False`` (default), a ``cudaStreamSynchronize`` will be called after each and every ``cudaMemcpyAsync``, making
117
# the call to :meth:`~torch.Tensor.to` blocking in the main thread.
118
# If ``non_blocking=True``, no synchronization is triggered, and the main thread on the host is not blocked.
119
# Therefore, from the host perspective, multiple tensors can be sent to the device simultaneously,
120
# as the thread does not need to wait for one transfer to be completed to initiate the other.
121
#
122
# .. note:: In general, the transfer is blocking on the device side (even if it isn't on the host side):
123
# the copy on the device cannot occur while another operation is being executed.
124
# However, in some advanced scenarios, a copy and a kernel execution can be done simultaneously on the GPU side.
125
# As the following example will show, three requirements must be met to enable this:
126
#
127
# 1. The device must have at least one free DMA (Direct Memory Access) engine. Modern GPU architectures such as Volterra,
128
# Tesla, or H100 devices have more than one DMA engine.
129
#
130
# 2. The transfer must be done on a separate, non-default cuda stream. In PyTorch, cuda streams can be handles using
131
# :class:`~torch.cuda.Stream`.
132
#
133
# 3. The source data must be in pinned memory.
134
#
135
# We demonstrate this by running profiles on the following script.
136
#
137
138
import contextlib
139
140
from torch.cuda import Stream
141
142
143
s = Stream()
144
145
torch.manual_seed(42)
146
t1_cpu_pinned = torch.randn(1024**2 * 5, pin_memory=True)
147
t2_cpu_paged = torch.randn(1024**2 * 5, pin_memory=False)
148
t3_cuda = torch.randn(1024**2 * 5, device="cuda:0")
149
150
assert torch.cuda.is_available()
151
device = torch.device("cuda", torch.cuda.current_device())
152
153
154
# The function we want to profile
155
def inner(pinned: bool, streamed: bool):
156
with torch.cuda.stream(s) if streamed else contextlib.nullcontext():
157
if pinned:
158
t1_cuda = t1_cpu_pinned.to(device, non_blocking=True)
159
else:
160
t2_cuda = t2_cpu_paged.to(device, non_blocking=True)
161
t_star_cuda_h2d_event = s.record_event()
162
# This operation can be executed during the CPU to GPU copy if and only if the tensor is pinned and the copy is
163
# done in the other stream
164
t3_cuda_mul = t3_cuda * t3_cuda * t3_cuda
165
t3_cuda_h2d_event = torch.cuda.current_stream().record_event()
166
t_star_cuda_h2d_event.synchronize()
167
t3_cuda_h2d_event.synchronize()
168
169
170
# Our profiler: profiles the `inner` function and stores the results in a .json file
171
def benchmark_with_profiler(
172
pinned,
173
streamed,
174
) -> None:
175
torch._C._profiler._set_cuda_sync_enabled_val(True)
176
wait, warmup, active = 1, 1, 2
177
num_steps = wait + warmup + active
178
rank = 0
179
with torch.profiler.profile(
180
activities=[
181
torch.profiler.ProfilerActivity.CPU,
182
torch.profiler.ProfilerActivity.CUDA,
183
],
184
schedule=torch.profiler.schedule(
185
wait=wait, warmup=warmup, active=active, repeat=1, skip_first=1
186
),
187
) as prof:
188
for step_idx in range(1, num_steps + 1):
189
inner(streamed=streamed, pinned=pinned)
190
if rank is None or rank == 0:
191
prof.step()
192
prof.export_chrome_trace(f"trace_streamed{int(streamed)}_pinned{int(pinned)}.json")
193
194
195
######################################################################
196
# Loading these profile traces in chrome (``chrome://tracing``) shows the following results: first, let's see
197
# what happens if both the arithmetic operation on ``t3_cuda`` is executed after the pageable tensor is sent to GPU
198
# in the main stream:
199
#
200
201
benchmark_with_profiler(streamed=False, pinned=False)
202
203
######################################################################
204
# .. figure:: /_static/img/pinmem/trace_streamed0_pinned0.png
205
# :alt:
206
#
207
# Using a pinned tensor doesn't change the trace much, both operations are still executed consecutively:
208
209
benchmark_with_profiler(streamed=False, pinned=True)
210
211
######################################################################
212
#
213
# .. figure:: /_static/img/pinmem/trace_streamed0_pinned1.png
214
# :alt:
215
#
216
# Sending a pageable tensor to GPU on a separate stream is also a blocking operation:
217
218
benchmark_with_profiler(streamed=True, pinned=False)
219
220
######################################################################
221
#
222
# .. figure:: /_static/img/pinmem/trace_streamed1_pinned0.png
223
# :alt:
224
#
225
# Only pinned tensors copies to GPU on a separate stream overlap with another cuda kernel executed on
226
# the main stream:
227
228
benchmark_with_profiler(streamed=True, pinned=True)
229
230
######################################################################
231
#
232
# .. figure:: /_static/img/pinmem/trace_streamed1_pinned1.png
233
# :alt:
234
#
235
# A PyTorch perspective
236
# ---------------------
237
#
238
# .. _pinned_memory_pt_perspective:
239
#
240
# ``pin_memory()``
241
# ~~~~~~~~~~~~~~~~
242
#
243
# .. _pinned_memory_pinned:
244
#
245
# PyTorch offers the possibility to create and send tensors to page-locked memory through the
246
# :meth:`~torch.Tensor.pin_memory` method and constructor arguments.
247
# CPU tensors on a machine where CUDA is initialized can be cast to pinned memory through the :meth:`~torch.Tensor.pin_memory`
248
# method. Importantly, ``pin_memory`` is blocking on the main thread of the host: it will wait for the tensor to be copied to
249
# page-locked memory before executing the next operation.
250
# New tensors can be directly created in pinned memory with functions like :func:`~torch.zeros`, :func:`~torch.ones` and other
251
# constructors.
252
#
253
# Let us check the speed of pinning memory and sending tensors to CUDA:
254
255
256
import torch
257
import gc
258
from torch.utils.benchmark import Timer
259
import matplotlib.pyplot as plt
260
261
262
def timer(cmd):
263
median = (
264
Timer(cmd, globals=globals())
265
.adaptive_autorange(min_run_time=1.0, max_run_time=20.0)
266
.median
267
* 1000
268
)
269
print(f"{cmd}: {median: 4.4f} ms")
270
return median
271
272
273
# A tensor in pageable memory
274
pageable_tensor = torch.randn(1_000_000)
275
276
# A tensor in page-locked (pinned) memory
277
pinned_tensor = torch.randn(1_000_000, pin_memory=True)
278
279
# Runtimes:
280
pageable_to_device = timer("pageable_tensor.to('cuda:0')")
281
pinned_to_device = timer("pinned_tensor.to('cuda:0')")
282
pin_mem = timer("pageable_tensor.pin_memory()")
283
pin_mem_to_device = timer("pageable_tensor.pin_memory().to('cuda:0')")
284
285
# Ratios:
286
r1 = pinned_to_device / pageable_to_device
287
r2 = pin_mem_to_device / pageable_to_device
288
289
# Create a figure with the results
290
fig, ax = plt.subplots()
291
292
xlabels = [0, 1, 2]
293
bar_labels = [
294
"pageable_tensor.to(device) (1x)",
295
f"pinned_tensor.to(device) ({r1:4.2f}x)",
296
f"pageable_tensor.pin_memory().to(device) ({r2:4.2f}x)"
297
f"\npin_memory()={100*pin_mem/pin_mem_to_device:.2f}% of runtime.",
298
]
299
values = [pageable_to_device, pinned_to_device, pin_mem_to_device]
300
colors = ["tab:blue", "tab:red", "tab:orange"]
301
ax.bar(xlabels, values, label=bar_labels, color=colors)
302
303
ax.set_ylabel("Runtime (ms)")
304
ax.set_title("Device casting runtime (pin-memory)")
305
ax.set_xticks([])
306
ax.legend()
307
308
plt.show()
309
310
# Clear tensors
311
del pageable_tensor, pinned_tensor
312
_ = gc.collect()
313
314
######################################################################
315
#
316
# We can observe that casting a pinned-memory tensor to GPU is indeed much faster than a pageable tensor, because under
317
# the hood, a pageable tensor must be copied to pinned memory before being sent to GPU.
318
#
319
# However, contrary to a somewhat common belief, calling :meth:`~torch.Tensor.pin_memory()` on a pageable tensor before
320
# casting it to GPU should not bring any significant speed-up, on the contrary this call is usually slower than just
321
# executing the transfer. This makes sense, since we're actually asking Python to execute an operation that CUDA will
322
# perform anyway before copying the data from host to device.
323
#
324
# .. note:: The PyTorch implementation of
325
# `pin_memory <https://github.com/pytorch/pytorch/blob/5298acb5c76855bc5a99ae10016efc86b27949bd/aten/src/ATen/native/Memory.cpp#L58>`_
326
# which relies on creating a brand new storage in pinned memory through `cudaHostAlloc <https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gb65da58f444e7230d3322b6126bb4902>`_
327
# could be, in rare cases, faster than transitioning data in chunks as ``cudaMemcpy`` does.
328
# Here too, the observation may vary depending on the available hardware, the size of the tensors being sent or
329
# the amount of available RAM.
330
#
331
# ``non_blocking=True``
332
# ~~~~~~~~~~~~~~~~~~~~~
333
#
334
# .. _pinned_memory_non_blocking:
335
#
336
# As mentioned earlier, many PyTorch operations have the option of being executed asynchronously with respect to the host
337
# through the ``non_blocking`` argument.
338
#
339
# Here, to account accurately of the benefits of using ``non_blocking``, we will design a slightly more complex
340
# experiment since we want to assess how fast it is to send multiple tensors to GPU with and without calling
341
# ``non_blocking``.
342
#
343
344
345
# A simple loop that copies all tensors to cuda
346
def copy_to_device(*tensors):
347
result = []
348
for tensor in tensors:
349
result.append(tensor.to("cuda:0"))
350
return result
351
352
353
# A loop that copies all tensors to cuda asynchronously
354
def copy_to_device_nonblocking(*tensors):
355
result = []
356
for tensor in tensors:
357
result.append(tensor.to("cuda:0", non_blocking=True))
358
# We need to synchronize
359
torch.cuda.synchronize()
360
return result
361
362
363
# Create a list of tensors
364
tensors = [torch.randn(1000) for _ in range(1000)]
365
to_device = timer("copy_to_device(*tensors)")
366
to_device_nonblocking = timer("copy_to_device_nonblocking(*tensors)")
367
368
# Ratio
369
r1 = to_device_nonblocking / to_device
370
371
# Plot the results
372
fig, ax = plt.subplots()
373
374
xlabels = [0, 1]
375
bar_labels = [f"to(device) (1x)", f"to(device, non_blocking=True) ({r1:4.2f}x)"]
376
colors = ["tab:blue", "tab:red"]
377
values = [to_device, to_device_nonblocking]
378
379
ax.bar(xlabels, values, label=bar_labels, color=colors)
380
381
ax.set_ylabel("Runtime (ms)")
382
ax.set_title("Device casting runtime (non-blocking)")
383
ax.set_xticks([])
384
ax.legend()
385
386
plt.show()
387
388
389
######################################################################
390
# To get a better sense of what is happening here, let us profile these two functions:
391
392
393
from torch.profiler import profile, ProfilerActivity
394
395
396
def profile_mem(cmd):
397
with profile(activities=[ProfilerActivity.CPU]) as prof:
398
exec(cmd)
399
print(cmd)
400
print(prof.key_averages().table(row_limit=10))
401
402
403
######################################################################
404
# Let's see the call stack with a regular ``to(device)`` first:
405
#
406
407
print("Call to `to(device)`", profile_mem("copy_to_device(*tensors)"))
408
409
######################################################################
410
# and now the ``non_blocking`` version:
411
#
412
413
print(
414
"Call to `to(device, non_blocking=True)`",
415
profile_mem("copy_to_device_nonblocking(*tensors)"),
416
)
417
418
419
######################################################################
420
# The results are without any doubt better when using ``non_blocking=True``, as all transfers are initiated simultaneously
421
# on the host side and only one synchronization is done.
422
#
423
# The benefit will vary depending on the number and the size of the tensors as well as depending on the hardware being
424
# used.
425
#
426
# .. note:: Interestingly, the blocking ``to("cuda")`` actually performs the same asynchronous device casting operation
427
# (``cudaMemcpyAsync``) as the one with ``non_blocking=True`` with a synchronization point after each copy.
428
#
429
# Synergies
430
# ~~~~~~~~~
431
#
432
# .. _pinned_memory_synergies:
433
#
434
# Now that we have made the point that data transfer of tensors already in pinned memory to GPU is faster than from
435
# pageable memory, and that we know that doing these transfers asynchronously is also faster than synchronously, we can
436
# benchmark combinations of these approaches. First, let's write a couple of new functions that will call ``pin_memory``
437
# and ``to(device)`` on each tensor:
438
#
439
440
441
def pin_copy_to_device(*tensors):
442
result = []
443
for tensor in tensors:
444
result.append(tensor.pin_memory().to("cuda:0"))
445
return result
446
447
448
def pin_copy_to_device_nonblocking(*tensors):
449
result = []
450
for tensor in tensors:
451
result.append(tensor.pin_memory().to("cuda:0", non_blocking=True))
452
# We need to synchronize
453
torch.cuda.synchronize()
454
return result
455
456
457
######################################################################
458
# The benefits of using :meth:`~torch.Tensor.pin_memory` are more pronounced for
459
# somewhat large batches of large tensors:
460
#
461
462
tensors = [torch.randn(1_000_000) for _ in range(1000)]
463
page_copy = timer("copy_to_device(*tensors)")
464
page_copy_nb = timer("copy_to_device_nonblocking(*tensors)")
465
466
tensors_pinned = [torch.randn(1_000_000, pin_memory=True) for _ in range(1000)]
467
pinned_copy = timer("copy_to_device(*tensors_pinned)")
468
pinned_copy_nb = timer("copy_to_device_nonblocking(*tensors_pinned)")
469
470
pin_and_copy = timer("pin_copy_to_device(*tensors)")
471
pin_and_copy_nb = timer("pin_copy_to_device_nonblocking(*tensors)")
472
473
# Plot
474
strategies = ("pageable copy", "pinned copy", "pin and copy")
475
blocking = {
476
"blocking": [page_copy, pinned_copy, pin_and_copy],
477
"non-blocking": [page_copy_nb, pinned_copy_nb, pin_and_copy_nb],
478
}
479
480
x = torch.arange(3)
481
width = 0.25
482
multiplier = 0
483
484
485
fig, ax = plt.subplots(layout="constrained")
486
487
for attribute, runtimes in blocking.items():
488
offset = width * multiplier
489
rects = ax.bar(x + offset, runtimes, width, label=attribute)
490
ax.bar_label(rects, padding=3, fmt="%.2f")
491
multiplier += 1
492
493
# Add some text for labels, title and custom x-axis tick labels, etc.
494
ax.set_ylabel("Runtime (ms)")
495
ax.set_title("Runtime (pin-mem and non-blocking)")
496
ax.set_xticks([0, 1, 2])
497
ax.set_xticklabels(strategies)
498
plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
499
ax.legend(loc="upper left", ncols=3)
500
501
plt.show()
502
503
del tensors, tensors_pinned
504
_ = gc.collect()
505
506
507
######################################################################
508
# Other copy directions (GPU -> CPU, CPU -> MPS)
509
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
510
#
511
# .. _pinned_memory_other_direction:
512
#
513
# Until now, we have operated under the assumption that asynchronous copies from the CPU to the GPU are safe.
514
# This is generally true because CUDA automatically handles synchronization to ensure that the data being accessed is
515
# valid at read time.
516
# However, this guarantee does not extend to transfers in the opposite direction, from GPU to CPU.
517
# Without explicit synchronization, these transfers offer no assurance that the copy will be complete at the time of
518
# data access. Consequently, the data on the host might be incomplete or incorrect, effectively rendering it garbage:
519
#
520
521
522
tensor = (
523
torch.arange(1, 1_000_000, dtype=torch.double, device="cuda")
524
.expand(100, 999999)
525
.clone()
526
)
527
torch.testing.assert_close(
528
tensor.mean(), torch.tensor(500_000, dtype=torch.double, device="cuda")
529
), tensor.mean()
530
try:
531
i = -1
532
for i in range(100):
533
cpu_tensor = tensor.to("cpu", non_blocking=True)
534
torch.testing.assert_close(
535
cpu_tensor.mean(), torch.tensor(500_000, dtype=torch.double)
536
)
537
print("No test failed with non_blocking")
538
except AssertionError:
539
print(f"{i}th test failed with non_blocking. Skipping remaining tests")
540
try:
541
i = -1
542
for i in range(100):
543
cpu_tensor = tensor.to("cpu", non_blocking=True)
544
torch.cuda.synchronize()
545
torch.testing.assert_close(
546
cpu_tensor.mean(), torch.tensor(500_000, dtype=torch.double)
547
)
548
print("No test failed with synchronize")
549
except AssertionError:
550
print(f"One test failed with synchronize: {i}th assertion!")
551
552
553
######################################################################
554
# The same considerations apply to copies from the CPU to non-CUDA devices, such as MPS.
555
# Generally, asynchronous copies to a device are safe without explicit synchronization only when the target is a
556
# CUDA-enabled device.
557
#
558
# In summary, copying data from CPU to GPU is safe when using ``non_blocking=True``, but for any other direction,
559
# ``non_blocking=True`` can still be used but the user must make sure that a device synchronization is executed before
560
# the data is accessed.
561
#
562
# Practical recommendations
563
# -------------------------
564
#
565
# .. _pinned_memory_recommendations:
566
#
567
# We can now wrap up some early recommendations based on our observations:
568
#
569
# In general, ``non_blocking=True`` will provide good throughput, regardless of whether the original tensor is or
570
# isn't in pinned memory.
571
# If the tensor is already in pinned memory, the transfer can be accelerated, but sending it to
572
# pin memory manually from python main thread is a blocking operation on the host, and hence will annihilate much of
573
# the benefit of using ``non_blocking=True`` (as CUDA does the `pin_memory` transfer anyway).
574
#
575
# One might now legitimately ask what use there is for the :meth:`~torch.Tensor.pin_memory` method.
576
# In the following section, we will explore further how this can be used to accelerate the data transfer even more.
577
#
578
# Additional considerations
579
# -------------------------
580
#
581
# .. _pinned_memory_considerations:
582
#
583
# PyTorch notoriously provides a :class:`~torch.utils.data.DataLoader` class whose constructor accepts a
584
# ``pin_memory`` argument.
585
# Considering our previous discussion on ``pin_memory``, you might wonder how the ``DataLoader`` manages to
586
# accelerate data transfers if memory pinning is inherently blocking.
587
#
588
# The key lies in the DataLoader's use of a separate thread to handle the transfer of data from pageable to pinned
589
# memory, thus preventing any blockage in the main thread.
590
#
591
# To illustrate this, we will use the TensorDict primitive from the homonymous library.
592
# When invoking :meth:`~tensordict.TensorDict.to`, the default behavior is to send tensors to the device asynchronously,
593
# followed by a single call to ``torch.device.synchronize()`` afterwards.
594
#
595
# Additionally, ``TensorDict.to()`` includes a ``non_blocking_pin`` option which initiates multiple threads to execute
596
# ``pin_memory()`` before proceeding with to ``to(device)``.
597
# This approach can further accelerate data transfers, as demonstrated in the following example.
598
#
599
#
600
601
from tensordict import TensorDict
602
import torch
603
from torch.utils.benchmark import Timer
604
import matplotlib.pyplot as plt
605
606
# Create the dataset
607
td = TensorDict({str(i): torch.randn(1_000_000) for i in range(1000)})
608
609
# Runtimes
610
copy_blocking = timer("td.to('cuda:0', non_blocking=False)")
611
copy_non_blocking = timer("td.to('cuda:0')")
612
copy_pin_nb = timer("td.to('cuda:0', non_blocking_pin=True, num_threads=0)")
613
copy_pin_multithread_nb = timer("td.to('cuda:0', non_blocking_pin=True, num_threads=4)")
614
615
# Rations
616
r1 = copy_non_blocking / copy_blocking
617
r2 = copy_pin_nb / copy_blocking
618
r3 = copy_pin_multithread_nb / copy_blocking
619
620
# Figure
621
fig, ax = plt.subplots()
622
623
xlabels = [0, 1, 2, 3]
624
bar_labels = [
625
"Blocking copy (1x)",
626
f"Non-blocking copy ({r1:4.2f}x)",
627
f"Blocking pin, non-blocking copy ({r2:4.2f}x)",
628
f"Non-blocking pin, non-blocking copy ({r3:4.2f}x)",
629
]
630
values = [copy_blocking, copy_non_blocking, copy_pin_nb, copy_pin_multithread_nb]
631
colors = ["tab:blue", "tab:red", "tab:orange", "tab:green"]
632
633
ax.bar(xlabels, values, label=bar_labels, color=colors)
634
635
ax.set_ylabel("Runtime (ms)")
636
ax.set_title("Device casting runtime")
637
ax.set_xticks([])
638
ax.legend()
639
640
plt.show()
641
642
######################################################################
643
# In this example, we are transferring many large tensors from the CPU to the GPU.
644
# This scenario is ideal for utilizing multithreaded ``pin_memory()``, which can significantly enhance performance.
645
# However, if the tensors are small, the overhead associated with multithreading may outweigh the benefits.
646
# Similarly, if there are only a few tensors, the advantages of pinning tensors on separate threads become limited.
647
#
648
# As an additional note, while it might seem advantageous to create permanent buffers in pinned memory to shuttle
649
# tensors from pageable memory before transferring them to the GPU, this strategy does not necessarily expedite
650
# computation. The inherent bottleneck caused by copying data into pinned memory remains a limiting factor.
651
#
652
# Moreover, transferring data that resides on disk (whether in shared memory or files) to the GPU typically requires an
653
# intermediate step of copying the data into pinned memory (located in RAM).
654
# Utilizing non_blocking for large data transfers in this context can significantly increase RAM consumption,
655
# potentially leading to adverse effects.
656
#
657
# In practice, there is no one-size-fits-all solution.
658
# The effectiveness of using multithreaded ``pin_memory`` combined with ``non_blocking`` transfers depends on a
659
# variety of factors, including the specific system, operating system, hardware, and the nature of the tasks
660
# being executed.
661
# Here is a list of factors to check when trying to speed-up data transfers between CPU and GPU, or comparing
662
# throughput's across scenarios:
663
#
664
# - **Number of available cores**
665
#
666
# How many CPU cores are available? Is the system shared with other users or processes that might compete for
667
# resources?
668
#
669
# - **Core utilization**
670
#
671
# Are the CPU cores heavily utilized by other processes? Does the application perform other CPU-intensive tasks
672
# concurrently with data transfers?
673
#
674
# - **Memory utilization**
675
#
676
# How much pageable and page-locked memory is currently being used? Is there sufficient free memory to allocate
677
# additional pinned memory without affecting system performance? Remember that nothing comes for free, for instance
678
# ``pin_memory`` will consume RAM and may impact other tasks.
679
#
680
# - **CUDA Device Capabilities**
681
#
682
# Does the GPU support multiple DMA engines for concurrent data transfers? What are the specific capabilities and
683
# limitations of the CUDA device being used?
684
#
685
# - **Number of tensors to be sent**
686
#
687
# How many tensors are transferred in a typical operation?
688
#
689
# - **Size of the tensors to be sent**
690
#
691
# What is the size of the tensors being transferred? A few large tensors or many small tensors may not benefit from
692
# the same transfer program.
693
#
694
# - **System Architecture**
695
#
696
# How is the system's architecture influencing data transfer speeds (for example, bus speeds, network latency)?
697
#
698
# Additionally, allocating a large number of tensors or sizable tensors in pinned memory can monopolize a substantial
699
# portion of RAM.
700
# This reduces the available memory for other critical operations, such as paging, which can negatively impact the
701
# overall performance of an algorithm.
702
#
703
# Conclusion
704
# ----------
705
#
706
# .. _pinned_memory_conclusion:
707
#
708
# Throughout this tutorial, we have explored several critical factors that influence transfer speeds and memory
709
# management when sending tensors from the host to the device. We've learned that using ``non_blocking=True`` generally
710
# accelerates data transfers, and that :meth:`~torch.Tensor.pin_memory` can also enhance performance if implemented
711
# correctly. However, these techniques require careful design and calibration to be effective.
712
#
713
# Remember that profiling your code and keeping an eye on the memory consumption are essential to optimize resource
714
# usage and achieve the best possible performance.
715
#
716
# Additional resources
717
# --------------------
718
#
719
# .. _pinned_memory_resources:
720
#
721
# If you are dealing with issues with memory copies when using CUDA devices or want to learn more about
722
# what was discussed in this tutorial, check the following references:
723
#
724
# - `CUDA toolkit memory management doc <https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html>`_;
725
# - `CUDA pin-memory note <https://forums.developer.nvidia.com/t/pinned-memory/268474>`_;
726
# - `How to Optimize Data Transfers in CUDA C/C++ <https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/>`_;
727
# - `tensordict doc <https://pytorch.org/tensordict/stable/index.html>`_ and `repo <https://github.com/pytorch/tensordict>`_.
728
#
729
730