CoCalc -- pinmem_nonblock.py

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place.

GitHub Repository: pytorch/tutorials
Path: blob/main/intermediate_source/pinmem_nonblock.py
Views: ⁷¹²
1
# -*- coding: utf-8 -*-
2
"""
3
A guide on good usage of ``non_blocking`` and ``pin_memory()`` in PyTorch
4
=========================================================================
5

6
**Author**: `Vincent Moens <https://github.com/vmoens>`_
7

8
Introduction
9
------------
10

11
Transferring data from the CPU to the GPU is fundamental in many PyTorch applications.
12
It's crucial for users to understand the most effective tools and options available for moving data between devices.
13
This tutorial examines two key methods for device-to-device data transfer in PyTorch:
14
:meth:`~torch.Tensor.pin_memory` and :meth:`~torch.Tensor.to` with the ``non_blocking=True`` option.
15

16
What you will learn
17
~~~~~~~~~~~~~~~~~~~
18

19
Optimizing the transfer of tensors from the CPU to the GPU can be achieved through asynchronous transfers and memory
20
pinning. However, there are important considerations:
21

22
- Using ``tensor.pin_memory().to(device, non_blocking=True)`` can be up to twice as slow as a straightforward ``tensor.to(device)``.
23
- Generally, ``tensor.to(device, non_blocking=True)`` is an effective choice for enhancing transfer speed.
24
- While ``cpu_tensor.to("cuda", non_blocking=True).mean()`` executes correctly, attempting
25
  ``cuda_tensor.to("cpu", non_blocking=True).mean()`` will result in erroneous outputs.
26

27
Preamble
28
~~~~~~~~
29

30
The performance reported in this tutorial are conditioned on the system used to build the tutorial.
31
Although the conclusions are applicable across different systems, the specific observations may vary slightly
32
depending on the hardware available, especially on older hardware.
33
The primary objective of this tutorial is to offer a theoretical framework for understanding CPU to GPU data transfers.
34
However, any design decisions should be tailored to individual cases and guided by benchmarked throughput measurements,
35
as well as the specific requirements of the task at hand.
36

37
"""
38

39
import torch
40

41
assert torch.cuda.is_available(), "A cuda device is required to run this tutorial"
42

43

44
######################################################################
45
#
46
# This tutorial requires tensordict to be installed. If you don't have tensordict in your environment yet, install it
47
# by running the following command in a separate cell:
48
#
49
# .. code-block:: bash
50
#
51
#    # Install tensordict with the following command
52
#    !pip3 install tensordict
53
#
54
# We start by outlining the theory surrounding these concepts, and then move to concrete test examples of the features.
55
#
56
#
57
# Background
58
# ----------
59
#
60
#   .. _pinned_memory_background:
61
#
62
# Memory management basics
63
# ~~~~~~~~~~~~~~~~~~~~~~~~
64
#
65
#   .. _pinned_memory_memory:
66
#
67
# When one creates a CPU tensor in PyTorch, the content of this tensor needs to be placed
68
# in memory. The memory we talk about here is a rather complex concept worth looking at carefully.
69
# We distinguish two types of memory that are handled by the Memory Management Unit: the RAM (for simplicity)
70
# and the swap space on disk (which may or may not be the hard drive). Together, the available space in disk and RAM (physical memory)
71
# make up the virtual memory, which is an abstraction of the total resources available.
72
# In short, the virtual memory makes it so that the available space is larger than what can be found on RAM in isolation
73
# and creates the illusion that the main memory is larger than it actually is.
74
#
75
# In normal circumstances, a regular CPU tensor is pageable which means that it is divided in blocks called pages that
76
# can live anywhere in the virtual memory (both in RAM or on disk). As mentioned earlier, this has the advantage that
77
# the memory seems larger than what the main memory actually is.
78
#
79
# Typically, when a program accesses a page that is not in RAM, a "page fault" occurs and the operating system (OS) then brings
80
# back this page into RAM ("swap in" or "page in").
81
# In turn, the OS may have to swap out (or "page out") another page to make room for the new page.
82
#
83
# In contrast to pageable memory, a pinned (or page-locked or non-pageable) memory is a type of memory that cannot
84
# be swapped out to disk.
85
# It allows for faster and more predictable access times, but has the downside that it is more limited than the
86
# pageable memory (aka the main memory).
87
#
88
# .. figure:: /_static/img/pinmem/pinmem.png
89
#    :alt:
90
#
91
# CUDA and (non-)pageable memory
92
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
93
#
94
#   .. _pinned_memory_cuda_pageable_memory:
95
#
96
# To understand how CUDA copies a tensor from CPU to CUDA, let's consider the two scenarios above:
97
#
98
# - If the memory is page-locked, the device can access the memory directly in the main memory. The memory addresses are well
99
#   defined and functions that need to read these data can be significantly accelerated.
100
# - If the memory is pageable, all the pages will have to be brought to the main memory before being sent to the GPU.
101
#   This operation may take time and is less predictable than when executed on page-locked tensors.
102
#
103
# More precisely, when CUDA sends pageable data from CPU to GPU, it must first create a page-locked copy of that data
104
# before making the transfer.
105
#
106
# Asynchronous vs. Synchronous Operations with ``non_blocking=True`` (CUDA ``cudaMemcpyAsync``)
107
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
108
#
109
#   .. _pinned_memory_async_sync:
110
#
111
# When executing a copy from a host (e.g., CPU) to a device (e.g., GPU), the CUDA toolkit offers modalities to do these
112
# operations synchronously or asynchronously with respect to the host.
113
#
114
# In practice, when calling :meth:`~torch.Tensor.to`, PyTorch always makes a call to
115
# `cudaMemcpyAsync <https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g85073372f776b4c4d5f89f7124b7bf79>`_.
116
# If ``non_blocking=False`` (default), a ``cudaStreamSynchronize`` will be called after each and every ``cudaMemcpyAsync``, making
117
# the call to :meth:`~torch.Tensor.to` blocking in the main thread.
118
# If ``non_blocking=True``, no synchronization is triggered, and the main thread on the host is not blocked.
119
# Therefore, from the host perspective, multiple tensors can be sent to the device simultaneously,
120
# as the thread does not need to wait for one transfer to be completed to initiate the other.
121
#
122
# .. note:: In general, the transfer is blocking on the device side (even if it isn't on the host side):
123
#   the copy on the device cannot occur while another operation is being executed.
124
#   However, in some advanced scenarios, a copy and a kernel execution can be done simultaneously on the GPU side.
125
#   As the following example will show, three requirements must be met to enable this:
126
#
127
#   1. The device must have at least one free DMA (Direct Memory Access) engine. Modern GPU architectures such as Volterra,
128
#      Tesla, or H100 devices have more than one DMA engine.
129
#
130
#   2. The transfer must be done on a separate, non-default cuda stream. In PyTorch, cuda streams can be handles using
131
#      :class:`~torch.cuda.Stream`.
132
#
133
#   3. The source data must be in pinned memory.
134
#
135
#   We demonstrate this by running profiles on the following script.
136
#
137

138
import contextlib
139

140
from torch.cuda import Stream
141

142

143
s = Stream()
144

145
torch.manual_seed(42)
146
t1_cpu_pinned = torch.randn(1024**2 * 5, pin_memory=True)
147
t2_cpu_paged = torch.randn(1024**2 * 5, pin_memory=False)
148
t3_cuda = torch.randn(1024**2 * 5, device="cuda:0")
149

150
assert torch.cuda.is_available()
151
device = torch.device("cuda", torch.cuda.current_device())
152

153

154
# The function we want to profile
155
def inner(pinned: bool, streamed: bool):
156
    with torch.cuda.stream(s) if streamed else contextlib.nullcontext():
157
        if pinned:
158
            t1_cuda = t1_cpu_pinned.to(device, non_blocking=True)
159
        else:
160
            t2_cuda = t2_cpu_paged.to(device, non_blocking=True)
161
        t_star_cuda_h2d_event = s.record_event()
162
    # This operation can be executed during the CPU to GPU copy if and only if the tensor is pinned and the copy is
163
    #  done in the other stream
164
    t3_cuda_mul = t3_cuda * t3_cuda * t3_cuda
165
    t3_cuda_h2d_event = torch.cuda.current_stream().record_event()
166
    t_star_cuda_h2d_event.synchronize()
167
    t3_cuda_h2d_event.synchronize()
168

169

170
# Our profiler: profiles the `inner` function and stores the results in a .json file
171
def benchmark_with_profiler(
172
    pinned,
173
    streamed,
174
) -> None:
175
    torch._C._profiler._set_cuda_sync_enabled_val(True)
176
    wait, warmup, active = 1, 1, 2
177
    num_steps = wait + warmup + active
178
    rank = 0
179
    with torch.profiler.profile(
180
        activities=[
181
            torch.profiler.ProfilerActivity.CPU,
182
            torch.profiler.ProfilerActivity.CUDA,
183
        ],
184
        schedule=torch.profiler.schedule(
185
            wait=wait, warmup=warmup, active=active, repeat=1, skip_first=1
186
        ),
187
    ) as prof:
188
        for step_idx in range(1, num_steps + 1):
189
            inner(streamed=streamed, pinned=pinned)
190
            if rank is None or rank == 0:
191
                prof.step()
192
    prof.export_chrome_trace(f"trace_streamed{int(streamed)}_pinned{int(pinned)}.json")
193

194

195
######################################################################
196
# Loading these profile traces in chrome (``chrome://tracing``) shows the following results: first, let's see
197
# what happens if both the arithmetic operation on ``t3_cuda`` is executed after the pageable tensor is sent to GPU
198
# in the main stream:
199
#
200

201
benchmark_with_profiler(streamed=False, pinned=False)
202

203
######################################################################
204
# .. figure:: /_static/img/pinmem/trace_streamed0_pinned0.png
205
#    :alt:
206
#
207
# Using a pinned tensor doesn't change the trace much, both operations are still executed consecutively:
208

209
benchmark_with_profiler(streamed=False, pinned=True)
210

211
######################################################################
212
#
213
# .. figure:: /_static/img/pinmem/trace_streamed0_pinned1.png
214
#    :alt:
215
#
216
# Sending a pageable tensor to GPU on a separate stream is also a blocking operation:
217

218
benchmark_with_profiler(streamed=True, pinned=False)
219

220
######################################################################
221
#
222
# .. figure:: /_static/img/pinmem/trace_streamed1_pinned0.png
223
#    :alt:
224
#
225
# Only pinned tensors copies to GPU on a separate stream overlap with another cuda kernel executed on
226
# the main stream:
227

228
benchmark_with_profiler(streamed=True, pinned=True)
229

230
######################################################################
231
#
232
# .. figure:: /_static/img/pinmem/trace_streamed1_pinned1.png
233
#    :alt:
234
#
235
# A PyTorch perspective
236
# ---------------------
237
#
238
#   .. _pinned_memory_pt_perspective:
239
#
240
# ``pin_memory()``
241
# ~~~~~~~~~~~~~~~~
242
#
243
#   .. _pinned_memory_pinned:
244
#
245
# PyTorch offers the possibility to create and send tensors to page-locked memory through the
246
# :meth:`~torch.Tensor.pin_memory` method and constructor arguments.
247
# CPU tensors on a machine where CUDA is initialized can be cast to pinned memory through the :meth:`~torch.Tensor.pin_memory`
248
# method. Importantly, ``pin_memory`` is blocking on the main thread of the host: it will wait for the tensor to be copied to
249
# page-locked memory before executing the next operation.
250
# New tensors can be directly created in pinned memory with functions like :func:`~torch.zeros`, :func:`~torch.ones` and other
251
# constructors.
252
#
253
# Let us check the speed of pinning memory and sending tensors to CUDA:
254

255

256
import torch
257
import gc
258
from torch.utils.benchmark import Timer
259
import matplotlib.pyplot as plt
260

261

262
def timer(cmd):
263
    median = (
264
        Timer(cmd, globals=globals())
265
        .adaptive_autorange(min_run_time=1.0, max_run_time=20.0)
266
        .median
267
        * 1000
268
    )
269
    print(f"{cmd}: {median: 4.4f} ms")
270
    return median
271

272

273
# A tensor in pageable memory
274
pageable_tensor = torch.randn(1_000_000)
275

276
# A tensor in page-locked (pinned) memory
277
pinned_tensor = torch.randn(1_000_000, pin_memory=True)
278

279
# Runtimes:
280
pageable_to_device = timer("pageable_tensor.to('cuda:0')")
281
pinned_to_device = timer("pinned_tensor.to('cuda:0')")
282
pin_mem = timer("pageable_tensor.pin_memory()")
283
pin_mem_to_device = timer("pageable_tensor.pin_memory().to('cuda:0')")
284

285
# Ratios:
286
r1 = pinned_to_device / pageable_to_device
287
r2 = pin_mem_to_device / pageable_to_device
288

289
# Create a figure with the results
290
fig, ax = plt.subplots()
291

292
xlabels = [0, 1, 2]
293
bar_labels = [
294
    "pageable_tensor.to(device) (1x)",
295
    f"pinned_tensor.to(device) ({r1:4.2f}x)",
296
    f"pageable_tensor.pin_memory().to(device) ({r2:4.2f}x)"
297
    f"\npin_memory()={100*pin_mem/pin_mem_to_device:.2f}% of runtime.",
298
]
299
values = [pageable_to_device, pinned_to_device, pin_mem_to_device]
300
colors = ["tab:blue", "tab:red", "tab:orange"]
301
ax.bar(xlabels, values, label=bar_labels, color=colors)
302

303
ax.set_ylabel("Runtime (ms)")
304
ax.set_title("Device casting runtime (pin-memory)")
305
ax.set_xticks([])
306
ax.legend()
307

308
plt.show()
309

310
# Clear tensors
311
del pageable_tensor, pinned_tensor
312
_ = gc.collect()
313

314
######################################################################
315
#
316
# We can observe that casting a pinned-memory tensor to GPU is indeed much faster than a pageable tensor, because under
317
# the hood, a pageable tensor must be copied to pinned memory before being sent to GPU.
318
#
319
# However, contrary to a somewhat common belief, calling :meth:`~torch.Tensor.pin_memory()` on a pageable tensor before
320
# casting it to GPU should not bring any significant speed-up, on the contrary this call is usually slower than just
321
# executing the transfer. This makes sense, since we're actually asking Python to execute an operation that CUDA will
322
# perform anyway before copying the data from host to device.
323
#
324
# .. note:: The PyTorch implementation of
325
#   `pin_memory <https://github.com/pytorch/pytorch/blob/5298acb5c76855bc5a99ae10016efc86b27949bd/aten/src/ATen/native/Memory.cpp#L58>`_
326
#   which relies on creating a brand new storage in pinned memory through `cudaHostAlloc <https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gb65da58f444e7230d3322b6126bb4902>`_
327
#   could be, in rare cases, faster than transitioning data in chunks as ``cudaMemcpy`` does.
328
#   Here too, the observation may vary depending on the available hardware, the size of the tensors being sent or
329
#   the amount of available RAM.
330
#
331
# ``non_blocking=True``
332
# ~~~~~~~~~~~~~~~~~~~~~
333
#
334
#   .. _pinned_memory_non_blocking:
335
#
336
# As mentioned earlier, many PyTorch operations have the option of being executed asynchronously with respect to the host
337
# through the ``non_blocking`` argument.
338
#
339
# Here, to account accurately of the benefits of using ``non_blocking``, we will design a slightly more complex
340
# experiment since we want to assess how fast it is to send multiple tensors to GPU with and without calling
341
# ``non_blocking``.
342
#
343

344

345
# A simple loop that copies all tensors to cuda
346
def copy_to_device(*tensors):
347
    result = []
348
    for tensor in tensors:
349
        result.append(tensor.to("cuda:0"))
350
    return result
351

352

353
# A loop that copies all tensors to cuda asynchronously
354
def copy_to_device_nonblocking(*tensors):
355
    result = []
356
    for tensor in tensors:
357
        result.append(tensor.to("cuda:0", non_blocking=True))
358
    # We need to synchronize
359
    torch.cuda.synchronize()
360
    return result
361

362

363
# Create a list of tensors
364
tensors = [torch.randn(1000) for _ in range(1000)]
365
to_device = timer("copy_to_device(*tensors)")
366
to_device_nonblocking = timer("copy_to_device_nonblocking(*tensors)")
367

368
# Ratio
369
r1 = to_device_nonblocking / to_device
370

371
# Plot the results
372
fig, ax = plt.subplots()
373

374
xlabels = [0, 1]
375
bar_labels = [f"to(device) (1x)", f"to(device, non_blocking=True) ({r1:4.2f}x)"]
376
colors = ["tab:blue", "tab:red"]
377
values = [to_device, to_device_nonblocking]
378

379
ax.bar(xlabels, values, label=bar_labels, color=colors)
380

381
ax.set_ylabel("Runtime (ms)")
382
ax.set_title("Device casting runtime (non-blocking)")
383
ax.set_xticks([])
384
ax.legend()
385

386
plt.show()
387

388

389
######################################################################
390
# To get a better sense of what is happening here, let us profile these two functions:
391

392

393
from torch.profiler import profile, ProfilerActivity
394

395

396
def profile_mem(cmd):
397
    with profile(activities=[ProfilerActivity.CPU]) as prof:
398
        exec(cmd)
399
    print(cmd)
400
    print(prof.key_averages().table(row_limit=10))
401

402

403
######################################################################
404
# Let's see the call stack with a regular ``to(device)`` first:
405
#
406

407
print("Call to `to(device)`", profile_mem("copy_to_device(*tensors)"))
408

409
######################################################################
410
# and now the ``non_blocking`` version:
411
#
412

413
print(
414
    "Call to `to(device, non_blocking=True)`",
415
    profile_mem("copy_to_device_nonblocking(*tensors)"),
416
)
417

418

419
######################################################################
420
# The results are without any doubt better when using ``non_blocking=True``, as all transfers are initiated simultaneously
421
# on the host side and only one synchronization is done.
422
#
423
# The benefit will vary depending on the number and the size of the tensors as well as depending on the hardware being
424
# used.
425
#
426
# .. note:: Interestingly, the blocking ``to("cuda")`` actually performs the same asynchronous device casting operation
427
#   (``cudaMemcpyAsync``) as the one with ``non_blocking=True`` with a synchronization point after each copy.
428
#
429
# Synergies
430
# ~~~~~~~~~
431
#
432
#   .. _pinned_memory_synergies:
433
#
434
# Now that we have made the point that data transfer of tensors already in pinned memory to GPU is faster than from
435
# pageable memory, and that we know that doing these transfers asynchronously is also faster than synchronously, we can
436
# benchmark combinations of these approaches. First, let's write a couple of new functions that will call ``pin_memory``
437
# and ``to(device)`` on each tensor:
438
#
439

440

441
def pin_copy_to_device(*tensors):
442
    result = []
443
    for tensor in tensors:
444
        result.append(tensor.pin_memory().to("cuda:0"))
445
    return result
446

447

448
def pin_copy_to_device_nonblocking(*tensors):
449
    result = []
450
    for tensor in tensors:
451
        result.append(tensor.pin_memory().to("cuda:0", non_blocking=True))
452
    # We need to synchronize
453
    torch.cuda.synchronize()
454
    return result
455

456

457
######################################################################
458
# The benefits of using :meth:`~torch.Tensor.pin_memory` are more pronounced for
459
# somewhat large batches of large tensors:
460
#
461

462
tensors = [torch.randn(1_000_000) for _ in range(1000)]
463
page_copy = timer("copy_to_device(*tensors)")
464
page_copy_nb = timer("copy_to_device_nonblocking(*tensors)")
465

466
tensors_pinned = [torch.randn(1_000_000, pin_memory=True) for _ in range(1000)]
467
pinned_copy = timer("copy_to_device(*tensors_pinned)")
468
pinned_copy_nb = timer("copy_to_device_nonblocking(*tensors_pinned)")
469

470
pin_and_copy = timer("pin_copy_to_device(*tensors)")
471
pin_and_copy_nb = timer("pin_copy_to_device_nonblocking(*tensors)")
472

473
# Plot
474
strategies = ("pageable copy", "pinned copy", "pin and copy")
475
blocking = {
476
    "blocking": [page_copy, pinned_copy, pin_and_copy],
477
    "non-blocking": [page_copy_nb, pinned_copy_nb, pin_and_copy_nb],
478
}
479

480
x = torch.arange(3)
481
width = 0.25
482
multiplier = 0
483

484

485
fig, ax = plt.subplots(layout="constrained")
486

487
for attribute, runtimes in blocking.items():
488
    offset = width * multiplier
489
    rects = ax.bar(x + offset, runtimes, width, label=attribute)
490
    ax.bar_label(rects, padding=3, fmt="%.2f")
491
    multiplier += 1
492

493
# Add some text for labels, title and custom x-axis tick labels, etc.
494
ax.set_ylabel("Runtime (ms)")
495
ax.set_title("Runtime (pin-mem and non-blocking)")
496
ax.set_xticks([0, 1, 2])
497
ax.set_xticklabels(strategies)
498
plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
499
ax.legend(loc="upper left", ncols=3)
500

501
plt.show()
502

503
del tensors, tensors_pinned
504
_ = gc.collect()
505

506

507
######################################################################
508
# Other copy directions (GPU -> CPU, CPU -> MPS)
509
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
510
#
511
#   .. _pinned_memory_other_direction:
512
#
513
# Until now, we have operated under the assumption that asynchronous copies from the CPU to the GPU are safe.
514
# This is generally true because CUDA automatically handles synchronization to ensure that the data being accessed is
515
# valid at read time.
516
# However, this guarantee does not extend to transfers in the opposite direction, from GPU to CPU.
517
# Without explicit synchronization, these transfers offer no assurance that the copy will be complete at the time of
518
# data access. Consequently, the data on the host might be incomplete or incorrect, effectively rendering it garbage:
519
#
520

521

522
tensor = (
523
    torch.arange(1, 1_000_000, dtype=torch.double, device="cuda")
524
    .expand(100, 999999)
525
    .clone()
526
)
527
torch.testing.assert_close(
528
    tensor.mean(), torch.tensor(500_000, dtype=torch.double, device="cuda")
529
), tensor.mean()
530
try:
531
    i = -1
532
    for i in range(100):
533
        cpu_tensor = tensor.to("cpu", non_blocking=True)
534
        torch.testing.assert_close(
535
            cpu_tensor.mean(), torch.tensor(500_000, dtype=torch.double)
536
        )
537
    print("No test failed with non_blocking")
538
except AssertionError:
539
    print(f"{i}th test failed with non_blocking. Skipping remaining tests")
540
try:
541
    i = -1
542
    for i in range(100):
543
        cpu_tensor = tensor.to("cpu", non_blocking=True)
544
        torch.cuda.synchronize()
545
        torch.testing.assert_close(
546
            cpu_tensor.mean(), torch.tensor(500_000, dtype=torch.double)
547
        )
548
    print("No test failed with synchronize")
549
except AssertionError:
550
    print(f"One test failed with synchronize: {i}th assertion!")
551

552

553
######################################################################
554
# The same considerations apply to copies from the CPU to non-CUDA devices, such as MPS.
555
# Generally, asynchronous copies to a device are safe without explicit synchronization only when the target is a
556
# CUDA-enabled device.
557
#
558
# In summary, copying data from CPU to GPU is safe when using ``non_blocking=True``, but for any other direction,
559
# ``non_blocking=True`` can still be used but the user must make sure that a device synchronization is executed before
560
# the data is accessed.
561
#
562
# Practical recommendations
563
# -------------------------
564
#
565
#   .. _pinned_memory_recommendations:
566
#
567
# We can now wrap up some early recommendations based on our observations:
568
#
569
# In general, ``non_blocking=True`` will provide good throughput, regardless of whether the original tensor is or
570
# isn't in pinned memory.
571
# If the tensor is already in pinned memory, the transfer can be accelerated, but sending it to
572
# pin memory manually from python main thread is a blocking operation on the host, and hence will annihilate much of
573
# the benefit of using ``non_blocking=True`` (as CUDA does the `pin_memory` transfer anyway).
574
#
575
# One might now legitimately ask what use there is for the :meth:`~torch.Tensor.pin_memory` method.
576
# In the following section, we will explore further how this can be used to accelerate the data transfer even more.
577
#
578
# Additional considerations
579
# -------------------------
580
#
581
#   .. _pinned_memory_considerations:
582
#
583
# PyTorch notoriously provides a :class:`~torch.utils.data.DataLoader` class whose constructor accepts a
584
# ``pin_memory`` argument.
585
# Considering our previous discussion on ``pin_memory``, you might wonder how the ``DataLoader`` manages to
586
# accelerate data transfers if memory pinning is inherently blocking.
587
#
588
# The key lies in the DataLoader's use of a separate thread to handle the transfer of data from pageable to pinned
589
# memory, thus preventing any blockage in the main thread.
590
#
591
# To illustrate this, we will use the TensorDict primitive from the homonymous library.
592
# When invoking :meth:`~tensordict.TensorDict.to`, the default behavior is to send tensors to the device asynchronously,
593
# followed by a single call to ``torch.device.synchronize()`` afterwards.
594
#
595
# Additionally, ``TensorDict.to()`` includes a ``non_blocking_pin`` option  which initiates multiple threads to execute
596
# ``pin_memory()`` before proceeding with to ``to(device)``.
597
# This approach can further accelerate data transfers, as demonstrated in the following example.
598
#
599
#
600

601
from tensordict import TensorDict
602
import torch
603
from torch.utils.benchmark import Timer
604
import matplotlib.pyplot as plt
605

606
# Create the dataset
607
td = TensorDict({str(i): torch.randn(1_000_000) for i in range(1000)})
608

609
# Runtimes
610
copy_blocking = timer("td.to('cuda:0', non_blocking=False)")
611
copy_non_blocking = timer("td.to('cuda:0')")
612
copy_pin_nb = timer("td.to('cuda:0', non_blocking_pin=True, num_threads=0)")
613
copy_pin_multithread_nb = timer("td.to('cuda:0', non_blocking_pin=True, num_threads=4)")
614

615
# Rations
616
r1 = copy_non_blocking / copy_blocking
617
r2 = copy_pin_nb / copy_blocking
618
r3 = copy_pin_multithread_nb / copy_blocking
619

620
# Figure
621
fig, ax = plt.subplots()
622

623
xlabels = [0, 1, 2, 3]
624
bar_labels = [
625
    "Blocking copy (1x)",
626
    f"Non-blocking copy ({r1:4.2f}x)",
627
    f"Blocking pin, non-blocking copy ({r2:4.2f}x)",
628
    f"Non-blocking pin, non-blocking copy ({r3:4.2f}x)",
629
]
630
values = [copy_blocking, copy_non_blocking, copy_pin_nb, copy_pin_multithread_nb]
631
colors = ["tab:blue", "tab:red", "tab:orange", "tab:green"]
632

633
ax.bar(xlabels, values, label=bar_labels, color=colors)
634

635
ax.set_ylabel("Runtime (ms)")
636
ax.set_title("Device casting runtime")
637
ax.set_xticks([])
638
ax.legend()
639

640
plt.show()
641

642
######################################################################
643
# In this example, we are transferring many large tensors from the CPU to the GPU.
644
# This scenario is ideal for utilizing multithreaded ``pin_memory()``, which can significantly enhance performance.
645
# However, if the tensors are small, the overhead associated with multithreading may outweigh the benefits.
646
# Similarly, if there are only a few tensors, the advantages of pinning tensors on separate threads become limited.
647
#
648
# As an additional note, while it might seem advantageous to create permanent buffers in pinned memory to shuttle
649
# tensors from pageable memory before transferring them to the GPU, this strategy does not necessarily expedite
650
# computation. The inherent bottleneck caused by copying data into pinned memory remains a limiting factor.
651
#
652
# Moreover, transferring data that resides on disk (whether in shared memory or files) to the GPU typically requires an
653
# intermediate step of copying the data into pinned memory (located in RAM).
654
# Utilizing non_blocking for large data transfers in this context can significantly increase RAM consumption,
655
# potentially leading to adverse effects.
656
#
657
# In practice, there is no one-size-fits-all solution.
658
# The effectiveness of using multithreaded ``pin_memory`` combined with ``non_blocking`` transfers depends on a
659
# variety of  factors, including the specific system, operating system, hardware, and the nature of the tasks
660
# being executed.
661
# Here is a list of factors to check when trying to speed-up data transfers between CPU and GPU, or comparing
662
# throughput's across scenarios:
663
#
664
# - **Number of available cores**
665
#
666
#   How many CPU cores are available? Is the system shared with other users or processes that might compete for
667
#   resources?
668
#
669
# - **Core utilization**
670
#
671
#   Are the CPU cores heavily utilized by other processes? Does the application perform other CPU-intensive tasks
672
#   concurrently with data transfers?
673
#
674
# - **Memory utilization**
675
#
676
#   How much pageable and page-locked memory is currently being used? Is there sufficient free memory to allocate
677
#   additional pinned memory without affecting system performance? Remember that nothing comes for free, for instance
678
#   ``pin_memory`` will consume RAM and may impact other tasks.
679
#
680
# - **CUDA Device Capabilities**
681
#
682
#   Does the GPU support multiple DMA engines for concurrent data transfers? What are the specific capabilities and
683
#   limitations of the CUDA device being used?
684
#
685
# - **Number of tensors to be sent**
686
#
687
#   How many tensors are transferred in a typical operation?
688
#
689
# - **Size of the tensors to be sent**
690
#
691
#   What is the size of the tensors being transferred? A few large tensors or many small tensors may not benefit from
692
#   the same transfer program.
693
#
694
# - **System Architecture**
695
#
696
#   How is the system's architecture influencing data transfer speeds (for example, bus speeds, network latency)?
697
#
698
# Additionally, allocating a large number of tensors or sizable tensors in pinned memory can monopolize a substantial
699
# portion of RAM.
700
# This reduces the available memory for other critical operations, such as paging, which can negatively impact the
701
# overall performance of an algorithm.
702
#
703
# Conclusion
704
# ----------
705
#
706
#   .. _pinned_memory_conclusion:
707
#
708
# Throughout this tutorial, we have explored several critical factors that influence transfer speeds and memory
709
# management when sending tensors from the host to the device. We've learned that using ``non_blocking=True`` generally
710
# accelerates data transfers, and that :meth:`~torch.Tensor.pin_memory` can also enhance performance if implemented
711
# correctly. However, these techniques require careful design and calibration to be effective.
712
#
713
# Remember that profiling your code and keeping an eye on the memory consumption are essential to optimize resource
714
# usage and achieve the best possible performance.
715
#
716
# Additional resources
717
# --------------------
718
#
719
#   .. _pinned_memory_resources:
720
#
721
# If you are dealing with issues with memory copies when using CUDA devices or want to learn more about
722
# what was discussed in this tutorial, check the following references:
723
#
724
# - `CUDA toolkit memory management doc <https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html>`_;
725
# - `CUDA pin-memory note <https://forums.developer.nvidia.com/t/pinned-memory/268474>`_;
726
# - `How to Optimize Data Transfers in CUDA C/C++ <https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/>`_;
727
# - `tensordict doc <https://pytorch.org/tensordict/stable/index.html>`_ and `repo <https://github.com/pytorch/tensordict>`_.
728
#
729

730
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place.

Product

Resources

Company

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more, all in one place.

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place.