CoCalc -- intermediate_data_loading

GitHub Repository: pytorch/tutorials
Path: blob/main/intermediate_source/intermediate_data_loading_tutorial.py
⁶⁴⁶³ views
1
# -*- coding: utf-8 -*-
2
"""
3
Data Loading Optimization in PyTorch
4
==============================================
5

6
**Authors**: `Divyansh Khanna <https://github.com/divyanshk>`_, `Ramanish Singh <https://github.com/ramanishsingh>`_
7

8
.. grid:: 2
9

10
    .. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn
11
       :class-card: card-prerequisites
12

13
       * How to optimize DataLoader configuration for maximum throughput
14
       * Best practices for ``batch_size``, ``num_workers``, and ``pin_memory``
15
       * Advanced techniques for overlapping data transfers with GPU compute
16
       * Configuring shared memory strategies and handling ``/dev/shm`` issues
17

18
    .. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites
19
       :class-card: card-prerequisites
20

21
       * PyTorch v2.0+
22
       * Basic understanding of PyTorch DataLoader
23
       * (Optional) A CUDA-capable GPU for GPU-specific optimizations
24

25
Introduction
26
------------
27

28
Data loading is often a critical bottleneck in deep learning pipelines. While
29
GPUs can process batches extremely quickly, inefficient data loading can leave
30
expensive hardware idle, waiting for the next batch of data. This tutorial
31
covers best practices and some techniques for optimizing your data loading configuration to
32
maximize training throughput.
33

34
We'll explore the key parameters of PyTorch's DataLoader and provide practical
35
guidance on tuning them for your specific workload. Rather than showing each
36
optimization in isolation, we'll build up from a baseline training loop and
37
progressively apply optimizations, measuring the cumulative speedup at each
38
step.
39
"""
40

41
import time
42

43
import torch
44
import torch.nn as nn
45
from torch.utils.data import DataLoader, Dataset
46

47
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
48
print(f"Using device: {device}")
49

50
# Set a fixed seed for reproducibility
51
torch.manual_seed(42)
52

53
######################################################################
54
# Creating a Sample Dataset
55
# -------------------------
56
#
57
# First, let's create a simple dataset that simulates expensive
58
# transformations. This will help us demonstrate the impact of
59
# various DataLoader configurations.
60

61

62
class SyntheticDataset(Dataset):
63
    """A synthetic dataset that simulates expensive data transformations."""
64

65
    def __init__(self, size=10000, feature_dim=224, transform_delay=0.001):
66
        self.size = size
67
        self.feature_dim = feature_dim
68
        self.transform_delay = transform_delay
69

70
    def __len__(self):
71
        return self.size
72

73
    def __getitem__(self, idx):
74
        # Generate data lazily to avoid pre-allocating large tensors
75
        data = torch.randn(3, self.feature_dim, self.feature_dim)
76
        label = torch.randint(0, 10, (1,)).item()
77
        if self.transform_delay > 0:
78
            time.sleep(self.transform_delay)
79
        return data, label
80

81

82
class SyntheticDatasetBatched(Dataset):
83
    """Same as SyntheticDataset but with __getitems__ for batched fetching."""
84

85
    def __init__(self, size=10000, feature_dim=224, transform_delay=0.001):
86
        self.size = size
87
        self.feature_dim = feature_dim
88
        self.transform_delay = transform_delay
89

90
    def __len__(self):
91
        return self.size
92

93
    def __getitem__(self, idx):
94
        data = torch.randn(3, self.feature_dim, self.feature_dim)
95
        label = torch.randint(0, 10, (1,)).item()
96
        if self.transform_delay > 0:
97
            time.sleep(self.transform_delay)
98
        return data, label
99

100
    def __getitems__(self, indices):
101
        """Fetch multiple items at once — enables vectorized generation.
102

103
        Instead of N individual __getitem__ calls (each with its own
104
        overhead), this generates the entire batch in one shot using
105
        vectorized tensor operations.
106
        """
107
        n = len(indices)
108
        # Vectorized generation: one call instead of N individual ones
109
        data = torch.randn(n, 3, self.feature_dim, self.feature_dim)
110
        labels = torch.randint(0, 10, (n,))
111
        # Simulate batch-level I/O: one sleep for the whole batch,
112
        # not one per sample (e.g., one DB query for N rows)
113
        if self.transform_delay > 0:
114
            time.sleep(self.transform_delay)
115
        return [(data[i], labels[i].item()) for i in range(n)]
116

117

118
######################################################################
119
# Shared Training Infrastructure
120
# ------------------------------
121
#
122
# To measure the real-world impact of each optimization, we define a
123
# reusable training loop that accepts a DataLoader and returns timing
124
# and loss. This avoids duplicating the training loop for every
125
# benchmark.
126
#
127
# We use a **small dataset (500 samples)** with a **high transform
128
# delay (5ms)** to ensure the pipeline remains data-bound throughout.
129
# The small dataset means short epochs (16 batches each), so we run
130
# many epochs — making persistent_workers' benefit visible across
131
# epoch boundaries.
132

133
benchmark_dataset = SyntheticDataset(size=512, feature_dim=224, transform_delay=0.005)
134

135

136
class SmallTransformerModel(nn.Module):
137

138
    def __init__(self):
139
        super().__init__()
140
        self.features = nn.Sequential(
141
            nn.Conv2d(3, 32, kernel_size=7, stride=4, padding=3),
142
            nn.ReLU(),
143
            nn.Conv2d(32, 64, kernel_size=3, stride=2, padding=1),
144
            nn.ReLU(),
145
            nn.AdaptiveAvgPool2d((7, 7)),
146
        )
147
        encoder_layer = nn.TransformerEncoderLayer(
148
            d_model=64, nhead=4, dim_feedforward=128, batch_first=True
149
        )
150
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=2)
151
        self.classifier = nn.Linear(64, 10)
152

153
    def forward(self, x):
154
        x = self.features(x)  # (B, 64, 7, 7)
155
        B, C, H, W = x.shape
156
        x = x.view(B, C, H * W).permute(0, 2, 1)  # (B, 49, 64)
157
        x = self.transformer(x)  # (B, 49, 64)
158
        x = x.mean(dim=1)  # (B, 64)
159
        return self.classifier(x)
160

161

162
def create_model():
163
    """Create a conv+transformer model for benchmarking."""
164
    return SmallTransformerModel().to(device)
165

166

167
def train_and_benchmark(loader, max_batches=160, epochs=10, prefetch_device=None):
168
    """Train a model over multiple epochs and return elapsed time and average loss.
169

170
    Running multiple epochs (10) with a small dataset ensures many epoch
171
    boundaries, making persistent_workers' startup savings visible.
172

173
    Args:
174
        loader: A DataLoader to iterate over.
175
        max_batches: Maximum total number of batches to process across all epochs.
176
        epochs: Number of epochs to iterate (re-iterates the loader each epoch).
177
        prefetch_device: If set, wraps the loader in a DataPrefetcher each epoch
178
            for overlapping H2D transfers. Data arrives already on device.
179

180
    Returns:
181
        Tuple of (elapsed_seconds, average_loss).
182
    """
183
    model = create_model()
184
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
185
    criterion = nn.CrossEntropyLoss()
186

187
    start_time = time.perf_counter()
188
    total_loss = 0.0
189
    num_batches = 0
190

191
    for epoch in range(epochs):
192
        if prefetch_device is not None:
193
            data_iter = DataPrefetcher(loader, prefetch_device)
194
        else:
195
            data_iter = loader
196

197
        for data, labels in data_iter:
198
            if prefetch_device is None:
199
                data = data.to(device, non_blocking=True)
200
                labels = labels.to(device, non_blocking=True)
201

202
            output = model(data)
203
            loss = criterion(output, labels)
204

205
            optimizer.zero_grad()
206
            loss.backward()
207
            optimizer.step()
208

209
            total_loss += loss.item()
210
            num_batches += 1
211

212
            if num_batches >= max_batches:
213
                break
214
        if num_batches >= max_batches:
215
            break
216

217
    if torch.cuda.is_available():
218
        torch.cuda.synchronize()
219
    elapsed = time.perf_counter() - start_time
220

221
    return elapsed, total_loss / num_batches
222

223

224
######################################################################
225
# Baseline Training Loop
226
# ----------------------
227
#
228
# Our starting point: a simple DataLoader with no multiprocessing,
229
# no pinned memory, and default settings. This establishes the
230
# performance floor we'll improve upon.
231

232
baseline_loader = DataLoader(
233
    benchmark_dataset,
234
    batch_size=32,
235
    shuffle=True,
236
    num_workers=0,
237
    pin_memory=False,
238
)
239

240
print("\n=== Progressive Optimization Results ===")
241
print("\nBaseline (num_workers=0, pin_memory=False):")
242
baseline_time, baseline_loss = train_and_benchmark(baseline_loader)
243
print(f"  Time: {baseline_time:.4f}s | Loss: {baseline_loss:.4f}")
244
prev_time = baseline_time
245

246
######################################################################
247
# Batch Size Optimization
248
# -----------------------
249
#
250
# The ``batch_size`` parameter controls how many samples are processed
251
# together. Choosing the right batch size involves balancing several factors:
252
#
253
# **Memory Considerations:**
254
#
255
# - Larger batch sizes require more GPU memory for storing inputs,
256
#   activations, and gradients
257
# - Out-of-memory (OOM) errors are common with large batch sizes
258
# - Moderate batch sizes (32-128) often provide the best balance
259
#
260
# **Training Dynamics:**
261
#
262
# - Batch size changes affect the effective learning rate, typically requiring tuning
263
# - Larger batches provide more stable gradient estimates but may
264
#   generalize differently
265
#
266
# .. note::
267
#    When changing batch size, remember to tune your optimizer parameters,
268
#    especially the learning rate schedule, unless you're doing inference
269
#
270
# Since batch size is model-dependent (not a "just add it" optimization),
271
# we benchmark it in isolation rather than folding it into the progressive
272
# optimization chain.
273

274
# Example: Testing different batch sizes
275
batch_dataset = SyntheticDataset(size=1000, transform_delay=0)
276

277

278
def benchmark_batch_size(batch_size, num_batches=10):
279
    """Benchmark data loading with a specific batch size."""
280
    loader = DataLoader(batch_dataset, batch_size=batch_size, shuffle=True)
281
    start = time.perf_counter()
282
    for i, (data, labels) in enumerate(loader):
283
        if i >= num_batches:
284
            break
285
        data = data.to(device, non_blocking=True)
286
        _ = data.sum()
287
    if torch.cuda.is_available():
288
        torch.cuda.synchronize()
289
    elapsed = time.perf_counter() - start
290
    return elapsed
291

292

293
# Benchmark different batch sizes
294
print("\nBatch size comparison (isolated benchmark):")
295
for bs in [16, 32, 64, 128]:
296
    elapsed = benchmark_batch_size(bs)
297
    print(f"  Batch size {bs:3d}: {elapsed:.4f}s for 10 batches")
298

299
######################################################################
300
# Number of Workers (``num_workers``)
301
# -----------------------------------
302
#
303
# The ``num_workers`` parameter controls how many subprocesses are used
304
# for data loading. This is crucial for parallelizing expensive data
305
# transformations.
306
#
307
# **How it works:**
308
#
309
# - Each worker maintains a queue of batches (controlled by ``prefetch_factor``)
310
# - Workers prepare batches in parallel and transfer them to the main process
311
# - If ``in_order=True`` (default), batches are returned in order
312
#
313
# **When to increase ``num_workers``:**
314
#
315
# - When transforms are computationally expensive (augmentations, decoding)
316
# - When data is loaded from slow storage (network drives, HDD)
317
# - When you observe GPU idle time due to data loading
318
#
319
# **When ``num_workers=0`` might be faster:**
320
#
321
# - When transforms are cheap (simple tensor operations)
322
# - When data is already in memory
323
# - The overhead of inter-process communication (IPC) exceeds the
324
#   parallelization benefits
325
#
326
# .. note::
327
#    Finding the optimal ``num_workers`` requires tuning: increase workers
328
#    until throughput plateaus. Too many workers waste CPU
329
#    memory (each worker holds its own copy of the dataset object and
330
#    prefetched batches) and can cause ``/dev/shm`` exhaustion. A good
331
#    starting point is 2-4 workers per GPU; profile with different values
332
#    to find the sweet spot for your workload.
333
#
334
# Let's add ``num_workers=4`` and ``prefetch_factor=2`` to our training
335
# loop and measure the improvement:
336

337
workers_loader = DataLoader(
338
    benchmark_dataset,
339
    batch_size=32,
340
    shuffle=True,
341
    num_workers=4,
342
    prefetch_factor=2,
343
    pin_memory=False,
344
)
345

346
print("\n+ num_workers=4, prefetch_factor=2:")
347
workers_time, workers_loss = train_and_benchmark(workers_loader)
348
print(f"  Time: {workers_time:.4f}s | Loss: {workers_loss:.4f}")
349
print(
350
    f"  Speedup vs baseline: {baseline_time / workers_time:.2f}x | vs previous: {prev_time / workers_time:.2f}x"
351
)
352
prev_time = workers_time
353

354
######################################################################
355
# Understanding ``pin_memory``
356
# ----------------------------
357
#
358
# The ``pin_memory`` parameter enables faster CPU-to-GPU data transfers
359
# by using page-locked (pinned) memory.
360
#
361
# **How pinned memory works:**
362
#
363
# - Pinned memory cannot be swapped to disk by the OS
364
# - This enables faster DMA (Direct Memory Access) transfers to GPU
365
# - The CPU-to-GPU transfer can happen asynchronously
366
#
367
# **Best practices:**
368
#
369
# 1. Use ``pin_memory=True`` in the DataLoader (recommended approach)
370
# 2. Combine with ``non_blocking=True`` when moving data to GPU
371
# 3. Avoid manually calling ``tensor.pin_memory()`` followed by
372
#    ``.to(device, non_blocking=True)`` - this is slower because
373
#    ``pin_memory()`` is blocking
374
#
375
# **The safe pattern:**
376
#
377
# .. code-block:: python
378
#
379
#     # Recommended: Let DataLoader handle pinning
380
#     loader = DataLoader(dataset, pin_memory=True)
381
#     for data, labels in loader:
382
#         data = data.to(device, non_blocking=True)
383
#         labels = labels.to(device, non_blocking=True)
384
#
385
# .. seealso::
386
#    For more details, see the
387
#    `pin_memory tutorial <https://docs.pytorch.org/tutorials/intermediate/pinmem_nonblock.html>`_
388
#
389
# Let's add ``pin_memory=True`` to our configuration:
390

391
pinmem_loader = DataLoader(
392
    benchmark_dataset,
393
    batch_size=32,
394
    shuffle=True,
395
    num_workers=4,
396
    prefetch_factor=2,
397
    pin_memory=torch.cuda.is_available(),
398
)
399

400
if torch.cuda.is_available():
401
    print("\n+ pin_memory=True:")
402
    pinmem_time, pinmem_loss = train_and_benchmark(pinmem_loader)
403
    print(f"  Time: {pinmem_time:.4f}s | Loss: {pinmem_loss:.4f}")
404
    print(
405
        f"  Speedup vs baseline: {baseline_time / pinmem_time:.2f}x | vs previous: {prev_time / pinmem_time:.2f}x"
406
    )
407
    print(
408
        "  (pin_memory benefit is modest here because CPU transform time dominates H2D transfer)"
409
    )
410
    prev_time = pinmem_time
411
else:
412
    print("\n+ pin_memory: skipped (CUDA not available)")
413
    pinmem_time = workers_time
414

415
######################################################################
416
# Persistent Workers
417
# ------------------
418
#
419
# By default, worker processes are shut down and restarted between
420
# epochs. This incurs startup overhead (importing modules, forking
421
# processes, re-initializing datasets) on every epoch boundary.
422
#
423
# Setting ``persistent_workers=True`` keeps the workers alive across
424
# epochs, eliminating this repeated startup cost.
425
#
426
# **When it helps most:**
427
#
428
# - Training for many epochs on smaller datasets
429
# - When dataset ``__init__`` is expensive (e.g., loading metadata)
430
# - When combined with high ``num_workers``
431
#
432
# Let's compare with and without persistent workers over multiple epochs:
433

434
non_persistent_loader = DataLoader(
435
    benchmark_dataset,
436
    batch_size=32,
437
    shuffle=True,
438
    num_workers=4,
439
    prefetch_factor=2,
440
    pin_memory=torch.cuda.is_available(),
441
    persistent_workers=False,
442
)
443

444
persistent_loader = DataLoader(
445
    benchmark_dataset,
446
    batch_size=32,
447
    shuffle=True,
448
    num_workers=4,
449
    prefetch_factor=2,
450
    pin_memory=torch.cuda.is_available(),
451
    persistent_workers=True,
452
)
453

454
print("\n+ persistent_workers=True (10 epochs):")
455
non_persistent_time, _ = train_and_benchmark(non_persistent_loader)
456
persistent_time, persistent_loss = train_and_benchmark(persistent_loader)
457
print(f"  Without persistent_workers: {non_persistent_time:.4f}s")
458
print(f"  With persistent_workers:    {persistent_time:.4f}s")
459
print(
460
    f"  Speedup vs baseline: {baseline_time / persistent_time:.2f}x | vs previous: {prev_time / persistent_time:.2f}x"
461
)
462
prev_time = persistent_time
463

464
######################################################################
465
# Overlapping H2D Transfer with GPU Compute
466
# ---------------------------------------------------
467
#
468
# For maximum throughput, you can overlap Host-to-Device (H2D) data
469
# transfers with GPU computation. This ensures the GPU is never idle
470
# waiting for data.
471
#
472
# The idea is to prefetch the next batch to GPU while the current batch
473
# is being processed.
474
#
475
# .. note::
476
#    The DataPrefetcher shows its greatest benefit when H2D transfer
477
#    time overlaps meaningfully with GPU compute. If data loading is
478
#    already fast, the stream synchronization overhead may exceed the benefit.
479

480

481
class DataPrefetcher:
482
    """Prefetches data to GPU while previous batch is being processed."""
483

484
    def __init__(self, loader, device):
485
        self.loader = iter(loader)
486
        self.device = device
487
        self.stream = torch.cuda.Stream() if torch.cuda.is_available() else None
488
        self.next_data = None
489
        self.next_labels = None
490
        self.preload()
491

492
    def preload(self):
493
        try:
494
            self.next_data, self.next_labels = next(self.loader)
495
        except StopIteration:
496
            self.next_data = None
497
            self.next_labels = None
498
            return
499

500
        if self.stream is not None:
501
            with torch.cuda.stream(self.stream):
502
                self.next_data = self.next_data.to(self.device, non_blocking=True)
503
                self.next_labels = self.next_labels.to(self.device, non_blocking=True)
504

505
    def __iter__(self):
506
        return self
507

508
    def __next__(self):
509
        if self.stream is not None:
510
            torch.cuda.current_stream().wait_stream(self.stream)
511

512
        data = self.next_data
513
        labels = self.next_labels
514

515
        if data is None:
516
            raise StopIteration
517

518
        # Ensure tensors are ready
519
        if self.stream is not None:
520
            data.record_stream(torch.cuda.current_stream())
521
            labels.record_stream(torch.cuda.current_stream())
522

523
        self.preload()
524
        return data, labels
525

526

527
# Integrate prefetcher into the training loop.
528
if torch.cuda.is_available():
529
    print("\n+ DataPrefetcher (overlapping H2D transfer):")
530
    prefetch_time, prefetch_loss = train_and_benchmark(
531
        persistent_loader, prefetch_device=device
532
    )
533
    print(f"  Time: {prefetch_time:.4f}s | Loss: {prefetch_loss:.4f}")
534
    print(
535
        f"  Speedup vs baseline: {baseline_time / prefetch_time:.2f}x | vs previous: {prev_time / prefetch_time:.2f}x"
536
    )
537
    prev_time = prefetch_time
538
else:
539
    print("\n+ DataPrefetcher: skipped (CUDA not available)")
540
    prefetch_time = persistent_time
541

542
######################################################################
543
# Dataset-Level Optimization: ``__getitems__``
544
# --------------------------------------------
545
#
546
# Beyond tuning DataLoader parameters, you can optimize the dataset
547
# itself. PyTorch's DataLoader supports a batched fetching protocol via
548
# ``__getitems__``: if your dataset defines this method, the fetcher
549
# calls it once with a list of indices instead of calling ``__getitem__``
550
# repeatedly for each sample.
551
#
552
# **How it works:**
553
#
554
# - The default fetcher does: ``[dataset[idx] for idx in batch_indices]``
555
# - With ``__getitems__``: ``dataset.__getitems__(batch_indices)``
556
#
557
# **When this helps:**
558
#
559
# - When per-sample overhead is significant (e.g., opening connections,
560
#   parsing headers, acquiring locks)
561
# - When data can be fetched in bulk more efficiently (e.g., one SQL query
562
#   for N rows instead of N queries, or vectorized tensor generation)
563
# - When the transform has a fixed setup cost that can be amortized
564
#   across the batch
565
#
566
# **Expected signature:**
567
#
568
# .. code-block:: python
569
#
570
#     def __getitems__(self, indices: list[int]) -> list:
571
#         # Fetch all items at once and return as a list
572
#         ...
573
#
574
# Our ``SyntheticDatasetBatched`` implements ``__getitems__`` to generate
575
# the entire batch in one vectorized call (with a single amortized delay)
576
# rather than N individual calls each with their own delay.
577
# Let's add this to our cumulative configuration:
578

579
benchmark_dataset_batched = SyntheticDatasetBatched(
580
    size=512, feature_dim=224, transform_delay=0.005
581
)
582

583
batched_loader = DataLoader(
584
    benchmark_dataset_batched,
585
    batch_size=32,
586
    shuffle=True,
587
    num_workers=4,
588
    prefetch_factor=2,
589
    pin_memory=torch.cuda.is_available(),
590
    persistent_workers=True,
591
)
592

593
print("\n+ __getitems__ (batched dataset fetching):")
594
batched_time, batched_loss = train_and_benchmark(batched_loader)
595
print(f"  Time: {batched_time:.4f}s | Loss: {batched_loss:.4f}")
596
print(
597
    f"  Speedup vs baseline: {baseline_time / batched_time:.2f}x | vs previous: {prev_time / batched_time:.2f}x"
598
)
599
prev_time = batched_time
600

601
######################################################################
602
# ``in_order`` parameter
603
# --------------------------
604
#
605
# By default (``in_order=True``), the DataLoader returns batches in
606
# the same order as the dataset indices. This requires caching batches
607
# that arrive out of order from workers.
608
#
609
# **When to consider ``in_order=False``:**
610
#
611
# - When you don't need deterministic ordering (e.g., not checkpointing)
612
# - When you observe training spikes due to batch caching
613
# - When maximizing throughput is more important than reproducibility
614
#
615
# .. note::
616
#    ``in_order=False`` might not increase average throughput, but it
617
#    can reduce variance and eliminate occasional slow batches caused
618
#    by head-of-line blocking when one worker is slower than others.
619

620
######################################################################
621
# Snapshot Frequency (``snapshot_every_n_steps``)
622
# -----------------------------------------------
623
#
624
# When using torchdata's StatefulDataLoader (for checkpointing), the
625
# ``snapshot_every_n_steps`` parameter controls how often the
626
# DataLoader state is saved.
627
#
628
# **Trade-offs:**
629
#
630
# - **Higher frequency (smaller n):** More overhead, but less data loss
631
#   on job failure
632
# - **Lower frequency (larger n):** Less overhead, but more replayed
633
#   samples on recovery
634
#
635
# Choose based on your fault tolerance requirements and the cost of
636
# reprocessing data.
637

638
######################################################################
639
# Shared Memory and ``set_sharing_strategy``
640
# ------------------------------------------
641
#
642
# When using multiprocessing with ``num_workers > 0``, PyTorch needs to
643
# transfer tensors between worker processes and the main process. The
644
# sharing strategy determines how this is done.
645
#
646
# **Available Strategies:**
647
#
648
# PyTorch provides two sharing strategies via
649
# ``torch.multiprocessing.set_sharing_strategy()``:
650
#
651
# 1. **file_descriptor** (default on most systems)
652
#
653
#    - Uses file descriptors to share memory
654
#    - Limited by system's open file descriptor limit (``ulimit -n``)
655
#    - More efficient for small tensors
656
#
657
# 2. **file_system**
658
#
659
#    - Uses shared memory files in ``/dev/shm``
660
#    - Not limited by file descriptor count
661
#    - Better for large numbers of tensors
662
#    - Low transform costs
663

664
######################################################################
665
# **How to Change the Strategy:**
666
#
667
# .. code-block:: python
668
#
669
#     import torch.multiprocessing as mp
670
#
671
#     # Switch to file_system strategy
672
#     # Must be called before creating any DataLoader workers
673
#     mp.set_sharing_strategy('file_system')
674
#
675
# **Choosing the Right Strategy:**
676
#
677
# +-------------------+---------------------------+---------------------------+
678
# | Scenario          | Recommended Strategy      | Reason                    |
679
# +===================+===========================+===========================+
680
# | Many small tensors| file_descriptor (default) | Lower overhead per tensor |
681
# +-------------------+---------------------------+---------------------------+
682
# | Few large tensors | file_system               | Avoids fd limits          |
683
# +-------------------+---------------------------+---------------------------+
684
# | High num_workers  | file_system               | Avoids fd exhaustion      |
685
# +-------------------+---------------------------+---------------------------+
686
#
687
# .. warning::
688
#    ``set_sharing_strategy()`` must be called **before** creating any
689
#    DataLoader with ``num_workers > 0``. Changing it afterward has no
690
#    effect on existing workers.
691

692
######################################################################
693
# Handling Insufficient Shared Memory (``/dev/shm``)
694
# --------------------------------------------------
695
#
696
# When using ``num_workers > 0``, PyTorch uses shared memory (``/dev/shm``)
697
# to efficiently pass data between worker processes and the main process.
698
# If you encounter errors like:
699
#
700
# .. code-block:: text
701
#
702
#     RuntimeError: unable to open shared memory object </torch_xxx>
703
#     ERROR: Unexpected bus error encountered in worker
704
#
705
# This typically means you've exhausted the shared memory allocation.
706
#
707
# **Solutions:**
708
#
709
# **1. Increase /dev/shm size (if you can)**
710
#
711
# **2. Reduce memory pressure from DataLoader:**
712
#
713
# .. code-block:: python
714
#
715
#     # Reduce number of workers
716
#     DataLoader(dataset, num_workers=2)  # Instead of 8+
717
#
718
#     # Reduce prefetch factor
719
#     DataLoader(dataset, num_workers=4, prefetch_factor=1)  # Instead of 2
720
#
721
#     # Use smaller batch sizes
722
#     DataLoader(dataset, batch_size=16)  # Smaller batches = less shm
723
#
724
# **3. Switch sharing strategy:**
725
#
726
# .. code-block:: python
727
#
728
#     import torch.multiprocessing as mp
729
#     mp.set_sharing_strategy('file_system')
730
#
731
# **4. Clean up leaked shared memory:**
732
#
733
# .. code-block:: bash
734
#
735
#     # List shared memory segments
736
#     ls -la /dev/shm/
737
#
738
#     # Remove orphaned PyTorch segments (be careful!)
739
#     rm /dev/shm/torch_*
740
#
741
# .. note::
742
#    Shared memory leaks can occur if worker processes crash without
743
#    proper cleanup.
744
#
745

746
######################################################################
747
# Final Summary
748
# -------------
749
#
750
# Here's the cumulative effect of each optimization we applied to
751
# our training loop. Each row includes all optimizations from previous
752
# rows:
753
#
754
# .. rst-class:: summary-table
755
#
756
# .. list-table::
757
#    :header-rows: 1
758
#    :widths: 55 20 20
759
#
760
#    * - Configuration
761
#      - vs Baseline
762
#      - vs Previous
763
#    * - Baseline (num_workers=0, no pinning)
764
#      - 1.00x
765
#      - —
766
#    * - \+ num_workers=4, prefetch_factor=2
767
#      - ~2.7x
768
#      - ~2.7x
769
#    * - \+ pin_memory=True
770
#      - ~2.8x
771
#      - ~1.0x
772
#    * - \+ persistent_workers=True
773
#      - ~3.7x
774
#      - ~1.3x
775
#    * - \+ DataPrefetcher (H2D overlap)
776
#      - ~3.6x
777
#      - ~1.0x
778
#    * - \+ __getitems__ (batched fetching)
779
#      - ~10x
780
#      - ~2.9x
781
#
782
# .. note::
783
#    These results are based on our benchmark dataset.
784
#    Actual speedups will vary depending on your specific
785
#    workload, hardware, dataset size, and transform complexity.
786

787
######################################################################
788
# Summary and Best Practices
789
# --------------------------
790
#
791
# 1. **Start with moderate batch sizes** (32-128) and scale up if memory
792
#    allows.
793
#
794
# 2. **Use ``num_workers > 0``** when transforms are expensive. Start with
795
#    2-4 workers and increase based on memory capacity. Higher is not always better.
796
#
797
# 3. **Enable ``pin_memory=True``** when using an accelerator.
798
#
799
# 4. **Use ``persistent_workers=True``** to avoid worker restart overhead
800
#    between epochs.
801
#
802
# 5. **Profile your pipeline** with to identify CPU bottlenecks during
803
#    dataset access, transformations, etc.
804
#
805
# 6. **Implement data prefetching** for GPU workloads to overlap data
806
#    transfer with computation.
807
#
808
# 7. **Use ``file_system`` sharing strategy** when hitting file descriptor limits.
809
#
810

811
######################################################################
812
# Conclusion
813
# ----------
814
#
815
# In this tutorial, we learned how to progressively optimize a PyTorch
816
# data loading pipeline — from a naive single-process baseline to a
817
# fully optimized configuration using multiprocessing workers, pinned
818
# memory, persistent workers, CUDA stream-based prefetching, and batched
819
# dataset fetching with ``__getitems__``. Each optimization targets a
820
# different bottleneck, and together they can yield an order-of-magnitude
821
# improvement in throughput. These should be considered best practices
822
# and performance is dependent on the specific workload.
823

824
######################################################################
825
# Additional Resources
826
# --------------------
827
#
828
# - `PyTorch DataLoader documentation <https://pytorch.org/docs/stable/data.html>`_
829
# - `Pin Memory and Non-blocking Transfer Tutorial <https://docs.pytorch.org/tutorials/intermediate/pinmem_nonblock.html>`_
830
# - `PyTorch Performance Tuning Guide <https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html>`_
831

832
Product

Resources

Company