Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
pytorch
GitHub Repository: pytorch/tutorials
Path: blob/main/intermediate_source/intermediate_data_loading_tutorial.py
6463 views
1
# -*- coding: utf-8 -*-
2
"""
3
Data Loading Optimization in PyTorch
4
==============================================
5
6
**Authors**: `Divyansh Khanna <https://github.com/divyanshk>`_, `Ramanish Singh <https://github.com/ramanishsingh>`_
7
8
.. grid:: 2
9
10
.. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn
11
:class-card: card-prerequisites
12
13
* How to optimize DataLoader configuration for maximum throughput
14
* Best practices for ``batch_size``, ``num_workers``, and ``pin_memory``
15
* Advanced techniques for overlapping data transfers with GPU compute
16
* Configuring shared memory strategies and handling ``/dev/shm`` issues
17
18
.. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites
19
:class-card: card-prerequisites
20
21
* PyTorch v2.0+
22
* Basic understanding of PyTorch DataLoader
23
* (Optional) A CUDA-capable GPU for GPU-specific optimizations
24
25
Introduction
26
------------
27
28
Data loading is often a critical bottleneck in deep learning pipelines. While
29
GPUs can process batches extremely quickly, inefficient data loading can leave
30
expensive hardware idle, waiting for the next batch of data. This tutorial
31
covers best practices and some techniques for optimizing your data loading configuration to
32
maximize training throughput.
33
34
We'll explore the key parameters of PyTorch's DataLoader and provide practical
35
guidance on tuning them for your specific workload. Rather than showing each
36
optimization in isolation, we'll build up from a baseline training loop and
37
progressively apply optimizations, measuring the cumulative speedup at each
38
step.
39
"""
40
41
import time
42
43
import torch
44
import torch.nn as nn
45
from torch.utils.data import DataLoader, Dataset
46
47
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
48
print(f"Using device: {device}")
49
50
# Set a fixed seed for reproducibility
51
torch.manual_seed(42)
52
53
######################################################################
54
# Creating a Sample Dataset
55
# -------------------------
56
#
57
# First, let's create a simple dataset that simulates expensive
58
# transformations. This will help us demonstrate the impact of
59
# various DataLoader configurations.
60
61
62
class SyntheticDataset(Dataset):
63
"""A synthetic dataset that simulates expensive data transformations."""
64
65
def __init__(self, size=10000, feature_dim=224, transform_delay=0.001):
66
self.size = size
67
self.feature_dim = feature_dim
68
self.transform_delay = transform_delay
69
70
def __len__(self):
71
return self.size
72
73
def __getitem__(self, idx):
74
# Generate data lazily to avoid pre-allocating large tensors
75
data = torch.randn(3, self.feature_dim, self.feature_dim)
76
label = torch.randint(0, 10, (1,)).item()
77
if self.transform_delay > 0:
78
time.sleep(self.transform_delay)
79
return data, label
80
81
82
class SyntheticDatasetBatched(Dataset):
83
"""Same as SyntheticDataset but with __getitems__ for batched fetching."""
84
85
def __init__(self, size=10000, feature_dim=224, transform_delay=0.001):
86
self.size = size
87
self.feature_dim = feature_dim
88
self.transform_delay = transform_delay
89
90
def __len__(self):
91
return self.size
92
93
def __getitem__(self, idx):
94
data = torch.randn(3, self.feature_dim, self.feature_dim)
95
label = torch.randint(0, 10, (1,)).item()
96
if self.transform_delay > 0:
97
time.sleep(self.transform_delay)
98
return data, label
99
100
def __getitems__(self, indices):
101
"""Fetch multiple items at once — enables vectorized generation.
102
103
Instead of N individual __getitem__ calls (each with its own
104
overhead), this generates the entire batch in one shot using
105
vectorized tensor operations.
106
"""
107
n = len(indices)
108
# Vectorized generation: one call instead of N individual ones
109
data = torch.randn(n, 3, self.feature_dim, self.feature_dim)
110
labels = torch.randint(0, 10, (n,))
111
# Simulate batch-level I/O: one sleep for the whole batch,
112
# not one per sample (e.g., one DB query for N rows)
113
if self.transform_delay > 0:
114
time.sleep(self.transform_delay)
115
return [(data[i], labels[i].item()) for i in range(n)]
116
117
118
######################################################################
119
# Shared Training Infrastructure
120
# ------------------------------
121
#
122
# To measure the real-world impact of each optimization, we define a
123
# reusable training loop that accepts a DataLoader and returns timing
124
# and loss. This avoids duplicating the training loop for every
125
# benchmark.
126
#
127
# We use a **small dataset (500 samples)** with a **high transform
128
# delay (5ms)** to ensure the pipeline remains data-bound throughout.
129
# The small dataset means short epochs (16 batches each), so we run
130
# many epochs — making persistent_workers' benefit visible across
131
# epoch boundaries.
132
133
benchmark_dataset = SyntheticDataset(size=512, feature_dim=224, transform_delay=0.005)
134
135
136
class SmallTransformerModel(nn.Module):
137
138
def __init__(self):
139
super().__init__()
140
self.features = nn.Sequential(
141
nn.Conv2d(3, 32, kernel_size=7, stride=4, padding=3),
142
nn.ReLU(),
143
nn.Conv2d(32, 64, kernel_size=3, stride=2, padding=1),
144
nn.ReLU(),
145
nn.AdaptiveAvgPool2d((7, 7)),
146
)
147
encoder_layer = nn.TransformerEncoderLayer(
148
d_model=64, nhead=4, dim_feedforward=128, batch_first=True
149
)
150
self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=2)
151
self.classifier = nn.Linear(64, 10)
152
153
def forward(self, x):
154
x = self.features(x) # (B, 64, 7, 7)
155
B, C, H, W = x.shape
156
x = x.view(B, C, H * W).permute(0, 2, 1) # (B, 49, 64)
157
x = self.transformer(x) # (B, 49, 64)
158
x = x.mean(dim=1) # (B, 64)
159
return self.classifier(x)
160
161
162
def create_model():
163
"""Create a conv+transformer model for benchmarking."""
164
return SmallTransformerModel().to(device)
165
166
167
def train_and_benchmark(loader, max_batches=160, epochs=10, prefetch_device=None):
168
"""Train a model over multiple epochs and return elapsed time and average loss.
169
170
Running multiple epochs (10) with a small dataset ensures many epoch
171
boundaries, making persistent_workers' startup savings visible.
172
173
Args:
174
loader: A DataLoader to iterate over.
175
max_batches: Maximum total number of batches to process across all epochs.
176
epochs: Number of epochs to iterate (re-iterates the loader each epoch).
177
prefetch_device: If set, wraps the loader in a DataPrefetcher each epoch
178
for overlapping H2D transfers. Data arrives already on device.
179
180
Returns:
181
Tuple of (elapsed_seconds, average_loss).
182
"""
183
model = create_model()
184
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
185
criterion = nn.CrossEntropyLoss()
186
187
start_time = time.perf_counter()
188
total_loss = 0.0
189
num_batches = 0
190
191
for epoch in range(epochs):
192
if prefetch_device is not None:
193
data_iter = DataPrefetcher(loader, prefetch_device)
194
else:
195
data_iter = loader
196
197
for data, labels in data_iter:
198
if prefetch_device is None:
199
data = data.to(device, non_blocking=True)
200
labels = labels.to(device, non_blocking=True)
201
202
output = model(data)
203
loss = criterion(output, labels)
204
205
optimizer.zero_grad()
206
loss.backward()
207
optimizer.step()
208
209
total_loss += loss.item()
210
num_batches += 1
211
212
if num_batches >= max_batches:
213
break
214
if num_batches >= max_batches:
215
break
216
217
if torch.cuda.is_available():
218
torch.cuda.synchronize()
219
elapsed = time.perf_counter() - start_time
220
221
return elapsed, total_loss / num_batches
222
223
224
######################################################################
225
# Baseline Training Loop
226
# ----------------------
227
#
228
# Our starting point: a simple DataLoader with no multiprocessing,
229
# no pinned memory, and default settings. This establishes the
230
# performance floor we'll improve upon.
231
232
baseline_loader = DataLoader(
233
benchmark_dataset,
234
batch_size=32,
235
shuffle=True,
236
num_workers=0,
237
pin_memory=False,
238
)
239
240
print("\n=== Progressive Optimization Results ===")
241
print("\nBaseline (num_workers=0, pin_memory=False):")
242
baseline_time, baseline_loss = train_and_benchmark(baseline_loader)
243
print(f" Time: {baseline_time:.4f}s | Loss: {baseline_loss:.4f}")
244
prev_time = baseline_time
245
246
######################################################################
247
# Batch Size Optimization
248
# -----------------------
249
#
250
# The ``batch_size`` parameter controls how many samples are processed
251
# together. Choosing the right batch size involves balancing several factors:
252
#
253
# **Memory Considerations:**
254
#
255
# - Larger batch sizes require more GPU memory for storing inputs,
256
# activations, and gradients
257
# - Out-of-memory (OOM) errors are common with large batch sizes
258
# - Moderate batch sizes (32-128) often provide the best balance
259
#
260
# **Training Dynamics:**
261
#
262
# - Batch size changes affect the effective learning rate, typically requiring tuning
263
# - Larger batches provide more stable gradient estimates but may
264
# generalize differently
265
#
266
# .. note::
267
# When changing batch size, remember to tune your optimizer parameters,
268
# especially the learning rate schedule, unless you're doing inference
269
#
270
# Since batch size is model-dependent (not a "just add it" optimization),
271
# we benchmark it in isolation rather than folding it into the progressive
272
# optimization chain.
273
274
# Example: Testing different batch sizes
275
batch_dataset = SyntheticDataset(size=1000, transform_delay=0)
276
277
278
def benchmark_batch_size(batch_size, num_batches=10):
279
"""Benchmark data loading with a specific batch size."""
280
loader = DataLoader(batch_dataset, batch_size=batch_size, shuffle=True)
281
start = time.perf_counter()
282
for i, (data, labels) in enumerate(loader):
283
if i >= num_batches:
284
break
285
data = data.to(device, non_blocking=True)
286
_ = data.sum()
287
if torch.cuda.is_available():
288
torch.cuda.synchronize()
289
elapsed = time.perf_counter() - start
290
return elapsed
291
292
293
# Benchmark different batch sizes
294
print("\nBatch size comparison (isolated benchmark):")
295
for bs in [16, 32, 64, 128]:
296
elapsed = benchmark_batch_size(bs)
297
print(f" Batch size {bs:3d}: {elapsed:.4f}s for 10 batches")
298
299
######################################################################
300
# Number of Workers (``num_workers``)
301
# -----------------------------------
302
#
303
# The ``num_workers`` parameter controls how many subprocesses are used
304
# for data loading. This is crucial for parallelizing expensive data
305
# transformations.
306
#
307
# **How it works:**
308
#
309
# - Each worker maintains a queue of batches (controlled by ``prefetch_factor``)
310
# - Workers prepare batches in parallel and transfer them to the main process
311
# - If ``in_order=True`` (default), batches are returned in order
312
#
313
# **When to increase ``num_workers``:**
314
#
315
# - When transforms are computationally expensive (augmentations, decoding)
316
# - When data is loaded from slow storage (network drives, HDD)
317
# - When you observe GPU idle time due to data loading
318
#
319
# **When ``num_workers=0`` might be faster:**
320
#
321
# - When transforms are cheap (simple tensor operations)
322
# - When data is already in memory
323
# - The overhead of inter-process communication (IPC) exceeds the
324
# parallelization benefits
325
#
326
# .. note::
327
# Finding the optimal ``num_workers`` requires tuning: increase workers
328
# until throughput plateaus. Too many workers waste CPU
329
# memory (each worker holds its own copy of the dataset object and
330
# prefetched batches) and can cause ``/dev/shm`` exhaustion. A good
331
# starting point is 2-4 workers per GPU; profile with different values
332
# to find the sweet spot for your workload.
333
#
334
# Let's add ``num_workers=4`` and ``prefetch_factor=2`` to our training
335
# loop and measure the improvement:
336
337
workers_loader = DataLoader(
338
benchmark_dataset,
339
batch_size=32,
340
shuffle=True,
341
num_workers=4,
342
prefetch_factor=2,
343
pin_memory=False,
344
)
345
346
print("\n+ num_workers=4, prefetch_factor=2:")
347
workers_time, workers_loss = train_and_benchmark(workers_loader)
348
print(f" Time: {workers_time:.4f}s | Loss: {workers_loss:.4f}")
349
print(
350
f" Speedup vs baseline: {baseline_time / workers_time:.2f}x | vs previous: {prev_time / workers_time:.2f}x"
351
)
352
prev_time = workers_time
353
354
######################################################################
355
# Understanding ``pin_memory``
356
# ----------------------------
357
#
358
# The ``pin_memory`` parameter enables faster CPU-to-GPU data transfers
359
# by using page-locked (pinned) memory.
360
#
361
# **How pinned memory works:**
362
#
363
# - Pinned memory cannot be swapped to disk by the OS
364
# - This enables faster DMA (Direct Memory Access) transfers to GPU
365
# - The CPU-to-GPU transfer can happen asynchronously
366
#
367
# **Best practices:**
368
#
369
# 1. Use ``pin_memory=True`` in the DataLoader (recommended approach)
370
# 2. Combine with ``non_blocking=True`` when moving data to GPU
371
# 3. Avoid manually calling ``tensor.pin_memory()`` followed by
372
# ``.to(device, non_blocking=True)`` - this is slower because
373
# ``pin_memory()`` is blocking
374
#
375
# **The safe pattern:**
376
#
377
# .. code-block:: python
378
#
379
# # Recommended: Let DataLoader handle pinning
380
# loader = DataLoader(dataset, pin_memory=True)
381
# for data, labels in loader:
382
# data = data.to(device, non_blocking=True)
383
# labels = labels.to(device, non_blocking=True)
384
#
385
# .. seealso::
386
# For more details, see the
387
# `pin_memory tutorial <https://docs.pytorch.org/tutorials/intermediate/pinmem_nonblock.html>`_
388
#
389
# Let's add ``pin_memory=True`` to our configuration:
390
391
pinmem_loader = DataLoader(
392
benchmark_dataset,
393
batch_size=32,
394
shuffle=True,
395
num_workers=4,
396
prefetch_factor=2,
397
pin_memory=torch.cuda.is_available(),
398
)
399
400
if torch.cuda.is_available():
401
print("\n+ pin_memory=True:")
402
pinmem_time, pinmem_loss = train_and_benchmark(pinmem_loader)
403
print(f" Time: {pinmem_time:.4f}s | Loss: {pinmem_loss:.4f}")
404
print(
405
f" Speedup vs baseline: {baseline_time / pinmem_time:.2f}x | vs previous: {prev_time / pinmem_time:.2f}x"
406
)
407
print(
408
" (pin_memory benefit is modest here because CPU transform time dominates H2D transfer)"
409
)
410
prev_time = pinmem_time
411
else:
412
print("\n+ pin_memory: skipped (CUDA not available)")
413
pinmem_time = workers_time
414
415
######################################################################
416
# Persistent Workers
417
# ------------------
418
#
419
# By default, worker processes are shut down and restarted between
420
# epochs. This incurs startup overhead (importing modules, forking
421
# processes, re-initializing datasets) on every epoch boundary.
422
#
423
# Setting ``persistent_workers=True`` keeps the workers alive across
424
# epochs, eliminating this repeated startup cost.
425
#
426
# **When it helps most:**
427
#
428
# - Training for many epochs on smaller datasets
429
# - When dataset ``__init__`` is expensive (e.g., loading metadata)
430
# - When combined with high ``num_workers``
431
#
432
# Let's compare with and without persistent workers over multiple epochs:
433
434
non_persistent_loader = DataLoader(
435
benchmark_dataset,
436
batch_size=32,
437
shuffle=True,
438
num_workers=4,
439
prefetch_factor=2,
440
pin_memory=torch.cuda.is_available(),
441
persistent_workers=False,
442
)
443
444
persistent_loader = DataLoader(
445
benchmark_dataset,
446
batch_size=32,
447
shuffle=True,
448
num_workers=4,
449
prefetch_factor=2,
450
pin_memory=torch.cuda.is_available(),
451
persistent_workers=True,
452
)
453
454
print("\n+ persistent_workers=True (10 epochs):")
455
non_persistent_time, _ = train_and_benchmark(non_persistent_loader)
456
persistent_time, persistent_loss = train_and_benchmark(persistent_loader)
457
print(f" Without persistent_workers: {non_persistent_time:.4f}s")
458
print(f" With persistent_workers: {persistent_time:.4f}s")
459
print(
460
f" Speedup vs baseline: {baseline_time / persistent_time:.2f}x | vs previous: {prev_time / persistent_time:.2f}x"
461
)
462
prev_time = persistent_time
463
464
######################################################################
465
# Overlapping H2D Transfer with GPU Compute
466
# ---------------------------------------------------
467
#
468
# For maximum throughput, you can overlap Host-to-Device (H2D) data
469
# transfers with GPU computation. This ensures the GPU is never idle
470
# waiting for data.
471
#
472
# The idea is to prefetch the next batch to GPU while the current batch
473
# is being processed.
474
#
475
# .. note::
476
# The DataPrefetcher shows its greatest benefit when H2D transfer
477
# time overlaps meaningfully with GPU compute. If data loading is
478
# already fast, the stream synchronization overhead may exceed the benefit.
479
480
481
class DataPrefetcher:
482
"""Prefetches data to GPU while previous batch is being processed."""
483
484
def __init__(self, loader, device):
485
self.loader = iter(loader)
486
self.device = device
487
self.stream = torch.cuda.Stream() if torch.cuda.is_available() else None
488
self.next_data = None
489
self.next_labels = None
490
self.preload()
491
492
def preload(self):
493
try:
494
self.next_data, self.next_labels = next(self.loader)
495
except StopIteration:
496
self.next_data = None
497
self.next_labels = None
498
return
499
500
if self.stream is not None:
501
with torch.cuda.stream(self.stream):
502
self.next_data = self.next_data.to(self.device, non_blocking=True)
503
self.next_labels = self.next_labels.to(self.device, non_blocking=True)
504
505
def __iter__(self):
506
return self
507
508
def __next__(self):
509
if self.stream is not None:
510
torch.cuda.current_stream().wait_stream(self.stream)
511
512
data = self.next_data
513
labels = self.next_labels
514
515
if data is None:
516
raise StopIteration
517
518
# Ensure tensors are ready
519
if self.stream is not None:
520
data.record_stream(torch.cuda.current_stream())
521
labels.record_stream(torch.cuda.current_stream())
522
523
self.preload()
524
return data, labels
525
526
527
# Integrate prefetcher into the training loop.
528
if torch.cuda.is_available():
529
print("\n+ DataPrefetcher (overlapping H2D transfer):")
530
prefetch_time, prefetch_loss = train_and_benchmark(
531
persistent_loader, prefetch_device=device
532
)
533
print(f" Time: {prefetch_time:.4f}s | Loss: {prefetch_loss:.4f}")
534
print(
535
f" Speedup vs baseline: {baseline_time / prefetch_time:.2f}x | vs previous: {prev_time / prefetch_time:.2f}x"
536
)
537
prev_time = prefetch_time
538
else:
539
print("\n+ DataPrefetcher: skipped (CUDA not available)")
540
prefetch_time = persistent_time
541
542
######################################################################
543
# Dataset-Level Optimization: ``__getitems__``
544
# --------------------------------------------
545
#
546
# Beyond tuning DataLoader parameters, you can optimize the dataset
547
# itself. PyTorch's DataLoader supports a batched fetching protocol via
548
# ``__getitems__``: if your dataset defines this method, the fetcher
549
# calls it once with a list of indices instead of calling ``__getitem__``
550
# repeatedly for each sample.
551
#
552
# **How it works:**
553
#
554
# - The default fetcher does: ``[dataset[idx] for idx in batch_indices]``
555
# - With ``__getitems__``: ``dataset.__getitems__(batch_indices)``
556
#
557
# **When this helps:**
558
#
559
# - When per-sample overhead is significant (e.g., opening connections,
560
# parsing headers, acquiring locks)
561
# - When data can be fetched in bulk more efficiently (e.g., one SQL query
562
# for N rows instead of N queries, or vectorized tensor generation)
563
# - When the transform has a fixed setup cost that can be amortized
564
# across the batch
565
#
566
# **Expected signature:**
567
#
568
# .. code-block:: python
569
#
570
# def __getitems__(self, indices: list[int]) -> list:
571
# # Fetch all items at once and return as a list
572
# ...
573
#
574
# Our ``SyntheticDatasetBatched`` implements ``__getitems__`` to generate
575
# the entire batch in one vectorized call (with a single amortized delay)
576
# rather than N individual calls each with their own delay.
577
# Let's add this to our cumulative configuration:
578
579
benchmark_dataset_batched = SyntheticDatasetBatched(
580
size=512, feature_dim=224, transform_delay=0.005
581
)
582
583
batched_loader = DataLoader(
584
benchmark_dataset_batched,
585
batch_size=32,
586
shuffle=True,
587
num_workers=4,
588
prefetch_factor=2,
589
pin_memory=torch.cuda.is_available(),
590
persistent_workers=True,
591
)
592
593
print("\n+ __getitems__ (batched dataset fetching):")
594
batched_time, batched_loss = train_and_benchmark(batched_loader)
595
print(f" Time: {batched_time:.4f}s | Loss: {batched_loss:.4f}")
596
print(
597
f" Speedup vs baseline: {baseline_time / batched_time:.2f}x | vs previous: {prev_time / batched_time:.2f}x"
598
)
599
prev_time = batched_time
600
601
######################################################################
602
# ``in_order`` parameter
603
# --------------------------
604
#
605
# By default (``in_order=True``), the DataLoader returns batches in
606
# the same order as the dataset indices. This requires caching batches
607
# that arrive out of order from workers.
608
#
609
# **When to consider ``in_order=False``:**
610
#
611
# - When you don't need deterministic ordering (e.g., not checkpointing)
612
# - When you observe training spikes due to batch caching
613
# - When maximizing throughput is more important than reproducibility
614
#
615
# .. note::
616
# ``in_order=False`` might not increase average throughput, but it
617
# can reduce variance and eliminate occasional slow batches caused
618
# by head-of-line blocking when one worker is slower than others.
619
620
######################################################################
621
# Snapshot Frequency (``snapshot_every_n_steps``)
622
# -----------------------------------------------
623
#
624
# When using torchdata's StatefulDataLoader (for checkpointing), the
625
# ``snapshot_every_n_steps`` parameter controls how often the
626
# DataLoader state is saved.
627
#
628
# **Trade-offs:**
629
#
630
# - **Higher frequency (smaller n):** More overhead, but less data loss
631
# on job failure
632
# - **Lower frequency (larger n):** Less overhead, but more replayed
633
# samples on recovery
634
#
635
# Choose based on your fault tolerance requirements and the cost of
636
# reprocessing data.
637
638
######################################################################
639
# Shared Memory and ``set_sharing_strategy``
640
# ------------------------------------------
641
#
642
# When using multiprocessing with ``num_workers > 0``, PyTorch needs to
643
# transfer tensors between worker processes and the main process. The
644
# sharing strategy determines how this is done.
645
#
646
# **Available Strategies:**
647
#
648
# PyTorch provides two sharing strategies via
649
# ``torch.multiprocessing.set_sharing_strategy()``:
650
#
651
# 1. **file_descriptor** (default on most systems)
652
#
653
# - Uses file descriptors to share memory
654
# - Limited by system's open file descriptor limit (``ulimit -n``)
655
# - More efficient for small tensors
656
#
657
# 2. **file_system**
658
#
659
# - Uses shared memory files in ``/dev/shm``
660
# - Not limited by file descriptor count
661
# - Better for large numbers of tensors
662
# - Low transform costs
663
664
######################################################################
665
# **How to Change the Strategy:**
666
#
667
# .. code-block:: python
668
#
669
# import torch.multiprocessing as mp
670
#
671
# # Switch to file_system strategy
672
# # Must be called before creating any DataLoader workers
673
# mp.set_sharing_strategy('file_system')
674
#
675
# **Choosing the Right Strategy:**
676
#
677
# +-------------------+---------------------------+---------------------------+
678
# | Scenario | Recommended Strategy | Reason |
679
# +===================+===========================+===========================+
680
# | Many small tensors| file_descriptor (default) | Lower overhead per tensor |
681
# +-------------------+---------------------------+---------------------------+
682
# | Few large tensors | file_system | Avoids fd limits |
683
# +-------------------+---------------------------+---------------------------+
684
# | High num_workers | file_system | Avoids fd exhaustion |
685
# +-------------------+---------------------------+---------------------------+
686
#
687
# .. warning::
688
# ``set_sharing_strategy()`` must be called **before** creating any
689
# DataLoader with ``num_workers > 0``. Changing it afterward has no
690
# effect on existing workers.
691
692
######################################################################
693
# Handling Insufficient Shared Memory (``/dev/shm``)
694
# --------------------------------------------------
695
#
696
# When using ``num_workers > 0``, PyTorch uses shared memory (``/dev/shm``)
697
# to efficiently pass data between worker processes and the main process.
698
# If you encounter errors like:
699
#
700
# .. code-block:: text
701
#
702
# RuntimeError: unable to open shared memory object </torch_xxx>
703
# ERROR: Unexpected bus error encountered in worker
704
#
705
# This typically means you've exhausted the shared memory allocation.
706
#
707
# **Solutions:**
708
#
709
# **1. Increase /dev/shm size (if you can)**
710
#
711
# **2. Reduce memory pressure from DataLoader:**
712
#
713
# .. code-block:: python
714
#
715
# # Reduce number of workers
716
# DataLoader(dataset, num_workers=2) # Instead of 8+
717
#
718
# # Reduce prefetch factor
719
# DataLoader(dataset, num_workers=4, prefetch_factor=1) # Instead of 2
720
#
721
# # Use smaller batch sizes
722
# DataLoader(dataset, batch_size=16) # Smaller batches = less shm
723
#
724
# **3. Switch sharing strategy:**
725
#
726
# .. code-block:: python
727
#
728
# import torch.multiprocessing as mp
729
# mp.set_sharing_strategy('file_system')
730
#
731
# **4. Clean up leaked shared memory:**
732
#
733
# .. code-block:: bash
734
#
735
# # List shared memory segments
736
# ls -la /dev/shm/
737
#
738
# # Remove orphaned PyTorch segments (be careful!)
739
# rm /dev/shm/torch_*
740
#
741
# .. note::
742
# Shared memory leaks can occur if worker processes crash without
743
# proper cleanup.
744
#
745
746
######################################################################
747
# Final Summary
748
# -------------
749
#
750
# Here's the cumulative effect of each optimization we applied to
751
# our training loop. Each row includes all optimizations from previous
752
# rows:
753
#
754
# .. rst-class:: summary-table
755
#
756
# .. list-table::
757
# :header-rows: 1
758
# :widths: 55 20 20
759
#
760
# * - Configuration
761
# - vs Baseline
762
# - vs Previous
763
# * - Baseline (num_workers=0, no pinning)
764
# - 1.00x
765
# - —
766
# * - \+ num_workers=4, prefetch_factor=2
767
# - ~2.7x
768
# - ~2.7x
769
# * - \+ pin_memory=True
770
# - ~2.8x
771
# - ~1.0x
772
# * - \+ persistent_workers=True
773
# - ~3.7x
774
# - ~1.3x
775
# * - \+ DataPrefetcher (H2D overlap)
776
# - ~3.6x
777
# - ~1.0x
778
# * - \+ __getitems__ (batched fetching)
779
# - ~10x
780
# - ~2.9x
781
#
782
# .. note::
783
# These results are based on our benchmark dataset.
784
# Actual speedups will vary depending on your specific
785
# workload, hardware, dataset size, and transform complexity.
786
787
######################################################################
788
# Summary and Best Practices
789
# --------------------------
790
#
791
# 1. **Start with moderate batch sizes** (32-128) and scale up if memory
792
# allows.
793
#
794
# 2. **Use ``num_workers > 0``** when transforms are expensive. Start with
795
# 2-4 workers and increase based on memory capacity. Higher is not always better.
796
#
797
# 3. **Enable ``pin_memory=True``** when using an accelerator.
798
#
799
# 4. **Use ``persistent_workers=True``** to avoid worker restart overhead
800
# between epochs.
801
#
802
# 5. **Profile your pipeline** with to identify CPU bottlenecks during
803
# dataset access, transformations, etc.
804
#
805
# 6. **Implement data prefetching** for GPU workloads to overlap data
806
# transfer with computation.
807
#
808
# 7. **Use ``file_system`` sharing strategy** when hitting file descriptor limits.
809
#
810
811
######################################################################
812
# Conclusion
813
# ----------
814
#
815
# In this tutorial, we learned how to progressively optimize a PyTorch
816
# data loading pipeline — from a naive single-process baseline to a
817
# fully optimized configuration using multiprocessing workers, pinned
818
# memory, persistent workers, CUDA stream-based prefetching, and batched
819
# dataset fetching with ``__getitems__``. Each optimization targets a
820
# different bottleneck, and together they can yield an order-of-magnitude
821
# improvement in throughput. These should be considered best practices
822
# and performance is dependent on the specific workload.
823
824
######################################################################
825
# Additional Resources
826
# --------------------
827
#
828
# - `PyTorch DataLoader documentation <https://pytorch.org/docs/stable/data.html>`_
829
# - `Pin Memory and Non-blocking Transfer Tutorial <https://docs.pytorch.org/tutorials/intermediate/pinmem_nonblock.html>`_
830
# - `PyTorch Performance Tuning Guide <https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html>`_
831
832