CoCalc provides the best real-time collaborative environment for Jupyter Notebooks, LaTeX documents, and SageMath, scalable from individual users to large groups and classes!
CoCalc provides the best real-time collaborative environment for Jupyter Notebooks, LaTeX documents, and SageMath, scalable from individual users to large groups and classes!
Path: blob/main/intermediate_source/model_parallel_tutorial.py
Views: 494
# -*- coding: utf-8 -*-1"""2Single-Machine Model Parallel Best Practices3============================================4**Author**: `Shen Li <https://mrshenli.github.io/>`_56Model parallel is widely-used in distributed training7techniques. Previous posts have explained how to use8`DataParallel <https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html>`_9to train a neural network on multiple GPUs; this feature replicates the10same model to all GPUs, where each GPU consumes a different partition of the11input data. Although it can significantly accelerate the training process, it12does not work for some use cases where the model is too large to fit into a13single GPU. This post shows how to solve that problem by using **model parallel**,14which, in contrast to ``DataParallel``, splits a single model onto different GPUs,15rather than replicating the entire model on each GPU (to be concrete, say a model16``m`` contains 10 layers: when using ``DataParallel``, each GPU will have a17replica of each of these 10 layers, whereas when using model parallel on two GPUs,18each GPU could host 5 layers).1920The high-level idea of model parallel is to place different sub-networks of a21model onto different devices, and implement the ``forward`` method accordingly22to move intermediate outputs across devices. As only part of a model operates23on any individual device, a set of devices can collectively serve a larger24model. In this post, we will not try to construct huge models and squeeze them25into a limited number of GPUs. Instead, this post focuses on showing the idea26of model parallel. It is up to the readers to apply the ideas to real-world27applications.2829.. note::3031For distributed model parallel training where a model spans multiple32servers, please refer to33`Getting Started With Distributed RPC Framework <rpc_tutorial.html>`__34for examples and details.3536Basic Usage37-----------38"""3940######################################################################41# Let us start with a toy model that contains two linear layers. To run this42# model on two GPUs, simply put each linear layer on a different GPU, and move43# inputs and intermediate outputs to match the layer devices accordingly.44#4546import torch47import torch.nn as nn48import torch.optim as optim495051class ToyModel(nn.Module):52def __init__(self):53super(ToyModel, self).__init__()54self.net1 = torch.nn.Linear(10, 10).to('cuda:0')55self.relu = torch.nn.ReLU()56self.net2 = torch.nn.Linear(10, 5).to('cuda:1')5758def forward(self, x):59x = self.relu(self.net1(x.to('cuda:0')))60return self.net2(x.to('cuda:1'))6162######################################################################63# Note that, the above ``ToyModel`` looks very similar to how one would64# implement it on a single GPU, except the four ``to(device)`` calls which65# place linear layers and tensors on proper devices. That is the only place in66# the model that requires changes. The ``backward()`` and ``torch.optim`` will67# automatically take care of gradients as if the model is on one GPU. You only68# need to make sure that the labels are on the same device as the outputs when69# calling the loss function.707172model = ToyModel()73loss_fn = nn.MSELoss()74optimizer = optim.SGD(model.parameters(), lr=0.001)7576optimizer.zero_grad()77outputs = model(torch.randn(20, 10))78labels = torch.randn(20, 5).to('cuda:1')79loss_fn(outputs, labels).backward()80optimizer.step()8182######################################################################83# Apply Model Parallel to Existing Modules84# ----------------------------------------85#86# It is also possible to run an existing single-GPU module on multiple GPUs87# with just a few lines of changes. The code below shows how to decompose88# ``torchvision.models.resnet50()`` to two GPUs. The idea is to inherit from89# the existing ``ResNet`` module, and split the layers to two GPUs during90# construction. Then, override the ``forward`` method to stitch two91# sub-networks by moving the intermediate outputs accordingly.929394from torchvision.models.resnet import ResNet, Bottleneck9596num_classes = 1000979899class ModelParallelResNet50(ResNet):100def __init__(self, *args, **kwargs):101super(ModelParallelResNet50, self).__init__(102Bottleneck, [3, 4, 6, 3], num_classes=num_classes, *args, **kwargs)103104self.seq1 = nn.Sequential(105self.conv1,106self.bn1,107self.relu,108self.maxpool,109110self.layer1,111self.layer2112).to('cuda:0')113114self.seq2 = nn.Sequential(115self.layer3,116self.layer4,117self.avgpool,118).to('cuda:1')119120self.fc.to('cuda:1')121122def forward(self, x):123x = self.seq2(self.seq1(x).to('cuda:1'))124return self.fc(x.view(x.size(0), -1))125126127######################################################################128# The above implementation solves the problem for cases where the model is too129# large to fit into a single GPU. However, you might have already noticed that130# it will be slower than running it on a single GPU if your model fits. It is131# because, at any point in time, only one of the two GPUs are working, while132# the other one is sitting there doing nothing. The performance further133# deteriorates as the intermediate outputs need to be copied from ``cuda:0`` to134# ``cuda:1`` between ``layer2`` and ``layer3``.135#136# Let us run an experiment to get a more quantitative view of the execution137# time. In this experiment, we train ``ModelParallelResNet50`` and the existing138# ``torchvision.models.resnet50()`` by running random inputs and labels through139# them. After the training, the models will not produce any useful predictions,140# but we can get a reasonable understanding of the execution times.141142143import torchvision.models as models144145num_batches = 3146batch_size = 120147image_w = 128148image_h = 128149150151def train(model):152model.train(True)153loss_fn = nn.MSELoss()154optimizer = optim.SGD(model.parameters(), lr=0.001)155156one_hot_indices = torch.LongTensor(batch_size) \157.random_(0, num_classes) \158.view(batch_size, 1)159160for _ in range(num_batches):161# generate random inputs and labels162inputs = torch.randn(batch_size, 3, image_w, image_h)163labels = torch.zeros(batch_size, num_classes) \164.scatter_(1, one_hot_indices, 1)165166# run forward pass167optimizer.zero_grad()168outputs = model(inputs.to('cuda:0'))169170# run backward pass171labels = labels.to(outputs.device)172loss_fn(outputs, labels).backward()173optimizer.step()174175176######################################################################177# The ``train(model)`` method above uses ``nn.MSELoss`` as the loss function,178# and ``optim.SGD`` as the optimizer. It mimics training on ``128 X 128``179# images which are organized into 3 batches where each batch contains 120180# images. Then, we use ``timeit`` to run the ``train(model)`` method 10 times181# and plot the execution times with standard deviations.182183184import matplotlib.pyplot as plt185plt.switch_backend('Agg')186import numpy as np187import timeit188189num_repeat = 10190191stmt = "train(model)"192193setup = "model = ModelParallelResNet50()"194mp_run_times = timeit.repeat(195stmt, setup, number=1, repeat=num_repeat, globals=globals())196mp_mean, mp_std = np.mean(mp_run_times), np.std(mp_run_times)197198setup = "import torchvision.models as models;" + \199"model = models.resnet50(num_classes=num_classes).to('cuda:0')"200rn_run_times = timeit.repeat(201stmt, setup, number=1, repeat=num_repeat, globals=globals())202rn_mean, rn_std = np.mean(rn_run_times), np.std(rn_run_times)203204205def plot(means, stds, labels, fig_name):206fig, ax = plt.subplots()207ax.bar(np.arange(len(means)), means, yerr=stds,208align='center', alpha=0.5, ecolor='red', capsize=10, width=0.6)209ax.set_ylabel('ResNet50 Execution Time (Second)')210ax.set_xticks(np.arange(len(means)))211ax.set_xticklabels(labels)212ax.yaxis.grid(True)213plt.tight_layout()214plt.savefig(fig_name)215plt.close(fig)216217218plot([mp_mean, rn_mean],219[mp_std, rn_std],220['Model Parallel', 'Single GPU'],221'mp_vs_rn.png')222223224######################################################################225#226# .. figure:: /_static/img/model-parallel-images/mp_vs_rn.png227# :alt:228#229# The result shows that the execution time of model parallel implementation is230# ``4.02/3.75-1=7%`` longer than the existing single-GPU implementation. So we231# can conclude there is roughly 7% overhead in copying tensors back and forth232# across the GPUs. There are rooms for improvements, as we know one of the two233# GPUs is sitting idle throughout the execution. One option is to further234# divide each batch into a pipeline of splits, such that when one split reaches235# the second sub-network, the following split can be fed into the first236# sub-network. In this way, two consecutive splits can run concurrently on two237# GPUs.238239######################################################################240# Speed Up by Pipelining Inputs241# -----------------------------242#243# In the following experiments, we further divide each 120-image batch into244# 20-image splits. As PyTorch launches CUDA operations asynchronously, the245# implementation does not need to spawn multiple threads to achieve246# concurrency.247248249class PipelineParallelResNet50(ModelParallelResNet50):250def __init__(self, split_size=20, *args, **kwargs):251super(PipelineParallelResNet50, self).__init__(*args, **kwargs)252self.split_size = split_size253254def forward(self, x):255splits = iter(x.split(self.split_size, dim=0))256s_next = next(splits)257s_prev = self.seq1(s_next).to('cuda:1')258ret = []259260for s_next in splits:261# A. ``s_prev`` runs on ``cuda:1``262s_prev = self.seq2(s_prev)263ret.append(self.fc(s_prev.view(s_prev.size(0), -1)))264265# B. ``s_next`` runs on ``cuda:0``, which can run concurrently with A266s_prev = self.seq1(s_next).to('cuda:1')267268s_prev = self.seq2(s_prev)269ret.append(self.fc(s_prev.view(s_prev.size(0), -1)))270271return torch.cat(ret)272273274setup = "model = PipelineParallelResNet50()"275pp_run_times = timeit.repeat(276stmt, setup, number=1, repeat=num_repeat, globals=globals())277pp_mean, pp_std = np.mean(pp_run_times), np.std(pp_run_times)278279plot([mp_mean, rn_mean, pp_mean],280[mp_std, rn_std, pp_std],281['Model Parallel', 'Single GPU', 'Pipelining Model Parallel'],282'mp_vs_rn_vs_pp.png')283284######################################################################285# Please note, device-to-device tensor copy operations are synchronized on286# current streams on the source and the destination devices. If you create287# multiple streams, you have to make sure that copy operations are properly288# synchronized. Writing the source tensor or reading/writing the destination289# tensor before finishing the copy operation can lead to undefined behavior.290# The above implementation only uses default streams on both source and291# destination devices, hence it is not necessary to enforce additional292# synchronizations.293#294# .. figure:: /_static/img/model-parallel-images/mp_vs_rn_vs_pp.png295# :alt:296#297# The experiment result shows that, pipelining inputs to model parallel298# ResNet50 speeds up the training process by roughly ``3.75/2.51-1=49%``. It is299# still quite far away from the ideal 100% speedup. As we have introduced a new300# parameter ``split_sizes`` in our pipeline parallel implementation, it is301# unclear how the new parameter affects the overall training time. Intuitively302# speaking, using small ``split_size`` leads to many tiny CUDA kernel launch,303# while using large ``split_size`` results to relatively long idle times during304# the first and last splits. Neither are optimal. There might be an optimal305# ``split_size`` configuration for this specific experiment. Let us try to find306# it by running experiments using several different ``split_size`` values.307308309means = []310stds = []311split_sizes = [1, 3, 5, 8, 10, 12, 20, 40, 60]312313for split_size in split_sizes:314setup = "model = PipelineParallelResNet50(split_size=%d)" % split_size315pp_run_times = timeit.repeat(316stmt, setup, number=1, repeat=num_repeat, globals=globals())317means.append(np.mean(pp_run_times))318stds.append(np.std(pp_run_times))319320fig, ax = plt.subplots()321ax.plot(split_sizes, means)322ax.errorbar(split_sizes, means, yerr=stds, ecolor='red', fmt='ro')323ax.set_ylabel('ResNet50 Execution Time (Second)')324ax.set_xlabel('Pipeline Split Size')325ax.set_xticks(split_sizes)326ax.yaxis.grid(True)327plt.tight_layout()328plt.savefig("split_size_tradeoff.png")329plt.close(fig)330331######################################################################332#333# .. figure:: /_static/img/model-parallel-images/split_size_tradeoff.png334# :alt:335#336# The result shows that setting ``split_size`` to 12 achieves the fastest337# training speed, which leads to ``3.75/2.43-1=54%`` speedup. There are338# still opportunities to further accelerate the training process. For example,339# all operations on ``cuda:0`` is placed on its default stream. It means that340# computations on the next split cannot overlap with the copy operation of the341# ``prev`` split. However, as ``prev`` and next splits are different tensors, there is342# no problem to overlap one's computation with the other one's copy. The343# implementation need to use multiple streams on both GPUs, and different344# sub-network structures require different stream management strategies. As no345# general multi-stream solution works for all model parallel use cases, we will346# not discuss it in this tutorial.347#348# **Note:**349#350# This post shows several performance measurements. You might see different351# numbers when running the same code on your own machine, because the result352# depends on the underlying hardware and software. To get the best performance353# for your environment, a proper approach is to first generate the curve to354# figure out the best split size, and then use that split size to pipeline355# inputs.356#357358359