CoCalc provides the best real-time collaborative environment for Jupyter Notebooks, LaTeX documents, and SageMath, scalable from individual users to large groups and classes!
CoCalc provides the best real-time collaborative environment for Jupyter Notebooks, LaTeX documents, and SageMath, scalable from individual users to large groups and classes!
Path: blob/main/beginner_source/profiler.py
Views: 494
"""1Profiling your PyTorch Module2-----------------------------34**Author:** `Suraj Subramanian <https://github.com/suraj813>`_56PyTorch includes a profiler API that is useful to identify the time and7memory costs of various PyTorch operations in your code. Profiler can be8easily integrated in your code, and the results can be printed as a table9or returned in a JSON trace file.1011.. note::12Profiler supports multithreaded models. Profiler runs in the13same thread as the operation but it will also profile child operators14that might run in another thread. Concurrently-running profilers will be15scoped to their own thread to prevent mixing of results.1617.. note::18PyTorch 1.8 introduces the new API that will replace the older profiler API19in the future releases. Check the new API at `this page <https://pytorch.org/docs/master/profiler.html>`__.2021Head on over to `this22recipe <https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html>`__23for a quicker walkthrough of Profiler API usage.242526--------------27"""2829import torch30import numpy as np31from torch import nn32import torch.autograd.profiler as profiler333435######################################################################36# Performance debugging using Profiler37# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~38#39# Profiler can be useful to identify performance bottlenecks in your40# models. In this example, we build a custom module that performs two41# sub-tasks:42#43# - a linear transformation on the input, and44# - use the transformation result to get indices on a mask tensor.45#46# We wrap the code for each sub-task in separate labelled context managers using47# ``profiler.record_function("label")``. In the profiler output, the48# aggregate performance metrics of all operations in the sub-task will49# show up under its corresponding label.50#51#52# Note that using Profiler incurs some overhead, and is best used only for investigating53# code. Remember to remove it if you are benchmarking runtimes.54#5556class MyModule(nn.Module):57def __init__(self, in_features: int, out_features: int, bias: bool = True):58super(MyModule, self).__init__()59self.linear = nn.Linear(in_features, out_features, bias)6061def forward(self, input, mask):62with profiler.record_function("LINEAR PASS"):63out = self.linear(input)6465with profiler.record_function("MASK INDICES"):66threshold = out.sum(axis=1).mean().item()67hi_idx = np.argwhere(mask.cpu().numpy() > threshold)68hi_idx = torch.from_numpy(hi_idx).cuda()6970return out, hi_idx717273######################################################################74# Profile the forward pass75# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~76#77# We initialize random input and mask tensors, and the model.78#79# Before we run the profiler, we warm-up CUDA to ensure accurate80# performance benchmarking. We wrap the forward pass of our module in the81# ``profiler.profile`` context manager. The ``with_stack=True`` parameter appends the82# file and line number of the operation in the trace.83#84# .. warning::85# ``with_stack=True`` incurs an additional overhead, and is better suited for investigating code.86# Remember to remove it if you are benchmarking performance.87#8889model = MyModule(500, 10).cuda()90input = torch.rand(128, 500).cuda()91mask = torch.rand((500, 500, 500), dtype=torch.double).cuda()9293# warm-up94model(input, mask)9596with profiler.profile(with_stack=True, profile_memory=True) as prof:97out, idx = model(input, mask)9899100######################################################################101# Print profiler results102# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~103#104# Finally, we print the profiler results. ``profiler.key_averages``105# aggregates the results by operator name, and optionally by input106# shapes and/or stack trace events.107# Grouping by input shapes is useful to identify which tensor shapes108# are utilized by the model.109#110# Here, we use ``group_by_stack_n=5`` which aggregates runtimes by the111# operation and its traceback (truncated to the most recent 5 events), and112# display the events in the order they are registered. The table can also113# be sorted by passing a ``sort_by`` argument (refer to the114# `docs <https://pytorch.org/docs/stable/autograd.html#profiler>`__ for115# valid sorting keys).116#117# .. note::118# When running profiler in a notebook, you might see entries like ``<ipython-input-18-193a910735e8>(13): forward``119# instead of filenames in the stacktrace. These correspond to ``<notebook-cell>(line number): calling-function``.120121print(prof.key_averages(group_by_stack_n=5).table(sort_by='self_cpu_time_total', row_limit=5))122123"""124(Some columns are omitted)125126------------- ------------ ------------ ------------ ---------------------------------127Name Self CPU % Self CPU Self CPU Mem Source Location128------------- ------------ ------------ ------------ ---------------------------------129MASK INDICES 87.88% 5.212s -953.67 Mb /mnt/xarfuse/.../torch/au130<ipython-input-...>(10): forward131/mnt/xarfuse/.../torch/nn132<ipython-input-...>(9): <module>133/mnt/xarfuse/.../IPython/134135aten::copy_ 12.07% 715.848ms 0 b <ipython-input-...>(12): forward136/mnt/xarfuse/.../torch/nn137<ipython-input-...>(9): <module>138/mnt/xarfuse/.../IPython/139/mnt/xarfuse/.../IPython/140141LINEAR PASS 0.01% 350.151us -20 b /mnt/xarfuse/.../torch/au142<ipython-input-...>(7): forward143/mnt/xarfuse/.../torch/nn144<ipython-input-...>(9): <module>145/mnt/xarfuse/.../IPython/146147aten::addmm 0.00% 293.342us 0 b /mnt/xarfuse/.../torch/nn148/mnt/xarfuse/.../torch/nn149/mnt/xarfuse/.../torch/nn150<ipython-input-...>(8): forward151/mnt/xarfuse/.../torch/nn152153aten::mean 0.00% 235.095us 0 b <ipython-input-...>(11): forward154/mnt/xarfuse/.../torch/nn155<ipython-input-...>(9): <module>156/mnt/xarfuse/.../IPython/157/mnt/xarfuse/.../IPython/158159----------------------------- ------------ ---------- ----------------------------------160Self CPU time total: 5.931s161162"""163164######################################################################165# Improve memory performance166# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~167# Note that the most expensive operations - in terms of memory and time -168# are at ``forward (10)`` representing the operations within MASK INDICES. Let’s try to169# tackle the memory consumption first. We can see that the ``.to()``170# operation at line 12 consumes 953.67 Mb. This operation copies ``mask`` to the CPU.171# ``mask`` is initialized with a ``torch.double`` datatype. Can we reduce the memory footprint by casting172# it to ``torch.float`` instead?173#174175model = MyModule(500, 10).cuda()176input = torch.rand(128, 500).cuda()177mask = torch.rand((500, 500, 500), dtype=torch.float).cuda()178179# warm-up180model(input, mask)181182with profiler.profile(with_stack=True, profile_memory=True) as prof:183out, idx = model(input, mask)184185print(prof.key_averages(group_by_stack_n=5).table(sort_by='self_cpu_time_total', row_limit=5))186187"""188(Some columns are omitted)189190----------------- ------------ ------------ ------------ --------------------------------191Name Self CPU % Self CPU Self CPU Mem Source Location192----------------- ------------ ------------ ------------ --------------------------------193MASK INDICES 93.61% 5.006s -476.84 Mb /mnt/xarfuse/.../torch/au194<ipython-input-...>(10): forward195/mnt/xarfuse/ /torch/nn196<ipython-input-...>(9): <module>197/mnt/xarfuse/.../IPython/198199aten::copy_ 6.34% 338.759ms 0 b <ipython-input-...>(12): forward200/mnt/xarfuse/.../torch/nn201<ipython-input-...>(9): <module>202/mnt/xarfuse/.../IPython/203/mnt/xarfuse/.../IPython/204205aten::as_strided 0.01% 281.808us 0 b <ipython-input-...>(11): forward206/mnt/xarfuse/.../torch/nn207<ipython-input-...>(9): <module>208/mnt/xarfuse/.../IPython/209/mnt/xarfuse/.../IPython/210211aten::addmm 0.01% 275.721us 0 b /mnt/xarfuse/.../torch/nn212/mnt/xarfuse/.../torch/nn213/mnt/xarfuse/.../torch/nn214<ipython-input-...>(8): forward215/mnt/xarfuse/.../torch/nn216217aten::_local 0.01% 268.650us 0 b <ipython-input-...>(11): forward218_scalar_dense /mnt/xarfuse/.../torch/nn219<ipython-input-...>(9): <module>220/mnt/xarfuse/.../IPython/221/mnt/xarfuse/.../IPython/222223----------------- ------------ ------------ ------------ --------------------------------224Self CPU time total: 5.347s225226"""227228######################################################################229#230# The CPU memory footprint for this operation has halved.231#232# Improve time performance233# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~234# While the time consumed has also reduced a bit, it’s still too high.235# Turns out copying a matrix from CUDA to CPU is pretty expensive!236# The ``aten::copy_`` operator in ``forward (12)`` copies ``mask`` to CPU237# so that it can use the NumPy ``argwhere`` function. ``aten::copy_`` at ``forward(13)``238# copies the array back to CUDA as a tensor. We could eliminate both of these if we use a239# ``torch`` function ``nonzero()`` here instead.240#241242class MyModule(nn.Module):243def __init__(self, in_features: int, out_features: int, bias: bool = True):244super(MyModule, self).__init__()245self.linear = nn.Linear(in_features, out_features, bias)246247def forward(self, input, mask):248with profiler.record_function("LINEAR PASS"):249out = self.linear(input)250251with profiler.record_function("MASK INDICES"):252threshold = out.sum(axis=1).mean()253hi_idx = (mask > threshold).nonzero(as_tuple=True)254255return out, hi_idx256257258model = MyModule(500, 10).cuda()259input = torch.rand(128, 500).cuda()260mask = torch.rand((500, 500, 500), dtype=torch.float).cuda()261262# warm-up263model(input, mask)264265with profiler.profile(with_stack=True, profile_memory=True) as prof:266out, idx = model(input, mask)267268print(prof.key_averages(group_by_stack_n=5).table(sort_by='self_cpu_time_total', row_limit=5))269270"""271(Some columns are omitted)272273-------------- ------------ ------------ ------------ ---------------------------------274Name Self CPU % Self CPU Self CPU Mem Source Location275-------------- ------------ ------------ ------------ ---------------------------------276aten::gt 57.17% 129.089ms 0 b <ipython-input-...>(12): forward277/mnt/xarfuse/.../torch/nn278<ipython-input-...>(25): <module>279/mnt/xarfuse/.../IPython/280/mnt/xarfuse/.../IPython/281282aten::nonzero 37.38% 84.402ms 0 b <ipython-input-...>(12): forward283/mnt/xarfuse/.../torch/nn284<ipython-input-...>(25): <module>285/mnt/xarfuse/.../IPython/286/mnt/xarfuse/.../IPython/287288INDEX SCORE 3.32% 7.491ms -119.21 Mb /mnt/xarfuse/.../torch/au289<ipython-input-...>(10): forward290/mnt/xarfuse/.../torch/nn291<ipython-input-...>(25): <module>292/mnt/xarfuse/.../IPython/293294aten::as_strided 0.20% 441.587us 0 b <ipython-input-...>(12): forward295/mnt/xarfuse/.../torch/nn296<ipython-input-...>(25): <module>297/mnt/xarfuse/.../IPython/298/mnt/xarfuse/.../IPython/299300aten::nonzero301_numpy 0.18% 395.602us 0 b <ipython-input-...>(12): forward302/mnt/xarfuse/.../torch/nn303<ipython-input-...>(25): <module>304/mnt/xarfuse/.../IPython/305/mnt/xarfuse/.../IPython/306-------------- ------------ ------------ ------------ ---------------------------------307Self CPU time total: 225.801ms308309"""310311312######################################################################313# Further Reading314# ~~~~~~~~~~~~~~~~~315# We have seen how Profiler can be used to investigate time and memory bottlenecks in PyTorch models.316# Read more about Profiler here:317#318# - `Profiler Usage Recipe <https://pytorch.org/tutorials/recipes/recipes/profiler.html>`__319# - `Profiling RPC-Based Workloads <https://pytorch.org/tutorials/recipes/distributed_rpc_profiling.html>`__320# - `Profiler API Docs <https://pytorch.org/docs/stable/autograd.html?highlight=profiler#profiler>`__321322323