CoCalc provides the best real-time collaborative environment for Jupyter Notebooks, LaTeX documents, and SageMath, scalable from individual users to large groups and classes!
CoCalc provides the best real-time collaborative environment for Jupyter Notebooks, LaTeX documents, and SageMath, scalable from individual users to large groups and classes!
Path: blob/main/intermediate_source/memory_format_tutorial.py
Views: 494
# -*- coding: utf-8 -*-1"""2(beta) Channels Last Memory Format in PyTorch3*******************************************************4**Author**: `Vitaly Fedyunin <https://github.com/VitalyFedyunin>`_56What is Channels Last7---------------------89Channels last memory format is an alternative way of ordering NCHW tensors in memory preserving dimensions ordering. Channels last tensors ordered in such a way that channels become the densest dimension (aka storing images pixel-per-pixel).1011For example, classic (contiguous) storage of NCHW tensor (in our case it is two 4x4 images with 3 color channels) look like this:1213.. figure:: /_static/img/classic_memory_format.png14:alt: classic_memory_format1516Channels last memory format orders data differently:1718.. figure:: /_static/img/channels_last_memory_format.png19:alt: channels_last_memory_format2021Pytorch supports memory formats (and provides back compatibility with existing models including eager, JIT, and TorchScript) by utilizing existing strides structure.22For example, 10x3x16x16 batch in Channels last format will have strides equal to (768, 1, 48, 3).23"""2425######################################################################26# Channels last memory format is implemented for 4D NCHW Tensors only.27#2829######################################################################30# Memory Format API31# -----------------------32#33# Here is how to convert tensors between contiguous and channels34# last memory formats.3536######################################################################37# Classic PyTorch contiguous tensor38import torch3940N, C, H, W = 10, 3, 32, 3241x = torch.empty(N, C, H, W)42print(x.stride()) # Outputs: (3072, 1024, 32, 1)4344######################################################################45# Conversion operator46x = x.to(memory_format=torch.channels_last)47print(x.shape) # Outputs: (10, 3, 32, 32) as dimensions order preserved48print(x.stride()) # Outputs: (3072, 1, 96, 3)4950######################################################################51# Back to contiguous52x = x.to(memory_format=torch.contiguous_format)53print(x.stride()) # Outputs: (3072, 1024, 32, 1)5455######################################################################56# Alternative option57x = x.contiguous(memory_format=torch.channels_last)58print(x.stride()) # Outputs: (3072, 1, 96, 3)5960######################################################################61# Format checks62print(x.is_contiguous(memory_format=torch.channels_last)) # Outputs: True6364######################################################################65# There are minor difference between the two APIs ``to`` and66# ``contiguous``. We suggest to stick with ``to`` when explicitly67# converting memory format of tensor.68#69# For general cases the two APIs behave the same. However in special70# cases for a 4D tensor with size ``NCHW`` when either: ``C==1`` or71# ``H==1 && W==1``, only ``to`` would generate a proper stride to72# represent channels last memory format.73#74# This is because in either of the two cases above, the memory format75# of a tensor is ambiguous, i.e. a contiguous tensor with size76# ``N1HW`` is both ``contiguous`` and channels last in memory storage.77# Therefore, they are already considered as ``is_contiguous``78# for the given memory format and hence ``contiguous`` call becomes a79# no-op and would not update the stride. On the contrary, ``to``80# would restride tensor with a meaningful stride on dimensions whose81# sizes are 1 in order to properly represent the intended memory82# format83special_x = torch.empty(4, 1, 4, 4)84print(special_x.is_contiguous(memory_format=torch.channels_last)) # Outputs: True85print(special_x.is_contiguous(memory_format=torch.contiguous_format)) # Outputs: True8687######################################################################88# Same thing applies to explicit permutation API ``permute``. In89# special case where ambiguity could occur, ``permute`` does not90# guarantee to produce a stride that properly carry the intended91# memory format. We suggest to use ``to`` with explicit memory format92# to avoid unintended behavior.93#94# And a side note that in the extreme case, where three non-batch95# dimensions are all equal to ``1`` (``C==1 && H==1 && W==1``),96# current implementation cannot mark a tensor as channels last memory97# format.9899######################################################################100# Create as channels last101x = torch.empty(N, C, H, W, memory_format=torch.channels_last)102print(x.stride()) # Outputs: (3072, 1, 96, 3)103104######################################################################105# ``clone`` preserves memory format106y = x.clone()107print(y.stride()) # Outputs: (3072, 1, 96, 3)108109######################################################################110# ``to``, ``cuda``, ``float`` ... preserves memory format111if torch.cuda.is_available():112y = x.cuda()113print(y.stride()) # Outputs: (3072, 1, 96, 3)114115######################################################################116# ``empty_like``, ``*_like`` operators preserves memory format117y = torch.empty_like(x)118print(y.stride()) # Outputs: (3072, 1, 96, 3)119120######################################################################121# Pointwise operators preserves memory format122z = x + y123print(z.stride()) # Outputs: (3072, 1, 96, 3)124125######################################################################126# ``Conv``, ``Batchnorm`` modules using ``cudnn`` backends support channels last127# (only works for cuDNN >= 7.6). Convolution modules, unlike binary128# p-wise operator, have channels last as the dominating memory format.129# If all inputs are in contiguous memory format, the operator130# produces output in contiguous memory format. Otherwise, output will131# be in channels last memory format.132133if torch.backends.cudnn.is_available() and torch.backends.cudnn.version() >= 7603:134model = torch.nn.Conv2d(8, 4, 3).cuda().half()135model = model.to(memory_format=torch.channels_last) # Module parameters need to be channels last136137input = torch.randint(1, 10, (2, 8, 4, 4), dtype=torch.float32, requires_grad=True)138input = input.to(device="cuda", memory_format=torch.channels_last, dtype=torch.float16)139140out = model(input)141print(out.is_contiguous(memory_format=torch.channels_last)) # Outputs: True142143######################################################################144# When input tensor reaches a operator without channels last support,145# a permutation should automatically apply in the kernel to restore146# contiguous on input tensor. This introduces overhead and stops the147# channels last memory format propagation. Nevertheless, it guarantees148# correct output.149150######################################################################151# Performance Gains152# --------------------------------------------------------------------153# Channels last memory format optimizations are available on both GPU and CPU.154# On GPU, the most significant performance gains are observed on NVIDIA's155# hardware with Tensor Cores support running on reduced precision156# (``torch.float16``).157# We were able to archive over 22% performance gains with channels last158# comparing to contiguous format, both while utilizing159# 'AMP (Automated Mixed Precision)' training scripts.160# Our scripts uses AMP supplied by NVIDIA161# https://github.com/NVIDIA/apex.162#163# ``python main_amp.py -a resnet50 --b 200 --workers 16 --opt-level O2 ./data``164165# opt_level = O2166# keep_batchnorm_fp32 = None <class 'NoneType'>167# loss_scale = None <class 'NoneType'>168# CUDNN VERSION: 7603169# => creating model 'resnet50'170# Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights.171# Defaults for this optimization level are:172# enabled : True173# opt_level : O2174# cast_model_type : torch.float16175# patch_torch_functions : False176# keep_batchnorm_fp32 : True177# master_weights : True178# loss_scale : dynamic179# Processing user overrides (additional kwargs that are not None)...180# After processing overrides, optimization options are:181# enabled : True182# opt_level : O2183# cast_model_type : torch.float16184# patch_torch_functions : False185# keep_batchnorm_fp32 : True186# master_weights : True187# loss_scale : dynamic188# Epoch: [0][10/125] Time 0.866 (0.866) Speed 230.949 (230.949) Loss 0.6735125184 (0.6735) Prec@1 61.000 (61.000) Prec@5 100.000 (100.000)189# Epoch: [0][20/125] Time 0.259 (0.562) Speed 773.481 (355.693) Loss 0.6968704462 (0.6852) Prec@1 55.000 (58.000) Prec@5 100.000 (100.000)190# Epoch: [0][30/125] Time 0.258 (0.461) Speed 775.089 (433.965) Loss 0.7877287269 (0.7194) Prec@1 51.500 (55.833) Prec@5 100.000 (100.000)191# Epoch: [0][40/125] Time 0.259 (0.410) Speed 771.710 (487.281) Loss 0.8285319805 (0.7467) Prec@1 48.500 (54.000) Prec@5 100.000 (100.000)192# Epoch: [0][50/125] Time 0.260 (0.380) Speed 770.090 (525.908) Loss 0.7370464802 (0.7447) Prec@1 56.500 (54.500) Prec@5 100.000 (100.000)193# Epoch: [0][60/125] Time 0.258 (0.360) Speed 775.623 (555.728) Loss 0.7592862844 (0.7472) Prec@1 51.000 (53.917) Prec@5 100.000 (100.000)194# Epoch: [0][70/125] Time 0.258 (0.345) Speed 774.746 (579.115) Loss 1.9698858261 (0.9218) Prec@1 49.500 (53.286) Prec@5 100.000 (100.000)195# Epoch: [0][80/125] Time 0.260 (0.335) Speed 770.324 (597.659) Loss 2.2505953312 (1.0879) Prec@1 50.500 (52.938) Prec@5 100.000 (100.000)196197######################################################################198# Passing ``--channels-last true`` allows running a model in Channels last format with observed 22% performance gain.199#200# ``python main_amp.py -a resnet50 --b 200 --workers 16 --opt-level O2 --channels-last true ./data``201202# opt_level = O2203# keep_batchnorm_fp32 = None <class 'NoneType'>204# loss_scale = None <class 'NoneType'>205#206# CUDNN VERSION: 7603207#208# => creating model 'resnet50'209# Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights.210#211# Defaults for this optimization level are:212# enabled : True213# opt_level : O2214# cast_model_type : torch.float16215# patch_torch_functions : False216# keep_batchnorm_fp32 : True217# master_weights : True218# loss_scale : dynamic219# Processing user overrides (additional kwargs that are not None)...220# After processing overrides, optimization options are:221# enabled : True222# opt_level : O2223# cast_model_type : torch.float16224# patch_torch_functions : False225# keep_batchnorm_fp32 : True226# master_weights : True227# loss_scale : dynamic228#229# Epoch: [0][10/125] Time 0.767 (0.767) Speed 260.785 (260.785) Loss 0.7579724789 (0.7580) Prec@1 53.500 (53.500) Prec@5 100.000 (100.000)230# Epoch: [0][20/125] Time 0.198 (0.482) Speed 1012.135 (414.716) Loss 0.7007197738 (0.7293) Prec@1 49.000 (51.250) Prec@5 100.000 (100.000)231# Epoch: [0][30/125] Time 0.198 (0.387) Speed 1010.977 (516.198) Loss 0.7113101482 (0.7233) Prec@1 55.500 (52.667) Prec@5 100.000 (100.000)232# Epoch: [0][40/125] Time 0.197 (0.340) Speed 1013.023 (588.333) Loss 0.8943189979 (0.7661) Prec@1 54.000 (53.000) Prec@5 100.000 (100.000)233# Epoch: [0][50/125] Time 0.198 (0.312) Speed 1010.541 (641.977) Loss 1.7113249302 (0.9551) Prec@1 51.000 (52.600) Prec@5 100.000 (100.000)234# Epoch: [0][60/125] Time 0.198 (0.293) Speed 1011.163 (683.574) Loss 5.8537774086 (1.7716) Prec@1 50.500 (52.250) Prec@5 100.000 (100.000)235# Epoch: [0][70/125] Time 0.198 (0.279) Speed 1011.453 (716.767) Loss 5.7595844269 (2.3413) Prec@1 46.500 (51.429) Prec@5 100.000 (100.000)236# Epoch: [0][80/125] Time 0.198 (0.269) Speed 1011.827 (743.883) Loss 2.8196096420 (2.4011) Prec@1 47.500 (50.938) Prec@5 100.000 (100.000)237238######################################################################239# The following list of models has the full support of Channels last and showing 8%-35% performance gains on Volta devices:240# ``alexnet``, ``mnasnet0_5``, ``mnasnet0_75``, ``mnasnet1_0``, ``mnasnet1_3``, ``mobilenet_v2``, ``resnet101``, ``resnet152``, ``resnet18``, ``resnet34``, ``resnet50``, ``resnext50_32x4d``, ``shufflenet_v2_x0_5``, ``shufflenet_v2_x1_0``, ``shufflenet_v2_x1_5``, ``shufflenet_v2_x2_0``, ``squeezenet1_0``, ``squeezenet1_1``, ``vgg11``, ``vgg11_bn``, ``vgg13``, ``vgg13_bn``, ``vgg16``, ``vgg16_bn``, ``vgg19``, ``vgg19_bn``, ``wide_resnet101_2``, ``wide_resnet50_2``241#242243######################################################################244# The following list of models has the full support of Channels last and showing 26%-76% performance gains on Intel(R) Xeon(R) Ice Lake (or newer) CPUs:245# ``alexnet``, ``densenet121``, ``densenet161``, ``densenet169``, ``googlenet``, ``inception_v3``, ``mnasnet0_5``, ``mnasnet1_0``, ``resnet101``, ``resnet152``, ``resnet18``, ``resnet34``, ``resnet50``, ``resnext101_32x8d``, ``resnext50_32x4d``, ``shufflenet_v2_x0_5``, ``shufflenet_v2_x1_0``, ``squeezenet1_0``, ``squeezenet1_1``, ``vgg11``, ``vgg11_bn``, ``vgg13``, ``vgg13_bn``, ``vgg16``, ``vgg16_bn``, ``vgg19``, ``vgg19_bn``, ``wide_resnet101_2``, ``wide_resnet50_2``246#247248######################################################################249# Converting existing models250# --------------------------251#252# Channels last support is not limited by existing models, as any253# model can be converted to channels last and propagate format through254# the graph as soon as input (or certain weight) is formatted255# correctly.256#257258# Need to be done once, after model initialization (or load)259model = model.to(memory_format=torch.channels_last) # Replace with your model260261# Need to be done for every input262input = input.to(memory_format=torch.channels_last) # Replace with your input263output = model(input)264265#######################################################################266# However, not all operators fully converted to support channels last267# (usually returning contiguous output instead). In the example posted268# above, layers that does not support channels last will stop the269# memory format propagation. In spite of that, as we have converted the270# model to channels last format, that means each convolution layer,271# which has its 4 dimensional weight in channels last memory format,272# will restore channels last memory format and benefit from faster273# kernels.274#275# But operators that does not support channels last does introduce276# overhead by permutation. Optionally, you can investigate and identify277# operators in your model that does not support channels last, if you278# want to improve the performance of converted model.279#280# That means you need to verify the list of used operators281# against supported operators list https://github.com/pytorch/pytorch/wiki/Operators-with-Channels-Last-support,282# or introduce memory format checks into eager execution mode and run your model.283#284# After running the code below, operators will raise an exception if the output of the285# operator doesn't match the memory format of the input.286#287#288def contains_cl(args):289for t in args:290if isinstance(t, torch.Tensor):291if t.is_contiguous(memory_format=torch.channels_last) and not t.is_contiguous():292return True293elif isinstance(t, list) or isinstance(t, tuple):294if contains_cl(list(t)):295return True296return False297298299def print_inputs(args, indent=""):300for t in args:301if isinstance(t, torch.Tensor):302print(indent, t.stride(), t.shape, t.device, t.dtype)303elif isinstance(t, list) or isinstance(t, tuple):304print(indent, type(t))305print_inputs(list(t), indent=indent + " ")306else:307print(indent, t)308309310def check_wrapper(fn):311name = fn.__name__312313def check_cl(*args, **kwargs):314was_cl = contains_cl(args)315try:316result = fn(*args, **kwargs)317except Exception as e:318print("`{}` inputs are:".format(name))319print_inputs(args)320print("-------------------")321raise e322failed = False323if was_cl:324if isinstance(result, torch.Tensor):325if result.dim() == 4 and not result.is_contiguous(memory_format=torch.channels_last):326print(327"`{}` got channels_last input, but output is not channels_last:".format(name),328result.shape,329result.stride(),330result.device,331result.dtype,332)333failed = True334if failed and True:335print("`{}` inputs are:".format(name))336print_inputs(args)337raise Exception("Operator `{}` lost channels_last property".format(name))338return result339340return check_cl341342343old_attrs = dict()344345346def attribute(m):347old_attrs[m] = dict()348for i in dir(m):349e = getattr(m, i)350exclude_functions = ["is_cuda", "has_names", "numel", "stride", "Tensor", "is_contiguous", "__class__"]351if i not in exclude_functions and not i.startswith("_") and "__call__" in dir(e):352try:353old_attrs[m][i] = e354setattr(m, i, check_wrapper(e))355except Exception as e:356print(i)357print(e)358359360attribute(torch.Tensor)361attribute(torch.nn.functional)362attribute(torch)363364365######################################################################366# If you found an operator that doesn't support channels last tensors367# and you want to contribute, feel free to use following developers368# guide https://github.com/pytorch/pytorch/wiki/Writing-memory-format-aware-operators.369#370371######################################################################372# Code below is to recover the attributes of torch.373374for (m, attrs) in old_attrs.items():375for (k, v) in attrs.items():376setattr(m, k, v)377378######################################################################379# Work to do380# ----------381# There are still many things to do, such as:382#383# - Resolving ambiguity of ``N1HW`` and ``NC11`` Tensors;384# - Testing of Distributed Training support;385# - Improving operators coverage.386#387# If you have feedback and/or suggestions for improvement, please let us388# know by creating `an issue <https://github.com/pytorch/pytorch/issues>`_.389390391