CoCalc provides the best real-time collaborative environment for Jupyter Notebooks, LaTeX documents, and SageMath, scalable from individual users to large groups and classes!
CoCalc provides the best real-time collaborative environment for Jupyter Notebooks, LaTeX documents, and SageMath, scalable from individual users to large groups and classes!
Path: blob/main/prototype_source/maskedtensor_adagrad.py
Views: 494
# -*- coding: utf-8 -*-12"""3(Prototype) Efficiently writing "sparse" semantics for Adagrad with MaskedTensor4================================================================================5"""67######################################################################8# Before working through this tutorial, please review the MaskedTensor9# `Overview <https://pytorch.org/tutorials/prototype/maskedtensor_overview.html>`__ and10# `Sparsity <https://pytorch.org/tutorials/prototype/maskedtensor_sparsity.html>`__ tutorials.11#12# Introduction and Motivation13# ---------------------------14# `Issue 1369 <https://github.com/pytorch/pytorch/issues/1369>`__ discussed the additional lines of code15# that were introduced while writing "sparse" semantics for Adagrad, but really,16# the code uses sparsity as a proxy for masked semantics rather than the intended use case of sparsity:17# a compression and optimization technique.18# Previously, we worked around the lack of formal masked semantics by introducing one-off semantics and operators19# while forcing users to be aware of storage details such as indices and values.20#21# Now that we have masked semantics, we are better equipped to point out when sparsity is used as a semantic extension.22# We'll also compare and contrast this with equivalent code written using MaskedTensor.23# In the end the code snippets are repeated without additional comments to show the difference in brevity.24#25# Preparation26# -----------27#2829import torch30import warnings3132# Disable prototype warnings and such33warnings.filterwarnings(action='ignore', category=UserWarning)3435# Some hyperparameters36eps = 1e-1037clr = 0.13839i = torch.tensor([[0, 1, 1], [2, 0, 2]])40v = torch.tensor([3, 4, 5], dtype=torch.float32)41grad = torch.sparse_coo_tensor(i, v, [2, 4])4243######################################################################44# Simpler Code with MaskedTensor45# ------------------------------46#47# Before we get too far in the weeds, let's introduce the problem a bit more concretely. We will be taking a look48# into the `Adagrad (functional) <https://github.com/pytorch/pytorch/blob/6c2f235d368b697072699e5ca9485fd97d0b9bcc/torch/optim/_functional.py#L16-L51>`__49# implementation in PyTorch with the ultimate goal of simplifying and more faithfully representing the masked approach.50#51# For reference, this is the regular, dense code path without masked gradients or sparsity:52#53# .. code-block:: python54#55# state_sum.addcmul_(grad, grad, value=1)56# std = state_sum.sqrt().add_(eps)57# param.addcdiv_(grad, std, value=-clr)58#59# The vanilla tensor implementation for sparse is:60#61# .. code-block:: python62#63# def _make_sparse(grad, grad_indices, values):64# size = grad.size()65# if grad_indices.numel() == 0 or values.numel() == 0:66# return torch.empty_like(grad)67# return torch.sparse_coo_tensor(grad_indices, values, size)68#69# grad = grad.coalesce() # the update is non-linear so indices must be unique70# grad_indices = grad._indices()71# grad_values = grad._values()72#73# state_sum.add_(_make_sparse(grad, grad_indices, grad_values.pow(2))) # a different _make_sparse per layout74# std = state_sum.sparse_mask(grad)75# std_values = std._values().sqrt_().add_(eps)76# param.add_(_make_sparse(grad, grad_indices, grad_values / std_values), alpha=-clr)77#78# while :class:`MaskedTensor` minimizes the code to the snippet:79#80# .. code-block:: python81#82# state_sum2 = state_sum2 + masked_grad.pow(2).get_data()83# std2 = masked_tensor(state_sum2.to_sparse(), mask)84# std2 = std2.sqrt().add(eps)85# param2 = param2.add((masked_grad / std2).get_data(), alpha=-clr)86#87# In this tutorial, we will go through each implementation line by line, but at first glance, we can notice88# (1) how much shorter the MaskedTensor implementation is, and89# (2) how it avoids conversions between dense and sparse tensors.90#9192######################################################################93# Original Sparse Implementation94# ------------------------------95#96# Now, let's break down the code with some inline comments:97#9899def _make_sparse(grad, grad_indices, values):100size = grad.size()101if grad_indices.numel() == 0 or values.numel() == 0:102return torch.empty_like(grad)103return torch.sparse_coo_tensor(grad_indices, values, size)104105# We don't support sparse gradients106param = torch.arange(8).reshape(2, 4).float()107state_sum = torch.full_like(param, 0.5) # initial value for state sum108109grad = grad.coalesce() # the update is non-linear so indices must be unique110grad_indices = grad._indices()111grad_values = grad._values()112# pow(2) has the same semantics for both sparse and dense memory layouts since 0^2 is zero113state_sum.add_(_make_sparse(grad, grad_indices, grad_values.pow(2)))114115# We take care to make std sparse, even though state_sum clearly is not.116# This means that we're only applying the gradient to parts of the state_sum117# for which it is specified. This further drives the point home that the passed gradient is not sparse, but masked.118# We currently dodge all these concerns using the private method `_values`.119std = state_sum.sparse_mask(grad)120std_values = std._values().sqrt_().add_(eps)121122# Note here that we currently don't support div for sparse Tensors because zero / zero is not well defined,123# so we're forced to perform `grad_values / std_values` outside the sparse semantic and then convert back to a124# sparse tensor with `make_sparse`.125# We'll later see that MaskedTensor will actually handle these operations for us as well as properly denote126# undefined / undefined = undefined!127param.add_(_make_sparse(grad, grad_indices, grad_values / std_values), alpha=-clr)128129######################################################################130# The third to last line -- `std = state_sum.sparse_mask(grad)` -- is where we have a very important divergence.131#132# The addition of eps should technically be applied to all values but instead is only applied to specified values.133# Here we're using sparsity as a semantic extension and to enforce a certain pattern of defined and undefined values.134# If parts of the values of the gradient are zero, they are still included if materialized even though they135# could be compressed by other sparse storage layouts. This is theoretically quite brittle!136# That said, one could argue that eps is always very small, so it might not matter so much in practice.137#138# Moreover, an implementation `add_` for sparsity as a storage layout and compression scheme139# should cause densification, but we force it not to for performance.140# For this one-off case it is fine.. until we want to introduce new compression scheme, such as141# `CSC <https://pytorch.org/docs/master/sparse.html#sparse-csc-docs>`__,142# `BSR <https://pytorch.org/docs/master/sparse.html#sparse-bsr-docs>`__,143# or `BSC <https://pytorch.org/docs/master/sparse.html#sparse-bsc-docs>`__.144# We will then need to introduce separate Tensor types for each and write variations for gradients compressed145# using different storage formats, which is inconvenient and not quite scalable nor clean.146#147# MaskedTensor Sparse Implementation148# ----------------------------------149#150# We've been conflating sparsity as an optimization with sparsity as a semantic extension to PyTorch.151# MaskedTensor proposes to disentangle the sparsity optimization from the semantic extension; for example,152# currently we can't have dense semantics with sparse storage or masked semantics with dense storage.153# MaskedTensor enables these ideas by purposefully separating the storage from the semantics.154#155# Consider the above example using a masked gradient:156#157158# Let's now import MaskedTensor!159from torch.masked import masked_tensor160161# Create an entirely new set of parameters to avoid errors162param2 = torch.arange(8).reshape(2, 4).float()163state_sum2 = torch.full_like(param, 0.5) # initial value for state sum164165mask = (grad.to_dense() != 0).to_sparse()166masked_grad = masked_tensor(grad, mask)167168state_sum2 = state_sum2 + masked_grad.pow(2).get_data()169std2 = masked_tensor(state_sum2.to_sparse(), mask)170171# We can add support for in-place operations later. Notice how this doesn't172# need to access any storage internals and is in general a lot shorter173std2 = std2.sqrt().add(eps)174175param2 = param2.add((masked_grad / std2).get_data(), alpha=-clr)176177######################################################################178# Note that the implementations look quite similar, but the MaskedTensor implementation is shorter and simpler.179# In particular, much of the boilerplate code around ``_make_sparse``180# (and needing to have a separate implementation per layout) is handled for the user with :class:`MaskedTensor`.181#182# At this point, let's print both this version and original version for easier comparison:183#184185print("state_sum:\n", state_sum)186print("state_sum2:\n", state_sum2)187188######################################################################189#190191print("std:\n", std)192print("std2:\n", std2)193194######################################################################195#196197print("param:\n", param)198print("param2:\n", param2)199200######################################################################201# Conclusion202# ----------203#204# In this tutorial, we've discussed how native masked semantics can enable a cleaner developer experience for205# Adagrad's existing implementation in PyTorch, which used sparsity as a proxy for writing masked semantics.206# But more importantly, allowing masked semantics to be a first class citizen through MaskedTensor207# removes the reliance on sparsity or unreliable hacks to mimic masking, thereby allowing for proper independence208# and development, while enabling sparse semantics, such as this one.209#210# Further Reading211# ---------------212#213# To continue learning more, you can find our final review (for now) on214# `MaskedTensor Advanced Semantics <https://pytorch.org/tutorials/prototype/maskedtensor_advanced_semantics.html>`__215# to see some of the differences in design decisions between :class:`MaskedTensor` and NumPy's MaskedArray, as well216# as reduction semantics.217#218219220