CoCalc provides the best real-time collaborative environment for Jupyter Notebooks, LaTeX documents, and SageMath, scalable from individual users to large groups and classes!
CoCalc provides the best real-time collaborative environment for Jupyter Notebooks, LaTeX documents, and SageMath, scalable from individual users to large groups and classes!
Path: blob/main/recipes_source/torch_export_aoti_python.py
Views: 494
# -*- coding: utf-8 -*-12"""3.. meta::4:description: An end-to-end example of how to use AOTInductor for Python runtime.5:keywords: torch.export, AOTInductor, torch._inductor.aot_compile, torch._export.aot_load67``torch.export`` AOTInductor Tutorial for Python runtime (Beta)8===============================================================9**Author:** Ankith Gunapal, Bin Bao, Angela Yi10"""1112######################################################################13#14# .. warning::15#16# ``torch._inductor.aot_compile`` and ``torch._export.aot_load`` are in Beta status and are subject to backwards compatibility17# breaking changes. This tutorial provides an example of how to use these APIs for model deployment using Python runtime.18#19# It has been shown `previously <https://pytorch.org/docs/stable/torch.compiler_aot_inductor.html#>`__ how AOTInductor can be used20# to do Ahead-of-Time compilation of PyTorch exported models by creating21# a shared library that can be run in a non-Python environment.22#23#24# In this tutorial, you will learn an end-to-end example of how to use AOTInductor for Python runtime.25# We will look at how to use :func:`torch._inductor.aot_compile` along with :func:`torch.export.export` to generate a26# shared library. Additionally, we will examine how to execute the shared library in Python runtime using :func:`torch._export.aot_load`.27# You will learn about the speed up seen in the first inference time using AOTInductor, especially when using28# ``max-autotune`` mode which can take some time to execute.29#30# **Contents**31#32# .. contents::33# :local:3435######################################################################36# Prerequisites37# -------------38# * PyTorch 2.4 or later39# * Basic understanding of ``torch.export`` and AOTInductor40# * Complete the `AOTInductor: Ahead-Of-Time Compilation for Torch.Export-ed Models <https://pytorch.org/docs/stable/torch.compiler_aot_inductor.html#>`_ tutorial4142######################################################################43# What you will learn44# ----------------------45# * How to use AOTInductor for python runtime.46# * How to use :func:`torch._inductor.aot_compile` along with :func:`torch.export.export` to generate a shared library47# * How to run a shared library in Python runtime using :func:`torch._export.aot_load`.48# * When do you use AOTInductor for python runtime4950######################################################################51# Model Compilation52# -----------------53#54# We will use the TorchVision pretrained `ResNet18` model and TorchInductor on the55# exported PyTorch program using :func:`torch._inductor.aot_compile`.56#57# .. note::58#59# This API also supports :func:`torch.compile` options like ``mode``60# This means that if used on a CUDA enabled device, you can, for example, set ``"max_autotune": True``61# which leverages Triton based matrix multiplications & convolutions, and enables CUDA graphs by default.62#63# We also specify ``dynamic_shapes`` for the batch dimension. In this example, ``min=2`` is not a bug and is64# explained in `The 0/1 Specialization Problem <https://docs.google.com/document/d/16VPOa3d-Liikf48teAOmxLc92rgvJdfosIy-yoT38Io/edit?fbclid=IwAR3HNwmmexcitV0pbZm_x1a4ykdXZ9th_eJWK-3hBtVgKnrkmemz6Pm5jRQ#heading=h.ez923tomjvyk>`__656667import os68import torch69from torchvision.models import ResNet18_Weights, resnet187071model = resnet18(weights=ResNet18_Weights.DEFAULT)72model.eval()7374with torch.inference_mode():7576# Specify the generated shared library path77aot_compile_options = {78"aot_inductor.output_path": os.path.join(os.getcwd(), "resnet18_pt2.so"),79}80if torch.cuda.is_available():81device = "cuda"82aot_compile_options.update({"max_autotune": True})83else:84device = "cpu"8586model = model.to(device=device)87example_inputs = (torch.randn(2, 3, 224, 224, device=device),)8889# min=2 is not a bug and is explained in the 0/1 Specialization Problem90batch_dim = torch.export.Dim("batch", min=2, max=32)91exported_program = torch.export.export(92model,93example_inputs,94# Specify the first dimension of the input x as dynamic95dynamic_shapes={"x": {0: batch_dim}},96)97so_path = torch._inductor.aot_compile(98exported_program.module(),99example_inputs,100# Specify the generated shared library path101options=aot_compile_options102)103104105######################################################################106# Model Inference in Python107# -------------------------108#109# Typically, the shared object generated above is used in a non-Python environment. In PyTorch 2.3,110# we added a new API called :func:`torch._export.aot_load` to load the shared library in the Python runtime.111# The API follows a structure similar to the :func:`torch.jit.load` API . You need to specify the path112# of the shared library and the device where it should be loaded.113#114# .. note::115# In the example above, we specified ``batch_size=1`` for inference and it still functions correctly even though we specified ``min=2`` in116# :func:`torch.export.export`.117118119import os120import torch121122device = "cuda" if torch.cuda.is_available() else "cpu"123model_so_path = os.path.join(os.getcwd(), "resnet18_pt2.so")124125model = torch._export.aot_load(model_so_path, device)126example_inputs = (torch.randn(1, 3, 224, 224, device=device),)127128with torch.inference_mode():129output = model(example_inputs)130131######################################################################132# When to use AOTInductor for Python Runtime133# ------------------------------------------134#135# One of the requirements for using AOTInductor is that the model shouldn't have any graph breaks.136# Once this requirement is met, the primary use case for using AOTInductor Python Runtime is for137# model deployment using Python.138# There are mainly two reasons why you would use AOTInductor Python Runtime:139#140# - ``torch._inductor.aot_compile`` generates a shared library. This is useful for model141# versioning for deployments and tracking model performance over time.142# - With :func:`torch.compile` being a JIT compiler, there is a warmup143# cost associated with the first compilation. Your deployment needs to account for the144# compilation time taken for the first inference. With AOTInductor, the compilation is145# done offline using ``torch.export.export`` & ``torch._indutor.aot_compile``. The deployment146# would only load the shared library using ``torch._export.aot_load`` and run inference.147#148#149# The section below shows the speedup achieved with AOTInductor for first inference150#151# We define a utility function ``timed`` to measure the time taken for inference152#153154import time155def timed(fn):156# Returns the result of running `fn()` and the time it took for `fn()` to run,157# in seconds. We use CUDA events and synchronization for accurate158# measurement on CUDA enabled devices.159if torch.cuda.is_available():160start = torch.cuda.Event(enable_timing=True)161end = torch.cuda.Event(enable_timing=True)162start.record()163else:164start = time.time()165166result = fn()167if torch.cuda.is_available():168end.record()169torch.cuda.synchronize()170else:171end = time.time()172173# Measure time taken to execute the function in miliseconds174if torch.cuda.is_available():175duration = start.elapsed_time(end)176else:177duration = (end - start) * 1000178179return result, duration180181182######################################################################183# Lets measure the time for first inference using AOTInductor184185torch._dynamo.reset()186187model = torch._export.aot_load(model_so_path, device)188example_inputs = (torch.randn(1, 3, 224, 224, device=device),)189190with torch.inference_mode():191_, time_taken = timed(lambda: model(example_inputs))192print(f"Time taken for first inference for AOTInductor is {time_taken:.2f} ms")193194195######################################################################196# Lets measure the time for first inference using ``torch.compile``197198torch._dynamo.reset()199200model = resnet18(weights=ResNet18_Weights.DEFAULT).to(device)201model.eval()202203model = torch.compile(model)204example_inputs = torch.randn(1, 3, 224, 224, device=device)205206with torch.inference_mode():207_, time_taken = timed(lambda: model(example_inputs))208print(f"Time taken for first inference for torch.compile is {time_taken:.2f} ms")209210######################################################################211# We see that there is a drastic speedup in first inference time using AOTInductor compared212# to ``torch.compile``213214######################################################################215# Conclusion216# ----------217#218# In this recipe, we have learned how to effectively use the AOTInductor for Python runtime by219# compiling and loading a pretrained ``ResNet18`` model using the ``torch._inductor.aot_compile``220# and ``torch._export.aot_load`` APIs. This process demonstrates the practical application of221# generating a shared library and running it within a Python environment, even with dynamic shape222# considerations and device-specific optimizations. We also looked at the advantage of using223# AOTInductor in model deployments, with regards to speed up in first inference time.224225226