Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place.
Path: blob/main/advanced_source/semi_structured_sparse.py
Views: 712
# -*- coding: utf-8 -*-1"""2(beta) Accelerating BERT with semi-structured (2:4) sparsity3=====================================================4**Author**: `Jesse Cai <https://github.com/jcaip>`_56"""78####################################################################9# Overview10# --------11#12# Like other forms of sparsity, **semi-structured sparsity** is a model13# optimization technique that seeks to reduce the memory overhead and14# latency of a neural network at the expense of some model accuracy. It is15# also known as **fine-grained structured sparsity** or **2:4 structured16# sparsity**.17#18# Semi-structured sparsity derives its name from its unique sparsity19# pattern, where n out of every 2n elements are pruned. We most often see20# n=2, hence 2:4 sparsity Semi-structured sparsity is particularly21# interesting because it can be efficiently accelerated on GPUs and22# doesn’t degrade model accuracy as much as other sparsity patterns.23#24# With the introduction of25# `semi-structured sparsity support <https://pytorch.org/docs/2.1/sparse.html#sparse-semi-structured-tensors>`_,26# it is possible to prune and accelerate a semi-structured sparse model27# without leaving PyTorch. We will explain this process in this tutorial.28#29# .. image:: ../../_static/img/pruning_flow.jpg30#31# By the end of this tutorial, we will have sparsified a BERT32# question-answering model to be 2:4 sparse, fine-tuning it to recover33# nearly all F1 loss (86.92 dense vs 86.48 sparse). Finally, we will34# accelerate this 2:4 sparse model for inference, yielding a 1.3x speedup.35#3637#####################################################38# Requirements39# ------------40#41# - PyTorch >= 2.1.42# - A NVIDIA GPU with semi-structured sparsity support (Compute43# Capability 8.0+).44#45# This tutorial is designed for beginners to semi-structured sparsity and46# sparsity in general. For users with existing 2:4 sparse models,47# accelerating ``nn.Linear`` layers for inference with48# ``to_sparse_semi_structured`` is quite straightforward. Here is an example:49#5051import torch52from torch.sparse import to_sparse_semi_structured, SparseSemiStructuredTensor53from torch.utils.benchmark import Timer54SparseSemiStructuredTensor._FORCE_CUTLASS = True5556# mask Linear weight to be 2:4 sparse57mask = torch.Tensor([0, 0, 1, 1]).tile((3072, 2560)).cuda().bool()58linear = torch.nn.Linear(10240, 3072).half().cuda().eval()59linear.weight = torch.nn.Parameter(mask * linear.weight)6061x = torch.rand(3072, 10240).half().cuda()6263with torch.inference_mode():64dense_output = linear(x)65dense_t = Timer(stmt="linear(x)",66globals={"linear": linear,67"x": x}).blocked_autorange().median * 1e36869# accelerate via SparseSemiStructuredTensor70linear.weight = torch.nn.Parameter(to_sparse_semi_structured(linear.weight))7172sparse_output = linear(x)73sparse_t = Timer(stmt="linear(x)",74globals={"linear": linear,75"x": x}).blocked_autorange().median * 1e37677# sparse and dense matmul are numerically equivalent78# On an A100 80GB, we see: `Dense: 0.870ms Sparse: 0.630ms | Speedup: 1.382x`79assert torch.allclose(sparse_output, dense_output, atol=1e-3)80print(f"Dense: {dense_t:.3f}ms Sparse: {sparse_t:.3f}ms | Speedup: {(dense_t / sparse_t):.3f}x")818283######################################################################84# What problem does semi-structured sparsity solve?85# -------------------------------------------------86#87# The general motivation behind sparsity is simple: if there are zeros in88# your network, you can optimize efficiency by not storing or computing those89# parameters. However, the specifics of sparsity are tricky. Zeroing out90# parameters doesn’t affect the latency / memory overhead of our model out91# of the box.92#93# This is because the dense tensor still contains the pruned (zero)94# elements, which the dense matrix multiplication kernel will still95# operate on this elements. In order to realize performance gains, we need96# to swap out dense kernels for sparse kernels, which skip calculation97# involving pruned elements.98#99# To do this, these kernels work on sparse matrices, which do not store100# the pruned elements and store the specified elements in a compressed101# format.102#103# For semi-structured sparsity, we store exactly half of the original104# parameters along with some compressed metadata about how the elements105# were arranged.106#107# .. image:: https://developer-blogs.nvidia.com/wp-content/uploads/2023/06/2-4-structured-sparsity-pattern.png108# :align: center :width: 80%109#110# Image sourced from `NVIDIA blog post <https://developer.nvidia.com/blog/structured-sparsity-in-the-nvidia-ampere-architecture-and-applications-in-search-engines/>`_ on semi-structured sparsity.111#112# There are many different sparse layouts, each with their own benefits113# and drawbacks. The 2:4 semi-structured sparse layout is particularly114# interesting for two reasons:115#116# * Unlike previous sparse formats,117# semi-structured sparsity was designed to be efficiently accelerated on118# GPUs. In 2020, NVIDIA introduced hardware support for semi-structured119# sparsity with their Ampere architecture, and have also released fast120# sparse kernels via121# CUTLASS `cuSPARSELt <https://docs.nvidia.com/cuda/cusparselt/index.html>`__.122#123# * At the same time, semi-structured sparsity tends to have a milder124# impact on model accuracy compared to other sparse formats, especially125# when accounting for more advanced pruning / fine-tuning methods. NVIDIA126# has shown in their `white paper <https://arxiv.org/abs/2104.08378>`_127# that a simple paradigm of magnitude pruning once to be 2:4 sparse and128# then retraining the model yields nearly identical model accuracies.129#130# Semi-structured exists in a sweet spot, providing a 2x (theoretical)131# speedup at a much lower sparsity level (50%), while still being granular132# enough to preserve model accuracy.133#134# +---------------------+-------------+--------+------------+-------------+135# | Network | Data Set | Metric | Dense FP16 | Sparse FP16 |136# +=====================+=============+========+============+=============+137# | ResNet-50 | ImageNet | Top-1 | 76.1 | 76.2 |138# +---------------------+-------------+--------+------------+-------------+139# | ResNeXt-101_32x8d | ImageNet | Top-1 | 79.3 | 79.3 |140# +---------------------+-------------+--------+------------+-------------+141# | Xception | ImageNet | Top-1 | 79.2 | 79.2 |142# +---------------------+-------------+--------+------------+-------------+143# | SSD-RN50 | COCO2017 | bbAP | 24.8 | 24.8 |144# +---------------------+-------------+--------+------------+-------------+145# | MaskRCNN-RN50 | COCO2017 | bbAP | 37.9 | 37.9 |146# +---------------------+-------------+--------+------------+-------------+147# | FairSeq Transformer | EN-DE WMT14 | BLEU | 28.2 | 28.5 |148# +---------------------+-------------+--------+------------+-------------+149# | BERT-Large | SQuAD v1.1 | F1 | 91.9 | 91.9 |150# +---------------------+-------------+--------+------------+-------------+151#152# Semi-structured sparsity has an additional advantage from a workflow153# perspective. Because the sparsity level is fixed at 50%, it is easier to154# decompose the problem of sparsifying a model into two distinct155# subproblems:156#157# - Accuracy - How can we find a set of 2:4 sparse weights that minimize158# the accuracy degradation of our model?159#160# - Performance - How can we accelerate our 2:4 sparse weights for161# inference and reduced memory overhead?162#163164#####################################################################165# .. math::166#167# \begin{bmatrix}168# 1 & 1 & 0 & 0 \\169# 0 & 0 & 1 & 1 \\170# 1 & 0 & 0 & 0 \\171# 0 & 0 & 1 & 1 \\172# \end{bmatrix}173#174# The natural handoff point between these two problems are zeroed-out175# dense tensors. Our inference solution is designed to compress and176# accelerate tensors in this format. We anticipate many users coming up177# with custom masking solution, as this is an active area of research.178#179# Now that we’ve learned a little more about semi-structured sparsity,180# let’s apply it to a BERT model trained on a question answering task,181# SQuAD.182#183# Intro & Setup184# -------------185#186# Let’s start by importing all the packages we need.187#188189# If you are running this in Google Colab, run:190# .. code-block: python191#192# !pip install datasets transformers evaluate accelerate pandas193#194import os195os.environ["WANDB_DISABLED"] = "true"196197import collections198import datasets199import evaluate200import numpy as np201import torch202import torch.utils.benchmark as benchmark203from torch import nn204from torch.sparse import to_sparse_semi_structured, SparseSemiStructuredTensor205from torch.ao.pruning import WeightNormSparsifier206import transformers207208# force CUTLASS use if ``cuSPARSELt`` is not available209SparseSemiStructuredTensor._FORCE_CUTLASS = True210torch.manual_seed(100)211212213######################################################################214# We’ll also need to define some helper functions that are specific to the215# dataset / task at hand. These were adapted from216# `this <https://huggingface.co/learn/nlp-course/chapter7/7?fw=pt>`__217# Hugging Face course as a reference.218#219220def preprocess_validation_function(examples, tokenizer):221inputs = tokenizer(222[q.strip() for q in examples["question"]],223examples["context"],224max_length=384,225truncation="only_second",226return_overflowing_tokens=True,227return_offsets_mapping=True,228padding="max_length",229)230sample_map = inputs.pop("overflow_to_sample_mapping")231example_ids = []232233for i in range(len(inputs["input_ids"])):234sample_idx = sample_map[i]235example_ids.append(examples["id"][sample_idx])236sequence_ids = inputs.sequence_ids(i)237offset = inputs["offset_mapping"][i]238inputs["offset_mapping"][i] = [239o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)240]241242inputs["example_id"] = example_ids243return inputs244245246def preprocess_train_function(examples, tokenizer):247inputs = tokenizer(248[q.strip() for q in examples["question"]],249examples["context"],250max_length=384,251truncation="only_second",252return_offsets_mapping=True,253padding="max_length",254)255256offset_mapping = inputs["offset_mapping"]257answers = examples["answers"]258start_positions = []259end_positions = []260261for i, (offset, answer) in enumerate(zip(offset_mapping, answers)):262start_char = answer["answer_start"][0]263end_char = start_char + len(answer["text"][0])264sequence_ids = inputs.sequence_ids(i)265266# Find the start and end of the context267idx = 0268while sequence_ids[idx] != 1:269idx += 1270context_start = idx271while sequence_ids[idx] == 1:272idx += 1273context_end = idx - 1274275# If the answer is not fully inside the context, label it (0, 0)276if offset[context_start][0] > end_char or offset[context_end][1] < start_char:277start_positions.append(0)278end_positions.append(0)279else:280# Otherwise it's the start and end token positions281idx = context_start282while idx <= context_end and offset[idx][0] <= start_char:283idx += 1284start_positions.append(idx - 1)285286idx = context_end287while idx >= context_start and offset[idx][1] >= end_char:288idx -= 1289end_positions.append(idx + 1)290291inputs["start_positions"] = start_positions292inputs["end_positions"] = end_positions293return inputs294295296def compute_metrics(start_logits, end_logits, features, examples):297n_best = 20298max_answer_length = 30299metric = evaluate.load("squad")300301example_to_features = collections.defaultdict(list)302for idx, feature in enumerate(features):303example_to_features[feature["example_id"]].append(idx)304305predicted_answers = []306# for example in ``tqdm`` (examples):307for example in examples:308example_id = example["id"]309context = example["context"]310answers = []311312# Loop through all features associated with that example313for feature_index in example_to_features[example_id]:314start_logit = start_logits[feature_index]315end_logit = end_logits[feature_index]316offsets = features[feature_index]["offset_mapping"]317318start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()319end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()320for start_index in start_indexes:321for end_index in end_indexes:322# Skip answers that are not fully in the context323if offsets[start_index] is None or offsets[end_index] is None:324continue325# Skip answers with a length that is either < 0326# or > max_answer_length327if (328end_index < start_index329or end_index - start_index + 1 > max_answer_length330):331continue332333answer = {334"text": context[335offsets[start_index][0] : offsets[end_index][1]336],337"logit_score": start_logit[start_index] + end_logit[end_index],338}339answers.append(answer)340341# Select the answer with the best score342if len(answers) > 0:343best_answer = max(answers, key=lambda x: x["logit_score"])344predicted_answers.append(345{"id": example_id, "prediction_text": best_answer["text"]}346)347else:348predicted_answers.append({"id": example_id, "prediction_text": ""})349350theoretical_answers = [351{"id": ex["id"], "answers": ex["answers"]} for ex in examples352]353return metric.compute(predictions=predicted_answers, references=theoretical_answers)354355356######################################################################357# Now that those are defined, we just need one additional helper function,358# which will help us benchmark our model.359#360361def measure_execution_time(model, batch_sizes, dataset):362dataset_for_model = dataset.remove_columns(["example_id", "offset_mapping"])363dataset_for_model.set_format("torch")364batch_size_to_time_sec = {}365for batch_size in batch_sizes:366batch = {367k: dataset_for_model[k][:batch_size].cuda()368for k in dataset_for_model.column_names369}370371with torch.no_grad():372baseline_predictions = model(**batch)373timer = benchmark.Timer(374stmt="model(**batch)", globals={"model": model, "batch": batch}375)376p50 = timer.blocked_autorange().median * 1000377batch_size_to_time_sec[batch_size] = p50378379model_c = torch.compile(model, fullgraph=True)380timer = benchmark.Timer(381stmt="model(**batch)", globals={"model": model_c, "batch": batch}382)383p50 = timer.blocked_autorange().median * 1000384batch_size_to_time_sec[f"{batch_size}_compile"] = p50385new_predictions = model_c(**batch)386387return batch_size_to_time_sec388389390391######################################################################392# We will get started by loading our model and tokenizer, and then setting393# up our dataset.394#395396# load model397model_name = "bert-base-cased"398tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)399model = transformers.AutoModelForQuestionAnswering.from_pretrained(model_name)400print(f"Loading tokenizer: {model_name}")401print(f"Loading model: {model_name}")402403# set up train and val dataset404squad_dataset = datasets.load_dataset("squad")405tokenized_squad_dataset = {}406tokenized_squad_dataset["train"] = squad_dataset["train"].map(407lambda x: preprocess_train_function(x, tokenizer), batched=True408)409tokenized_squad_dataset["validation"] = squad_dataset["validation"].map(410lambda x: preprocess_validation_function(x, tokenizer),411batched=True,412remove_columns=squad_dataset["train"].column_names,413)414data_collator = transformers.DataCollatorWithPadding(tokenizer=tokenizer)415416417######################################################################418# Establishing a baseline419# =======================420#421# Next, we’ll train a quick baseline of our model on SQuAD. This task asks422# our model to identify spans, or segments of text, in a given context423# (Wikipedia articles) that answer a given question. Running the following424# code gives me an F1 score of 86.9. This is quite close to the reported425# NVIDIA score and the difference is likely due to BERT-base426# vs. BERT-large or fine-tuning hyperparameters.427#428429training_args = transformers.TrainingArguments(430"trainer",431num_train_epochs=1,432lr_scheduler_type="constant",433per_device_train_batch_size=32,434per_device_eval_batch_size=256,435logging_steps=50,436# Limit max steps for tutorial runners. Delete the below line to see the reported accuracy numbers.437max_steps=500,438report_to=None,439)440441trainer = transformers.Trainer(442model,443training_args,444train_dataset=tokenized_squad_dataset["train"],445eval_dataset=tokenized_squad_dataset["validation"],446data_collator=data_collator,447tokenizer=tokenizer,448)449450trainer.train()451452# batch sizes to compare for eval453batch_sizes = [4, 16, 64, 256]454# 2:4 sparsity require fp16, so we cast here for a fair comparison455with torch.autocast("cuda"):456with torch.no_grad():457predictions = trainer.predict(tokenized_squad_dataset["validation"])458start_logits, end_logits = predictions.predictions459fp16_baseline = compute_metrics(460start_logits,461end_logits,462tokenized_squad_dataset["validation"],463squad_dataset["validation"],464)465fp16_time = measure_execution_time(466model,467batch_sizes,468tokenized_squad_dataset["validation"],469)470471print("fp16", fp16_baseline)472print("cuda_fp16 time", fp16_time)473474import pandas as pd475df = pd.DataFrame(trainer.state.log_history)476df.plot.line(x='step', y='loss', title="Loss vs. # steps", ylabel="loss")477478479######################################################################480# Pruning BERT to be 2:4 sparse481# -----------------------------482#483# Now that we have our baseline, it’s time we prune BERT. There are many484# different pruning strategies, but one of the most common is **magnitude485# pruning**, which seeks to remove the weights with the lowest L1 norm.486# Magnitude pruning was used by NVIDIA in all their results and is a487# common baseline.488#489# To do this, we will use the ``torch.ao.pruning`` package, which contains490# a weight-norm (magnitude) sparsifier. These sparsifiers work by applying491# mask parametrizations to the weight tensors in a model. This lets them492# simulate sparsity by masking out the pruned weights.493#494# We’ll also have to decide what layers of the model to apply sparsity to,495# which in this case is all of the ``nn.Linear`` layers, except for the496# task-specific head outputs. That’s because semi-structured sparsity has497# `shape constraints <https://pytorch.org/docs/2.1/sparse.html#constructing-sparse-semi-structured-tensors>`_,498# and the task-specific ``nn.Linear`` layers do not satisfy them.499#500501sparsifier = WeightNormSparsifier(502# apply sparsity to all blocks503sparsity_level=1.0,504# shape of 4 elements is a block505sparse_block_shape=(1, 4),506# two zeros for every block of 4507zeros_per_block=2508)509510# add to config if ``nn.Linear`` and in the BERT model.511sparse_config = [512{"tensor_fqn": f"{fqn}.weight"}513for fqn, module in model.named_modules()514if isinstance(module, nn.Linear) and "layer" in fqn515]516517518######################################################################519# The first step for pruning the model is to insert parametrizations for520# masking the weights of the model. This is done by the prepare step.521# Anytime we try to access the ``.weight`` we will get ``mask * weight``522# instead.523#524525# Prepare the model, insert fake-sparsity parametrizations for training526sparsifier.prepare(model, sparse_config)527print(model.bert.encoder.layer[0].output)528529530######################################################################531# Then, we’ll take a single pruning step. All pruners implement a532# ``update_mask()`` method that updates the mask with the logic being533# determined by the pruner implementation. The step method calls this534# ``update_mask`` functions for the weights specified in the sparse535# config.536#537# We will also evaluate the model to show the accuracy degradation of538# zero-shot pruning, or pruning without fine-tuning / retraining.539#540541sparsifier.step()542with torch.autocast("cuda"):543with torch.no_grad():544predictions = trainer.predict(tokenized_squad_dataset["validation"])545pruned = compute_metrics(546*predictions.predictions,547tokenized_squad_dataset["validation"],548squad_dataset["validation"],549)550print("pruned eval metrics:", pruned)551552553######################################################################554# In this state, we can start fine-tuning the model, updating the elements555# that wouldn’t be pruned to better account for the accuracy loss. Once556# we’ve reached a satisfied state, we can call ``squash_mask`` to fuse the557# mask and the weight together. This will remove the parametrizations and558# we are left with a zeroed-out 2:4 dense model.559#560561trainer.train()562sparsifier.squash_mask()563torch.set_printoptions(edgeitems=4)564print(model.bert.encoder.layer[0].intermediate.dense.weight[:8, :8])565566df["sparse_loss"] = pd.DataFrame(trainer.state.log_history)["loss"]567df.plot.line(x='step', y=["loss", "sparse_loss"], title="Loss vs. # steps", ylabel="loss")568569570######################################################################571# Accelerating 2:4 sparse models for inference572# --------------------------------------------573#574# Now that we have a model in this format, we can accelerate it for575# inference just like in the QuickStart Guide.576#577578model = model.cuda().half()579# accelerate for sparsity580for fqn, module in model.named_modules():581if isinstance(module, nn.Linear) and "layer" in fqn:582module.weight = nn.Parameter(to_sparse_semi_structured(module.weight))583584with torch.no_grad():585predictions = trainer.predict(tokenized_squad_dataset["validation"])586start_logits, end_logits = predictions.predictions587metrics_sparse = compute_metrics(588start_logits,589end_logits,590tokenized_squad_dataset["validation"],591squad_dataset["validation"],592)593print("sparse eval metrics: ", metrics_sparse)594sparse_perf = measure_execution_time(595model,596batch_sizes,597tokenized_squad_dataset["validation"],598)599print("sparse perf metrics: ", sparse_perf)600601602######################################################################603# Retraining our model after magnitude pruning has recovered nearly all of604# the F1 that has been lost when the model was pruned. At the same time we605# have achieved a 1.28x speedup for ``bs=16``. Note that not all shapes are606# amenable to performance improvements. When batch sizes are small and607# limited time is spent in compute sparse kernels may be slower than their608# dense counterparts.609#610# Because semi-structured sparsity is implemented as a tensor subclass, it611# is compatible with ``torch.compile``. When composed with612# ``to_sparse_semi_structured``, we are able to achieve a total 2x speedup613# on BERT.614#615# .. table::616#617# +--------------------+--------+--------------+-----------------+-----------+618# | Metrics | fp16 | 2:4 sparse | delta / speedup | compiled |619# +====================+========+==============+=================+===========+620# | Exact Match (%) | 78.53 | 78.44 | -0.09 | |621# +--------------------+--------+--------------+-----------------+-----------+622# | F1 (%) | 86.93 | 86.49 | -0.44 | |623# +--------------------+--------+--------------+-----------------+-----------+624# | Time (bs=4) | 11.10 | 15.54 | 0.71x | no |625# +--------------------+--------+--------------+-----------------+-----------+626# | Time (bs=16) | 19.35 | 15.74 | 1.23x | no |627# +--------------------+--------+--------------+-----------------+-----------+628# | Time (bs=64) | 72.71 | 59.41 | 1.22x | no |629# +--------------------+--------+--------------+-----------------+-----------+630# | Time (bs=256) | 286.65 | 247.63 | 1.14x | no |631# +--------------------+--------+--------------+-----------------+-----------+632# | Time (bs=4) | 7.59 | 7.46 | 1.02x | yes |633# +--------------------+--------+--------------+-----------------+-----------+634# | Time (bs=16) | 11.47 | 9.68 | 1.18x | yes |635# +--------------------+--------+--------------+-----------------+-----------+636# | Time (bs=64) | 41.57 | 36.92 | 1.13x | yes |637# +--------------------+--------+--------------+-----------------+-----------+638# | Time (bs=256) | 159.22 | 142.23 | 1.12x | yes |639# +--------------------+--------+--------------+-----------------+-----------+640#641# Conclusion642# ==========643#644# In this tutorial, we have shown how to prune BERT to be 2:4 sparse and645# how to accelerate a 2:4 sparse model for inference. By taking advantage646# of our ``SparseSemiStructuredTensor`` subclass, we were able to achieve a647# 1.3x speedup over the fp16 baseline, and up to 2x with648# ``torch.compile``. We also demonstrated the benefits of 2:4 sparsity by649# fine-tuning BERT to recover any lost F1 (86.92 dense vs 86.48 sparse).650#651652653