CoCalc provides the best real-time collaborative environment for Jupyter Notebooks, LaTeX documents, and SageMath, scalable from individual users to large groups and classes!
CoCalc provides the best real-time collaborative environment for Jupyter Notebooks, LaTeX documents, and SageMath, scalable from individual users to large groups and classes!
Path: blob/main/beginner_source/t5_tutorial.py
Views: 494
"""1T5-Base Model for Summarization, Sentiment Classification, and Translation2==========================================================================34**Authors**: `Pendo Abbo <[email protected]>`__, `Joe Cummings <[email protected]>`__56"""78######################################################################9# Overview10# --------11#12# This tutorial demonstrates how to use a pretrained T5 Model for summarization, sentiment classification, and13# translation tasks. We will demonstrate how to use the torchtext library to:14#15# 1. Build a text preprocessing pipeline for a T5 model16# 2. Instantiate a pretrained T5 model with base configuration17# 3. Read in the CNNDM, IMDB, and Multi30k datasets and preprocess their texts in preparation for the model18# 4. Perform text summarization, sentiment classification, and translation19#20# .. note::21# This tutorial requires PyTorch 2.0.0 or later.22#23#######################################################################24# Data Transformation25# -------------------26#27# The T5 model does not work with raw text. Instead, it requires the text to be transformed into numerical form28# in order to perform training and inference. The following transformations are required for the T5 model:29#30# 1. Tokenize text31# 2. Convert tokens into (integer) IDs32# 3. Truncate the sequences to a specified maximum length33# 4. Add end-of-sequence (EOS) and padding token IDs34#35# T5 uses a ``SentencePiece`` model for text tokenization. Below, we use a pretrained ``SentencePiece`` model to build36# the text preprocessing pipeline using torchtext's T5Transform. Note that the transform supports both37# batched and non-batched text input (for example, one can either pass a single sentence or a list of sentences), however the T5 model expects the input to be batched.38#3940from torchtext.models import T5Transform4142padding_idx = 043eos_idx = 144max_seq_len = 51245t5_sp_model_path = "https://download.pytorch.org/models/text/t5_tokenizer_base.model"4647transform = T5Transform(48sp_model_path=t5_sp_model_path,49max_seq_len=max_seq_len,50eos_idx=eos_idx,51padding_idx=padding_idx,52)5354#######################################################################55# Alternatively, we can also use the transform shipped with the pretrained models that does all of the above out-of-the-box56#57# .. code-block::58#59# from torchtext.models import T5_BASE_GENERATION60# transform = T5_BASE_GENERATION.transform()61#626364######################################################################65# Model Preparation66# -----------------67#68# torchtext provides SOTA pretrained models that can be used directly for NLP tasks or fine-tuned on downstream tasks. Below69# we use the pretrained T5 model with standard base configuration to perform text summarization, sentiment classification, and70# translation. For additional details on available pretrained models, see `the torchtext documentation <https://pytorch.org/text/main/models.html>`__71#72#73from torchtext.models import T5_BASE_GENERATION747576t5_base = T5_BASE_GENERATION77transform = t5_base.transform()78model = t5_base.get_model()79model.eval()808182#######################################################################83# Using ``GenerationUtils``84# -------------------------85#86# We can use torchtext's ``GenerationUtils`` to produce an output sequence based on the input sequence provided. This calls on the87# model's encoder and decoder, and iteratively expands the decoded sequences until the end-of-sequence token is generated88# for all sequences in the batch. The ``generate`` method shown below uses greedy search to generate the sequences. Beam search and89# other decoding strategies are also supported.90#91#92from torchtext.prototype.generate import GenerationUtils9394sequence_generator = GenerationUtils(model)959697#######################################################################98# Datasets99# --------100# torchtext provides several standard NLP datasets. For a complete list, refer to the documentation101# at https://pytorch.org/text/stable/datasets.html. These datasets are built using composable torchdata102# datapipes and hence support standard flow-control and mapping/transformation using user defined103# functions and transforms.104#105# Below we demonstrate how to preprocess the CNNDM dataset to include the prefix necessary for the106# model to identify the task it is performing. The CNNDM dataset has a train, validation, and test107# split. Below we demo on the test split.108#109# The T5 model uses the prefix "summarize" for text summarization. For more information on task110# prefixes, please visit Appendix D of the `T5 Paper <https://arxiv.org/pdf/1910.10683.pdf>`__111#112# .. note::113# Using datapipes is still currently subject to a few caveats. If you wish114# to extend this example to include shuffling, multi-processing, or115# distributed learning, please see :ref:`this note <datapipes_warnings>`116# for further instructions.117118from functools import partial119120from torch.utils.data import DataLoader121from torchtext.datasets import CNNDM122123cnndm_batch_size = 5124cnndm_datapipe = CNNDM(split="test")125task = "summarize"126127128def apply_prefix(task, x):129return f"{task}: " + x[0], x[1]130131132cnndm_datapipe = cnndm_datapipe.map(partial(apply_prefix, task))133cnndm_datapipe = cnndm_datapipe.batch(cnndm_batch_size)134cnndm_datapipe = cnndm_datapipe.rows2columnar(["article", "abstract"])135cnndm_dataloader = DataLoader(cnndm_datapipe, shuffle=True, batch_size=None)136137#######################################################################138# Alternately, we can also use batched API, for example, apply the prefix on the whole batch:139#140# .. code-block::141#142# def batch_prefix(task, x):143# return {144# "article": [f'{task}: ' + y for y in x["article"]],145# "abstract": x["abstract"]146# }147#148# cnndm_batch_size = 5149# cnndm_datapipe = CNNDM(split="test")150# task = 'summarize'151#152# cnndm_datapipe = cnndm_datapipe.batch(cnndm_batch_size).rows2columnar(["article", "abstract"])153# cnndm_datapipe = cnndm_datapipe.map(partial(batch_prefix, task))154# cnndm_dataloader = DataLoader(cnndm_datapipe, batch_size=None)155#156157#######################################################################158# We can also load the IMDB dataset, which will be used to demonstrate sentiment classification using the T5 model.159# This dataset has a train and test split. Below we demo on the test split.160#161# The T5 model was trained on the SST2 dataset (also available in torchtext) for sentiment classification using the162# prefix ``sst2 sentence``. Therefore, we will use this prefix to perform sentiment classification on the IMDB dataset.163#164165from torchtext.datasets import IMDB166167imdb_batch_size = 3168imdb_datapipe = IMDB(split="test")169task = "sst2 sentence"170labels = {"1": "negative", "2": "positive"}171172173def process_labels(labels, x):174return x[1], labels[str(x[0])]175176177imdb_datapipe = imdb_datapipe.map(partial(process_labels, labels))178imdb_datapipe = imdb_datapipe.map(partial(apply_prefix, task))179imdb_datapipe = imdb_datapipe.batch(imdb_batch_size)180imdb_datapipe = imdb_datapipe.rows2columnar(["text", "label"])181imdb_dataloader = DataLoader(imdb_datapipe, batch_size=None)182183#######################################################################184# Finally, we can also load the Multi30k dataset to demonstrate English to German translation using the T5 model.185# This dataset has a train, validation, and test split. Below we demo on the test split.186#187# The T5 model uses the prefix "translate English to German" for this task.188189from torchtext.datasets import Multi30k190191multi_batch_size = 5192language_pair = ("en", "de")193multi_datapipe = Multi30k(split="test", language_pair=language_pair)194task = "translate English to German"195196multi_datapipe = multi_datapipe.map(partial(apply_prefix, task))197multi_datapipe = multi_datapipe.batch(multi_batch_size)198multi_datapipe = multi_datapipe.rows2columnar(["english", "german"])199multi_dataloader = DataLoader(multi_datapipe, batch_size=None)200201#######################################################################202# Generate Summaries203# ------------------204#205# We can put all of the components together to generate summaries on the first batch of articles in the CNNDM test set206# using a beam size of 1.207#208209batch = next(iter(cnndm_dataloader))210input_text = batch["article"]211target = batch["abstract"]212beam_size = 1213214model_input = transform(input_text)215model_output = sequence_generator.generate(model_input, eos_idx=eos_idx, num_beams=beam_size)216output_text = transform.decode(model_output.tolist())217218for i in range(cnndm_batch_size):219print(f"Example {i+1}:\n")220print(f"prediction: {output_text[i]}\n")221print(f"target: {target[i]}\n\n")222223224#######################################################################225# Summarization Output226# --------------------227#228# Summarization output might vary since we shuffle the dataloader.229#230# .. code-block::231#232# Example 1:233#234# prediction: the 24-year-old has been tattooed for over a decade . he has landed in australia235# to start work on a new campaign . he says he is 'taking it in your stride' to be honest .236#237# target: London-based model Stephen James Hendry famed for his full body tattoo . The supermodel238# is in Sydney for a new modelling campaign . Australian fans understood to have already located239# him at his hotel . The 24-year-old heartthrob is recently single .240#241#242# Example 2:243#244# prediction: a stray pooch has used up at least three of her own after being hit by a245# car and buried in a field . the dog managed to stagger to a nearby farm, dirt-covered246# and emaciated, where she was found . she suffered a dislocated jaw, leg injuries and a247# caved-in sinus cavity -- and still requires surgery to help her breathe .248#249# target: Theia, a bully breed mix, was apparently hit by a car, whacked with a hammer250# and buried in a field . "She's a true miracle dog and she deserves a good life," says251# Sara Mellado, who is looking for a home for Theia .252#253#254# Example 3:255#256# prediction: mohammad Javad Zarif arrived in Iran on a sunny friday morning . he has gone257# a long way to bring Iran in from the cold and allow it to rejoin the international258# community . but there are some facts about him that are less well-known .259#260# target: Mohammad Javad Zarif has spent more time with John Kerry than any other261# foreign minister . He once participated in a takeover of the Iranian Consulate in San262# Francisco . The Iranian foreign minister tweets in English .263#264#265# Example 4:266#267# prediction: five americans were monitored for three weeks after being exposed to Ebola in268# west africa . one of the five had a heart-related issue and has been discharged but hasn't269# left the area . they are clinicians for Partners in Health, a Boston-based aid group .270#271# target: 17 Americans were exposed to the Ebola virus while in Sierra Leone in March .272# Another person was diagnosed with the disease and taken to hospital in Maryland .273# National Institutes of Health says the patient is in fair condition after weeks of274# treatment .275#276#277# Example 5:278#279# prediction: the student was identified during an investigation by campus police and280# the office of student affairs . he admitted to placing the noose on the tree early281# Wednesday morning . the incident is one of several recent racist events to affect282# college students .283#284# target: Student is no longer on Duke University campus and will face disciplinary285# review . School officials identified student during investigation and the person286# admitted to hanging the noose, Duke says . The noose, made of rope, was discovered on287# campus about 2 a.m.288#289290291#######################################################################292# Generate Sentiment Classifications293# ----------------------------------294#295# Similarly, we can use the model to generate sentiment classifications on the first batch of reviews from the IMDB test set296# using a beam size of 1.297#298299batch = next(iter(imdb_dataloader))300input_text = batch["text"]301target = batch["label"]302beam_size = 1303304model_input = transform(input_text)305model_output = sequence_generator.generate(model_input, eos_idx=eos_idx, num_beams=beam_size)306output_text = transform.decode(model_output.tolist())307308for i in range(imdb_batch_size):309print(f"Example {i+1}:\n")310print(f"input_text: {input_text[i]}\n")311print(f"prediction: {output_text[i]}\n")312print(f"target: {target[i]}\n\n")313314315#######################################################################316# Sentiment Output317# ----------------318#319# .. code-block:: bash320#321# Example 1:322#323# input_text: sst2 sentence: I love sci-fi and am willing to put up with a lot. Sci-fi324# movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like325# this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original).326# Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn't match the327# background, and painfully one-dimensional characters cannot be overcome with a 'sci-fi'328# setting. (I'm sure there are those of you out there who think Babylon 5 is good sci-fi TV.329# It's not. It's clichéd and uninspiring.) While US viewers might like emotion and character330# development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may331# treat important issues, yet not as a serious philosophy. It's really difficult to care about332# the characters here as they are not simply foolish, just missing a spark of life. Their333# actions and reactions are wooden and predictable, often painful to watch. The makers of Earth334# KNOW it's rubbish as they have to always say "Gene Roddenberry's Earth..." otherwise people335# would not continue watching. Roddenberry's ashes must be turning in their orbit as this dull,336# cheap, poorly edited (watching it without advert breaks really brings this home) trudging337# Trabant of a show lumbers into space. Spoiler. So, kill off a main character. And then bring338# him back as another actor. Jeeez. Dallas all over again.339#340# prediction: negative341#342# target: negative343#344#345# Example 2:346#347# input_text: sst2 sentence: Worth the entertainment value of a rental, especially if you like348# action movies. This one features the usual car chases, fights with the great Van Damme kick349# style, shooting battles with the 40 shell load shotgun, and even terrorist style bombs. All350# of this is entertaining and competently handled but there is nothing that really blows you351# away if you've seen your share before.<br /><br />The plot is made interesting by the352# inclusion of a rabbit, which is clever but hardly profound. Many of the characters are353# heavily stereotyped -- the angry veterans, the terrified illegal aliens, the crooked cops,354# the indifferent feds, the bitchy tough lady station head, the crooked politician, the fat355# federale who looks like he was typecast as the Mexican in a Hollywood movie from the 1940s.356# All passably acted but again nothing special.<br /><br />I thought the main villains were357# pretty well done and fairly well acted. By the end of the movie you certainly knew who the358# good guys were and weren't. There was an emotional lift as the really bad ones got their just359# deserts. Very simplistic, but then you weren't expecting Hamlet, right? The only thing I found360# really annoying was the constant cuts to VDs daughter during the last fight scene.<br /><br />361# Not bad. Not good. Passable 4.362#363# prediction: positive364#365# target: negative366#367#368# Example 3:369#370# input_text: sst2 sentence: its a totally average film with a few semi-alright action sequences371# that make the plot seem a little better and remind the viewer of the classic van dam films.372# parts of the plot don't make sense and seem to be added in to use up time. the end plot is that373# of a very basic type that doesn't leave the viewer guessing and any twists are obvious from the374# beginning. the end scene with the flask backs don't make sense as they are added in and seem to375# have little relevance to the history of van dam's character. not really worth watching again,376# bit disappointed in the end production, even though it is apparent it was shot on a low budget377# certain shots and sections in the film are of poor directed quality.378#379# prediction: negative380#381# target: negative382#383384385#######################################################################386# Generate Translations387# ---------------------388#389# Finally, we can also use the model to generate English to German translations on the first batch of examples from the Multi30k390# test set.391#392393batch = next(iter(multi_dataloader))394input_text = batch["english"]395target = batch["german"]396397model_input = transform(input_text)398model_output = sequence_generator.generate(model_input, eos_idx=eos_idx, num_beams=beam_size)399output_text = transform.decode(model_output.tolist())400401for i in range(multi_batch_size):402print(f"Example {i+1}:\n")403print(f"input_text: {input_text[i]}\n")404print(f"prediction: {output_text[i]}\n")405print(f"target: {target[i]}\n\n")406407408#######################################################################409# Translation Output410# ------------------411#412# .. code-block:: bash413#414# Example 1:415#416# input_text: translate English to German: A man in an orange hat starring at something.417#418# prediction: Ein Mann in einem orangen Hut, der an etwas schaut.419#420# target: Ein Mann mit einem orangefarbenen Hut, der etwas anstarrt.421#422#423# Example 2:424#425# input_text: translate English to German: A Boston Terrier is running on lush green grass in front of a white fence.426#427# prediction: Ein Boston Terrier läuft auf üppigem grünem Gras vor einem weißen Zaun.428#429# target: Ein Boston Terrier läuft über saftig-grünes Gras vor einem weißen Zaun.430#431#432# Example 3:433#434# input_text: translate English to German: A girl in karate uniform breaking a stick with a front kick.435#436# prediction: Ein Mädchen in Karate-Uniform bricht einen Stöck mit einem Frontkick.437#438# target: Ein Mädchen in einem Karateanzug bricht ein Brett mit einem Tritt.439#440#441# Example 4:442#443# input_text: translate English to German: Five people wearing winter jackets and helmets stand in the snow, with snowmobiles in the background.444#445# prediction: Fünf Menschen mit Winterjacken und Helmen stehen im Schnee, mit Schneemobilen im Hintergrund.446#447# target: Fünf Leute in Winterjacken und mit Helmen stehen im Schnee mit Schneemobilen im Hintergrund.448#449#450# Example 5:451#452# input_text: translate English to German: People are fixing the roof of a house.453#454# prediction: Die Leute fixieren das Dach eines Hauses.455#456# target: Leute Reparieren das Dach eines Hauses.457#458459460