CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
y33-j3T

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: y33-j3T/Coursera-Deep-Learning
Path: blob/master/Natural Language Processing with Attention Models/Week 2 - Text Summarization/C4_W2_Assignment.ipynb
Views: 13373
Kernel: Python 3

Assignment 2: Transformer Summarizer

Welcome to the second assignment of course 4. In this assignment you will explore summarization using the transformer model. Yes, you will implement the transformer decoder from scratch, but we will slowly walk you through it. There are many hints in this notebook so feel free to use them as needed.

Introduction

Summarization is an important task in natural language processing and could be useful for a consumer enterprise. For example, bots can be used to scrape articles, summarize them, and then you can use sentiment analysis to identify the sentiment about certain stocks. Anyways who wants to read an article or a long email today, when you can build a transformer to summarize text for you. Let's get started, by completing this assignment you will learn to:

  • Use built-in functions to preprocess your data

  • Implement DotProductAttention

  • Implement Causal Attention

  • Understand how attention works

  • Build the transformer model

  • Evaluate your model

  • Summarize an article

As you can tell, this model is slightly different than the ones you have already implemented. This is heavily based on attention and does not rely on sequences, which allows for parallel computing.

import sys import os import numpy as np import textwrap wrapper = textwrap.TextWrapper(width=70) import trax from trax import layers as tl from trax.fastmath import numpy as jnp # to print the entire np array np.set_printoptions(threshold=sys.maxsize)
INFO:tensorflow:tokens_length=568 inputs_length=512 targets_length=114 noise_density=0.15 mean_noise_span_length=3.0

Part 1: Importing the dataset

Trax makes it easy to work with Tensorflow's datasets:

# This will download the dataset if no data_dir is specified. # Downloading and processing can take bit of time, # so we have the data already in 'data/' for you # Importing CNN/DailyMail articles dataset train_stream_fn = trax.data.TFDS('cnn_dailymail', data_dir='data/', keys=('article', 'highlights'), train=True) # This should be much faster as the data is downloaded already. eval_stream_fn = trax.data.TFDS('cnn_dailymail', data_dir='data/', keys=('article', 'highlights'), train=False)

1.1 Tokenize & Detokenize helper functions

Just like in the previous assignment, the cell above loads in the encoder for you. Given any data set, you have to be able to map words to their indices, and indices to their words. The inputs and outputs to your Trax models are usually tensors of numbers where each number corresponds to a word. If you were to process your data manually, you would have to make use of the following:

  • word2Ind: a dictionary mapping the word to its index.

  • ind2Word: a dictionary mapping the index to its word.

  • word2Count: a dictionary mapping the word to the number of times it appears.

  • num_words: total number of words that have appeared.

Since you have already implemented these in previous assignments of the specialization, we will provide you with helper functions that will do this for you. Run the cell below to get the following functions:

  • tokenize: converts a text sentence to its corresponding token list (i.e. list of indices). Also converts words to subwords.

  • detokenize: converts a token list to its corresponding sentence (i.e. string).

def tokenize(input_str, EOS=1): """Input str to features dict, ready for inference""" # Use the trax.data.tokenize method. It takes streams and returns streams, # we get around it by making a 1-element stream with `iter`. inputs = next(trax.data.tokenize(iter([input_str]), vocab_dir='vocab_dir/', vocab_file='summarize32k.subword.subwords')) # Mark the end of the sentence with EOS return list(inputs) + [EOS] def detokenize(integers): """List of ints to str""" s = trax.data.detokenize(integers, vocab_dir='vocab_dir/', vocab_file='summarize32k.subword.subwords') return wrapper.fill(s)

1.2 Preprocessing for Language Models: Concatenate It!

This week you will use a language model -- Transformer Decoder -- to solve an input-output problem. As you know, language models only predict the next word, they have no notion of inputs. To create a single input suitable for a language model, we concatenate inputs with targets putting a separator in between. We also need to create a mask -- with 0s at inputs and 1s at targets -- so that the model is not penalized for mis-predicting the article and only focuses on the summary. See the preprocess function below for how this is done.

# Special tokens SEP = 0 # Padding or separator token EOS = 1 # End of sentence token # Concatenate tokenized inputs and targets using 0 as separator. def preprocess(stream): for (article, summary) in stream: joint = np.array(list(article) + [EOS, SEP] + list(summary) + [EOS]) mask = [0] * (len(list(article)) + 2) + [1] * (len(list(summary)) + 1) # Accounting for EOS and SEP yield joint, joint, np.array(mask) # You can combine a few data preprocessing steps into a pipeline like this. input_pipeline = trax.data.Serial( # Tokenizes trax.data.Tokenize(vocab_dir='vocab_dir/', vocab_file='summarize32k.subword.subwords'), # Uses function defined above preprocess, # Filters out examples longer than 2048 trax.data.FilterByLength(2048) ) # Apply preprocessing to data streams. train_stream = input_pipeline(train_stream_fn()) eval_stream = input_pipeline(eval_stream_fn()) train_input, train_target, train_mask = next(train_stream) assert sum((train_input - train_target)**2) == 0 # They are the same in Language Model (LM).
# prints mask, 0s on article, 1s on summary print(f'Single example mask:\n\n {train_mask}')
Single example mask: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
# prints: [Example][<EOS>][<pad>][Example Summary][<EOS>] print(f'Single example:\n\n {detokenize(train_input)}')
Single example: By . Deborah Arthurs . UPDATED: . 13:47 EST, 4 January 2012 . As Burberry today unveils its new spring/summer campaign in store windows and across social media networking sites, millions of ardent fans will be watching. Just last month, it was announced that the British company had become the world's most successful  luxury fashion brand on Facebook and Twitter, with a record 10million fans on Facebook, and almost 700,000 following the brand's regular UK feed on Twitter. Meanwhile, they have thousands more global Twitter fans following their international feeds and post exclusive content on their own YouTube channel. Behind the scenes: Eddie Redmayne and Cara Delevingne on set of the latest Burberry campaign . Burberry's social media success has grown exponentially - and it is still growing fast. The secret, say consumer experts, is the fact that Burberry share so much unique content exclusively with their followers on social networking platforms, and post new and different content to each one. What is posted on Facebook will be different from that on Twitter, and dialogues are constantly being carried out across both. Just this morning, Burberry placed . three new videos from their latest campaign, starring My Week With . Marilyn star Eddie Redmayne and Cara Delevingne, onto their YouTube channel. And at the news that the brand had . reached 10million Facebook fans, Burberry Chief Creative Officer Christopher . Bailey uploaded a personal thank you via a video message to the site. Days later, a further 163,000 had joined. At ease: The two chat between shots . All this direct contact gives fans that bit more involvement with the brand - and keeps them connected. 'Christopher is really involved,' Burberry told MailOnline. 'He has answered fans' questions personally on . Facebook, and he's very active on Twitter.' 'It gives people something new. It's a direct and personal route for them.' Their groundbreaking, inclusive . approach goes against the practiced elitism of luxury brands, and ultimately brings . fans closer rather than keeping them at a distance. While other fashion houses operate . strict closed-door policies that isolate all but the inner circle - Tom . Ford famously showed his last collection to a tiny selection of invited . contacts, with no photographers and only a handful of select Press allowed - Burberry have . embraced this public approach. Indeed, Burberry was the world's first . fashion house to 'Tweettalk', posting images from its autumn/winter fashion show on to . Twitter - . before the models had even stepped out onto the catwalk. The innovative move meant fans . were able to see the collection before the celebrity front row and the . fashion Press had even caught a glimpse. For his part, current star of the . campaign Eddie Redmayne reports that Christopher Bailey is, as one would . expect from his democratic approach to business, a 'gem of a man' 'I am a huge fan of Christopher Bailey,' he says. 'He is a brilliant designer and a brilliant man. 'The fashion world can be a tad intimidating but Christopher remains a kind, grounded gem of a man. And judging by the latest figures, there are at least 10 million people out there who agree. Intense: Cara, 18, has been the face of Burberry since January last year, when she appeared in the brand's S/S11 campaign . Check it out: The spring/summer Burberry campaign starring My Week With Marilyn's Eddie Redmayne and Brit socialite and model Cara Delevingne . The pair model the Spring Summer 2012 collection for the Burbery Prorsum collection . Social media success story: Burberry has over 10m Facebook fans - and counting . About the videos, Bailey says: 'We wanted to capture a moment in the lives of two exciting and inspiring British actors who have been part of the Burberry family for several years. 'The images reflect the mood of the collection through Eddie's and Cara's energy, playfulness and effortless elegance and I have such huge admiration for them both.' Follow Burberry at www.facebook.com/burberry, . www.twitter.com/burberry, www.youtube.com/burberry, www.burberry.com and www.artofthetrench.com.<EOS><pad>Brandhas 10,163,728 fans on Facebook while 694,495 follow UK Twitter feed . Chief creative officer Christopher Bailey personally thanks followers . New films star Burberry faces Eddie Redmayne and Cara Delevingne .<EOS>

1.3 Batching with bucketing

As in the previous week, we use bucketing to create batches of data.

# Bucketing to create batched generators. # Buckets are defined in terms of boundaries and batch sizes. # Batch_sizes[i] determines the batch size for items with length < boundaries[i] # So below, we'll take a batch of 16 sentences of length < 128 , 8 of length < 256, # 4 of length < 512. And so on. boundaries = [128, 256, 512, 1024] batch_sizes = [16, 8, 4, 2, 1] # Create the streams. train_batch_stream = trax.data.BucketByLength( boundaries, batch_sizes)(train_stream) eval_batch_stream = trax.data.BucketByLength( boundaries, batch_sizes)(eval_stream)
# Every execution will result in generation of a different article # Try running this cell multiple times to see how the length of the examples affects the batch size input_batch, _, mask_batch = next(train_batch_stream) # Shape of the input_batch input_batch.shape
(1, 1311)
# print corresponding integer values print(input_batch[0])
[ 9 21247 21954 5 1260 5028 5817 3218 21 467 3573 143 1771 117 28 4539 80 1271 17362 107 24752 963 8766 14201 279 14960 4 78 5512 583 186 6365 2 3573 7 5 17263 320 213 257 366 14487 78 2613 3 2328 25464 4 16860 20 127 285 122 213 257 366 186 54 3675 1105 403 196 78 3573 7 5 2279 1716 186 15984 213 117 3865 3188 80 527 28 8409 1028 285 7 5 8103 7217 2 213 646 39 1151 20794 251 1782 56 229 28 1271 10243 132 3573 892 219 169 186 132 3652 3898 22 127 2 8795 320 213 21247 13139 4 260 3588 838 213 2348 303 132 3573 186 3652 8 17355 5305 7 129 7 183 553 103 171 2002 9 1251 638 23 3596 15621 21 18836 1685 67 6 6811 1649 7 5 266 132 26652 1019 13608 26376 140 17355 691 19736 16 213 296 7 5 21247 4945 3 26651 4217 127 78 2102 285 184 10 59 3 726 5087 62 1151 12837 17 78 24070 1338 1838 6811 1649 7 5 2495 28 1295 3 5472 1971 4801 3767 9761 4 7326 4480 2690 6644 379 2328 25464 4 16860 20 2 213 13678 17263 320 213 257 366 2 14487 14607 7 5 8282 4090 16231 12978 285 117 28 4539 80 24752 963 8766 14201 2382 143 10862 1838 398 213 21247 21954 5 13342 320 245 95 26652 379 18427 10044 179 11 13678 2495 2792 24829 18674 279 1435 4114 205 6 655 1098 1221 132 213 1621 214 4060 8378 958 13139 5 1779 18 576 95 329 1766 13678 1397 29 3573 7 5 555 17263 465 15 296 848 561 7 5 399 320 2481 117 28 4539 80 7182 4 14201 2382 1838 4403 213 10846 71 28 8409 4811 7694 11969 7 3386 715 285 51 108 245 320 430 2479 320 13678 1098 1221 23 320 1151 1241 691 28 1839 186 19677 4 1587 691 3573 29725 4 5 1386 320 284 6856 6732 4666 1741 2 320 2406 4881 2 186 1011 1019 213 9445 1870 527 38 527 3573 29725 4 5 1038 3898 4217 127 132 218 11851 1782 577 824 170 1151 28 9160 6 61 922 4172 27439 9275 29 3573 29725 4 5 1386 18 320 4608 28 17400 320 191 800 2132 186 21852 5 78 4725 527 213 13678 101 132 356 320 1215 213 296 500 2002 200 16860 20 2266 145 163 2264 1248 14607 7 5 8282 4090 16231 12978 285 4217 229 12879 4371 21276 17187 113 192 26652 8967 5 1782 326 1435 38 669 27634 4 2951 320 18 27634 391 5157 3898 22 127 2 35 117 2754 51 18 132 3573 169 2 320 76 229 163 3865 3188 10220 200 124 33 19 1006 824 229 163 3865 3188 14065 16231 12978 26237 948 285 14253 605 213 296 5798 13204 7143 11352 5229 98 9 17110 229 24000 117 129 7 165 19 1821 51 7 165 19 2378 320 669 27884 4 18 28 27872 391 1729 3898 22 7801 1782 129 483 320 18 285 1729 3 200 51 7 165 1821 25800 285 1729 229 19 4651 3 10105 869 285 51 38 1670 500 214 28 3188 132 1271 8155 229 213 883 10220 2926 156 616 33 163 419 3 397 33 18 132 5005 2 1248 36 7182 4 14201 279 669 27978 391 33 39 18 28 4539 527 105 10220 321 25696 5 3898 22 14487 1782 11393 2 3538 527 285 3 321 1370 527 5259 35 6365 10220 449 7 5 213 1359 132 3573 2002 4217 793 1608 78 2613 285 22 229 8676 23422 4574 726 313 186 331 71 26652 320 2322 199 17073 3282 186 3472 412 41 1435 995 4347 132 3573 186 320 5288 3 207 39 1151 117 4574 1019 5421 3898 213 1251 638 127 2 192 7881 16 285 213 927 3345 511 7 26 1151 213 60 4252 132 163 5733 527 726 20667 13977 939 527 213 2548 285 4217 18792 17 214 132 472 186 420 3 514 1945 1131 43 127 2613 285 213 1155 229 4689 213 20135 527 28 325 26454 527 2019 7067 320 3573 2 2845 320 399 213 67 6 6811 1649 266 3732 213 3134 527 17355 3 4217 229 43 9868 862 3951 7784 908 10347 320 18921 81 17355 2 35 1945 1466 793 6676 1850 5646 78 3425 285 213 1203 14414 4 320 285 2085 229 351 2 19 20532 3 9 276 3283 5885 2 36 1175 127 2 229 2331 285 11012 17355 236 213 24926 4 169 285 14168 5371 23 1233 42 88 226 527 50 6400 13523 1044 1221 320 24366 4 213 606 143 3528 869 213 138 1019 5371 320 12378 14 1597 2093 186 54 1948 132 2141 3573 3 16231 12978 8663 179 214 16860 20 2 16509 285 213 44 117 3865 3188 80 320 3573 1353 351 17467 24302 691 2495 9001 214 17110 229 379 17355 5506 320 2525 28 16113 587 76 163 2348 205 285 14610 5 404 8507 76 132 370 527 3652 186 3573 2 186 103 23 3361 809 518 1752 1397 132 213 72 628 11969 7 27 4539 527 105 15570 16860 20 465 213 532 24752 963 8766 14201 279 8 12370 21 48 1779 9022 14048 67 6 17664 7 5 128 37 253 11970 2860 132 213 257 366 2 143 141 1151 213 1182 229 17355 229 1325 320 1701 1195 379 4217 229 11145 111 28 1354 186 28 800 219 2 15112 16 320 923 28 184 10 59 4602 3489 266 132 219 132 26652 192 43 20873 28 3713 18191 186 501 20868 21 5371 809 28 55 7511 285 2348 19275 229 1772 1629 2890 4144 11458 3 16860 20 127 2613 285 647 181 19 213 1251 638 14484 78 28 2358 527 1047 4212 1248 22547 1619 2 3573 848 399 23587 3486 1782 129 18 46 1821 285 51 317 320 8364 89 2115 1248 523 9375 9062 2 13475 5241 19106 17037 5 186 469 17821 9 1945 169 15280 5 285 23587 988 2002 22 127 3 207 18 46 4591 320 476 2035 27634 4 129 1435 4591 320 399 4172 27634 391 397 51 1435 1821 229 51 972 3945 353 11822 3 27 884 23 320 1151 133 3 52 170 18 46 133 7356 1782 642 89 2539 3898 16860 20 127 948 213 23587 988 527 213 927 1435 1657 93 256 2677 186 44 6315 2677 11585 1 0 2328 16056 16860 20 127 2613 285 192 4217 4679 275 2685 3573 7 5 2279 1716 2 17355 229 8103 2273 16346 27439 6774 1628 23066 213 260 229 4194 2 22 127 2 103 39 424 28 1271 11970 3188 107 117 28 4539 80 7182 4 14201 2382 16346 27439 6774 1628 17355 39 245 117 92 25696 5 3898 22 14487 948 1290 2 3538 527 285 3 321 1370 527 5259 35 6365 6053 27439 6774 7583 7 397 51 18 132 3573 169 427 229 163 3865 3188 3898 213 17263 14487 16346 27439 6774 1628 14607 7 5 8282 4090 16231 12978 127 131 742 213 1023 3188 1353 28 21247 4945 285 5798 117 13204 7143 11352 5229 6053 27439 6774 1628 17355 2 213 2348 303 132 3573 186 3652 2 229 28 21247 13139 4 260 285 1353 1499 249 412 117 1909 6 17664 132 3573 7 1]

Things to notice:

  • First we see the corresponding values of the words.

  • The first 1, which represents the <EOS> tag of the article.

  • Followed by a 0, which represents a <pad> tag.

  • After the first 0 (<pad> tag) the corresponding values are of the words that are used for the summary of the article.

  • The second 1 represents the <EOS> tag for the summary.

  • All the trailing 0s represent <pad> tags which are appended to maintain consistent length (If you don't see them then it would mean it is already of max length)

# print the article and its summary print('Article:\n\n', detokenize(input_batch[0]))
Article: The Sunni extremists running roughshod across Iraq could produce 'a thousand' global terrorists like Osama bin Laden bent on widespread death and destruction, Iraq's ambassador to the United States warned on Monday. Lukman Faily said that if the United States and other nations focus too much on Iraq's internal politics and ignore the 'immediate threat' of a terrorist movement that's gathering steam, the results will be catastrophic. 'This is a global tumor in Iraq taking place now and in Syria,' he said, referring to the Sunni militant group calling itself the Islamic State in Iraq and Syria (ISIS). 'We've seen it before.' The White House has scolded Nouri al-Maliki's government in Baghdad for provoking ISIS by alienating the country's Sunni minority. Barack Obama said on Friday that U.S. military intervention would be conditioned on reconciliation efforts from Maliki's Shia majority. SCROLL DOWN FOR VIDEOS . Lukman Faily, the Iraqi ambassador to the United States, warned CNN's Christiane Amanpour that 'a thousand' Osama bin Ladens could emerge from among the Sunni extremists pushing to take over Baghdad . Pushing back: Iraqi Shiite tribesmen are joining state-run security forces in the fight against Jihadist militants who have taken over several northern Iraqi cities; Iraq's US ambassador says his country needs America's help to prevent 'a thousand' Bin Ladens from turning the globe into a terrorist shooting gallery . 'Any action that we may take to provide assistance to Iraqi security forces has to be joined by a serious and sincere effort by Iraq’s leaders to set aside sectarian differences, to promote stability, and account for the legitimate interests of all of Iraq’s communities,' Obama said in public remarks. 'So this should be a wake-up call.  Iraq’s leaders have to demonstrate a willingness to make hard decisions and compromises on behalf of the Iraqi people in order to bring the country together.' But Faily suggested during an interview with CNN's Christiane Amanpour that Obama is naively fiddling while Baghdad burns. 'These are all "nice to have" discussions,' he said, but 'what we have in Iraq now, to -- is an immediate threat.' 'But do you not feel this is an immediate threat?' Amanpour interrupted, 'that practically half the country feels disenfranchised? The Sunnis?' 'We're not saying we're not happy to [have a] discussion,' he responded. 'We want to have that discussion. But we're saying conditioning that discussion is not wise. Making clear that we all stand together against a threat in global terrorism is the question.' 'Let me give you an example. What you have in Afghanistan, with one Bin Laden – you will have a thousand of them.' 'No POWs,' he warned. 'Nothing, none of that. No rules of engagement but destruction.' 'That's the situation in Iraq.' Obama told Congress on Monday that he is sending 275 armed military men and women into Baghdad to protect American embassy personnel and assets as they are moved elsewhere in Iraq and to Jordan. They will be 'armed for combat,' the White House said, while insisting that the ground troops won't be the first drop in an ocean of military entanglements of the sort that Obama campaigned against in 2008 and 2012. An administration official also said Monday that the president is considering the deployment of a small contingent of Special Forces to Iraq, specifically to help the al-Maliki government slow the advance of ISIS. Obama is also mulling unilateral air strikes to hamper ISIS, but administration sources told MailOnline on Tuesday that the primary objection to that strategy is political, not tactical. The National Security Staff, one source said, is concerned that forcing ISIS off the battlefield now that neighboring Iran has sent 2,000 of its elite Quds forces to stabilize the region could effectively clear the way for Iran to seize oil fields and other lands in eastern Iraq. Amanpour pushed back against Faily, arguing that the more 'immediate threat' to Iraq was political inequality enforced by Shiites against Sunnis . ISIS aims to establish a caliphate -- an Islamic state that transcends national borders -- in areas of Syria and Iraq, and it has captured at least nine cities in the two countries . 'A thousand of them': Faily says the late Osama bin Laden (pictured), who masterminded al-Qaeda's 9/11 terror attacks in the United States, could just be the beginning is ISIS is allowed to press forward . Obama is stuck between a rock and a hard place, needing to keep a U.S.-friendly government in place in Baghdad while also avoiding a newly strengthened and further leveraged Iran at a time when that Islamic republic is moving toward nuclear weapons capability. Faily said Monday that whether or not the White House decides on a path of limited cooperation with Tehran, Iraq needs help urgently. 'We have been saying that we need to strengthen our army with having fighter planes, Apache helicopters and others. ... The administration now understands that urgency.' he said. They have been willing to say, "We are willing to help." What we are saying is we cannot wait until tomorrow. A decision has to be made. It should have been made yesterday. 'From our perspective,' Faily said, 'the urgency of the ground are giving us less options and more radical options.'<EOS><pad>LukanFaily said Monday that while Obama frets about Iraq's internal politics, ISIS is gathering strength . Unless the group is stopped, he said, it will become a global terror threat like 'a thousand' Bin Ladens . ISIS will take 'no POWs,' he warned, 'nothing, none of that. No rules of engagement but destruction' 'What we have in Iraq now ... is an immediate threat,' the ambassador warned . CNN's Christiane Amanpour said she thought the true threat was a Sunni minority that feels 'disenfranchised' ISIS, the Islamic State in Iraq and Syria, is a Sunni militant group that was previously known as 'Al-Qaeda in Iraq'<EOS>

You can see that the data has the following structure:

  • [Article] -> <EOS> -> <pad> -> [Article Summary] -> <EOS> -> (possibly) multiple <pad>

The loss is taken only on the summary using cross_entropy as loss function.

Part 2: Summarization with transformer

Now that we have given you the data generator and have handled the preprocessing for you, it is time for you to build your own model. We saved you some time because we know you have already preprocessed data before in this specialization, so we would rather you spend your time doing the next steps.

You will be implementing the attention from scratch and then using it in your transformer model. Concretely, you will understand how attention works, how you use it to connect the encoder and the decoder.

2.1 Dot product attention

Now you will implement dot product attention which takes in a query, key, value, and a mask. It returns the output.

Here are some helper functions that will help you create tensors and display useful information:

  • create_tensor creates a jax numpy array from a list of lists.

  • display_tensor prints out the shape and the actual tensor.

def create_tensor(t): """Create tensor from list of lists""" return jnp.array(t) def display_tensor(t, name): """Display shape and tensor""" print(f'{name} shape: {t.shape}\n') print(f'{t}\n')

Before implementing it yourself, you can play around with a toy example of dot product attention without the softmax operation. Technically it would not be dot product attention without the softmax but this is done to avoid giving away too much of the answer and the idea is to display these tensors to give you a sense of how they look like.

The formula for attention is this one:

$$\text { Attention }(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}+{M}\right) V\tag{1}\$$

dkd_{k} stands for the dimension of queries and keys.

The query, key, value and mask vectors are provided for this example.

Notice that the masking is done using very negative values that will yield a similar effect to using -\infty .

q = create_tensor([[1, 0, 0], [0, 1, 0]]) display_tensor(q, 'query') k = create_tensor([[1, 2, 3], [4, 5, 6]]) display_tensor(k, 'key') v = create_tensor([[0, 1, 0], [1, 0, 1]]) display_tensor(v, 'value') m = create_tensor([[0, 0], [-1e9, 0]]) display_tensor(m, 'mask')
query shape: (2, 3) [[1 0 0] [0 1 0]] key shape: (2, 3) [[1 2 3] [4 5 6]] value shape: (2, 3) [[0 1 0] [1 0 1]] mask shape: (2, 2) [[ 0.e+00 0.e+00] [-1.e+09 0.e+00]]

Expected Output:

query shape: (2, 3) [[1 0 0] [0 1 0]] key shape: (2, 3) [[1 2 3] [4 5 6]] value shape: (2, 3) [[0 1 0] [1 0 1]] mask shape: (2, 2) [[ 0.e+00 0.e+00] [-1.e+09 0.e+00]]
q_dot_k = q @ k.T / jnp.sqrt(3) display_tensor(q_dot_k, 'query dot key')
query dot key shape: (2, 2) [[0.57735026 2.309401 ] [1.1547005 2.8867514 ]]

Expected Output:

query dot key shape: (2, 2) [[0.57735026 2.309401 ] [1.1547005 2.8867514 ]]
masked = q_dot_k + m display_tensor(masked, 'masked query dot key')
masked query dot key shape: (2, 2) [[ 5.7735026e-01 2.3094010e+00] [-1.0000000e+09 2.8867514e+00]]

Expected Output:

masked query dot key shape: (2, 2) [[ 5.7735026e-01 2.3094010e+00] [-1.0000000e+09 2.8867514e+00]]
display_tensor(masked @ v, 'masked query dot key dot value')
masked query dot key dot value shape: (2, 3) [[ 2.3094010e+00 5.7735026e-01 2.3094010e+00] [ 2.8867514e+00 -1.0000000e+09 2.8867514e+00]]

Expected Output:

masked query dot key dot value shape: (2, 3) [[ 2.3094010e+00 5.7735026e-01 2.3094010e+00] [ 2.8867514e+00 -1.0000000e+09 2.8867514e+00]]

In order to use the previous dummy tensors to test some of the graded functions, a batch dimension should be added to them so they mimic the shape of real-life examples. The mask is also replaced by a version of it that resembles the one that is used by trax:

q_with_batch = q[None,:] display_tensor(q_with_batch, 'query with batch dim') k_with_batch = k[None,:] display_tensor(k_with_batch, 'key with batch dim') v_with_batch = v[None,:] display_tensor(v_with_batch, 'value with batch dim') m_bool = create_tensor([[True, True], [False, True]]) display_tensor(m_bool, 'boolean mask')
query with batch dim shape: (1, 2, 3) [[[1 0 0] [0 1 0]]] key with batch dim shape: (1, 2, 3) [[[1 2 3] [4 5 6]]] value with batch dim shape: (1, 2, 3) [[[0 1 0] [1 0 1]]] boolean mask shape: (2, 2) [[ True True] [False True]]

Expected Output:

query with batch dim shape: (1, 2, 3) [[[1 0 0] [0 1 0]]] key with batch dim shape: (1, 2, 3) [[[1 2 3] [4 5 6]]] value with batch dim shape: (1, 2, 3) [[[0 1 0] [1 0 1]]] boolean mask shape: (2, 2) [[ True True] [False True]]

Exercise 01

Instructions: Implement the dot product attention. Concretely, implement the following equation

$$\text { Attention }(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}+{M}\right) V\tag{1}\$$

QQ - query, KK - key, VV - values, MM - mask, dk{d_k} - depth/dimension of the queries and keys (used for scaling down)

You can implement this formula either by trax numpy (trax.math.numpy) or regular numpy but it is recommended to use jnp.

Something to take into consideration is that within trax, the masks are tensors of True/False values not 0's and -\infty as in the previous example. Within the graded function don't think of applying the mask by summing up matrices, instead use jnp.where() and treat the mask as a tensor of boolean values with False for values that need to be masked and True for the ones that don't.

Also take into account that the real tensors are far more complex than the toy ones you just played with. Because of this avoid using shortened operations such as @ for dot product or .T for transposing. Use jnp.matmul() and jnp.swapaxes() instead.

This is the self-attention block for the transformer decoder. Good luck!

# UNQ_C1 # GRADED FUNCTION: DotProductAttention def DotProductAttention(query, key, value, mask): """Dot product self-attention. Args: query (jax.interpreters.xla.DeviceArray): array of query representations with shape (L_q by d) key (jax.interpreters.xla.DeviceArray): array of key representations with shape (L_k by d) value (jax.interpreters.xla.DeviceArray): array of value representations with shape (L_k by d) where L_v = L_k mask (jax.interpreters.xla.DeviceArray): attention-mask, gates attention with shape (L_q by L_k) Returns: jax.interpreters.xla.DeviceArray: Self-attention array for q, k, v arrays. (L_q by L_k) """ assert query.shape[-1] == key.shape[-1] == value.shape[-1], "Embedding dimensions of q, k, v aren't all the same" ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ### # Save depth/dimension of the query embedding for scaling down the dot product depth = query.shape[-1] # Calculate scaled query key dot product according to formula above dots = jnp.matmul(query, jnp.swapaxes(key, -1, -2)) / jnp.sqrt(depth) # Apply the mask if mask is not None: # The 'None' in this line does not need to be replaced dots = jnp.where(mask, dots, jnp.full_like(dots, -1e9)) # Softmax formula implementation # Use trax.fastmath.logsumexp of dots to avoid underflow by division by large numbers # Hint: Last axis should be used and keepdims should be True # Note: softmax = e^(dots - logsumexp(dots)) = E^dots / sumexp(dots) logsumexp = trax.fastmath.logsumexp(dots, axis=-1, keepdims=True) # Take exponential of dots minus logsumexp to get softmax # Use jnp.exp() dots = jnp.exp(dots - logsumexp) # Multiply dots by value to get self-attention # Use jnp.matmul() attention = jnp.matmul(dots, value) ## END CODE HERE ### return attention
DotProductAttention(q_with_batch, k_with_batch, v_with_batch, m_bool)
DeviceArray([[[0.8496746 , 0.15032545, 0.8496746 ], [1. , 0. , 1. ]]], dtype=float32)

Expected Output:

DeviceArray([[[0.8496746 , 0.15032545, 0.8496746 ], [1. , 0. , 1. ]]], dtype=float32)

2.2 Causal Attention

Now you are going to implement causal attention: multi-headed attention with a mask to attend only to words that occurred before.

In the image above, a word can see everything that is before it, but not what is after it. To implement causal attention, you will have to transform vectors and do many reshapes. You will need to implement the functions below.

Exercise 02

Implement the following functions that will be needed for Causal Attention:

  • compute_attention_heads : Gets an input xx of dimension (batch_size, seqlen, n_heads ×\times d_head) and splits the last (depth) dimension and stacks it to the zeroth dimension to allow matrix multiplication (batch_size ×\times n_heads, seqlen, d_head).

  • dot_product_self_attention : Creates a mask matrix with False values above the diagonal and True values below and calls DotProductAttention which implements dot product self attention.

  • compute_attention_output : Undoes compute_attention_heads by splitting first (vertical) dimension and stacking in the last (depth) dimension (batch_size, seqlen, n_heads ×\times d_head). These operations concatenate (stack/merge) the heads.

Next there are some toy tensors which may serve to give you an idea of the data shapes and opperations involved in Causal Attention. They are also useful to test out your functions!

tensor2d = create_tensor(q) display_tensor(tensor2d, 'query matrix (2D tensor)') tensor4d2b = create_tensor([[q, q], [q, q]]) display_tensor(tensor4d2b, 'batch of two (multi-head) collections of query matrices (4D tensor)') tensor3dc = create_tensor([jnp.concatenate([q, q], axis = -1)]) display_tensor(tensor3dc, 'one batch of concatenated heads of query matrices (3d tensor)') tensor3dc3b = create_tensor([jnp.concatenate([q, q], axis = -1), jnp.concatenate([q, q], axis = -1), jnp.concatenate([q, q], axis = -1)]) display_tensor(tensor3dc3b, 'three batches of concatenated heads of query matrices (3d tensor)')
query matrix (2D tensor) shape: (2, 3) [[1 0 0] [0 1 0]] batch of two (multi-head) collections of query matrices (4D tensor) shape: (2, 2, 2, 3) [[[[1 0 0] [0 1 0]] [[1 0 0] [0 1 0]]] [[[1 0 0] [0 1 0]] [[1 0 0] [0 1 0]]]] one batch of concatenated heads of query matrices (3d tensor) shape: (1, 2, 6) [[[1 0 0 1 0 0] [0 1 0 0 1 0]]] three batches of concatenated heads of query matrices (3d tensor) shape: (3, 2, 6) [[[1 0 0 1 0 0] [0 1 0 0 1 0]] [[1 0 0 1 0 0] [0 1 0 0 1 0]] [[1 0 0 1 0 0] [0 1 0 0 1 0]]]

It is important to know that the following 3 functions would normally be defined within the CausalAttention function further below.

However this makes these functions harder to test. Because of this, these functions are shown individually using a closure (when necessary) that simulates them being inside of the CausalAttention function. This is done because they rely on some variables that can be accessed from within CausalAttention.

Support Functions

compute_attention_heads : Gets an input xx of dimension (batch_size, seqlen, n_heads ×\times d_head) and splits the last (depth) dimension and stacks it to the zeroth dimension to allow matrix multiplication (batch_size ×\times n_heads, seqlen, d_head).

For the closures you only have to fill the inner function.

# UNQ_C2 # GRADED FUNCTION: compute_attention_heads_closure def compute_attention_heads_closure(n_heads, d_head): """ Function that simulates environment inside CausalAttention function. Args: d_head (int): dimensionality of heads. n_heads (int): number of attention heads. Returns: function: compute_attention_heads function """ def compute_attention_heads(x): """ Compute the attention heads. Args: x (jax.interpreters.xla.DeviceArray): tensor with shape (batch_size, seqlen, n_heads X d_head). Returns: jax.interpreters.xla.DeviceArray: reshaped tensor with shape (batch_size X n_heads, seqlen, d_head). """ ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ### # Size of the x's batch dimension batch_size = x.shape[0] # Length of the sequence # Should be size of x's first dimension without counting the batch dim seqlen = x.shape[1] # Reshape x using jnp.reshape() # batch_size, seqlen, n_heads*d_head -> batch_size, seqlen, n_heads, d_head x = jnp.reshape(x, (batch_size, seqlen, n_heads, d_head)) # Transpose x using jnp.transpose() # batch_size, seqlen, n_heads, d_head -> batch_size, n_heads, seqlen, d_head # Note that the values within the tuple are the indexes of the dimensions of x and you must rearrange them x = jnp.transpose(x, (0, 2, 1, 3)) # Reshape x using jnp.reshape() # batch_size, n_heads, seqlen, d_head -> batch_size*n_heads, seqlen, d_head x = jnp.reshape(x, (-1, seqlen, d_head)) ### END CODE HERE ### return x return compute_attention_heads
display_tensor(tensor3dc3b, "input tensor") result_cah = compute_attention_heads_closure(2,3)(tensor3dc3b) display_tensor(result_cah, "output tensor")
input tensor shape: (3, 2, 6) [[[1 0 0 1 0 0] [0 1 0 0 1 0]] [[1 0 0 1 0 0] [0 1 0 0 1 0]] [[1 0 0 1 0 0] [0 1 0 0 1 0]]] output tensor shape: (6, 2, 3) [[[1 0 0] [0 1 0]] [[1 0 0] [0 1 0]] [[1 0 0] [0 1 0]] [[1 0 0] [0 1 0]] [[1 0 0] [0 1 0]] [[1 0 0] [0 1 0]]]

Expected Output:

input tensor shape: (3, 2, 6) [[[1 0 0 1 0 0] [0 1 0 0 1 0]] [[1 0 0 1 0 0] [0 1 0 0 1 0]] [[1 0 0 1 0 0] [0 1 0 0 1 0]]] output tensor shape: (6, 2, 3) [[[1 0 0] [0 1 0]] [[1 0 0] [0 1 0]] [[1 0 0] [0 1 0]] [[1 0 0] [0 1 0]] [[1 0 0] [0 1 0]] [[1 0 0] [0 1 0]]]

dot_product_self_attention : Creates a mask matrix with False values above the diagonal and True values below and calls DotProductAttention which implements dot product self attention.

# UNQ_C3 # GRADED FUNCTION: dot_product_self_attention def dot_product_self_attention(q, k, v): """ Masked dot product self attention. Args: q (jax.interpreters.xla.DeviceArray): queries. k (jax.interpreters.xla.DeviceArray): keys. v (jax.interpreters.xla.DeviceArray): values. Returns: jax.interpreters.xla.DeviceArray: masked dot product self attention tensor. """ ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ### # Hint: mask size should be equal to L_q. Remember that q has shape (batch_size, L_q, d) mask_size = q.shape[-2] # Creates a matrix with ones below the diagonal and 0s above. It should have shape (1, mask_size, mask_size) # Notice that 1's and 0's get casted to True/False by setting dtype to jnp.bool_ # Use jnp.tril() - Lower triangle of an array and jnp.ones() mask = jnp.tril(jnp.ones((1, mask_size, mask_size), dtype=jnp.bool_), k=0) ### END CODE HERE ### return DotProductAttention(q, k, v, mask)
dot_product_self_attention(q_with_batch, k_with_batch, v_with_batch)
DeviceArray([[[0. , 1. , 0. ], [0.8496746 , 0.15032543, 0.8496746 ]]], dtype=float32)

Expected Output:

DeviceArray([[[0. , 1. , 0. ], [0.8496746 , 0.15032543, 0.8496746 ]]], dtype=float32)

compute_attention_output : Undoes compute_attention_heads by splitting first (vertical) dimension and stacking in the last (depth) dimension (batch_size, seqlen, n_heads ×\times d_head). These operations concatenate (stack/merge) the heads.

# UNQ_C4 # GRADED FUNCTION: compute_attention_output_closure def compute_attention_output_closure(n_heads, d_head): """ Function that simulates environment inside CausalAttention function. Args: d_head (int): dimensionality of heads. n_heads (int): number of attention heads. Returns: function: compute_attention_output function """ def compute_attention_output(x): """ Compute the attention output. Args: x (jax.interpreters.xla.DeviceArray): tensor with shape (batch_size X n_heads, seqlen, d_head). Returns: jax.interpreters.xla.DeviceArray: reshaped tensor with shape (batch_size, seqlen, n_heads X d_head). """ ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ### # Length of the sequence # Should be size of x's first dimension without counting the batch dim seqlen = x.shape[1] # Reshape x using jnp.reshape() to shape (batch_size, n_heads, seqlen, d_head) x = jnp.reshape(x, ( -1, n_heads, seqlen, d_head)) # Transpose x using jnp.transpose() to shape (batch_size, seqlen, n_heads, d_head) x = jnp.transpose(x, ( 0, 2, 1 , 3)) ### END CODE HERE ### # Reshape to allow to concatenate the heads return jnp.reshape(x, (-1, seqlen, n_heads * d_head)) return compute_attention_output
display_tensor(result_cah, "input tensor") result_cao = compute_attention_output_closure(2,3)(result_cah) display_tensor(result_cao, "output tensor")
input tensor shape: (6, 2, 3) [[[1 0 0] [0 1 0]] [[1 0 0] [0 1 0]] [[1 0 0] [0 1 0]] [[1 0 0] [0 1 0]] [[1 0 0] [0 1 0]] [[1 0 0] [0 1 0]]] output tensor shape: (3, 2, 6) [[[1 0 0 1 0 0] [0 1 0 0 1 0]] [[1 0 0 1 0 0] [0 1 0 0 1 0]] [[1 0 0 1 0 0] [0 1 0 0 1 0]]]

Expected Output:

input tensor shape: (6, 2, 3) [[[1 0 0] [0 1 0]] [[1 0 0] [0 1 0]] [[1 0 0] [0 1 0]] [[1 0 0] [0 1 0]] [[1 0 0] [0 1 0]] [[1 0 0] [0 1 0]]] output tensor shape: (3, 2, 6) [[[1 0 0 1 0 0] [0 1 0 0 1 0]] [[1 0 0 1 0 0] [0 1 0 0 1 0]] [[1 0 0 1 0 0] [0 1 0 0 1 0]]]

Causal Attention Function

Now it is time for you to put everything together within the CausalAttention or Masked multi-head attention function:

Instructions: Implement the causal attention. Your model returns the causal attention through a tl.Serialtl.Serial with the following:

  • tl.Branch : consisting of 3 [tl.Dense(d_feature), ComputeAttentionHeads] to account for the queries, keys, and values.

  • tl.Fn: Takes in dot_product_self_attention function and uses it to compute the dot product using QQ, KK, VV.

  • tl.Fn: Takes in compute_attention_output_closure to allow for parallel computing.

  • tl.Dense: Final Dense layer, with dimension d_feature.

Remember that in order for trax to properly handle the functions you just defined, they need to be added as layers using the tl.Fn() function.

# UNQ_C5 # GRADED FUNCTION: CausalAttention def CausalAttention(d_feature, n_heads, compute_attention_heads_closure=compute_attention_heads_closure, dot_product_self_attention=dot_product_self_attention, compute_attention_output_closure=compute_attention_output_closure, mode='train'): """Transformer-style multi-headed causal attention. Args: d_feature (int): dimensionality of feature embedding. n_heads (int): number of attention heads. compute_attention_heads_closure (function): Closure around compute_attention heads. dot_product_self_attention (function): dot_product_self_attention function. compute_attention_output_closure (function): Closure around compute_attention_output. mode (str): 'train' or 'eval'. Returns: trax.layers.combinators.Serial: Multi-headed self-attention model. """ assert d_feature % n_heads == 0 d_head = d_feature // n_heads ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ### # HINT: The second argument to tl.Fn() is an uncalled function (without the parentheses) # Since you are dealing with closures you might need to call the outer # function with the correct parameters to get the actual uncalled function. ComputeAttentionHeads = tl.Fn('AttnHeads', compute_attention_heads_closure(n_heads, d_head), n_out=1) return tl.Serial( tl.Branch( # creates three towers for one input, takes activations and creates queries keys and values [tl.Dense(d_feature), ComputeAttentionHeads], # queries [tl.Dense(d_feature), ComputeAttentionHeads], # keys [tl.Dense(d_feature), ComputeAttentionHeads], # values ), tl.Fn('DotProductAttn', dot_product_self_attention, n_out=1), # takes QKV # HINT: The second argument to tl.Fn() is an uncalled function # Since you are dealing with closures you might need to call the outer # function with the correct parameters to get the actual uncalled function. tl.Fn('AttnOutput', compute_attention_output_closure(n_heads, d_head), n_out=1), # to allow for parallel tl.Dense(d_feature) # Final dense layer ) ### END CODE HERE ###
# Take a look at the causal attention model print(CausalAttention(d_feature=512, n_heads=8))
Serial[ Branch_out3[ [Dense_512, AttnHeads] [Dense_512, AttnHeads] [Dense_512, AttnHeads] ] DotProductAttn_in3 AttnOutput Dense_512 ]

Expected Output:

Serial[ Branch_out3[ [Dense_512, AttnHeads] [Dense_512, AttnHeads] [Dense_512, AttnHeads] ] DotProductAttn_in3 AttnOutput Dense_512 ]

2.3 Transformer decoder block

Now that you have implemented the causal part of the transformer, you will implement the transformer decoder block. Concretely you will be implementing this image now.

To implement this function, you will have to call the CausalAttention or Masked multi-head attention function you implemented above. You will have to add a feedforward which consists of:

Finally once you implement the feedforward, you can go ahead and implement the entire block using:

  • tl.Residual : takes in the tl.LayerNorm(), causal attention block, tl.dropout.

  • tl.Residual : takes in the feedforward block you will implement.

Exercise 03

Instructions: Implement the transformer decoder block. Good luck!

# UNQ_C6 # GRADED FUNCTION: DecoderBlock def DecoderBlock(d_model, d_ff, n_heads, dropout, mode, ff_activation): """Returns a list of layers that implements a Transformer decoder block. The input is an activation tensor. Args: d_model (int): depth of embedding. d_ff (int): depth of feed-forward layer. n_heads (int): number of attention heads. dropout (float): dropout rate (how much to drop out). mode (str): 'train' or 'eval'. ff_activation (function): the non-linearity in feed-forward layer. Returns: list: list of trax.layers.combinators.Serial that maps an activation tensor to an activation tensor. """ ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ### # Create masked multi-head attention block using CausalAttention function causal_attention = CausalAttention( d_model, n_heads=n_heads, mode=mode ) # Create feed-forward block (list) with two dense layers with dropout and input normalized feed_forward = [ # Normalize layer inputs tl.LayerNorm(), # Add first feed forward (dense) layer (don't forget to set the correct value for n_units) tl.Dense(d_ff), # Add activation function passed in as a parameter (you need to call it!) ff_activation(), # Generally ReLU # Add dropout with rate and mode specified (i.e., don't use dropout during evaluation) tl.Dropout(rate=dropout, mode=mode), # Add second feed forward layer (don't forget to set the correct value for n_units) tl.Dense(d_model), # Add dropout with rate and mode specified (i.e., don't use dropout during evaluation) tl.Dropout(rate=dropout, mode=mode) ] # Add list of two Residual blocks: the attention with normalization and dropout and feed-forward blocks return [ tl.Residual( # Normalize layer input tl.LayerNorm(), # Add causal attention block previously defined (without parentheses) causal_attention, # Add dropout with rate and mode specified tl.Dropout(rate=dropout, mode=mode) ), tl.Residual( # Add feed forward block (without parentheses) feed_forward ), ] ### END CODE HERE ###
# Take a look at the decoder block print(DecoderBlock(d_model=512, d_ff=2048, n_heads=8, dropout=0.1, mode='train', ff_activation=tl.Relu))
[Serial[ Branch_out2[ None Serial[ LayerNorm Serial[ Branch_out3[ [Dense_512, AttnHeads] [Dense_512, AttnHeads] [Dense_512, AttnHeads] ] DotProductAttn_in3 AttnOutput Dense_512 ] Dropout ] ] Add_in2 ], Serial[ Branch_out2[ None Serial[ LayerNorm Dense_2048 Relu Dropout Dense_512 Dropout ] ] Add_in2 ]]

Expected Output:

[Serial[ Branch_out2[ None Serial[ LayerNorm Serial[ Branch_out3[ [Dense_512, AttnHeads] [Dense_512, AttnHeads] [Dense_512, AttnHeads] ] DotProductAttn_in3 AttnOutput Dense_512 ] Dropout ] ] Add_in2 ], Serial[ Branch_out2[ None Serial[ LayerNorm Dense_2048 Relu Dropout Dense_512 Dropout ] ] Add_in2 ]]

2.4 Transformer Language Model

You will now bring it all together. In this part you will use all the subcomponents you previously built to make the final model. Concretely, here is the image you will be implementing.

Exercise 04

Instructions: Previously you coded the decoder block. Now you will code the transformer language model. Here is what you will need.

Go go go!! You can do it 😃

# UNQ_C7 # GRADED FUNCTION: TransformerLM def TransformerLM(vocab_size=33300, d_model=512, d_ff=2048, n_layers=6, n_heads=8, dropout=0.1, max_len=4096, mode='train', ff_activation=tl.Relu): """Returns a Transformer language model. The input to the model is a tensor of tokens. (This model uses only the decoder part of the overall Transformer.) Args: vocab_size (int): vocab size. d_model (int): depth of embedding. d_ff (int): depth of feed-forward layer. n_layers (int): number of decoder layers. n_heads (int): number of attention heads. dropout (float): dropout rate (how much to drop out). max_len (int): maximum symbol length for positional encoding. mode (str): 'train', 'eval' or 'predict', predict mode is for fast inference. ff_activation (function): the non-linearity in feed-forward layer. Returns: trax.layers.combinators.Serial: A Transformer language model as a layer that maps from a tensor of tokens to activations over a vocab set. """ ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ### # Embedding inputs and positional encoder positional_encoder = [ # Add embedding layer of dimension (vocab_size, d_model) tl.Embedding(vocab_size, d_model), # Use dropout with rate and mode specified tl.Dropout(rate=dropout, mode=mode), # Add positional encoding layer with maximum input length and mode specified tl.PositionalEncoding(max_len=max_len, mode=mode)] # Create stack (list) of decoder blocks with n_layers with necessary parameters decoder_blocks = [ DecoderBlock(d_model, d_ff, n_heads, dropout, mode, ff_activation) for _ in range(n_layers)] # Create the complete model as written in the figure return tl.Serial( # Use teacher forcing (feed output of previous step to current step) tl.ShiftRight(mode=mode), # Specify the mode! # Add positional encoder positional_encoder, # Add decoder blocks decoder_blocks, # Normalize layer tl.LayerNorm(), # Add dense layer of vocab_size (since need to select a word to translate to) # (a.k.a., logits layer. Note: activation already set by ff_activation) tl.Dense(vocab_size), # Get probabilities with Logsoftmax tl.LogSoftmax() ) ### END CODE HERE ###
# Take a look at the Transformer print(TransformerLM(n_layers=1))
Serial[ ShiftRight(1) Embedding_33300_512 Dropout PositionalEncoding Serial[ Branch_out2[ None Serial[ LayerNorm Serial[ Branch_out3[ [Dense_512, AttnHeads] [Dense_512, AttnHeads] [Dense_512, AttnHeads] ] DotProductAttn_in3 AttnOutput Dense_512 ] Dropout ] ] Add_in2 ] Serial[ Branch_out2[ None Serial[ LayerNorm Dense_2048 Relu Dropout Dense_512 Dropout ] ] Add_in2 ] LayerNorm Dense_33300 LogSoftmax ]

Expected Output:

Serial[ ShiftRight(1) Embedding_33300_512 Dropout PositionalEncoding Serial[ Branch_out2[ None Serial[ LayerNorm Serial[ Branch_out3[ [Dense_512, AttnHeads] [Dense_512, AttnHeads] [Dense_512, AttnHeads] ] DotProductAttn_in3 AttnOutput Dense_512 ] Dropout ] ] Add_in2 ] Serial[ Branch_out2[ None Serial[ LayerNorm Dense_2048 Relu Dropout Dense_512 Dropout ] ] Add_in2 ] LayerNorm Dense_33300 LogSoftmax ]

Part 3: Training

Now you are going to train your model. As usual, you have to define the cost function, the optimizer, and decide whether you will be training it on a gpu or cpu. In this case, you will train your model on a cpu for a few steps and we will load in a pre-trained model that you can use to predict with your own words.

3.1 Training the model

You will now write a function that takes in your model and trains it. To train your model you have to decide how many times you want to iterate over the entire data set. Each iteration is defined as an epoch. For each epoch, you have to go over all the data, using your training iterator.

Exercise 05

Instructions: Implement the train_model program below to train the neural network above. Here is a list of things you should do:

You will be using a cross entropy loss, with Adam optimizer. Please read the Trax documentation to get a full understanding.

The training loop that this function returns can be runned using the run() method by passing in the desired number of steps.

from trax.supervised import training # UNQ_C8 # GRADED FUNCTION: train_model def training_loop(TransformerLM, train_gen, eval_gen, output_dir = "~/model"): ''' Input: TransformerLM (trax.layers.combinators.Serial): The model you are building. train_gen (generator): Training stream of data. eval_gen (generator): Evaluation stream of data. output_dir (str): folder to save your file. Returns: trax.supervised.training.Loop: Training loop. ''' output_dir = os.path.expanduser(output_dir) # trainer is an object lr_schedule = trax.lr.warmup_and_rsqrt_decay(n_warmup_steps=1000, max_value=0.01) ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ### train_task = training.TrainTask( labeled_data=train_gen, # The training generator loss_layer=tl.CrossEntropyLoss(), # Loss function optimizer=trax.optimizers.Adam(0.01), # Optimizer (Don't forget to set LR to 0.01) lr_schedule=lr_schedule, n_steps_per_checkpoint=10 ) eval_task = training.EvalTask( labeled_data=eval_gen, # The evaluation generator metrics=[tl.CrossEntropyLoss(), tl.Accuracy()] # CrossEntropyLoss and Accuracy ) ### END CODE HERE ### loop = training.Loop(TransformerLM(d_model=4, d_ff=16, n_layers=1, n_heads=2, mode='train'), train_task, eval_tasks=[eval_task], output_dir=output_dir) return loop

Notice that the model will be trained for only 10 steps.

Even with this constraint the model with the original default arguments took a very long time to finish. Because of this some parameters are changed when defining the model that is fed into the training loop in the function above.

# Should take around 1.5 minutes !rm -f ~/model/model.pkl.gz loop = training_loop(TransformerLM, train_batch_stream, eval_batch_stream) loop.run(10)
Step 1: Ran 1 train steps in 10.03 secs Step 1: train CrossEntropyLoss | 10.40958691 Step 1: eval CrossEntropyLoss | 10.41131115 Step 1: eval Accuracy | 0.00000000 Step 10: Ran 9 train steps in 56.81 secs Step 10: train CrossEntropyLoss | 10.41096878 Step 10: eval CrossEntropyLoss | 10.41013813 Step 10: eval Accuracy | 0.00000000

Part 4: Evaluation

4.1 Loading in a trained model

In this part you will evaluate by loading in an almost exact version of the model you coded, but we trained it for you to save you time. Please run the cell below to load in the model.

As you may have already noticed the model that you trained and the pretrained model share the same overall architecture but they have different values for some of the parameters:

Original (pretrained) model:

TransformerLM(vocab_size=33300, d_model=512, d_ff=2048, n_layers=6, n_heads=8, dropout=0.1, max_len=4096, ff_activation=tl.Relu)

Your model:

TransformerLM(d_model=4, d_ff=16, n_layers=1, n_heads=2)

Only the parameters shown for your model were changed. The others stayed the same.

# Get the model architecture model = TransformerLM(mode='eval') # Load the pre-trained weights model.init_from_file('model.pkl.gz', weights_only=True)

Part 5: Testing with your own input

You will now test your input. You are going to implement greedy decoding. This consists of two functions. The first one allows you to identify the next symbol. It gets the argmax of the output of your model and then returns that index.

Exercise 06

Instructions: Implement the next symbol function that takes in the cur_output_tokens and the trained model to return the index of the next word.

# UNQ_C9 def next_symbol(cur_output_tokens, model): """Returns the next symbol for a given sentence. Args: cur_output_tokens (list): tokenized sentence with EOS and PAD tokens at the end. model (trax.layers.combinators.Serial): The transformer model. Returns: int: tokenized symbol. """ ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ### # current output tokens length token_length = len(cur_output_tokens) # calculate the minimum power of 2 big enough to store token_length # HINT: use np.ceil() and np.log2() # add 1 to token_length so np.log2() doesn't receive 0 when token_length is 0 padded_length = 2**int(np.ceil(np.log2(token_length + 1))) # Fill cur_output_tokens with 0's until it reaches padded_length padded = cur_output_tokens + [0] * (padded_length - token_length) padded_with_batch = np.array(padded)[None, :] # Don't replace this 'None'! This is a way of setting the batch dim # model expects a tuple containing two padded tensors (with batch) output, _ = model((padded_with_batch, padded_with_batch)) # HINT: output has shape (1, padded_length, vocab_size) # To get log_probs you need to index output with 0 in the first dim # token_length in the second dim and all of the entries for the last dim. log_probs = output[0, token_length, :] ### END CODE HERE ### return int(np.argmax(log_probs))
# Test it out! sentence_test_nxt_symbl = "I want to fly in the sky." detokenize([next_symbol(tokenize(sentence_test_nxt_symbl)+[0], model)])
'The'

Expected Output:

'The'

5.1 Greedy decoding

Now you will implement the greedy_decode algorithm that will call the next_symbol function. It takes in the input_sentence, the trained model and returns the decoded sentence.

Exercise 07

Instructions: Implement the greedy_decode algorithm.

# UNQ_C10 # Decoding functions. def greedy_decode(input_sentence, model): """Greedy decode function. Args: input_sentence (string): a sentence or article. model (trax.layers.combinators.Serial): Transformer model. Returns: string: summary of the input. """ ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ### # Use tokenize() cur_output_tokens = tokenize(input_sentence) + [0] generated_output = [] cur_output = 0 EOS = 1 while cur_output != EOS: # Get next symbol cur_output = next_symbol(cur_output_tokens, model) # Append next symbol to original sentence cur_output_tokens.append(cur_output) # Append next symbol to generated sentence generated_output.append(cur_output) print(detokenize(generated_output)) ### END CODE HERE ### return detokenize(generated_output)
# Test it out on a sentence! test_sentence = "It was a sunny day when I went to the market to buy some flowers. But I only found roses, not tulips." print(wrapper.fill(test_sentence), '\n') print(greedy_decode(test_sentence, model))
It was a sunny day when I went to the market to buy some flowers. But I only found roses, not tulips. : : I : I just : I just found : I just found ros : I just found roses : I just found roses, : I just found roses, not : I just found roses, not tu : I just found roses, not tulips : I just found roses, not tulips : I just found roses, not tulips. : I just found roses, not tulips.<EOS> : I just found roses, not tulips.<EOS>

Expected Output:

: : I : I just : I just found : I just found ros : I just found roses : I just found roses, : I just found roses, not : I just found roses, not tu : I just found roses, not tulips : I just found roses, not tulips : I just found roses, not tulips. : I just found roses, not tulips.<EOS> : I just found roses, not tulips.<EOS>
# Test it out with a whole article! article = "It’s the posing craze sweeping the U.S. after being brought to fame by skier Lindsey Vonn, soccer star Omar Cummings, baseball player Albert Pujols - and even Republican politician Rick Perry. But now four students at Riverhead High School on Long Island, New York, have been suspended for dropping to a knee and taking up a prayer pose to mimic Denver Broncos quarterback Tim Tebow. Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were all suspended for one day because the ‘Tebowing’ craze was blocking the hallway and presenting a safety hazard to students. Scroll down for video. Banned: Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll (all pictured left) were all suspended for one day by Riverhead High School on Long Island, New York, for their tribute to Broncos quarterback Tim Tebow. Issue: Four of the pupils were suspended for one day because they allegedly did not heed to warnings that the 'Tebowing' craze at the school was blocking the hallway and presenting a safety hazard to students." print(wrapper.fill(article), '\n') print(greedy_decode(article, model))
It’s the posing craze sweeping the U.S. after being brought to fame by skier Lindsey Vonn, soccer star Omar Cummings, baseball player Albert Pujols - and even Republican politician Rick Perry. But now four students at Riverhead High School on Long Island, New York, have been suspended for dropping to a knee and taking up a prayer pose to mimic Denver Broncos quarterback Tim Tebow. Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were all suspended for one day because the ‘Tebowing’ craze was blocking the hallway and presenting a safety hazard to students. Scroll down for video. Banned: Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll (all pictured left) were all suspended for one day by Riverhead High School on Long Island, New York, for their tribute to Broncos quarterback Tim Tebow. Issue: Four of the pupils were suspended for one day because they allegedly did not heed to warnings that the 'Tebowing' craze at the school was blocking the hallway and presenting a safety hazard to students. Jordan Jordan Ful Jordan Fulcol Jordan Fulcoly Jordan Fulcoly, Jordan Fulcoly, Wayne Jordan Fulcoly, Wayne Dre Jordan Fulcoly, Wayne Drexe Jordan Fulcoly, Wayne Drexel Jordan Fulcoly, Wayne Drexel, Jordan Fulcoly, Wayne Drexel, Tyler Jordan Fulcoly, Wayne Drexel, Tyler Carroll Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not hee Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not heed Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not heed to Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not heed to warn Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not heed to warnings Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not heed to warnings that Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not heed to warnings that the Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not heed to warnings that the ' Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not heed to warnings that the 'Te Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not heed to warnings that the 'Tebow Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not heed to warnings that the 'Tebowing Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not heed to warnings that the 'Tebowing' Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not heed to warnings that the 'Tebowing' cra Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not heed to warnings that the 'Tebowing' craze Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not heed to warnings that the 'Tebowing' craze was Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not heed to warnings that the 'Tebowing' craze was blocki Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not heed to warnings that the 'Tebowing' craze was blocking Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not heed to warnings that the 'Tebowing' craze was blocking the Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not heed to warnings that the 'Tebowing' craze was blocking the hall Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not heed to warnings that the 'Tebowing' craze was blocking the hallway Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not heed to warnings that the 'Tebowing' craze was blocking the hallway and Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not heed to warnings that the 'Tebowing' craze was blocking the hallway and presenting Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not heed to warnings that the 'Tebowing' craze was blocking the hallway and presenting a Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not heed to warnings that the 'Tebowing' craze was blocking the hallway and presenting a safety Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not heed to warnings that the 'Tebowing' craze was blocking the hallway and presenting a safety hazard Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not heed to warnings that the 'Tebowing' craze was blocking the hallway and presenting a safety hazard to Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not heed to warnings that the 'Tebowing' craze was blocking the hallway and presenting a safety hazard to students Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not heed to warnings that the 'Tebowing' craze was blocking the hallway and presenting a safety hazard to students. Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not heed to warnings that the 'Tebowing' craze was blocking the hallway and presenting a safety hazard to students.<EOS> Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not heed to warnings that the 'Tebowing' craze was blocking the hallway and presenting a safety hazard to students.<EOS>

Expected Output:

Jordan Jordan Ful Jordan Fulcol Jordan Fulcoly Jordan Fulcoly, Jordan Fulcoly, Wayne Jordan Fulcoly, Wayne Dre Jordan Fulcoly, Wayne Drexe Jordan Fulcoly, Wayne Drexel Jordan Fulcoly, Wayne Drexel, . . . Final summary: Jordan Fulcoly, Wayne Drexel, Tyler Carroll and Connor Carroll were suspended for one day. Four students were suspended for one day because they allegedly did not heed to warnings that the 'Tebowing' craze was blocking the hallway and presenting a safety hazard to students.<EOS>

Congratulations on finishing this week's assignment! You did a lot of work and now you should have a better understanding of the encoder part of Transformers and how Transformers can be used for text summarization.

Keep it up!