CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
amanchadha

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: amanchadha/coursera-deep-learning-specialization
Path: blob/master/C5 - Sequence Models/Week 4/C5_W4_A1_Transformer_Subclass_v1.py
Views: 4818
1
#!/usr/bin/env python
2
# coding: utf-8
3
4
# # Transformer Network
5
#
6
# Welcome to Week 4's assignment, the last assignment of Course 5 of the Deep Learning Specialization! And congratulations on making it to the last assignment of the entire Deep Learning Specialization - you're almost done!
7
#
8
# Ealier in the course, you've implemented sequential neural networks such as RNNs, GRUs, and LSTMs. In this notebook you'll explore the Transformer architecture, a neural network that takes advantage of parallel processing and allows you to substantially speed up the training process.
9
#
10
# **After this assignment you'll be able to**:
11
#
12
# * Create positional encodings to capture sequential relationships in data
13
# * Calculate scaled dot-product self-attention with word embeddings
14
# * Implement masked multi-head attention
15
# * Build and train a Transformer model
16
#
17
# For the last time, let's get started!
18
19
# ## Table of Contents
20
#
21
# - [Packages](#0)
22
# - [1 - Positional Encoding](#1)
23
# - [1.1 - Sine and Cosine Angles](#1-1)
24
# - [Exercise 1 - get_angles](#ex-1)
25
# - [1.2 - Sine and Cosine Positional Encodings](#1-2)
26
# - [Exercise 2 - positional_encoding](#ex-2)
27
# - [2 - Masking](#2)
28
# - [2.1 - Padding Mask](#2-1)
29
# - [2.2 - Look-ahead Mask](#2-2)
30
# - [3 - Self-Attention](#3)
31
# - [Exercise 3 - scaled_dot_product_attention](#ex-3)
32
# - [4 - Encoder](#4)
33
# - [4.1 Encoder Layer](#4-1)
34
# - [Exercise 4 - EncoderLayer](#ex-4)
35
# - [4.2 - Full Encoder](#4-2)
36
# - [Exercise 5 - Encoder](#ex-5)
37
# - [5 - Decoder](#5)
38
# - [5.1 - Decoder Layer](#5-1)
39
# - [Exercise 6 - DecoderLayer](#ex-6)
40
# - [5.2 - Full Decoder](#5-2)
41
# - [Exercise 7 - Decoder](#ex-7)
42
# - [6 - Transformer](#6)
43
# - [Exercise 8 - Transformer](#ex-8)
44
# - [7 - References](#7)
45
46
# <a name='0'></a>
47
# ## Packages
48
#
49
# Run the following cell to load the packages you'll need.
50
51
# In[ ]:
52
53
54
import tensorflow as tf
55
import pandas as pd
56
import time
57
import numpy as np
58
import matplotlib.pyplot as plt
59
60
from tensorflow.keras.layers import Embedding, MultiHeadAttention, Dense, Input, Dropout, LayerNormalization
61
from transformers import DistilBertTokenizerFast #, TFDistilBertModel
62
from transformers import TFDistilBertForTokenClassification
63
from tqdm import tqdm_notebook as tqdm
64
65
66
# <a name='1'></a>
67
# ## 1 - Positional Encoding
68
#
69
# In sequence to sequence tasks, the relative order of your data is extremely important to its meaning. When you were training sequential neural networks such as RNNs, you fed your inputs into the network in order. Information about the order of your data was automatically fed into your model. However, when you train a Transformer network using multi-head attention, you feed your data into the model all at once. While this dramatically reduces training time, there is no information about the order of your data. This is where positional encoding is useful - you can specifically encode the positions of your inputs and pass them into the network using these sine and cosine formulas:
70
#
71
# $$
72
# PE_{(pos, 2i)}= sin\left(\frac{pos}{{10000}^{\frac{2i}{d}}}\right)
73
# \tag{1}$$
74
# <br>
75
# $$
76
# PE_{(pos, 2i+1)}= cos\left(\frac{pos}{{10000}^{\frac{2i}{d}}}\right)
77
# \tag{2}$$
78
#
79
# * $d$ is the dimension of the word embedding and positional encoding
80
# * $pos$ is the position of the word.
81
# * $i$ refers to each of the different dimensions of the positional encoding.
82
#
83
# To develop some intuition about positional encodings, you can think of them broadly as a feature that contains the information about the relative positions of words. The sum of the positional encoding and word embedding is ultimately what is fed into the model. If you just hard code the positions in, say by adding a matrix of 1's or whole numbers to the word embedding, the semantic meaning is distorted. Conversely, the values of the sine and cosine equations are small enough (between -1 and 1) that when you add the positional encoding to a word embedding, the word embedding is not significantly distorted, and is instead enriched with positional information. Using a combination of these two equations helps your Transformer network attend to the relative positions of your input data. This was a short discussion on positional encodings, but develop further intuition, check out the *Positional Encoding Ungraded Lab*.
84
#
85
# **Note:** In the lectures Andrew uses vertical vectors, but in this assignment all vectors are horizontal. All matrix multiplications should be adjusted accordingly.
86
#
87
# <a name='1-1'></a>
88
# ### 1.1 - Sine and Cosine Angles
89
#
90
# Notice that even though the sine and cosine positional encoding equations take in different arguments (`2i` versus `2i+1`, or even versus odd numbers) the inner terms for both equations are the same: $$\theta(pos, i, d) = \frac{pos}{10000^{\frac{2i}{d}}} \tag{3}$$
91
#
92
# Consider the inner term as you calculate the positional encoding for a word in a sequence.<br>
93
# $PE_{(pos, 0)}= sin\left(\frac{pos}{{10000}^{\frac{0}{d}}}\right)$, since solving `2i = 0` gives `i = 0` <br>
94
# $PE_{(pos, 1)}= cos\left(\frac{pos}{{10000}^{\frac{0}{d}}}\right)$, since solving `2i + 1 = 1` gives `i = 0`
95
#
96
# The angle is the same for both! The angles for $PE_{(pos, 2)}$ and $PE_{(pos, 3)}$ are the same as well, since for both, `i = 1` and therefore the inner term is $\left(\frac{pos}{{10000}^{\frac{1}{d}}}\right)$. This relationship holds true for all paired sine and cosine curves:
97
#
98
# | k | <code> 0 </code>|<code> 1 </code>|<code> 2 </code>|<code> 3 </code>| <code> ... </code> |<code> d - 2 </code>|<code> d - 1 </code>|
99
# | ---------------- | :------: | ----------------- | ----------------- | ----------------- | ----- | ----------------- | ----------------- |
100
# | encoding(0) = |[$sin(\theta(0, 0, d))$| $cos(\theta(0, 0, d))$| $sin(\theta(0, 1, d))$| $cos(\theta(0, 1, d))$|... |$sin(\theta(0, d//2, d))$| $cos(\theta(0, d//2, d))$]|
101
# | encoding(1) = | [$sin(\theta(1, 0, d))$| $cos(\theta(1, 0, d))$| $sin(\theta(1, 1, d))$| $cos(\theta(1, 1, d))$|... |$sin(\theta(1, d//2, d))$| $cos(\theta(1, d//2, d))$]|
102
# ...
103
# | encoding(pos) = | [$sin(\theta(pos, 0, d))$| $cos(\theta(pos, 0, d))$| $sin(\theta(pos, 1, d))$| $cos(\theta(pos, 1, d))$|... |$sin(\theta(pos, d//2, d))$| $cos(\theta(pos, d//2, d))]$|
104
#
105
#
106
# <a name='ex-1'></a>
107
# ### Exercise 1 - get_angles
108
#
109
# Implement the function `get_angles()` to calculate the possible angles for the sine and cosine positional encodings
110
#
111
# **Hints**
112
#
113
# - If `k = [0, 1, 2, 3, 4, 5]`, then, `i` must be `i = [0, 0, 1, 1, 2, 2]`
114
# - `i = k//2`
115
116
# In[ ]:
117
118
119
# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
120
# GRADED FUNCTION get_angles
121
def get_angles(pos, k, d):
122
"""
123
Get the angles for the positional encoding
124
125
Arguments:
126
pos -- Column vector containing the positions [[0], [1], ...,[N-1]]
127
k -- Row vector containing the dimension span [[0, 1, 2, ..., d-1]]
128
d(integer) -- Encoding size
129
130
Returns:
131
angles -- (pos, d) numpy array
132
"""
133
134
# START CODE HERE
135
136
137
138
i = k // 2
139
# Calculate the angles using pos, i and d
140
angles = pos / np.power(10000, 2 * i / d)
141
142
# END CODE HERE
143
144
return angles
145
146
147
# In[ ]:
148
149
150
from public_tests import *
151
152
get_angles_test(get_angles)
153
154
# Example
155
position = 4
156
d_model = 8
157
pos_m = np.arange(position)[:, np.newaxis]
158
dims = np.arange(d_model)[np.newaxis, :]
159
get_angles(pos_m, dims, d_model)
160
161
162
# <a name='1-2'></a>
163
# ### 1.2 - Sine and Cosine Positional Encodings
164
#
165
# Now you can use the angles you computed to calculate the sine and cosine positional encodings.
166
#
167
# $$
168
# PE_{(pos, 2i)}= sin\left(\frac{pos}{{10000}^{\frac{2i}{d}}}\right)
169
# $$
170
# <br>
171
# $$
172
# PE_{(pos, 2i+1)}= cos\left(\frac{pos}{{10000}^{\frac{2i}{d}}}\right)
173
# $$
174
#
175
# <a name='ex-2'></a>
176
# ### Exercise 2 - positional_encoding
177
#
178
# Implement the function `positional_encoding()` to calculate the sine and cosine positional encodings
179
#
180
# **Reminder:** Use the sine equation when $i$ is an even number and the cosine equation when $i$ is an odd number.
181
#
182
# #### Additional Hints
183
# * You may find
184
# [np.newaxis](https://numpy.org/doc/stable/reference/arrays.indexing.html) useful depending on the implementation you choose.
185
186
# In[ ]:
187
188
189
# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
190
# GRADED FUNCTION positional_encoding
191
def positional_encoding(positions, d):
192
"""
193
Precomputes a matrix with all the positional encodings
194
195
Arguments:
196
positions (int) -- Maximum number of positions to be encoded
197
d (int) -- Encoding size
198
199
Returns:
200
pos_encoding -- (1, position, d_model) A matrix with the positional encodings
201
"""
202
# START CODE HERE
203
# initialize a matrix angle_rads of all the angles
204
angle_rads = get_angles(np.arange(positions)[:, np.newaxis],
205
np.arange(d)[np.newaxis, :],
206
d)
207
208
# apply sin to even indices in the array; 2i
209
angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
210
211
# apply cos to odd indices in the array; 2i+1
212
angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
213
# END CODE HERE
214
215
pos_encoding = angle_rads[np.newaxis, ...]
216
217
return tf.cast(pos_encoding, dtype=tf.float32)
218
219
220
# In[ ]:
221
222
223
# UNIT TEST
224
positional_encoding_test(positional_encoding, get_angles)
225
226
227
# Nice work calculating the positional encodings! Now you can visualize them.
228
229
# In[ ]:
230
231
232
pos_encoding = positional_encoding(50, 512)
233
234
print (pos_encoding.shape)
235
236
plt.pcolormesh(pos_encoding[0], cmap='RdBu')
237
plt.xlabel('d')
238
plt.xlim((0, 512))
239
plt.ylabel('Position')
240
plt.colorbar()
241
plt.show()
242
243
244
# Each row represents a positional encoding - notice how none of the rows are identical! You have created a unique positional encoding for each of the words.
245
246
# <a name='2'></a>
247
# ## 2 - Masking
248
#
249
# There are two types of masks that are useful when building your Transformer network: the *padding mask* and the *look-ahead mask*. Both help the softmax computation give the appropriate weights to the words in your input sentence.
250
#
251
# <a name='2-1'></a>
252
# ### 2.1 - Padding Mask
253
#
254
# Oftentimes your input sequence will exceed the maximum length of a sequence your network can process. Let's say the maximum length of your model is five, it is fed the following sequences:
255
#
256
# [["Do", "you", "know", "when", "Jane", "is", "going", "to", "visit", "Africa"],
257
# ["Jane", "visits", "Africa", "in", "September" ],
258
# ["Exciting", "!"]
259
# ]
260
#
261
# which might get vectorized as:
262
#
263
# [[ 71, 121, 4, 56, 99, 2344, 345, 1284, 15],
264
# [ 56, 1285, 15, 181, 545],
265
# [ 87, 600]
266
# ]
267
#
268
# When passing sequences into a transformer model, it is important that they are of uniform length. You can achieve this by padding the sequence with zeros, and truncating sentences that exceed the maximum length of your model:
269
#
270
# [[ 71, 121, 4, 56, 99],
271
# [ 2344, 345, 1284, 15, 0],
272
# [ 56, 1285, 15, 181, 545],
273
# [ 87, 600, 0, 0, 0],
274
# ]
275
#
276
# Sequences longer than the maximum length of five will be truncated, and zeros will be added to the truncated sequence to achieve uniform length. Similarly, for sequences shorter than the maximum length, they zeros will also be added for padding. However, these zeros will affect the softmax calculation - this is when a padding mask comes in handy! You will need to define a boolean mask that specifies which elements you must attend(1) and which elements you must ignore(0). Later you will use that mask to set all the zeros in the sequence to a value close to negative infinity (-1e9). We'll implement this for you so you can get to the fun of building the Transformer network! 😇 Just make sure you go through the code so you can correctly implement padding when building your model.
277
#
278
# After masking, your input should go from `[87, 600, 0, 0, 0]` to `[87, 600, -1e9, -1e9, -1e9]`, so that when you take the softmax, the zeros don't affect the score.
279
#
280
# The [MultiheadAttention](https://keras.io/api/layers/attention_layers/multi_head_attention/) layer implemented in Keras, use this masking logic.
281
282
# In[ ]:
283
284
285
def create_padding_mask(decoder_token_ids):
286
"""
287
Creates a matrix mask for the padding cells
288
289
Arguments:
290
decoder_token_ids -- (n, m) matrix
291
292
Returns:
293
mask -- (n, 1, 1, m) binary tensor
294
"""
295
seq = 1 - tf.cast(tf.math.equal(decoder_token_ids, 0), tf.float32)
296
297
# add extra dimensions to add the padding
298
# to the attention logits.
299
return seq[:, tf.newaxis, :]
300
301
302
# In[ ]:
303
304
305
x = tf.constant([[7., 6., 0., 0., 1.], [1., 2., 3., 0., 0.], [0., 0., 0., 4., 5.]])
306
print(create_padding_mask(x))
307
308
309
# If we multiply (1 - mask) by -1e9 and add it to the sample input sequences, the zeros are essentially set to negative infinity. Notice the difference when taking the softmax of the original sequence and the masked sequence:
310
311
# In[ ]:
312
313
314
print(tf.keras.activations.softmax(x))
315
print(tf.keras.activations.softmax(x + (1 - create_padding_mask(x)) * -1.0e9))
316
317
318
# <a name='2-2'></a>
319
# ### 2.2 - Look-ahead Mask
320
#
321
# The look-ahead mask follows similar intuition. In training, you will have access to the complete correct output of your training example. The look-ahead mask helps your model pretend that it correctly predicted a part of the output and see if, *without looking ahead*, it can correctly predict the next output.
322
#
323
# For example, if the expected correct output is `[1, 2, 3]` and you wanted to see if given that the model correctly predicted the first value it could predict the second value, you would mask out the second and third values. So you would input the masked sequence `[1, -1e9, -1e9]` and see if it could generate `[1, 2, -1e9]`.
324
#
325
# Just because you've worked so hard, we'll also implement this mask for you 😇😇. Again, take a close look at the code so you can effictively implement it later.
326
327
# In[ ]:
328
329
330
def create_look_ahead_mask(sequence_length):
331
"""
332
Returns an upper triangular matrix filled with ones
333
334
Arguments:
335
sequence_length -- matrix size
336
337
Returns:
338
mask -- (size, size) tensor
339
"""
340
mask = tf.linalg.band_part(tf.ones((1, sequence_length, sequence_length)), -1, 0)
341
return mask
342
343
344
# In[ ]:
345
346
347
x = tf.random.uniform((1, 3))
348
temp = create_look_ahead_mask(x.shape[1])
349
temp
350
351
352
# <a name='3'></a>
353
# ## 3 - Self-Attention
354
#
355
# As the authors of the Transformers paper state, "Attention is All You Need".
356
#
357
# <img src="self-attention.png" alt="Encoder" width="600"/>
358
# <caption><center><font color='purple'><b>Figure 1: Self-Attention calculation visualization</font></center></caption>
359
#
360
# The use of self-attention paired with traditional convolutional networks allows for the parallization which speeds up training. You will implement **scaled dot product attention** which takes in a query, key, value, and a mask as inputs to returns rich, attention-based vector representations of the words in your sequence. This type of self-attention can be mathematically expressed as:
361
# $$
362
# \text { Attention }(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}+{M}\right) V\tag{4}\
363
# $$
364
#
365
# * $Q$ is the matrix of queries
366
# * $K$ is the matrix of keys
367
# * $V$ is the matrix of values
368
# * $M$ is the optional mask you choose to apply
369
# * ${d_k}$ is the dimension of the keys, which is used to scale everything down so the softmax doesn't explode
370
#
371
# <a name='ex-3'></a>
372
# ### Exercise 3 - scaled_dot_product_attention
373
#
374
# Implement the function `scaled_dot_product_attention()` to create attention-based representations
375
# **Reminder**: The boolean mask parameter can be passed in as `none` or as either padding or look-ahead.
376
#
377
# Multiply (1. - mask) by -1e9 before applying the softmax.
378
#
379
# **Additional Hints**
380
# * You may find [tf.matmul](https://www.tensorflow.org/api_docs/python/tf/linalg/matmul) useful for matrix multiplication.
381
382
# In[ ]:
383
384
385
# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
386
# GRADED FUNCTION scaled_dot_product_attention
387
def scaled_dot_product_attention(q, k, v, mask):
388
"""
389
Calculate the attention weights.
390
q, k, v must have matching leading dimensions.
391
k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
392
The mask has different shapes depending on its type(padding or look ahead)
393
but it must be broadcastable for addition.
394
395
Arguments:
396
q -- query shape == (..., seq_len_q, depth)
397
k -- key shape == (..., seq_len_k, depth)
398
v -- value shape == (..., seq_len_v, depth_v)
399
mask: Float tensor with shape broadcastable
400
to (..., seq_len_q, seq_len_k). Defaults to None.
401
402
Returns:
403
output -- attention_weights
404
"""
405
# START CODE HERE
406
407
matmul_qk = tf.matmul(q, k, transpose_b = True) # (..., seq_len_q, seq_len_k)
408
409
# scale matmul_qk
410
dk = tf.cast(tf.shape(k)[-1], tf.float32)
411
scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)
412
413
# add the mask to the scaled tensor.
414
if mask is not None: # Don't replace this None
415
scaled_attention_logits += ((1 - mask) * -1e9)
416
417
# softmax is normalized on the last axis (seq_len_k) so that the scores
418
# add up to 1.
419
attention_weights = tf.nn.softmax(scaled_attention_logits, axis = -1) # (..., seq_len_q, seq_len_k)
420
421
output = tf.matmul(attention_weights, v) # (..., seq_len_q, depth_v)
422
423
# END CODE HERE
424
return output, attention_weights
425
426
427
# In[ ]:
428
429
430
# UNIT TEST
431
scaled_dot_product_attention_test(scaled_dot_product_attention)
432
433
434
# Excellent work! You can now implement self-attention. With that, you can start building the encoder block!
435
436
# <a name='4'></a>
437
# ## 4 - Encoder
438
#
439
# The Transformer Encoder layer pairs self-attention and convolutional neural network style of processing to improve the speed of training and passes K and V matrices to the Decoder, which you'll build later in the assignment. In this section of the assignment, you will implement the Encoder by pairing multi-head attention and a feed forward neural network (Figure 2a).
440
# <img src="encoder_layer.png" alt="Encoder" width="250"/>
441
# <caption><center><font color='purple'><b>Figure 2a: Transformer encoder layer</font></center></caption>
442
#
443
# * `MultiHeadAttention` you can think of as computing the self-attention several times to detect different features.
444
# * Feed forward neural network contains two Dense layers which we'll implement as the function `FullyConnected`
445
#
446
# Your input sentence first passes through a *multi-head attention layer*, where the encoder looks at other words in the input sentence as it encodes a specific word. The outputs of the multi-head attention layer are then fed to a *feed forward neural network*. The exact same feed forward network is independently applied to each position.
447
#
448
# * For the `MultiHeadAttention` layer, you will use the [Keras implementation](https://www.tensorflow.org/api_docs/python/tf/keras/layers/MultiHeadAttention). If you're curious about how to split the query matrix Q, key matrix K, and value matrix V into different heads, you can look through the implementation.
449
# * You will also use the [Sequential API](https://keras.io/api/models/sequential/) with two dense layers to built the feed forward neural network layers.
450
451
# In[ ]:
452
453
454
def FullyConnected(embedding_dim, fully_connected_dim):
455
return tf.keras.Sequential([
456
tf.keras.layers.Dense(fully_connected_dim, activation='relu'), # (batch_size, seq_len, dff)
457
tf.keras.layers.Dense(embedding_dim) # (batch_size, seq_len, d_model)
458
])
459
460
461
# <a name='4-1'></a>
462
# ### 4.1 Encoder Layer
463
#
464
# Now you can pair multi-head attention and feed forward neural network together in an encoder layer! You will also use residual connections and layer normalization to help speed up training (Figure 2a).
465
#
466
# <a name='ex-4'></a>
467
# ### Exercise 4 - EncoderLayer
468
#
469
# Implement `EncoderLayer()` using the `call()` method
470
#
471
# In this exercise, you will implement one encoder block (Figure 2) using the `call()` method. The function should perform the following steps:
472
# 1. You will pass the Q, V, K matrices and a boolean mask to a multi-head attention layer. Remember that to compute *self*-attention Q, V and K should be the same. Let the default values for `return_attention_scores` and `training`. You will also perform Dropout in this multi-head attention layer during training.
473
# 2. Now add a skip connection by adding your original input `x` and the output of the your multi-head attention layer.
474
# 3. After adding the skip connection, pass the output through the first normalization layer.
475
# 4. Finally, repeat steps 1-3 but with the feed forward neural network with a dropout layer instead of the multi-head attention layer.
476
#
477
# <details>
478
# <summary><font size="2" color="darkgreen"><b>Additional Hints (Click to expand)</b></font></summary>
479
#
480
# * The `__init__` method creates all the layers that will be accesed by the the `call` method. Wherever you want to use a layer defined inside the `__init__` method you will have to use the syntax `self.[insert layer name]`.
481
# * You will find the documentation of [MultiHeadAttention](https://www.tensorflow.org/api_docs/python/tf/keras/layers/MultiHeadAttention) helpful. *Note that if query, key and value are the same, then this function performs self-attention.*
482
# * The call arguments for `self.mha` are (Where B is for batch_size, T is for target sequence shapes, and S is output_shape):
483
# - `query`: Query Tensor of shape (B, T, dim).
484
# - `value`: Value Tensor of shape (B, S, dim).
485
# - `key`: Optional key Tensor of shape (B, S, dim). If not given, will use value for both key and value, which is the most common case.
486
# - `attention_mask`: a boolean mask of shape (B, T, S), that prevents attention to certain positions. The boolean mask specifies which query elements can attend to which key elements, 1 indicates attention and 0 indicates no attention. Broadcasting can happen for the missing batch dimensions and the head dimension.
487
# - `return_attention_scores`: A boolean to indicate whether the output should be attention output if True, or (attention_output, attention_scores) if False. Defaults to False.
488
# - `training`: Python boolean indicating whether the layer should behave in training mode (adding dropout) or in inference mode (no dropout). Defaults to either using the training mode of the parent layer/model, or False (inference) if there is no parent layer.
489
490
# In[ ]:
491
492
493
# UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
494
# GRADED FUNCTION EncoderLayer
495
class EncoderLayer(tf.keras.layers.Layer):
496
"""
497
The encoder layer is composed by a multi-head self-attention mechanism,
498
followed by a simple, positionwise fully connected feed-forward network.
499
This archirecture includes a residual connection around each of the two
500
sub-layers, followed by layer normalization.
501
"""
502
def __init__(self, embedding_dim, num_heads, fully_connected_dim,
503
dropout_rate=0.1, layernorm_eps=1e-6):
504
super(EncoderLayer, self).__init__()
505
506
self.mha = MultiHeadAttention(num_heads=num_heads,
507
key_dim=embedding_dim,
508
dropout=dropout_rate)
509
510
self.ffn = FullyConnected(embedding_dim=embedding_dim,
511
fully_connected_dim=fully_connected_dim)
512
513
self.layernorm1 = LayerNormalization(epsilon=layernorm_eps)
514
self.layernorm2 = LayerNormalization(epsilon=layernorm_eps)
515
516
self.dropout_ffn = Dropout(dropout_rate)
517
518
def call(self, x, training, mask):
519
"""
520
Forward pass for the Encoder Layer
521
522
Arguments:
523
x -- Tensor of shape (batch_size, input_seq_len, fully_connected_dim)
524
training -- Boolean, set to true to activate
525
the training mode for dropout layers
526
mask -- Boolean mask to ensure that the padding is not
527
treated as part of the input
528
Returns:
529
encoder_layer_out -- Tensor of shape (batch_size, input_seq_len, fully_connected_dim)
530
"""
531
# START CODE HERE
532
# calculate self-attention using mha(~1 line). Dropout will be applied during training
533
attn_output = self.mha(x, x, x, mask) # Self attention (batch_size, input_seq_len, fully_connected_dim)
534
535
# apply layer normalization on sum of the input and the attention output to get the
536
# output of the multi-head attention layer (~1 line)
537
out1 = self.layernorm1(x + attn_output) # (batch_size, input_seq_len, fully_connected_dim)
538
539
# pass the output of the multi-head attention layer through a ffn (~1 line)
540
ffn_output = self.ffn(out1) # (batch_size, input_seq_len, fully_connected_dim)
541
542
# apply dropout layer to ffn output during training (~1 line)
543
ffn_output = self.dropout_ffn(ffn_output, training=training)
544
545
# apply layer normalization on sum of the output from multi-head attention and ffn output to get the
546
# output of the encoder layer (~1 line)
547
encoder_layer_out = self.layernorm2(out1 + ffn_output) # (batch_size, input_seq_len, fully_connected_dim)
548
# END CODE HERE
549
550
return encoder_layer_out
551
552
553
554
# In[ ]:
555
556
557
# UNIT TEST
558
EncoderLayer_test(EncoderLayer)
559
560
561
# <a name='4-2'></a>
562
# ### 4.2 - Full Encoder
563
#
564
# Awesome job! You have now successfully implemented positional encoding, self-attention, and an encoder layer - give yourself a pat on the back. Now you're ready to build the full Transformer Encoder (Figure 2b), where you will embedd your input and add the positional encodings you calculated. You will then feed your encoded embeddings to a stack of Encoder layers.
565
#
566
# <img src="encoder.png" alt="Encoder" width="330"/>
567
# <caption><center><font color='purple'><b>Figure 2b: Transformer Encoder</font></center></caption>
568
#
569
#
570
# <a name='ex-5'></a>
571
# ### Exercise 5 - Encoder
572
#
573
# Complete the `Encoder()` function using the `call()` method to embed your input, add positional encoding, and implement multiple encoder layers
574
#
575
# In this exercise, you will initialize your Encoder with an Embedding layer, positional encoding, and multiple EncoderLayers. Your `call()` method will perform the following steps:
576
# 1. Pass your input through the Embedding layer.
577
# 2. Scale your embedding by multiplying it by the square root of your embedding dimension. Remember to cast the embedding dimension to data type `tf.float32` before computing the square root.
578
# 3. Add the position encoding: self.pos_encoding `[:, :seq_len, :]` to your embedding.
579
# 4. Pass the encoded embedding through a dropout layer, remembering to use the `training` parameter to set the model training mode.
580
# 5. Pass the output of the dropout layer through the stack of encoding layers using a for loop.
581
582
# In[ ]:
583
584
585
# UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
586
# GRADED FUNCTION
587
class Encoder(tf.keras.layers.Layer):
588
"""
589
The entire Encoder starts by passing the input to an embedding layer
590
and using positional encoding to then pass the output through a stack of
591
encoder Layers
592
593
"""
594
def __init__(self, num_layers, embedding_dim, num_heads, fully_connected_dim, input_vocab_size,
595
maximum_position_encoding, dropout_rate=0.1, layernorm_eps=1e-6):
596
super(Encoder, self).__init__()
597
598
self.embedding_dim = embedding_dim
599
self.num_layers = num_layers
600
601
self.embedding = Embedding(input_vocab_size, self.embedding_dim)
602
self.pos_encoding = positional_encoding(maximum_position_encoding,
603
self.embedding_dim)
604
605
606
self.enc_layers = [EncoderLayer(embedding_dim=self.embedding_dim,
607
num_heads=num_heads,
608
fully_connected_dim=fully_connected_dim,
609
dropout_rate=dropout_rate,
610
layernorm_eps=layernorm_eps)
611
for _ in range(self.num_layers)]
612
613
self.dropout = Dropout(dropout_rate)
614
615
def call(self, x, training, mask):
616
"""
617
Forward pass for the Encoder
618
619
Arguments:
620
x -- Tensor of shape (batch_size, input_seq_len)
621
training -- Boolean, set to true to activate
622
the training mode for dropout layers
623
mask -- Boolean mask to ensure that the padding is not
624
treated as part of the input
625
Returns:
626
out2 -- Tensor of shape (batch_size, input_seq_len, fully_connected_dim)
627
"""
628
#mask = create_padding_mask(x)
629
seq_len = tf.shape(x)[1]
630
631
# START CODE HERE
632
# Pass input through the Embedding layer
633
x = self.embedding(x) # (batch_size, input_seq_len, fully_connected_dim)
634
# Scale embedding by multiplying it by the square root of the embedding dimension
635
x *= tf.math.sqrt(tf.cast(self.embedding_dim, tf.float32))
636
# Add the position encoding to embedding
637
x += self.pos_encoding[:, :seq_len, :]
638
# Pass the encoded embedding through a dropout layer
639
x = self.dropout(x, training = training)
640
# Pass the output through the stack of encoding layers
641
for i in range(self.num_layers):
642
x = self.enc_layers[i](x, training, mask)
643
# END CODE HERE
644
645
return x # (batch_size, input_seq_len, fully_connected_dim)
646
647
648
# In[ ]:
649
650
651
# UNIT TEST
652
Encoder_test(Encoder)
653
654
655
# <a name='5'></a>
656
# ## 5 - Decoder
657
#
658
# The Decoder layer takes the K and V matrices generated by the Encoder and in computes the second multi-head attention layer with the Q matrix from the output (Figure 3a).
659
#
660
# <img src="decoder_layer.png" alt="Encoder" width="250"/>
661
# <caption><center><font color='purple'><b>Figure 3a: Transformer Decoder layer</font></center></caption>
662
#
663
# <a name='5-1'></a>
664
# ### 5.1 - Decoder Layer
665
# Again, you'll pair multi-head attention with a feed forward neural network, but this time you'll implement two multi-head attention layers. You will also use residual connections and layer normalization to help speed up training (Figure 3a).
666
#
667
# <a name='ex-6'></a>
668
# ### Exercise 6 - DecoderLayer
669
#
670
# Implement `DecoderLayer()` using the `call()` method
671
#
672
# 1. Block 1 is a multi-head attention layer with a residual connection, and look-ahead mask. Like in the `EncoderLayer`, Dropout is defined within the multi-head attention layer.
673
# 2. Block 2 will take into account the output of the Encoder, so the multi-head attention layer will receive K and V from the encoder, and Q from the Block 1. You will then apply a normalization layer and a residual connection, just like you did before with the `EncoderLayer`.
674
# 3. Finally, Block 3 is a feed forward neural network with dropout and normalization layers and a residual connection.
675
#
676
# **Additional Hints:**
677
# * The first two blocks are fairly similar to the EncoderLayer except you will return `attention_scores` when computing self-attention
678
679
# In[ ]:
680
681
682
# UNQ_C6 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
683
# GRADED FUNCTION DecoderLayer
684
class DecoderLayer(tf.keras.layers.Layer):
685
"""
686
The decoder layer is composed by two multi-head attention blocks,
687
one that takes the new input and uses self-attention, and the other
688
one that combines it with the output of the encoder, followed by a
689
fully connected block.
690
"""
691
def __init__(self, embedding_dim, num_heads, fully_connected_dim, dropout_rate=0.1, layernorm_eps=1e-6):
692
super(DecoderLayer, self).__init__()
693
694
self.mha1 = MultiHeadAttention(num_heads=num_heads,
695
key_dim=embedding_dim,
696
dropout=dropout_rate)
697
698
self.mha2 = MultiHeadAttention(num_heads=num_heads,
699
key_dim=embedding_dim,
700
dropout=dropout_rate)
701
702
self.ffn = FullyConnected(embedding_dim=embedding_dim,
703
fully_connected_dim=fully_connected_dim)
704
705
self.layernorm1 = LayerNormalization(epsilon=layernorm_eps)
706
self.layernorm2 = LayerNormalization(epsilon=layernorm_eps)
707
self.layernorm3 = LayerNormalization(epsilon=layernorm_eps)
708
709
self.dropout_ffn = Dropout(dropout_rate)
710
711
def call(self, x, enc_output, training, look_ahead_mask, padding_mask):
712
"""
713
Forward pass for the Decoder Layer
714
715
Arguments:
716
x -- Tensor of shape (batch_size, target_seq_len, fully_connected_dim)
717
enc_output -- Tensor of shape(batch_size, input_seq_len, fully_connected_dim)
718
training -- Boolean, set to true to activate
719
the training mode for dropout layers
720
look_ahead_mask -- Boolean mask for the target_input
721
padding_mask -- Boolean mask for the second multihead attention layer
722
Returns:
723
out3 -- Tensor of shape (batch_size, target_seq_len, fully_connected_dim)
724
attn_weights_block1 -- Tensor of shape(batch_size, num_heads, target_seq_len, input_seq_len)
725
attn_weights_block2 -- Tensor of shape(batch_size, num_heads, target_seq_len, input_seq_len)
726
"""
727
728
# START CODE HERE
729
# enc_output.shape == (batch_size, input_seq_len, fully_connected_dim)
730
731
# BLOCK 1
732
# calculate self-attention and return attention scores as attn_weights_block1.
733
# Dropout will be applied during training (~1 line).
734
mult_attn_out1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask, return_attention_scores=True) # (batch_size, target_seq_len, d_model)
735
736
# apply layer normalization (layernorm1) to the sum of the attention output and the input (~1 line)
737
Q1 = self.layernorm1(mult_attn_out1 + x)
738
739
# BLOCK 2
740
# calculate self-attention using the Q from the first block and K and V from the encoder output.
741
# Dropout will be applied during training
742
# Return attention scores as attn_weights_block2 (~1 line)
743
mult_attn_out2, attn_weights_block2 = self.mha2(Q1, enc_output, enc_output, padding_mask, return_attention_scores=True) # (batch_size, target_seq_len, d_model)
744
745
# apply layer normalization (layernorm2) to the sum of the attention output and the output of the first block (~1 line)
746
mult_attn_out2 = self.layernorm2(mult_attn_out2 + Q1) # (batch_size, target_seq_len, fully_connected_dim)
747
748
#BLOCK 3
749
# pass the output of the second block through a ffn
750
ffn_output = self.ffn(mult_attn_out2) # (batch_size, target_seq_len, fully_connected_dim)
751
752
# apply a dropout layer to the ffn output
753
ffn_output = self.dropout_ffn(ffn_output, training = training)
754
755
# apply layer normalization (layernorm3) to the sum of the ffn output and the output of the second block
756
out3 = self.layernorm3(ffn_output + mult_attn_out2) # (batch_size, target_seq_len, fully_connected_dim)
757
# END CODE HERE
758
759
return out3, attn_weights_block1, attn_weights_block2
760
761
762
763
# In[ ]:
764
765
766
# UNIT TEST
767
DecoderLayer_test(DecoderLayer, create_look_ahead_mask)
768
769
770
# <a name='5-2'></a>
771
# ### 5.2 - Full Decoder
772
# You're almost there! Time to use your Decoder layer to build a full Transformer Decoder (Figure 3b). You will embedd your output and add positional encodings. You will then feed your encoded embeddings to a stack of Decoder layers.
773
#
774
#
775
# <img src="decoder.png" alt="Encoder" width="300"/>
776
# <caption><center><font color='purple'><b>Figure 3b: Transformer Decoder</font></center></caption>
777
#
778
# <a name='ex-7'></a>
779
# ### Exercise 7 - Decoder
780
#
781
# Implement `Decoder()` using the `call()` method to embed your output, add positional encoding, and implement multiple decoder layers
782
#
783
# In this exercise, you will initialize your Decoder with an Embedding layer, positional encoding, and multiple DecoderLayers. Your `call()` method will perform the following steps:
784
# 1. Pass your generated output through the Embedding layer.
785
# 2. Scale your embedding by multiplying it by the square root of your embedding dimension. Remember to cast the embedding dimension to data type `tf.float32` before computing the square root.
786
# 3. Add the position encoding: self.pos_encoding `[:, :seq_len, :]` to your embedding.
787
# 4. Pass the encoded embedding through a dropout layer, remembering to use the `training` parameter to set the model training mode.
788
# 5. Pass the output of the dropout layer through the stack of Decoding layers using a for loop.
789
790
# In[ ]:
791
792
793
# UNQ_C7 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
794
# GRADED FUNCTION Decoder
795
class Decoder(tf.keras.layers.Layer):
796
"""
797
The entire Encoder is starts by passing the target input to an embedding layer
798
and using positional encoding to then pass the output through a stack of
799
decoder Layers
800
801
"""
802
def __init__(self, num_layers, embedding_dim, num_heads, fully_connected_dim, target_vocab_size,
803
maximum_position_encoding, dropout_rate=0.1, layernorm_eps=1e-6):
804
super(Decoder, self).__init__()
805
806
self.embedding_dim = embedding_dim
807
self.num_layers = num_layers
808
809
self.embedding = Embedding(target_vocab_size, self.embedding_dim)
810
self.pos_encoding = positional_encoding(maximum_position_encoding, self.embedding_dim)
811
812
self.dec_layers = [DecoderLayer(embedding_dim=self.embedding_dim,
813
num_heads=num_heads,
814
fully_connected_dim=fully_connected_dim,
815
dropout_rate=dropout_rate,
816
layernorm_eps=layernorm_eps)
817
for _ in range(self.num_layers)]
818
self.dropout = Dropout(dropout_rate)
819
820
def call(self, x, enc_output, training,
821
look_ahead_mask, padding_mask):
822
"""
823
Forward pass for the Decoder
824
825
Arguments:
826
x -- Tensor of shape (batch_size, target_seq_len, fully_connected_dim)
827
enc_output -- Tensor of shape(batch_size, input_seq_len, fully_connected_dim)
828
training -- Boolean, set to true to activate
829
the training mode for dropout layers
830
look_ahead_mask -- Boolean mask for the target_input
831
padding_mask -- Boolean mask for the second multihead attention layer
832
Returns:
833
x -- Tensor of shape (batch_size, target_seq_len, fully_connected_dim)
834
attention_weights - Dictionary of tensors containing all the attention weights
835
each of shape Tensor of shape (batch_size, num_heads, target_seq_len, input_seq_len)
836
"""
837
838
seq_len = tf.shape(x)[1]
839
attention_weights = {}
840
841
# START CODE HERE
842
# create word embeddings
843
x = self.embedding(x) # (batch_size, target_seq_len, fully_connected_dim)
844
845
# scale embeddings by multiplying by the square root of their dimension
846
x *= tf.math.sqrt(tf.cast(self.embedding_dim, tf.float32))
847
848
# calculate positional encodings and add to word embedding
849
x += self.pos_encoding[:, :seq_len, :]
850
851
# apply a dropout layer to x
852
x = self.dropout(x, training = training)
853
854
# use a for loop to pass x through a stack of decoder layers and update attention_weights (~4 lines total)
855
for i in range(self.num_layers):
856
# pass x and the encoder output through a stack of decoder layers and save the attention weights
857
# of block 1 and 2 (~1 line)
858
x, block1, block2 = self.dec_layers[i](x, enc_output, training,
859
look_ahead_mask, padding_mask)
860
861
#update attention_weights dictionary with the attention weights of block 1 and block 2
862
attention_weights['decoder_layer{}_block1_self_att'.format(i+1)] = block1
863
attention_weights['decoder_layer{}_block2_decenc_att'.format(i+1)] = block2
864
# END CODE HERE
865
866
867
# x.shape == (batch_size, target_seq_len, fully_connected_dim)
868
return x, attention_weights
869
870
871
# In[ ]:
872
873
874
# UNIT TEST
875
Decoder_test(Decoder, create_look_ahead_mask, create_padding_mask)
876
877
878
# <a name='6'></a>
879
# ## 6 - Transformer
880
#
881
# Phew! This has been quite the assignment, and now you've made it to your last exercise of the Deep Learning Specialization. Congratulations! You've done all the hard work, now it's time to put it all together.
882
#
883
# <img src="transformer.png" alt="Transformer" width="550"/>
884
# <caption><center><font color='purple'><b>Figure 4: Transformer</font></center></caption>
885
#
886
# The flow of data through the Transformer Architecture is as follows:
887
# * First your input passes through an Encoder, which is just repeated Encoder layers that you implemented:
888
# - embedding and positional encoding of your input
889
# - multi-head attention on your input
890
# - feed forward neural network to help detect features
891
# * Then the predicted output passes through a Decoder, consisting of the decoder layers that you implemented:
892
# - embedding and positional encoding of the output
893
# - multi-head attention on your generated output
894
# - multi-head attention with the Q from the first multi-head attention layer and the K and V from the Encoder
895
# - a feed forward neural network to help detect features
896
# * Finally, after the Nth Decoder layer, two dense layers and a softmax are applied to generate prediction for the next output in your sequence.
897
#
898
# <a name='ex-8'></a>
899
# ### Exercise 8 - Transformer
900
#
901
# Implement `Transformer()` using the `call()` method
902
# 1. Pass the input through the Encoder with the appropiate mask.
903
# 2. Pass the encoder output and the target through the Decoder with the appropiate mask.
904
# 3. Apply a linear transformation and a softmax to get a prediction.
905
906
# In[ ]:
907
908
909
# UNQ_C8 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
910
# GRADED FUNCTION Transformer
911
class Transformer(tf.keras.Model):
912
"""
913
Complete transformer with an Encoder and a Decoder
914
"""
915
def __init__(self, num_layers, embedding_dim, num_heads, fully_connected_dim, input_vocab_size,
916
target_vocab_size, max_positional_encoding_input,
917
max_positional_encoding_target, dropout_rate=0.1, layernorm_eps=1e-6):
918
super(Transformer, self).__init__()
919
920
self.encoder = Encoder(num_layers=num_layers,
921
embedding_dim=embedding_dim,
922
num_heads=num_heads,
923
fully_connected_dim=fully_connected_dim,
924
input_vocab_size=input_vocab_size,
925
maximum_position_encoding=max_positional_encoding_input,
926
dropout_rate=dropout_rate,
927
layernorm_eps=layernorm_eps)
928
929
self.decoder = Decoder(num_layers=num_layers,
930
embedding_dim=embedding_dim,
931
num_heads=num_heads,
932
fully_connected_dim=fully_connected_dim,
933
target_vocab_size=target_vocab_size,
934
maximum_position_encoding=max_positional_encoding_target,
935
dropout_rate=dropout_rate,
936
layernorm_eps=layernorm_eps)
937
938
self.final_layer = Dense(target_vocab_size, activation='softmax')
939
940
def call(self, input_sentence, output_sentence, training, enc_padding_mask, look_ahead_mask, dec_padding_mask):
941
"""
942
Forward pass for the entire Transformer
943
Arguments:
944
input_sentence -- Tensor of shape (batch_size, input_seq_len, fully_connected_dim)
945
An array of the indexes of the words in the input sentence
946
output_sentence -- Tensor of shape (batch_size, target_seq_len, fully_connected_dim)
947
An array of the indexes of the words in the output sentence
948
training -- Boolean, set to true to activate
949
the training mode for dropout layers
950
enc_padding_mask -- Boolean mask to ensure that the padding is not
951
treated as part of the input
952
look_ahead_mask -- Boolean mask for the target_input
953
dec_padding_mask -- Boolean mask for the second multihead attention layer
954
Returns:
955
final_output -- Describe me
956
attention_weights - Dictionary of tensors containing all the attention weights for the decoder
957
each of shape Tensor of shape (batch_size, num_heads, target_seq_len, input_seq_len)
958
959
"""
960
# START CODE HERE
961
# call self.encoder with the appropriate arguments to get the encoder output
962
enc_output = self.encoder(input_sentence, training, enc_padding_mask) # (batch_size, inp_seq_len, fully_connected_dim)
963
964
# call self.decoder with the appropriate arguments to get the decoder output
965
# dec_output.shape == (batch_size, tar_seq_len, fully_connected_dim)
966
dec_output, attention_weights = self.decoder(output_sentence, enc_output, training, look_ahead_mask, dec_padding_mask)
967
968
# pass decoder output through a linear layer and softmax (~2 lines)
969
final_output = self.final_layer(dec_output) # (batch_size, tar_seq_len, target_vocab_size)
970
# START CODE HERE
971
972
return final_output, attention_weights
973
974
975
# In[ ]:
976
977
978
# UNIT TEST
979
Transformer_test(Transformer, create_look_ahead_mask, create_padding_mask)
980
981
982
# ## Conclusion
983
#
984
# You've come to the end of the graded portion of the assignment. By now, you've:
985
#
986
# * Create positional encodings to capture sequential relationships in data
987
# * Calculate scaled dot-product self-attention with word embeddings
988
# * Implement masked multi-head attention
989
# * Build and train a Transformer model
990
991
# <font color='blue'>
992
# <b>What you should remember</b>:
993
#
994
# - The combination of self-attention and convolutional network layers allows of parallization of training and *faster training*.
995
# - Self-attention is calculated using the generated query Q, key K, and value V matrices.
996
# - Adding positional encoding to word embeddings is an effective way of include sequence information in self-attention calculations.
997
# - Multi-head attention can help detect multiple features in your sentence.
998
# - Masking stops the model from 'looking ahead' during training, or weighting zeroes too much when processing cropped sentences.
999
1000
# Now that you have completed the Transformer assignment, make sure you check out the ungraded labs to apply the Transformer model to practical use cases such as Name Entity Recogntion (NER) and Question Answering (QA).
1001
#
1002
#
1003
# # Congratulations on finishing the Deep Learning Specialization!!!!!! 🎉🎉🎉🎉🎉
1004
#
1005
# This was the last graded assignment of the specialization. It is now time to celebrate all your hard work and dedication!
1006
#
1007
# <a name='7'></a>
1008
# ## 7 - References
1009
#
1010
# The Transformer algorithm was due to Vaswani et al. (2017).
1011
#
1012
# - Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin (2017). [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
1013
1014
# In[ ]:
1015
1016
1017
1018
1019
1020