CoCalc -- C5_W4_A1_Transformer_Subclass

GitHub Repository: amanchadha/coursera-deep-learning-specialization
Path: blob/master/C5 - Sequence Models/Week 4/C5_W4_A1_Transformer_Subclass_v1.py
⁵¹⁵⁰ views
1
#!/usr/bin/env python
2
# coding: utf-8
3

4
# # Transformer Network
5
# 
6
# Welcome to Week 4's assignment, the last assignment of Course 5 of the Deep Learning Specialization! And congratulations on making it to the last assignment of the entire Deep Learning Specialization - you're almost done!
7
# 
8
# Ealier in the course, you've implemented sequential neural networks such as RNNs, GRUs, and LSTMs. In this notebook you'll explore the Transformer architecture, a neural network that takes advantage of parallel processing and allows you to substantially speed up the training process. 
9
# 
10
# **After this assignment you'll be able to**:
11
# 
12
# * Create positional encodings to capture sequential relationships in data
13
# * Calculate scaled dot-product self-attention with word embeddings
14
# * Implement masked multi-head attention
15
# * Build and train a Transformer model
16
# 
17
# For the last time, let's get started!
18

19
# ## Table of Contents
20
# 
21
# - [Packages](#0)
22
# - [1 - Positional Encoding](#1)
23
#     - [1.1 - Sine and Cosine Angles](#1-1)
24
#         - [Exercise 1 - get_angles](#ex-1)
25
#     - [1.2 - Sine and Cosine Positional Encodings](#1-2)
26
#         - [Exercise 2 - positional_encoding](#ex-2)
27
# - [2 - Masking](#2)
28
#     - [2.1 - Padding Mask](#2-1)
29
#     - [2.2 - Look-ahead Mask](#2-2)
30
# - [3 - Self-Attention](#3)
31
#     - [Exercise 3 - scaled_dot_product_attention](#ex-3)
32
# - [4 - Encoder](#4)
33
#     - [4.1 Encoder Layer](#4-1)
34
#         - [Exercise 4 - EncoderLayer](#ex-4)
35
#     - [4.2 - Full Encoder](#4-2)
36
#         - [Exercise 5 - Encoder](#ex-5)
37
# - [5 - Decoder](#5)
38
#     - [5.1 - Decoder Layer](#5-1)
39
#         - [Exercise 6 - DecoderLayer](#ex-6)
40
#     - [5.2 - Full Decoder](#5-2)
41
#         - [Exercise 7 - Decoder](#ex-7)
42
# - [6 - Transformer](#6)
43
#     - [Exercise 8 - Transformer](#ex-8)
44
# - [7 - References](#7)
45

46
# <a name='0'></a>
47
# ## Packages
48
# 
49
# Run the following cell to load the packages you'll need.
50

51
# In[ ]:
52

53

54
import tensorflow as tf
55
import pandas as pd
56
import time
57
import numpy as np
58
import matplotlib.pyplot as plt
59

60
from tensorflow.keras.layers import Embedding, MultiHeadAttention, Dense, Input, Dropout, LayerNormalization
61
from transformers import DistilBertTokenizerFast #, TFDistilBertModel
62
from transformers import TFDistilBertForTokenClassification
63
from tqdm import tqdm_notebook as tqdm
64

65

66
# <a name='1'></a>
67
# ## 1 - Positional Encoding
68
# 
69
# In sequence to sequence tasks, the relative order of your data is extremely important to its meaning. When you were training sequential neural networks such as RNNs, you fed your inputs into the network in order. Information about the order of your data was automatically fed into your model.  However, when you train a Transformer network using multi-head attention, you feed your data into the model all at once. While this dramatically reduces training time, there is no information about the order of your data. This is where positional encoding is useful - you can specifically encode the positions of your inputs and pass them into the network using these sine and cosine formulas:
70
#     
71
# $$
72
# PE_{(pos, 2i)}= sin\left(\frac{pos}{{10000}^{\frac{2i}{d}}}\right)
73
# \tag{1}$$
74
# <br>
75
# $$
76
# PE_{(pos, 2i+1)}= cos\left(\frac{pos}{{10000}^{\frac{2i}{d}}}\right)
77
# \tag{2}$$
78
# 
79
# * $d$ is the dimension of the word embedding and positional encoding
80
# * $pos$ is the position of the word.
81
# * $i$ refers to each of the different dimensions of the positional encoding.
82
# 
83
# To develop some intuition about positional encodings, you can think of them broadly as a feature that contains the information about the relative positions of words. The sum of the positional encoding and word embedding is ultimately what is fed into the model. If you just hard code the positions in, say by adding a matrix of 1's or whole numbers to the word embedding, the semantic meaning is distorted. Conversely, the values of the sine and cosine equations are small enough (between -1 and 1) that when you add the positional encoding to a word embedding, the word embedding is not significantly distorted, and is instead enriched with positional information. Using a combination of these two equations helps your Transformer network attend to the relative positions of your input data. This was a short discussion on positional encodings, but develop further intuition, check out the *Positional Encoding Ungraded Lab*. 
84
# 
85
# **Note:** In the lectures Andrew uses vertical vectors, but in this assignment all vectors are horizontal. All matrix multiplications should be adjusted accordingly.
86
# 
87
# <a name='1-1'></a>
88
# ### 1.1 - Sine and Cosine Angles
89
# 
90
# Notice that even though the sine and cosine positional encoding equations take in different arguments (`2i` versus `2i+1`, or even versus odd numbers) the inner terms for both equations are the same: $$\theta(pos, i, d) = \frac{pos}{10000^{\frac{2i}{d}}} \tag{3}$$
91
# 
92
# Consider the inner term as you calculate the positional encoding for a word in a sequence.<br> 
93
# $PE_{(pos, 0)}= sin\left(\frac{pos}{{10000}^{\frac{0}{d}}}\right)$, since solving `2i = 0` gives `i = 0` <br>
94
# $PE_{(pos, 1)}= cos\left(\frac{pos}{{10000}^{\frac{0}{d}}}\right)$, since solving `2i + 1 = 1` gives `i = 0`
95
# 
96
# The angle is the same for both! The angles for $PE_{(pos, 2)}$ and $PE_{(pos, 3)}$ are the same as well, since for both, `i = 1` and therefore the inner term is $\left(\frac{pos}{{10000}^{\frac{1}{d}}}\right)$. This relationship holds true for all paired sine and cosine curves:
97
# 
98
# |      k         | <code>       0      </code>|<code>       1      </code>|<code>       2      </code>|<code>       3      </code>| <code> ... </code> |<code>      d - 2     </code>|<code>      d - 1     </code>| 
99
# | ---------------- | :------: | ----------------- | ----------------- | ----------------- | ----- | ----------------- | ----------------- |
100
# | encoding(0) = |[$sin(\theta(0, 0, d))$| $cos(\theta(0, 0, d))$| $sin(\theta(0, 1, d))$| $cos(\theta(0, 1, d))$|... |$sin(\theta(0, d//2, d))$| $cos(\theta(0, d//2, d))$]|
101
# | encoding(1) = | [$sin(\theta(1, 0, d))$| $cos(\theta(1, 0, d))$| $sin(\theta(1, 1, d))$| $cos(\theta(1, 1, d))$|... |$sin(\theta(1, d//2, d))$| $cos(\theta(1, d//2, d))$]|
102
# ...
103
# | encoding(pos) = | [$sin(\theta(pos, 0, d))$| $cos(\theta(pos, 0, d))$| $sin(\theta(pos, 1, d))$| $cos(\theta(pos, 1, d))$|... |$sin(\theta(pos, d//2, d))$| $cos(\theta(pos, d//2, d))]$|
104
# 
105
# 
106
# <a name='ex-1'></a>
107
# ### Exercise 1 - get_angles
108
# 
109
# Implement the function `get_angles()` to calculate the possible angles for the sine and cosine positional encodings
110
# 
111
# **Hints**
112
# 
113
# - If `k = [0, 1, 2, 3, 4, 5]`, then, `i` must be `i = [0, 0, 1, 1, 2, 2]`
114
# - `i = k//2`
115

116
# In[ ]:
117

118

119
# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
120
# GRADED FUNCTION get_angles
121
def get_angles(pos, k, d):
122
    """
123
    Get the angles for the positional encoding
124
    
125
    Arguments:
126
        pos -- Column vector containing the positions [[0], [1], ...,[N-1]]
127
        k --   Row vector containing the dimension span [[0, 1, 2, ..., d-1]]
128
        d(integer) -- Encoding size
129
    
130
    Returns:
131
        angles -- (pos, d) numpy array 
132
    """
133
    
134
    # START CODE HERE
135
    
136

137
    
138
    i = k // 2
139
    # Calculate the angles using pos, i and d
140
    angles = pos / np.power(10000, 2 * i / d)
141
    
142
    # END CODE HERE
143
    
144
    return angles
145

146

147
# In[ ]:
148

149

150
from public_tests import *
151

152
get_angles_test(get_angles)
153

154
# Example
155
position = 4
156
d_model = 8
157
pos_m = np.arange(position)[:, np.newaxis]
158
dims = np.arange(d_model)[np.newaxis, :]
159
get_angles(pos_m, dims, d_model)
160

161

162
# <a name='1-2'></a>
163
# ### 1.2 - Sine and Cosine Positional Encodings
164
# 
165
# Now you can use the angles you computed to calculate the sine and cosine positional encodings.
166
# 
167
# $$
168
# PE_{(pos, 2i)}= sin\left(\frac{pos}{{10000}^{\frac{2i}{d}}}\right)
169
# $$
170
# <br>
171
# $$
172
# PE_{(pos, 2i+1)}= cos\left(\frac{pos}{{10000}^{\frac{2i}{d}}}\right)
173
# $$
174
# 
175
# <a name='ex-2'></a>
176
# ### Exercise 2 - positional_encoding
177
# 
178
# Implement the function `positional_encoding()` to calculate the sine and cosine  positional encodings
179
# 
180
# **Reminder:** Use the sine equation when $i$ is an even number and the cosine equation when $i$ is an odd number.
181
# 
182
# #### Additional Hints
183
# * You may find 
184
# [np.newaxis](https://numpy.org/doc/stable/reference/arrays.indexing.html) useful depending on the implementation you choose. 
185

186
# In[ ]:
187

188

189
# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
190
# GRADED FUNCTION positional_encoding
191
def positional_encoding(positions, d):
192
    """
193
    Precomputes a matrix with all the positional encodings 
194
    
195
    Arguments:
196
        positions (int) -- Maximum number of positions to be encoded 
197
        d (int) -- Encoding size 
198
    
199
    Returns:
200
        pos_encoding -- (1, position, d_model) A matrix with the positional encodings
201
    """
202
    # START CODE HERE
203
    # initialize a matrix angle_rads of all the angles
204
    angle_rads = get_angles(np.arange(positions)[:, np.newaxis],
205
                            np.arange(d)[np.newaxis, :],
206
                            d)
207
  
208
    # apply sin to even indices in the array; 2i
209
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
210
  
211
    # apply cos to odd indices in the array; 2i+1
212
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
213
    # END CODE HERE
214
    
215
    pos_encoding = angle_rads[np.newaxis, ...]
216
    
217
    return tf.cast(pos_encoding, dtype=tf.float32)
218

219

220
# In[ ]:
221

222

223
# UNIT TEST    
224
positional_encoding_test(positional_encoding, get_angles)
225

226

227
# Nice work calculating the positional encodings! Now you can visualize them.
228

229
# In[ ]:
230

231

232
pos_encoding = positional_encoding(50, 512)
233

234
print (pos_encoding.shape)
235

236
plt.pcolormesh(pos_encoding[0], cmap='RdBu')
237
plt.xlabel('d')
238
plt.xlim((0, 512))
239
plt.ylabel('Position')
240
plt.colorbar()
241
plt.show()
242

243

244
# Each row represents a positional encoding - notice how none of the rows are identical! You have created a unique positional encoding for each of the words.
245

246
# <a name='2'></a>
247
# ## 2 - Masking
248
# 
249
# There are two types of masks that are useful when building your Transformer network: the *padding mask* and the *look-ahead mask*. Both help the softmax computation give the appropriate weights to the words in your input sentence. 
250
# 
251
# <a name='2-1'></a>
252
# ### 2.1 - Padding Mask
253
# 
254
# Oftentimes your input sequence will exceed the maximum length of a sequence your network can process. Let's say the maximum length of your model is five, it is fed the following sequences:
255
# 
256
#     [["Do", "you", "know", "when", "Jane", "is", "going", "to", "visit", "Africa"], 
257
#      ["Jane", "visits", "Africa", "in", "September" ],
258
#      ["Exciting", "!"]
259
#     ]
260
# 
261
# which might get vectorized as:
262
# 
263
#     [[ 71, 121, 4, 56, 99, 2344, 345, 1284, 15],
264
#      [ 56, 1285, 15, 181, 545],
265
#      [ 87, 600]
266
#     ]
267
#     
268
# When passing sequences into a transformer model, it is important that they are of uniform length. You can achieve this by padding the sequence with zeros, and truncating sentences that exceed the maximum length of your model:
269
# 
270
#     [[ 71, 121, 4, 56, 99],
271
#      [ 2344, 345, 1284, 15, 0],
272
#      [ 56, 1285, 15, 181, 545],
273
#      [ 87, 600, 0, 0, 0],
274
#     ]
275
#     
276
# Sequences longer than the maximum length of five will be truncated, and zeros will be added to the truncated sequence to achieve uniform length. Similarly, for sequences shorter than the maximum length, they zeros will also be added for padding. However, these zeros will affect the softmax calculation - this is when a padding mask comes in handy! You will need to define a boolean mask that specifies which elements you must attend(1) and which elements you must ignore(0). Later you will use that mask to set all the zeros in the sequence to a value close to negative infinity (-1e9). We'll implement this for you so you can get to the fun of building the Transformer network! 😇 Just make sure you go through the code so you can correctly implement padding when building your model. 
277
# 
278
# After masking, your input should go from `[87, 600, 0, 0, 0]` to `[87, 600, -1e9, -1e9, -1e9]`, so that when you take the softmax, the zeros don't affect the score.
279
# 
280
# The [MultiheadAttention](https://keras.io/api/layers/attention_layers/multi_head_attention/) layer implemented in Keras, use this masking logic.
281

282
# In[ ]:
283

284

285
def create_padding_mask(decoder_token_ids):
286
    """
287
    Creates a matrix mask for the padding cells
288
    
289
    Arguments:
290
        decoder_token_ids -- (n, m) matrix
291
    
292
    Returns:
293
        mask -- (n, 1, 1, m) binary tensor
294
    """    
295
    seq = 1 - tf.cast(tf.math.equal(decoder_token_ids, 0), tf.float32)
296
  
297
    # add extra dimensions to add the padding
298
    # to the attention logits.
299
    return seq[:, tf.newaxis, :] 
300

301

302
# In[ ]:
303

304

305
x = tf.constant([[7., 6., 0., 0., 1.], [1., 2., 3., 0., 0.], [0., 0., 0., 4., 5.]])
306
print(create_padding_mask(x))
307

308

309
# If we multiply (1 - mask) by -1e9 and add it to the sample input sequences, the zeros are essentially set to negative infinity. Notice the difference when taking the softmax of the original sequence and the masked sequence:
310

311
# In[ ]:
312

313

314
print(tf.keras.activations.softmax(x))
315
print(tf.keras.activations.softmax(x + (1 - create_padding_mask(x)) * -1.0e9))
316

317

318
# <a name='2-2'></a>
319
# ### 2.2 - Look-ahead Mask
320
# 
321
# The look-ahead mask follows similar intuition. In training, you will have access to the complete correct output of your training example. The look-ahead mask helps your model pretend that it correctly predicted a part of the output and see if, *without looking ahead*, it can correctly predict the next output. 
322
# 
323
# For example, if the expected correct output is `[1, 2, 3]` and you wanted to see if given that the model correctly predicted the first value it could predict the second value, you would mask out the second and third values. So you would input the masked sequence `[1, -1e9, -1e9]` and see if it could generate `[1, 2, -1e9]`.
324
# 
325
# Just because you've worked so hard, we'll also implement this mask for you 😇😇. Again, take a close look at the code so you can effictively implement it later.
326

327
# In[ ]:
328

329

330
def create_look_ahead_mask(sequence_length):
331
    """
332
    Returns an upper triangular matrix filled with ones
333
    
334
    Arguments:
335
        sequence_length -- matrix size
336
    
337
    Returns:
338
        mask -- (size, size) tensor
339
    """
340
    mask = tf.linalg.band_part(tf.ones((1, sequence_length, sequence_length)), -1, 0)
341
    return mask 
342

343

344
# In[ ]:
345

346

347
x = tf.random.uniform((1, 3))
348
temp = create_look_ahead_mask(x.shape[1])
349
temp
350

351

352
# <a name='3'></a>
353
# ## 3 - Self-Attention
354
# 
355
# As the authors of the Transformers paper state, "Attention is All You Need". 
356
# 
357
# <img src="self-attention.png" alt="Encoder" width="600"/>
358
# <caption><center><font color='purple'><b>Figure 1: Self-Attention calculation visualization</font></center></caption>
359
#     
360
# The use of self-attention paired with traditional convolutional networks allows for the parallization which speeds up training. You will implement **scaled dot product attention** which takes in a query, key, value, and a mask as inputs to returns rich, attention-based vector representations of the words in your sequence. This type of self-attention can be mathematically expressed as:
361
# $$
362
# \text { Attention }(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}+{M}\right) V\tag{4}\
363
# $$
364
# 
365
# * $Q$ is the matrix of queries 
366
# * $K$ is the matrix of keys
367
# * $V$ is the matrix of values
368
# * $M$ is the optional mask you choose to apply 
369
# * ${d_k}$ is the dimension of the keys, which is used to scale everything down so the softmax doesn't explode
370
# 
371
# <a name='ex-3'></a>
372
# ### Exercise 3 - scaled_dot_product_attention 
373
# 
374
#     Implement the function `scaled_dot_product_attention()` to create attention-based representations
375
# **Reminder**: The boolean mask parameter can be passed in as `none` or as either padding or look-ahead. 
376
#     
377
#     Multiply (1. - mask) by -1e9 before applying the softmax. 
378
# 
379
# **Additional Hints**
380
# * You may find [tf.matmul](https://www.tensorflow.org/api_docs/python/tf/linalg/matmul) useful for matrix multiplication.
381

382
# In[ ]:
383

384

385
# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
386
# GRADED FUNCTION scaled_dot_product_attention
387
def scaled_dot_product_attention(q, k, v, mask):
388
    """
389
    Calculate the attention weights.
390
      q, k, v must have matching leading dimensions.
391
      k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
392
      The mask has different shapes depending on its type(padding or look ahead) 
393
      but it must be broadcastable for addition.
394

395
    Arguments:
396
        q -- query shape == (..., seq_len_q, depth)
397
        k -- key shape == (..., seq_len_k, depth)
398
        v -- value shape == (..., seq_len_v, depth_v)
399
        mask: Float tensor with shape broadcastable 
400
              to (..., seq_len_q, seq_len_k). Defaults to None.
401

402
    Returns:
403
        output -- attention_weights
404
    """
405
   # START CODE HERE
406
    
407
    matmul_qk = tf.matmul(q, k, transpose_b = True)  # (..., seq_len_q, seq_len_k)
408

409
    # scale matmul_qk
410
    dk = tf.cast(tf.shape(k)[-1], tf.float32)
411
    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)
412

413
    # add the mask to the scaled tensor.
414
    if mask is not None: # Don't replace this None
415
        scaled_attention_logits += ((1 - mask) * -1e9)
416

417
    # softmax is normalized on the last axis (seq_len_k) so that the scores
418
    # add up to 1.
419
    attention_weights = tf.nn.softmax(scaled_attention_logits, axis = -1)  # (..., seq_len_q, seq_len_k)
420

421
    output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)
422
    
423
    # END CODE HERE
424
    return output, attention_weights
425

426

427
# In[ ]:
428

429

430
# UNIT TEST
431
scaled_dot_product_attention_test(scaled_dot_product_attention)
432

433

434
# Excellent work! You can now implement self-attention. With that, you can start building the encoder block! 
435

436
# <a name='4'></a>
437
# ## 4 - Encoder
438
# 
439
# The Transformer Encoder layer pairs self-attention and convolutional neural network style of processing to improve the speed of training and passes K and V matrices to the Decoder, which you'll build later in the assignment. In this section of the assignment, you will implement the Encoder by pairing multi-head attention and a feed forward neural network (Figure 2a). 
440
# <img src="encoder_layer.png" alt="Encoder" width="250"/>
441
# <caption><center><font color='purple'><b>Figure 2a: Transformer encoder layer</font></center></caption>
442
# 
443
# * `MultiHeadAttention` you can think of as computing the self-attention several times to detect different features. 
444
# * Feed forward neural network contains two Dense layers which we'll implement as the function `FullyConnected`
445
# 
446
# Your input sentence first passes through a *multi-head attention layer*, where the encoder looks at other words in the input sentence as it encodes a specific word. The outputs of the multi-head attention layer are then fed to a *feed forward neural network*. The exact same feed forward network is independently applied to each position.
447
#    
448
# * For the `MultiHeadAttention` layer, you will use the [Keras implementation](https://www.tensorflow.org/api_docs/python/tf/keras/layers/MultiHeadAttention). If you're curious about how to split the query matrix Q, key matrix K, and value matrix V into different heads, you can look through the implementation. 
449
# * You will also use the [Sequential API](https://keras.io/api/models/sequential/) with two dense layers to built the feed forward neural network layers.
450

451
# In[ ]:
452

453

454
def FullyConnected(embedding_dim, fully_connected_dim):
455
    return tf.keras.Sequential([
456
        tf.keras.layers.Dense(fully_connected_dim, activation='relu'),  # (batch_size, seq_len, dff)
457
        tf.keras.layers.Dense(embedding_dim)  # (batch_size, seq_len, d_model)
458
    ])
459

460

461
# <a name='4-1'></a>
462
# ### 4.1 Encoder Layer
463
# 
464
# Now you can pair multi-head attention and feed forward neural network together in an encoder layer! You will also use residual connections and layer normalization to help speed up training (Figure 2a).
465
# 
466
# <a name='ex-4'></a>
467
# ### Exercise 4 - EncoderLayer
468
# 
469
# Implement `EncoderLayer()` using the `call()` method
470
# 
471
# In this exercise, you will implement one encoder block (Figure 2) using the `call()` method. The function should perform the following steps: 
472
# 1. You will pass the Q, V, K matrices and a boolean mask to a multi-head attention layer. Remember that to compute *self*-attention Q, V and K should be the same. Let the default values for `return_attention_scores` and `training`. You will also perform Dropout in this multi-head attention layer during training. 
473
# 2. Now add a skip connection by adding your original input `x` and the output of the your multi-head attention layer. 
474
# 3. After adding the skip connection, pass the output through the first normalization layer.
475
# 4. Finally, repeat steps 1-3 but with the feed forward neural network with a dropout layer instead of the multi-head attention layer. 
476
# 
477
# <details>
478
#   <summary><font size="2" color="darkgreen"><b>Additional Hints (Click to expand)</b></font></summary>
479
#     
480
# * The `__init__` method creates all the layers that will be accesed by the the `call` method. Wherever you want to use a layer defined inside  the `__init__`  method you will have to use the syntax `self.[insert layer name]`. 
481
# * You will find the documentation of [MultiHeadAttention](https://www.tensorflow.org/api_docs/python/tf/keras/layers/MultiHeadAttention) helpful. *Note that if query, key and value are the same, then this function performs self-attention.*
482
# * The call arguments for `self.mha` are (Where B is for batch_size, T is for target sequence shapes, and S is output_shape):
483
#  - `query`: Query Tensor of shape (B, T, dim).
484
#  - `value`: Value Tensor of shape (B, S, dim).
485
#  - `key`: Optional key Tensor of shape (B, S, dim). If not given, will use value for both key and value, which is the most common case.
486
#  - `attention_mask`: a boolean mask of shape (B, T, S), that prevents attention to certain positions. The boolean mask specifies which query elements can attend to which key elements, 1 indicates attention and 0 indicates no attention. Broadcasting can happen for the missing batch dimensions and the head dimension.
487
#  - `return_attention_scores`: A boolean to indicate whether the output should be attention output if True, or (attention_output, attention_scores) if False. Defaults to False.
488
#  - `training`: Python boolean indicating whether the layer should behave in training mode (adding dropout) or in inference mode (no dropout). Defaults to either using the training mode of the parent layer/model, or False (inference) if there is no parent layer.
489

490
# In[ ]:
491

492

493
# UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
494
# GRADED FUNCTION EncoderLayer
495
class EncoderLayer(tf.keras.layers.Layer):
496
    """
497
    The encoder layer is composed by a multi-head self-attention mechanism,
498
    followed by a simple, positionwise fully connected feed-forward network. 
499
    This archirecture includes a residual connection around each of the two 
500
    sub-layers, followed by layer normalization.
501
    """
502
    def __init__(self, embedding_dim, num_heads, fully_connected_dim,
503
                 dropout_rate=0.1, layernorm_eps=1e-6):
504
        super(EncoderLayer, self).__init__()
505

506
        self.mha = MultiHeadAttention(num_heads=num_heads,
507
                                      key_dim=embedding_dim,
508
                                      dropout=dropout_rate)
509

510
        self.ffn = FullyConnected(embedding_dim=embedding_dim,
511
                                  fully_connected_dim=fully_connected_dim)
512

513
        self.layernorm1 = LayerNormalization(epsilon=layernorm_eps)
514
        self.layernorm2 = LayerNormalization(epsilon=layernorm_eps)
515

516
        self.dropout_ffn = Dropout(dropout_rate)
517
    
518
    def call(self, x, training, mask):
519
        """
520
        Forward pass for the Encoder Layer
521
        
522
        Arguments:
523
            x -- Tensor of shape (batch_size, input_seq_len, fully_connected_dim)
524
            training -- Boolean, set to true to activate
525
                        the training mode for dropout layers
526
            mask -- Boolean mask to ensure that the padding is not 
527
                    treated as part of the input
528
        Returns:
529
            encoder_layer_out -- Tensor of shape (batch_size, input_seq_len, fully_connected_dim)
530
        """
531
        # START CODE HERE
532
        # calculate self-attention using mha(~1 line). Dropout will be applied during training
533
        attn_output = self.mha(x, x, x, mask) # Self attention (batch_size, input_seq_len, fully_connected_dim)
534
        
535
        # apply layer normalization on sum of the input and the attention output to get the  
536
        # output of the multi-head attention layer (~1 line)
537
        out1 = self.layernorm1(x + attn_output)  # (batch_size, input_seq_len, fully_connected_dim)
538

539
        # pass the output of the multi-head attention layer through a ffn (~1 line)
540
        ffn_output = self.ffn(out1)  # (batch_size, input_seq_len, fully_connected_dim)
541
        
542
        # apply dropout layer to ffn output during training (~1 line)
543
        ffn_output =  self.dropout_ffn(ffn_output, training=training)
544
        
545
        # apply layer normalization on sum of the output from multi-head attention and ffn output to get the
546
        # output of the encoder layer (~1 line)
547
        encoder_layer_out = self.layernorm2(out1 + ffn_output)  # (batch_size, input_seq_len, fully_connected_dim)
548
        # END CODE HERE
549
        
550
        return encoder_layer_out
551
    
552

553

554
# In[ ]:
555

556

557
# UNIT TEST
558
EncoderLayer_test(EncoderLayer)
559

560

561
# <a name='4-2'></a>
562
# ### 4.2 - Full Encoder
563
# 
564
# Awesome job! You have now successfully implemented positional encoding, self-attention, and an encoder layer - give yourself a pat on the back. Now you're ready to build the full Transformer Encoder (Figure 2b), where you will embedd your input and add the positional encodings you calculated. You will then feed your encoded embeddings to a stack of Encoder layers. 
565
# 
566
# <img src="encoder.png" alt="Encoder" width="330"/>
567
# <caption><center><font color='purple'><b>Figure 2b: Transformer Encoder</font></center></caption>
568
# 
569
# 
570
# <a name='ex-5'></a>
571
# ### Exercise 5 - Encoder
572
# 
573
# Complete the `Encoder()` function using the `call()` method to embed your input, add positional encoding, and implement multiple encoder layers 
574
# 
575
# In this exercise, you will initialize your Encoder with an Embedding layer, positional encoding, and multiple EncoderLayers. Your `call()` method will perform the following steps: 
576
# 1. Pass your input through the Embedding layer.
577
# 2. Scale your embedding by multiplying it by the square root of your embedding dimension. Remember to cast the embedding dimension to data type `tf.float32` before computing the square root.
578
# 3. Add the position encoding: self.pos_encoding `[:, :seq_len, :]` to your embedding.
579
# 4. Pass the encoded embedding through a dropout layer, remembering to use the `training` parameter to set the model training mode. 
580
# 5. Pass the output of the dropout layer through the stack of encoding layers using a for loop.
581

582
# In[ ]:
583

584

585
# UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
586
# GRADED FUNCTION
587
class Encoder(tf.keras.layers.Layer):
588
   """
589
   The entire Encoder starts by passing the input to an embedding layer 
590
   and using positional encoding to then pass the output through a stack of
591
   encoder Layers
592
       
593
   """  
594
   def __init__(self, num_layers, embedding_dim, num_heads, fully_connected_dim, input_vocab_size,
595
              maximum_position_encoding, dropout_rate=0.1, layernorm_eps=1e-6):
596
       super(Encoder, self).__init__()
597

598
       self.embedding_dim = embedding_dim
599
       self.num_layers = num_layers
600

601
       self.embedding = Embedding(input_vocab_size, self.embedding_dim)
602
       self.pos_encoding = positional_encoding(maximum_position_encoding, 
603
                                               self.embedding_dim)
604

605

606
       self.enc_layers = [EncoderLayer(embedding_dim=self.embedding_dim,
607
                                       num_heads=num_heads,
608
                                       fully_connected_dim=fully_connected_dim,
609
                                       dropout_rate=dropout_rate,
610
                                       layernorm_eps=layernorm_eps) 
611
                          for _ in range(self.num_layers)]
612

613
       self.dropout = Dropout(dropout_rate)
614
       
615
   def call(self, x, training, mask):
616
       """
617
       Forward pass for the Encoder
618
       
619
       Arguments:
620
           x -- Tensor of shape (batch_size, input_seq_len)
621
           training -- Boolean, set to true to activate
622
                       the training mode for dropout layers
623
           mask -- Boolean mask to ensure that the padding is not 
624
                   treated as part of the input
625
       Returns:
626
           out2 -- Tensor of shape (batch_size, input_seq_len, fully_connected_dim)
627
       """
628
       #mask = create_padding_mask(x)
629
       seq_len = tf.shape(x)[1]
630
       
631
       # START CODE HERE
632
       # Pass input through the Embedding layer
633
       x = self.embedding(x)  # (batch_size, input_seq_len, fully_connected_dim)
634
       # Scale embedding by multiplying it by the square root of the embedding dimension
635
       x *= tf.math.sqrt(tf.cast(self.embedding_dim, tf.float32))
636
       # Add the position encoding to embedding
637
       x += self.pos_encoding[:, :seq_len, :]
638
       # Pass the encoded embedding through a dropout layer
639
       x = self.dropout(x, training = training)
640
       # Pass the output through the stack of encoding layers 
641
       for i in range(self.num_layers):
642
           x = self.enc_layers[i](x, training, mask)
643
       # END CODE HERE
644

645
       return x  # (batch_size, input_seq_len, fully_connected_dim)
646

647

648
# In[ ]:
649

650

651
# UNIT TEST    
652
Encoder_test(Encoder)
653

654

655
# <a name='5'></a>
656
# ## 5 - Decoder
657
# 
658
# The Decoder layer takes the K and V matrices generated by the Encoder and in computes the second multi-head attention layer with the Q matrix from the output (Figure 3a).
659
# 
660
# <img src="decoder_layer.png" alt="Encoder" width="250"/>
661
# <caption><center><font color='purple'><b>Figure 3a: Transformer Decoder layer</font></center></caption>
662
# 
663
# <a name='5-1'></a>    
664
# ### 5.1 - Decoder Layer
665
# Again, you'll pair multi-head attention with a feed forward neural network, but this time you'll implement two multi-head attention layers. You will also use residual connections and layer normalization to help speed up training (Figure 3a).
666
# 
667
# <a name='ex-6'></a>    
668
# ### Exercise 6 - DecoderLayer
669
#     
670
# Implement `DecoderLayer()` using the `call()` method
671
#     
672
# 1. Block 1 is a multi-head attention layer with a residual connection, and look-ahead mask. Like in the `EncoderLayer`, Dropout is defined within the multi-head attention layer.
673
# 2. Block 2 will take into account the output of the Encoder, so the multi-head attention layer will receive K and V from the encoder, and Q from the Block 1. You will then apply a normalization layer and a residual connection, just like you did before with the `EncoderLayer`.
674
# 3. Finally, Block 3 is a feed forward neural network with dropout and normalization layers and a residual connection.
675
#     
676
# **Additional Hints:**
677
# * The first two blocks are fairly similar to the EncoderLayer except you will return `attention_scores` when computing self-attention
678

679
# In[ ]:
680

681

682
# UNQ_C6 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
683
# GRADED FUNCTION DecoderLayer
684
class DecoderLayer(tf.keras.layers.Layer):
685
    """
686
    The decoder layer is composed by two multi-head attention blocks, 
687
    one that takes the new input and uses self-attention, and the other 
688
    one that combines it with the output of the encoder, followed by a
689
    fully connected block. 
690
    """
691
    def __init__(self, embedding_dim, num_heads, fully_connected_dim, dropout_rate=0.1, layernorm_eps=1e-6):
692
        super(DecoderLayer, self).__init__()
693

694
        self.mha1 = MultiHeadAttention(num_heads=num_heads,
695
                                      key_dim=embedding_dim,
696
                                      dropout=dropout_rate)
697

698
        self.mha2 = MultiHeadAttention(num_heads=num_heads,
699
                                      key_dim=embedding_dim,
700
                                      dropout=dropout_rate)
701

702
        self.ffn = FullyConnected(embedding_dim=embedding_dim,
703
                                  fully_connected_dim=fully_connected_dim)
704

705
        self.layernorm1 = LayerNormalization(epsilon=layernorm_eps)
706
        self.layernorm2 = LayerNormalization(epsilon=layernorm_eps)
707
        self.layernorm3 = LayerNormalization(epsilon=layernorm_eps)
708

709
        self.dropout_ffn = Dropout(dropout_rate)
710
    
711
    def call(self, x, enc_output, training, look_ahead_mask, padding_mask):
712
        """
713
        Forward pass for the Decoder Layer
714
        
715
        Arguments:
716
            x -- Tensor of shape (batch_size, target_seq_len, fully_connected_dim)
717
            enc_output --  Tensor of shape(batch_size, input_seq_len, fully_connected_dim)
718
            training -- Boolean, set to true to activate
719
                        the training mode for dropout layers
720
            look_ahead_mask -- Boolean mask for the target_input
721
            padding_mask -- Boolean mask for the second multihead attention layer
722
        Returns:
723
            out3 -- Tensor of shape (batch_size, target_seq_len, fully_connected_dim)
724
            attn_weights_block1 -- Tensor of shape(batch_size, num_heads, target_seq_len, input_seq_len)
725
            attn_weights_block2 -- Tensor of shape(batch_size, num_heads, target_seq_len, input_seq_len)
726
        """
727
        
728
        # START CODE HERE
729
        # enc_output.shape == (batch_size, input_seq_len, fully_connected_dim)
730
        
731
        # BLOCK 1
732
        # calculate self-attention and return attention scores as attn_weights_block1.
733
        # Dropout will be applied during training (~1 line).
734
        mult_attn_out1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask, return_attention_scores=True)  # (batch_size, target_seq_len, d_model)
735
        
736
        # apply layer normalization (layernorm1) to the sum of the attention output and the input (~1 line)
737
        Q1 = self.layernorm1(mult_attn_out1 + x)
738

739
        # BLOCK 2
740
        # calculate self-attention using the Q from the first block and K and V from the encoder output. 
741
        # Dropout will be applied during training
742
        # Return attention scores as attn_weights_block2 (~1 line) 
743
        mult_attn_out2, attn_weights_block2 = self.mha2(Q1, enc_output, enc_output, padding_mask, return_attention_scores=True)  # (batch_size, target_seq_len, d_model)
744
        
745
        # apply layer normalization (layernorm2) to the sum of the attention output and the output of the first block (~1 line)
746
        mult_attn_out2 = self.layernorm2(mult_attn_out2 + Q1)  # (batch_size, target_seq_len, fully_connected_dim)
747
                
748
        #BLOCK 3
749
        # pass the output of the second block through a ffn
750
        ffn_output = self.ffn(mult_attn_out2)  # (batch_size, target_seq_len, fully_connected_dim)
751
        
752
        # apply a dropout layer to the ffn output
753
        ffn_output = self.dropout_ffn(ffn_output, training = training)
754
        
755
        # apply layer normalization (layernorm3) to the sum of the ffn output and the output of the second block
756
        out3 = self.layernorm3(ffn_output + mult_attn_out2)  # (batch_size, target_seq_len, fully_connected_dim)
757
        # END CODE HERE
758

759
        return out3, attn_weights_block1, attn_weights_block2
760
    
761

762

763
# In[ ]:
764

765

766
# UNIT TEST
767
DecoderLayer_test(DecoderLayer, create_look_ahead_mask)
768

769

770
# <a name='5-2'></a> 
771
# ### 5.2 - Full Decoder
772
# You're almost there! Time to use your Decoder layer to build a full Transformer Decoder (Figure 3b). You will embedd your output and add positional encodings. You will then feed your encoded embeddings to a stack of Decoder layers. 
773
# 
774
# 
775
# <img src="decoder.png" alt="Encoder" width="300"/>
776
# <caption><center><font color='purple'><b>Figure 3b: Transformer Decoder</font></center></caption>
777
# 
778
# <a name='ex-7'></a>     
779
# ### Exercise 7 - Decoder
780
# 
781
# Implement `Decoder()` using the `call()` method to embed your output, add positional encoding, and implement multiple decoder layers
782
#  
783
# In this exercise, you will initialize your Decoder with an Embedding layer, positional encoding, and multiple DecoderLayers. Your `call()` method will perform the following steps: 
784
# 1. Pass your generated output through the Embedding layer.
785
# 2. Scale your embedding by multiplying it by the square root of your embedding dimension. Remember to cast the embedding dimension to data type `tf.float32` before computing the square root.
786
# 3. Add the position encoding: self.pos_encoding `[:, :seq_len, :]` to your embedding.
787
# 4. Pass the encoded embedding through a dropout layer, remembering to use the `training` parameter to set the model training mode. 
788
# 5. Pass the output of the dropout layer through the stack of Decoding layers using a for loop.
789

790
# In[ ]:
791

792

793
# UNQ_C7 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
794
# GRADED FUNCTION Decoder
795
class Decoder(tf.keras.layers.Layer):
796
    """
797
    The entire Encoder is starts by passing the target input to an embedding layer 
798
    and using positional encoding to then pass the output through a stack of
799
    decoder Layers
800
        
801
    """ 
802
    def __init__(self, num_layers, embedding_dim, num_heads, fully_connected_dim, target_vocab_size,
803
               maximum_position_encoding, dropout_rate=0.1, layernorm_eps=1e-6):
804
        super(Decoder, self).__init__()
805

806
        self.embedding_dim = embedding_dim
807
        self.num_layers = num_layers
808

809
        self.embedding = Embedding(target_vocab_size, self.embedding_dim)
810
        self.pos_encoding = positional_encoding(maximum_position_encoding, self.embedding_dim)
811

812
        self.dec_layers = [DecoderLayer(embedding_dim=self.embedding_dim,
813
                                        num_heads=num_heads,
814
                                        fully_connected_dim=fully_connected_dim,
815
                                        dropout_rate=dropout_rate,
816
                                        layernorm_eps=layernorm_eps) 
817
                           for _ in range(self.num_layers)]
818
        self.dropout = Dropout(dropout_rate)
819
    
820
    def call(self, x, enc_output, training, 
821
           look_ahead_mask, padding_mask):
822
        """
823
        Forward  pass for the Decoder
824
        
825
        Arguments:
826
            x -- Tensor of shape (batch_size, target_seq_len, fully_connected_dim)
827
            enc_output --  Tensor of shape(batch_size, input_seq_len, fully_connected_dim)
828
            training -- Boolean, set to true to activate
829
                        the training mode for dropout layers
830
            look_ahead_mask -- Boolean mask for the target_input
831
            padding_mask -- Boolean mask for the second multihead attention layer
832
        Returns:
833
            x -- Tensor of shape (batch_size, target_seq_len, fully_connected_dim)
834
            attention_weights - Dictionary of tensors containing all the attention weights
835
                                each of shape Tensor of shape (batch_size, num_heads, target_seq_len, input_seq_len)
836
        """
837

838
        seq_len = tf.shape(x)[1]
839
        attention_weights = {}
840
        
841
         # START CODE HERE
842
        # create word embeddings 
843
        x = self.embedding(x)  # (batch_size, target_seq_len, fully_connected_dim)
844
        
845
        # scale embeddings by multiplying by the square root of their dimension
846
        x *= tf.math.sqrt(tf.cast(self.embedding_dim, tf.float32))
847
        
848
        # calculate positional encodings and add to word embedding
849
        x += self.pos_encoding[:, :seq_len, :]
850

851
        # apply a dropout layer to x
852
        x = self.dropout(x, training = training)
853

854
        # use a for loop to pass x through a stack of decoder layers and update attention_weights (~4 lines total)
855
        for i in range(self.num_layers):
856
            # pass x and the encoder output through a stack of decoder layers and save the attention weights
857
            # of block 1 and 2 (~1 line)
858
            x, block1, block2 = self.dec_layers[i](x, enc_output, training,
859
                                                 look_ahead_mask, padding_mask)
860

861
            #update attention_weights dictionary with the attention weights of block 1 and block 2
862
            attention_weights['decoder_layer{}_block1_self_att'.format(i+1)] = block1
863
            attention_weights['decoder_layer{}_block2_decenc_att'.format(i+1)] = block2
864
        # END CODE HERE
865
        
866
        
867
        # x.shape == (batch_size, target_seq_len, fully_connected_dim)
868
        return x, attention_weights
869

870

871
# In[ ]:
872

873

874
# UNIT TEST
875
Decoder_test(Decoder, create_look_ahead_mask, create_padding_mask)
876

877

878
# <a name='6'></a> 
879
# ## 6 - Transformer
880
# 
881
# Phew! This has been quite the assignment, and now you've made it to your last exercise of the Deep Learning Specialization. Congratulations! You've done all the hard work, now it's time to put it all together.  
882
# 
883
# <img src="transformer.png" alt="Transformer" width="550"/>
884
# <caption><center><font color='purple'><b>Figure 4: Transformer</font></center></caption>
885
#     
886
# The flow of data through the Transformer Architecture is as follows:
887
# * First your input passes through an Encoder, which is just repeated Encoder layers that you implemented:
888
#     - embedding and positional encoding of your input
889
#     - multi-head attention on your input
890
#     - feed forward neural network to help detect features
891
# * Then the predicted output passes through a Decoder, consisting of the decoder layers that you implemented:
892
#     - embedding and positional encoding of the output
893
#     - multi-head attention on your generated output
894
#     - multi-head attention with the Q from the first multi-head attention layer and the K and V from the Encoder
895
#     - a feed forward neural network to help detect features
896
# * Finally, after the Nth Decoder layer, two dense layers and a softmax are applied to generate prediction for the next output in your sequence.
897
# 
898
# <a name='ex-8'></a> 
899
# ### Exercise 8 - Transformer
900
# 
901
# Implement `Transformer()` using the `call()` method
902
# 1. Pass the input through the Encoder with the appropiate mask.
903
# 2. Pass the encoder output and the target through the Decoder with the appropiate mask.
904
# 3. Apply a linear transformation and a softmax to get a prediction.
905

906
# In[ ]:
907

908

909
# UNQ_C8 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
910
# GRADED FUNCTION Transformer
911
class Transformer(tf.keras.Model):
912
    """
913
    Complete transformer with an Encoder and a Decoder
914
    """
915
    def __init__(self, num_layers, embedding_dim, num_heads, fully_connected_dim, input_vocab_size, 
916
               target_vocab_size, max_positional_encoding_input,
917
               max_positional_encoding_target, dropout_rate=0.1, layernorm_eps=1e-6):
918
        super(Transformer, self).__init__()
919

920
        self.encoder = Encoder(num_layers=num_layers,
921
                               embedding_dim=embedding_dim,
922
                               num_heads=num_heads,
923
                               fully_connected_dim=fully_connected_dim,
924
                               input_vocab_size=input_vocab_size,
925
                               maximum_position_encoding=max_positional_encoding_input,
926
                               dropout_rate=dropout_rate,
927
                               layernorm_eps=layernorm_eps)
928

929
        self.decoder = Decoder(num_layers=num_layers, 
930
                               embedding_dim=embedding_dim,
931
                               num_heads=num_heads,
932
                               fully_connected_dim=fully_connected_dim,
933
                               target_vocab_size=target_vocab_size, 
934
                               maximum_position_encoding=max_positional_encoding_target,
935
                               dropout_rate=dropout_rate,
936
                               layernorm_eps=layernorm_eps)
937

938
        self.final_layer = Dense(target_vocab_size, activation='softmax')
939
    
940
    def call(self, input_sentence, output_sentence, training, enc_padding_mask, look_ahead_mask, dec_padding_mask):
941
        """
942
        Forward pass for the entire Transformer
943
        Arguments:
944
            input_sentence -- Tensor of shape (batch_size, input_seq_len, fully_connected_dim)
945
                              An array of the indexes of the words in the input sentence
946
            output_sentence -- Tensor of shape (batch_size, target_seq_len, fully_connected_dim)
947
                              An array of the indexes of the words in the output sentence
948
            training -- Boolean, set to true to activate
949
                        the training mode for dropout layers
950
            enc_padding_mask -- Boolean mask to ensure that the padding is not 
951
                    treated as part of the input
952
            look_ahead_mask -- Boolean mask for the target_input
953
            dec_padding_mask -- Boolean mask for the second multihead attention layer
954
        Returns:
955
            final_output -- Describe me
956
            attention_weights - Dictionary of tensors containing all the attention weights for the decoder
957
                                each of shape Tensor of shape (batch_size, num_heads, target_seq_len, input_seq_len)
958
        
959
        """
960
        # START CODE HERE
961
        # call self.encoder with the appropriate arguments to get the encoder output
962
        enc_output = self.encoder(input_sentence, training, enc_padding_mask)  # (batch_size, inp_seq_len, fully_connected_dim)
963
        
964
        # call self.decoder with the appropriate arguments to get the decoder output
965
        # dec_output.shape == (batch_size, tar_seq_len, fully_connected_dim)
966
        dec_output, attention_weights = self.decoder(output_sentence, enc_output, training, look_ahead_mask, dec_padding_mask)
967
        
968
        # pass decoder output through a linear layer and softmax (~2 lines)
969
        final_output = self.final_layer(dec_output) # (batch_size, tar_seq_len, target_vocab_size)
970
        # START CODE HERE
971

972
        return final_output, attention_weights
973

974

975
# In[ ]:
976

977

978
# UNIT TEST
979
Transformer_test(Transformer, create_look_ahead_mask, create_padding_mask)
980

981

982
# ## Conclusion
983
# 
984
# You've come to the end of the graded portion of the assignment. By now, you've: 
985
# 
986
# * Create positional encodings to capture sequential relationships in data
987
# * Calculate scaled dot-product self-attention with word embeddings
988
# * Implement masked multi-head attention
989
# * Build and train a Transformer model
990

991
# <font color='blue'>
992
#     <b>What you should remember</b>:
993
# 
994
# - The combination of self-attention and convolutional network layers allows of parallization of training and *faster training*.
995
# - Self-attention is calculated using the generated query Q, key K, and value V matrices.
996
# - Adding positional encoding to word embeddings is an effective way of include sequence information in self-attention calculations. 
997
# - Multi-head attention can help detect multiple features in your sentence.
998
# - Masking stops the model from 'looking ahead' during training, or weighting zeroes too much when processing cropped sentences. 
999

1000
# Now that you have completed the Transformer assignment, make sure you check out the ungraded labs to apply the Transformer model to practical use cases such as Name Entity Recogntion (NER) and Question Answering (QA).  
1001
# 
1002
# 
1003
# # Congratulations on finishing the Deep Learning Specialization!!!!!! 🎉🎉🎉🎉🎉
1004
# 
1005
# This was the last graded assignment of the specialization. It is now time to celebrate all your hard work and dedication! 
1006
# 
1007
# <a name='7'></a> 
1008
# ## 7 - References
1009
# 
1010
# The Transformer algorithm was due to Vaswani et al. (2017). 
1011
# 
1012
# - Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin (2017). [Attention Is All You Need](https://arxiv.org/abs/1706.03762) 
1013

1014
# In[ ]:
1015

1016

1017

1018

1019

1020
Product

Resources

Company