Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
keras-team
GitHub Repository: keras-team/keras-io
Path: blob/master/examples/nlp/neural_machine_translation_with_keras_hub.py
3507 views
1
"""
2
Title: English-to-Spanish translation with KerasHub
3
Author: [Abheesht Sharma](https://github.com/abheesht17/)
4
Date created: 2022/05/26
5
Last modified: 2024/04/30
6
Description: Use KerasHub to train a sequence-to-sequence Transformer model on the machine translation task.
7
Accelerator: GPU
8
"""
9
10
"""
11
## Introduction
12
13
KerasHub provides building blocks for NLP (model layers, tokenizers, metrics, etc.) and
14
makes it convenient to construct NLP pipelines.
15
16
In this example, we'll use KerasHub layers to build an encoder-decoder Transformer
17
model, and train it on the English-to-Spanish machine translation task.
18
19
This example is based on the
20
[English-to-Spanish NMT
21
example](https://keras.io/examples/nlp/neural_machine_translation_with_transformer/)
22
by [fchollet](https://twitter.com/fchollet). The original example is more low-level
23
and implements layers from scratch, whereas this example uses KerasHub to show
24
some more advanced approaches, such as subword tokenization and using metrics
25
to compute the quality of generated translations.
26
27
You'll learn how to:
28
29
- Tokenize text using `keras_hub.tokenizers.WordPieceTokenizer`.
30
- Implement a sequence-to-sequence Transformer model using KerasHub's
31
`keras_hub.layers.TransformerEncoder`, `keras_hub.layers.TransformerDecoder` and
32
`keras_hub.layers.TokenAndPositionEmbedding` layers, and train it.
33
- Use `keras_hub.samplers` to generate translations of unseen input sentences
34
using the top-p decoding strategy!
35
36
Don't worry if you aren't familiar with KerasHub. This tutorial will start with
37
the basics. Let's dive right in!
38
"""
39
40
"""
41
## Setup
42
43
Before we start implementing the pipeline, let's import all the libraries we need.
44
"""
45
46
"""shell
47
pip install -q --upgrade rouge-score
48
pip install -q --upgrade keras-hub
49
pip install -q --upgrade keras # Upgrade to Keras 3.
50
"""
51
52
import keras_hub
53
import pathlib
54
import random
55
56
import keras
57
from keras import ops
58
59
import tensorflow.data as tf_data
60
from tensorflow_text.tools.wordpiece_vocab import (
61
bert_vocab_from_dataset as bert_vocab,
62
)
63
64
"""
65
Let's also define our parameters/hyperparameters.
66
"""
67
68
BATCH_SIZE = 64
69
EPOCHS = 1 # This should be at least 10 for convergence
70
MAX_SEQUENCE_LENGTH = 40
71
ENG_VOCAB_SIZE = 15000
72
SPA_VOCAB_SIZE = 15000
73
74
EMBED_DIM = 256
75
INTERMEDIATE_DIM = 2048
76
NUM_HEADS = 8
77
78
"""
79
## Downloading the data
80
81
We'll be working with an English-to-Spanish translation dataset
82
provided by [Anki](https://www.manythings.org/anki/). Let's download it:
83
"""
84
85
text_file = keras.utils.get_file(
86
fname="spa-eng.zip",
87
origin="http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip",
88
extract=True,
89
)
90
text_file = pathlib.Path(text_file).parent / "spa-eng" / "spa.txt"
91
92
"""
93
## Parsing the data
94
95
Each line contains an English sentence and its corresponding Spanish sentence.
96
The English sentence is the *source sequence* and Spanish one is the *target sequence*.
97
Before adding the text to a list, we convert it to lowercase.
98
"""
99
100
with open(text_file) as f:
101
lines = f.read().split("\n")[:-1]
102
text_pairs = []
103
for line in lines:
104
eng, spa = line.split("\t")
105
eng = eng.lower()
106
spa = spa.lower()
107
text_pairs.append((eng, spa))
108
109
"""
110
Here's what our sentence pairs look like:
111
"""
112
113
for _ in range(5):
114
print(random.choice(text_pairs))
115
116
"""
117
Now, let's split the sentence pairs into a training set, a validation set,
118
and a test set.
119
"""
120
121
random.shuffle(text_pairs)
122
num_val_samples = int(0.15 * len(text_pairs))
123
num_train_samples = len(text_pairs) - 2 * num_val_samples
124
train_pairs = text_pairs[:num_train_samples]
125
val_pairs = text_pairs[num_train_samples : num_train_samples + num_val_samples]
126
test_pairs = text_pairs[num_train_samples + num_val_samples :]
127
128
print(f"{len(text_pairs)} total pairs")
129
print(f"{len(train_pairs)} training pairs")
130
print(f"{len(val_pairs)} validation pairs")
131
print(f"{len(test_pairs)} test pairs")
132
133
134
"""
135
## Tokenizing the data
136
137
We'll define two tokenizers - one for the source language (English), and the other
138
for the target language (Spanish). We'll be using
139
`keras_hub.tokenizers.WordPieceTokenizer` to tokenize the text.
140
`keras_hub.tokenizers.WordPieceTokenizer` takes a WordPiece vocabulary
141
and has functions for tokenizing the text, and detokenizing sequences of tokens.
142
143
Before we define the two tokenizers, we first need to train them on the dataset
144
we have. The WordPiece tokenization algorithm is a subword tokenization algorithm;
145
training it on a corpus gives us a vocabulary of subwords. A subword tokenizer
146
is a compromise between word tokenizers (word tokenizers need very large
147
vocabularies for good coverage of input words), and character tokenizers
148
(characters don't really encode meaning like words do). Luckily, KerasHub
149
makes it very simple to train WordPiece on a corpus with the
150
`keras_hub.tokenizers.compute_word_piece_vocabulary` utility.
151
"""
152
153
154
def train_word_piece(text_samples, vocab_size, reserved_tokens):
155
word_piece_ds = tf_data.Dataset.from_tensor_slices(text_samples)
156
vocab = keras_hub.tokenizers.compute_word_piece_vocabulary(
157
word_piece_ds.batch(1000).prefetch(2),
158
vocabulary_size=vocab_size,
159
reserved_tokens=reserved_tokens,
160
)
161
return vocab
162
163
164
"""
165
Every vocabulary has a few special, reserved tokens. We have four such tokens:
166
167
- `"[PAD]"` - Padding token. Padding tokens are appended to the input sequence
168
length when the input sequence length is shorter than the maximum sequence length.
169
- `"[UNK]"` - Unknown token.
170
- `"[START]"` - Token that marks the start of the input sequence.
171
- `"[END]"` - Token that marks the end of the input sequence.
172
"""
173
174
reserved_tokens = ["[PAD]", "[UNK]", "[START]", "[END]"]
175
176
eng_samples = [text_pair[0] for text_pair in train_pairs]
177
eng_vocab = train_word_piece(eng_samples, ENG_VOCAB_SIZE, reserved_tokens)
178
179
spa_samples = [text_pair[1] for text_pair in train_pairs]
180
spa_vocab = train_word_piece(spa_samples, SPA_VOCAB_SIZE, reserved_tokens)
181
182
"""
183
Let's see some tokens!
184
"""
185
186
print("English Tokens: ", eng_vocab[100:110])
187
print("Spanish Tokens: ", spa_vocab[100:110])
188
189
"""
190
Now, let's define the tokenizers. We will configure the tokenizers with the
191
the vocabularies trained above.
192
"""
193
194
eng_tokenizer = keras_hub.tokenizers.WordPieceTokenizer(
195
vocabulary=eng_vocab, lowercase=False
196
)
197
spa_tokenizer = keras_hub.tokenizers.WordPieceTokenizer(
198
vocabulary=spa_vocab, lowercase=False
199
)
200
201
"""
202
Let's try and tokenize a sample from our dataset! To verify whether the text has
203
been tokenized correctly, we can also detokenize the list of tokens back to the
204
original text.
205
"""
206
207
eng_input_ex = text_pairs[0][0]
208
eng_tokens_ex = eng_tokenizer.tokenize(eng_input_ex)
209
print("English sentence: ", eng_input_ex)
210
print("Tokens: ", eng_tokens_ex)
211
print(
212
"Recovered text after detokenizing: ",
213
eng_tokenizer.detokenize(eng_tokens_ex),
214
)
215
216
print()
217
218
spa_input_ex = text_pairs[0][1]
219
spa_tokens_ex = spa_tokenizer.tokenize(spa_input_ex)
220
print("Spanish sentence: ", spa_input_ex)
221
print("Tokens: ", spa_tokens_ex)
222
print(
223
"Recovered text after detokenizing: ",
224
spa_tokenizer.detokenize(spa_tokens_ex),
225
)
226
227
"""
228
## Format datasets
229
230
Next, we'll format our datasets.
231
232
At each training step, the model will seek to predict target words N+1 (and beyond)
233
using the source sentence and the target words 0 to N.
234
235
As such, the training dataset will yield a tuple `(inputs, targets)`, where:
236
237
- `inputs` is a dictionary with the keys `encoder_inputs` and `decoder_inputs`.
238
`encoder_inputs` is the tokenized source sentence and `decoder_inputs` is the target
239
sentence "so far",
240
that is to say, the words 0 to N used to predict word N+1 (and beyond) in the target
241
sentence.
242
- `target` is the target sentence offset by one step:
243
it provides the next words in the target sentence -- what the model will try to predict.
244
245
We will add special tokens, `"[START]"` and `"[END]"`, to the input Spanish
246
sentence after tokenizing the text. We will also pad the input to a fixed length.
247
This can be easily done using `keras_hub.layers.StartEndPacker`.
248
"""
249
250
251
def preprocess_batch(eng, spa):
252
batch_size = ops.shape(spa)[0]
253
254
eng = eng_tokenizer(eng)
255
spa = spa_tokenizer(spa)
256
257
# Pad `eng` to `MAX_SEQUENCE_LENGTH`.
258
eng_start_end_packer = keras_hub.layers.StartEndPacker(
259
sequence_length=MAX_SEQUENCE_LENGTH,
260
pad_value=eng_tokenizer.token_to_id("[PAD]"),
261
)
262
eng = eng_start_end_packer(eng)
263
264
# Add special tokens (`"[START]"` and `"[END]"`) to `spa` and pad it as well.
265
spa_start_end_packer = keras_hub.layers.StartEndPacker(
266
sequence_length=MAX_SEQUENCE_LENGTH + 1,
267
start_value=spa_tokenizer.token_to_id("[START]"),
268
end_value=spa_tokenizer.token_to_id("[END]"),
269
pad_value=spa_tokenizer.token_to_id("[PAD]"),
270
)
271
spa = spa_start_end_packer(spa)
272
273
return (
274
{
275
"encoder_inputs": eng,
276
"decoder_inputs": spa[:, :-1],
277
},
278
spa[:, 1:],
279
)
280
281
282
def make_dataset(pairs):
283
eng_texts, spa_texts = zip(*pairs)
284
eng_texts = list(eng_texts)
285
spa_texts = list(spa_texts)
286
dataset = tf_data.Dataset.from_tensor_slices((eng_texts, spa_texts))
287
dataset = dataset.batch(BATCH_SIZE)
288
dataset = dataset.map(preprocess_batch, num_parallel_calls=tf_data.AUTOTUNE)
289
return dataset.shuffle(2048).prefetch(16).cache()
290
291
292
train_ds = make_dataset(train_pairs)
293
val_ds = make_dataset(val_pairs)
294
295
"""
296
Let's take a quick look at the sequence shapes
297
(we have batches of 64 pairs, and all sequences are 40 steps long):
298
"""
299
300
for inputs, targets in train_ds.take(1):
301
print(f'inputs["encoder_inputs"].shape: {inputs["encoder_inputs"].shape}')
302
print(f'inputs["decoder_inputs"].shape: {inputs["decoder_inputs"].shape}')
303
print(f"targets.shape: {targets.shape}")
304
305
306
"""
307
## Building the model
308
309
Now, let's move on to the exciting part - defining our model!
310
We first need an embedding layer, i.e., a vector for every token in our input sequence.
311
This embedding layer can be initialised randomly. We also need a positional
312
embedding layer which encodes the word order in the sequence. The convention is
313
to add these two embeddings. KerasHub has a `keras_hub.layers.TokenAndPositionEmbedding `
314
layer which does all of the above steps for us.
315
316
Our sequence-to-sequence Transformer consists of a `keras_hub.layers.TransformerEncoder`
317
layer and a `keras_hub.layers.TransformerDecoder` layer chained together.
318
319
The source sequence will be passed to `keras_hub.layers.TransformerEncoder`, which
320
will produce a new representation of it. This new representation will then be passed
321
to the `keras_hub.layers.TransformerDecoder`, together with the target sequence
322
so far (target words 0 to N). The `keras_hub.layers.TransformerDecoder` will
323
then seek to predict the next words in the target sequence (N+1 and beyond).
324
325
A key detail that makes this possible is causal masking.
326
The `keras_hub.layers.TransformerDecoder` sees the entire sequence at once, and
327
thus we must make sure that it only uses information from target tokens 0 to N
328
when predicting token N+1 (otherwise, it could use information from the future,
329
which would result in a model that cannot be used at inference time). Causal masking
330
is enabled by default in `keras_hub.layers.TransformerDecoder`.
331
332
We also need to mask the padding tokens (`"[PAD]"`). For this, we can set the
333
`mask_zero` argument of the `keras_hub.layers.TokenAndPositionEmbedding` layer
334
to True. This will then be propagated to all subsequent layers.
335
"""
336
337
# Encoder
338
encoder_inputs = keras.Input(shape=(None,), name="encoder_inputs")
339
340
x = keras_hub.layers.TokenAndPositionEmbedding(
341
vocabulary_size=ENG_VOCAB_SIZE,
342
sequence_length=MAX_SEQUENCE_LENGTH,
343
embedding_dim=EMBED_DIM,
344
)(encoder_inputs)
345
346
encoder_outputs = keras_hub.layers.TransformerEncoder(
347
intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS
348
)(inputs=x)
349
encoder = keras.Model(encoder_inputs, encoder_outputs)
350
351
352
# Decoder
353
decoder_inputs = keras.Input(shape=(None,), name="decoder_inputs")
354
encoded_seq_inputs = keras.Input(shape=(None, EMBED_DIM), name="decoder_state_inputs")
355
356
x = keras_hub.layers.TokenAndPositionEmbedding(
357
vocabulary_size=SPA_VOCAB_SIZE,
358
sequence_length=MAX_SEQUENCE_LENGTH,
359
embedding_dim=EMBED_DIM,
360
)(decoder_inputs)
361
362
x = keras_hub.layers.TransformerDecoder(
363
intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS
364
)(decoder_sequence=x, encoder_sequence=encoded_seq_inputs)
365
x = keras.layers.Dropout(0.5)(x)
366
decoder_outputs = keras.layers.Dense(SPA_VOCAB_SIZE, activation="softmax")(x)
367
decoder = keras.Model(
368
[
369
decoder_inputs,
370
encoded_seq_inputs,
371
],
372
decoder_outputs,
373
)
374
decoder_outputs = decoder([decoder_inputs, encoder_outputs])
375
376
transformer = keras.Model(
377
[encoder_inputs, decoder_inputs],
378
decoder_outputs,
379
name="transformer",
380
)
381
382
"""
383
## Training our model
384
385
We'll use accuracy as a quick way to monitor training progress on the validation data.
386
Note that machine translation typically uses BLEU scores as well as other metrics,
387
rather than accuracy. However, in order to use metrics like ROUGE, BLEU, etc. we
388
will have decode the probabilities and generate the text. Text generation is
389
computationally expensive, and performing this during training is not recommended.
390
391
Here we only train for 1 epoch, but to get the model to actually converge
392
you should train for at least 10 epochs.
393
"""
394
395
transformer.summary()
396
transformer.compile(
397
"rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"]
398
)
399
transformer.fit(train_ds, epochs=EPOCHS, validation_data=val_ds)
400
401
"""
402
## Decoding test sentences (qualitative analysis)
403
404
Finally, let's demonstrate how to translate brand new English sentences.
405
We simply feed into the model the tokenized English sentence
406
as well as the target token `"[START]"`. The model outputs probabilities of the
407
next token. We then we repeatedly generated the next token conditioned on the
408
tokens generated so far, until we hit the token `"[END]"`.
409
410
For decoding, we will use the `keras_hub.samplers` module from
411
KerasHub. Greedy Decoding is a text decoding method which outputs the most
412
likely next token at each time step, i.e., the token with the highest probability.
413
"""
414
415
416
def decode_sequences(input_sentences):
417
batch_size = 1
418
419
# Tokenize the encoder input.
420
encoder_input_tokens = ops.convert_to_tensor(eng_tokenizer(input_sentences))
421
if len(encoder_input_tokens[0]) < MAX_SEQUENCE_LENGTH:
422
pads = ops.full((1, MAX_SEQUENCE_LENGTH - len(encoder_input_tokens[0])), 0)
423
encoder_input_tokens = ops.concatenate(
424
[encoder_input_tokens.to_tensor(), pads], 1
425
)
426
427
# Define a function that outputs the next token's probability given the
428
# input sequence.
429
def next(prompt, cache, index):
430
logits = transformer([encoder_input_tokens, prompt])[:, index - 1, :]
431
# Ignore hidden states for now; only needed for contrastive search.
432
hidden_states = None
433
return logits, hidden_states, cache
434
435
# Build a prompt of length 40 with a start token and padding tokens.
436
length = 40
437
start = ops.full((batch_size, 1), spa_tokenizer.token_to_id("[START]"))
438
pad = ops.full((batch_size, length - 1), spa_tokenizer.token_to_id("[PAD]"))
439
prompt = ops.concatenate((start, pad), axis=-1)
440
441
generated_tokens = keras_hub.samplers.GreedySampler()(
442
next,
443
prompt,
444
stop_token_ids=[spa_tokenizer.token_to_id("[END]")],
445
index=1, # Start sampling after start token.
446
)
447
generated_sentences = spa_tokenizer.detokenize(generated_tokens)
448
return generated_sentences
449
450
451
test_eng_texts = [pair[0] for pair in test_pairs]
452
for i in range(2):
453
input_sentence = random.choice(test_eng_texts)
454
translated = decode_sequences([input_sentence])
455
translated = translated.numpy()[0].decode("utf-8")
456
translated = (
457
translated.replace("[PAD]", "")
458
.replace("[START]", "")
459
.replace("[END]", "")
460
.strip()
461
)
462
print(f"** Example {i} **")
463
print(input_sentence)
464
print(translated)
465
print()
466
467
"""
468
## Evaluating our model (quantitative analysis)
469
470
There are many metrics which are used for text generation tasks. Here, to
471
evaluate translations generated by our model, let's compute the ROUGE-1 and
472
ROUGE-2 scores. Essentially, ROUGE-N is a score based on the number of common
473
n-grams between the reference text and the generated text. ROUGE-1 and ROUGE-2
474
use the number of common unigrams and bigrams, respectively.
475
476
We will calculate the score over 30 test samples (since decoding is an
477
expensive process).
478
"""
479
480
rouge_1 = keras_hub.metrics.RougeN(order=1)
481
rouge_2 = keras_hub.metrics.RougeN(order=2)
482
483
for test_pair in test_pairs[:30]:
484
input_sentence = test_pair[0]
485
reference_sentence = test_pair[1]
486
487
translated_sentence = decode_sequences([input_sentence])
488
translated_sentence = translated_sentence.numpy()[0].decode("utf-8")
489
translated_sentence = (
490
translated_sentence.replace("[PAD]", "")
491
.replace("[START]", "")
492
.replace("[END]", "")
493
.strip()
494
)
495
496
rouge_1(reference_sentence, translated_sentence)
497
rouge_2(reference_sentence, translated_sentence)
498
499
print("ROUGE-1 Score: ", rouge_1.result())
500
print("ROUGE-2 Score: ", rouge_2.result())
501
502
"""
503
After 10 epochs, the scores are as follows:
504
505
| | **ROUGE-1** | **ROUGE-2** |
506
|:-------------:|:-----------:|:-----------:|
507
| **Precision** | 0.568 | 0.374 |
508
| **Recall** | 0.615 | 0.394 |
509
| **F1 Score** | 0.579 | 0.381 |
510
"""
511
512