CoCalc -- neural_machine_translation_with_keras

GitHub Repository: keras-team/keras-io
Path: blob/master/examples/nlp/neural_machine_translation_with_keras_hub.py
³⁵⁰⁷ views
1
"""
2
Title: English-to-Spanish translation with KerasHub
3
Author: [Abheesht Sharma](https://github.com/abheesht17/)
4
Date created: 2022/05/26
5
Last modified: 2024/04/30
6
Description: Use KerasHub to train a sequence-to-sequence Transformer model on the machine translation task.
7
Accelerator: GPU
8
"""
9

10
"""
11
## Introduction
12

13
KerasHub provides building blocks for NLP (model layers, tokenizers, metrics, etc.) and
14
makes it convenient to construct NLP pipelines.
15

16
In this example, we'll use KerasHub layers to build an encoder-decoder Transformer
17
model, and train it on the English-to-Spanish machine translation task.
18

19
This example is based on the
20
[English-to-Spanish NMT
21
example](https://keras.io/examples/nlp/neural_machine_translation_with_transformer/)
22
by [fchollet](https://twitter.com/fchollet). The original example is more low-level
23
and implements layers from scratch, whereas this example uses KerasHub to show
24
some more advanced approaches, such as subword tokenization and using metrics
25
to compute the quality of generated translations.
26

27
You'll learn how to:
28

29
- Tokenize text using `keras_hub.tokenizers.WordPieceTokenizer`.
30
- Implement a sequence-to-sequence Transformer model using KerasHub's
31
`keras_hub.layers.TransformerEncoder`, `keras_hub.layers.TransformerDecoder` and
32
`keras_hub.layers.TokenAndPositionEmbedding` layers, and train it.
33
- Use `keras_hub.samplers` to generate translations of unseen input sentences
34
 using the top-p decoding strategy!
35

36
Don't worry if you aren't familiar with KerasHub. This tutorial will start with
37
the basics. Let's dive right in!
38
"""
39

40
"""
41
## Setup
42

43
Before we start implementing the pipeline, let's import all the libraries we need.
44
"""
45

46
"""shell
47
pip install -q --upgrade rouge-score
48
pip install -q --upgrade keras-hub
49
pip install -q --upgrade keras  # Upgrade to Keras 3.
50
"""
51

52
import keras_hub
53
import pathlib
54
import random
55

56
import keras
57
from keras import ops
58

59
import tensorflow.data as tf_data
60
from tensorflow_text.tools.wordpiece_vocab import (
61
    bert_vocab_from_dataset as bert_vocab,
62
)
63

64
"""
65
Let's also define our parameters/hyperparameters.
66
"""
67

68
BATCH_SIZE = 64
69
EPOCHS = 1  # This should be at least 10 for convergence
70
MAX_SEQUENCE_LENGTH = 40
71
ENG_VOCAB_SIZE = 15000
72
SPA_VOCAB_SIZE = 15000
73

74
EMBED_DIM = 256
75
INTERMEDIATE_DIM = 2048
76
NUM_HEADS = 8
77

78
"""
79
## Downloading the data
80

81
We'll be working with an English-to-Spanish translation dataset
82
provided by [Anki](https://www.manythings.org/anki/). Let's download it:
83
"""
84

85
text_file = keras.utils.get_file(
86
    fname="spa-eng.zip",
87
    origin="http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip",
88
    extract=True,
89
)
90
text_file = pathlib.Path(text_file).parent / "spa-eng" / "spa.txt"
91

92
"""
93
## Parsing the data
94

95
Each line contains an English sentence and its corresponding Spanish sentence.
96
The English sentence is the *source sequence* and Spanish one is the *target sequence*.
97
Before adding the text to a list, we convert it to lowercase.
98
"""
99

100
with open(text_file) as f:
101
    lines = f.read().split("\n")[:-1]
102
text_pairs = []
103
for line in lines:
104
    eng, spa = line.split("\t")
105
    eng = eng.lower()
106
    spa = spa.lower()
107
    text_pairs.append((eng, spa))
108

109
"""
110
Here's what our sentence pairs look like:
111
"""
112

113
for _ in range(5):
114
    print(random.choice(text_pairs))
115

116
"""
117
Now, let's split the sentence pairs into a training set, a validation set,
118
and a test set.
119
"""
120

121
random.shuffle(text_pairs)
122
num_val_samples = int(0.15 * len(text_pairs))
123
num_train_samples = len(text_pairs) - 2 * num_val_samples
124
train_pairs = text_pairs[:num_train_samples]
125
val_pairs = text_pairs[num_train_samples : num_train_samples + num_val_samples]
126
test_pairs = text_pairs[num_train_samples + num_val_samples :]
127

128
print(f"{len(text_pairs)} total pairs")
129
print(f"{len(train_pairs)} training pairs")
130
print(f"{len(val_pairs)} validation pairs")
131
print(f"{len(test_pairs)} test pairs")
132

133

134
"""
135
## Tokenizing the data
136

137
We'll define two tokenizers - one for the source language (English), and the other
138
for the target language (Spanish). We'll be using
139
`keras_hub.tokenizers.WordPieceTokenizer` to tokenize the text.
140
`keras_hub.tokenizers.WordPieceTokenizer` takes a WordPiece vocabulary
141
and has functions for tokenizing the text, and detokenizing sequences of tokens.
142

143
Before we define the two tokenizers, we first need to train them on the dataset
144
we have. The WordPiece tokenization algorithm is a subword tokenization algorithm;
145
training it on a corpus gives us a vocabulary of subwords. A subword tokenizer
146
is a compromise between word tokenizers (word tokenizers need very large
147
vocabularies for good coverage of input words), and character tokenizers
148
(characters don't really encode meaning like words do). Luckily, KerasHub
149
makes it very simple to train WordPiece on a corpus with the
150
`keras_hub.tokenizers.compute_word_piece_vocabulary` utility.
151
"""
152

153

154
def train_word_piece(text_samples, vocab_size, reserved_tokens):
155
    word_piece_ds = tf_data.Dataset.from_tensor_slices(text_samples)
156
    vocab = keras_hub.tokenizers.compute_word_piece_vocabulary(
157
        word_piece_ds.batch(1000).prefetch(2),
158
        vocabulary_size=vocab_size,
159
        reserved_tokens=reserved_tokens,
160
    )
161
    return vocab
162

163

164
"""
165
Every vocabulary has a few special, reserved tokens. We have four such tokens:
166

167
- `"[PAD]"` - Padding token. Padding tokens are appended to the input sequence
168
length when the input sequence length is shorter than the maximum sequence length.
169
- `"[UNK]"` - Unknown token.
170
- `"[START]"` - Token that marks the start of the input sequence.
171
- `"[END]"` - Token that marks the end of the input sequence.
172
"""
173

174
reserved_tokens = ["[PAD]", "[UNK]", "[START]", "[END]"]
175

176
eng_samples = [text_pair[0] for text_pair in train_pairs]
177
eng_vocab = train_word_piece(eng_samples, ENG_VOCAB_SIZE, reserved_tokens)
178

179
spa_samples = [text_pair[1] for text_pair in train_pairs]
180
spa_vocab = train_word_piece(spa_samples, SPA_VOCAB_SIZE, reserved_tokens)
181

182
"""
183
Let's see some tokens!
184
"""
185

186
print("English Tokens: ", eng_vocab[100:110])
187
print("Spanish Tokens: ", spa_vocab[100:110])
188

189
"""
190
Now, let's define the tokenizers. We will configure the tokenizers with the
191
the vocabularies trained above.
192
"""
193

194
eng_tokenizer = keras_hub.tokenizers.WordPieceTokenizer(
195
    vocabulary=eng_vocab, lowercase=False
196
)
197
spa_tokenizer = keras_hub.tokenizers.WordPieceTokenizer(
198
    vocabulary=spa_vocab, lowercase=False
199
)
200

201
"""
202
Let's try and tokenize a sample from our dataset! To verify whether the text has
203
been tokenized correctly, we can also detokenize the list of tokens back to the
204
original text.
205
"""
206

207
eng_input_ex = text_pairs[0][0]
208
eng_tokens_ex = eng_tokenizer.tokenize(eng_input_ex)
209
print("English sentence: ", eng_input_ex)
210
print("Tokens: ", eng_tokens_ex)
211
print(
212
    "Recovered text after detokenizing: ",
213
    eng_tokenizer.detokenize(eng_tokens_ex),
214
)
215

216
print()
217

218
spa_input_ex = text_pairs[0][1]
219
spa_tokens_ex = spa_tokenizer.tokenize(spa_input_ex)
220
print("Spanish sentence: ", spa_input_ex)
221
print("Tokens: ", spa_tokens_ex)
222
print(
223
    "Recovered text after detokenizing: ",
224
    spa_tokenizer.detokenize(spa_tokens_ex),
225
)
226

227
"""
228
## Format datasets
229

230
Next, we'll format our datasets.
231

232
At each training step, the model will seek to predict target words N+1 (and beyond)
233
using the source sentence and the target words 0 to N.
234

235
As such, the training dataset will yield a tuple `(inputs, targets)`, where:
236

237
- `inputs` is a dictionary with the keys `encoder_inputs` and `decoder_inputs`.
238
`encoder_inputs` is the tokenized source sentence and `decoder_inputs` is the target
239
sentence "so far",
240
that is to say, the words 0 to N used to predict word N+1 (and beyond) in the target
241
sentence.
242
- `target` is the target sentence offset by one step:
243
it provides the next words in the target sentence -- what the model will try to predict.
244

245
We will add special tokens, `"[START]"` and `"[END]"`, to the input Spanish
246
sentence after tokenizing the text. We will also pad the input to a fixed length.
247
This can be easily done using `keras_hub.layers.StartEndPacker`.
248
"""
249

250

251
def preprocess_batch(eng, spa):
252
    batch_size = ops.shape(spa)[0]
253

254
    eng = eng_tokenizer(eng)
255
    spa = spa_tokenizer(spa)
256

257
    # Pad `eng` to `MAX_SEQUENCE_LENGTH`.
258
    eng_start_end_packer = keras_hub.layers.StartEndPacker(
259
        sequence_length=MAX_SEQUENCE_LENGTH,
260
        pad_value=eng_tokenizer.token_to_id("[PAD]"),
261
    )
262
    eng = eng_start_end_packer(eng)
263

264
    # Add special tokens (`"[START]"` and `"[END]"`) to `spa` and pad it as well.
265
    spa_start_end_packer = keras_hub.layers.StartEndPacker(
266
        sequence_length=MAX_SEQUENCE_LENGTH + 1,
267
        start_value=spa_tokenizer.token_to_id("[START]"),
268
        end_value=spa_tokenizer.token_to_id("[END]"),
269
        pad_value=spa_tokenizer.token_to_id("[PAD]"),
270
    )
271
    spa = spa_start_end_packer(spa)
272

273
    return (
274
        {
275
            "encoder_inputs": eng,
276
            "decoder_inputs": spa[:, :-1],
277
        },
278
        spa[:, 1:],
279
    )
280

281

282
def make_dataset(pairs):
283
    eng_texts, spa_texts = zip(*pairs)
284
    eng_texts = list(eng_texts)
285
    spa_texts = list(spa_texts)
286
    dataset = tf_data.Dataset.from_tensor_slices((eng_texts, spa_texts))
287
    dataset = dataset.batch(BATCH_SIZE)
288
    dataset = dataset.map(preprocess_batch, num_parallel_calls=tf_data.AUTOTUNE)
289
    return dataset.shuffle(2048).prefetch(16).cache()
290

291

292
train_ds = make_dataset(train_pairs)
293
val_ds = make_dataset(val_pairs)
294

295
"""
296
Let's take a quick look at the sequence shapes
297
(we have batches of 64 pairs, and all sequences are 40 steps long):
298
"""
299

300
for inputs, targets in train_ds.take(1):
301
    print(f'inputs["encoder_inputs"].shape: {inputs["encoder_inputs"].shape}')
302
    print(f'inputs["decoder_inputs"].shape: {inputs["decoder_inputs"].shape}')
303
    print(f"targets.shape: {targets.shape}")
304

305

306
"""
307
## Building the model
308

309
Now, let's move on to the exciting part - defining our model!
310
We first need an embedding layer, i.e., a vector for every token in our input sequence.
311
This embedding layer can be initialised randomly. We also need a positional
312
embedding layer which encodes the word order in the sequence. The convention is
313
to add these two embeddings. KerasHub has a `keras_hub.layers.TokenAndPositionEmbedding `
314
layer which does all of the above steps for us.
315

316
Our sequence-to-sequence Transformer consists of a `keras_hub.layers.TransformerEncoder`
317
layer and a `keras_hub.layers.TransformerDecoder` layer chained together.
318

319
The source sequence will be passed to `keras_hub.layers.TransformerEncoder`, which
320
will produce a new representation of it. This new representation will then be passed
321
to the `keras_hub.layers.TransformerDecoder`, together with the target sequence
322
so far (target words 0 to N). The `keras_hub.layers.TransformerDecoder` will
323
then seek to predict the next words in the target sequence (N+1 and beyond).
324

325
A key detail that makes this possible is causal masking.
326
The `keras_hub.layers.TransformerDecoder` sees the entire sequence at once, and
327
thus we must make sure that it only uses information from target tokens 0 to N
328
when predicting token N+1 (otherwise, it could use information from the future,
329
which would result in a model that cannot be used at inference time). Causal masking
330
is enabled by default in `keras_hub.layers.TransformerDecoder`.
331

332
We also need to mask the padding tokens (`"[PAD]"`). For this, we can set the
333
`mask_zero` argument of the `keras_hub.layers.TokenAndPositionEmbedding` layer
334
to True. This will then be propagated to all subsequent layers.
335
"""
336

337
# Encoder
338
encoder_inputs = keras.Input(shape=(None,), name="encoder_inputs")
339

340
x = keras_hub.layers.TokenAndPositionEmbedding(
341
    vocabulary_size=ENG_VOCAB_SIZE,
342
    sequence_length=MAX_SEQUENCE_LENGTH,
343
    embedding_dim=EMBED_DIM,
344
)(encoder_inputs)
345

346
encoder_outputs = keras_hub.layers.TransformerEncoder(
347
    intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS
348
)(inputs=x)
349
encoder = keras.Model(encoder_inputs, encoder_outputs)
350

351

352
# Decoder
353
decoder_inputs = keras.Input(shape=(None,), name="decoder_inputs")
354
encoded_seq_inputs = keras.Input(shape=(None, EMBED_DIM), name="decoder_state_inputs")
355

356
x = keras_hub.layers.TokenAndPositionEmbedding(
357
    vocabulary_size=SPA_VOCAB_SIZE,
358
    sequence_length=MAX_SEQUENCE_LENGTH,
359
    embedding_dim=EMBED_DIM,
360
)(decoder_inputs)
361

362
x = keras_hub.layers.TransformerDecoder(
363
    intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS
364
)(decoder_sequence=x, encoder_sequence=encoded_seq_inputs)
365
x = keras.layers.Dropout(0.5)(x)
366
decoder_outputs = keras.layers.Dense(SPA_VOCAB_SIZE, activation="softmax")(x)
367
decoder = keras.Model(
368
    [
369
        decoder_inputs,
370
        encoded_seq_inputs,
371
    ],
372
    decoder_outputs,
373
)
374
decoder_outputs = decoder([decoder_inputs, encoder_outputs])
375

376
transformer = keras.Model(
377
    [encoder_inputs, decoder_inputs],
378
    decoder_outputs,
379
    name="transformer",
380
)
381

382
"""
383
## Training our model
384

385
We'll use accuracy as a quick way to monitor training progress on the validation data.
386
Note that machine translation typically uses BLEU scores as well as other metrics,
387
rather than accuracy. However, in order to use metrics like ROUGE, BLEU, etc. we
388
will have decode the probabilities and generate the text. Text generation is
389
computationally expensive, and performing this during training is not recommended.
390

391
Here we only train for 1 epoch, but to get the model to actually converge
392
you should train for at least 10 epochs.
393
"""
394

395
transformer.summary()
396
transformer.compile(
397
    "rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"]
398
)
399
transformer.fit(train_ds, epochs=EPOCHS, validation_data=val_ds)
400

401
"""
402
## Decoding test sentences (qualitative analysis)
403

404
Finally, let's demonstrate how to translate brand new English sentences.
405
We simply feed into the model the tokenized English sentence
406
as well as the target token `"[START]"`. The model outputs probabilities of the
407
next token. We then we repeatedly generated the next token conditioned on the
408
tokens generated so far, until we hit the token `"[END]"`.
409

410
For decoding, we will use the `keras_hub.samplers` module from
411
KerasHub. Greedy Decoding is a text decoding method which outputs the most
412
likely next token at each time step, i.e., the token with the highest probability.
413
"""
414

415

416
def decode_sequences(input_sentences):
417
    batch_size = 1
418

419
    # Tokenize the encoder input.
420
    encoder_input_tokens = ops.convert_to_tensor(eng_tokenizer(input_sentences))
421
    if len(encoder_input_tokens[0]) < MAX_SEQUENCE_LENGTH:
422
        pads = ops.full((1, MAX_SEQUENCE_LENGTH - len(encoder_input_tokens[0])), 0)
423
        encoder_input_tokens = ops.concatenate(
424
            [encoder_input_tokens.to_tensor(), pads], 1
425
        )
426

427
    # Define a function that outputs the next token's probability given the
428
    # input sequence.
429
    def next(prompt, cache, index):
430
        logits = transformer([encoder_input_tokens, prompt])[:, index - 1, :]
431
        # Ignore hidden states for now; only needed for contrastive search.
432
        hidden_states = None
433
        return logits, hidden_states, cache
434

435
    # Build a prompt of length 40 with a start token and padding tokens.
436
    length = 40
437
    start = ops.full((batch_size, 1), spa_tokenizer.token_to_id("[START]"))
438
    pad = ops.full((batch_size, length - 1), spa_tokenizer.token_to_id("[PAD]"))
439
    prompt = ops.concatenate((start, pad), axis=-1)
440

441
    generated_tokens = keras_hub.samplers.GreedySampler()(
442
        next,
443
        prompt,
444
        stop_token_ids=[spa_tokenizer.token_to_id("[END]")],
445
        index=1,  # Start sampling after start token.
446
    )
447
    generated_sentences = spa_tokenizer.detokenize(generated_tokens)
448
    return generated_sentences
449

450

451
test_eng_texts = [pair[0] for pair in test_pairs]
452
for i in range(2):
453
    input_sentence = random.choice(test_eng_texts)
454
    translated = decode_sequences([input_sentence])
455
    translated = translated.numpy()[0].decode("utf-8")
456
    translated = (
457
        translated.replace("[PAD]", "")
458
        .replace("[START]", "")
459
        .replace("[END]", "")
460
        .strip()
461
    )
462
    print(f"** Example {i} **")
463
    print(input_sentence)
464
    print(translated)
465
    print()
466

467
"""
468
## Evaluating our model (quantitative analysis)
469

470
There are many metrics which are used for text generation tasks. Here, to
471
evaluate translations generated by our model, let's compute the ROUGE-1 and
472
ROUGE-2 scores. Essentially, ROUGE-N is a score based on the number of common
473
n-grams between the reference text and the generated text. ROUGE-1 and ROUGE-2
474
use the number of common unigrams and bigrams, respectively.
475

476
We will calculate the score over 30 test samples (since decoding is an
477
expensive process).
478
"""
479

480
rouge_1 = keras_hub.metrics.RougeN(order=1)
481
rouge_2 = keras_hub.metrics.RougeN(order=2)
482

483
for test_pair in test_pairs[:30]:
484
    input_sentence = test_pair[0]
485
    reference_sentence = test_pair[1]
486

487
    translated_sentence = decode_sequences([input_sentence])
488
    translated_sentence = translated_sentence.numpy()[0].decode("utf-8")
489
    translated_sentence = (
490
        translated_sentence.replace("[PAD]", "")
491
        .replace("[START]", "")
492
        .replace("[END]", "")
493
        .strip()
494
    )
495

496
    rouge_1(reference_sentence, translated_sentence)
497
    rouge_2(reference_sentence, translated_sentence)
498

499
print("ROUGE-1 Score: ", rouge_1.result())
500
print("ROUGE-2 Score: ", rouge_2.result())
501

502
"""
503
After 10 epochs, the scores are as follows:
504

505
|               | **ROUGE-1** | **ROUGE-2** |
506
|:-------------:|:-----------:|:-----------:|
507
| **Precision** |    0.568    |    0.374    |
508
|   **Recall**  |    0.615    |    0.394    |
509
|  **F1 Score** |    0.579    |    0.381    |
510
"""
511

512
Product

Resources

Company