Path: blob/master/examples/generative/ipynb/gpt2_text_generation_with_keras_hub.ipynb
3508 views
GPT2 Text Generation with KerasHub
Author: Chen Qian
Date created: 2023/04/17
Last modified: 2024/04/12
Description: Use KerasHub GPT2 model and samplers
to do text generation.
In this tutorial, you will learn to use KerasHub to load a pre-trained Large Language Model (LLM) - GPT-2 model (originally invented by OpenAI), finetune it to a specific text style, and generate text based on users' input (also known as prompt). You will also learn how GPT2 adapts quickly to non-English languages, such as Chinese.
Before we begin
Colab offers different kinds of runtimes. Make sure to go to Runtime -> Change runtime type and choose the GPU Hardware Accelerator runtime (which should have >12G host RAM and ~15G GPU RAM) since you will finetune the GPT-2 model. Running this tutorial on CPU runtime will take hours.
Install KerasHub, Choose Backend and Import Dependencies
This examples uses Keras 3 to work in any of "tensorflow"
, "jax"
or "torch"
. Support for Keras 3 is baked into KerasHub, simply change the "KERAS_BACKEND"
environment variable to select the backend of your choice. We select the JAX backend below.
Introduction to Generative Large Language Models (LLMs)
Large language models (LLMs) are a type of machine learning models that are trained on a large corpus of text data to generate outputs for various natural language processing (NLP) tasks, such as text generation, question answering, and machine translation.
Generative LLMs are typically based on deep learning neural networks, such as the Transformer architecture invented by Google researchers in 2017, and are trained on massive amounts of text data, often involving billions of words. These models, such as Google LaMDA and PaLM, are trained with a large dataset from various data sources which allows them to generate output for many tasks. The core of Generative LLMs is predicting the next word in a sentence, often referred as Causal LM Pretraining. In this way LLMs can generate coherent text based on user prompts. For a more pedagogical discussion on language models, you can refer to the Stanford CS324 LLM class.
Introduction to KerasHub
Large Language Models are complex to build and expensive to train from scratch. Luckily there are pretrained LLMs available for use right away. KerasHub provides a large number of pre-trained checkpoints that allow you to experiment with SOTA models without needing to train them yourself.
KerasHub is a natural language processing library that supports users through their entire development cycle. KerasHub offers both pretrained models and modularized building blocks, so developers could easily reuse pretrained models or stack their own LLM.
In a nutshell, for generative LLM, KerasHub offers:
Pretrained models with
generate()
method, e.g.,keras_hub.models.GPT2CausalLM
andkeras_hub.models.OPTCausalLM
.Sampler class that implements generation algorithms such as Top-K, Beam and contrastive search. These samplers can be used to generate text with custom models.
Load a pre-trained GPT-2 model and generate some text
KerasHub provides a number of pre-trained models, such as Google Bert and GPT-2. You can see the list of models available in the KerasHub repository.
It's very easy to load the GPT-2 model as you can see below:
Once the model is loaded, you can use it to generate some text right away. Run the cells below to give it a try. It's as simple as calling a single function generate():
Try another one:
Notice how much faster the second call is. This is because the computational graph is XLA compiled in the 1st run and re-used in the 2nd behind the scenes.
The quality of the generated text looks OK, but we can improve it via fine-tuning.
More on the GPT-2 model from KerasHub
Next up, we will actually fine-tune the model to update its parameters, but before we do, let's take a look at the full set of tools we have to for working with for GPT2.
The code of GPT2 can be found here. Conceptually the GPT2CausalLM
can be hierarchically broken down into several modules in KerasHub, all of which have a from_preset() function that loads a pretrained model:
keras_hub.models.GPT2Tokenizer
: The tokenizer used by GPT2 model, which is a byte-pair encoder.keras_hub.models.GPT2CausalLMPreprocessor
: the preprocessor used by GPT2 causal LM training. It does the tokenization along with other preprocessing works such as creating the label and appending the end token.keras_hub.models.GPT2Backbone
: the GPT2 model, which is a stack ofkeras_hub.layers.TransformerDecoder
. This is usually just referred asGPT2
.keras_hub.models.GPT2CausalLM
: wrapsGPT2Backbone
, it multiplies the output ofGPT2Backbone
by embedding matrix to generate logits over vocab tokens.
Finetune on Reddit dataset
Now you have the knowledge of the GPT-2 model from KerasHub, you can take one step further to finetune the model so that it generates text in a specific style, short or long, strict or casual. In this tutorial, we will use reddit dataset for example.
Let's take a look inside sample data from the reddit TensorFlow Dataset. There are two features:
document: text of the post.
title: the title.
In our case, we are performing next word prediction in a language model, so we only need the 'document' feature.
Now you can finetune the model using the familiar fit() function. Note that preprocessor
will be automatically called inside fit
method since GPT2CausalLM
is a keras_hub.models.Task
instance.
This step takes quite a bit of GPU memory and a long time if we were to train it all the way to a fully trained state. Here we just use part of the dataset for demo purposes.
After fine-tuning is finished, you can again generate text using the same generate() function. This time, the text will be closer to Reddit writing style, and the generated length will be close to our preset length in the training set.
Into the Sampling Method
In KerasHub, we offer a few sampling methods, e.g., contrastive search, Top-K and beam sampling. By default, our GPT2CausalLM
uses Top-k search, but you can choose your own sampling method.
Much like optimizer and activations, there are two ways to specify your custom sampler:
Use a string identifier, such as "greedy", you are using the default configuration via this way.
Pass a
keras_hub.samplers.Sampler
instance, you can use custom configuration via this way.
For more details on KerasHub Sampler
class, you can check the code here.
Finetune on Chinese Poem Dataset
We can also finetune GPT2 on non-English datasets. For readers knowing Chinese, this part illustrates how to fine-tune GPT2 on Chinese poem dataset to teach our model to become a poet!
Because GPT2 uses byte-pair encoder, and the original pretraining dataset contains some Chinese characters, we can use the original vocab to finetune on Chinese dataset.
Load text from the json file. We only use《全唐诗》for demo purposes.
Let's take a look at sample data.
Similar as Reddit example, we convert to TF dataset, and only use partial data to train.
Let's check the result!
Not bad 😀