Path: blob/master/examples/generative/ipynb/dreambooth.ipynb
3508 views
DreamBooth
Author: Sayak Paul, Chansung Park
Date created: 2023/02/01
Last modified: 2023/02/05
Description: Implementing DreamBooth.
Introduction
In this example, we implement DreamBooth, a fine-tuning technique to teach new visual concepts to text-conditioned Diffusion models with just 3 - 5 images. DreamBooth was proposed in DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation by Ruiz et al.
DreamBooth, in a sense, is similar to the traditional way of fine-tuning a text-conditioned Diffusion model except for a few gotchas. This example assumes that you have basic familiarity with Diffusion models and how to fine-tune them. Here are some reference examples that might help you to get familiarized quickly:
First, let's install the latest versions of KerasCV and TensorFlow.
If you're running the code, please ensure you're using a GPU with at least 24 GBs of VRAM.
Initial imports
Usage of DreamBooth
... is very versatile. By teaching Stable Diffusion about your favorite visual concepts, you can
Recontextualize objects in interesting ways:
Generate artistic renderings of the underlying visual concept:
And many other applications. We welcome you to check out the original DreamBooth paper in this regard.
Download the instance and class images
DreamBooth uses a technique called "prior preservation" to meaningfully guide the training procedure such that the fine-tuned models can still preserve some of the prior semantics of the visual concept you're introducing. To know more about the idea of "prior preservation" refer to this document.
Here, we need to introduce a few key terms specific to DreamBooth:
Unique class: Examples include "dog", "person", etc. In this example, we use "dog".
Unique identifier: A unique identifier that is prepended to the unique class while forming the "instance prompts". In this example, we use "sks" as this unique identifier.
Instance prompt: Denotes a prompt that best describes the "instance images". An example prompt could be - "f"a photo of {unique_id} {unique_class}". So, for our example, this becomes - "a photo of sks dog".
Class prompt: Denotes a prompt without the unique identifier. This prompt is used for generating "class images" for prior preservation. For our example, this prompt is - "a photo of dog".
Instance images: Denote the images that represent the visual concept you're trying to teach aka the "instance prompt". This number is typically just 3 - 5. We typically gather these images ourselves.
Class images: Denote the images generated using the "class prompt" for using prior preservation in DreamBooth training. We leverage the pre-trained model before fine-tuning it to generate these class images. Typically, 200 - 300 class images are enough.
In code, this generation process looks quite simply:
To keep the runtime of this example short, the authors of this example have gone ahead and generated some class images using this notebook.
Note that prior preservation is an optional technique used in DreamBooth, but it almost always helps in improving the quality of the generated images.
Visualize images
First, let's load the image paths.
Then we load the images from the paths.
And then we make use a utility function to plot the loaded images.
Instance images:
Class images:
Prepare datasets
Dataset preparation includes two stages: (1): preparing the captions, (2) processing the images.
Prepare the captions
Next, we embed the prompts to save some compute.
Prepare the images
Assemble dataset
Check shapes
Now that the dataset has been prepared, let's quickly check what's inside it.
During training, we make use of these keys to gather the images and text embeddings and concat them accordingly.
DreamBooth training loop
Our DreamBooth training loop is very much inspired by this script provided by the Diffusers team at Hugging Face. However, there is an important difference to note. We only fine-tune the UNet (the model responsible for predicting noise) and don't fine-tune the text encoder in this example. If you're looking for an implementation that also performs the additional fine-tuning of the text encoder, refer to this repository.
Trainer initialization
Train!
We first calculate the number of epochs, we need to train for.
And then we start training!
Experiments and inference
We ran various experiments with a slightly modified version of this example. Our experiments are based on this repository and are inspired by this blog post from Hugging Face.
First, let's see how we can use the fine-tuned checkpoint for running inference.
Now, let's load checkpoints from a different experiment we conducted where we also fine-tuned the text encoder along with the UNet:
The default number of steps for generating an image in text_to_image()
is 50. Let's increase it to 100.
Feel free to experiment with different prompts (don't forget to add the unique identifier and the class label!) to see how the results change. We welcome you to check out our codebase and more experimental results here. You can also read this blog post to get more ideas.
Acknowledgements
Thanks to the DreamBooth example script provided by Hugging Face which helped us a lot in getting the initial implementation ready quickly.
Getting DreamBooth to work on human faces can be challenging. We have compiled some general recommendations here. Thanks to Abhishek Thakur for helping with these.