Path: blob/master/examples/generative/ipynb/gan_ada.ipynb
3508 views
Data-efficient GANs with Adaptive Discriminator Augmentation
Author: András Béres
Date created: 2021/10/28
Last modified: 2025/01/23
Description: Generating images from limited data using the Caltech Birds dataset.
Introduction
GANs
Generative Adversarial Networks (GANs) are a popular class of generative deep learning models, commonly used for image generation. They consist of a pair of dueling neural networks, called the discriminator and the generator. The discriminator's task is to distinguish real images from generated (fake) ones, while the generator network tries to fool the discriminator by generating more and more realistic images. If the generator is however too easy or too hard to fool, it might fail to provide useful learning signal for the generator, therefore training GANs is usually considered a difficult task.
Data augmentation for GANS
Data augmentation, a popular technique in deep learning, is the process of randomly applying semantics-preserving transformations to the input data to generate multiple realistic versions of it, thereby effectively multiplying the amount of training data available. The simplest example is left-right flipping an image, which preserves its contents while generating a second unique training sample. Data augmentation is commonly used in supervised learning to prevent overfitting and enhance generalization.
The authors of StyleGAN2-ADA show that discriminator overfitting can be an issue in GANs, especially when only low amounts of training data is available. They propose Adaptive Discriminator Augmentation to mitigate this issue.
Applying data augmentation to GANs however is not straightforward. Since the generator is updated using the discriminator's gradients, if the generated images are augmented, the augmentation pipeline has to be differentiable and also has to be GPU-compatible for computational efficiency. Luckily, the Keras image augmentation layers fulfill both these requirements, and are therefore very well suited for this task.
Invertible data augmentation
A possible difficulty when using data augmentation in generative models is the issue of "leaky augmentations" (section 2.2), namely when the model generates images that are already augmented. This would mean that it was not able to separate the augmentation from the underlying data distribution, which can be caused by using non-invertible data transformations. For example, if either 0, 90, 180 or 270 degree rotations are performed with equal probability, the original orientation of the images is impossible to infer, and this information is destroyed.
A simple trick to make data augmentations invertible is to only apply them with some probability. That way the original version of the images will be more common, and the data distribution can be inferred. By properly choosing this probability, one can effectively regularize the discriminator without making the augmentations leaky.
Setup
Hyperparameterers
Data pipeline
In this example, we will use the Caltech Birds (2011) dataset for generating images of birds, which is a diverse natural dataset containing less then 6000 images for training. When working with such low amounts of data, one has to take extra care to retain as high data quality as possible. In this example, we use the provided bounding boxes of the birds to cut them out with square crops while preserving their aspect ratios when possible.
After preprocessing the training images look like the following:
Kernel inception distance
Kernel Inception Distance (KID) was proposed as a replacement for the popular Frechet Inception Distance (FID) metric for measuring image generation quality. Both metrics measure the difference in the generated and training distributions in the representation space of an InceptionV3 network pretrained on ImageNet.
According to the paper, KID was proposed because FID has no unbiased estimator, its expected value is higher when it is measured on fewer images. KID is more suitable for small datasets because its expected value does not depend on the number of samples it is measured on. In my experience it is also computationally lighter, numerically more stable, and simpler to implement because it can be estimated in a per-batch manner.
In this example, the images are evaluated at the minimal possible resolution of the Inception network (75x75 instead of 299x299), and the metric is only measured on the validation set for computational efficiency.
Adaptive discriminator augmentation
The authors of StyleGAN2-ADA propose to change the augmentation probability adaptively during training. Though it is explained differently in the paper, they use integral control on the augmentation probability to keep the discriminator's accuracy on real images close to a target value. Note, that their controlled variable is actually the average sign of the discriminator logits (r_t in the paper), which corresponds to 2 * accuracy - 1.
This method requires two hyperparameters:
target_accuracy
: the target value for the discriminator's accuracy on real images. I recommend selecting its value from the 80-90% range.integration_steps
: the number of update steps required for an accuracy error of 100% to transform into an augmentation probability increase of 100%. To give an intuition, this defines how slowly the augmentation probability is changed. I recommend setting this to a relatively high value (1000 in this case) so that the augmentation strength is only adjusted slowly.
The main motivation for this procedure is that the optimal value of the target accuracy is similar across different dataset sizes (see figure 4 and 5 in the paper), so it does not have to be re-tuned, because the process automatically applies stronger data augmentation when it is needed.
Network architecture
Here we specify the architecture of the two networks:
generator: maps a random vector to an image, which should be as realistic as possible
discriminator: maps an image to a scalar score, which should be high for real and low for generated images
GANs tend to be sensitive to the network architecture, I implemented a DCGAN architecture in this example, because it is relatively stable during training while being simple to implement. We use a constant number of filters throughout the network, use a sigmoid instead of tanh in the last layer of the generator, and use default initialization instead of random normal as further simplifications.
As a good practice, we disable the learnable scale parameter in the batch normalization layers, because on one hand the following relu + convolutional layers make it redundant (as noted in the documentation). But also because it should be disabled based on theory when using spectral normalization (section 4.1), which is not used here, but is common in GANs. We also disable the bias in the fully connected and convolutional layers, because the following batch normalization makes it redundant.
GAN model
Training
One can should see from the metrics during training, that if the real accuracy (discriminator's accuracy on real images) is below the target accuracy, the augmentation probability is increased, and vice versa. In my experience, during a healthy GAN training, the discriminator accuracy should stay in the 80-95% range. Below that, the discriminator is too weak, above that it is too strong.
Note that we track the exponential moving average of the generator's weights, and use that for image generation and KID evaluation.
Inference
Results
By running the training for 400 epochs (which takes 2-3 hours in a Colab notebook), one can get high quality image generations using this code example.
The evolution of a random batch of images over a 400 epoch training (ema=0.999 for animation smoothness):
Latent-space interpolation between a batch of selected images:
I also recommend trying out training on other datasets, such as CelebA for example. In my experience good results can be achieved without changing any hyperparameters (though discriminator augmentation might not be necessary).
GAN tips and tricks
My goal with this example was to find a good tradeoff between ease of implementation and generation quality for GANs. During preparation, I have run numerous ablations using this repository.
In this section I list the lessons learned and my recommendations in my subjective order of importance.
I recommend checking out the DCGAN paper, this NeurIPS talk, and this large scale GAN study for others' takes on this subject.
Architectural tips
resolution: Training GANs at higher resolutions tends to get more difficult, I recommend experimenting at 32x32 or 64x64 resolutions initially.
initialization: If you see strong colorful patterns early on in the training, the initialization might be the issue. Set the kernel_initializer parameters of layers to random normal, and decrease the standard deviation (recommended value: 0.02, following DCGAN) until the issue disappears.
upsampling: There are two main methods for upsampling in the generator. Transposed convolution is faster, but can lead to checkerboard artifacts, which can be reduced by using a kernel size that is divisible with the stride (recommended kernel size is 4 for a stride of 2). Upsampling + standard convolution can have slightly lower quality, but checkerboard artifacts are not an issue. I recommend using nearest-neighbor interpolation over bilinear for it.
batch normalization in discriminator: Sometimes has a high impact, I recommend trying out both ways.
spectral normalization: A popular technique for training GANs, can help with stability. I recommend disabling batch normalization's learnable scale parameters along with it.
residual connections: While residual discriminators behave similarly, residual generators are more difficult to train in my experience. They are however necessary for training large and deep architectures. I recommend starting with non-residual architectures.
dropout: Using dropout before the last layer of the discriminator improves generation quality in my experience. Recommended dropout rate is below 0.5.
leaky ReLU: Use leaky ReLU activations in the discriminator to make its gradients less sparse. Recommended slope/alpha is 0.2 following DCGAN.
Algorithmic tips
loss functions: Numerous losses have been proposed over the years for training GANs, promising improved performance and stability. I have implemented 5 of them in this repository, and my experience is in line with this GAN study: no loss seems to consistently outperform the default non-saturating GAN loss. I recommend using that as a default.
Adam's beta_1 parameter: The beta_1 parameter in Adam can be interpreted as the momentum of mean gradient estimation. Using 0.5 or even 0.0 instead of the default 0.9 value was proposed in DCGAN and is important. This example would not work using its default value.
separate batch normalization for generated and real images: The forward pass of the discriminator should be separate for the generated and real images. Doing otherwise can lead to artifacts (45 degree stripes in my case) and decreased performance.
exponential moving average of generator's weights: This helps to reduce the variance of the KID measurement, and helps in averaging out the rapid color palette changes during training.
different learning rate for generator and discriminator: If one has the resources, it can help to tune the learning rates of the two networks separately. A similar idea is to update either network's (usually the discriminator's) weights multiple times for each of the other network's updates. I recommend using the same learning rate of 2e-4 (Adam), following DCGAN for both networks, and only updating both of them once as a default.
label noise: One-sided label smoothing (using less than 1.0 for real labels), or adding noise to the labels can regularize the discriminator not to get overconfident, however in my case they did not improve performance.
adaptive data augmentation: Since it adds another dynamic component to the training process, disable it as a default, and only enable it when the other components already work well.