Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/master/Build Better Generative Adversarial Networks (GANs)/Week 3 - StyleGAN and Advancements/StyleGAN2.ipynb
Views: 13373
StyleGAN2
Please note that this is an optional notebook that is meant to introduce more advanced concepts, if you're up for a challenge. So, don't worry if you don't completely follow every step! We provide external resources for extra base knowledge required to grasp some components of the advanced material.
In this notebook, you're going to learn about StyleGAN2, from the paper Analyzing and Improving the Image Quality of StyleGAN (Karras et al., 2019), and how it builds on StyleGAN. This is the V2 of StyleGAN, so be prepared for even more extraordinary outputs. Here's the quick version:
Demodulation. The instance normalization of AdaIN in the original StyleGAN actually was producing “droplet artifacts” that made the output images clearly fake. AdaIN is modified a bit in StyleGAN2 to make this not happen. Below, Figure 1 from the StyleGAN2 paper is reproduced, showing the droplet artifacts in StyleGAN.
Path length regularization. “Perceptual path length” (or PPL, which you can explore in another optional notebook) was introduced in the original StyleGAN paper, as a metric for measuring the disentanglement of the intermediate noise space W. PPL measures the change in the output image, when interpolating between intermediate noise vectors . You'd expect a good model to have a smooth transition during interpolation, where the same step size in maps onto the same amount of perceived change in the resulting image.
Using this intuition, you can make the mapping from space to images smoother, by encouraging a given change in to correspond to a constant amount of change in the image. This is known as path length regularization, and as you might expect, included as a term in the loss function. This smoothness also made the generator model "significantly easier to invert"! Recall that inversion means going from a real or fake image to finding its , so you can easily adapt the image's styles by controlling .
No progressive growing. While progressive growing was seemingly helpful for training the network more efficiently and with greater stability at lower resolutions before progressing to higher resolutions, there's actually a better way. Instead, you can replace it with 1) a better neural network architecture with skip and residual connections (which you also see in Course 3 models, Pix2Pix and CycleGAN), and 2) training with all of the resolutions at once, but gradually moving the generator's attention from lower-resolution to higher-resolution dimensions. So in a way, still being very careful about how to handle different resolutions to make training eaiser, from lower to higher scales.
There are also a number of performance optimizations, like calculating the regularization less frequently. We won't focus on those in this notebook, but they are meaningful technical contributions.
But first, some useful imports:
Fixing Instance Norm
One issue with instance normalization is that it can lose important information that is typically communicated by relative magnitudes. In StyleGAN2, it was proposed that the droplet artifects are a way for the network to "sneak" this magnitude information with a single large spike. This issue was also highlighted in the paper which introduced GauGAN, Semantic Image Synthesis with Spatially-Adaptive Normalization (Park et al.), earlier in 2019. In that more extreme case, instance normalization could sometimes eliminate all semantic information, as shown in their paper's Figure 3:
While removing normalization is technically possible, it reduces the controllability of the model, a major feature of StyleGAN. Here's one solution from the paper:
Output Demodulation
The first solution notes that the scaling the output of a convolutional layer by style has a consistent and numerically reproducible impact on the standard deviation of its output. By scaling down the standard deviation of the output to 1, the droplet effect can be reduced.
More specifically, the style , when applied as a multiple to convolutional weights , resulting in weights will have standard deviation . One can simply divide the output of the convolution by this factor.
However, the authors note that dividing by this factor can also be incorporated directly into the the convolutional weights (with an added for numerical stability):
This makes it so that this entire operation can be baked into a single convolutional layer, making it easier to work with, implement, and integrate into the existing architecture of the model.
Path Length Regularization
Path length regularization was introduced based on the usefulness of PPL, or perceptual path length, a metric used of evaluating disentanglement proposed in the original StyleGAN paper -- feel free to check out the optional notebook for a detailed overview! In essence, for a fixed-size step in any direction in space, the metric attempts to make the change in image space to have a constant magnitude . This is accomplished (in theory) by first taking the Jacobian of the generator with respect to , which is .
Then, you take the L2 norm of Jacobian matrix and you multiply that by random images (that you sample from a normal distribution, as you often do): . This captures the expected magnitude of the change in pixel space. From this, you get a loss term, which penalizes the distance between this magnitude and . The paper notes that this has similarities to spectral normalization (discussed in another optional notebook in Course 1), because it constrains multiple norms.
An additional optimization is also possible and ultimately used in the StyleGAN2 model: instead of directly computing , you can more efficiently calculate the gradient .
Finally, a bit of talk on : is not a fixed constant, but an exponentially decaying average of the magnitudes over various runs -- as with most times you see (decaying) averages being used, this is to smooth out the value of across multiple iterations, not just dependent on one. Notationally, with decay rate , at the next iteration .
However, for your one example iteration you can treat as a constant for simplicity. There is also an example of an update of after the calculation of the loss, so you can see what looks like with exponential decay.
No More Progressive Growing
While the concepts behind progressive growing remain, you get to see how that is revamped and beefed up in StyleGAN2. This starts with generating all resolutions of images from the very start of training. You might be wondering why they didn't just do this in the first place: in the past, this has generally been unstable to do. However, by using residual or skip connections (there are two variants that both do better than without them), StyleGAN2 manages to replicate many of the dynamics of progressive growing in a less explicit way. Three architectures were considered for StyleGAN2 to replace the progressive growing.
Note that in the following figure, tRGB and fRGB refer to the convolutions which transform the noise with some number channels at a given layer into a three-channel image for the generator, and vice versa for the discriminator.
The set of architectures considered for StyleGAN2 (from the paper). Ultimately, the skip generator and residual discriminator (highlighted in green) were chosen.
Option a: MSG-GAN
MSG-GAN (from Karnewar and Wang 2019), proposed a somewhat natural approach: generate all resolutions of images, but also directly pass each corresponding resolution to a block of the discriminator responsible for dealing with that resolution.
Option b: Skip Connections
In the skip-connection approach, each block takes the previous noise as input and generates the next resolution of noise. For the generator, each noise is converted to an image, upscaled to the maximum size, and then summed together. For the discriminator, the images are downsampled to each block's size and converted to noises.
Option c: Residual Nets
In the residual network approach, each block adds residual detail to the noise, and the image conversion happens at the end for the generator and at the start for the discriminator.
StyleGAN2: Skip Generator, Residual Discriminator
By experiment, the skip generator and residual discriminator were chosen. One interesting effect is that, as the images for the skip generator are additive, you can explicitly see the contribution from each of them, and measure the magnitude of each block's contribution. If you're not 100% sure how to implement skip and residual models yet, don't worry - you'll get a lot of practice with that in Course 3!
Figure 8 from StyleGAN2 paper, showing generator contributions by different resolution blocks of the generator over time. The y-axis is the standard deviation of the contributions, and the x-axis is the number of millions of images that the model has been trained on (training progress).
Now, you've seen the primary changes, and you understand the current state-of-the-art in image generation, StyleGAN2, congratulations!
If you're the type of person who reads through the optional notebooks for fun, maybe you'll make the next state-of-the-art! Can't wait to cover your GAN in a new notebook 😃