Speaking Code: LARGE SCALE GAN TRAINING FOR HIGH FIDELITY NATURAL IMAGE SYNTHESIS

5 min readJan 13, 2022

Background

Generative Adversarial Networks (GANs) are capable of generating high-quality images; however, the resolution of generated images remains relatively small. There were many efforts to address this issue. For example, ProGAN trains high-resolution GANs in the single-class setting by iteratively training across a set of increasing resolutions. Nevertheless, the model training is still unstable regardless of the large number of studies that have investigated and proposed improvements.

Without auxiliary stabilization techniques, this training procedure is notoriously brittle, requiring finely-tuned hyperparameters and architectural choices to work at all.

— Large Scale GAN Training for High Fidelity Natural Image Synthesis, 2018.

Most of the improvements have been made due to changes the objective function or constraining the discriminator model during the training.

More recently, scaling up GAN models has been found to work pretty well for generating both high-quality and larger images.

Takeaways from Scaling Up GANs

The authors provide class information to the Generator with class-conditional BatchNorm, as seen in the image (sub-figure (a) and (b)) above, and to the Discriminator with projection.

They also use Orthogonal Inialization instead of classic Xavier Initialization or N(0, 0.02I). BatchNorm Statistics in G are computed across all devices instead of per-devices, which is a typical scenario. They note that progressive growing, as ProGAN, is unnecessary.

Simply by increasing the batch size by a factor of 8 improved their performance, in terms of Inception Score (IS), by 46%. They explain it that it provides better gradients for both networks. Also, they achieved a better final performance in fewer iterations.
They then increase number of channels, in CNNs, in each layer by 50% (meaning the number of parameters are almost doubled). It resulted in 21% improvement in terms of IS.
Notice from the figure above that class embeddings are shared and they use separate linear layers to fit each BatchNorm layer. It reduces computation cost a lot and improves training speed by 37%.
Notice the noise vector z is split into one chunk per ResBlock and conctaenated with class embedding c. It gave a slight improvement of 4 %.

Also, if you wonder what Non-local block is, here’s is the diagram

Truncation Trick

The authors find out that taking models trained with z~N(0, I) and sampling from a truncated normal boosts IS and FID. Truncation trick: truncating a z vector by resampling the values having a magnitude greater than a chosen threshold. It leads to a better quality images in the cost of overall sample variety. The smaller the threshold, the smaller sample variety.

Orthogonal Regularization

where W is a weight matrix and beta is a hyperparameter set to 1e-4.

They notice some of their larger models do not benefit from truncation trick. Therefore, they introduce Orthogonal Regularization due to which 60% of larger models became amenable to truncation.

ARCHITECTURAL DETAILS

There are two versions of the model described BigGAN and BigGAN-deep, the latter involving deeper ResNet modules and it does not require z splitting.

I am bringing this figure again to better understand the architecture. As we see ResBlocks are modified but the individual modules should look familiar to you. Also, see that the ResBlock for D is different from that of G in the way that the number of filters in the first convolutional layer of each block is equal to the number of output filters. The noise vector z is split along its channel dimension into chunks of equal size. We see that each chunk is concatenated with class embedding.

The BigGAN-deep model differs from BigGAN in several aspects but I don’t want to explain the details here. Please take your time to read the paper. I believe if it is not hard to understand it after you understand the BigGAN.

Speaking Code

I took the codes from the following repo:

GitHub - ajbrock/BigGAN-PyTorch: The author's officially unofficial PyTorch BigGAN implementation.

The author's officially unofficial PyTorch BigGAN implementation. This repo contains code for 4-8 GPU training of…

github.com

Also, all the codes are written in Pytorch.

We see that the noise vector z is first split into equal size chunks. First, we take the very first chunk (zs[0]) as input and the rest chunks are used for concatenation with our class conditional vector y. After that we iterate over our ResBlock (self.blocks), as well as concatenated vectors, and pass our parameters. The final output is obtained by passing through batchnorm-relu-conv and tanh. Looks pretty simple, right?

This is ResBlock’s forward function for the Generator. It looks pretty clear. See that we pass our concatenated vector y into our BatchNorm blocks.

Now let’s see what happens inside our BatchNorm blocks. We see that our concatenated vector y is passed into self.gain and self.bias which are just Linear layers. So, vector y is linearly projected to produce per-sample gains and biases for the BatchNorm layers of the block. The bias projections are zero-centered, while the gain projections are centered at 1. Therefore, we add 1 after we apply self.gain. Finally, after we normalize our input x, we multiply it by our computed gain and add bias.

Some Last Words

I hope I help someone understand the concepts of BigGAN better. Anyways, my articles are just to introduce you to the concepts. You can always read the paper and, of course, get more details from it. I encourage to study the paper on your own. This article provides a great amount of information so you the paper seem a little bit easier. Thank you for your time reading my work (: