Paper of Choice: Image Generation From Small Datasets via Batch Statistics Adaptation

5 min readDec 28, 2021

Hello, the year is about to end. Overall, there are millions of things I would love to write about but I simply do not have a lot of time (unfortunately). Well, here I am again writing about GANs totally out of the blue. However, I found this work fascinating; therefore, I wanted to share and write about it. I will try to review some interesting papers I encounter once a while. This work’s motivation is to use pretrained GANs and adapt them on very limited data (~25 images as in the image below). I hope you enjoy this piece.

Taken from: Image Generation From Small Datasets via Batch Statistics Adaptation

Role of Batch Statistics

In this subsection, I want to provide a brief analysis in terms of filter selection before we dive into the actual method proposed by the authors.

Here, W is the weight of the convolution, whereas W_i is the i th filter in the convolution and c_out is the number of output channels. Notice that changing the scale γ corresponds to changing the activation strength of the filter of each convolution, whereas changing the shift β is equivalent to changing the activation threshold of the filter. Thus, the larger γi and βi are, the more active corresponding neurons are, and vice versa. So, the authors noticed this correlation via a set of prior experiments. They came to a conclusion that modulating the scale and shift parameters corresponds to filter selection and controlling the activation in a Convolutional Neural Network (CNN). I believe this is what inspired the authors to do this work.

Method

Thus, the authors introduce the scale and shift parameters after each hidden layer, except of the final layer and update only these parameters to conduct an adaptation, i.e. all the kernels are not updated. It is important to mention that the authors finetune ImageNet pretrained BigGan.

where G(l) is the feature representation of the l-th layer of the generator, whereas G (l)_Adapt is the feature of the l th layer after adaptation. Note that γ and β parameters are learnable and introduced by the authors, and initialized with 1 and 0, respectively.

In case of Batch Normalization layers, the authors decided to update it instead of introducing the learnable parameters on top.

Training

Notice that there’s no Discriminator in the pipeline. Instead, there’s a learnable latent vector z that is trained so that the generated image is very close to one of the images in your dataset. Thus, there are as many trainable vectors z as there are images in your dataset D. Also, z is initialized with a zero vector.

where x_i is the i-th image, whereas z_i is the corresponding learnable latent vector. c, h, w, and d are the channel, height, width, and dimension of each feature, respectively.

G_Adapt is pretrained Generator. C (l) is the feature map at l-th layer of the trained classifier C (VGG16).

b is the batch size and λ is the coefficient used to determine the balance of each term. Also, r_j is a random vector sampled from the normal distribution. Here, k is a hyperparameter that is sufficiently larger than the amount of training data, and ϵ is a small random noise vector.

The first and second terms are used to make the generated image close. The third term is for regularization so that z resembles a standard normal distribution. Finally, the fourth term is to address overfitting.

Inference

The authors found that sampling a random vector from a truncated normal distribution gives better results.

In this image we see that the proposed architecture trained only on ~25 face images gives much smoother results. Also, it achieves good FID and KMMD scores.

Figure 9 shows that the performance on anime face dataset is relatively good compared to other methods when the data size is smaller than 500. Also, for the latter dataset, the method performs better compared to Transfer GAN provided the data size is smaller than 100. So, though the method performs well for small datasets, there are still limitations when the dataset size becomes large. In case of anime dataset, “updating all” performs better when the datasize is large. This might be because it has more trainable parameters.

Some Last Words

I tried to highlight all important parts of the paper. However, I admit that might have missed some interesting points. Therefore, I encourage you to study it on your own. I hope it is now easier for you to follow along this paper. Personally, I found this paper fascinating and very simple; therefore, I really wanted to write and share this piece. Thank you for your time reading my article. If you have anything to suggest or share, I will be happy to read and reply to it. Thank you again (: