Speaking Code: Vision Transformer

Ching (Chingis)
4 min readJan 3, 2022

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

Hello, I always wanted to start this series where I explain the concepts of papers by breaking down their codes. So, here I am writing about Vision Transformers. I am not focusing on math details of the paper, instead I am breaking down the author’s code to explain the process and the concept. I hope my Computer Science fellas enjoy it, thank you!

Original Paper: https://arxiv.org/pdf/2010.11929.pdf

In vision, Attention is usually applied in combination with CNNs, or used to replace some components of CNNs while preserving the overall structure. However, the authors argue that CNNs is not necessary and a pure transformer, applied directly to sequences of image patches, can works well on image classification tasks.

Inspiration

Inspired by NLP successes, many works incorporate self-attention into CNN-like architectures, whereas some replace the convolutions entirely. The latter architectures have not yet been scaled efficiently on modern hardware accelerators yet seem to perform relatively well. Therefore, in large-scale image recognition, classic ResNet-like architectures are still state of the art.

Inspired by this observation, the authors try to incorporate a pure transformer with little modifications. To achieve it, the authors split an image into patches and feed the sequence of embeddings of these patches, treated as tokens in NLP, as an input to a Transformer architecture.

Important Observation

Transformers lack some of the inductive biases, for example translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data (mid-sized datasets such as ImageNet); however, it approaches or beats SOTA, state-of-the-art, results when pre-trained at sufficient scale and transferred to tasks with fewer datapoints.

Architecture

taken from: https://arxiv.org/pdf/2010.11929.pdf

Speaking Code

taken from: official repo

Although the code above looks like Pytorch, it is written in Flax ( a neural network library and ecosystem for JAX designed for flexibility). I don’t know what it is but it does not stop us from learning since it is easy enough to understand the process.

Although the original paper shows simple Fully-connected layers to embed the image patches, we actually see CNN+Transformer architecture, which is described as Hybrid Architecture in the original paper. We can see that an image goes through multiple ResNet blocks before it is fed in Transformer Encoder. This is done because Vision Transformer has much less image-specific inductive bias than CNNs.

After a series of CNN layers (ResNet blocks), the feature maps are reshaped into N x W * H x C (from N x C x H x W), where C is the number of channels. This is done as an alternative to breaking down a raw image into a series of patches. Also the authors prepend (add in front) a learnable parameter that is responsible for image classification. Next, we feed the whole thing into Transformer Encoder, which consists of Self-attention blocks, MLP layers and LayerNormalization blocks (applied before every block), and residual connections after every block. Note that position embeddings are added as well.

Next, we can apply additional Fully-Connected layer, probably to reduce the dimension, and tanh activation (I wonder why not ReLU though). Finally, we perform our classification by applying another Fully-connected layer (prediction head).

EXPERIMENTS

taken from: https://arxiv.org/pdf/2010.11929.pdf
taken from: https://arxiv.org/pdf/2010.11929.pdf

We see that the ViT pretrained on JFT-300M dataset outperforms ResNet-based baselines (ResNet 152x4 and EfficientNet-L2) on all datasets. Also, the computational cost is much less than that of the baselines’.

taken from: https://arxiv.org/pdf/2010.11929.pdf

We can also have a glance at the attention maps that the authors provide. We see that the ViT is able to focus on the significant parts of the images above.

Some Last Words

I know did not cover the experimental results in details; however, the goal of my articles is to introduce the concepts and make them seem easier for my computer science fellas. I hope you enjoyed this piece as much as I did. Thank you for your time reading my work. If you have any concerns or suggestions, it’s my pleasure to read and respond.

--

--

Ching (Chingis)

I am a passionate student. I enjoy studying and sharing my knowledge. Follow me/Connect with me and join my journey.