Recent Advancements on Vision Transformers (ViT) by Facebook AI Research (FAIR) (Part 1)

Ching (Chingis)
8 min readMar 23, 2022

Here I am summarizing the recent works done on Vision Transformers (ViT) by Facebook AI to keep you updated. Vision transformers are becoming very popular these days. They are being used in many fields, including object detection, segmentation and representation learning. Therefore, I think it is important to know what has been going on recently. I personally think that FAIR is doing an amazing job in this field. Therefore, I am summarizing some of the works I found on the Internet. However, I cannot include all the details, such as experiments, because I am trying to include multiple related works together. I hope you find it useful.

Vision Transformer

Early Convolutions Help Transformers See Better

Paper

The authors state that ViTs are sensitive to the choice of optimizer (AdamW vs. SGD), especially hyperparameters, and training schedule length, unlike modern convolutional neural networks. Why is this the case? They say the issue lies with the patchify stem (refers to obtaining non-overlapping patches) of ViT models, implemented as a stride-p p×p convolution (p = 16 by default) operation. This large-kernel plus large-stride convolution runs counter to typical design choices of convolutional layers in neural networks. Therefore, they propose replacing 1 Transformer Encoder block with equivalent, in terms of # parameters, CNN architecture, having traditional 3x3 kernels (as seen in the figure above).

Convolutional stem design.

As said, Convolutional (C) stem (see the figure above) is ontained by stacking 3×3 convolutions, followed by a single 1×1 convolution at the end to match the dimensions for the input of the first transformer encoder block. So, C stems downsample a 224×224 input image using overlapping (meaning stride is either 1 or 2) strided convolutions to 14×14, matching the number of inputs created by the standard patchify stem. Design choice is as follows:

  1. All 3×3 convolutions either have stride 2 and double the output channels or stride 1 and keep output channels unchanged.
  2. 3×3 convolutions are followed by batch norm (BN) and then ReLU, while the final 1×1 convolution is not.
  3. C stem is introduced as an alternative to the original Patchify stem; however, they also drop 1 transformer block to keep the flops similar.

Convolutions in ViT

We remember that there’s also a Hybrid ViT that uses ResNet 50p, with 40 convolutional layers as, a backbone instead of the P stem. The authors emphasize that the goal is to explore lightweight convolutional stems that consist of only 5 to 7 convolutions in total (minimal convolutional stem), instead of the 40 used by the hybrid ViT.

Paper

The authors conclude that trivial change of replacing this patchify stem with a simple convolutional stem leads to a remarkable change in optimization behavior. It converges faster and it is not much sensitive to choices in optimizes or some hyperparameters. It’s easier to optimize and it achieves a remarkable performance.

Read More:

Training data-efficient image transformers & distillation through attention (DeiT)

In this work, the authors produce competitive Conovlution-free transformers. They introduce a teacher-student strategy, which relies on a new distillation token. This token ensures that the student learns from the teacher through attention. The teacher network is usually a convnet. The proposed DeiT is a new distillation procedure based on a distillation token, which essentially has the same role as the class token. Both tokens, class and distillation, interact in the transformer through attention.

DeiT

Regarding the architecture, we see that we simply introduce a new tokes, similar to CLS token, and pass it with CLS and Patch tokens together. The distillation token interacts with other embedding through self-attention layers in the same way as CLS token does. It allows to learn from the output of the teacher model, convnet. They also observed that both vectors CLS and distillation converge towards different vectors yet their tasks are somewhat similar provided we have a well-trained teacher network. The distillation token is used to minimize the distillation loss. First, let’s introduce 2 distillation strategies:

Soft distillation

DeiT

Soft distillation minimizes KL divergence between the softmax (phi) of the teacher logits (Zt) and that of the student logits (Zs), and tau is the temperature hyperparameter. We also minimize the cross entropy loss with the ground truth label y.

Hard distillation

DeiT

Here, we just apply argmax on the teacher logits (Zt) to get the pseudo-label yt. So, we minimize both the cross entropy with the true label and the pseudo-label. Note we can also apply the label smoothing on yt.

Differences

Our only differences are the training strategies, and the distillation token. Also we do not use a MLP head for the pre-training but only a linear classifier.

So, there are 2 linear classifiers (not MLP) for CLS and distillation tokens, respectively.

Teacher Network

DeiT

The authors report that convnet are better as teacher network that transformer. We see that RegNetY-16GF(gigaflop) gives the greatest gains. Student : DeiT-B (with an alembic sign) refers to DeiT-B trained with the authors’ distillation strategy. DeiT-B 384 refers to a fine-tuning stage at a larger resolution. As stated in ViT, the authors found it useful to fine-tune at a larger resolution. DeiT is fine-tuned using both the true label and teacher prediction (at higher resolution), and the teacher network uses the same target resolution. So, similar to ViT, they pretrain DeiT models with at resolution 224 and we fine-tune at resolution 384.

Distillation

DeiT

Here DeiT refers to DeiT-B, which has the same architecture as ViT-B. DeiT-B (with an alembic sign) refers to DeiT-B trained with the authors’ distillation strategy. We see that Hard distillation gives a better performance than soft distillation. Next, DeiT-B (with an alembic sign) [embedding] means how the label is predicted at test time since both embeddings able to infer the image label. We see that if we add softmax output of both classifiers (for CLS and distillation tokens) up, we get a little bit better accuracy.

Read More:

Going deeper with Image Transformers

Deeper ViTs with LayerScale

The authors’ goal here is to improve the training stability of deeper vision transformers.

paper

Here, eta (weird n) refers to the layer-normalization module, whereas FFN and SA are Feed Forward Network and Self-attention Module, respectively. So, some works like Fixup, Rezero/Skipinit introduce a learnable scalar alpha that is applied on the output of residual blocks (see that layer-normalization eta is removed in the sub-fegure b). ReZero initializes this parameter to 0, Fixup initializes it to 1 and make some other little adjustments in block initializations. I believe these approaches worked well for traditional transformer but not for ViT/DeiT. Recall that ViTs apply eta before FFN or SA (not after as traditional architecture). So, they introduce layer-normalization and warmup back, so that DeiT/ViT could converge. Also, instead of using a single scalar parameter alpha, they introduce a set of learnable parameters and construct a diagonal matrix (diag(lambda0, lambda1 … lambdan)). So, it allows a per-channel multiplication of the vector produced by each residual block (FFN/SA(vector) + orig vector). The diagonal values are all initialized to a fixed small value: 0.1 until depth 18, 0.00001 for 24 and 0.000001 for deeper networks. So, the initial contribution of the residual blocks is small. LayerScale allows more diversity than just adjusting the whole layer by a single scalar as in Fixup.

Overall, I believe it is a trick to help deeper vision transformers to converge.

paper

We see that DeiT-S with larger depth (36) achieves a better performance with LayerScale than other previous methods.

Class attention (CaiT)

paper

The architecture in the left-most is a vanilla ViT. The problem here is that ViT is kind of trying to learn 2 different objectives: learn representations and the overall content of an image (trhough CLS token). Therefore, they propose to explicitly split these two tasks up. The middle architecture shows that we can prepend our CLS token later in deeper layers. However, they also propose CaiT that consists of self-attention to learn representations and class-attention to learn the content itself and it usually has 2 layers. Specifically, after self-attention we freeze the patch embeddings and prepend learnable CLS embedding. Then, we apply Class-attention module, which basically does the same thing as SA module. Finally, we apply FFN only on our CLS token since others are frozen.

paper

Note that here LayerScale is not applied. Also, here a + b refers to #layers for self-attention + #layers for class-attention. We see that if we simply insert CLS token at deeper layers, we achieve a slightly better performance. Also, we see that the proposed CaiT consistently achieves a great performance.

Read More:

Some Last Words

I am sorry I could not include all the details and the experiments. I tried to keep it compact and include more works on ViTs. Therefore, I encourage you to read the papers by yourself and see the experiments and some technical details by yourself. However, I hope you find this piece useful. I am just trying to save your time and get you familiar with the works. Thank you for your time reading this article. Have a great day (:

--

--

Ching (Chingis)

I am a passionate student. I enjoy studying and sharing my knowledge. Follow me/Connect with me and join my journey.