Self-Supervised Learning: Representation Learning via Invariant Causal Mechanisms

4 min readMar 3, 2022

Hello, I haven’t been active lately but I’ve got some free time to provide this piece. I found more interesting works in Self-Supervised Learning, so you can expect from me more articles on that. This work comes from DeepMind and I found it very insightful. If you’ve just started studying Contrastive Learning, this paper should give you some interesting insights. Anyways, hope you find it useful.

REPRESENTATION LEARNING VIA INVARIANT CAUSAL MECHANISMS (ReLIC)

ReLIC was proposed by Mitrovic et al. in 2020.

Contrastive learning has been justified as maximizing a lower bound on the mutual information (MI) between representations. In this work, the authors hypothesize that the goal of SSL is to learn style (augmentation) invariant representations. Let’s understand their motivation using the following figure:

The sub-figure on the left-hand side shows the image (X) generation process. We see that an image X is generated by combining content information C and style information S. Also, the content information C depends on some target Y_i (here i denotes some task i, not category), whereas the style information S is independent of the target. Thus, the goal is to learn style invariant representations, which describe C, to successfully learn our target Y_i. The formulation is below:

However, the authors state that simply relying on contrastive learning is not enough to obtain invariant representation, hence we should explicitly enforce invariance under augmentations. Thus, they propose to use the Kullback-Leibler (KL) divergence along with the contrastive learning loss. The total loss is given as follows:

ReLIC

where f and h are the online encoder and momentum encoder (parameters are updated via an exponential moving average of f), respectively. Next, φ(f(xi), h(xj )) = <g(f(xi)), g(h(xj ))> (inner product), g is a fully-connected neural network often called the critic (or projector). x^a means an image x is augmented using augmentation policy a. So, the first term is our contrastive learning loss, whereas the latter one is the KL divergence.

where p^do(a_lk) is denoted as:

ReLIC

P.S. basically the same computation of logits as done for the contrastive loss (see the figure above).

To better understand contrastive learning and motivate this proxy task, they rely on the causal concept of refinements, which is a more fine-grained problem of some other problem. For example, instead of classifying cats vs dogs, you are to classify individual breeds of these animals. The most fine-grained task is, of course, learning every single instance in a dataset, which is the goal of contrastive learning. Hence they denote contrastive learning as a refinement task.

Say Y^R be targets of a proxy task that is a refinement for all tasks in Y (take a look at the first figure again). If f(X) is an invariant representation for Y_R under all styles in S, then f(X) is also an invariant representation for all tasks in Y. Thus by enforcing invariance (KL divergence) on a refinement (contrastive learning), we learn representations that generalize to downstream tasks.

EXPERIMENTS

ReLIC was pretrained on the training set of the ImageNet ILSVRC-2012 dataset. In Table above, they report top-1 and top-5 accuracy on the ImageNet test set using linear evaluation protocol, where they freeze the encoder and train a linear layer for classification. We see that ReLIC achieves a comparable performance with the SOTA, such as BYOL.

They also test ReLIC for Reinforcement learning. The table above shows normalized scores over 57 Atari Games. We see that it achieves superior performance over previous SOTA methods. Thus, ReLIC generalizes well on different downstream tasks.

Self-Supervised Learning: What Should not be Contrastive in Contrastive Learning

You will find this work to be very insightful if you study Contrastive Learning. Unlike previous works in…

chingisoinar.medium.com

Some Last Words

We see that the framework of ReLIC is simple yet achieves a great performance. I think it is a technically well-written work. Unfortunately, I did not include all the math and theory but I encourage you to check out the original paper. I think it gives some good theoretical insights on why contrastive learning is popular in SSL. Anyways, I will keep working on SSL and you can expect me to come with more works on this field very soon. Thank you for your time reading my article!