Self-Supervised Learning: What Should not be Contrastive in Contrastive Learning

Ching (Chingis)
4 min readMar 16, 2022

--

You will find this work to be very insightful if you study Contrastive Learning. Unlike previous works in un-/self-supervised learning that propose learning augmentation invariant representations, the authors stress the importance of preserving some style information (e.g., distinguishing red vs. yellow cars). They demonstrate that the style-variant framework outperforms some SOTA methods that learn invariant representations by a decent margin. I hope you find this work as useful as I do.

Self-Supervised Learning:

LOOC: LEAVE-ONE-OUT CONTRASTIVE LEARNING

Recent contrastive learning methods try to learn augmentation invariant representations, where the transformations are generated using classic data augmentation techniques which correspond to common pretext tasks, e.g., randomizing color, texture, orientation and cropping. However, the inductive bias arisen through such frameworks is a double-edged sword since augmentation encourages invariance to a transformation which can be beneficial in some downstream tasks and harmful in others. For example, it can perform poorly when a downstream task violates augmentation invariance assumption (e.g., distinguishing red vs. yellow cars).

Adding rotation may help with view-independent aerial image recognition, but significantly downgrade the capacity of a network to solve tasks such as detecting which way is up in a photograph for a display application.

Thus, the authors propose to learn representations that capture individual factors of variation in a contrastive learning framework with no prior knowledge of downstream invariances, LOOC.

LOOC framework

LOOC

Rather than projecting every view into a single embedding space which is invariant to all augmentations, LOOC projects views into several embedding spaces, each invariant to a certain augmentation (one only). The image above shows several embedding spaces: rotation-variant (invariant to color), color-variant ( invariant to rotation) and all-invariant (common framework). Thus, we separately perform Contrastive Learning for each embedding space, 3 in the case above. Therefore, the representations within the general space (blue box) capture the information of all the augmentations, whereas the individual projection heads h pool the necessary information for a corresponding space.

View Generation

I will summarize the process as follows:

  1. Get a reference image x and a list of n augmentations.

2. Get the query and the first key view by independently augmenting x twice.

3. For each i-th augmentation from the list of n augmentations: take the query view and apply the augmentation to obtain additional key views. Thus, every additional view is 1 augmentation ahead of the query view.

In the image above, q and k_0 are independently augmented, q and k_1 have same rotation angle but different color jittering, and q and k_2 have different rotation angles.

Note that in the image there’s a list of 2 augmentations. So, we have 2 + 1 spaces and corresponding 3 projection heads h.

Contrastive Embedding Space

LOOC

where z is the embedding.

As we see the total loss is the average contrastive loss value across all (n + 1) embedding spaces (n for the list of augmentations and + 1 for the all-invariant space).

Learnt representations. The representation for downstream tasks can be from the general embedding space (blue box), or the concatenation of all embedding sub-spaces (referred as LOOC++).

Some Experiments

LOOC

The evaluation was done following linear classification protocol by training a supervised linear classifier on frozen features. We see that LOOC outperforms MoCo by a decent margin, even with the same augmentation policy. The comparison shows that LOOC preserves color information better. Rotation augmentation also boosts the performance on iNat-1k and Flowers-102, while yields smaller improvements on CUB-200, which supports the intuition that some categories benefit from rotation-invariant representations while some do not. The performance is further boosted by using both augmentations, demonstrating the effectiveness in simultaneously learning the information w.r.t. multiple augmentations.

LOOC

The image above shows top retrievals on the query embeddings. We see that since MoCo is trained to have augmentation invariant representations the top retrievals may not contain some style information, i.e. rotation angle or color, that might be crucial for some tasks (ex. search engines). Meanwhile, LOOC preserves both color and rotation information.

Some Last Words

I definitely have not included all the insights described in the experiment section. Please take your time to study it by yourself. I believe it is an interesting work that gives important insights on popular contrastive learning. Unlike ReLIC (described in my previous article), it demonstrates that some style information is crucial in some cases yet invariant representations are still important for other cases. Thank you for your time reading this piece!

--

--

Ching (Chingis)

I am a passionate student. I enjoy studying and sharing my knowledge. Follow me/Connect with me and join my journey.