This article is purely for academic sharing. All images within the text are from the original paper (Cover image: Photo by Pietro Jeng on Unsplash )

Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020, November). A simple framework for contrastive learning of visual representations. In International conference on machine learning (pp. 1597-1607). PMLR.。https://arxiv.org/abs/2002.05709

How to truly learn image representations and achieve task-agnostic models is currently a very popular research topic. Although this paper is a bit old, published in mid-2020, the fundamental concepts it introduced have now evolved into popular models like CLIP (Contrastive Language-Image Pretraining). Therefore, I thought it would be worthwhile to review this article, “A Simple Framework for Contrastive Learning of Visual Representation,” which has already garnered over 2600 citations on Google Scholar. However, the main purpose of this article is still for personal notes, so for more detailed information, readers are encouraged to refer to the original paper (link: https://arxiv.org/abs/2002.05709)~

Method

Simply put, the authors’ approach to learning visual representations involves using unsupervised learning combined with the concept of contrastive loss. This allows the model to automatically learn which representations belong to the same class and then apply these learned representations to different types of tasks, resulting in a good pre-trained model. For the details, let’s first look at the framework architecture presented in the paper:

First, x represents an image. Each image passes through two different data augmentation functions (denoted here as t and t'), yielding xi and xj. For ease of explanation, let’s first consider the left branch with xi; the right branch with xj follows essentially the same concept. xi then goes through a ResNet (this ResNet layer can be directly considered an encoder). After average pooling and before entering the FC layer, we extract the feature vector (hi) and pass it through a network projection head (g(.)) to project hi into the latent space for contrastive learning. This g(.) consists of two weight layers with a ReLU activation in between, effectively a 2-layer MLP, finally yielding zi. Similarly, by passing through the right branch, we can obtain zj, generated by another data augmentation. Next comes the main event: we use zi and zj to compute their contrastive loss. The loss function concept used by the authors here is likely a common contrastive loss function, which they call NT-Xent. We can directly understand it by looking at the formula:

The goal of NT-Xent is to make zi and zj generated from the same image as similar as possible, while pushing other unrelated images as far away as possible. Thus, the numerator shows the cosine similarity between zi and zj from the same image, while the denominator shows the cosine similarity between zi and any zk from different source images. Since there’s a negative sign before the log, minimizing the loss function will result in the numerator becoming larger and the denominator becoming smaller, achieving the training effect of “making zi and zj generated from the same image closer, and pushing other unrelated images further away.” One small detail is that the uppercase 1 in the denominator represents an indicator vector, which equals 1 only when k is not equal to i, so zi itself won’t calculate cosine similarity with itself (obviously). And τ represents the temperature parameter (just a normalization parameter); in the paper, it appears to be fixed rather than learnable.

For more details, we can examine the pseudo-code in the paper. After reviewing it, you should have a complete understanding of the entire model’s architecture:

In summary, we can first divide the training data into an unknown number of batches of size N. For each image from 1 to N in a batch, we apply two different data augmentations to obtain z2k-1 and z2k, where si,j is the cosine similarity. The actual loss function is then defined by the large L equation, which considers not only l(2k-1, 2k) but also l(2k, 2k-1), because using only the small l loss would omit the reverse results in the denominator. Finally, what is returned is the encoder architecture before f(.); g(.) is only used for pre-training and will not be used in actual applications. (The authors conducted detailed experiments on this approach, which will be discussed shortly.)

Alright, that largely covers the entire architecture. Next, we’ll look at some experimental results and testing related to the architecture.

Discussion & Results

First, we haven’t yet discussed which data augmentations the authors used. The authors experimented with methods such as Crop, Flip, Cutout, Color distortion, Rotate, Gaussian noise, and Gaussian blur. They then performed a linear probe on the ImageNet dataset using the resulting feature network (i.e., the architecture before f(.)). The experimental results showed that using only one type of data augmentation did not yield very good performance. This can be seen in the figure below:

The authors provided an example: if data augmentation consistently uses only cropping, serious problems can arise. If you plot a pixel intensities histogram for an image, you’ll find that even if you randomly crop different regions of an image, the resulting pixel intensity distributions will be almost identical. In such cases, data augmentation is ineffective, which is why more than two modalities of augmentation are needed. Ultimately, the authors selected three augmentation methods: random crop (with flip and resize), color distortion, and Gaussian blur.

Second, the authors experimented with batch sizes ranging from 256 to 8192. They found that a large batch size improves the model’s top-1 accuracy because it increases the number of negative pairs within a single batch. This phenomenon is not particularly evident in supervised learning.

Third, and what I consider a very important point, is the role of the projection head and why the information learned by g(.) should be discarded. The authors noted that if you compare h=f(t) (obtained before the projection head) with z=g(h) (obtained after adding the projection head) by performing a linear probe on both, the former yields significantly higher accuracy than the latter. The authors hypothesize that the projection head causes information loss, especially regarding data transformation information, which is why z=g(h) should not be used for linear probing. To verify this hypothesis, the authors directly used h and g(h) for training to predict what transformations each image had undergone. The experimental results indeed showed that g(h) performed worse in learning. However, I feel this is a question worth researching; for instance, perhaps the projection head loses more than just data transformation information. Are there other ways to design this projection head? These all seem like areas for further consideration.

Finally, there are some testing results that weren’t covered. In summary, the authors conducted semi-supervised learning experiments on ILSVRC-12 (training with 1% and 10% of the labeled dataset), and SimCLR surpassed SOTA accuracy. Furthermore, when fine-tuned on ResNet using the entire ImageNet dataset, not only did it avoid catastrophic forgetting (the authors likely didn’t use regularization), but it also outperformed SOTA (which was trained directly from scratch). This result is quite interesting.

Additionally, the authors performed transfer learning on 12 other datasets using both linear probing and fine-tuning methods. SimCLR’s performance is shown in the table below, and it generally performed quite well.

Thank you for reading this far. I’ve actually omitted many details and experimental results, primarily hoping to highlight the key points and clarify my own thoughts. I hope those interested will find it enjoyable~~

References

  1. Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020, November). A simple framework for contrastive learning of visual representations. In International conference on machine learning (pp. 1597-1607). PMLR.