References: Self-supervised vision

Source material

This lesson follows Stanford CS231n’s treatment of self-supervised learning (Lecture 12), opening Phase 3 of the Track 16 arc.

Course: Stanford CS231n, “Deep Learning for Computer Vision”
Instructors: Fei-Fei Li, Ehsan Adeli, and Justin Johnson (Stanford University)
Course site: cs231n.stanford.edu
This lesson maps to: Lecture 12 (Self-supervised Learning).

Attribution (Clawdemy-authored): Stanford CS231n: Deep Learning for Computer Vision, Fei-Fei Li, Ehsan Adeli, and Justin Johnson, Stanford University (cs231n.stanford.edu). CS231n does not publish a required citation string; this is the attribution Clawdemy uses.

A note on access and license

The current term’s lecture recordings are posted on Canvas for enrolled Stanford students. Recordings from previous years are publicly available on YouTube under YouTube’s standard license; Clawdemy links out rather than embedding or rehosting. The course notes (cs231n.github.io) and site are Stanford’s. No Creative Commons license is published for the lectures, so we treat them as link-only references.

Primary papers (cited by name and venue)

Pretext tasks

Relative position. Doersch, Gupta, Efros, “Unsupervised Visual Representation Learning by Context Prediction” (ICCV 2015).
Rotation prediction. Gidaris, Singh, Komodakis, “Unsupervised Representation Learning by Predicting Image Rotations” (ICLR 2018).
Jigsaw puzzles. Noroozi, Favaro, “Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles” (ECCV 2016).
Colorization. Zhang, Isola, Efros, “Colorful Image Colorization” (ECCV 2016).
Inpainting (Context Encoders). Pathak, Krähenbühl, Donahue, Darrell, Efros, “Context Encoders: Feature Learning by Inpainting” (CVPR 2016).

Contrastive learning

SimCLR. Chen, Kornblith, Norouzi, Hinton, “A Simple Framework for Contrastive Learning of Visual Representations” (ICML 2020). The canonical contrastive instantiation.
MoCo (v1/v2/v3). He, Fan, Wu, Xie, Girshick, “Momentum Contrast for Unsupervised Visual Representation Learning” (CVPR 2020); follow-up papers extended to ViT backbones.
BYOL. Grill et al., “Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning” (NeurIPS 2020). The surprising “no negatives” result.

Masked image modeling and self-distillation

MAE. He, Chen, Xie, Li, Dollár, Girshick, “Masked Autoencoders Are Scalable Vision Learners” (CVPR 2022). The 75-percent-mask + asymmetric-encoder-decoder design.
DINO. Caron, Touvron, Misra, Jégou, Mairal, Bojanowski, Joulin, “Emerging Properties in Self-Supervised Vision Transformers” (ICCV 2021). Self-distillation with no labels.
DINOv2. Oquab et al., “DINOv2: Learning Robust Visual Features without Supervision” (TMLR 2024 / arXiv 2023). Strong general-purpose vision encoder commonly used frozen.

Further study

CLIP. Radford et al., “Learning Transferable Visual Models From Natural Language Supervision” (ICML 2021). Joint image-text contrastive training; the canonical bridge from self-supervised vision into vision-language. Track 5 (AI Foundations) covers it in depth; future Track 16 lessons (vision-and-language, lesson 14) will return to it.
Production self-supervised pipelines. Facebook AI’s VISSL framework and OpenSelfSup (now mmselfsup, OpenMMLab) implement most of the methods above with consistent APIs; recommended for actually running any of them.
Survey perspective. Several recent self-supervised-learning surveys give a broader map of the field beyond the methods named here.

How we use this source

Clawdemy follows CS231n’s Lec 12 ordering (pretext-task history → contrastive learning → masked image modeling) and cites the canonical papers by name and venue. The cosine-similarity worked examples (body: cos([1,0], [0.9,0.4]) ≈ 0.914 and cos([1,0], [-0.5,0.8]) ≈ -0.530; practice: cos([2,1], [3,2]) ≈ 0.992 and cos([2,1], [-2,1]) = -0.600) are Clawdemy-authored against the standard cosine formula. The “pre-train then transfer” workflow framing and the linear-probe-vs-fine-tune distinction reflect the practitioner consensus on how self-supervised models are actually used. We do not reproduce CS231n’s slides, figures, problem sets, or lecture text. Full attribution policy: see Doc/attribution-policy.md.