Self-supervised vision: cheatsheet

The core trick

Element	What it is
Pretext task	A task whose labels are constructed from the data itself, no human annotation
What matters	NOT solving the pretext task; the encoder features it leaves behind, which transfer to real tasks
Workflow	Pre-train on huge unlabeled data (expensive, once), then transfer to downstream tasks (cheap, repeated per task)

Pretext-task history (2014-2018)

Method	Pretext task
Relative position (Doersch 2015)	Predict spatial relationship of two image patches
Rotation (Gidaris 2018)	Predict 0/90/180/270 degree rotation
Jigsaw (Noroozi 2016)	Predict original permutation of shuffled patches
Colorization (Zhang 2016)	Predict color from grayscale
Inpainting (Pathak 2016)	Predict missing pixels of a masked region

Each worked; none closed the gap with supervised pretraining alone.

Contrastive learning (2020 onward)

Method	Headline
SimCLR (Chen 2020)	Two augmentations = positive pair; other images in batch = negatives; NT-Xent loss; needs large batches
MoCo (He 2019, v2, v3)	Momentum-updated encoder + queue of negatives; scales beyond single-batch limits
BYOL (Grill 2020)	NO explicit negatives; two networks (online + momentum target); online predicts target

Worked cosine similarity (body)

Pair	Vectors	Cosine
Positive `(a, a⁺)`	`[1,0]` and `[0.9,0.4]`	`0.9 / sqrt(0.97) ≈ 0.914`
Negative `(a, b)`	`[1,0]` and `[-0.5,0.8]`	`-0.5 / sqrt(0.89) ≈ -0.530`

Contrastive loss pulls positive cosine higher, pushes negative cosine lower; repeated across thousands of pairs.

Masked image modeling (2021 onward)

Method	Recipe
MAE (He 2022)	Mask 75% of patches; heavy encoder sees only visible 25%; lightweight decoder reconstructs masked. Computationally efficient + strong features
DINO / DINOv2 (Caron 2021, 2023)	Self-distillation: student predicts teacher (momentum copy of itself); strong general-purpose features; DINOv2 is current go-to encoder

Pre-train then transfer

Step	What	Cost
Pre-train	Train encoder on huge unlabeled data	Expensive (days/weeks on many GPUs); happens ONCE
Linear probe	Freeze encoder; train one linear classifier on top	Cheap; fair encoder comparison
Fine-tune	Train whole network with small LR on encoder	More cost than linear-probe; best task accuracy
Amortization	One pre-trained encoder serves many downstream tasks	Pre-training cost spread across all downstream uses

When to use what

Situation	Method choice
Abundant unlabeled, scarce labeled	Self-supervised pre-train + fine-tune (the canonical case)
Comparing encoder quality fairly	Linear probe (frozen encoder, one linear layer)
Production task accuracy	Fine-tune (whole network, small LR on encoder)
Small dataset, narrow domain	Pre-train on closer-domain unlabeled data; cross-domain transfer often weak

Pitfalls

Pitfall	Reality
Self-supervised = unsupervised	Different; self-sup constructs labels from data + runs supervised training. Unsup classical = clustering, density estimation, no labels of any kind
Skip pretext tasks because contrastive works	Pretext tasks still useful in small datasets / specific domains where contrastive augmentations don’t apply
Pre-train on any data, transfers to anything	Encoder learns the statistics of its pre-training distribution; cross-domain transfer often limited (Internet photos → satellite images)
Linear probe = final task accuracy	Linear probing measures feature quality (no encoder updates); fine-tuning measures real-world task accuracy. They often diverge

One-line takeaway

Self-supervised learning constructs labels from unlabeled data via a pretext task; pre-train an encoder once on huge unlabeled data, then fine-tune cheaply per downstream task; this is the engine behind modern general-purpose vision encoders and label-scarce domain applications.