Skip to content

Cheatsheet: Self-supervised vision

ElementWhat it is
Pretext taskA task whose labels are constructed from the data itself, no human annotation
What mattersNOT solving the pretext task; the encoder features it leaves behind, which transfer to real tasks
WorkflowPre-train on huge unlabeled data (expensive, once), then transfer to downstream tasks (cheap, repeated per task)
MethodPretext task
Relative position (Doersch 2015)Predict spatial relationship of two image patches
Rotation (Gidaris 2018)Predict 0/90/180/270 degree rotation
Jigsaw (Noroozi 2016)Predict original permutation of shuffled patches
Colorization (Zhang 2016)Predict color from grayscale
Inpainting (Pathak 2016)Predict missing pixels of a masked region

Each worked; none closed the gap with supervised pretraining alone.

MethodHeadline
SimCLR (Chen 2020)Two augmentations = positive pair; other images in batch = negatives; NT-Xent loss; needs large batches
MoCo (He 2019, v2, v3)Momentum-updated encoder + queue of negatives; scales beyond single-batch limits
BYOL (Grill 2020)NO explicit negatives; two networks (online + momentum target); online predicts target
PairVectorsCosine
Positive (a, a⁺)[1,0] and [0.9,0.4]0.9 / sqrt(0.97) ≈ 0.914
Negative (a, b)[1,0] and [-0.5,0.8]-0.5 / sqrt(0.89) ≈ -0.530

Contrastive loss pulls positive cosine higher, pushes negative cosine lower; repeated across thousands of pairs.

MethodRecipe
MAE (He 2022)Mask 75% of patches; heavy encoder sees only visible 25%; lightweight decoder reconstructs masked. Computationally efficient + strong features
DINO / DINOv2 (Caron 2021, 2023)Self-distillation: student predicts teacher (momentum copy of itself); strong general-purpose features; DINOv2 is current go-to encoder
StepWhatCost
Pre-trainTrain encoder on huge unlabeled dataExpensive (days/weeks on many GPUs); happens ONCE
Linear probeFreeze encoder; train one linear classifier on topCheap; fair encoder comparison
Fine-tuneTrain whole network with small LR on encoderMore cost than linear-probe; best task accuracy
AmortizationOne pre-trained encoder serves many downstream tasksPre-training cost spread across all downstream uses
SituationMethod choice
Abundant unlabeled, scarce labeledSelf-supervised pre-train + fine-tune (the canonical case)
Comparing encoder quality fairlyLinear probe (frozen encoder, one linear layer)
Production task accuracyFine-tune (whole network, small LR on encoder)
Small dataset, narrow domainPre-train on closer-domain unlabeled data; cross-domain transfer often weak
PitfallReality
Self-supervised = unsupervisedDifferent; self-sup constructs labels from data + runs supervised training. Unsup classical = clustering, density estimation, no labels of any kind
Skip pretext tasks because contrastive worksPretext tasks still useful in small datasets / specific domains where contrastive augmentations don’t apply
Pre-train on any data, transfers to anythingEncoder learns the statistics of its pre-training distribution; cross-domain transfer often limited (Internet photos → satellite images)
Linear probe = final task accuracyLinear probing measures feature quality (no encoder updates); fine-tuning measures real-world task accuracy. They often diverge

Self-supervised learning constructs labels from unlabeled data via a pretext task; pre-train an encoder once on huge unlabeled data, then fine-tune cheaply per downstream task; this is the engine behind modern general-purpose vision encoders and label-scarce domain applications.