| Element | What it is |
|---|
| Pretext task | A task whose labels are constructed from the data itself, no human annotation |
| What matters | NOT solving the pretext task; the encoder features it leaves behind, which transfer to real tasks |
| Workflow | Pre-train on huge unlabeled data (expensive, once), then transfer to downstream tasks (cheap, repeated per task) |
| Method | Pretext task |
|---|
| Relative position (Doersch 2015) | Predict spatial relationship of two image patches |
| Rotation (Gidaris 2018) | Predict 0/90/180/270 degree rotation |
| Jigsaw (Noroozi 2016) | Predict original permutation of shuffled patches |
| Colorization (Zhang 2016) | Predict color from grayscale |
| Inpainting (Pathak 2016) | Predict missing pixels of a masked region |
Each worked; none closed the gap with supervised pretraining alone.
| Method | Headline |
|---|
| SimCLR (Chen 2020) | Two augmentations = positive pair; other images in batch = negatives; NT-Xent loss; needs large batches |
| MoCo (He 2019, v2, v3) | Momentum-updated encoder + queue of negatives; scales beyond single-batch limits |
| BYOL (Grill 2020) | NO explicit negatives; two networks (online + momentum target); online predicts target |
| Pair | Vectors | Cosine |
|---|
Positive (a, a⁺) | [1,0] and [0.9,0.4] | 0.9 / sqrt(0.97) ≈ 0.914 |
Negative (a, b) | [1,0] and [-0.5,0.8] | -0.5 / sqrt(0.89) ≈ -0.530 |
Contrastive loss pulls positive cosine higher, pushes negative cosine lower; repeated across thousands of pairs.
| Method | Recipe |
|---|
| MAE (He 2022) | Mask 75% of patches; heavy encoder sees only visible 25%; lightweight decoder reconstructs masked. Computationally efficient + strong features |
| DINO / DINOv2 (Caron 2021, 2023) | Self-distillation: student predicts teacher (momentum copy of itself); strong general-purpose features; DINOv2 is current go-to encoder |
| Step | What | Cost |
|---|
| Pre-train | Train encoder on huge unlabeled data | Expensive (days/weeks on many GPUs); happens ONCE |
| Linear probe | Freeze encoder; train one linear classifier on top | Cheap; fair encoder comparison |
| Fine-tune | Train whole network with small LR on encoder | More cost than linear-probe; best task accuracy |
| Amortization | One pre-trained encoder serves many downstream tasks | Pre-training cost spread across all downstream uses |
| Situation | Method choice |
|---|
| Abundant unlabeled, scarce labeled | Self-supervised pre-train + fine-tune (the canonical case) |
| Comparing encoder quality fairly | Linear probe (frozen encoder, one linear layer) |
| Production task accuracy | Fine-tune (whole network, small LR on encoder) |
| Small dataset, narrow domain | Pre-train on closer-domain unlabeled data; cross-domain transfer often weak |
| Pitfall | Reality |
|---|
| Self-supervised = unsupervised | Different; self-sup constructs labels from data + runs supervised training. Unsup classical = clustering, density estimation, no labels of any kind |
| Skip pretext tasks because contrastive works | Pretext tasks still useful in small datasets / specific domains where contrastive augmentations don’t apply |
| Pre-train on any data, transfers to anything | Encoder learns the statistics of its pre-training distribution; cross-domain transfer often limited (Internet photos → satellite images) |
| Linear probe = final task accuracy | Linear probing measures feature quality (no encoder updates); fine-tuning measures real-world task accuracy. They often diverge |
Self-supervised learning constructs labels from unlabeled data via a pretext task; pre-train an encoder once on huge unlabeled data, then fine-tune cheaply per downstream task; this is the engine behind modern general-purpose vision encoders and label-scarce domain applications.