Skip to content

Summary: Self-supervised vision

Phase 2’s classifiers, detectors, and segmenters all required labeled images. Labels are expensive; the world has far more unlabeled images than labeled ones. Self-supervised learning is the family of techniques that lets a model learn useful visual features from unlabeled images alone, by constructing pretext tasks whose labels come from the data itself. The pretext task does not matter in itself; what matters is that solving it forces the network to learn features that transfer well to real downstream tasks. The workflow is pre-train then transfer: pre-train an encoder once on huge unlabeled data, then fine-tune (or linear-probe) on each small labeled downstream task. This is the engine behind modern general-purpose vision encoders and a load-bearing piece of multimodal AI.

  • The trick: construct labels from the data itself (pretext task), train normally, then transfer. The pretext task is a means; the durable artifact is the encoder it leaves behind.
  • Pretext-task wave (2014-2018): relative position of two patches, rotation prediction (0/90/180/270), jigsaw puzzles, colorization (grayscale → color), inpainting (predict masked region). Each worked; none closed the gap with supervised pretraining alone.
  • Contrastive learning (2020 onward). Two augmentations of the same image = positive pair (high cosine similarity wanted); other images = negative pairs (low similarity wanted). SimCLR (Chen et al. 2020) with NT-Xent loss is the canonical instantiation; MoCo (He et al. 2019) adds a momentum-updated encoder + queue of negatives for memory efficiency at scale; BYOL (Grill et al. 2020) trains without explicit negatives using a momentum-target setup. Closed most of the gap with supervised pretraining.
  • Worked similarity (body): a positive pair a=[1,0], a⁺=[0.9,0.4] gives cos ≈ 0.914; a negative b=[-0.5,0.8] gives cos ≈ -0.530. The contrastive loss pulls positives higher and pushes negatives lower; scaled across thousands of pairs, the network ends up with feature space where same-image-different-view is close and different-image is far.
  • Masked image modeling (2021 onward). MAE (He et al. 2022): mask 75% of patches; heavy encoder sees only the visible 25%; lightweight decoder reconstructs masked patches. DINO / DINOv2 (Caron et al. 2021, 2023): self-distillation; student predicts teacher (momentum copy of itself). Produces strong general-purpose features; DINOv2 is the current go-to general-purpose encoder for many downstream tasks.
  • Pre-train then transfer workflow. Pre-train encoder on huge unlabeled data (expensive, once). Transfer to downstream task: fine-tuning (train whole network with small LR on encoder; best task accuracy) or linear probing (freeze encoder, train one linear classifier; fair feature comparison). One pre-trained encoder amortizes across many downstream applications.

Self-supervised learning is the engine behind several things you have seen. When a lab releases a “general-purpose vision encoder” (DINOv2, CLIP’s vision tower, MAE-pretrained ViT) that does well on segmentation, classification, retrieval, and depth estimation without retraining, you are seeing self-supervised pre-training paying off across tasks. Multimodal models (CLIP, larger vision-language families) typically use self-supervised image encoders as one component, jointly trained or fine-tuned with similarly pre-trained text encoders. In label-scarce domains (medical imaging, satellite imagery, microscopy, scientific data) self-supervised pre-training is what makes vision feasible: pre-train on millions of unlabeled scans (which exist), fine-tune on the small set of expert-labeled cases (which are expensive). The economic pattern, expensive pre-training amortized across many cheap downstream fine-tunes, is the same pattern that drives modern language models, now converged on the same engineering shape in vision.

Labels are expensive; unlabeled images are not. Self-supervised learning is how vision systems learn from the cheap stuff first, then spend the precious labeled data where it actually matters.