Self-supervised vision, in brief

What you’ll learn

This is the Phase 3 opener (Generating and grounding vision) and the first lesson where the model learns without labels at all. The one capability it builds: you will be able to explain why self-supervised learning is the load-bearing technique for label-scarce domains and modern general-purpose vision encoders, walk the three families of methods (pretext tasks, contrastive learning, masked image modeling) with their canonical instantiations, and compute one cosine similarity to see what contrastive learning is actually optimizing. The source curriculum is Stanford CS231n, cs231n.stanford.edu; this lesson maps to Lecture 12 (Self-supervised Learning).

The lesson opens with the economics (labels are expensive, unlabeled data is abundant), names the core trick (construct labels from the data via a pretext task), walks the pretext-task wave 2014-2018 (rotation, jigsaw, colorization, inpainting, relative position), then the contrastive-learning shift (SimCLR, MoCo, BYOL) with one cosine similarity worked by hand, then masked image modeling (MAE, DINO / DINOv2), and ends with the pre-train-then-transfer workflow that ties it all to downstream use.

Where this fits

This is lesson 10 of 16, the first lesson of Phase 3. It depends on lessons 6 (CNN architectures: ResNet for the encoder backbones contrastive methods often use) and 7 (ViT for masked image modeling and DINO). The next lesson, Teaching machines to imagine: GANs and VAEs, opens the generative-modeling stretch with discriminative-vs-generative framing.

Before you start

Prerequisites: lessons 6 and 7 of this track. ResNet (L6) is the canonical encoder backbone used in early contrastive work; the Vision Transformer (L7) is the architecture that masked image modeling and DINO rely on. Lessons 3-4 (loss + gradient descent + backprop) carry over unchanged; what changes is the what the loss is computed on.

About the math

Light. The body computes one cosine similarity by hand (positive pair a=[1,0] and a⁺=[0.9,0.4] → cos ≈ 0.914; negative b=[-0.5,0.8] → cos ≈ -0.530), the standard formula cos(u,v) = (u · v) / (||u|| · ||v||). Practice repeats with fresh vectors (a=[2,1], a⁺=[3,2], b=[-2,1] → cos ≈ 0.992 vs -0.600). No calculus; multiplication, addition, square root, division.

By the end, you’ll be able to

Explain the core trick + pre-train-then-transfer workflow
Name three pretext tasks and what each requires the network to understand
Describe SimCLR’s contrastive setup and compute one cosine similarity by hand
Distinguish MoCo’s and BYOL’s contributions
Describe masked image modeling (MAE) and self-distillation (DINO/DINOv2)

Time and difficulty

Read time: about 14 minutes
Practice time: about 15 minutes (a fresh cosine-similarity computation for a positive vs negative pair, a method-matching exercise across pretext / contrastive / masked-modeling, a workflow-planning question for a medical-imaging label-scarce scenario, plus flashcards)
Difficulty: standard (the math is the cosine-similarity formula; the conceptual lift is seeing the through-line from pretext tasks to contrastive to masked-image-modeling as one family of “construct labels from the data” techniques)