Diffusion models: brief

What you’ll learn

This is lesson 12 of Phase 3 (Generating and grounding vision), the third lesson of the generative-modeling stretch. The one capability it builds: you will be able to explain the diffusion model’s strange-but-elegant two-direction setup (forward noising + learned reverse denoising), compute one forward step by hand, place diffusion in its three-way trade-off with VAEs and GANs from lesson 11, and recognize the architecture behind modern text-to-image systems. The source curriculum is Stanford CS231n, cs231n.stanford.edu; this lesson maps to Lecture 14 (Generative Models 2). Deep mechanical derivations are deferred to sister tracks T19 (variational interpretation, score-based equivalence) and T24 (production text-to-image pipelines) per the Track 16 Phase 0 arc.

The lesson opens with the forward (no-learning) process defined by a noise schedule β_1, ..., β_T and works one numerical step by hand. It explains the reverse (learned) process’s training objective (simple MSE between predicted and true noise, with a U-Net architecture conditioned on the timestep). It walks inference (iterative denoising from pure noise) and the three main speed-up techniques (DDIM, distilled diffusion, latent diffusion). It covers text-to-image conditioning via cross-attention to a text embedding plus classifier-free guidance, and names the modern landmark systems (Stable Diffusion, Imagen, DALL-E 2/3). The closing common-pitfall carries L11’s “technique vs application” distinction forward.

Where this fits

This is lesson 12 of 16, the third lesson of Phase 3. It depends on lesson 11 (VAEs and GANs; the trade-off comparison and the VAE-as-first-stage-encoder framing both build on L11). The next lesson, Recovering the third dimension: 3D vision, opens the geometry-focused stretch of Phase 3.

Before you start

Prerequisites: lesson 11 of this track (GANs and VAEs). This lesson’s main framing is “the third generative-image family that solves the trade-off L11 set up,” and it builds explicitly on the VAE concepts (especially the encoder, which reappears as latent diffusion’s first-stage component). Lessons 3-4 (loss + gradient descent + backprop) carry over; what changes is the training objective (simple MSE on noise prediction) and the iterative inference loop.

About the math

Light. The body shows the forward-step formula and works one numerical example by hand (x_{t-1} = 0.8, β_t = 0.1, ε = -0.3 → x_t ≈ 0.664). Practice repeats with fresh numbers (x_{t-1} = 0.5, β_t = 0.04, ε = 1.5 → x_t ≈ 0.790). The full mathematical derivation of the training objective and the equivalence with score-based generative models lives in T19; this lesson takes the simple MSE loss as given.

By the end, you’ll be able to

Describe the diffusion two-direction setup and why generation is iterative denoising
Compute one forward step by hand
Explain the MSE training loss and why it gives stable training
Compare diffusion vs VAE vs GAN
Recognize text-to-image diffusion (Stable Diffusion / latent diffusion + cross-attention + classifier-free guidance) and the VAE’s first-stage role

Time and difficulty

Read time: about 14 minutes
Practice time: about 15 minutes (a fresh forward-step computation, a method-choice exercise across the three families, a classifier-free-guidance trajectory-reasoning question, plus flashcards)
Difficulty: standard (the math is multiplication and one square-root lookup; the conceptual lift is seeing the two-direction setup as a clean reformulation that gives quality + stability at iterative-inference cost)