Skip to content

Generating images by denoising, diffusion

This is lesson 12 of Phase 3 (Generating and grounding vision), the third lesson of the generative-modeling stretch. The one capability it builds: you will be able to explain the diffusion model’s strange-but-elegant two-direction setup (forward noising + learned reverse denoising), compute one forward step by hand, place diffusion in its three-way trade-off with VAEs and GANs from lesson 11, and recognize the architecture behind modern text-to-image systems. The source curriculum is Stanford CS231n, cs231n.stanford.edu; this lesson maps to Lecture 14 (Generative Models 2). Deep mechanical derivations are deferred to sister tracks T19 (variational interpretation, score-based equivalence) and T24 (production text-to-image pipelines) per the Track 16 Phase 0 arc.

The lesson opens with the forward (no-learning) process defined by a noise schedule β_1, ..., β_T and works one numerical step by hand. It explains the reverse (learned) process’s training objective (simple MSE between predicted and true noise, with a U-Net architecture conditioned on the timestep). It walks inference (iterative denoising from pure noise) and the three main speed-up techniques (DDIM, distilled diffusion, latent diffusion). It covers text-to-image conditioning via cross-attention to a text embedding plus classifier-free guidance, and names the modern landmark systems (Stable Diffusion, Imagen, DALL-E 2/3). The closing common-pitfall carries L11’s “technique vs application” distinction forward.

This is lesson 12 of 16, the third lesson of Phase 3. It depends on lesson 11 (VAEs and GANs; the trade-off comparison and the VAE-as-first-stage-encoder framing both build on L11). The next lesson, Recovering the third dimension: 3D vision, opens the geometry-focused stretch of Phase 3.

Prerequisites: lesson 11 of this track (GANs and VAEs). This lesson’s main framing is “the third generative-image family that solves the trade-off L11 set up,” and it builds explicitly on the VAE concepts (especially the encoder, which reappears as latent diffusion’s first-stage component). Lessons 3-4 (loss + gradient descent + backprop) carry over; what changes is the training objective (simple MSE on noise prediction) and the iterative inference loop.

Light. The body shows the forward-step formula and works one numerical example by hand (x_{t-1} = 0.8, β_t = 0.1, ε = -0.3x_t ≈ 0.664). Practice repeats with fresh numbers (x_{t-1} = 0.5, β_t = 0.04, ε = 1.5x_t ≈ 0.790). The full mathematical derivation of the training objective and the equivalence with score-based generative models lives in T19; this lesson takes the simple MSE loss as given.

  • Describe the diffusion two-direction setup and why generation is iterative denoising
  • Compute one forward step by hand
  • Explain the MSE training loss and why it gives stable training
  • Compare diffusion vs VAE vs GAN
  • Recognize text-to-image diffusion (Stable Diffusion / latent diffusion + cross-attention + classifier-free guidance) and the VAE’s first-stage role
  • Read time: about 14 minutes
  • Practice time: about 15 minutes (a fresh forward-step computation, a method-choice exercise across the three families, a classifier-free-guidance trajectory-reasoning question, plus flashcards)
  • Difficulty: standard (the math is multiplication and one square-root lookup; the conceptual lift is seeing the two-direction setup as a clean reformulation that gives quality + stability at iterative-inference cost)