Summary: Diffusion models II, training and sampling
The previous lesson built DDPM with a thousand-step Markov reverse chain. This lesson covers what made diffusion practical for production: two moves that reuse the same trained noise predictor in a different inference loop.
What this lesson did
Section titled “What this lesson did”- Explained why DDPM sampling is slow (a thousand stochastic Markov-chain steps), and identified the slowness as a property of the sampling procedure, not of what the trained network knows.
- Built DDIM (Song et al. 2020), a deterministic non-Markovian sampler that uses the same trained noise predictor and takes tens of steps instead of thousands. The update predicts the implied clean-sample estimate from the current noisy state, then re-noises to the target cleaner timestep with zero stochasticity.
- Built classifier-free guidance (Ho and Salimans 2021), the conditioning trick behind every modern text-to-image system. Train one network on both conditional and unconditional generation; at inference, blend the conditional and unconditional noise predictions with a guidance scale.
- Walked the latency-quality Pareto frontier (a thousand DDPM steps at the asymptote, fifty DDIM steps as the production sweet spot, ten or fewer steps requiring distillation), and the §6 watch-territory carry-over from L12.
What to remember in three lines
Section titled “What to remember in three lines”- DDIM is a deterministic non-Markovian sampler that uses the same trained noise predictor as DDPM. The update extracts the implied clean-sample estimate from the current noisy state and the predicted noise, then re-noises to a cleaner target timestep. A fifty-step DDIM sampler matches a thousand-step DDPM sampler in quality on most benchmarks, at twenty times the speed.
- Classifier-free guidance trains one network on both conditional and unconditional generation, then at inference blends the conditional and unconditional noise predictions: the guided prediction is the unconditional prediction plus the guidance scale times the difference. Higher guidance amplifies prompt adherence at the cost of sample diversity. Costs two forward passes per step.
- The latency-quality Pareto frontier governs the modern stack. Reading a sampling-step quote without the sampler name is ambiguous; reading a sampling-time quote without the network size and batch size is also ambiguous.
Where this is going
Section titled “Where this is going”The next lesson (lesson 14) takes the score-based view from L11 and shows the formal equivalence between L11 (denoising score matching), L12 (DDPM Markov chain), and this lesson’s DDIM sampler. The unification is via the continuous-time stochastic differential equation perspective. The capstone at lesson 15 returns to L1’s four-paradigm map with all the math filled in and places modern systems (Stable Diffusion, GAN-based face generators, autoregressive language models, latent diffusion hybrids) on it explicitly.