Skip to content

Cheatsheet: Diffusion models

DirectionWhatLearned?
ForwardRepeatedly add Gaussian noise: x_t = sqrt(1-β_t)·x_{t-1} + sqrt(β_t)·εNo; defined by noise schedule β_1..β_T
ReversePredict noise at step t; iterate from x_T back to x_0Yes; the trained network
GenerationStart at x_T ~ N(0, I); iterate reverse T times → x_0At inference only

x_t = sqrt(1 - β_t) · x_{t-1} + sqrt(β_t) · ε, with ε ~ N(0, I).

TermEffect
sqrt(1 - β_t) · x_{t-1}Shrinks previous image slightly toward zero
sqrt(β_t) · εAdds calibrated bump of fresh noise
Both square rootsKeep total variance normalized as t grows
Sourcex_{t-1}β_tεx_t
Body0.80.1-0.3≈ 0.664
Practice0.50.041.5≈ 0.790
StepAction
1Sample training image x_0
2Sample random timestep t ~ uniform(1, T)
3Sample noise ε ~ N(0, I); compute x_t in one shot via the closed-form
4Pass (x_t, t) to network (typically U-Net with time embedding)
5Network predicts ε; loss = `
6Backprop + gradient descent (same L3-L4 machinery)

No adversarial dynamic; no encoder-decoder reconstruction term; just clean regression. Why diffusion trains so stably vs GANs.

StepDetail
Startx_T ~ N(0, I) (pure noise)
IterateFor t = T, T-1, …, 1: use network to step from x_t to x_{t-1}
Endx_0 is the generated image
CostT forward passes (T often 1000 originally; sped up to 25-100 by DDIM, 1-4 by distillation)
TechniqueWhat it does
DDIM (Song 2020)Deterministic sampler; good samples in 25-100 steps
Distilled diffusion (multiple lines)Student model produces image in 1-4 steps
Latent diffusion (Rombach 2022)Operate in small VAE-compressed latent space; each step cheaper

Latent diffusion architecture (the modern default)

Section titled “Latent diffusion architecture (the modern default)”
ComponentRole
Pre-trained VAE encoderImage → compact latent code
Diffusion modelRuns reverse process in the latent space
Pre-trained VAE decoderLatent → pixels

L11’s VAE is load-bearing here as the first-stage encoder, even though diffusion replaced VAE for direct generation.

ElementDetail
Cross-attention in U-NetImage-feature positions attend to text-embedding positions
Text embeddingTypically from CLIP’s text encoder
Classifier-free guidanceTrain with + without prompt; combine at inference for tunable adherence vs diversity
PropertyVAEGANDiffusion
Output qualitySlightly blurrySharpSharp / high-quality
Training stabilityStable, principledUnstable, artStable, simple MSE
Mode coverageGoodMode-collapse riskGood (no mode collapse)
LikelihoodELBO boundNoneApproximate / score-based
Inference speedSingle pass (fast)Single pass (fast)Iterative (T steps; slow)
Conditioning qualityPossiblePossibleExcellent (text-to-image dominant)
Production useOften as first-stage encoderReal-time / on-deviceText-to-image; conditional generation; controlled editing
SystemNotes
Stable DiffusionLatent diffusion; open-source; consumer-grade
Imagen (Google)High-resolution text-to-image
DALL-E 2 / DALL-E 3OpenAI’s text-to-image
MidjourneyProprietary; widely understood to be diffusion-based
ApplicationNotes
Text-to-imageThe dominant use today
Image-to-image translation (text-guided)img2img modes; instruction-based editing
Inpainting / outpaintingConditional on surrounding region
Super-resolutionCondition on low-res input
Controlled generationControlNet (depth, edges, pose, etc.)
Video generationActive research; mainstream-product-emerging
PitfallReality
”Diffusion is just a fancier VAE”Structurally different; no bottleneck latent; learns noise prediction, not direct generation
β_t = predicted noiseβ_t is the fixed schedule (how much noise gets added); ε_θ(x_t, t) is the network’s prediction
Iterative cost is fixedDDIM, distillation, latent diffusion have dropped it dramatically; continues to drop
Diffusion = text-to-image onlyThe architecture is general; inpainting, super-resolution, controlled generation, video, 3D are all active
Diffusion = controversyTechnique vs application; the mechanism is general, controversies are about specific deployment choices

Diffusion reframes generation as iterative noise removal: train a network on simple MSE noise-prediction; sample by iterating from pure noise back to an image; pay iterative inference time for high-quality + stable training. Modern text-to-image systems (Stable Diffusion, Imagen, DALL-E 2/3) are all diffusion-based.