Cheatsheet: Generating by denoising: diffusion
The one idea that matters
Section titled “The one idea that matters”diffusion = learn to REMOVE noiseforward (no learning): real image + noise + noise + ... → pure staticreverse (the network): static → denoise → denoise → ... → a NEW imageLearning to undo noise turns out to be learning to create.
The two directions
Section titled “The two directions”| Direction | What happens | Needs learning? |
|---|---|---|
| Forward | Add a little noise to a real image, repeatedly, until pure static | No (trivial) |
| Reverse | A network predicts “one step less noisy” and removes a little | Yes (and we know the clean answers, so training is easy) |
Generating
Section titled “Generating”Start from pure random static (never seen before). Apply the trained denoiser many times. Each pass adds a little structure until a sharp, coherent new image emerges. Like a photo developing, or a sculptor removing all that is not the statue.
Why many small steps
Section titled “Why many small steps”Denoising static into a finished image in one jump is a brutally hard prediction (a network does it poorly). Removing a little noise is a gentle, well-defined prediction (a network does it well). Diffusion replaces one impossible step with a long chain of easy ones.
How it differs from VAE / GAN
Section titled “How it differs from VAE / GAN”| VAE | GAN | Diffusion | |
|---|---|---|---|
| How it generates | sample a latent point, decode (one shot) | generator pass (one shot) | denoise from static (many steps) |
| Mechanism | learned smooth space | generator-vs-discriminator contest | repeated denoiser |
| Tends toward | plausible, sometimes blurry | sharp, finicky to train | high quality + variety, but slow |
Text to image
Section titled “Text to image”Feed the text prompt alongside the noisy image; the network denoises toward a picture matching the words. “A cat on a skateboard” nudges every step toward catness + skateboardness. Same denoiser, now steered.
Tradeoffs (honest)
Section titled “Tradeoffs (honest)”- Strength: high-quality, varied images; the gradual refinement pays off in quality.
- Cost: slow, dozens to hundreds of steps per image (vs a GAN’s single pass).
- Failure mode: confidently renders plausible-but-wrong details (extra fingers, melted shapes); it matches the look of real images, it does not understand them.
Pitfalls to dodge
Section titled “Pitfalls to dodge”- “It designs the image from the prompt.” No. It repeatedly removes noise, nudged by the prompt; the picture accumulates.
- “The noise is incidental.” No. Noising is the method; without it there is nothing to learn to reverse.
- “One pass makes the image.” No. Many small denoising steps (why it is slow).
- “It understands what it draws.” No. It learned the statistical look of images, which is why it can render the impossible with confidence.
The one-line version
Section titled “The one-line version”A diffusion model is a patient sculptor working in static: trained only to chip away noise, it reveals an image that was never there until it made it.