Diffusion models: cheatsheet

The one idea that matters

diffusion = learn to REMOVE noise
forward (no learning):  real image  + noise + noise + ...  →  pure static
reverse (the network):  static  → denoise → denoise → ...  →  a NEW image

Learning to undo noise turns out to be learning to create.

The two directions

Direction	What happens	Needs learning?
Forward	Add a little noise to a real image, repeatedly, until pure static	No (trivial)
Reverse	A network predicts “one step less noisy” and removes a little	Yes (and we know the clean answers, so training is easy)

Generating

Start from pure random static (never seen before). Apply the trained denoiser many times. Each pass adds a little structure until a sharp, coherent new image emerges. Like a photo developing, or a sculptor removing all that is not the statue.

Why many small steps

Denoising static into a finished image in one jump is a brutally hard prediction (a network does it poorly). Removing a little noise is a gentle, well-defined prediction (a network does it well). Diffusion replaces one impossible step with a long chain of easy ones.

How it differs from VAE / GAN

	VAE	GAN	Diffusion
How it generates	sample a latent point, decode (one shot)	generator pass (one shot)	denoise from static (many steps)
Mechanism	learned smooth space	generator-vs-discriminator contest	repeated denoiser
Tends toward	plausible, sometimes blurry	sharp, finicky to train	high quality + variety, but slow

Text to image

Feed the text prompt alongside the noisy image; the network denoises toward a picture matching the words. “A cat on a skateboard” nudges every step toward catness + skateboardness. Same denoiser, now steered.

Tradeoffs (honest)

Strength: high-quality, varied images; the gradual refinement pays off in quality.
Cost: slow, dozens to hundreds of steps per image (vs a GAN’s single pass).
Failure mode: confidently renders plausible-but-wrong details (extra fingers, melted shapes); it matches the look of real images, it does not understand them.

Pitfalls to dodge

“It designs the image from the prompt.” No. It repeatedly removes noise, nudged by the prompt; the picture accumulates.
“The noise is incidental.” No. Noising is the method; without it there is nothing to learn to reverse.
“One pass makes the image.” No. Many small denoising steps (why it is slow).
“It understands what it draws.” No. It learned the statistical look of images, which is why it can render the impossible with confidence.

The one-line version

A diffusion model is a patient sculptor working in static: trained only to chip away noise, it reveals an image that was never there until it made it.