Skip to content

Cheatsheet: Generating by denoising: diffusion

diffusion = learn to REMOVE noise
forward (no learning): real image + noise + noise + ... → pure static
reverse (the network): static → denoise → denoise → ... → a NEW image

Learning to undo noise turns out to be learning to create.

DirectionWhat happensNeeds learning?
ForwardAdd a little noise to a real image, repeatedly, until pure staticNo (trivial)
ReverseA network predicts “one step less noisy” and removes a littleYes (and we know the clean answers, so training is easy)

Start from pure random static (never seen before). Apply the trained denoiser many times. Each pass adds a little structure until a sharp, coherent new image emerges. Like a photo developing, or a sculptor removing all that is not the statue.

Denoising static into a finished image in one jump is a brutally hard prediction (a network does it poorly). Removing a little noise is a gentle, well-defined prediction (a network does it well). Diffusion replaces one impossible step with a long chain of easy ones.

VAEGANDiffusion
How it generatessample a latent point, decode (one shot)generator pass (one shot)denoise from static (many steps)
Mechanismlearned smooth spacegenerator-vs-discriminator contestrepeated denoiser
Tends towardplausible, sometimes blurrysharp, finicky to trainhigh quality + variety, but slow

Feed the text prompt alongside the noisy image; the network denoises toward a picture matching the words. “A cat on a skateboard” nudges every step toward catness + skateboardness. Same denoiser, now steered.

  • Strength: high-quality, varied images; the gradual refinement pays off in quality.
  • Cost: slow, dozens to hundreds of steps per image (vs a GAN’s single pass).
  • Failure mode: confidently renders plausible-but-wrong details (extra fingers, melted shapes); it matches the look of real images, it does not understand them.
  • “It designs the image from the prompt.” No. It repeatedly removes noise, nudged by the prompt; the picture accumulates.
  • “The noise is incidental.” No. Noising is the method; without it there is nothing to learn to reverse.
  • “One pass makes the image.” No. Many small denoising steps (why it is slow).
  • “It understands what it draws.” No. It learned the statistical look of images, which is why it can render the impossible with confidence.

A diffusion model is a patient sculptor working in static: trained only to chip away noise, it reveals an image that was never there until it made it.