Practice: Generating by denoising: diffusion

Self-check

Six short questions. Try to answer each one in your head (or on paper) before opening the collapsible. Active retrieval is where the learning sticks; rereading feels productive but does much less.

1. What happens in the forward direction of diffusion, and why does it need no learning?

Show answer

You take a real image and add a little random noise, step by step, until nothing of the original remains and you are left with pure static. Adding noise is trivial, so no network is needed for this part. It is just a controlled way of destroying an image in small, even increments.

2. What single task is the network trained to do in the reverse direction, and why do we have perfect answers to train against?

Show answer

Given a slightly noisy image, predict what it looked like one noising-step earlier (remove a little noise). Because we made the noisy images ourselves in the forward direction, we know exactly what the cleaner version was, so we have perfect training targets for every step.

3. Once the denoiser is trained, how do you generate a brand-new image?

Show answer

Start with a fresh screen of pure random static the network has never seen, ask it to remove a little noise, feed its slightly-cleaner output back in, and repeat. Step by step more structure appears, until a sharp, coherent, brand-new image emerges from the static.

4. Why creep along in many small steps instead of denoising the static into a finished image in one jump?

Show answer

One jump is a brutally hard prediction (guess an entire detailed image from meaningless static), which networks do poorly. Removing a little noise is a gentle, well-defined prediction networks do well. Diffusion replaces one impossible step with a long chain of easy ones; the patience is what buys the quality.

5. How does a text prompt fit into the process?

Show answer

The prompt is fed alongside the noisy image at every step, and the network is trained to denoise toward an image that matches the words. Ask for “a cat riding a skateboard” and each denoising step nudges the emerging picture toward catness and skateboardness. The same denoising machinery, now steered by a description, turns a sentence into a picture.

6. Why can a diffusion model confidently render something physically impossible, like an extra finger?

Show answer

It is producing a plausible-looking arrangement of pixels learned from data, not reasoning about how hands actually work. It matches the statistical look of real images rather than understanding the world, so a physically impossible result can still look locally plausible to the network.

Try it yourself: order the process, then predict the behavior

No math here. About 15 minutes of reasoning and writing.

Side effects: none. This is a thinking-and-writing exercise. No tools, no API calls, no costs.

Part A: put the stages in order.

Here are six stages, scrambled. Arrange them into the correct sequence, from “we have a training set” to “a new image appears.”

(a) Feed the denoiser its own slightly-cleaner output and ask it to denoise again.
(b) Take real images and add noise step by step until each is pure static.
(c) Start generation from a fresh screen of pure random static.
(d) Train a network to undo one noising step, using the clean versions we kept as answers.
(e) Repeat the denoising many times until a coherent image emerges.
(f) Ask the trained denoiser to remove a little noise from the static.

Show answer

b → d → c → f → a → e.

First wreck real images to make training pairs (b), then train the network to undo one step (d). To generate: start from pure static (c), denoise once (f), feed the output back and denoise again (a), and repeat many times until an image emerges (e). Steps a and f are the same operation; the loop of a-then-repeat is the engine.

Part B: predict three things you would notice.

You sit down with a text-to-image tool built on diffusion. Based only on how diffusion works, predict three things you would observe, and give the one-line reason for each.

Show a model answer

It is slow, taking noticeable seconds per image. Each picture is many denoising steps, not one.
The same prompt gives different images each time. Every run starts from different random static and develops differently.
It can render something subtly wrong with full confidence (an extra finger, a melted railing). It produces a plausible-looking arrangement of pixels learned from data, not a model of how hands or railings really work.

If you named “slow because many steps,” “varied because random starting static,” and “confident errors because it matches data rather than understanding,” you have the lesson.

Flashcards

Ten cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page, ready to print or save as a PDF for offline review.

Q. What happens in the forward direction of diffusion?

A real image has noise added step by step until it dissolves into pure random static. No learning is needed; adding noise is trivial.

Q. What is the network trained to do in the reverse direction?

Take a slightly noisy image and predict the slightly-cleaner version from one step earlier: remove a little noise.

Q. Why do we have perfect training answers for the denoiser?

Because we created the noisy images ourselves in the forward direction, so we know exactly what each cleaner version was.

Q. How does a trained diffusion model generate a brand-new image?

Start from pure random static, apply the denoiser, feed the output back, and repeat many times. Structure accumulates until a coherent new image emerges.

Q. Why many small denoising steps instead of one big jump?

One jump (guess a full image from static) is a prediction networks do poorly. Removing a little noise is easy and well-defined. Diffusion swaps one impossible step for a chain of easy ones.

Q. How does diffusion differ from a VAE and a GAN in how it generates?

It generates gradually over many denoising steps. There is no contest (unlike a GAN) and no single latent point to sample (unlike a VAE), just a denoiser applied many times.

Q. Why are diffusion models often slow?

A single image takes dozens or hundreds of denoising steps, where a GAN makes an image in one pass. Quality is bought with patience.

Q. How does a text prompt steer a diffusion model?

The prompt is given alongside the noisy image at each step, and the network denoises toward an image that matches the words.

Q. Why can a diffusion model confidently produce something physically impossible?

It renders a plausible-looking arrangement of pixels learned from data, not a model of how the world works, so a locally plausible but impossible result can slip through.

Q. What is the one-line intuition for why denoising can create?

A network trained only to clean up noise, run from pure static, has nowhere to land except on a plausible image, because plausible images are the only thing its training taught it to produce.