Skip to content

Lesson: Generating by denoising: diffusion

Last lesson gave us two ways to make a network generate: the VAE, which samples from a learned space, and the GAN, which learns through a contest. They were the classics. But the systems behind most of today’s jaw-dropping image generators usually work by a third idea, one that sounds almost too strange to work. They start from a screen of pure random noise, the visual static of an untuned television, and remove that noise a little at a time until a clear, detailed image rises out of it.

That is a diffusion model, and this lesson is about why “learning to remove noise” turns out to be the same thing as “learning to create.” It is the most counterintuitive idea in the track, and once it clicks, it is also one of the most satisfying.

The trick: learn to undo a mess you made on purpose

Section titled “The trick: learn to undo a mess you made on purpose”

The clever move behind diffusion is to set up an easy problem whose solution happens to be generation. It comes in two directions.

The forward direction: wreck an image, slowly. Take a real photo, say of a cat, and add a tiny bit of random noise. It looks almost the same, just slightly grainy. Add a little more. Grainier. Keep going, step after step, and the cat dissolves into speckle, until after enough steps nothing of the cat remains and you are left with pure random static. This part needs no learning at all; adding noise is trivial. It is just a controlled way of destroying an image in small, even increments.

The reverse direction: learn to undo one step. Here is where a network comes in. We train it on a simple, well-defined task: given a slightly noisy image, predict what it looked like one noising-step earlier, that is, remove a little bit of noise. Because we made the noisy images ourselves in the forward direction, we know exactly what the cleaner version was, so we have perfect answers to train against. The network just learns, over and over, to take a noisy image and make it slightly less noisy.

Notice how modest that task is. We are not asking the network to dream up a cat from nothing. We are asking it to do a small, almost janitorial job: clean up a little noise. That is a problem a network can learn well.

Now the payoff, and it is genuinely surprising. Suppose the network has gotten good at removing one step of noise. To generate a brand-new image, we do this: start with a fresh screen of pure random static, something the network has never seen, and ask it to remove a little noise. Then take its slightly-cleaner output and ask again. And again. Step by step, the network keeps denoising, and with each pass a bit more structure appears, until, after many steps, a sharp, coherent image has emerged from what began as meaningless static.

It is like watching a photograph develop, or a sculptor who swears the statue was always inside the marble and they merely removed what was not the statue. The network was only ever trained to clean up noise, but run that cleaning from pure noise and it has nowhere to land except on a plausible image, because plausible images are the only things its training ever taught it to produce. Learning to denoise, it turns out, is learning the shape of real images so well that you can summon one out of static.

This also answers a natural question: why creep along in many small steps instead of denoising the static into a finished image in one jump? Because the one-jump version is a brutally hard prediction, guessing an entire detailed image from meaningless static, the kind of task a network does poorly. Removing just a little noise is a gentle, well-defined prediction the network does well. Diffusion’s whole strategy is to replace one impossible step with a long chain of easy ones, and the patience is what buys the quality.

All three make new data, but the route is different, and the difference explains how they feel to use.

  • A VAE decodes a sampled point in one shot. A GAN generates in one shot from its trained generator. A diffusion model generates gradually, over many small denoising steps.
  • There is no contest here, unlike the GAN, and no single compressed latent point to sample, unlike the VAE. There is just a denoiser, applied many times.
  • That gradual refinement is why diffusion models tend to produce especially high-quality and varied images. It is also why they can be slow: where a GAN makes an image in a single pass, a diffusion model may take dozens or hundreds of denoising steps for one picture. Quality bought with patience.

One more piece explains the tools you have actually seen. If the denoiser runs from pure static with no guidance, it produces some plausible image, but not one you asked for. Modern image generators add a steering signal: alongside the noisy image, the network is also given your text prompt, and it is trained to denoise toward an image that matches the words. Ask for “a cat riding a skateboard” and at every denoising step the network nudges the emerging picture toward catness and skateboardness. The same denoising machinery, now pointed by a description, is what turns a sentence into a picture. (This builds on the idea of representing text as vectors, which the transformer track covers; here the load-bearing point is just that the prompt steers the denoising.)

Diffusion is the engine behind most of the image and video generation that has stunned people recently. Knowing how it works demystifies a few things you can feel when you use these tools. They are often slow, taking noticeable seconds per image, because each picture is many denoising steps, not one. They produce striking variety, because every run starts from different random static and develops differently. And they can confidently render something subtly wrong (an extra finger, a melted railing) because the network is producing a plausible-looking arrangement of pixels learned from data, not reasoning about how hands or railings really work. As with every generative model, it is matching the shape of its training data, not understanding the world, a thread the limitations lesson next phase picks up directly.

Thinking the network “designs” the image from the prompt. It does not plan a picture. It repeatedly removes noise, nudged by the prompt, and an image accumulates. The result looks designed; the process is gradual refinement.

Thinking the noise is incidental. The noise is the whole method. Training is learning to undo noise; generating is undoing noise starting from pure static. Without the deliberate noising, there is nothing to learn to reverse.

Thinking one denoising pass makes the image. It is many small steps, each removing a little. The gradualness is why the results are high-quality and also why generation is slow.

Thinking diffusion understands what it draws. It learned the statistical look of real images and can render them convincingly, which is exactly why it can also render something physically impossible with total confidence.

  • Diffusion learns to remove noise. Forward: take a real image and add noise step by step until it is pure static (no learning needed). Reverse: train a network to undo one noising step, which it can learn well because we know the clean answers.
  • Generating means denoising from scratch. Start with pure random static and apply the trained denoiser many times; with each step more structure appears, until a new, coherent image emerges.
  • It generates gradually, not in one shot. Unlike the VAE and GAN, there is no contest and no single latent sample, just many denoising steps. That yields high quality and variety, at the cost of speed.
  • A text prompt steers the denoising. Feeding the description alongside the noisy image nudges each step toward a matching picture, which is how text-to-image works.

A diffusion model is a patient sculptor working in static: trained only to chip away noise, it can start from pure chaos and, step by careful step, reveal an image that was never there until it made it.

Next: we close the tour of what networks can do and turn to a different kind of learning entirely. Every model so far learned from a fixed pile of examples. The next lesson is about agents that learn by acting, trying things, getting rewards or penalties, and improving from the consequences. That is reinforcement learning.