Summary: Generating by denoising: diffusion

The VAE and the GAN were the classic ways to make a network generate. The systems behind most of today’s most striking image generators usually use a third idea, one that sounds too strange to work: start from a screen of pure random static and remove the noise a little at a time until a clear image rises out of it. That is a diffusion model, and the heart of this lesson is why “learning to remove noise” turns out to be the same thing as “learning to create.”

Core ideas

The forward direction wrecks an image on purpose. Take a real photo and add a little random noise, step after step, until nothing remains but static. This needs no learning; adding noise is trivial. It is a controlled way to destroy an image in small, even increments.
The reverse direction is a modest learned task. Train a network to take a slightly noisy image and predict the slightly-cleaner version from one step earlier. Because we made the noisy images ourselves, we know the exact clean answers, so the network has perfect targets. It is learning a small, almost janitorial job: remove a little noise.
Generating means denoising from scratch. Start with fresh random static the network has never seen, ask it to remove a little noise, feed the output back, and repeat. With each pass more structure appears, until a sharp, brand-new image emerges. A network trained only to clean up noise, run from pure static, has nowhere to land except on a plausible image.
Many small steps, not one jump. Guessing a whole image from static in one shot is a prediction networks do poorly; removing a little noise is one they do well. Diffusion replaces one impossible step with a long chain of easy ones, and the patience buys the quality.
It differs from the VAE and the GAN by being gradual. No contest, no single latent point to sample, just a denoiser applied many times. That yields high quality and variety, at the cost of speed: a diffusion image can take dozens or hundreds of steps where a GAN takes one pass.
A text prompt steers the denoising. Feed the description alongside the noisy image and the network denoises toward an image that matches the words. The same machinery, now pointed by a sentence, is what turns a prompt into a picture.

What changes for you

Knowing how diffusion works demystifies what you feel when you use these tools. They are slow because each image is many steps, not one. They are varied because every run starts from different random static. And they can render something subtly wrong (an extra finger, a melted railing) with total confidence, because the network produces a plausible-looking arrangement of pixels learned from data, not a model of how the world really works. That last point, matching the shape of the data rather than understanding it, is the thread the next phase picks up when we turn to where deep learning breaks. But first, the next lesson leaves generation behind for a different kind of learning entirely: agents that learn by acting, trying things and improving from rewards and penalties. That is reinforcement learning.