Practice: Transformers in diffusion models for image generation

Self-check

Seven short questions. Try to answer each one before opening the collapsible.

1. What does the denoiser network actually do, in one sentence?

Show answer

It takes a noisy image (at some timestep) and any conditioning (typically a text prompt) and predicts the noise that was added. Applied iteratively from pure random noise, this turns into image generation.

2. What was the original backbone of diffusion models, and what kind of architecture is it?

Show answer

U-Net: a convolutional architecture with skip connections between symmetric downsampling and upsampling stages, originally designed for biomedical image segmentation. It worked well as a diffusion denoiser because convolutions preserve spatial structure and the skip connections route fine detail past the bottleneck.

3. What does DiT do differently?

Show answer

It replaces the U-Net with a transformer. The image is broken into patches (in latent space), each patch becomes a token, and a transformer processes the sequence. The denoising step is now “given these noisy patches at this timestep, predict the noise per patch.”

4. Name the three things DiT buys over U-Net.

Show answer

(1) Scaling laws transfer: predictable quality improvement with parameters/data/compute. (2) Better global structure: global attention helps compositional coherence. (3) Architectural unification: same training infra, hardware kernels, and engineering investment as the rest of the transformer stack.

5. What is MM-DiT, and what familiar pattern does it recapitulate?

Show answer

Multimodal DiT: text tokens and image patch tokens flow through the same transformer, attending to each other in every block (as in Stable Diffusion 3). It recapitulates the native-multimodal pattern from L3 (tokenize everything, put it through one transformer) on the generative side.

6. Name two practical tradeoffs of DiT.

Show answer

Any two: more expensive per step than U-Net at small scale (the crossover is at scale); high-resolution attention is quadratic in patches (latent diffusion mitigates by operating in compressed latent space); diffusion still pays an iteration tax across many denoising steps (flow-matching variants reduce step count and pair well with DiT).

7. Why is this lesson explicit that several adjacent conversations are out of scope?

Show answer

Because image generation sits beside several real conversations (use-case policy, provenance/watermarking, sector-specific standards, training-data licensing, likeness rights) that are evaluated by different methods than the technique itself. Naming them specifically as out of scope keeps the technical content focused and honest about what was deliberately not covered and where it lives.

Try it yourself: U-Net or DiT?

For each described scenario, decide whether U-Net or DiT is the better fit, and say why in one line.

A. A small open-source image model trained on a single GPU's worth of compute,
   meant to run real-time on consumer hardware.
B. A frontier text-to-image model trained on tens of thousands of GPUs, aiming
   to push state-of-the-art composition quality.
C. A team wants to apply the same architecture to image generation that they
   already use for their large text model, sharing distributed-training
   infrastructure.

Show answer

A: U-Net. At small scale, U-Net is efficient, fast per step, and well-understood. DiT’s advantages do not materialize at this scale; the per-step compute hit is a real cost without the scaling payoff.
B: DiT. This is exactly where the scaling laws and global-attention compositional benefits show up. Modern frontier image models have moved here.
C: DiT. The architectural-unification benefit dominates: same training stack, same hardware kernels, same distributed-training investment as the text model. This is one of the strongest practical reasons frontier labs adopted DiT.

The pattern: DiT pays off at scale and where the rest of the transformer stack is already in use. U-Net is still the right choice for small efficient deployments.

Try it yourself: in scope or out of scope for this lesson?

For each statement, label it IN SCOPE (a thing this lesson covers, as architecture/technique/evaluation) or OUT OF SCOPE (a thing this lesson explicitly defers to a different conversation), and identify which out-of-scope category it falls into if applicable.

A. How a transformer block computes attention over image patches in DiT.
B. Whether generative-AI-produced images should be allowed in news photography.
C. The mathematics of the latent-space diffusion noise schedule.
D. How to watermark generated images so downstream pipelines can detect them.
E. The compute-quality scaling curve of DiT at frontier scales.
F. Whether a generative system trained on art scraped from named artist
   portfolios infringes their rights.

Show answer

A: IN SCOPE. Architectural/technique content; this lesson’s primary territory.
B: OUT OF SCOPE (sector-specific policy: journalism). News organizations have their own institutions and standards; the lesson defers there.
C: IN SCOPE. Technical content (lesson touches it lightly; deeper treatment belongs in a diffusion-specific lesson, but the conversation lives in the same technique/evaluation space).
D: OUT OF SCOPE (provenance and watermarking). Its own technical sub-area entangled with its own policy debate; deferred.
E: IN SCOPE. Evaluation; the lesson directly addresses the “scaling laws transfer” point.
F: OUT OF SCOPE (training-data licensing). Active legal and policy area; deferred.

The pattern from the lesson: technique/architecture/evaluation are in; use-case policy, provenance/watermarking, sector-specific policies, training-data licensing, and likeness rights are out and live in their own forums.

Flashcards

Ten cards. Click any card to reveal the answer. Use the Print flashcards button for one card per page.

Q. What does the denoiser in a diffusion model do?

Takes a noisy image and conditioning (e.g. a text prompt) and predicts the noise that was added. Applied iteratively from pure noise, it generates an image.

Q. What was the original diffusion-model backbone?

U-Net: a convolutional architecture with skip connections, originally designed for biomedical image segmentation. Used in Stable Diffusion 1.x/2.x, DALL-E 2, Imagen.

Q. What is DiT (Diffusion Transformer)?

The transformer replacement for U-Net as the diffusion denoiser. Image is patchified into tokens; a transformer processes the sequence to predict noise per patch.

Q. What three things does DiT buy over U-Net?

Scaling laws transfer (predictable quality from parameters/data/compute), better global structure (global attention vs local conv), and architectural unification with the rest of the transformer software/hardware stack.

Q. When is U-Net still the right choice?

At smaller scales where DiT’s per-step compute hit is unjustified, and where the scaling and global-attention benefits do not materialize. Efficient deployments and constrained settings.

Q. What is MM-DiT and what does it parallel?

Multimodal DiT: text and image tokens flow through one transformer, attending to each other in every block. Recapitulates the native-multimodal pattern from L3 on the generative (output) side.

Q. Name one practical tradeoff of DiT.

Any: more expensive per step at small scale; quadratic attention cost at high resolution (mitigated by latent diffusion); diffusion still requires many denoising steps (flow-matching variants reduce this).

Q. What is latent diffusion?

Running diffusion in a compressed latent space rather than raw pixel space, so the number of patches/tokens stays manageable at high resolution. Standard mitigation for the quadratic attention cost.

Q. What is in scope for this lesson?

Technique, architecture, and evaluation: how the DiT backbone works, what it buys, the practical tradeoffs, how text conditioning folds in.

Q. What is explicitly out of scope?

Use-case policy, provenance and watermarking, sector-specific standards (journalism, political content, legal evidence, medical imaging), training-data licensing, and likeness rights. Each is a real and separate conversation evaluated by different methods.