Summary: Transformers in diffusion models for image generation

Modern image generation runs on diffusion, and its recent quality jump came from replacing the convolutional U-Net backbone with a transformer (DiT). DiT buys predictable scaling, better global structure, and shared engineering investment with the rest of the transformer stack. MM-DiT folds text and image tokens through one transformer for conditioning, recapitulating native multimodal on the generative side. This summary is the scan version of the full lesson, which opens Phase 3.

Core ideas

Phase 3 turns to multimodal output. Phase 2 covered inputs (encode-then-fuse, native multimodal, reasoning). This phase covers generation: images here, video next.
Diffusion denoises. Train by adding noise to images and predicting it; at inference, iteratively denoise from random noise to a coherent image, conditioned (typically) on a text prompt.
U-Net was the original denoiser (Stable Diffusion 1.x/2.x, DALL-E 2, Imagen): convolutional, with skip connections; local-receptive-field, parameter-efficient, worked well at small scales.
DiT (Diffusion Transformer) is the modern replacement. Patchify the image, treat patches as tokens, process with a transformer that predicts noise per patch.
What DiT buys: (1) scaling laws transfer (quality grows predictably with compute), (2) better global structure (attention is global, helping composition), (3) architectural unification (same training stack and hardware kernels as the rest of the transformer ecosystem).
Production examples: Stable Diffusion 3 (Stability AI), Flux (Black Forest Labs), Sora (OpenAI, video).
MM-DiT puts text and image tokens through the same transformer for conditioning. Stable Diffusion 3’s design. Same idea as native multimodal from L3, applied to the output side.
Tradeoffs: more expensive per step at small scale; high-resolution quadratic attention cost mitigated by latent diffusion; diffusion’s many-step inference tax reduced by flow-matching variants.

What changes for you

When you use modern image generators (Stable Diffusion 3, Flux, the latest DALL-E generation), you are seeing the U-Net-to-DiT shift in product. The cleaner composition, the better adherence to long prompts, the predictable improvement of larger models, all reflect this architectural change among others. When you read a system’s “transformer-based architecture” claim, DiT-family is the territory. The lesson also draws a sharp scope line: technique and architecture and evaluation are in; use-case policy, provenance/watermarking, sector-specific standards, training-data licensing, and likeness rights are each their own conversations, evaluated by different methods, deliberately deferred to the right forums. The next lesson takes the DiT family from images to video and unpacks what changes when a temporal dimension joins the picture.