Cheatsheet: Transformers in diffusion models for image generation
Diffusion in one row
Section titled “Diffusion in one row”| Phase | What happens |
|---|---|
| Training | add noise to images, denoiser learns to predict the noise |
| Inference | start from pure random noise, iteratively denoise to an image, conditioned (typically) on a text prompt |
| Core component | the denoiser network (architecture choice = this lesson’s topic) |
U-Net vs DiT (the central contrast)
Section titled “U-Net vs DiT (the central contrast)”| Aspect | U-Net (original) | DiT (modern) |
|---|---|---|
| Architecture | convolutional + skip connections | transformer over patches |
| Receptive field | local (per conv); global needs many layers | global (attention) at every layer |
| Scaling behavior | unclear / less predictable | transformer scaling laws apply |
| Best at | small/efficient deployments | frontier scale + global composition |
| Shared infra with text transformers | no | yes (architectural unification) |
What DiT buys
Section titled “What DiT buys”| Benefit | What it enables |
|---|---|
| Scaling laws transfer | predictable quality from more parameters/data/compute |
| Better global structure | composition, coherence, lighting across the image |
| Architectural unification | same training stack, hardware kernels, infra investment as text/MM transformers |
Modern systems on DiT
Section titled “Modern systems on DiT”| System | Org | Notes |
|---|---|---|
| Stable Diffusion 3 | Stability AI | adopted DiT (uses MM-DiT for text+image conditioning) |
| Flux | Black Forest Labs | DiT-family backbone |
| Sora | OpenAI | DiT extended to video (Phase 3 lesson 6) |
Text conditioning evolution
Section titled “Text conditioning evolution”| Approach | How text fuses with image |
|---|---|
| Cross-attention (SD 1.x/2.x) | text vectors “on the side”; image features attend to them |
| MM-DiT (SD 3 era) | text + image patch tokens in one transformer, attending to each other in every block |
MM-DiT recapitulates the native-multimodal pattern (L3) on the generative side.
Tradeoffs and mitigations
Section titled “Tradeoffs and mitigations”| Tradeoff | Mitigation |
|---|---|
| Expensive per step at small scale | use U-Net there; DiT wins at scale |
| Quadratic attention at high resolution | latent diffusion (operate in compressed latent space) |
| Many denoising steps at inference | flow-matching / rectified-flow variants reduce step count |
Scope of this lesson
Section titled “Scope of this lesson”| IN scope (this lesson) | OUT of scope (separate conversations) |
|---|---|
| Architecture / technique | Use-case policy (when synthetic images are appropriate) |
| Evaluation (FID, scaling curves, human pref) | Provenance / watermarking (C2PA, SynthID) |
| MM-DiT conditioning, latent diffusion | Sector-specific policies (journalism / political / legal / medical) |
| Tradeoffs, modern systems landscape | Training-data licensing (scraped-image IP claims) |
| Likeness / consent for real people |