Skip to content

Cheatsheet: Transformers in diffusion models for image generation

PhaseWhat happens
Trainingadd noise to images, denoiser learns to predict the noise
Inferencestart from pure random noise, iteratively denoise to an image, conditioned (typically) on a text prompt
Core componentthe denoiser network (architecture choice = this lesson’s topic)
AspectU-Net (original)DiT (modern)
Architectureconvolutional + skip connectionstransformer over patches
Receptive fieldlocal (per conv); global needs many layersglobal (attention) at every layer
Scaling behaviorunclear / less predictabletransformer scaling laws apply
Best atsmall/efficient deploymentsfrontier scale + global composition
Shared infra with text transformersnoyes (architectural unification)
BenefitWhat it enables
Scaling laws transferpredictable quality from more parameters/data/compute
Better global structurecomposition, coherence, lighting across the image
Architectural unificationsame training stack, hardware kernels, infra investment as text/MM transformers
SystemOrgNotes
Stable Diffusion 3Stability AIadopted DiT (uses MM-DiT for text+image conditioning)
FluxBlack Forest LabsDiT-family backbone
SoraOpenAIDiT extended to video (Phase 3 lesson 6)
ApproachHow text fuses with image
Cross-attention (SD 1.x/2.x)text vectors “on the side”; image features attend to them
MM-DiT (SD 3 era)text + image patch tokens in one transformer, attending to each other in every block

MM-DiT recapitulates the native-multimodal pattern (L3) on the generative side.

TradeoffMitigation
Expensive per step at small scaleuse U-Net there; DiT wins at scale
Quadratic attention at high resolutionlatent diffusion (operate in compressed latent space)
Many denoising steps at inferenceflow-matching / rectified-flow variants reduce step count
IN scope (this lesson)OUT of scope (separate conversations)
Architecture / techniqueUse-case policy (when synthetic images are appropriate)
Evaluation (FID, scaling curves, human pref)Provenance / watermarking (C2PA, SynthID)
MM-DiT conditioning, latent diffusionSector-specific policies (journalism / political / legal / medical)
Tradeoffs, modern systems landscapeTraining-data licensing (scraped-image IP claims)
Likeness / consent for real people