Diffusion image generation: brief

What you’ll learn

This is lesson 5 of Track 24, the opener of Phase 3 (Generative multimodal models). By the end you will be able to explain how transformer backbones (DiT, Diffusion Transformer) replaced U-Net in modern image-generation diffusion, what that shift buys, and how text conditioning has folded back into the same transformer machinery through MM-DiT. The one capability to walk away with: given an image-generation system, identify whether its denoiser is U-Net-style or DiT-style and predict the consequences for scaling, composition, and the engineering investment it shares with the rest of the transformer ecosystem.

The lesson maps to Sayak Paul’s CS25 V5 guest lecture (May 27, 2025); full attribution is in this lesson’s references.

Where this fits

This lesson opens Phase 3 by turning from input-side multimodal (Phase 2: encode-then-fuse, native, reasoning) to output-side multimodal: generation. Image generation is the natural first step; lesson 6 takes the same DiT-family architecture to video. The Phase 3 close-out of T24 then turns to Phase 4 advanced directions (JEPA, multimodal world models for science, multimodal agents in production), with the Clawdemy-authored closer in L10 synthesizing the whole track.

Before you start

Prerequisite: Lesson 1, What multimodal AI actually is (which named the single-in / multimodal-out operating mode this lesson occupies). Familiarity with transformer fundamentals (attention, tokenization) from prior tracks (T11, T13, T20) is assumed. No prior diffusion-specific background is required; the lesson includes a short recap of diffusion’s denoising loop before turning to the architecture choice.

By the end, you’ll be able to

Describe diffusion image generation at the level of iterative denoising with text conditioning
Contrast U-Net and DiT backbones and name the three things DiT buys
Explain MM-DiT and how it parallels native multimodal from L3 on the output side
Identify DiT’s practical tradeoffs and their standard mitigations
Distinguish the technical territory this lesson covers from the use-case, provenance, sector-policy, training-data-licensing, and likeness-rights conversations it defers to their own forums

Time and difficulty

Read time: about 13 minutes
Practice time: about 15 minutes (a U-Net-or-DiT architectural-tradeoff exercise, an in-scope-vs-out-of-scope check that reinforces the scope-line discipline, and flashcards)
Difficulty: standard