Transformers for video generation
What you’ll learn
Section titled “What you’ll learn”This is lesson 6 of Track 24, the close of Phase 3 (Generative multimodal models). By the end you will be able to explain how spacetime patches extend the image-generation DiT architecture to video and why the resulting design naturally handles temporal coherence, identify the two latent-compression strategies (spatial + temporal) that make the compute tractable, and match modern video-generation failure modes to their structural causes. The one capability to walk away with: given a video-generation system or output, identify what spacetime patches it likely uses, predict which of the current failure modes might bite, and apply the expanded out-of-scope category enumeration to identify which adjacent conversations belong elsewhere.
The lesson maps to Andrew Brown’s CS25 V5 guest lecture on Meta’s Movie Gen (June 3, 2025); full attribution is in this lesson’s references.
Where this fits
Section titled “Where this fits”This closes Phase 3 (generative multimodal models) by extending L5’s image-generation architecture to video. Together L5 and L6 cover the generative side of multimodal AI: how transformers produce images, then how the same architectural family scales to video with the spacetime-patches design and the additional engineering demands video imposes. Phase 4 then turns to advanced multimodal directions: lesson 7 opens with JEPA and world modeling, which abandon generative pretraining for a fundamentally different objective.
Before you start
Section titled “Before you start”Prerequisite: Lesson 5, Transformers in diffusion models for image generation. You need the DiT framing (patchify, treat patches as tokens, transformer denoiser) and the latent-diffusion idea, because this lesson extends both to three dimensions. Familiarity with the L5 §6 scope-line discipline helps for the expanded video-specific enumeration this lesson uses.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”- Explain spacetime patches and why shared attention across them produces temporal coherence
- Identify why spatial AND temporal latent compression are both required
- Describe captioned video at scale as the binding training constraint
- Match current failure modes to structural causes
- Apply the seven-category out-of-scope enumeration to video-generation contexts
Time and difficulty
Section titled “Time and difficulty”- Read time: about 13 minutes
- Practice time: about 15 minutes (a failure-to-cause matching exercise, an in-scope-vs-out-of-scope identification with the expanded category list, and flashcards)
- Difficulty: standard