Skip to content

Transformers for video generation

This is lesson 6 of Track 24, the close of Phase 3 (Generative multimodal models). By the end you will be able to explain how spacetime patches extend the image-generation DiT architecture to video and why the resulting design naturally handles temporal coherence, identify the two latent-compression strategies (spatial + temporal) that make the compute tractable, and match modern video-generation failure modes to their structural causes. The one capability to walk away with: given a video-generation system or output, identify what spacetime patches it likely uses, predict which of the current failure modes might bite, and apply the expanded out-of-scope category enumeration to identify which adjacent conversations belong elsewhere.

The lesson maps to Andrew Brown’s CS25 V5 guest lecture on Meta’s Movie Gen (June 3, 2025); full attribution is in this lesson’s references.

This closes Phase 3 (generative multimodal models) by extending L5’s image-generation architecture to video. Together L5 and L6 cover the generative side of multimodal AI: how transformers produce images, then how the same architectural family scales to video with the spacetime-patches design and the additional engineering demands video imposes. Phase 4 then turns to advanced multimodal directions: lesson 7 opens with JEPA and world modeling, which abandon generative pretraining for a fundamentally different objective.

Prerequisite: Lesson 5, Transformers in diffusion models for image generation. You need the DiT framing (patchify, treat patches as tokens, transformer denoiser) and the latent-diffusion idea, because this lesson extends both to three dimensions. Familiarity with the L5 §6 scope-line discipline helps for the expanded video-specific enumeration this lesson uses.

  • Explain spacetime patches and why shared attention across them produces temporal coherence
  • Identify why spatial AND temporal latent compression are both required
  • Describe captioned video at scale as the binding training constraint
  • Match current failure modes to structural causes
  • Apply the seven-category out-of-scope enumeration to video-generation contexts
  • Read time: about 13 minutes
  • Practice time: about 15 minutes (a failure-to-cause matching exercise, an in-scope-vs-out-of-scope identification with the expanded category list, and flashcards)
  • Difficulty: standard