Transformers for video generation, in brief

What you’ll learn

This is lesson 6 of Track 24, the close of Phase 3 (Generative multimodal models). By the end you will be able to explain how spacetime patches extend the image-generation DiT architecture to video and why the resulting design naturally handles temporal coherence, identify the two latent-compression strategies (spatial + temporal) that make the compute tractable, and match modern video-generation failure modes to their structural causes. The one capability to walk away with: given a video-generation system or output, identify what spacetime patches it likely uses, predict which of the current failure modes might bite, and apply the expanded out-of-scope category enumeration to identify which adjacent conversations belong elsewhere.

The lesson maps to Andrew Brown’s CS25 V5 guest lecture on Meta’s Movie Gen (June 3, 2025); full attribution is in this lesson’s references.

Where this fits

This closes Phase 3 (generative multimodal models) by extending L5’s image-generation architecture to video. Together L5 and L6 cover the generative side of multimodal AI: how transformers produce images, then how the same architectural family scales to video with the spacetime-patches design and the additional engineering demands video imposes. Phase 4 then turns to advanced multimodal directions: lesson 7 opens with JEPA and world modeling, which abandon generative pretraining for a fundamentally different objective.

Before you start

Prerequisite: Lesson 5, Transformers in diffusion models for image generation. You need the DiT framing (patchify, treat patches as tokens, transformer denoiser) and the latent-diffusion idea, because this lesson extends both to three dimensions. Familiarity with the L5 §6 scope-line discipline helps for the expanded video-specific enumeration this lesson uses.

By the end, you’ll be able to

Explain spacetime patches and why shared attention across them produces temporal coherence
Identify why spatial AND temporal latent compression are both required
Describe captioned video at scale as the binding training constraint
Match current failure modes to structural causes
Apply the seven-category out-of-scope enumeration to video-generation contexts

Time and difficulty

Read time: about 13 minutes
Practice time: about 15 minutes (a failure-to-cause matching exercise, an in-scope-vs-out-of-scope identification with the expanded category list, and flashcards)
Difficulty: standard