Summary: Transformers for video generation

Video generation takes the DiT family from images into three dimensions: video is patchified as spacetime cuboids, one transformer attends across all of them at once, and temporal consistency falls out of the attention mechanism naturally. The cost is a token-count explosion that latent compression in both space and time must manage, and the binding constraint on training quality is well-captioned video at scale. This summary is the scan version of the full lesson, which closes Phase 3.

Core ideas

Frame-by-frame image generation does not work for video. Independent frames flicker; identities shift; motion looks wrong. Temporal consistency is the new central problem.
Spacetime patches treat video as a 3D tensor (space + time) of small cuboids. Each cuboid is one token; a single transformer attends across all of them, and coherence across frames is what shared self-attention naturally produces.
Token-count explosion. A 5-second 30fps clip is roughly 38,400 raw tokens without compression. Attention is quadratic; two compressions are required:
- Latent compression in space (image-latent diffusion’s idea, extended).
- Latent compression in time (several adjacent frames bundled into one temporal patch). Sora’s spacetime patches do exactly this.
The tokenizer is the floor and ceiling of system quality, same as in native multimodal and image generation.
Captioned video at scale is the binding data constraint. Production systems rely on automatic captioning / recaptioning pipelines; captioner quality cascades into model quality.
Production systems. Sora (OpenAI, popularized spacetime patches), Veo (Google), Movie Gen (Meta, the source lecture’s subject). All DiT-family + spacetime patches; differ in tokenizer, conditioning, post-training.
Current failure modes: physics violations, long-horizon coherence (identity drift), in-frame text, compute walls at longer durations.

What changes for you

When you see synthetic video that looks plausible, this is the architecture under it. The capability arc parallels what image generation walked from 2022 to 2024: rapid quality jumps as architecture, datasets, and compute matured. The technical evaluation instruments here are FVD, motion quality metrics, and human preference studies, the same family of quantitative measures other generative work uses. The scope line is also extended for video, with two categories beyond what image generation alone raised: real-person reanimation (deepfake video; motion and action attributed to a real person, beyond static likeness) and video provenance (temporal-coherence requirements that make it technically distinct from image watermarking). Both, along with the carried-forward categories (use-case policy, sector-specific policies, training-data licensing), live in their own conversations evaluated by different methods. That closes Phase 3 on generative multimodal. Phase 4 turns to advanced multimodal directions, starting with JEPA’s predict-in-embedding-space alternative to generative pretraining.