References: Transformers for video generation

Source material

Source material:
• Stanford CS25 V5 (June 3, 2025):
  "Transformers for Video Generation"
  Speaker: Andrew Brown (Meta GenAI; Movie Gen)
  YouTube: https://www.youtube.com/watch?v=YGHF8_tf--g
  Course site: https://web.stanford.edu/class/cs25/past/cs25-v5/
  License (lecture video): as published on Stanford's public CS25 YouTube
                           channel (link-out only)

Clawdemy provides original notes, summaries, and quizzes derived from this
material for educational purposes. All rights to the original lecture remain
with Stanford and the speaker.

What this lesson draws from each source

Andrew Brown’s CS25 V5 lecture anchors the lesson’s topic and Meta’s Movie Gen as the concrete production-system example. The lecture covers Meta’s design rationale for the architecture, training pipeline, and the engineering decisions that made the system possible at scale.
The spacetime-patches framing centered on Sora as the popularizing example, the explicit token-count arithmetic that motivates compression, the failure-mode-to-cause pinning, and the expanded-from-L5 seven-category out-of-scope enumeration with the two video-specific additions (real-person reanimation, video provenance) are Clawdemy’s own connective tissue.

Going deeper

“Video generation models as world simulators” (OpenAI Sora technical report, 2024). OpenAI’s technical writeup on Sora introducing the spacetime-patches design at scale. The most public-facing account of the framing that the rest of the field has adopted.
“Movie Gen: A Cast of Media Foundation Models” (Meta AI, 2024). Meta’s Movie Gen paper and the more detailed technical companion to Andrew Brown’s lecture; covers the architecture, training, post-training, and evaluation in depth.
Stanford CS25 V5 schedule. The full V5 lineup; useful context for where this lecture sits relative to the image-generation lecture (L5) and the rest of the series.

Adjacent topics

Flow matching and rectified flow. As in image generation, the inference-cost reduction story for video diffusion is increasingly about reducing the step count, where flow-matching variants pair well with DiT backbones. Worth knowing as the practical frontier.
Video tokenizers and discrete-latent video representations. Active research area; the tokenizer’s quality is the system’s quality floor and ceiling, so improvements here improve the whole stack.
JEPA and predict-in-embedding-space architectures (next lesson). A fundamentally different objective from the generative pretraining used across Phases 2 and 3; opens Phase 4 by leaving the generative paradigm to ask what other objectives are possible.

Community discussion

None selected for this lesson at the present time. The OpenAI Sora report, the Movie Gen paper, and Andrew Brown’s CS25 lecture together are the strongest public reading. If a canonical secondary discussion surfaces, it will be added at the next review.