Skip to content

References: Transformers for video generation

Source material:
• Stanford CS25 V5 (June 3, 2025):
"Transformers for Video Generation"
Speaker: Andrew Brown (Meta GenAI; Movie Gen)
YouTube: https://www.youtube.com/watch?v=YGHF8_tf--g
Course site: https://web.stanford.edu/class/cs25/past/cs25-v5/
License (lecture video): as published on Stanford's public CS25 YouTube
channel (link-out only)
Clawdemy provides original notes, summaries, and quizzes derived from this
material for educational purposes. All rights to the original lecture remain
with Stanford and the speaker.
  • Andrew Brown’s CS25 V5 lecture anchors the lesson’s topic and Meta’s Movie Gen as the concrete production-system example. The lecture covers Meta’s design rationale for the architecture, training pipeline, and the engineering decisions that made the system possible at scale.
  • The spacetime-patches framing centered on Sora as the popularizing example, the explicit token-count arithmetic that motivates compression, the failure-mode-to-cause pinning, and the expanded-from-L5 seven-category out-of-scope enumeration with the two video-specific additions (real-person reanimation, video provenance) are Clawdemy’s own connective tissue.
  • Flow matching and rectified flow. As in image generation, the inference-cost reduction story for video diffusion is increasingly about reducing the step count, where flow-matching variants pair well with DiT backbones. Worth knowing as the practical frontier.
  • Video tokenizers and discrete-latent video representations. Active research area; the tokenizer’s quality is the system’s quality floor and ceiling, so improvements here improve the whole stack.
  • JEPA and predict-in-embedding-space architectures (next lesson). A fundamentally different objective from the generative pretraining used across Phases 2 and 3; opens Phase 4 by leaving the generative paradigm to ask what other objectives are possible.

None selected for this lesson at the present time. The OpenAI Sora report, the Movie Gen paper, and Andrew Brown’s CS25 lecture together are the strongest public reading. If a canonical secondary discussion surfaces, it will be added at the next review.