Skip to content

Cheatsheet: Transformers for video generation

AspectWhat changes from image generation
Dimensions2D (H x W) -> 3D (H x W x T)
Patches2D image patches -> 3D spacetime cuboids (Sora-style)
Attentionover patches in frame -> over patches across space AND time
Coherencestructural assumption -> falls out of attention mechanism
ProblemResult
No shared attention across framesflicker, jitter
Identity not preservedcharacter features shift frame to frame
Physics not coherentmotion looks wrong
Lighting not consistentdistracting shifts between frames
CompressionWhat it doesWhy
Spatial latentimage-latent-diffusion’s idea, extendedreduces tokens per frame
Temporal latentbundles several adjacent frames into one temporal patchreduces temporal token count
Combined effectmillions of raw tokens -> thousandsmakes quadratic attention tractable
ItemDetail
Most video has bad captionstraining signal is therefore weak
Captions rarely describe across timemodel misses “person sits down at 0:04” style prompts
Solutionautomatic captioning + recaptioning pipelines
Captioner qualitycascades directly into model quality (floor for prompt adherence)
SystemOrgNote
SoraOpenAIpopularized spacetime patches (2024); ~1 minute coherent clips
VeoGoogleGoogle’s video generation family
Movie GenMetaAndrew Brown’s team; subject of the source lecture
Runway Gen-3RunwayDiT-family variant; product-focused
FailureCause
Physics violationslearned priors imperfect
Long-horizon identity driftfrontier research limit
In-frame texthard for image gen; compounded across frames for video
Compute walls past ~10sspacetime token count grows with duration

Scope of this lesson (expanded from L5’s 5 categories to 7)

Section titled “Scope of this lesson (expanded from L5’s 5 categories to 7)”
IN scopeOUT of scope
Architecture (spacetime patches, DiT-family)Use-case policy (synthetic video appropriateness)
Latent compression (spatial + temporal)Provenance / watermarking (general)
Evaluation (FVD, motion metrics, human pref)Sector-specific (journalism, political, legal evidence)
Captioning pipeline as training constraintTraining-data licensing (scraped movie/TV)
Likeness / consent (real people)
Real-person reanimation (video-specific deepfake territory)
Video provenance (temporal-coherence watermark requirements)

Evaluation methods (this lesson’s instruments)

Section titled “Evaluation methods (this lesson’s instruments)”
MetricWhat it measures
Training lossdenoising objective on training data
FVD (Fréchet Video Distance)distance between generated-vs-real video distributions
Motion quality metricstemporal consistency, motion plausibility
Human preference studiesend-to-end quality as humans perceive it

These are quantitative technical instruments. Policy debates use different instruments (legal precedent, professional standards, stakeholder consultation); the scope line is operational, not rhetorical.