| Aspect | What changes from image generation |
|---|
| Dimensions | 2D (H x W) -> 3D (H x W x T) |
| Patches | 2D image patches -> 3D spacetime cuboids (Sora-style) |
| Attention | over patches in frame -> over patches across space AND time |
| Coherence | structural assumption -> falls out of attention mechanism |
| Problem | Result |
|---|
| No shared attention across frames | flicker, jitter |
| Identity not preserved | character features shift frame to frame |
| Physics not coherent | motion looks wrong |
| Lighting not consistent | distracting shifts between frames |
| Compression | What it does | Why |
|---|
| Spatial latent | image-latent-diffusion’s idea, extended | reduces tokens per frame |
| Temporal latent | bundles several adjacent frames into one temporal patch | reduces temporal token count |
| Combined effect | millions of raw tokens -> thousands | makes quadratic attention tractable |
| Item | Detail |
|---|
| Most video has bad captions | training signal is therefore weak |
| Captions rarely describe across time | model misses “person sits down at 0:04” style prompts |
| Solution | automatic captioning + recaptioning pipelines |
| Captioner quality | cascades directly into model quality (floor for prompt adherence) |
| System | Org | Note |
|---|
| Sora | OpenAI | popularized spacetime patches (2024); ~1 minute coherent clips |
| Veo | Google | Google’s video generation family |
| Movie Gen | Meta | Andrew Brown’s team; subject of the source lecture |
| Runway Gen-3 | Runway | DiT-family variant; product-focused |
| Failure | Cause |
|---|
| Physics violations | learned priors imperfect |
| Long-horizon identity drift | frontier research limit |
| In-frame text | hard for image gen; compounded across frames for video |
| Compute walls past ~10s | spacetime token count grows with duration |
| IN scope | OUT of scope |
|---|
| Architecture (spacetime patches, DiT-family) | Use-case policy (synthetic video appropriateness) |
| Latent compression (spatial + temporal) | Provenance / watermarking (general) |
| Evaluation (FVD, motion metrics, human pref) | Sector-specific (journalism, political, legal evidence) |
| Captioning pipeline as training constraint | Training-data licensing (scraped movie/TV) |
| Likeness / consent (real people) |
| Real-person reanimation (video-specific deepfake territory) |
| Video provenance (temporal-coherence watermark requirements) |
| Metric | What it measures |
|---|
| Training loss | denoising objective on training data |
| FVD (Fréchet Video Distance) | distance between generated-vs-real video distributions |
| Motion quality metrics | temporal consistency, motion plausibility |
| Human preference studies | end-to-end quality as humans perceive it |
These are quantitative technical instruments. Policy debates use different instruments (legal precedent, professional standards, stakeholder consultation); the scope line is operational, not rhetorical.