Cheatsheet: Where multimodal AI is going
The six cross-cutting threads
Section titled “The six cross-cutting threads”| Thread | Pattern | Where it appears across the track |
|---|---|---|
| 1 | Tokenize-everything + one transformer | L3, L5, L6, L7, L9 |
| 2 | Fusion gets pushed earlier | L2 (after) -> L3 (from step 1) -> L5/L6 (MM-DiT) -> L9 (RL co-design) |
| 3 | Tokenizer is the floor and ceiling | L3 (image/audio codecs), L5 (latent VAE), L6 (spacetime tokenizer), L8 (biological “tokenization”) |
| 4 | Generative pretraining dominates; JEPA is the articulated alternative | L2-L6 generative; L7 introduces JEPA contrast |
| 5 | Capability stacks (perception + reasoning + tool use + alignment + production engineering) | L4 four-layer stack; L9 adds production engineering layer |
| 6 | Scope-line discipline (operational scope test) | L6, L7, L8, L9 (each adds a §6 boundary) |
The operational scope test (Thread 6 distilled)
Section titled “The operational scope test (Thread 6 distilled)”| If the question is settled by… | It is… |
|---|---|
| Engineering instruments: benchmarks, A/B tests, FVD, latency, evaluation harness | TECHNIQUE territory (in scope for technical lessons) |
| Different instruments: legal precedent, clinical trials, sectoral policy, philosophical argument, business judgment | DIFFERENT CONVERSATION (out of scope; evaluated by other methods) |
Portable across any domain where engineering work touches non-engineering conversations.
What the track did NOT cover
Section titled “What the track did NOT cover”| Topic | Where to look |
|---|---|
| Embodied AI / robotics with multimodal world models | dedicated robotics tracks |
| 3D / 4D generation (NeRF, Gaussian splatting, dynamic 3D) | sub-field with own architectures |
| Multimodal alignment safety beyond deliberative alignment | active research area; alignment-specific reading |
| Specific frontier-model technical reports (GPT-5, Claude 4, Gemini 2.5, etc.) | read the system technical reports directly |
| The economic and market story | lives in non-technical forums |
| Long-context multimodal | active engineering territory |
Where the field is going (2026 onward)
Section titled “Where the field is going (2026 onward)”| Trajectory | Key direction |
|---|---|
| Truly native everything | text + image + audio + video as first-class citizens, in + out, in one model |
| More efficient training paradigms | JEPA, mixture-of-experts at extreme scale, sparsity techniques |
| Production and product co-design | RL co-design, evaluation in deployment, the engineering-informs-vs-settles discipline becomes competitive advantage |
Next directions per interest
Section titled “Next directions per interest”| If you want | Read |
|---|---|
| Architectural depth | Stanford CS25 series (continues each year); the track’s per-lesson references |
| Training paradigm depth | JEPA papers; V-JEPA + world-model literature; Yann LeCun’s group |
| Production engineering depth | Karina Nguyen’s talks; broader RL-co-design writing; AI-engineering pattern literature |
| Foundational depth | T11 (Intro to Deep Learning), T13 (Build Neural Networks from Scratch), T20 (AI Agents and Tool Use) |