Skip to content

Cheatsheet: Where multimodal AI is going

ThreadPatternWhere it appears across the track
1Tokenize-everything + one transformerL3, L5, L6, L7, L9
2Fusion gets pushed earlierL2 (after) -> L3 (from step 1) -> L5/L6 (MM-DiT) -> L9 (RL co-design)
3Tokenizer is the floor and ceilingL3 (image/audio codecs), L5 (latent VAE), L6 (spacetime tokenizer), L8 (biological “tokenization”)
4Generative pretraining dominates; JEPA is the articulated alternativeL2-L6 generative; L7 introduces JEPA contrast
5Capability stacks (perception + reasoning + tool use + alignment + production engineering)L4 four-layer stack; L9 adds production engineering layer
6Scope-line discipline (operational scope test)L6, L7, L8, L9 (each adds a §6 boundary)

The operational scope test (Thread 6 distilled)

Section titled “The operational scope test (Thread 6 distilled)”
If the question is settled by…It is…
Engineering instruments: benchmarks, A/B tests, FVD, latency, evaluation harnessTECHNIQUE territory (in scope for technical lessons)
Different instruments: legal precedent, clinical trials, sectoral policy, philosophical argument, business judgmentDIFFERENT CONVERSATION (out of scope; evaluated by other methods)

Portable across any domain where engineering work touches non-engineering conversations.

TopicWhere to look
Embodied AI / robotics with multimodal world modelsdedicated robotics tracks
3D / 4D generation (NeRF, Gaussian splatting, dynamic 3D)sub-field with own architectures
Multimodal alignment safety beyond deliberative alignmentactive research area; alignment-specific reading
Specific frontier-model technical reports (GPT-5, Claude 4, Gemini 2.5, etc.)read the system technical reports directly
The economic and market storylives in non-technical forums
Long-context multimodalactive engineering territory
TrajectoryKey direction
Truly native everythingtext + image + audio + video as first-class citizens, in + out, in one model
More efficient training paradigmsJEPA, mixture-of-experts at extreme scale, sparsity techniques
Production and product co-designRL co-design, evaluation in deployment, the engineering-informs-vs-settles discipline becomes competitive advantage
If you wantRead
Architectural depthStanford CS25 series (continues each year); the track’s per-lesson references
Training paradigm depthJEPA papers; V-JEPA + world-model literature; Yann LeCun’s group
Production engineering depthKarina Nguyen’s talks; broader RL-co-design writing; AI-engineering pattern literature
Foundational depthT11 (Intro to Deep Learning), T13 (Build Neural Networks from Scratch), T20 (AI Agents and Tool Use)