Summary: World modeling
Every vision system in this track so far has been reactive: process the current input, output an answer. World modeling extends vision to predictive: given the past, predict the future. Self-driving trajectory prediction, robotics planning, video generation, model-based reinforcement learning, climate / weather forecasting from observations are all instances. Three architectural pieces almost always appear: an observation encoder (CNN or ViT), a dynamics model (RNN, LSTM, or transformer running over time), and (sometimes) a decoder to map predicted representations back to pixels. The central design trade-off is pixel-space vs latent-space prediction. Pixel is directly useful for video generation but expensive (per-frame numbers in the hundreds of thousands; many frames in the millions); latent is two to three orders of magnitude cheaper for the same horizon and sufficient for most planning uses, with decoding to pixels at the end when needed.
Core ideas
Section titled “Core ideas”- Reactive → predictive. Every CV system so far processed current input and answered. World modeling adds the predictive question (given past, predict future). Self-driving prediction, robotics planning, video gen are variants.
- Three architectural pieces. Encoder (compact representation from observation), dynamics model (predicts how representation evolves over time, sometimes conditioned on actions), decoder (optional, maps representations back to pixels).
- Pixel vs latent prediction trade-off. Pixel space: directly useful for video generation but expensive. Latent space: dramatically cheaper, sufficient for planning. Body calc: 224×224×3 frame is 150,528 numbers; 30 frames is 4.5M; latent dim 512 over 30 frames is 15,360. Ratio: ~294x cheaper. Practice calc: 512×512×3 frame is 786,432; 96 frames is 75.5M; latent dim 256 over 96 frames is 24,576. Ratio: ~3,072x cheaper. The argument sharpens at higher resolution and longer horizon.
- Landmark architectures. World Models (Ha & Schmidhuber 2018; VAE + RNN + controller, the early influential pattern). Dreamer family (Hafner 2019-2023; recurrent world model + policy trained from imagined rollouts; DreamerV3 generalizes broadly with same hyperparameters). MuZero (Schrittwieser 2019; tree search + learned dynamics predicting state, reward, policy; played Go/chess/shogi/Atari without rule access). JEPA / V-JEPA (LeCun research direction; predict in latent space, not pixels; V-JEPA extends to spatio-temporal masked prediction). Video world models (Sora, Genie, 2024; production-scale video generation framed as world modeling).
- What world models actually predict varies: next observation directly; observation given an action (for planning); state plus reward (for RL); long-horizon stable; multiple plausible futures.
- Cross-track ties. T18 (planned, RL) covers model-based RL depth. T24 (planned, image gen + multimodal) covers production-scale video. T16’s L15 is the CV-side complement to L13 spatial 3D vision.
- Evaluation pitfall. Prediction error is a proxy. What matters downstream: planning quality, long-horizon coherence, downstream task performance. Models with slightly worse one-step error but better long-horizon coherence are usually preferred for prediction-and-planning applications.
What changes for you
Section titled “What changes for you”When you read that a self-driving system “predicts trajectories” of other agents two-to-five seconds out, that is a world model. When a robot demo shows the robot “imagining” candidate manipulations before committing, the imagining is a world-model rollout. When a video-generation system produces minutes of coherent video from a short prompt, the coherence comes from a (large, expensive) learned model of how visible scenes evolve. The same architecture family, at different scales and on different data, underlies all three. The central engineering choice for any new world-modeling system is the pixel-vs-latent trade-off; for most planning and decision-making uses, latent-space prediction wins decisively on cost; for video generation that needs to be watched, pixel-space (often after a latent stage) is required.
Reactive vision answered “what is in this image”; predictive vision answers “what is the world this image is a snapshot of, and what will happen next.” Phase 3 has been the story of moving from the first to the second, and this lesson is the most direct expression of that move.