Skip to content

Summary: World modeling

Every vision system in this track so far has been reactive: process the current input, output an answer. World modeling extends vision to predictive: given the past, predict the future. Self-driving trajectory prediction, robotics planning, video generation, model-based reinforcement learning, climate / weather forecasting from observations are all instances. Three architectural pieces almost always appear: an observation encoder (CNN or ViT), a dynamics model (RNN, LSTM, or transformer running over time), and (sometimes) a decoder to map predicted representations back to pixels. The central design trade-off is pixel-space vs latent-space prediction. Pixel is directly useful for video generation but expensive (per-frame numbers in the hundreds of thousands; many frames in the millions); latent is two to three orders of magnitude cheaper for the same horizon and sufficient for most planning uses, with decoding to pixels at the end when needed.

  • Reactive → predictive. Every CV system so far processed current input and answered. World modeling adds the predictive question (given past, predict future). Self-driving prediction, robotics planning, video gen are variants.
  • Three architectural pieces. Encoder (compact representation from observation), dynamics model (predicts how representation evolves over time, sometimes conditioned on actions), decoder (optional, maps representations back to pixels).
  • Pixel vs latent prediction trade-off. Pixel space: directly useful for video generation but expensive. Latent space: dramatically cheaper, sufficient for planning. Body calc: 224×224×3 frame is 150,528 numbers; 30 frames is 4.5M; latent dim 512 over 30 frames is 15,360. Ratio: ~294x cheaper. Practice calc: 512×512×3 frame is 786,432; 96 frames is 75.5M; latent dim 256 over 96 frames is 24,576. Ratio: ~3,072x cheaper. The argument sharpens at higher resolution and longer horizon.
  • Landmark architectures. World Models (Ha & Schmidhuber 2018; VAE + RNN + controller, the early influential pattern). Dreamer family (Hafner 2019-2023; recurrent world model + policy trained from imagined rollouts; DreamerV3 generalizes broadly with same hyperparameters). MuZero (Schrittwieser 2019; tree search + learned dynamics predicting state, reward, policy; played Go/chess/shogi/Atari without rule access). JEPA / V-JEPA (LeCun research direction; predict in latent space, not pixels; V-JEPA extends to spatio-temporal masked prediction). Video world models (Sora, Genie, 2024; production-scale video generation framed as world modeling).
  • What world models actually predict varies: next observation directly; observation given an action (for planning); state plus reward (for RL); long-horizon stable; multiple plausible futures.
  • Cross-track ties. T18 (planned, RL) covers model-based RL depth. T24 (planned, image gen + multimodal) covers production-scale video. T16’s L15 is the CV-side complement to L13 spatial 3D vision.
  • Evaluation pitfall. Prediction error is a proxy. What matters downstream: planning quality, long-horizon coherence, downstream task performance. Models with slightly worse one-step error but better long-horizon coherence are usually preferred for prediction-and-planning applications.

When you read that a self-driving system “predicts trajectories” of other agents two-to-five seconds out, that is a world model. When a robot demo shows the robot “imagining” candidate manipulations before committing, the imagining is a world-model rollout. When a video-generation system produces minutes of coherent video from a short prompt, the coherence comes from a (large, expensive) learned model of how visible scenes evolve. The same architecture family, at different scales and on different data, underlies all three. The central engineering choice for any new world-modeling system is the pixel-vs-latent trade-off; for most planning and decision-making uses, latent-space prediction wins decisively on cost; for video generation that needs to be watched, pixel-space (often after a latent stage) is required.

Reactive vision answered “what is in this image”; predictive vision answers “what is the world this image is a snapshot of, and what will happen next.” Phase 3 has been the story of moving from the first to the second, and this lesson is the most direct expression of that move.