| Vision type | Question | Examples in this track |
|---|
| Reactive | What is in this image? | Classification, detection, segmentation, depth, retrieval (L1-L14 mostly) |
| Predictive (world modeling) | Given the past, what comes next? | Self-driving trajectory prediction, robotics planning, video generation |
| Piece | Role | Typical architecture |
|---|
| Encoder | Observation → compact representation | CNN, ViT (often self-supervised pre-trained) |
| Dynamics | Predict how representation evolves; may consume actions | RNN, LSTM, transformer over time |
| Decoder (optional) | Map representations back to pixels | Conv decoder; needed if predictions must be rendered |
| Property | Pixel space | Latent space |
|---|
| Output size | All future frame pixels | Compact latent per future step |
| Cost | Huge (millions of numbers per rollout) | Hundreds to a few thousand |
| Use case | Video generation that must be watched | Planning, decision-making, RL |
| Decode at end | Native | One pass through decoder when needed |
| Source | Frame resolution | Frames forward | Pixel-space output | Latent dim | Latent-space output | Ratio |
|---|
| Body | 224×224×3 | 30 | 4,515,840 | 512 | 15,360 | ~294x |
| Practice | 512×512×3 | 96 | 75,497,472 | 256 | 24,576 | ~3,072x |
The argument for latent-space prediction sharpens at higher resolution and longer horizons.
| Year | Architecture | Headline |
|---|
| 2018 | World Models (Ha & Schmidhuber) | VAE + RNN + controller; early influential pattern |
| 2019 | Dreamer (Hafner) | Recurrent world model; train policy from IMAGINED rollouts |
| 2020 | DreamerV2 (Hafner) | Atari at human level via model-based RL |
| 2023 | DreamerV3 (Hafner) | Broad generalization across domains with same hyperparameters |
| 2019 | MuZero (Schrittwieser) | Tree search + learned dynamics; Go/chess/shogi/Atari without rule access |
| 2023+ | JEPA family (LeCun direction) | Predict in latent space, not pixels (V-JEPA for video, 2024) |
| 2024 | Sora (OpenAI), Genie (DeepMind) | Production-scale video world models |
| Prediction target | Use case |
|---|
| Next observation directly | Forecasting, video generation |
| Next observation given action | Planning, model-based RL |
| State + reward | RL (MuZero) |
| Long horizons (30, 60+ steps) | Stable trajectory prediction; video generation coherence |
| Multiple plausible futures | Stochastic environments, generative video |
| Application | World-model role |
|---|
| Self-driving | Predict other agents’ trajectories 2-5 sec out |
| Robotics | Forward-simulate candidate actions; pick best |
| Video generation | Predict long-horizon visible-scene evolution |
| Model-based RL | Train policy in learned model (cheaper than real env) |
| Physics / climate / weather | Learned forward models from observations |
| Game AI / procedural content | Imagine how game world evolves |
| Topic | Sister track |
|---|
| Model-based RL deep treatment (Dreamer, MuZero) | T18 (planned, reinforcement learning) |
| Production-scale video generation | T24 (planned, image generation + multimodal) |
| 3D vision (spatial complement to L15’s temporal) | L13 of this track |
| Self-supervised pre-training (for encoder) | L10 of this track |
| Pitfall | Reality |
|---|
| World modeling = video generation | Video gen is one application; world modeling is broader. Many (Dreamer, MuZero) never render pixels |
| Latent-space prediction is damagingly lossy | The compression is the point: keep what matters for prediction, discard what doesn’t |
| Prediction error = quality | Proxy, not goal. What matters downstream: planning quality, long-horizon coherence, task performance |
| Reading “world model” too literally | Learned models fit statistical structure; not physics simulators with conservation laws |
World modeling = learned prediction of how the visible world evolves; the central trade-off is pixel-space (directly useful, expensive) vs latent-space (cheaper, sufficient for most planning uses); landmark architectures (World Models, Dreamer family, MuZero, JEPA, video world models) explore different shapes of the encoder + dynamics + decoder triple.