Skip to content

Cheatsheet: World modeling

Vision typeQuestionExamples in this track
ReactiveWhat is in this image?Classification, detection, segmentation, depth, retrieval (L1-L14 mostly)
Predictive (world modeling)Given the past, what comes next?Self-driving trajectory prediction, robotics planning, video generation
PieceRoleTypical architecture
EncoderObservation → compact representationCNN, ViT (often self-supervised pre-trained)
DynamicsPredict how representation evolves; may consume actionsRNN, LSTM, transformer over time
Decoder (optional)Map representations back to pixelsConv decoder; needed if predictions must be rendered
PropertyPixel spaceLatent space
Output sizeAll future frame pixelsCompact latent per future step
CostHuge (millions of numbers per rollout)Hundreds to a few thousand
Use caseVideo generation that must be watchedPlanning, decision-making, RL
Decode at endNativeOne pass through decoder when needed
SourceFrame resolutionFrames forwardPixel-space outputLatent dimLatent-space outputRatio
Body224×224×3304,515,84051215,360~294x
Practice512×512×39675,497,47225624,576~3,072x

The argument for latent-space prediction sharpens at higher resolution and longer horizons.

YearArchitectureHeadline
2018World Models (Ha & Schmidhuber)VAE + RNN + controller; early influential pattern
2019Dreamer (Hafner)Recurrent world model; train policy from IMAGINED rollouts
2020DreamerV2 (Hafner)Atari at human level via model-based RL
2023DreamerV3 (Hafner)Broad generalization across domains with same hyperparameters
2019MuZero (Schrittwieser)Tree search + learned dynamics; Go/chess/shogi/Atari without rule access
2023+JEPA family (LeCun direction)Predict in latent space, not pixels (V-JEPA for video, 2024)
2024Sora (OpenAI), Genie (DeepMind)Production-scale video world models
Prediction targetUse case
Next observation directlyForecasting, video generation
Next observation given actionPlanning, model-based RL
State + rewardRL (MuZero)
Long horizons (30, 60+ steps)Stable trajectory prediction; video generation coherence
Multiple plausible futuresStochastic environments, generative video
ApplicationWorld-model role
Self-drivingPredict other agents’ trajectories 2-5 sec out
RoboticsForward-simulate candidate actions; pick best
Video generationPredict long-horizon visible-scene evolution
Model-based RLTrain policy in learned model (cheaper than real env)
Physics / climate / weatherLearned forward models from observations
Game AI / procedural contentImagine how game world evolves
TopicSister track
Model-based RL deep treatment (Dreamer, MuZero)T18 (planned, reinforcement learning)
Production-scale video generationT24 (planned, image generation + multimodal)
3D vision (spatial complement to L15’s temporal)L13 of this track
Self-supervised pre-training (for encoder)L10 of this track
PitfallReality
World modeling = video generationVideo gen is one application; world modeling is broader. Many (Dreamer, MuZero) never render pixels
Latent-space prediction is damagingly lossyThe compression is the point: keep what matters for prediction, discard what doesn’t
Prediction error = qualityProxy, not goal. What matters downstream: planning quality, long-horizon coherence, task performance
Reading “world model” too literallyLearned models fit statistical structure; not physics simulators with conservation laws

World modeling = learned prediction of how the visible world evolves; the central trade-off is pixel-space (directly useful, expensive) vs latent-space (cheaper, sufficient for most planning uses); landmark architectures (World Models, Dreamer family, MuZero, JEPA, video world models) explore different shapes of the encoder + dynamics + decoder triple.