World modeling: cheatsheet

Reactive vs predictive vision

Vision type	Question	Examples in this track
Reactive	What is in this image?	Classification, detection, segmentation, depth, retrieval (L1-L14 mostly)
Predictive (world modeling)	Given the past, what comes next?	Self-driving trajectory prediction, robotics planning, video generation

Three-piece world model

Piece	Role	Typical architecture
Encoder	Observation → compact representation	CNN, ViT (often self-supervised pre-trained)
Dynamics	Predict how representation evolves; may consume actions	RNN, LSTM, transformer over time
Decoder (optional)	Map representations back to pixels	Conv decoder; needed if predictions must be rendered

Pixel-space vs latent-space prediction

Property	Pixel space	Latent space
Output size	All future frame pixels	Compact latent per future step
Cost	Huge (millions of numbers per rollout)	Hundreds to a few thousand
Use case	Video generation that must be watched	Planning, decision-making, RL
Decode at end	Native	One pass through decoder when needed

Worked efficiency comparisons

Source	Frame resolution	Frames forward	Pixel-space output	Latent dim	Latent-space output	Ratio
Body	224×224×3	30	4,515,840	512	15,360	~294x
Practice	512×512×3	96	75,497,472	256	24,576	~3,072x

The argument for latent-space prediction sharpens at higher resolution and longer horizons.

Landmark architectures

Year	Architecture	Headline
2018	World Models (Ha & Schmidhuber)	VAE + RNN + controller; early influential pattern
2019	Dreamer (Hafner)	Recurrent world model; train policy from IMAGINED rollouts
2020	DreamerV2 (Hafner)	Atari at human level via model-based RL
2023	DreamerV3 (Hafner)	Broad generalization across domains with same hyperparameters
2019	MuZero (Schrittwieser)	Tree search + learned dynamics; Go/chess/shogi/Atari without rule access
2023+	JEPA family (LeCun direction)	Predict in latent space, not pixels (V-JEPA for video, 2024)
2024	Sora (OpenAI), Genie (DeepMind)	Production-scale video world models

What world models predict (varies by use)

Prediction target	Use case
Next observation directly	Forecasting, video generation
Next observation given action	Planning, model-based RL
State + reward	RL (MuZero)
Long horizons (30, 60+ steps)	Stable trajectory prediction; video generation coherence
Multiple plausible futures	Stochastic environments, generative video

Use cases

Application	World-model role
Self-driving	Predict other agents’ trajectories 2-5 sec out
Robotics	Forward-simulate candidate actions; pick best
Video generation	Predict long-horizon visible-scene evolution
Model-based RL	Train policy in learned model (cheaper than real env)
Physics / climate / weather	Learned forward models from observations
Game AI / procedural content	Imagine how game world evolves

Cross-track ties

Topic	Sister track
Model-based RL deep treatment (Dreamer, MuZero)	T18 (planned, reinforcement learning)
Production-scale video generation	T24 (planned, image generation + multimodal)
3D vision (spatial complement to L15’s temporal)	L13 of this track
Self-supervised pre-training (for encoder)	L10 of this track

Pitfalls

Pitfall	Reality
World modeling = video generation	Video gen is one application; world modeling is broader. Many (Dreamer, MuZero) never render pixels
Latent-space prediction is damagingly lossy	The compression is the point: keep what matters for prediction, discard what doesn’t
Prediction error = quality	Proxy, not goal. What matters downstream: planning quality, long-horizon coherence, task performance
Reading “world model” too literally	Learned models fit statistical structure; not physics simulators with conservation laws

One-line takeaway

World modeling = learned prediction of how the visible world evolves; the central trade-off is pixel-space (directly useful, expensive) vs latent-space (cheaper, sufficient for most planning uses); landmark architectures (World Models, Dreamer family, MuZero, JEPA, video world models) explore different shapes of the encoder + dynamics + decoder triple.