World modeling: brief

What you’ll learn

This is lesson 15 of Phase 3 (Generating and grounding vision). The one capability it builds: you will be able to explain what a learned world model is, name the central pixel-vs-latent prediction trade-off and compute its efficiency ratio for a given setup, walk the landmark architectures, and identify the cross-track ties to model-based RL and large-scale video generation. The source curriculum is Stanford CS231n, cs231n.stanford.edu; this lesson maps to Lecture 17 (World Modeling). Deep mechanics live in sister tracks T18 (planned, RL) and T24 (planned, image generation and multimodal) per the Phase 0 arc.

The lesson opens with the reactive-vs-predictive split (the first 14 lessons of T16 have been reactive; world modeling adds prediction), names the three-piece world-model architecture (encoder + dynamics + optional decoder), covers the central pixel-vs-latent-space trade-off with a worked efficiency calculation (224×224×3 frame, 30-frame horizon, 512-dim latent → ~294x cheaper than pixel-space), surveys landmark architectures (World Models, Dreamer family, MuZero, JEPA / V-JEPA, Sora-style video world models), discusses what world models actually predict (next state, action-conditioned, state + reward, long horizons, multiple plausible futures), covers use cases (self-driving, robotics, video gen, model-based RL, science), and ends with cross-track routing to T18 and T24.

Where this fits

This is lesson 15 of 16, the sixth lesson of Phase 3. It depends on lesson 9 (video understanding; world models extend the time-dimension treatment from L9 into prediction) and lesson 10 (self-supervised vision; the encoder side of most world models is a self-supervised vision backbone). The next lesson, Computer vision among people: the human-centered view, closes T16 by surveying the real-world strengths, failure modes, and biases of vision systems as engineering concerns.

Before you start

Prerequisites: lessons 9 (video understanding) and 10 (self-supervised vision). Lesson 9 covered the input-side time dimension (process a sequence of frames); this lesson adds the output-side time dimension (predict future frames or states). Lesson 10’s self-supervised encoders are the standard front-end for world models.

About the math

Light. The body works one pixel-vs-latent efficiency calculation (224 × 224 × 3 per frame; 30 frames forward; 512-dim latent: pixel output 4,515,840 vs latent output 15,360, ratio ~294x). Practice extends to a video-generation-scale setup (512 × 512 × 3 per frame; 96 frames forward; 256-dim latent: pixel output 75,497,472 vs latent output 24,576, ratio ~3,072x). Multiplication and division only.

By the end, you’ll be able to

Distinguish reactive from predictive vision and name application categories where prediction is required
Identify the three pieces of most world-model architectures
Articulate the pixel-vs-latent trade-off and the structural reason latent wins for planning uses
Compute the efficiency ratio for given resolution, horizon, and latent dimension
Place the landmark architectures and route to cross-track sister lessons for depth

Time and difficulty

Read time: about 14 minutes
Practice time: about 15 minutes (a fresh pixel-vs-latent efficiency calculation at video-generation scale, an architecture-matching exercise across the landmarks, an evaluation-reasoning question about long-horizon coherence vs one-step error, plus flashcards)
Difficulty: standard (the math is multiplication and division; the conceptual lift is seeing prediction as the natural extension of reactive vision and seeing the pixel-vs-latent decision as the central design point)