Models that imagine the world, world modeling
What you’ll learn
Section titled “What you’ll learn”This is lesson 15 of Phase 3 (Generating and grounding vision). The one capability it builds: you will be able to explain what a learned world model is, name the central pixel-vs-latent prediction trade-off and compute its efficiency ratio for a given setup, walk the landmark architectures, and identify the cross-track ties to model-based RL and large-scale video generation. The source curriculum is Stanford CS231n, cs231n.stanford.edu; this lesson maps to Lecture 17 (World Modeling). Deep mechanics live in sister tracks T18 (planned, RL) and T24 (planned, image generation and multimodal) per the Phase 0 arc.
The lesson opens with the reactive-vs-predictive split (the first 14 lessons of T16 have been reactive; world modeling adds prediction), names the three-piece world-model architecture (encoder + dynamics + optional decoder), covers the central pixel-vs-latent-space trade-off with a worked efficiency calculation (224×224×3 frame, 30-frame horizon, 512-dim latent → ~294x cheaper than pixel-space), surveys landmark architectures (World Models, Dreamer family, MuZero, JEPA / V-JEPA, Sora-style video world models), discusses what world models actually predict (next state, action-conditioned, state + reward, long horizons, multiple plausible futures), covers use cases (self-driving, robotics, video gen, model-based RL, science), and ends with cross-track routing to T18 and T24.
Where this fits
Section titled “Where this fits”This is lesson 15 of 16, the sixth lesson of Phase 3. It depends on lesson 9 (video understanding; world models extend the time-dimension treatment from L9 into prediction) and lesson 10 (self-supervised vision; the encoder side of most world models is a self-supervised vision backbone). The next lesson, Computer vision among people: the human-centered view, closes T16 by surveying the real-world strengths, failure modes, and biases of vision systems as engineering concerns.
Before you start
Section titled “Before you start”Prerequisites: lessons 9 (video understanding) and 10 (self-supervised vision). Lesson 9 covered the input-side time dimension (process a sequence of frames); this lesson adds the output-side time dimension (predict future frames or states). Lesson 10’s self-supervised encoders are the standard front-end for world models.
About the math
Section titled “About the math”Light. The body works one pixel-vs-latent efficiency calculation (224 × 224 × 3 per frame; 30 frames forward; 512-dim latent: pixel output 4,515,840 vs latent output 15,360, ratio ~294x). Practice extends to a video-generation-scale setup (512 × 512 × 3 per frame; 96 frames forward; 256-dim latent: pixel output 75,497,472 vs latent output 24,576, ratio ~3,072x). Multiplication and division only.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”- Distinguish reactive from predictive vision and name application categories where prediction is required
- Identify the three pieces of most world-model architectures
- Articulate the pixel-vs-latent trade-off and the structural reason latent wins for planning uses
- Compute the efficiency ratio for given resolution, horizon, and latent dimension
- Place the landmark architectures and route to cross-track sister lessons for depth
Time and difficulty
Section titled “Time and difficulty”- Read time: about 14 minutes
- Practice time: about 15 minutes (a fresh pixel-vs-latent efficiency calculation at video-generation scale, an architecture-matching exercise across the landmarks, an evaluation-reasoning question about long-horizon coherence vs one-step error, plus flashcards)
- Difficulty: standard (the math is multiplication and division; the conceptual lift is seeing prediction as the natural extension of reactive vision and seeing the pixel-vs-latent decision as the central design point)