References: World modeling

Source material

This lesson follows Stanford CS231n’s treatment of world modeling (Lecture 17).

Course: Stanford CS231n, “Deep Learning for Computer Vision”
Instructors: Fei-Fei Li, Ehsan Adeli, and Justin Johnson (Stanford University)
Course site: cs231n.stanford.edu
This lesson maps to: Lecture 17 (World Modeling).

Attribution (Clawdemy-authored): Stanford CS231n: Deep Learning for Computer Vision, Fei-Fei Li, Ehsan Adeli, and Justin Johnson, Stanford University (cs231n.stanford.edu). CS231n does not publish a required citation string; this is the attribution Clawdemy uses.

A note on access and license

The current term’s lecture recordings are posted on Canvas for enrolled Stanford students. Recordings from previous years are publicly available on YouTube under YouTube’s standard license; Clawdemy links out rather than embedding or rehosting. The course notes (cs231n.github.io) and site are Stanford’s. No Creative Commons license is published for the lectures, so we treat them as link-only references.

Primary papers (cited by name and venue)

World models foundations

World Models. Ha, Schmidhuber, “World Models” (NeurIPS 2018). The influential early VAE + RNN + controller architecture; popularized the “imagine futures, plan from them” framing for model-based RL.
PlaNet. Hafner, Lillicrap, Fischer, Villegas, Ha, Lee, Davidson, “Learning Latent Dynamics for Planning from Pixels” (ICML 2019). A predecessor to Dreamer; latent dynamics for planning.

Dreamer family

Dreamer. Hafner, Lillicrap, Ba, Norouzi, “Dream to Control: Learning Behaviors by Latent Imagination” (ICLR 2020). Train RL agents by imagining trajectories in a learned world model.
DreamerV2. Hafner, Lillicrap, Norouzi, Ba, “Mastering Atari with Discrete World Models” (ICLR 2021). Human-level Atari with discrete latents.
DreamerV3. Hafner, Pasukonis, Ba, Lillicrap, “Mastering Diverse Domains through World Models” (Nature / arXiv 2023). Broad generalization across many distinct environments with the same hyperparameters; notable robustness milestone.

MuZero

MuZero. Schrittwieser et al., “Mastering Atari, Go, chess and shogi by planning with a learned model” (Nature 2020 / arXiv 2019). Tree search with learned dynamics; mastered multiple games without rule access.

JEPA family

I-JEPA. Assran et al., “Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture” (CVPR 2023). Image-side JEPA.
V-JEPA. Bardes et al., “Revisiting Feature Prediction for Learning Visual Representations from Video” (TMLR 2024). Video extension; spatio-temporal masked latent prediction.

Predecessors and context

Predictive Coding Networks (PredNet). Lotter, Kreiman, Cox, “Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning” (ICLR 2017). Early video-prediction architecture inspired by neuroscience predictive-coding theories.

Further study (sister tracks for depth)

T18 (planned, reinforcement learning). Will cover model-based RL in depth, including Dreamer, MuZero, and the full algorithm details of training a policy from imagined rollouts. The right destination if you want to actually train an agent in a learned world model.
T24 (planned, image generation and multimodal). Will cover production-scale video generation end-to-end, including the architectures behind Sora-style systems and the training-data and engineering considerations that make them work.
L13 (3D vision). Same track. The spatial complement to L15’s temporal prediction; both are about lifting vision beyond reactive image-in-answer-out.

Further study (tools and reproduction)

Open-source Dreamer implementations (multiple in PyTorch and JAX) make Dreamer-style training reproducible at moderate scale.
Open-source video-generation systems (Stable Video Diffusion, AnimateDiff and similar) provide consumer-grade reproductions of some of the video-world-model ideas at smaller scale.

How we use this source

Clawdemy follows CS231n’s Lec 17 ordering (the predictive-vs-reactive split, the pixel-vs-latent trade-off, landmark architectures, evaluation considerations) and cites the canonical papers by name and venue. The pixel-vs-latent efficiency calculations (body: 224×224×3, 30 frames, 512-dim latent → ~294x; practice: 512×512×3, 96 frames, 256-dim latent → ~3,072x) are Clawdemy-authored against the standard arithmetic. The architectural-pattern descriptions (World Models, Dreamer, MuZero, JEPA, video world models) are at the intuition level appropriate for the Track 16 Phase 0 arc; deep coverage of model-based RL lives in T18, and deep coverage of production-scale video generation lives in T24. We name the research artifacts (Sora, Genie) by their publication papers per the same standard used for AlexNet / ResNet / StyleGAN / CLIP in earlier lessons; we do not market or recommend specific commercial products. We do not reproduce CS231n’s slides, figures, problem sets, or lecture text. Full attribution policy: see Doc/attribution-policy.md.