Skip to content

References: World modeling

This lesson follows Stanford CS231n’s treatment of world modeling (Lecture 17).

  • Course: Stanford CS231n, “Deep Learning for Computer Vision”
  • Instructors: Fei-Fei Li, Ehsan Adeli, and Justin Johnson (Stanford University)
  • Course site: cs231n.stanford.edu
  • This lesson maps to: Lecture 17 (World Modeling).

Attribution (Clawdemy-authored): Stanford CS231n: Deep Learning for Computer Vision, Fei-Fei Li, Ehsan Adeli, and Justin Johnson, Stanford University (cs231n.stanford.edu). CS231n does not publish a required citation string; this is the attribution Clawdemy uses.

The current term’s lecture recordings are posted on Canvas for enrolled Stanford students. Recordings from previous years are publicly available on YouTube under YouTube’s standard license; Clawdemy links out rather than embedding or rehosting. The course notes (cs231n.github.io) and site are Stanford’s. No Creative Commons license is published for the lectures, so we treat them as link-only references.

  • World Models. Ha, Schmidhuber, “World Models” (NeurIPS 2018). The influential early VAE + RNN + controller architecture; popularized the “imagine futures, plan from them” framing for model-based RL.
  • PlaNet. Hafner, Lillicrap, Fischer, Villegas, Ha, Lee, Davidson, “Learning Latent Dynamics for Planning from Pixels” (ICML 2019). A predecessor to Dreamer; latent dynamics for planning.
  • Dreamer. Hafner, Lillicrap, Ba, Norouzi, “Dream to Control: Learning Behaviors by Latent Imagination” (ICLR 2020). Train RL agents by imagining trajectories in a learned world model.
  • DreamerV2. Hafner, Lillicrap, Norouzi, Ba, “Mastering Atari with Discrete World Models” (ICLR 2021). Human-level Atari with discrete latents.
  • DreamerV3. Hafner, Pasukonis, Ba, Lillicrap, “Mastering Diverse Domains through World Models” (Nature / arXiv 2023). Broad generalization across many distinct environments with the same hyperparameters; notable robustness milestone.
  • MuZero. Schrittwieser et al., “Mastering Atari, Go, chess and shogi by planning with a learned model” (Nature 2020 / arXiv 2019). Tree search with learned dynamics; mastered multiple games without rule access.
  • I-JEPA. Assran et al., “Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture” (CVPR 2023). Image-side JEPA.
  • V-JEPA. Bardes et al., “Revisiting Feature Prediction for Learning Visual Representations from Video” (TMLR 2024). Video extension; spatio-temporal masked latent prediction.
  • Sora. Brooks et al., “Video generation models as world simulators” (OpenAI technical report 2024). Production-scale text-to-video; explicitly positions video generation as world modeling.
  • Genie. Bruce et al., “Genie: Generative Interactive Environments” (ICML 2024). Action-controllable video generation that learns to behave like an interactive environment.
  • Predictive Coding Networks (PredNet). Lotter, Kreiman, Cox, “Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning” (ICLR 2017). Early video-prediction architecture inspired by neuroscience predictive-coding theories.
  • T18 (planned, reinforcement learning). Will cover model-based RL in depth, including Dreamer, MuZero, and the full algorithm details of training a policy from imagined rollouts. The right destination if you want to actually train an agent in a learned world model.
  • T24 (planned, image generation and multimodal). Will cover production-scale video generation end-to-end, including the architectures behind Sora-style systems and the training-data and engineering considerations that make them work.
  • L13 (3D vision). Same track. The spatial complement to L15’s temporal prediction; both are about lifting vision beyond reactive image-in-answer-out.
  • Open-source Dreamer implementations (multiple in PyTorch and JAX) make Dreamer-style training reproducible at moderate scale.
  • Open-source video-generation systems (Stable Video Diffusion, AnimateDiff and similar) provide consumer-grade reproductions of some of the video-world-model ideas at smaller scale.

Clawdemy follows CS231n’s Lec 17 ordering (the predictive-vs-reactive split, the pixel-vs-latent trade-off, landmark architectures, evaluation considerations) and cites the canonical papers by name and venue. The pixel-vs-latent efficiency calculations (body: 224×224×3, 30 frames, 512-dim latent → ~294x; practice: 512×512×3, 96 frames, 256-dim latent → ~3,072x) are Clawdemy-authored against the standard arithmetic. The architectural-pattern descriptions (World Models, Dreamer, MuZero, JEPA, video world models) are at the intuition level appropriate for the Track 16 Phase 0 arc; deep coverage of model-based RL lives in T18, and deep coverage of production-scale video generation lives in T24. We name the research artifacts (Sora, Genie) by their publication papers per the same standard used for AlexNet / ResNet / StyleGAN / CLIP in earlier lessons; we do not market or recommend specific commercial products. We do not reproduce CS231n’s slides, figures, problem sets, or lecture text. Full attribution policy: see Doc/attribution-policy.md.