References: Model-based RL, learning the dynamics

Primary sources (load-bearing for this lesson)

Foundational

Sutton, R. S. (1991). Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin, 2(4), 160-163. https://dl.acm.org/doi/10.1145/122344.122377 The original Dyna architecture. Sutton & Barto Chapter 8 has the modern treatment.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Free online: http://incompleteideas.net/book/the-book-2nd.html Chapter 8 (Planning and Learning) is the canonical treatment of Dyna and the model-based / model-free distinction.

Linear-Gaussian dynamics and iterative LQR

Todorov, E., & Li, W. (2005). A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems. American Control Conference 2005. https://homes.cs.washington.edu/~todorov/papers/TodorovACC05.pdf Iterative LQR / iLQG, the classic local-linearization-based planner for nonlinear dynamics.
Levine, S., & Koltun, V. (2013). Guided Policy Search. ICML 2013. https://proceedings.mlr.press/v28/levine13.html Local linear-Gaussian models for learning policies in real-robot continuous control.

Probabilistic ensembles (PETS)

Chua, K., Calandra, R., McAllister, R., & Levine, S. (2018). Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models. NeurIPS 2018. https://arxiv.org/abs/1805.12114 PETS. The headline 10× to 100× sample-efficiency claim. Probabilistic ensembles with trajectory sampling.

Model-based policy optimization (MBPO)

Janner, M., Fu, J., Zhang, M., & Levine, S. (2019). When to Trust Your Model: Model-Based Policy Optimization. NeurIPS 2019. https://arxiv.org/abs/1906.08253 Short imagined rollouts (1 to 5 steps) feeding a SAC-style model-free learner; the practical recipe.

World models and Dreamer

Ha, D., & Schmidhuber, J. (2018). World Models. NeurIPS 2018. https://arxiv.org/abs/1803.10122 Training policies entirely in a learned latent dream-world.
Hafner, D., Lillicrap, T., Ba, J., & Norouzi, M. (2020). Dream to Control: Learning Behaviors by Latent Imagination. ICLR 2020. https://arxiv.org/abs/1912.01603 DreamerV1.
Hafner, D., Lillicrap, T., Norouzi, M., & Ba, J. (2021). Mastering Atari with Discrete World Models. ICLR 2021. https://arxiv.org/abs/2010.02193 DreamerV2.
Hafner, D., Pasukonis, J., Ba, J., & Lillicrap, T. (2023). Mastering Diverse Domains through World Models. arXiv:2301.04104. https://arxiv.org/abs/2301.04104 DreamerV3.

MuZero (learned model + MCTS)

Schrittwieser, J., Antonoglou, I., Hubert, T., et al. (2020). Mastering Atari, Go, chess and shogi by planning with a learned model. Nature, 588, 604-609. https://www.nature.com/articles/s41586-020-03051-4 MuZero. Learns the dynamics implicitly inside the MCTS planning loop.

Berkeley CS285 (course source for this track)

Levine, S. (2023). CS285 lecture on Model-Based Reinforcement Learning. UC Berkeley. https://rail.eecs.berkeley.edu/deeprlcourse/ Lecture covering model fitting and Dyna. CS285 L16 is the natural pair (planning with the model), the source for L10.

Reference texts

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 5 (Machine Learning Basics) covers least squares and the bias-variance decomposition referenced in the lesson.
Boyd, S., & Vandenberghe, L. (2018). Introduction to Applied Linear Algebra. Cambridge University Press. Chapters 12-13 on least-squares estimation. Free online at https://web.stanford.edu/~boyd/vmls/.

Uncertainty quantification

Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. NeurIPS 2017. https://arxiv.org/abs/1612.01474 The “deep ensembles” paper that’s standard for epistemic uncertainty estimation. Used by PETS.
Kendall, A., & Gal, Y. (2017). What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? NeurIPS 2017. https://arxiv.org/abs/1703.04977 The clearest articulation of the aleatoric / epistemic split.

Source material

Source curriculum (structural mirror, cited as further study):
• UC Berkeley CS285: Deep Reinforcement Learning (Sergey Levine)
  Course page: http://rail.eecs.berkeley.edu/deeprlcourse/
  Lecture videos: YouTube (link-out only)
Clawdemy's lessons are original prose that follows the pedagogical arc of this
source. We do not reproduce or transcribe it; we cite it as a recommended
companion. All rights to the original material remain with its authors.