References: Planning with a learned model

Primary sources (load-bearing for this lesson)

MuZero and the AlphaGo lineage

Schrittwieser, J., Antonoglou, I., Hubert, T., et al. (2020). Mastering Atari, Go, chess and shogi by planning with a learned model. Nature, 588, 604-609. https://www.nature.com/articles/s41586-020-03051-4 MuZero. The end-to-end learned-model MCTS recipe. Hidden-state-space dynamics network; trained for planning quality, not observation reconstruction.
Silver, D., Hubert, T., Schrittwieser, J., et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419), 1140-1144. https://www.science.org/doi/10.1126/science.aar6404 AlphaZero. The MuZero precursor with a perfect simulator.
Silver, D., Huang, A., Maddison, C. J., et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529, 484-489. https://www.nature.com/articles/nature16961 AlphaGo. The original deep-RL + MCTS combination.

MuZero variants and extensions

Ye, W., Liu, S., Kurutach, T., Abbeel, P., & Gao, Y. (2021). Mastering Atari Games with Limited Data. NeurIPS 2021. https://arxiv.org/abs/2111.00210 EfficientZero. MuZero with self-supervised consistency losses; reaches Atari median human performance from only two hours of real-time game experience (the Atari 100k data budget). The more accessible MuZero variant for academic-scale compute.
Hubert, T., Schrittwieser, J., Antonoglou, I., et al. (2021). Learning and Planning in Complex Action Spaces. ICML 2021. https://arxiv.org/abs/2104.06303 Sampled MuZero. Handles continuous action spaces via action sampling during MCTS.
Schrittwieser, J., Hubert, T., Mandhane, A., et al. (2021). Online and Offline Reinforcement Learning by Planning with a Learned Model. NeurIPS 2021. https://arxiv.org/abs/2104.06294 MuZero Unplugged. The offline-RL variant of MuZero.

Cross-entropy method (CEM)

de Boer, P.-T., Kroese, D. P., Mannor, S., & Rubinstein, R. Y. (2005). A Tutorial on the Cross-Entropy Method. Annals of Operations Research, 134, 19-67. https://link.springer.com/article/10.1007/s10479-005-5724-z The canonical CEM tutorial. Covers the rare-event-simulation origins, the optimization formulation, and the equivalence to importance-sampling minimization of KL divergence.

CMA-ES (CEM’s stronger cousin)

Hansen, N. (2016). The CMA Evolution Strategy: A Tutorial. arXiv:1604.00772. https://arxiv.org/abs/1604.00772 CMA-ES with rank-1 and rank-µ covariance updates. Scales better than vanilla CEM to higher-dimensional action spaces.

MPPI (path-integral MPC)

Williams, G., Wagener, N., Goldfain, B., et al. (2017). Information theoretic MPC for model-based reinforcement learning. ICRA 2017. https://arxiv.org/abs/1707.02342 MPPI. Soft elite weighting via softmax instead of CEM’s hard top-K cutoff. Workhorse in racing robotics.
Williams, G., Drews, P., Goldfain, B., Rehg, J. M., & Theodorou, E. A. (2018). Information-Theoretic Model Predictive Control: Theory and Applications to Autonomous Driving. IEEE Transactions on Robotics, 34(6). The MPPI paper applied to high-speed autonomous racing.

Classical Model Predictive Control

Garcia, C. E., Prett, D. M., & Morari, M. (1989). Model predictive control: theory and practice (a survey). Automatica, 25(3), 335-348. The classical MPC reference, predating the RL adoption by decades.
Mayne, D. Q., Rawlings, J. B., Rao, C. V., & Scokaert, P. O. M. (2000). Constrained model predictive control: Stability and optimality. Automatica, 36(6), 789-814. The canonical stability-and-optimality analysis of receding-horizon control.

Modern model-based RL with planning

Chua, K., Calandra, R., McAllister, R., & Levine, S. (2018). Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models. NeurIPS 2018. https://arxiv.org/abs/1805.12114 PETS. Probabilistic ensembles + CEM-style trajectory sampling.
Hafner, D., Pasukonis, J., Ba, J., & Lillicrap, T. (2023). Mastering Diverse Domains through World Models. arXiv:2301.04104. https://arxiv.org/abs/2301.04104 DreamerV3. The Dyna-style alternative to MPC: imagine rollouts, train a policy on the imagined data, no planning at decision time.

Berkeley CS285 (course source for this track)

Levine, S. (2023). CS285 lecture on Model-Based Reinforcement Learning with Function Approximation. UC Berkeley. https://rail.eecs.berkeley.edu/deeprlcourse/ The course’s deep-learning-era treatment of planning with learned models. CS285 L15 was the source for L9 (learning the model); L16 is the natural pair.

Reference texts

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Free online: http://incompleteideas.net/book/the-book-2nd.html Chapter 8 covers planning in the tabular setting (Dyna, value iteration). Modern function-approximation versions are in the original papers above.
Browne, C. B., Powley, E., Whitehouse, D., et al. (2012). A Survey of Monte Carlo Tree Search Methods. IEEE Transactions on Computational Intelligence and AI in Games, 4(1), 1-43. The canonical MCTS survey predating the AlphaGo era; covers UCT, PUCT, and the variants used in AlphaZero/MuZero.

Implementation references

OpenAI Spinning Up. https://spinningup.openai.com/ Reference implementations of model-free algorithms; no native model-based coverage but the model-free baselines are useful for comparison.
MuZero open-source implementations: several community ports exist (e.g., the EfficientZero authors released code at https://github.com/YeWR/EfficientZero). The original MuZero training procedure was not released publicly; community implementations are research-grade approximations.

Source material

Source curriculum (structural mirror, cited as further study):
• UC Berkeley CS285: Deep Reinforcement Learning (Sergey Levine)
  Course page: http://rail.eecs.berkeley.edu/deeprlcourse/
  Lecture videos: YouTube (link-out only)
Clawdemy's lessons are original prose that follows the pedagogical arc of this
source. We do not reproduce or transcribe it; we cite it as a recommended
companion. All rights to the original material remain with its authors.