Skip to content

References: Control as inference

Primary sources (load-bearing for this lesson)

Section titled “Primary sources (load-bearing for this lesson)”
  • Levine, S. (2018). Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review. arXiv:1805.00909. https://arxiv.org/abs/1805.00909 The canonical modern reference. Derives the graphical model, soft Bellman backup, and connections to MaxEnt RL, IRL, and approximate inference. Read this lesson alongside the paper.
  • Toussaint, M., & Storkey, A. (2006). Probabilistic inference for solving discrete and continuous state Markov Decision Processes. ICML 2006. https://dl.acm.org/doi/10.1145/1143844.1143963 The original graphical-model construction. Pre-deep-learning.
  • Kappen, H. J. (2005). Path integrals and symmetry breaking for optimal control theory. Journal of Statistical Mechanics: Theory and Experiment. The path-integral formulation of stochastic optimal control, parallel to the inference framing.
  • Todorov, E. (2009). Efficient computation of optimal actions. PNAS, 106(28), 11478-11483. https://www.pnas.org/doi/10.1073/pnas.0710743106 Linearly-solvable MDPs; another lineage that connects to control-as-inference via KL-regularized objectives.
  • Ziebart, B. D., Maas, A., Bagnell, J. A., & Dey, A. K. (2008). Maximum Entropy Inverse Reinforcement Learning. AAAI 2008. The original MaxEnt-RL formulation in inverse RL. Predates SAC by a decade; arguably the earliest deployment of the variational-RL framework.
  • Finn, C., Levine, S., & Abbeel, P. (2016). Guided cost learning: Deep inverse optimal control via policy optimization. ICML 2016. https://arxiv.org/abs/1603.00448 Deep MaxEnt-IRL; the bridge from Ziebart 2008 to modern deep RL.
  • Haarnoja, T., Tang, H., Abbeel, P., & Levine, S. (2017). Reinforcement Learning with Deep Energy-Based Policies. ICML 2017. https://arxiv.org/abs/1702.08165 Soft Q-learning. The first deep-learning instantiation of the soft Bellman backup.
  • Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. ICML 2018. https://arxiv.org/abs/1801.01290 SAC. The practical workhorse implementing soft Bellman.
  • Haarnoja, T., Zhou, A., Hartikainen, K., et al. (2018). Soft Actor-Critic Algorithms and Applications. arXiv:1812.05905. https://arxiv.org/abs/1812.05905 The follow-up with automatic temperature tuning (the production-ready SAC).
  • Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022. https://arxiv.org/abs/2203.02155 InstructGPT. The full RLHF objective L = L^CLIP - β · KL(π_θ || π_pretrained) discussed in this lesson and Lesson 8.
  • Bai, Y., Jones, A., Ndousse, K., et al. (2022). Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv:2204.05862. https://arxiv.org/abs/2204.05862 Anthropic’s RLHF paper. Same variational structure.
  • Rafailov, R., Sharma, A., Mitchell, E., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023. https://arxiv.org/abs/2305.18290 DPO. The variational shortcut that skips the explicit reward model. The “secret reward model” the paper title alludes to is the policy itself evaluated via the variational identity.
  • Azar, M. G., Rowland, M., Piot, B., et al. (2024). A General Theoretical Paradigm to Understand Learning from Human Preferences. AISTATS 2024. https://arxiv.org/abs/2310.12036 IPO. Generalization of DPO with a different surrogate.
  • Levine, S. (2023). CS285 lecture on Reframing Control as an Inference Problem. UC Berkeley. https://rail.eecs.berkeley.edu/deeprlcourse/ Lecture slides + video covering the graphical model, soft Bellman backup, and the connections to MaxEnt RL and SAC.
  • Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Free online: http://incompleteideas.net/book/the-book-2nd.html The 2018 edition does not cover control-as-inference in depth (it predates much of the modern unification work); chapter 6 on TD learning and chapter 11 on off-policy with FA are the closest entry points.
  • Murphy, K. P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press. Free online: https://probml.github.io/pml-book/book2.html Chapter 33 (Reinforcement Learning) covers the variational framing including soft Q-learning and SAC.
  • Schulman, J., Chen, X., & Abbeel, P. (2018). Equivalence Between Policy Gradients and Soft Q-Learning. arXiv:1704.06440. https://arxiv.org/abs/1704.06440 Bridge paper: shows that policy gradient and soft Q-learning are equivalent under appropriate parameterization. Useful for understanding the SAC vs PPO algorithmic choice.
  • Nachum, O., Norouzi, M., Xu, K., & Schuurmans, D. (2017). Bridging the Gap Between Value and Policy Based Reinforcement Learning. NeurIPS 2017. PCL (path consistency learning). Another bridge between value- and policy-based RL via the soft Bellman backup.
Source curriculum (structural mirror, cited as further study):
• UC Berkeley CS285: Deep Reinforcement Learning (Sergey Levine)
Course page: http://rail.eecs.berkeley.edu/deeprlcourse/
Lecture videos: YouTube (link-out only)
Clawdemy's lessons are original prose that follows the pedagogical arc of this
source. We do not reproduce or transcribe it; we cite it as a recommended
companion. All rights to the original material remain with its authors.