References: Control as inference

Primary sources (load-bearing for this lesson)

The canonical reference

Levine, S. (2018). Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review. arXiv:1805.00909. https://arxiv.org/abs/1805.00909 The canonical modern reference. Derives the graphical model, soft Bellman backup, and connections to MaxEnt RL, IRL, and approximate inference. Read this lesson alongside the paper.

Pre-deep-learning precursors

Toussaint, M., & Storkey, A. (2006). Probabilistic inference for solving discrete and continuous state Markov Decision Processes. ICML 2006. https://dl.acm.org/doi/10.1145/1143844.1143963 The original graphical-model construction. Pre-deep-learning.
Kappen, H. J. (2005). Path integrals and symmetry breaking for optimal control theory. Journal of Statistical Mechanics: Theory and Experiment. The path-integral formulation of stochastic optimal control, parallel to the inference framing.
Todorov, E. (2009). Efficient computation of optimal actions. PNAS, 106(28), 11478-11483. https://www.pnas.org/doi/10.1073/pnas.0710743106 Linearly-solvable MDPs; another lineage that connects to control-as-inference via KL-regularized objectives.

MaxEnt IRL (the inverse direction)

Ziebart, B. D., Maas, A., Bagnell, J. A., & Dey, A. K. (2008). Maximum Entropy Inverse Reinforcement Learning. AAAI 2008. The original MaxEnt-RL formulation in inverse RL. Predates SAC by a decade; arguably the earliest deployment of the variational-RL framework.
Finn, C., Levine, S., & Abbeel, P. (2016). Guided cost learning: Deep inverse optimal control via policy optimization. ICML 2016. https://arxiv.org/abs/1603.00448 Deep MaxEnt-IRL; the bridge from Ziebart 2008 to modern deep RL.

Soft Q-learning and SAC

Haarnoja, T., Tang, H., Abbeel, P., & Levine, S. (2017). Reinforcement Learning with Deep Energy-Based Policies. ICML 2017. https://arxiv.org/abs/1702.08165 Soft Q-learning. The first deep-learning instantiation of the soft Bellman backup.
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. ICML 2018. https://arxiv.org/abs/1801.01290 SAC. The practical workhorse implementing soft Bellman.
Haarnoja, T., Zhou, A., Hartikainen, K., et al. (2018). Soft Actor-Critic Algorithms and Applications. arXiv:1812.05905. https://arxiv.org/abs/1812.05905 The follow-up with automatic temperature tuning (the production-ready SAC).

RLHF as special case

Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022. https://arxiv.org/abs/2203.02155 InstructGPT. The full RLHF objective L = L^CLIP - β · KL(π_θ || π_pretrained) discussed in this lesson and Lesson 8.
Bai, Y., Jones, A., Ndousse, K., et al. (2022). Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv:2204.05862. https://arxiv.org/abs/2204.05862 Anthropic’s RLHF paper. Same variational structure.

DPO and direct preference optimization

Rafailov, R., Sharma, A., Mitchell, E., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023. https://arxiv.org/abs/2305.18290 DPO. The variational shortcut that skips the explicit reward model. The “secret reward model” the paper title alludes to is the policy itself evaluated via the variational identity.
Azar, M. G., Rowland, M., Piot, B., et al. (2024). A General Theoretical Paradigm to Understand Learning from Human Preferences. AISTATS 2024. https://arxiv.org/abs/2310.12036 IPO. Generalization of DPO with a different surrogate.

Berkeley CS285 (course source)

Levine, S. (2023). CS285 lecture on Reframing Control as an Inference Problem. UC Berkeley. https://rail.eecs.berkeley.edu/deeprlcourse/ Lecture slides + video covering the graphical model, soft Bellman backup, and the connections to MaxEnt RL and SAC.

Reference texts

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Free online: http://incompleteideas.net/book/the-book-2nd.html The 2018 edition does not cover control-as-inference in depth (it predates much of the modern unification work); chapter 6 on TD learning and chapter 11 on off-policy with FA are the closest entry points.
Murphy, K. P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press. Free online: https://probml.github.io/pml-book/book2.html Chapter 33 (Reinforcement Learning) covers the variational framing including soft Q-learning and SAC.

Schulman, J., Chen, X., & Abbeel, P. (2018). Equivalence Between Policy Gradients and Soft Q-Learning. arXiv:1704.06440. https://arxiv.org/abs/1704.06440 Bridge paper: shows that policy gradient and soft Q-learning are equivalent under appropriate parameterization. Useful for understanding the SAC vs PPO algorithmic choice.
Nachum, O., Norouzi, M., Xu, K., & Schuurmans, D. (2017). Bridging the Gap Between Value and Policy Based Reinforcement Learning. NeurIPS 2017. PCL (path consistency learning). Another bridge between value- and policy-based RL via the soft Bellman backup.

Source material

Source curriculum (structural mirror, cited as further study):
• UC Berkeley CS285: Deep Reinforcement Learning (Sergey Levine)
  Course page: http://rail.eecs.berkeley.edu/deeprlcourse/
  Lecture videos: YouTube (link-out only)
Clawdemy's lessons are original prose that follows the pedagogical arc of this
source. We do not reproduce or transcribe it; we cite it as a recommended
companion. All rights to the original material remain with its authors.

References: Control as inference

Primary sources (load-bearing for this lesson)

The canonical reference

Pre-deep-learning precursors

MaxEnt IRL (the inverse direction)

Soft Q-learning and SAC

RLHF as special case

DPO and direct preference optimization

Berkeley CS285 (course source)

Reference texts

Related and extension reading

Source material