References: Variational inference for RL

Primary sources (load-bearing for this lesson)

The reparameterization trick / VAE

Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. ICLR 2014. https://arxiv.org/abs/1312.6114 The VAE paper. Introduces the reparameterization trick in modern deep-learning form. Foundational.
Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic Backpropagation and Approximate Inference in Deep Generative Models. ICML 2014. https://arxiv.org/abs/1401.4082 Parallel paper, same year, that also introduced the reparameterization trick. The two papers are co-foundational.

Variational inference fundamentals

Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational Inference: A Review for Statisticians. Journal of the American Statistical Association, 112(518), 859-877. https://arxiv.org/abs/1601.00670 The accessible modern review of variational inference. Covers ELBO derivation, mean-field VI, stochastic VI, the connection to information geometry.
Murphy, K. P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press. Free online: https://probml.github.io/pml-book/book2.html Chapter 10 covers variational inference in depth. The standard modern reference.

Latent-state world models for RL

Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., & Davidson, J. (2019). Learning Latent Dynamics for Planning from Pixels. ICML 2019. https://arxiv.org/abs/1811.04551 PlaNet. The RSSM (recurrent state-space model) trained with the ELBO; planning via CEM in latent space.
Hafner, D., Lillicrap, T., Ba, J., & Norouzi, M. (2020). Dream to Control: Learning Behaviors by Latent Imagination. ICLR 2020. https://arxiv.org/abs/1912.01603 DreamerV1.
Hafner, D., Lillicrap, T., Norouzi, M., & Ba, J. (2021). Mastering Atari with Discrete World Models. ICLR 2021. https://arxiv.org/abs/2010.02193 DreamerV2.
Hafner, D., Pasukonis, J., Ba, J., & Lillicrap, T. (2023). Mastering Diverse Domains through World Models. arXiv:2301.04104. https://arxiv.org/abs/2301.04104 DreamerV3.

MaxEnt RL / SAC

Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. ICML 2018. https://arxiv.org/abs/1801.01290 SAC. The widely-deployed MaxEnt RL workhorse for continuous control.
Haarnoja, T., Tang, H., Abbeel, P., & Levine, S. (2017). Reinforcement Learning with Deep Energy-Based Policies. ICML 2017. https://arxiv.org/abs/1702.08165 Soft Q-learning. The energy-based-policy precursor to SAC; first clean exposition of the variational framing.
Ziebart, B. D., Maas, A., Bagnell, J. A., & Dey, A. K. (2008). Maximum Entropy Inverse Reinforcement Learning. AAAI 2008. The original MaxEnt-RL framing in inverse RL, predating SAC by a decade. Why MaxEnt was studied first as a way to handle reward ambiguity in IRL.

Control as inference (L12 source)

Levine, S. (2018). Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review. arXiv:1805.00909. https://arxiv.org/abs/1805.00909 The canonical reference for the control-as-inference framing. Read together with this lesson; that paper is L12’s primary source.
Toussaint, M., & Storkey, A. (2006). Probabilistic inference for solving discrete and continuous state Markov Decision Processes. ICML 2006. https://dl.acm.org/doi/10.1145/1143844.1143963 Pre-deep-learning predecessor; framed planning as probabilistic inference in a planning graphical model.

RLHF (KL-regularized variant)

Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022. https://arxiv.org/abs/2203.02155 InstructGPT. The full RLHF objective L = L^CLIP - β · KL(π_θ || π_pretrained) discussed in this lesson.
Bai, Y., Jones, A., Ndousse, K., et al. (2022). Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv:2204.05862. https://arxiv.org/abs/2204.05862 Anthropic’s RLHF paper.

Berkeley CS285 (course source for this track)

Levine, S. (2023). CS285 lecture on Variational Inference and Generative Models. UC Berkeley. https://rail.eecs.berkeley.edu/deeprlcourse/ Slides + video covering the ELBO derivation, reparameterization, and the canonical examples (VAE, latent-state models). The natural pair for this lesson.

Reference texts

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 10 covers variational inference rigorously. The pre-deep-learning reference; still the cleanest derivation of the ELBO.
Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press. Chapter 21 covers variational inference. Compatible with the 2023 Advanced Topics second volume above.

Recent applications and extensions

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022. https://arxiv.org/abs/2112.10752 Latent diffusion = variational latent-space + diffusion process. The architecture behind Stable Diffusion. Variational inference scaled to high-resolution generation.

Source material

Source curriculum (structural mirror, cited as further study):
• UC Berkeley CS285: Deep Reinforcement Learning (Sergey Levine)
  Course page: http://rail.eecs.berkeley.edu/deeprlcourse/
  Lecture videos: YouTube (link-out only)
Clawdemy's lessons are original prose that follows the pedagogical arc of this
source. We do not reproduce or transcribe it; we cite it as a recommended
companion. All rights to the original material remain with its authors.