References: PPO (clipped surrogate objective)

Primary sources (load-bearing for this lesson)

The PPO paper

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347. https://arxiv.org/abs/1707.06347 The PPO paper. Introduces both the clipped surrogate (the practical workhorse) and the adaptive-KL-penalty variant. Empirical results on MuJoCo, Roboschool, and Atari.

TRPO (the precursor)

Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2015). Trust Region Policy Optimization. ICML 2015. https://arxiv.org/abs/1502.05477 The TRPO paper. Establishes the theoretical foundation: trust-region constraint as a monotonic improvement bound; natural-gradient solver with conjugate gradients and line search.
Kakade, S., & Langford, J. (2002). Approximately optimal approximate reinforcement learning. ICML 2002. The conservative policy iteration paper that motivated TRPO. The KL trust region is a relaxation of the conservative-update mixing parameter.

GAE (the standard advantage estimator used with PPO)

Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2016). High-Dimensional Continuous Control Using Generalized Advantage Estimation. ICLR 2016. https://arxiv.org/abs/1506.02438 GAE = exponentially-weighted geometric blend of n-step TD targets. Standard PPO implementations use GAE with λ = 0.95.

RLHF (the killer application)

Christiano, P. F., Leike, J., Brown, T. B., et al. (2017). Deep reinforcement learning from human preferences. NeurIPS 2017. https://arxiv.org/abs/1706.03741 The first end-to-end demonstration of RL from preferences on Atari/MuJoCo; the methodological foundation for what later became RLHF on language models.
Stiennon, N., Ouyang, L., Wu, J., et al. (2020). Learning to summarize with human feedback. NeurIPS 2020. https://arxiv.org/abs/2009.01325 RLHF for the summarization task; one of the first scaled demonstrations on language models.
Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022. https://arxiv.org/abs/2203.02155 The InstructGPT paper. PPO for language-model fine-tuning at scale. The architecture template inherited by every later commercial instruction-tuned LLM.
Bai, Y., Jones, A., Ndousse, K., et al. (2022). Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv:2204.05862. https://arxiv.org/abs/2204.05862 Anthropic’s RLHF paper. PPO with KL-to-pretrained penalty for alignment.

PPO variants and successors

DeepSeek-AI (2024). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948. https://arxiv.org/abs/2501.12948 GRPO (Group Relative Policy Optimization) variant; drops the value network, computes advantages from group-relative ranks.
Rafailov, R., Sharma, A., Mitchell, E., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023. https://arxiv.org/abs/2305.18290 DPO: skip the reward-model + PPO sandwich, optimize the policy directly against preference pairs. Now widely used alongside PPO in production RLHF pipelines.

Berkeley CS285 (course source for this track)

Levine, S. (2023). CS285 lectures on Advanced Policy Gradients. UC Berkeley. https://rail.eecs.berkeley.edu/deeprlcourse/ Lecture 9 covers natural gradient and TRPO; Lecture 10 covers PPO and the relationship to TRPO.

Implementation references

OpenAI Spinning Up: PPO. https://spinningup.openai.com/en/latest/algorithms/ppo.html Clean, well-documented reference implementation in PyTorch and TensorFlow. The “RL textbook in code” for the deep-RL canon.
Engstrom, L., Ilyas, A., Santurkar, S., et al. (2020). Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO. ICLR 2020. https://arxiv.org/abs/2005.12729 Empirical study showing that several “small details” in PPO’s standard implementation (advantage normalization, learning rate annealing, gradient clipping) contribute more to performance than the clipped surrogate itself. Required reading for anyone implementing PPO from scratch.

Sutton & Barto reference chapters

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Free online: http://incompleteideas.net/book/the-book-2nd.html
- Chapter 13 (Policy Gradient Methods). REINFORCE through actor-critic; foundation for understanding what PPO is patching.
- Section 5.5 (Off-policy Prediction via Importance Sampling). The importance-sampling correction that PPO’s surrogate is built on.

Source material

Source curriculum (structural mirror, cited as further study):
• UC Berkeley CS285: Deep Reinforcement Learning (Sergey Levine)
  Course page: http://rail.eecs.berkeley.edu/deeprlcourse/
  Lecture videos: YouTube (link-out only)
Clawdemy's lessons are original prose that follows the pedagogical arc of this
source. We do not reproduce or transcribe it; we cite it as a recommended
companion. All rights to the original material remain with its authors.