References: Policy gradients (REINFORCE)
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• Berkeley CS285 (CS185), Deep Reinforcement Learning, Lecture 5: Policy Gradients Instructor: Sergey Levine Course page: http://rail.eecs.berkeley.edu/deeprlcourse/ Lecture videos (Fall 2023 recordings, most recent at time of authoring): https://www.youtube.com/playlist?list=PL_iWQOsE6TfVYGEGiAOMaOzzv41Jfm_Ps License: YouTube standard (link-out only, no embed, no transcript republication)This Clawdemy lesson is an original derivation of REINFORCE from thelog-derivative trick, with a worked sigmoid-bandit example and a dual-pathverification of the analytic expectation against single-sample variance,following the pedagogical arc of CS285 Lecture 5. We cite the lecture as therecommended full-depth companion; we do not reproduce or transcribe the videos.All rights to the original lectures remain with the creator.Watch this next
Section titled “Watch this next”- CS285 Lecture 5, Policy Gradients (Sergey Levine, Berkeley). The lecture this lesson mirrors. Levine derives the policy gradient theorem step by step, works the log-derivative trick visually, and walks through rewards-to-go and baseline subtraction with explicit variance analysis. The “why each refinement helps” intuition is sharper in lecture than text can fully convey.
Going deeper (foundational papers)
Section titled “Going deeper (foundational papers)”-
Simple statistical gradient-following algorithms for connectionist reinforcement learning (Williams, 1992). The original REINFORCE paper. Williams introduces the algorithm, proves it is an unbiased estimator of the policy gradient, and discusses the baseline. The paper that named the family.
-
Policy gradient methods for reinforcement learning with function approximation (Sutton, McAllester, Singh, Mansour, NeurIPS 2000). The policy-gradient theorem in its modern form, with the function-approximation analysis that justifies pairing it with neural networks. The paper that made deep policy gradients formally respectable.
-
High-dimensional continuous control using generalized advantage estimation (Schulman, Moritz, Levine, Jordan, Abbeel, ICLR 2016). The GAE paper. Develops the lambda-weighted advantage estimator that bridges the bias-variance tradeoff between Monte-Carlo returns and one-step bootstrapped advantages. The standard baseline-and-advantage estimator used in PPO today.
Going deeper (textbooks and tutorials)
Section titled “Going deeper (textbooks and tutorials)”-
Reinforcement Learning: An Introduction (Sutton and Barto, 2nd edition), Chapter 13: Policy Gradient Methods. The textbook treatment, with the policy-gradient theorem derived from scratch, the REINFORCE algorithm, and the baseline + advantage analysis. The exposition pairs naturally with the CS285 lecture; the textbook is more thorough on the proofs, the lecture more direct on the deep-RL practicalities.
-
Spinning Up in Deep RL: Vanilla Policy Gradient. Achiam’s pedagogical implementation of vanilla policy gradients with pseudocode, mathematical derivation, and working code. Useful as the practical companion when you go to implement REINFORCE for the first time.
Adjacent topics
Section titled “Adjacent topics”Where this sits in the wider curriculum.
-
Actor-critic methods (next lesson). The natural variance-reduction step from REINFORCE: learn a value function
V_φ(s)alongside the policy and use it as the baseline. The bracketG_t - V_φ(s_t)becomes a learned advantage estimate, lower-variance than Monte-Carlo returns. The workhorse template for modern policy-gradient methods. -
Advanced policy gradients: TRPO and PPO (lesson 8). Trust-region and clipped-surrogate refinements on REINFORCE that bound the policy update size, preventing the on-policy assumption from breaking. PPO is the algorithm used in the RLHF post-training step of most modern LLMs.
-
RL for large language models (lesson 13, RLHF). The RLHF pipeline uses PPO (and therefore REINFORCE underneath) to fine-tune an LLM against a learned reward model. The log-derivative trick from this lesson is the calculus identity that makes the LM-as-policy gradient computable.
-
T17 (RL Foundations) Chapter on Policy Gradient. T17’s lesson on classical policy-gradient methods covers the same derivation at lower scale (tabular policies, not neural-network policies). If T17 is the parallel prerequisite track, that lesson is its tabular twin to this one.