Skip to content

References: Policy gradients (REINFORCE)

Source curriculum (structural mirror, cited as further study):
• Berkeley CS285 (CS185), Deep Reinforcement Learning, Lecture 5: Policy Gradients
Instructor: Sergey Levine
Course page: http://rail.eecs.berkeley.edu/deeprlcourse/
Lecture videos (Fall 2023 recordings, most recent at time of authoring):
https://www.youtube.com/playlist?list=PL_iWQOsE6TfVYGEGiAOMaOzzv41Jfm_Ps
License: YouTube standard (link-out only, no embed, no transcript republication)
This Clawdemy lesson is an original derivation of REINFORCE from the
log-derivative trick, with a worked sigmoid-bandit example and a dual-path
verification of the analytic expectation against single-sample variance,
following the pedagogical arc of CS285 Lecture 5. We cite the lecture as the
recommended full-depth companion; we do not reproduce or transcribe the videos.
All rights to the original lectures remain with the creator.
  • CS285 Lecture 5, Policy Gradients (Sergey Levine, Berkeley). The lecture this lesson mirrors. Levine derives the policy gradient theorem step by step, works the log-derivative trick visually, and walks through rewards-to-go and baseline subtraction with explicit variance analysis. The “why each refinement helps” intuition is sharper in lecture than text can fully convey.

Where this sits in the wider curriculum.

  • Actor-critic methods (next lesson). The natural variance-reduction step from REINFORCE: learn a value function V_φ(s) alongside the policy and use it as the baseline. The bracket G_t - V_φ(s_t) becomes a learned advantage estimate, lower-variance than Monte-Carlo returns. The workhorse template for modern policy-gradient methods.

  • Advanced policy gradients: TRPO and PPO (lesson 8). Trust-region and clipped-surrogate refinements on REINFORCE that bound the policy update size, preventing the on-policy assumption from breaking. PPO is the algorithm used in the RLHF post-training step of most modern LLMs.

  • RL for large language models (lesson 13, RLHF). The RLHF pipeline uses PPO (and therefore REINFORCE underneath) to fine-tune an LLM against a learned reward model. The log-derivative trick from this lesson is the calculus identity that makes the LM-as-policy gradient computable.

  • T17 (RL Foundations) Chapter on Policy Gradient. T17’s lesson on classical policy-gradient methods covers the same derivation at lower scale (tabular policies, not neural-network policies). If T17 is the parallel prerequisite track, that lesson is its tabular twin to this one.