Skip to content

Policy gradients (REINFORCE)

Lesson 3 ended on the deep-RL dispatch table: every algorithm in this track estimates one of π, V, Q, A, or P. This lesson takes the policy-gradient branch and derives the simplest member of the family. The single capability it builds: derive the REINFORCE policy-gradient estimator from the log-derivative trick, run it by hand on a small bandit, and name the high-variance failure mode that the rest of the policy-gradient lessons exist to manage.

You will meet the RL objective J(θ) = E_(τ ~ π_θ) [R(τ)] and the obstacle (the sampling distribution depends on θ, so you cannot simply differentiate the integrand); apply the log-derivative trick ∇_θ log p(x; θ) = (∇_θ p) / p to convert a gradient of an expectation into an expectation of a gradient times a score; factor the trajectory distribution and watch the environment dynamics P drop out (the gradient depends only on Σ_t ∇_θ log π_θ(a_t | s_t), which is what “model-free” means and why deep RL is possible without ever knowing P); write the REINFORCE algorithm in five lines and confirm the estimator is unbiased; add the two variance-reduction refinements (rewards-to-go via causality, and baseline subtraction which is unbiased and reduces variance when the baseline approximates V^π, giving the advantage A^π = Q^π - V^π from lesson 3); work a sigmoid bandit by hand (σ(0) = 0.5 → after one rewarding sample, σ(0.5) ≈ 0.622, then σ(0.878) ≈ 0.706); and verify dual-path that the analytic E[g] = σ(θ)(1 - σ(θ)) = 0.25 at θ = 0 matches a sample mean with variance also 0.0625, standard deviation equal to the signal.

This is lesson 4 of Phase 1 (RL foundations). It is the first algorithmic move in the track, the simplest deep-RL algorithm, and the algorithmic spine of every later policy-gradient method (actor-critic in lesson 5, TRPO and PPO in lesson 8). It also leads directly to RLHF in lesson 13: the “PPO” used to post-train modern LLMs is REINFORCE with a clipped trust-region refinement, layered on this lesson’s ∇_θ J = E[R · Σ ∇log π] estimator. The variance-reduction refinements introduced here (rewards-to-go, baseline subtraction, the advantage) carry forward to every subsequent lesson.

Prerequisite (within this track): lesson 3, RL fundamentals (MDPs, value functions, Bellman), since the policy gradient uses the trajectory distribution, the return, and (in its variance-reduced form) the advantage A^π = Q^π - V^π defined there. Background from earlier tracks: T8 (Calculus) for the chain rule and ∇_θ log of a parameterized distribution, and T11/T12/T13 for the picture of a neural network as a parameterized function whose parameters you train by gradient ascent/descent. The math leans on log, exponential (for the sigmoid worked example), and one trick from calculus (∇p = p · ∇log p); a calculator helps for the bandit numerics.

  • State the log-derivative trick and apply it to derive the policy-gradient theorem for the RL objective J(θ) = E[R(τ)]
  • Explain why the environment dynamics P drop out, making deep RL model-free
  • Write the REINFORCE algorithm in five lines and explain that the estimator is unbiased
  • Derive the rewards-to-go and baseline-subtraction refinements and connect the baseline-subtracted return to the advantage A^π = Q^π - V^π
  • Run a REINFORCE update by hand on a sigmoid bandit and verify the dual-path equality between the analytic expected gradient and a sample mean, naming the high-variance failure mode
  • Read time: about 14 minutes
  • Practice time: about 14 minutes (a fresh REINFORCE update from θ_0 = -1 on the sigmoid bandit, the dual-path analytic-vs-sample check, and flashcards)
  • Difficulty: standard