Skip to content

Policy gradient and the path to modern RL

This is lesson 10 of Track 17 (Reinforcement Learning Foundations), the capstone, and the close of Phase 4 (Scaling up). The previous nine lessons learned a value (V or Q) and read the policy off greedily. Policy gradient flips the move: parameterize the policy directly, then take gradient steps that increase the probability of actions that lead to high return. That single conceptual change unlocks continuous action spaces, gives stochastic policies as first-class citizens, and is the entire family behind modern actor-critic methods, PPO, and the RLHF step that fine-tunes large language models. The source curriculum is David Silver’s UCL RL course (CC BY-NC 4.0), freely available and cited per lesson as further study.

The lesson explains why direct policy parameterization is needed, states the policy-gradient theorem at intuition level and writes the REINFORCE update, walks one policy-gradient step on a tiny 2-action softmax policy (sample a_1 with return G = 2; theta climbs to (0.1, -0.1); pi(a_1) rises from 0.50 to about 0.55), sketches actor-critic as the variance-reduction step that produces the modern algorithms (A2C/A3C, PPO, SAC), and closes the track with the bridge to RLHF: the language model is the policy, the reward is a learned reward model from human preferences, and PPO is the algorithm. Track 5’s rlhf-and-dpo covers the alignment side; this track gave you the RL mechanics underneath.

This is lesson 10 of 10 and the final lesson of the track. It uses everything from Phases 1-3 (MDP framework, value functions and Bellman equations, TD learning) and the function-approximation move from lesson 9 (now parameterizing the policy instead of the value). Together with the value-based side (lessons 4-9), it covers the two big families of modern RL. End of track. From here the recommended next track is Track 5’s rlhf-and-dpo lesson, which sits on top of this track’s machinery on the alignment side.

Prerequisites: the previous lesson (Function approximation and deep RL) for parameterizing a function with theta, the gradient step, and the deadly-triad framing; lesson 6 (Monte Carlo prediction) for the return G_t that REINFORCE uses and the high-variance property it inherits; lesson 7 (TD learning) for the value-baseline used in actor-critic. Comfort with the chain rule and softmax derivatives (a small amount) for the worked example; the formulas are spelled out.

The lesson has a real but light derivation: the softmax log-likelihood gradient d/d_theta_i log pi(a_taken) = 1{i = taken} - pi(a_i), applied step by step on a 2-action case. The REINFORCE update is one multiplication of the gradient by the return and the step size. No proofs of the policy-gradient theorem itself; it is stated and used. PPO’s clipping and SAC’s entropy term are named but not derived.

  • Explain why parameterizing the policy directly is needed for continuous actions and natural for stochastic policies
  • State the policy-gradient theorem at intuition level and write the REINFORCE update
  • Compute one REINFORCE step on a tiny softmax policy and observe the probability of a rewarded action increase
  • Explain how actor-critic reduces REINFORCE’s variance by replacing the return with a learned value baseline
  • Connect policy gradient to the modern RL landscape, including PPO and the RLHF step used to fine-tune large language models
  • Read time: about 14 minutes (slightly above the track-average band because it is the capstone and includes the modern-RL + RLHF bridge)
  • Practice time: about 16 minutes (a self-check, a REINFORCE step on a 3-action softmax, a value-vs-policy-vs-actor-critic method-fits-the-setting drill that closes the track’s value/policy split, and flashcards)
  • Difficulty: standard (one light derivation, one arithmetic step; conceptual challenge is internalizing the value/policy split and recognizing the RLHF recipe as PPO + reward model)