Summary: Policy gradient and the path to modern RL
Policy gradient parameterizes the policy directly and follows the gradient of expected return. This is the policy side of RL; the previous nine lessons were the value side. Together they cover modern RL, and the closing bridge is RLHF. This summary is the scan-in-five-minutes version of the full lesson (which is the track’s capstone).
Core ideas
Section titled “Core ideas”- Why parameterize the policy. Continuous actions (argmax intractable; sample a parameterized distribution), stochastic optima (rock-paper-scissors), and cases where the policy is simpler than Q. Value-based methods cannot represent these naturally.
- The policy-gradient theorem. grad_theta J(theta) = E_pi [ grad_theta log pi_theta(a | s) * Q^pi(s, a) ]. The log-likelihood gradient of the action taken, scaled by how good that action is. Samplable; you do not need P.
- REINFORCE. The simplest policy-gradient algorithm: estimate Q^pi(s_t, a_t) by the actual observed return G_t and update theta
<-theta + eta * G_t * grad_theta log pi_theta(a_t | s_t). Unbiased but high-variance (inherits MC’s properties from lesson 6). - Worked one-step on a softmax policy. 2 actions, theta = (0, 0) -> pi uniform; sample a_1, return G = 2; grad_theta log pi(a_1) = (0.5, -0.5); eta = 0.1 -> theta = (0.1, -0.1); pi(a_1) climbs from 0.50 to about 0.55. The rewarded action got more likely; exactly REINFORCE’s job.
- Actor-critic reduces variance. Replace G_t with a learned value baseline (commonly the advantage A = Q - V) from a TD-trained critic: theta
<-theta + eta * A_t * grad_theta log pi_theta(a_t | s_t). Two networks (actor + critic) trained together. Blueprint of A2C/A3C, PPO, SAC. - PPO is the modern workhorse: actor-critic with a clipped surrogate objective that prevents destabilizing per-update policy changes. Used in robotics, game agents, and RLHF.
- RLHF for LLMs. The LM is the policy; reward is a learned reward model trained on human-preference pairs; PPO is the algorithm. Track 5’s
rlhf-and-dpocovers the alignment side; this track gave you the RL mechanics. The loop is closed.
What changes for you
Section titled “What changes for you”You have the second of the two families that cover modern RL: value-based (lessons 4-9) and policy-based (this lesson), with actor-critic as the canonical hybrid that wraps both. The MDP framework from Phase 1 is underneath both; the Bellman equations govern values; the policy-gradient theorem governs policies. Most modern systems blend the two and use PPO under the hood. The cleanest practical takeaway is the method-fits-setting rule: discrete + manageable + deterministic-optimal -> value-based; continuous or stochastic-optimal or LM-fine-tuning -> policy-based, usually actor-critic. The cleanest conceptual takeaway is what RLHF actually is: PPO on a language-model policy with a learned reward model, sitting on the same MDP framework the rest of the track is built on. End of track.