Skip to content

Summary: Policy gradient and the path to modern RL

Policy gradient parameterizes the policy directly and follows the gradient of expected return. This is the policy side of RL; the previous nine lessons were the value side. Together they cover modern RL, and the closing bridge is RLHF. This summary is the scan-in-five-minutes version of the full lesson (which is the track’s capstone).

  • Why parameterize the policy. Continuous actions (argmax intractable; sample a parameterized distribution), stochastic optima (rock-paper-scissors), and cases where the policy is simpler than Q. Value-based methods cannot represent these naturally.
  • The policy-gradient theorem. grad_theta J(theta) = E_pi [ grad_theta log pi_theta(a | s) * Q^pi(s, a) ]. The log-likelihood gradient of the action taken, scaled by how good that action is. Samplable; you do not need P.
  • REINFORCE. The simplest policy-gradient algorithm: estimate Q^pi(s_t, a_t) by the actual observed return G_t and update theta <- theta + eta * G_t * grad_theta log pi_theta(a_t | s_t). Unbiased but high-variance (inherits MC’s properties from lesson 6).
  • Worked one-step on a softmax policy. 2 actions, theta = (0, 0) -> pi uniform; sample a_1, return G = 2; grad_theta log pi(a_1) = (0.5, -0.5); eta = 0.1 -> theta = (0.1, -0.1); pi(a_1) climbs from 0.50 to about 0.55. The rewarded action got more likely; exactly REINFORCE’s job.
  • Actor-critic reduces variance. Replace G_t with a learned value baseline (commonly the advantage A = Q - V) from a TD-trained critic: theta <- theta + eta * A_t * grad_theta log pi_theta(a_t | s_t). Two networks (actor + critic) trained together. Blueprint of A2C/A3C, PPO, SAC.
  • PPO is the modern workhorse: actor-critic with a clipped surrogate objective that prevents destabilizing per-update policy changes. Used in robotics, game agents, and RLHF.
  • RLHF for LLMs. The LM is the policy; reward is a learned reward model trained on human-preference pairs; PPO is the algorithm. Track 5’s rlhf-and-dpo covers the alignment side; this track gave you the RL mechanics. The loop is closed.

You have the second of the two families that cover modern RL: value-based (lessons 4-9) and policy-based (this lesson), with actor-critic as the canonical hybrid that wraps both. The MDP framework from Phase 1 is underneath both; the Bellman equations govern values; the policy-gradient theorem governs policies. Most modern systems blend the two and use PPO under the hood. The cleanest practical takeaway is the method-fits-setting rule: discrete + manageable + deterministic-optimal -> value-based; continuous or stochastic-optimal or LM-fine-tuning -> policy-based, usually actor-critic. The cleanest conceptual takeaway is what RLHF actually is: PPO on a language-model policy with a learned reward model, sitting on the same MDP framework the rest of the track is built on. End of track.