Cheatsheet: Policy gradient and the path to modern RL
The one idea
Section titled “The one idea”Parameterize the policy directly, pi_theta(a | s); take gradient steps that increase the probability of actions with high return. The policy side of RL; complements the value side from lessons 4-9.
When to use policy-based methods
Section titled “When to use policy-based methods”| Setting | Why |
|---|---|
| Continuous action spaces (R^n) | argmax over a continuum is intractable; sample a parameterized distribution |
| Stochastic optimal policy | Rock-paper-scissors style; value-based greedy cannot represent it |
| Policy simpler than Q | Sometimes a short reactive rule is easier than a full value function |
| LM fine-tuning (RLHF) | The LM IS the parameterized stochastic policy |
The policy-gradient theorem and REINFORCE
Section titled “The policy-gradient theorem and REINFORCE”J(theta) = E_pi[ G_0 ] (objective)grad_theta J(theta) = E_pi[ grad_theta log pi_theta(a | s) * Q^pi(s, a) ] (theorem)
REINFORCE (MC policy gradient): estimate Q^pi by the observed return G_t. theta <- theta + eta * G_t * grad_theta log pi_theta(a_t | s_t)Worked one-step (2-action softmax)
Section titled “Worked one-step (2-action softmax)”pi(a_i) = exp(theta_i) / sum_j exp(theta_j). theta = (0, 0) -> pi = (0.5, 0.5).Sample a_1, observe return G = 2, eta = 0.1.
grad_theta log pi(a_1) = ( 1 - pi(a_1), -pi(a_2) ) = ( 0.5, -0.5 )theta <- (0, 0) + 0.1 * 2 * (0.5, -0.5) = (0.1, -0.1)
After update: pi(a_1) = exp(0.1) / (exp(0.1) + exp(-0.1)) ~ 1.105 / 2.010 ~ 0.55 pi(a_2) = ~ 0.45
=> Rewarded action's probability went UP (0.50 -> 0.55).Actor-critic (variance fix)
Section titled “Actor-critic (variance fix)”Replace G_t with a LEARNED-VALUE baseline (commonly the ADVANTAGE): A(s, a) = Q(s, a) - V(s) theta <- theta + eta * A_t * grad_theta log pi_theta(a_t | s_t)
ACTOR = policy network pi_theta (trained with policy-gradient update)CRITIC = value network V_phi or Q_phi (trained with TD)Two networks trained together. Blueprint of A2C/A3C, PPO, SAC.The modern landscape
Section titled “The modern landscape”| Algorithm | What it is |
|---|---|
| REINFORCE | MC policy gradient (high variance) |
| A2C / A3C | Actor-critic with advantage; sync / async parallel |
| PPO | Actor-critic with CLIPPED surrogate objective (modern workhorse) |
| SAC | Continuous-action actor-critic with entropy regularization |
RLHF for LLMs (recipe)
Section titled “RLHF for LLMs (recipe)”POLICY = the language model itself (parameterized, stochastic)STATE = conversation historyACTION = next token (or full response)REWARD = LEARNED REWARD MODEL trained on human-ranked response pairsALGORITHM = PPO
=> LM is fine-tuned to maximize the reward model's score.T17 = the RL machinery RLHF assumes; T5 (rlhf-and-dpo) = the alignment side.Pitfalls to dodge
Section titled “Pitfalls to dodge”- Conflating policy gradient with random search (the gradient is precisely defined).
- Underestimating REINFORCE’s variance (modern methods use a critic/baseline).
- Confusing actor with critic (actor = policy, trained with policy-gradient; critic = value, trained with TD).
- Mistaking PPO’s clipping for optional polish (it is what makes large policy-gradient steps safe).
- Treating RLHF as a different algorithm (RLHF = PPO + learned reward model).
Words to use precisely
Section titled “Words to use precisely”- Policy gradient: gradient of expected return w.r.t. policy parameters.
- REINFORCE: MC policy gradient using observed returns.
- Actor-critic: policy network + value network trained together.
- Advantage A(s, a): Q(s, a) - V(s); how much better than average this action is.
- PPO: actor-critic with a clipped objective; the modern workhorse.
- RLHF: PPO on an LM with a learned reward model from human preferences.