Skip to content

Cheatsheet: Policy gradient and the path to modern RL

Parameterize the policy directly, pi_theta(a | s); take gradient steps that increase the probability of actions with high return. The policy side of RL; complements the value side from lessons 4-9.

SettingWhy
Continuous action spaces (R^n)argmax over a continuum is intractable; sample a parameterized distribution
Stochastic optimal policyRock-paper-scissors style; value-based greedy cannot represent it
Policy simpler than QSometimes a short reactive rule is easier than a full value function
LM fine-tuning (RLHF)The LM IS the parameterized stochastic policy
J(theta) = E_pi[ G_0 ] (objective)
grad_theta J(theta) = E_pi[ grad_theta log pi_theta(a | s) * Q^pi(s, a) ] (theorem)
REINFORCE (MC policy gradient): estimate Q^pi by the observed return G_t.
theta <- theta + eta * G_t * grad_theta log pi_theta(a_t | s_t)
pi(a_i) = exp(theta_i) / sum_j exp(theta_j). theta = (0, 0) -> pi = (0.5, 0.5).
Sample a_1, observe return G = 2, eta = 0.1.
grad_theta log pi(a_1) = ( 1 - pi(a_1), -pi(a_2) ) = ( 0.5, -0.5 )
theta <- (0, 0) + 0.1 * 2 * (0.5, -0.5) = (0.1, -0.1)
After update:
pi(a_1) = exp(0.1) / (exp(0.1) + exp(-0.1)) ~ 1.105 / 2.010 ~ 0.55
pi(a_2) = ~ 0.45
=> Rewarded action's probability went UP (0.50 -> 0.55).
Replace G_t with a LEARNED-VALUE baseline (commonly the ADVANTAGE):
A(s, a) = Q(s, a) - V(s)
theta <- theta + eta * A_t * grad_theta log pi_theta(a_t | s_t)
ACTOR = policy network pi_theta (trained with policy-gradient update)
CRITIC = value network V_phi or Q_phi (trained with TD)
Two networks trained together. Blueprint of A2C/A3C, PPO, SAC.
AlgorithmWhat it is
REINFORCEMC policy gradient (high variance)
A2C / A3CActor-critic with advantage; sync / async parallel
PPOActor-critic with CLIPPED surrogate objective (modern workhorse)
SACContinuous-action actor-critic with entropy regularization
POLICY = the language model itself (parameterized, stochastic)
STATE = conversation history
ACTION = next token (or full response)
REWARD = LEARNED REWARD MODEL trained on human-ranked response pairs
ALGORITHM = PPO
=> LM is fine-tuned to maximize the reward model's score.
T17 = the RL machinery RLHF assumes; T5 (rlhf-and-dpo) = the alignment side.
  • Conflating policy gradient with random search (the gradient is precisely defined).
  • Underestimating REINFORCE’s variance (modern methods use a critic/baseline).
  • Confusing actor with critic (actor = policy, trained with policy-gradient; critic = value, trained with TD).
  • Mistaking PPO’s clipping for optional polish (it is what makes large policy-gradient steps safe).
  • Treating RLHF as a different algorithm (RLHF = PPO + learned reward model).
  • Policy gradient: gradient of expected return w.r.t. policy parameters.
  • REINFORCE: MC policy gradient using observed returns.
  • Actor-critic: policy network + value network trained together.
  • Advantage A(s, a): Q(s, a) - V(s); how much better than average this action is.
  • PPO: actor-critic with a clipped objective; the modern workhorse.
  • RLHF: PPO on an LM with a learned reward model from human preferences.