Policy gradient and modern RL: cheatsheet

The one idea

Parameterize the policy directly, pi_theta(a | s); take gradient steps that increase the probability of actions with high return. The policy side of RL; complements the value side from lessons 4-9.

When to use policy-based methods

Setting	Why
Continuous action spaces (R^n)	argmax over a continuum is intractable; sample a parameterized distribution
Stochastic optimal policy	Rock-paper-scissors style; value-based greedy cannot represent it
Policy simpler than Q	Sometimes a short reactive rule is easier than a full value function
LM fine-tuning (RLHF)	The LM IS the parameterized stochastic policy

The policy-gradient theorem and REINFORCE

J(theta)            = E_pi[ G_0 ]                            (objective)
grad_theta J(theta) = E_pi[ grad_theta log pi_theta(a | s) * Q^pi(s, a) ]   (theorem)

REINFORCE (MC policy gradient): estimate Q^pi by the observed return G_t.
  theta <- theta + eta * G_t * grad_theta log pi_theta(a_t | s_t)

Worked one-step (2-action softmax)

pi(a_i) = exp(theta_i) / sum_j exp(theta_j).   theta = (0, 0)  -> pi = (0.5, 0.5).
Sample a_1, observe return G = 2, eta = 0.1.

grad_theta log pi(a_1) = ( 1 - pi(a_1), -pi(a_2) ) = ( 0.5, -0.5 )
theta <- (0, 0) + 0.1 * 2 * (0.5, -0.5) = (0.1, -0.1)

After update:
  pi(a_1) = exp(0.1) / (exp(0.1) + exp(-0.1)) ~ 1.105 / 2.010 ~ 0.55
  pi(a_2) = ~ 0.45

=> Rewarded action's probability went UP (0.50 -> 0.55).

Actor-critic (variance fix)

Replace G_t with a LEARNED-VALUE baseline (commonly the ADVANTAGE):
  A(s, a) = Q(s, a) - V(s)
  theta <- theta + eta * A_t * grad_theta log pi_theta(a_t | s_t)

ACTOR  = policy network pi_theta (trained with policy-gradient update)
CRITIC = value network V_phi or Q_phi (trained with TD)
Two networks trained together. Blueprint of A2C/A3C, PPO, SAC.

The modern landscape

Algorithm	What it is
REINFORCE	MC policy gradient (high variance)
A2C / A3C	Actor-critic with advantage; sync / async parallel
PPO	Actor-critic with CLIPPED surrogate objective (modern workhorse)
SAC	Continuous-action actor-critic with entropy regularization

RLHF for LLMs (recipe)

POLICY = the language model itself (parameterized, stochastic)
STATE  = conversation history
ACTION = next token (or full response)
REWARD = LEARNED REWARD MODEL trained on human-ranked response pairs
ALGORITHM = PPO

=> LM is fine-tuned to maximize the reward model's score.
T17 = the RL machinery RLHF assumes; T5 (rlhf-and-dpo) = the alignment side.

Pitfalls to dodge

Conflating policy gradient with random search (the gradient is precisely defined).
Underestimating REINFORCE’s variance (modern methods use a critic/baseline).
Confusing actor with critic (actor = policy, trained with policy-gradient; critic = value, trained with TD).
Mistaking PPO’s clipping for optional polish (it is what makes large policy-gradient steps safe).
Treating RLHF as a different algorithm (RLHF = PPO + learned reward model).

Words to use precisely

Policy gradient: gradient of expected return w.r.t. policy parameters.
REINFORCE: MC policy gradient using observed returns.
Actor-critic: policy network + value network trained together.
Advantage A(s, a): Q(s, a) - V(s); how much better than average this action is.
PPO: actor-critic with a clipped objective; the modern workhorse.
RLHF: PPO on an LM with a learned reward model from human preferences.