Practice: Policy gradient and the path to modern RL

The first practice exercise drills the REINFORCE update: compute the softmax log-likelihood gradient, scale by the return, take a step, see the rewarded action’s probability rise. The second is the track-closing drill: given a setting, name whether a value-based, policy-based, or actor-critic method fits, with reasoning. Keep a scratchpad.

Self-check

Six short questions. Answer each in your head before opening the collapsible.

1. Why parameterize the policy directly rather than always derive it from Q?

Show answer

(1) Continuous actions — argmax over a continuum is intractable; sampling from a parameterized distribution (e.g. Gaussian) is easy. (2) Stochastic optima — some problems (rock-paper-scissors) have stochastic optimal policies that value-based greedy methods cannot represent. (3) Sometimes the policy is simpler than the value — learning pi directly can be more sample-efficient.

2. State the policy-gradient theorem in words and write the REINFORCE update.

Show answer

The gradient of expected return is the expectation over the policy of the log-likelihood gradient of the action taken, scaled by how good that action is (its Q-value under the current policy). REINFORCE estimates Q by the actual return: theta <- theta + eta * G_t * grad_theta log pi_theta(a_t | s_t).

3. Why is REINFORCE described as “unbiased but high-variance”?

Show answer

Because it uses the Monte Carlo return G_t as the Q estimate, inheriting MC’s unbiased-but-high-variance trade from lesson 6. Convergence is guaranteed in expectation but slow in practice. Actor-critic replaces G_t with a learned value baseline (lower variance, some bias) which is why most modern policy-gradient methods are actor-critic.

4. How does actor-critic reduce REINFORCE’s variance?

Show answer

Replace the raw return G_t in the update with a lower-variance estimate of Q^pi from a learned critic (a value network trained with TD). A common form uses the advantage A(s, a) = Q(s, a) - V(s) in place of the return: theta <- theta + eta * A_t * grad_theta log pi_theta(a_t | s_t). The actor (policy) and critic (value) are trained together.

5. What is PPO, and where is it used in practice?

Show answer

Proximal Policy Optimization: actor-critic with a clipped surrogate objective that prevents the policy from changing too much per update (preventing destabilizing jumps). It is the modern workhorse of policy-gradient RL: used in robotics, game agents (Dota, StarCraft variants), and crucially in RLHF for large language models.

6. Sketch the RLHF recipe for a large language model in policy-gradient terms.

Show answer

The policy is the LM (parameterized, stochastic, samples next tokens). The state is the conversation so far; the action is the next token (or full response). The reward is provided by a learned reward model trained on pairs of human-ranked responses to predict human preference. The algorithm is PPO, which uses the reward model’s score as the signal in the policy-gradient update. Track 5’s rlhf-and-dpo covers the alignment side; this track gave you the RL mechanics.

Try it yourself: a REINFORCE step on a 3-action softmax

A softmax policy over three actions {a_1, a_2, a_3} with parameters theta = (theta_1, theta_2, theta_3):

pi_theta(a_i)  =  exp(theta_i) / sum_j exp(theta_j)

Initialize theta = (0, 0, 0) so the policy is uniform: pi(a_1) = pi(a_2) = pi(a_3) = 1/3.

Suppose the agent samples action a_2 and observes return G = 3. Use step size eta = 0.1.

1. Compute grad_theta log pi_theta(a_2). Hint: d/d_theta_i log pi(a_2) is
   (1 - pi(a_2)) if i = 2, and -pi(a_i) otherwise.
2. Apply the REINFORCE update to get the new theta.
3. Compute the new pi(a_2) (you can leave it in terms of exp(0.2) and
   exp(-0.1/3)... or approximate numerically: exp(0.2) approx 1.221,
   exp(-1/30) approx 0.967).
4. Did pi(a_2) increase, and is that the right direction?

Show answer

1. grad_theta log pi_theta(a_2):
     d/d_theta_1 = -pi(a_1) = -1/3
     d/d_theta_2 = 1 - pi(a_2) = 1 - 1/3 = 2/3
     d/d_theta_3 = -pi(a_3) = -1/3
   grad_theta log pi(a_2) = (-1/3, 2/3, -1/3)

2. theta <- theta + eta * G * grad
   theta <- (0, 0, 0) + 0.1 * 3 * (-1/3, 2/3, -1/3)
         = (0, 0, 0) + (-0.1, 0.2, -0.1)
         = (-0.1, 0.2, -0.1)

3. With theta = (-0.1, 0.2, -0.1):
     exp(-0.1) approx 0.905
     exp(0.2)  approx 1.221
     exp(-0.1) approx 0.905
     Z = 0.905 + 1.221 + 0.905 = 3.031
     pi(a_2) = 1.221 / 3.031 approx 0.403   (was 1/3 approx 0.333)
     pi(a_1) = 0.905 / 3.031 approx 0.299
     pi(a_3) = 0.905 / 3.031 approx 0.299

4. YES: pi(a_2) climbed from 0.333 to about 0.403; the other two correspondingly
   dropped (each from 0.333 to about 0.299). The action that earned a positive
   return became more likely; the others less so. Exactly what REINFORCE is
   supposed to do.

Notice the symmetry: the gradient at the not-taken actions points down equally (both -pi(a_i)), so they get pulled the same amount. The taken action gets pushed up by (1 - pi(a_taken)) — larger when its current probability is small (a useful property: the algorithm makes bigger updates when it most needs to).

Try it yourself: value-based, policy-based, or actor-critic?

For each setting, pick the family that fits best and explain in one line.

A. A tabular gridworld with 100 discrete states and 4 discrete actions, known
   dynamics, you want pi^*.
B. A simulated robot with 7 continuous joints; the action space is R^7
   (torques for each joint).
C. An Atari game from raw pixels; you have a CNN and want to maximize game
   score.
D. Rock-paper-scissors against a smart opponent.
E. Fine-tuning a large language model with human preference data, given a
   trained reward model.

Show answer

A: value-based, planning (PI or VI). Known dynamics + small discrete state space -> tabular planning is exact, fast, and right (lessons 4-5).
B: policy-based (or actor-critic). Continuous action space (R^7) — value-based with argmax is intractable. Parameterize a Gaussian policy over joint torques; train with PPO or SAC. Actor-critic is the modern default.
C: value-based deep RL (DQN family). Discrete actions, large pixel state space, single reward signal — classic DQN territory (lesson 9). Function approximation + experience replay + target network handle the deadly triad.
D: policy-based with a stochastic policy. The optimal policy is uniform random; any deterministic value-based answer is exploitable. A parameterized stochastic policy can represent the right answer directly.
E: policy-based (specifically PPO) on the LM as the policy, with the reward model providing the signal. The exact RLHF recipe: PPO + reward model + LM = aligned model. Track 5’s lesson covers the alignment side.

The takeaway: value-based dominates when actions are discrete, manageable, and the optimal is deterministic; policy-based is the right call for continuous actions, stochastic optima, and LM fine-tuning; actor-critic combines both and is the modern default for everything in between.

Flashcards

Eight cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page for offline review.

Q. Why parameterize the policy directly rather than always derive it from Q?

Continuous actions (argmax intractable; sample a parameterized distribution), stochastic optimal policies (rock-paper-scissors), or settings where the policy is simpler than the value function.

Q. State the policy-gradient theorem.

grad_theta J(theta) = E_pi [ grad_theta log pi_theta(a | s) * Q^pi(s, a) ]. The log-likelihood gradient of the action taken, scaled by how good that action is.

Q. Write the REINFORCE update.

theta <- theta + eta * G_t * grad_theta log pi_theta(a_t | s_t). Per-step, run an episode; for each step, scale the log-likelihood gradient by the return from t and step theta up by it.

Q. REINFORCE is unbiased but ___ . What fixes this?

High-variance (inherits MC’s variance because it uses G_t as the Q estimate). Actor-critic fixes it by replacing G_t with a lower-variance learned-value baseline (typically the advantage A = Q - V from a TD-trained critic).

Q. Actor vs critic: what does each do?

Actor = the policy network pi_theta, trained with the policy-gradient update. Critic = a value network (V_phi or Q_phi) trained with TD; provides a lower-variance estimate of Q^pi to scale the actor’s update. Two networks, trained together.

Q. What is PPO, and where is it the workhorse?

Proximal Policy Optimization: actor-critic with a clipped surrogate objective that prevents the policy from changing too much per update. The modern default for robotics, game agents, and RLHF for LLMs.

Q. Sketch the RLHF recipe in policy-gradient terms.

Policy = the LM (sample next tokens). State = conversation history. Action = next token/response. Reward = learned reward model predicting human preference. Algorithm = PPO. The LM is fine-tuned to maximize the reward model’s score.

Q. In the 2-action softmax worked example, after one REINFORCE step the rewarded action's probability went from 0.50 to about 0.55. Why?

The log-likelihood gradient at the taken action is (1 - pi(a)) for that action’s parameter and -pi(other) for others. Scaled by a positive return and a positive step size, the taken action’s parameter increases and the others decrease, so its softmax probability rises.