Cheatsheet: PPO (clipped surrogate objective)
The progression: from REINFORCE to PPO
Section titled “The progression: from REINFORCE to PPO”| Method | Data | Stability mechanism |
|---|---|---|
| REINFORCE (Lesson 4) | On-policy, one update per batch | None (high variance, low sample efficiency) |
| Actor-critic (Lesson 5) | On-policy, one update per batch | Variance reduction via critic; same on-policy constraint |
| TRPO | On-policy, multi-step via importance sampling | Hard KL constraint: `KL(π_old |
| PPO (this lesson) | On-policy, multi-step via importance sampling | Clipped surrogate: min(r·A, clip(r,1-ε,1+ε)·A) |
| DQN (Lesson 7) | Off-policy, replay buffer | Replay + target net + double-Q (Lesson 7 engineering) |
PPO sits between vanilla on-policy and full off-policy: it allows bounded off-policy reuse via importance sampling, with the clip preventing the surrogate from drifting past where it remains a good approximation.
Key definitions
Section titled “Key definitions”Probability ratio:
r_t(θ) = π_θ(a_t | s_t) / π_{θ_old}(a_t | s_t)At θ = θ_old, r = 1. As θ moves, r measures how much the policy’s probability of the observed action has changed.
Importance-sampled surrogate:
L^IS(θ) = E_{(s,a) ~ π_{θ_old}} [ r_t(θ) · A^{π_old}(s, a) ]At θ = θ_old, equals the standard policy-gradient objective. Good approximation while r stays close to 1.
TRPO constrained objective:
maximize_θ E [ r_t(θ) · A_t ]subject to E [ KL(π_{θ_old}(·|s) || π_θ(·|s)) ] ≤ δ (typically δ = 0.01)Hard constraint; requires conjugate-gradient solver + backtracking line search.
PPO clipped surrogate (the practical version):
L^CLIP(θ) = E_t [ min( r_t(θ) · A_t, clip(r_t(θ), 1 - ε, 1 + ε) · A_t ) ]Typical ε = 0.2. No hard constraint; the min provides soft regularization.
The asymmetric clip behavior
Section titled “The asymmetric clip behavior”| Region | A > 0 (good action) | A < 0 (bad action) |
|---|---|---|
r < 1 - ε | unclipped: r·A | clipped: (1-ε)·A (saturates) |
1 - ε ≤ r ≤ 1 + ε | r·A (no clip) | r·A (no clip) |
r > 1 + ε | clipped: (1+ε)·A (saturates) | unclipped: r·A |
Rule: the clip caps the upside (saturates the reward for over-shooting in the favorable direction). It does not cap the downside (the optimizer still registers losses on clearly bad moves like A < 0, r > 1 + ε).
Worked example (one (s, a) pair, A = +1, ε = 0.2)
Section titled “Worked example (one (s, a) pair, A = +1, ε = 0.2)”r | unclipped r·A | clip r to [0.8, 1.2] | clipped ·A | L^CLIP = min(…) |
|---|---|---|---|---|
| 0.5 | 0.5 | 0.8 | 0.8 | 0.5 |
| 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| 1.2 | 1.2 | 1.2 | 1.2 | 1.2 |
| 1.5 | 1.5 | 1.2 | 1.2 | 1.2 ← clipped |
| 2.0 | 2.0 | 1.2 | 1.2 | 1.2 ← clipped |
Gradient saturates above r = 1.2; optimizer earns no further credit per epoch.
PPO training loop
Section titled “PPO training loop”Initialize π_θ, V_φFor each iteration: 1. Collect N timesteps with π_{θ_old} = π_θ 2. Compute A_t via GAE (Lesson 5) 3. For K epochs (K = 4 to 10): L = L^CLIP(θ) - c_1 · L^V(φ) + c_2 · S[π_θ] Gradient step on L 4. θ_old ← θ| Hyperparameter | Typical value |
|---|---|
Clip parameter ε | 0.1 to 0.3 (default 0.2) |
Epochs per batch K | 4 to 10 |
Timesteps per batch N | 2048 to 4096 (single-env), more if multi-env |
GAE λ | 0.95 |
Value loss coefficient c_1 | 0.5 to 1.0 |
Entropy bonus c_2 | 0.0 to 0.01 |
PPO in RLHF (Lesson 13 deep-dive)
Section titled “PPO in RLHF (Lesson 13 deep-dive)”Full RLHF objective adds a KL penalty against the pretrained model:
L = L^CLIP - β · KL(π_θ || π_pretrained)| Term | Purpose |
|---|---|
L^CLIP (PPO clip) | Limits policy change per epoch within one PPO iteration |
| `β · KL(π_θ |
Why PPO for RLHF:
- Vocabulary-sized action space → DQN
argmax_ainfeasible - On-policy rollouts fit autoregressive generation
- Trust region keeps policy near pretrained distribution → reward hacking constrained
- Simple to implement vs TRPO’s natural-gradient solver
Common pitfalls
Section titled “Common pitfalls”εtoo high → trust region too wide → importance-sampling approximation breaks → back to vanilla PGKtoo high → policy drifts within batch → surrogate becomes approximation of approximation- Forgetting entropy bonus → premature deterministic collapse
- Confusing PPO with TRPO; both work, PPO is the practical workhorse
- Reading
minas a regularizer instead of the source of the asymmetric clip
What you should remember
Section titled “What you should remember”- PPO = on-policy alternative to DQN’s off-policy engineering; same stability goal, different mechanism.
r_t(θ) = π_θ / π_{θ_old}; surrogater · Acorrect atθ_old; degrades asrdrifts.- Clipped surrogate
min(r·A, clip(r, 1-ε, 1+ε)·A)caps the upside but not the downside. - Workhorse of modern RLHF: vocab-sized actions, on-policy rollouts, trust region constrains reward hacking.