Skip to content

Cheatsheet: PPO (clipped surrogate objective)

MethodDataStability mechanism
REINFORCE (Lesson 4)On-policy, one update per batchNone (high variance, low sample efficiency)
Actor-critic (Lesson 5)On-policy, one update per batchVariance reduction via critic; same on-policy constraint
TRPOOn-policy, multi-step via importance samplingHard KL constraint: `KL(π_old
PPO (this lesson)On-policy, multi-step via importance samplingClipped surrogate: min(r·A, clip(r,1-ε,1+ε)·A)
DQN (Lesson 7)Off-policy, replay bufferReplay + target net + double-Q (Lesson 7 engineering)

PPO sits between vanilla on-policy and full off-policy: it allows bounded off-policy reuse via importance sampling, with the clip preventing the surrogate from drifting past where it remains a good approximation.

Probability ratio:

r_t(θ) = π_θ(a_t | s_t) / π_{θ_old}(a_t | s_t)

At θ = θ_old, r = 1. As θ moves, r measures how much the policy’s probability of the observed action has changed.

Importance-sampled surrogate:

L^IS(θ) = E_{(s,a) ~ π_{θ_old}} [ r_t(θ) · A^{π_old}(s, a) ]

At θ = θ_old, equals the standard policy-gradient objective. Good approximation while r stays close to 1.

TRPO constrained objective:

maximize_θ E [ r_t(θ) · A_t ]
subject to E [ KL(π_{θ_old}(·|s) || π_θ(·|s)) ] ≤ δ (typically δ = 0.01)

Hard constraint; requires conjugate-gradient solver + backtracking line search.

PPO clipped surrogate (the practical version):

L^CLIP(θ) = E_t [ min( r_t(θ) · A_t, clip(r_t(θ), 1 - ε, 1 + ε) · A_t ) ]

Typical ε = 0.2. No hard constraint; the min provides soft regularization.

RegionA > 0 (good action)A < 0 (bad action)
r < 1 - εunclipped: r·Aclipped: (1-ε)·A (saturates)
1 - ε ≤ r ≤ 1 + εr·A (no clip)r·A (no clip)
r > 1 + εclipped: (1+ε)·A (saturates)unclipped: r·A

Rule: the clip caps the upside (saturates the reward for over-shooting in the favorable direction). It does not cap the downside (the optimizer still registers losses on clearly bad moves like A < 0, r > 1 + ε).

Worked example (one (s, a) pair, A = +1, ε = 0.2)

Section titled “Worked example (one (s, a) pair, A = +1, ε = 0.2)”
runclipped r·Aclip r to [0.8, 1.2]clipped ·AL^CLIP = min(…)
0.50.50.80.80.5
1.01.01.01.01.0
1.21.21.21.21.2
1.51.51.21.21.2 ← clipped
2.02.01.21.21.2 ← clipped

Gradient saturates above r = 1.2; optimizer earns no further credit per epoch.

Initialize π_θ, V_φ
For each iteration:
1. Collect N timesteps with π_{θ_old} = π_θ
2. Compute A_t via GAE (Lesson 5)
3. For K epochs (K = 4 to 10):
L = L^CLIP(θ) - c_1 · L^V(φ) + c_2 · S[π_θ]
Gradient step on L
4. θ_old ← θ
HyperparameterTypical value
Clip parameter ε0.1 to 0.3 (default 0.2)
Epochs per batch K4 to 10
Timesteps per batch N2048 to 4096 (single-env), more if multi-env
GAE λ0.95
Value loss coefficient c_10.5 to 1.0
Entropy bonus c_20.0 to 0.01

Full RLHF objective adds a KL penalty against the pretrained model:

L = L^CLIP - β · KL(π_θ || π_pretrained)
TermPurpose
L^CLIP (PPO clip)Limits policy change per epoch within one PPO iteration
`β · KL(π_θ

Why PPO for RLHF:

  • Vocabulary-sized action space → DQN argmax_a infeasible
  • On-policy rollouts fit autoregressive generation
  • Trust region keeps policy near pretrained distribution → reward hacking constrained
  • Simple to implement vs TRPO’s natural-gradient solver
  • ε too high → trust region too wide → importance-sampling approximation breaks → back to vanilla PG
  • K too high → policy drifts within batch → surrogate becomes approximation of approximation
  • Forgetting entropy bonus → premature deterministic collapse
  • Confusing PPO with TRPO; both work, PPO is the practical workhorse
  • Reading min as a regularizer instead of the source of the asymmetric clip
  • PPO = on-policy alternative to DQN’s off-policy engineering; same stability goal, different mechanism.
  • r_t(θ) = π_θ / π_{θ_old}; surrogate r · A correct at θ_old; degrades as r drifts.
  • Clipped surrogate min(r·A, clip(r, 1-ε, 1+ε)·A) caps the upside but not the downside.
  • Workhorse of modern RLHF: vocab-sized actions, on-policy rollouts, trust region constrains reward hacking.