Skip to content

Brief: PPO (trust regions, clipped surrogate, RLHF workhorse)

Derive the PPO clipped surrogate objective from the on-policy stability problem (via TRPO’s trust region). Compute L^CLIP for a worked example and identify where the gradient saturates. Explain why PPO became the workhorse of RLHF over the alternatives in this track (DQN’s discrete-action limit; TRPO’s implementation complexity).

L6 named the deadly triad. L7 was the off-policy resolution: DQN’s engineering tricks (replay buffer, target network, double-Q) patch each leg. L8 is the on-policy resolution: avoid the off-policy leg by construction, then clip per-epoch policy change to bound the importance-sampling approximation error. The L6/L7/L8 triplet should land as a complete chapter: failure mode named (L6), two distinct resolutions presented (L7 off-policy, L8 on-policy), with the L7→L8 contrast explaining the modern algorithm zoo.

PPO is also the contemporary practical workhorse. Every major LLM finetuned for instructions since 2022 used PPO or a close variant. The lesson establishes the link forward to RLHF (Lesson 13) without preempting its content.

Berkeley CS285 Lectures 9 (natural gradient, TRPO) and 10 (PPO), Sergey Levine, 2023. Primary papers: Schulman et al. (2017) PPO; Schulman et al. (2015) TRPO; Schulman et al. (2016) GAE. RLHF connection: Christiano et al. (2017) preference-RL; Ouyang et al. (2022) InstructGPT; Bai et al. (2022) Anthropic RLHF.

Phase 2 lesson 3 (phase_order: 3). Completes the L6/L7/L8 chapter on the deadly triad and its two resolutions. Sets up Lessons 9 and 10 (model-based RL, the P-branch of the dispatch table) as the next algorithmic family, and forward-references Lesson 13 (RLHF) where PPO returns as the production workhorse.

  • Recap of L6/L7 framing; this lesson is the on-policy alternative.
  • The on-policy stability problem: REINFORCE’s gradient is correct only when data is from π_θ; reuse breaks the unbiasedness.
  • Importance-sampled surrogate L^IS(θ) = E[r · A] with r = π_θ / π_{θ_old}; correct at θ_old, degrades as r drifts.
  • TRPO as the principled solution: hard KL constraint E[KL(π_old || π_θ)] ≤ δ, natural-gradient solver. Works but heavy to implement.
  • PPO as the engineering solution: clipped surrogate L^CLIP = E[min(r · A, clip(r, 1-ε, 1+ε) · A)] with ε = 0.2.
  • Case-by-case analysis: for A > 0, clip caps upside above 1 + ε; for A < 0, clip caps “upside” (saturation of negative-direction reward) below 1 - ε. Asymmetry: for A > 0, r < 0.8 is not clipped (unclipped term smaller); for A < 0, r > 1.2 is not clipped (unclipped term more negative). The clip caps upside rewards, not downside losses.
  • Worked example: tables of L^CLIP for A = +1 and A = -1 across r ∈ {0.5, 0.7, 0.8, 1.0, 1.2, 1.5, 2.0}, identifying which rows are clipped vs unclipped.
  • PPO training loop pseudocode with K epochs, GAE-based advantages, value loss, entropy bonus.
  • RLHF integration: vocabulary-sized action space rules out DQN; on-policy rollouts fit autoregressive generation; trust region constrains reward hacking. The full RLHF objective L = L^CLIP - β · KL(π_θ || π_pretrained) introduced (full deep-dive in L13).
  • Common pitfalls: ε too high; K too high; missing entropy bonus; confusing PPO with TRPO; mis-reading the min.
  • “Why this matters when you use AI” anchors PPO as the RLHF workhorse; L6/L7/L8 triplet completes the chapter.

Two exercises:

  1. Complete L^CLIP tables for both signs of A (epsilon = 0.2). For A = +1 across r values 0.5, 0.8, 1.0, 1.1, 1.2, 1.3, 1.5, 2.0: identifies that L^CLIP saturates at 1.2 for r above 1.2. For A = -1 across r values 0.5, 0.7, 0.8, 1.0, 1.2, 1.5, 2.0: identifies that L^CLIP floors at -0.8 for r below 0.8 but goes unclipped to -1.5, -2.0 for r above 1.2 (the asymmetric behavior; loss for clearly bad moves still registered). Part C asks reader to plot the piecewise-linear shape in their head.

  2. Trace a 3-action softmax through one PPO update. Initial policy uniform [1/3, 1/3, 1/3], advantages [+1, 0, -1]. Unconstrained exponentiated-advantage proposal gives [0.665, 0.245, 0.090]. Probability ratios [1.995, 0.735, 0.270] all fall outside [0.8, 1.2]. PPO caps the per-epoch move to roughly [0.400, ?, 0.267]. Part D explains why this gradualism is the point: the importance-sampling surrogate is only reliable when r stays near 1.

5 flashcards: why naive on-policy reuse breaks unbiasedness; what each piece of L^CLIP does; why the loss is not clipped for r > 1+ε when A < 0; why PPO over DQN for RLHF; TRPO vs PPO relationship.

One-page reference. Progression table (REINFORCE → actor-critic → TRPO → PPO → DQN). Key definitions (r, surrogate, TRPO objective, PPO clipped surrogate). Asymmetric-clip-behavior table organized by region × sign of A. Worked example for A = +1. Training loop skeleton + hyperparameter table. RLHF integration with the KL-to-pretrained term. Common pitfalls.

5-minute distillation. One-paragraph framing. Five things to remember. Why-this-matters paragraph completing the L6/L7/L8 chapter. Worked-check memory anchor showing the asymmetric L^CLIP values. Where this fits in the track arc.

Primary: Schulman et al. (2017 PPO; 2015 TRPO; 2016 GAE). RLHF: Christiano et al. (2017 preference-RL foundation); Stiennon et al. (2020 summarization RLHF); Ouyang et al. (2022 InstructGPT); Bai et al. (2022 Anthropic RLHF). Variants/successors: DeepSeek-R1 (2024 GRPO); Rafailov et al. (2023 DPO). Implementation: OpenAI Spinning Up; Engstrom et al. (2020) “Implementation Matters” empirical study. Course: Berkeley CS285 L9-L10. Sutton & Barto chapter 13, section 5.5.

  • Stage 2 sweep: em/en dashes, U+2212 vs hyphen, caps emphasis, dead /topics/ links. Acronyms allowed in caps: PPO, TRPO, DPO, GRPO, DQN, RL, RLHF, SAC, DDPG, FA, BS, OP, IS, GAE, MDP, TD, MC, MSE, SGD, KL, LLM, ICML, ICLR, AAAI, NeurIPS, MuJoCo, InstructGPT, Anthropic, DeepSeek, OpenAI, JAIR.
  • No vendor naming triggers (paper authors, course instructors, OpenAI/Anthropic as RLHF practitioners; not commercial framing). No security claims; RLHF is mentioned as the contemporary application context.
  • §6 status: standard pipeline, no triggers. RLHF forward-references properly deferred to Lesson 13.
  • Lesson 2640
  • Cheatsheet 670
  • Practice 1955
  • Summary 654
  • Brief 935
  • References 633

Total ≈ 7487 words across 6 artifacts. Math-heavy band; in line with L5-L7 calibration.

  • Component placeholders (�J0�, �J1�) live as MDX comments; Lead wires at promotion.
  • Practice imports real �J0� + �J1� components.
  • Numerics on PPO are conservative (ε = 0.2, K = 4 to 10, λ = 0.95 for GAE) and well-documented in the original paper. RLHF specifics deliberately abbreviated; the L13 deep-dive will carry the load.
  • Continues phase-boundary cadence; Phase 2 boundary check after L12.
  • Completes L6/L7/L8 chapter. The L7→L8 contrast (off-policy engineering vs on-policy clip) is the load-bearing pedagogical move and should be preserved through any future edits.