PPO clipped surrogate: brief

Capability gained

Derive the PPO clipped surrogate objective from the on-policy stability problem (via TRPO’s trust region). Compute L^CLIP for a worked example and identify where the gradient saturates. Explain why PPO became the workhorse of RLHF over the alternatives in this track (DQN’s discrete-action limit; TRPO’s implementation complexity).

Why this lesson exists

L6 named the deadly triad. L7 was the off-policy resolution: DQN’s engineering tricks (replay buffer, target network, double-Q) patch each leg. L8 is the on-policy resolution: avoid the off-policy leg by construction, then clip per-epoch policy change to bound the importance-sampling approximation error. The L6/L7/L8 triplet should land as a complete chapter: failure mode named (L6), two distinct resolutions presented (L7 off-policy, L8 on-policy), with the L7→L8 contrast explaining the modern algorithm zoo.

PPO is also the contemporary practical workhorse. Every major LLM finetuned for instructions since 2022 used PPO or a close variant. The lesson establishes the link forward to RLHF (Lesson 13) without preempting its content.

Source

Berkeley CS285 Lectures 9 (natural gradient, TRPO) and 10 (PPO), Sergey Levine, 2023. Primary papers: Schulman et al. (2017) PPO; Schulman et al. (2015) TRPO; Schulman et al. (2016) GAE. RLHF connection: Christiano et al. (2017) preference-RL; Ouyang et al. (2022) InstructGPT; Bai et al. (2022) Anthropic RLHF.

Phase advance

Phase 2 lesson 3 (phase_order: 3). Completes the L6/L7/L8 chapter on the deadly triad and its two resolutions. Sets up Lessons 9 and 10 (model-based RL, the P-branch of the dispatch table) as the next algorithmic family, and forward-references Lesson 13 (RLHF) where PPO returns as the production workhorse.

Lesson body (lesson.mdx)

Recap of L6/L7 framing; this lesson is the on-policy alternative.
The on-policy stability problem: REINFORCE’s gradient is correct only when data is from π_θ; reuse breaks the unbiasedness.
Importance-sampled surrogate L^IS(θ) = E[r · A] with r = π_θ / π_{θ_old}; correct at θ_old, degrades as r drifts.
TRPO as the principled solution: hard KL constraint E[KL(π_old || π_θ)] ≤ δ, natural-gradient solver. Works but heavy to implement.
PPO as the engineering solution: clipped surrogate L^CLIP = E[min(r · A, clip(r, 1-ε, 1+ε) · A)] with ε = 0.2.
Case-by-case analysis: for A > 0, clip caps upside above 1 + ε; for A < 0, clip caps “upside” (saturation of negative-direction reward) below 1 - ε. Asymmetry: for A > 0, r < 0.8 is not clipped (unclipped term smaller); for A < 0, r > 1.2 is not clipped (unclipped term more negative). The clip caps upside rewards, not downside losses.
Worked example: tables of L^CLIP for A = +1 and A = -1 across r ∈ {0.5, 0.7, 0.8, 1.0, 1.2, 1.5, 2.0}, identifying which rows are clipped vs unclipped.
PPO training loop pseudocode with K epochs, GAE-based advantages, value loss, entropy bonus.
RLHF integration: vocabulary-sized action space rules out DQN; on-policy rollouts fit autoregressive generation; trust region constrains reward hacking. The full RLHF objective L = L^CLIP - β · KL(π_θ || π_pretrained) introduced (full deep-dive in L13).
Common pitfalls: ε too high; K too high; missing entropy bonus; confusing PPO with TRPO; mis-reading the min.
“Why this matters when you use AI” anchors PPO as the RLHF workhorse; L6/L7/L8 triplet completes the chapter.

Practice (practice.mdx)

Two exercises:

Complete L^CLIP tables for both signs of A (epsilon = 0.2). For A = +1 across r values 0.5, 0.8, 1.0, 1.1, 1.2, 1.3, 1.5, 2.0: identifies that L^CLIP saturates at 1.2 for r above 1.2. For A = -1 across r values 0.5, 0.7, 0.8, 1.0, 1.2, 1.5, 2.0: identifies that L^CLIP floors at -0.8 for r below 0.8 but goes unclipped to -1.5, -2.0 for r above 1.2 (the asymmetric behavior; loss for clearly bad moves still registered). Part C asks reader to plot the piecewise-linear shape in their head.
Trace a 3-action softmax through one PPO update. Initial policy uniform [1/3, 1/3, 1/3], advantages [+1, 0, -1]. Unconstrained exponentiated-advantage proposal gives [0.665, 0.245, 0.090]. Probability ratios [1.995, 0.735, 0.270] all fall outside [0.8, 1.2]. PPO caps the per-epoch move to roughly [0.400, ?, 0.267]. Part D explains why this gradualism is the point: the importance-sampling surrogate is only reliable when r stays near 1.

5 flashcards: why naive on-policy reuse breaks unbiasedness; what each piece of L^CLIP does; why the loss is not clipped for r > 1+ε when A < 0; why PPO over DQN for RLHF; TRPO vs PPO relationship.

Cheatsheet (cheatsheet.mdx)

One-page reference. Progression table (REINFORCE → actor-critic → TRPO → PPO → DQN). Key definitions (r, surrogate, TRPO objective, PPO clipped surrogate). Asymmetric-clip-behavior table organized by region × sign of A. Worked example for A = +1. Training loop skeleton + hyperparameter table. RLHF integration with the KL-to-pretrained term. Common pitfalls.

Summary (summary.mdx)

5-minute distillation. One-paragraph framing. Five things to remember. Why-this-matters paragraph completing the L6/L7/L8 chapter. Worked-check memory anchor showing the asymmetric L^CLIP values. Where this fits in the track arc.

References (references.mdx)

Primary: Schulman et al. (2017 PPO; 2015 TRPO; 2016 GAE). RLHF: Christiano et al. (2017 preference-RL foundation); Stiennon et al. (2020 summarization RLHF); Ouyang et al. (2022 InstructGPT); Bai et al. (2022 Anthropic RLHF). Variants/successors: DeepSeek-R1 (2024 GRPO); Rafailov et al. (2023 DPO). Implementation: OpenAI Spinning Up; Engstrom et al. (2020) “Implementation Matters” empirical study. Course: Berkeley CS285 L9-L10. Sutton & Barto chapter 13, section 5.5.

Editorial discipline

Stage 2 sweep: em/en dashes, U+2212 vs hyphen, caps emphasis, dead /topics/ links. Acronyms allowed in caps: PPO, TRPO, DPO, GRPO, DQN, RL, RLHF, SAC, DDPG, FA, BS, OP, IS, GAE, MDP, TD, MC, MSE, SGD, KL, LLM, ICML, ICLR, AAAI, NeurIPS, MuJoCo, InstructGPT, Anthropic, DeepSeek, OpenAI, JAIR.
No vendor naming triggers (paper authors, course instructors, OpenAI/Anthropic as RLHF practitioners; not commercial framing). No security claims; RLHF is mentioned as the contemporary application context.
§6 status: standard pipeline, no triggers. RLHF forward-references properly deferred to Lesson 13.

Word counts

Lesson 2640
Cheatsheet 670
Practice 1955
Summary 654
Brief 935
References 633

Total ≈ 7487 words across 6 artifacts. Math-heavy band; in line with L5-L7 calibration.

Notes for promotion

Component placeholders (�J0�, �J1�) live as MDX comments; Lead wires at promotion.
Practice imports real �J0� + �J1� components.
Numerics on PPO are conservative (ε = 0.2, K = 4 to 10, λ = 0.95 for GAE) and well-documented in the original paper. RLHF specifics deliberately abbreviated; the L13 deep-dive will carry the load.
Continues phase-boundary cadence; Phase 2 boundary check after L12.
Completes L6/L7/L8 chapter. The L7→L8 contrast (off-policy engineering vs on-policy clip) is the load-bearing pedagogical move and should be preserved through any future edits.