Brief: PPO (trust regions, clipped surrogate, RLHF workhorse)
Capability gained
Section titled “Capability gained”Derive the PPO clipped surrogate objective from the on-policy stability problem (via TRPO’s trust region). Compute L^CLIP for a worked example and identify where the gradient saturates. Explain why PPO became the workhorse of RLHF over the alternatives in this track (DQN’s discrete-action limit; TRPO’s implementation complexity).
Why this lesson exists
Section titled “Why this lesson exists”L6 named the deadly triad. L7 was the off-policy resolution: DQN’s engineering tricks (replay buffer, target network, double-Q) patch each leg. L8 is the on-policy resolution: avoid the off-policy leg by construction, then clip per-epoch policy change to bound the importance-sampling approximation error. The L6/L7/L8 triplet should land as a complete chapter: failure mode named (L6), two distinct resolutions presented (L7 off-policy, L8 on-policy), with the L7→L8 contrast explaining the modern algorithm zoo.
PPO is also the contemporary practical workhorse. Every major LLM finetuned for instructions since 2022 used PPO or a close variant. The lesson establishes the link forward to RLHF (Lesson 13) without preempting its content.
Source
Section titled “Source”Berkeley CS285 Lectures 9 (natural gradient, TRPO) and 10 (PPO), Sergey Levine, 2023. Primary papers: Schulman et al. (2017) PPO; Schulman et al. (2015) TRPO; Schulman et al. (2016) GAE. RLHF connection: Christiano et al. (2017) preference-RL; Ouyang et al. (2022) InstructGPT; Bai et al. (2022) Anthropic RLHF.
Phase advance
Section titled “Phase advance”Phase 2 lesson 3 (phase_order: 3). Completes the L6/L7/L8 chapter on the deadly triad and its two resolutions. Sets up Lessons 9 and 10 (model-based RL, the P-branch of the dispatch table) as the next algorithmic family, and forward-references Lesson 13 (RLHF) where PPO returns as the production workhorse.
Lesson body (lesson.mdx)
Section titled “Lesson body (lesson.mdx)”- Recap of L6/L7 framing; this lesson is the on-policy alternative.
- The on-policy stability problem: REINFORCE’s gradient is correct only when data is from
π_θ; reuse breaks the unbiasedness. - Importance-sampled surrogate
L^IS(θ) = E[r · A]withr = π_θ / π_{θ_old}; correct atθ_old, degrades asrdrifts. - TRPO as the principled solution: hard KL constraint
E[KL(π_old || π_θ)] ≤ δ, natural-gradient solver. Works but heavy to implement. - PPO as the engineering solution: clipped surrogate
L^CLIP = E[min(r · A, clip(r, 1-ε, 1+ε) · A)]withε = 0.2. - Case-by-case analysis: for A > 0, clip caps upside above
1 + ε; for A < 0, clip caps “upside” (saturation of negative-direction reward) below1 - ε. Asymmetry: for A > 0,r < 0.8is not clipped (unclipped term smaller); for A < 0,r > 1.2is not clipped (unclipped term more negative). The clip caps upside rewards, not downside losses. - Worked example: tables of
L^CLIPfor A = +1 and A = -1 acrossr ∈ {0.5, 0.7, 0.8, 1.0, 1.2, 1.5, 2.0}, identifying which rows are clipped vs unclipped. - PPO training loop pseudocode with K epochs, GAE-based advantages, value loss, entropy bonus.
- RLHF integration: vocabulary-sized action space rules out DQN; on-policy rollouts fit autoregressive generation; trust region constrains reward hacking. The full RLHF objective
L = L^CLIP - β · KL(π_θ || π_pretrained)introduced (full deep-dive in L13). - Common pitfalls: ε too high; K too high; missing entropy bonus; confusing PPO with TRPO; mis-reading the min.
- “Why this matters when you use AI” anchors PPO as the RLHF workhorse; L6/L7/L8 triplet completes the chapter.
Practice (practice.mdx)
Section titled “Practice (practice.mdx)”Two exercises:
-
Complete L^CLIP tables for both signs of A (epsilon = 0.2). For A = +1 across r values 0.5, 0.8, 1.0, 1.1, 1.2, 1.3, 1.5, 2.0: identifies that L^CLIP saturates at 1.2 for r above 1.2. For A = -1 across r values 0.5, 0.7, 0.8, 1.0, 1.2, 1.5, 2.0: identifies that L^CLIP floors at -0.8 for r below 0.8 but goes unclipped to -1.5, -2.0 for r above 1.2 (the asymmetric behavior; loss for clearly bad moves still registered). Part C asks reader to plot the piecewise-linear shape in their head.
-
Trace a 3-action softmax through one PPO update. Initial policy uniform
[1/3, 1/3, 1/3], advantages[+1, 0, -1]. Unconstrained exponentiated-advantage proposal gives[0.665, 0.245, 0.090]. Probability ratios[1.995, 0.735, 0.270]all fall outside[0.8, 1.2]. PPO caps the per-epoch move to roughly[0.400, ?, 0.267]. Part D explains why this gradualism is the point: the importance-sampling surrogate is only reliable when r stays near 1.
5 flashcards: why naive on-policy reuse breaks unbiasedness; what each piece of L^CLIP does; why the loss is not clipped for r > 1+ε when A < 0; why PPO over DQN for RLHF; TRPO vs PPO relationship.
Cheatsheet (cheatsheet.mdx)
Section titled “Cheatsheet (cheatsheet.mdx)”One-page reference. Progression table (REINFORCE → actor-critic → TRPO → PPO → DQN). Key definitions (r, surrogate, TRPO objective, PPO clipped surrogate). Asymmetric-clip-behavior table organized by region × sign of A. Worked example for A = +1. Training loop skeleton + hyperparameter table. RLHF integration with the KL-to-pretrained term. Common pitfalls.
Summary (summary.mdx)
Section titled “Summary (summary.mdx)”5-minute distillation. One-paragraph framing. Five things to remember. Why-this-matters paragraph completing the L6/L7/L8 chapter. Worked-check memory anchor showing the asymmetric L^CLIP values. Where this fits in the track arc.
References (references.mdx)
Section titled “References (references.mdx)”Primary: Schulman et al. (2017 PPO; 2015 TRPO; 2016 GAE). RLHF: Christiano et al. (2017 preference-RL foundation); Stiennon et al. (2020 summarization RLHF); Ouyang et al. (2022 InstructGPT); Bai et al. (2022 Anthropic RLHF). Variants/successors: DeepSeek-R1 (2024 GRPO); Rafailov et al. (2023 DPO). Implementation: OpenAI Spinning Up; Engstrom et al. (2020) “Implementation Matters” empirical study. Course: Berkeley CS285 L9-L10. Sutton & Barto chapter 13, section 5.5.
Editorial discipline
Section titled “Editorial discipline”- Stage 2 sweep: em/en dashes, U+2212 vs hyphen, caps emphasis, dead
/topics/links. Acronyms allowed in caps: PPO, TRPO, DPO, GRPO, DQN, RL, RLHF, SAC, DDPG, FA, BS, OP, IS, GAE, MDP, TD, MC, MSE, SGD, KL, LLM, ICML, ICLR, AAAI, NeurIPS, MuJoCo, InstructGPT, Anthropic, DeepSeek, OpenAI, JAIR. - No vendor naming triggers (paper authors, course instructors, OpenAI/Anthropic as RLHF practitioners; not commercial framing). No security claims; RLHF is mentioned as the contemporary application context.
- §6 status: standard pipeline, no triggers. RLHF forward-references properly deferred to Lesson 13.
Word counts
Section titled “Word counts”- Lesson 2640
- Cheatsheet 670
- Practice 1955
- Summary 654
- Brief 935
- References 633
Total ≈ 7487 words across 6 artifacts. Math-heavy band; in line with L5-L7 calibration.
Notes for promotion
Section titled “Notes for promotion”- Component placeholders (
�J0�,�J1�) live as MDX comments; Lead wires at promotion. - Practice imports real
�J0�+�J1�components. - Numerics on PPO are conservative (
ε = 0.2,K = 4 to 10,λ = 0.95for GAE) and well-documented in the original paper. RLHF specifics deliberately abbreviated; the L13 deep-dive will carry the load. - Continues phase-boundary cadence; Phase 2 boundary check after L12.
- Completes L6/L7/L8 chapter. The L7→L8 contrast (off-policy engineering vs on-policy clip) is the load-bearing pedagogical move and should be preserved through any future edits.