RLHF pipeline: brief

Capability gained

Walk the three-stage InstructGPT pipeline (SFT → RM → PPO + KL). Write the Bradley-Terry reward objective. Derive the closed-form optimal RLHF policy π*(y|x) ∝ π_SFT(y|x) · exp(R_φ(x,y)/β) from the L11/L12 variational framework. Compute the optimal policy by hand on a 2-response example in the lesson body (and a 3-response variant in practice) and verify both β limits. Diagnose five different RLHF failure modes by mapping symptoms to operational instruments (reward-model test accuracy, measured KL-from-base, win-rate against base, harm-bench scores, sycophancy benchmarks, capability evals).

Why this lesson exists

L13 is the Phase 3 opener and the synthesis of everything Phase 2 built. L8 introduced PPO; L11-L12 introduced the variational framework; L13 wires them together for the language-model alignment problem. Readers leave seeing RLHF not as a separate technique but as the natural application of the algorithmic and theoretical machinery they have already built.

L13 is also the §6-watch-zone test for the track. RLHF brushes alignment debates, reward hacking, sycophancy, training-data policy. The lesson applies the discipline the advisor named:

4-category specificity per RLHF-specific surface: each contested topic (reward hacking, sycophancy, refusal hacking, length hacking) is mapped to a specific empirical instrument rather than discussed in the abstract.
Evaluation-methods-naming: training stability (KL-from-base), reward-model accuracy (test set), downstream eval (win-rate against base + HarmBench + sycophancy benchmarks + capability evals like MMLU/GSM8K/HumanEval) are named as the operational measurements.
Operational scope test: every claim is paired with the instrument that would settle it. “Is this RLHF run working?” → reward-model test accuracy + measured KL-from-base + win-rate against base. “Is this model aligned?” → broader, harder, the instruments inform but do not settle.
Domain-specific instrument suite: HarmBench (Mazeika 2024), Anthropic sycophancy benchmarks (Perez 2023; Sharma 2023), MMLU/GSM8K/HumanEval as capability-retention proxies; the Bradley-Terry (1952) parameterization for the reward model.

The empirical/value-question distinction is preserved: operational instruments are presented as the engineering surface; deeper alignment questions are flagged as broader and remaining open.

Source

Primary papers: Ouyang et al. (2022 InstructGPT); Bai et al. (2022 Anthropic RLHF); Bai et al. (2022 Constitutional AI); Rafailov et al. (2023 DPO); DeepSeek-AI (2024 GRPO); Azar et al. (2024 IPO); Ethayarajh et al. (2024 KTO). Operational instruments: Perez et al. (2023), Sharma et al. (2023), Mazeika et al. (2024 HarmBench). Bradley-Terry foundation (1952). Surveys: Casper et al. (2023), Kaufmann et al. (2024).

This is the first Track 18 lesson without a direct CS285 lecture source; the variational framework is from CS285 L18-L19 (Lessons 11-12) and the practical pipeline is from the primary RLHF papers.

Phase advance

Phase 3 lesson 1 (phase_order: 1). OPENS PHASE 3 (rl-frontiers). Phase 2 closed at L12; the Phase 2 → Phase 3 boundary checkpoint was approved between L12 and L13. L13 sets up the Phase 3 framing: frontier applications wire together pieces from Phases 1-2. The Phase 3 sequence after L13 is L14 offline RL problem, L15 offline RL algorithms, L16 exploration, L17 multi-task and meta-RL, L18 open problems (closes Phase 3 + Track 18).

Lesson body (lesson.mdx)

Recap of where we are in the Phase 1/2/3 narrative arc; L13 is the synthesis lesson.
The RLHF problem: align pretrained model to preferences without losing pretraining benefits.
InstructGPT pipeline three stages: SFT, reward modeling (Bradley-Terry), PPO + KL.
The full RLHF objective L_RLHF = L^CLIP - β · KL(π_θ || π_SFT) connected back to L12’s variational framework.
Closed-form optimal policy π*(y|x) ∝ π_SFT(y|x) · exp(R_φ(x,y)/β); soft Bellman posterior at sequence level.
Worked example: π_SFT = (0.6, 0.4), R = (1, 0), β = 0.5 gives π* = (0.917, 0.083). Both limits verified (β → 0: reward maximizer; β → ∞: SFT unchanged).
Reward hacking section: dominant failure mode; KL penalty as structural defense; operational symptoms (repetition, sycophancy, length hacking, refusal hacking, format gaming).
Variants: Constitutional AI (RLAIF), DPO (variational shortcut), GRPO (group-relative), IPO (theoretical refinement of DPO).
Operational instruments section: named with criterion (reward-model test accuracy > 65%; measured KL 5-50 nats; win-rate > 50%; HarmBench; sycophancy benchmarks; capability evals). Empirical/value-question split flagged explicitly.
Common pitfalls (setting β without measuring KL; believing RM; skipping SFT; comparing across RM normalizations; treating DPO as “RLHF without RM”; using rater agreement as truth proxy).
“Why this matters” anchors RLHF as the technique that bridged 2019 GPT-2-era pretrained models to 2022-era instruction-tuned assistants. What RLHF does/does not solve.
“What you should remember” closes the lesson.

Practice (practice.mdx)

Two exercises:

3-response optimal RLHF policy. π_SFT = (0.5, 0.3, 0.2), R = (1.0, 0.5, -1.0), β = 0.5. Compute un-normalized weights (3.6945, 0.8155, 0.0271), partition function Z = 4.5371, optimal policy π* = (0.8143, 0.1797, 0.0060). Part D verifies β → 0 (gives (1, 0, 0)) and β → ∞ (gives SFT (0.5, 0.3, 0.2)). Part E connects back to L12’s soft Bellman framework.
Diagnostic finding → operational instrument mapping. Five findings: (a) over-refusal of benign prompts (catch with win-rate + capability evals + HarmBench); (b) excessive length (catch with length stats + length-controlled win-rate); (c) sycophancy on factual questions (catch with sycophancy benchmarks + capability evals); (d) phrase-level reward gaming (catch with RM adversarial analysis + win-rate); (e) KL = 2 nats + win-rate 55% (this is a working run, no fix needed).

5 flashcards: three-stage pipeline; derive optimal RLHF policy from L11/L12; reward hacking + structural defense; relationship between InstructGPT/DPO/GRPO/Constitutional AI; operational instruments.

Cheatsheet (cheatsheet.mdx)

One-page reference. Three-stage table. Full objective. Closed-form optimal policy. Worked example reproduced. β decision table mapping β to behavior. Variant table. Operational instruments table. Bradley-Terry preference model statement. Common pitfalls.

Summary (summary.mdx)

5-minute distillation. One-paragraph framing. Five things to remember. Why-this-matters paragraph closing Phase 3 framing. Worked-check memory anchor reproducing the practice numerics. Where this fits (L14 offline RL: problem next).

References (references.mdx)

Primary papers: Ouyang (2022 InstructGPT), Stiennon (2020), Christiano (2017), Bai (2022 Anthropic RLHF), Bai (2022 Constitutional AI), Rafailov (2023 DPO), DeepSeek (2024 GRPO), Azar (2024 IPO), Ethayarajh (2024 KTO). Operational instruments: Perez (2023), Sharma (2023), Mazeika (2024 HarmBench), Hendrycks (2021 MMLU), Cobbe (2021 GSM8K), Chen (2021 HumanEval). Bradley-Terry (1952). Open-source: Huang et al. (37 implementation details), TRL, trlx. Surveys: Casper et al. (2023), Kaufmann et al. (2024).

Editorial discipline

Stage 2 sweep: em/en dashes, U+2212 vs hyphen, caps emphasis, dead /topics/ links.
Acronyms allowed in caps: RLHF, SFT, RM, PPO, KL, ELBO, RL, DPO, GRPO, IPO, KTO, RLAIF, MMLU, GSM8K, HumanEval, NLP, LLM, MDP, MaxEnt, MIT, ICML, ICLR, NeurIPS, AISTATS, ACL, MuJoCo, IRL.
§6 watch-zone discipline applied throughout per advisor’s L12 boundary-checkpoint pointer:
- 4-category specificity per failure mode (reward hacking / sycophancy / refusal hacking / length hacking each mapped to specific instrument).
- Evaluation-methods-naming (training stability, RM accuracy, downstream eval, capability retention).
- Operational scope test (every claim paired with instrument that would settle it).
- Domain-specific instrument suite (HarmBench, sycophancy benchmarks, MMLU/GSM8K/HumanEval).
The empirical/value-question split is preserved: operational instruments distinguished from broader value-alignment questions. Lesson states explicitly that instruments inform but do not settle the deeper questions.
Vendor naming: paper authors + organizations as paper authors (Anthropic / OpenAI / DeepSeek named as authors of cited papers, not as marketed entities). Constitutional AI / RLAIF named as algorithmic approaches per published papers. No marketing framing of any specific vendor product.

Word counts

Lesson 3120
Cheatsheet 730
Practice 2310
Summary 705
Brief 1100
References 745

Total ≈ 8710 words across 6 artifacts. Math-heavy band; the largest lesson in the track due to combined algorithmic + production + variant + instrument-suite coverage.

Notes for promotion

Component placeholders (�J0�, �J1�) live as MDX comments. The �J2� is configured for “Custom (RLHF papers)” rather than a single CS285 lecture; Lead may want to wire this with a different component or just leave the URL empty.
Practice imports real �J0� + �J1� components.
Numerics: π_SFT = (0.6, 0.4), R = (1, 0), β = 0.5 gives π* = (0.917, 0.083); verified to 4 decimals. Practice’s 3-response numerics also verify to 4 decimals. Limits work out as expected.
Phase 3 opener. The Phase 2 → Phase 3 boundary checkpoint approved this lesson’s draft start; advisor’s L12 boundary-checkpoint message provided the green light. L14 onward continues Phase 3 (offline RL problem L14, offline RL algorithms L15, exploration L16, multi-task and meta-RL L17, open problems L18 which closes the track).
§6-watch-zone discipline followed throughout: empirical instruments named, value/policy questions flagged separately. The lesson should be reviewable as a model for the §6-touch-content discipline going forward.