Brief: RLHF (opens Phase 3)
Capability gained
Section titled “Capability gained”Walk the three-stage InstructGPT pipeline (SFT → RM → PPO + KL). Write the Bradley-Terry reward objective. Derive the closed-form optimal RLHF policy π*(y|x) ∝ π_SFT(y|x) · exp(R_φ(x,y)/β) from the L11/L12 variational framework. Compute the optimal policy by hand on a 2-response example in the lesson body (and a 3-response variant in practice) and verify both β limits. Diagnose five different RLHF failure modes by mapping symptoms to operational instruments (reward-model test accuracy, measured KL-from-base, win-rate against base, harm-bench scores, sycophancy benchmarks, capability evals).
Why this lesson exists
Section titled “Why this lesson exists”L13 is the Phase 3 opener and the synthesis of everything Phase 2 built. L8 introduced PPO; L11-L12 introduced the variational framework; L13 wires them together for the language-model alignment problem. Readers leave seeing RLHF not as a separate technique but as the natural application of the algorithmic and theoretical machinery they have already built.
L13 is also the §6-watch-zone test for the track. RLHF brushes alignment debates, reward hacking, sycophancy, training-data policy. The lesson applies the discipline the advisor named:
- 4-category specificity per RLHF-specific surface: each contested topic (reward hacking, sycophancy, refusal hacking, length hacking) is mapped to a specific empirical instrument rather than discussed in the abstract.
- Evaluation-methods-naming: training stability (KL-from-base), reward-model accuracy (test set), downstream eval (win-rate against base + HarmBench + sycophancy benchmarks + capability evals like MMLU/GSM8K/HumanEval) are named as the operational measurements.
- Operational scope test: every claim is paired with the instrument that would settle it. “Is this RLHF run working?” → reward-model test accuracy + measured KL-from-base + win-rate against base. “Is this model aligned?” → broader, harder, the instruments inform but do not settle.
- Domain-specific instrument suite: HarmBench (Mazeika 2024), Anthropic sycophancy benchmarks (Perez 2023; Sharma 2023), MMLU/GSM8K/HumanEval as capability-retention proxies; the Bradley-Terry (1952) parameterization for the reward model.
The empirical/value-question distinction is preserved: operational instruments are presented as the engineering surface; deeper alignment questions are flagged as broader and remaining open.
Source
Section titled “Source”Primary papers: Ouyang et al. (2022 InstructGPT); Bai et al. (2022 Anthropic RLHF); Bai et al. (2022 Constitutional AI); Rafailov et al. (2023 DPO); DeepSeek-AI (2024 GRPO); Azar et al. (2024 IPO); Ethayarajh et al. (2024 KTO). Operational instruments: Perez et al. (2023), Sharma et al. (2023), Mazeika et al. (2024 HarmBench). Bradley-Terry foundation (1952). Surveys: Casper et al. (2023), Kaufmann et al. (2024).
This is the first Track 18 lesson without a direct CS285 lecture source; the variational framework is from CS285 L18-L19 (Lessons 11-12) and the practical pipeline is from the primary RLHF papers.
Phase advance
Section titled “Phase advance”Phase 3 lesson 1 (phase_order: 1). OPENS PHASE 3 (rl-frontiers). Phase 2 closed at L12; the Phase 2 → Phase 3 boundary checkpoint was approved between L12 and L13. L13 sets up the Phase 3 framing: frontier applications wire together pieces from Phases 1-2. The Phase 3 sequence after L13 is L14 offline RL problem, L15 offline RL algorithms, L16 exploration, L17 multi-task and meta-RL, L18 open problems (closes Phase 3 + Track 18).
Lesson body (lesson.mdx)
Section titled “Lesson body (lesson.mdx)”- Recap of where we are in the Phase 1/2/3 narrative arc; L13 is the synthesis lesson.
- The RLHF problem: align pretrained model to preferences without losing pretraining benefits.
- InstructGPT pipeline three stages: SFT, reward modeling (Bradley-Terry), PPO + KL.
- The full RLHF objective
L_RLHF = L^CLIP - β · KL(π_θ || π_SFT)connected back to L12’s variational framework. - Closed-form optimal policy
π*(y|x) ∝ π_SFT(y|x) · exp(R_φ(x,y)/β); soft Bellman posterior at sequence level. - Worked example:
π_SFT = (0.6, 0.4),R = (1, 0),β = 0.5givesπ* = (0.917, 0.083). Both limits verified (β → 0: reward maximizer;β → ∞: SFT unchanged). - Reward hacking section: dominant failure mode; KL penalty as structural defense; operational symptoms (repetition, sycophancy, length hacking, refusal hacking, format gaming).
- Variants: Constitutional AI (RLAIF), DPO (variational shortcut), GRPO (group-relative), IPO (theoretical refinement of DPO).
- Operational instruments section: named with criterion (reward-model test accuracy > 65%; measured KL 5-50 nats; win-rate > 50%; HarmBench; sycophancy benchmarks; capability evals). Empirical/value-question split flagged explicitly.
- Common pitfalls (setting β without measuring KL; believing RM; skipping SFT; comparing across RM normalizations; treating DPO as “RLHF without RM”; using rater agreement as truth proxy).
- “Why this matters” anchors RLHF as the technique that bridged 2019 GPT-2-era pretrained models to 2022-era instruction-tuned assistants. What RLHF does/does not solve.
- “What you should remember” closes the lesson.
Practice (practice.mdx)
Section titled “Practice (practice.mdx)”Two exercises:
-
3-response optimal RLHF policy.
π_SFT = (0.5, 0.3, 0.2),R = (1.0, 0.5, -1.0),β = 0.5. Compute un-normalized weights(3.6945, 0.8155, 0.0271), partition functionZ = 4.5371, optimal policyπ* = (0.8143, 0.1797, 0.0060). Part D verifiesβ → 0(gives(1, 0, 0)) andβ → ∞(gives SFT(0.5, 0.3, 0.2)). Part E connects back to L12’s soft Bellman framework. -
Diagnostic finding → operational instrument mapping. Five findings: (a) over-refusal of benign prompts (catch with win-rate + capability evals + HarmBench); (b) excessive length (catch with length stats + length-controlled win-rate); (c) sycophancy on factual questions (catch with sycophancy benchmarks + capability evals); (d) phrase-level reward gaming (catch with RM adversarial analysis + win-rate); (e) KL = 2 nats + win-rate 55% (this is a working run, no fix needed).
5 flashcards: three-stage pipeline; derive optimal RLHF policy from L11/L12; reward hacking + structural defense; relationship between InstructGPT/DPO/GRPO/Constitutional AI; operational instruments.
Cheatsheet (cheatsheet.mdx)
Section titled “Cheatsheet (cheatsheet.mdx)”One-page reference. Three-stage table. Full objective. Closed-form optimal policy. Worked example reproduced. β decision table mapping β to behavior. Variant table. Operational instruments table. Bradley-Terry preference model statement. Common pitfalls.
Summary (summary.mdx)
Section titled “Summary (summary.mdx)”5-minute distillation. One-paragraph framing. Five things to remember. Why-this-matters paragraph closing Phase 3 framing. Worked-check memory anchor reproducing the practice numerics. Where this fits (L14 offline RL: problem next).
References (references.mdx)
Section titled “References (references.mdx)”Primary papers: Ouyang (2022 InstructGPT), Stiennon (2020), Christiano (2017), Bai (2022 Anthropic RLHF), Bai (2022 Constitutional AI), Rafailov (2023 DPO), DeepSeek (2024 GRPO), Azar (2024 IPO), Ethayarajh (2024 KTO). Operational instruments: Perez (2023), Sharma (2023), Mazeika (2024 HarmBench), Hendrycks (2021 MMLU), Cobbe (2021 GSM8K), Chen (2021 HumanEval). Bradley-Terry (1952). Open-source: Huang et al. (37 implementation details), TRL, trlx. Surveys: Casper et al. (2023), Kaufmann et al. (2024).
Editorial discipline
Section titled “Editorial discipline”- Stage 2 sweep: em/en dashes, U+2212 vs hyphen, caps emphasis, dead
/topics/links. - Acronyms allowed in caps: RLHF, SFT, RM, PPO, KL, ELBO, RL, DPO, GRPO, IPO, KTO, RLAIF, MMLU, GSM8K, HumanEval, NLP, LLM, MDP, MaxEnt, MIT, ICML, ICLR, NeurIPS, AISTATS, ACL, MuJoCo, IRL.
- §6 watch-zone discipline applied throughout per advisor’s L12 boundary-checkpoint pointer:
- 4-category specificity per failure mode (reward hacking / sycophancy / refusal hacking / length hacking each mapped to specific instrument).
- Evaluation-methods-naming (training stability, RM accuracy, downstream eval, capability retention).
- Operational scope test (every claim paired with instrument that would settle it).
- Domain-specific instrument suite (HarmBench, sycophancy benchmarks, MMLU/GSM8K/HumanEval).
- The empirical/value-question split is preserved: operational instruments distinguished from broader value-alignment questions. Lesson states explicitly that instruments inform but do not settle the deeper questions.
- Vendor naming: paper authors + organizations as paper authors (Anthropic / OpenAI / DeepSeek named as authors of cited papers, not as marketed entities). Constitutional AI / RLAIF named as algorithmic approaches per published papers. No marketing framing of any specific vendor product.
Word counts
Section titled “Word counts”- Lesson 3120
- Cheatsheet 730
- Practice 2310
- Summary 705
- Brief 1100
- References 745
Total ≈ 8710 words across 6 artifacts. Math-heavy band; the largest lesson in the track due to combined algorithmic + production + variant + instrument-suite coverage.
Notes for promotion
Section titled “Notes for promotion”- Component placeholders (
�J0�,�J1�) live as MDX comments. The�J2�is configured for “Custom (RLHF papers)” rather than a single CS285 lecture; Lead may want to wire this with a different component or just leave the URL empty. - Practice imports real
�J0�+�J1�components. - Numerics:
π_SFT = (0.6, 0.4),R = (1, 0),β = 0.5givesπ* = (0.917, 0.083); verified to 4 decimals. Practice’s 3-response numerics also verify to 4 decimals. Limits work out as expected. - Phase 3 opener. The Phase 2 → Phase 3 boundary checkpoint approved this lesson’s draft start; advisor’s L12 boundary-checkpoint message provided the green light. L14 onward continues Phase 3 (offline RL problem L14, offline RL algorithms L15, exploration L16, multi-task and meta-RL L17, open problems L18 which closes the track).
- §6-watch-zone discipline followed throughout: empirical instruments named, value/policy questions flagged separately. The lesson should be reviewable as a model for the §6-touch-content discipline going forward.