Skip to content

Brief: RLHF (opens Phase 3)

Walk the three-stage InstructGPT pipeline (SFT → RM → PPO + KL). Write the Bradley-Terry reward objective. Derive the closed-form optimal RLHF policy π*(y|x) ∝ π_SFT(y|x) · exp(R_φ(x,y)/β) from the L11/L12 variational framework. Compute the optimal policy by hand on a 2-response example in the lesson body (and a 3-response variant in practice) and verify both β limits. Diagnose five different RLHF failure modes by mapping symptoms to operational instruments (reward-model test accuracy, measured KL-from-base, win-rate against base, harm-bench scores, sycophancy benchmarks, capability evals).

L13 is the Phase 3 opener and the synthesis of everything Phase 2 built. L8 introduced PPO; L11-L12 introduced the variational framework; L13 wires them together for the language-model alignment problem. Readers leave seeing RLHF not as a separate technique but as the natural application of the algorithmic and theoretical machinery they have already built.

L13 is also the §6-watch-zone test for the track. RLHF brushes alignment debates, reward hacking, sycophancy, training-data policy. The lesson applies the discipline the advisor named:

  1. 4-category specificity per RLHF-specific surface: each contested topic (reward hacking, sycophancy, refusal hacking, length hacking) is mapped to a specific empirical instrument rather than discussed in the abstract.
  2. Evaluation-methods-naming: training stability (KL-from-base), reward-model accuracy (test set), downstream eval (win-rate against base + HarmBench + sycophancy benchmarks + capability evals like MMLU/GSM8K/HumanEval) are named as the operational measurements.
  3. Operational scope test: every claim is paired with the instrument that would settle it. “Is this RLHF run working?” → reward-model test accuracy + measured KL-from-base + win-rate against base. “Is this model aligned?” → broader, harder, the instruments inform but do not settle.
  4. Domain-specific instrument suite: HarmBench (Mazeika 2024), Anthropic sycophancy benchmarks (Perez 2023; Sharma 2023), MMLU/GSM8K/HumanEval as capability-retention proxies; the Bradley-Terry (1952) parameterization for the reward model.

The empirical/value-question distinction is preserved: operational instruments are presented as the engineering surface; deeper alignment questions are flagged as broader and remaining open.

Primary papers: Ouyang et al. (2022 InstructGPT); Bai et al. (2022 Anthropic RLHF); Bai et al. (2022 Constitutional AI); Rafailov et al. (2023 DPO); DeepSeek-AI (2024 GRPO); Azar et al. (2024 IPO); Ethayarajh et al. (2024 KTO). Operational instruments: Perez et al. (2023), Sharma et al. (2023), Mazeika et al. (2024 HarmBench). Bradley-Terry foundation (1952). Surveys: Casper et al. (2023), Kaufmann et al. (2024).

This is the first Track 18 lesson without a direct CS285 lecture source; the variational framework is from CS285 L18-L19 (Lessons 11-12) and the practical pipeline is from the primary RLHF papers.

Phase 3 lesson 1 (phase_order: 1). OPENS PHASE 3 (rl-frontiers). Phase 2 closed at L12; the Phase 2 → Phase 3 boundary checkpoint was approved between L12 and L13. L13 sets up the Phase 3 framing: frontier applications wire together pieces from Phases 1-2. The Phase 3 sequence after L13 is L14 offline RL problem, L15 offline RL algorithms, L16 exploration, L17 multi-task and meta-RL, L18 open problems (closes Phase 3 + Track 18).

  • Recap of where we are in the Phase 1/2/3 narrative arc; L13 is the synthesis lesson.
  • The RLHF problem: align pretrained model to preferences without losing pretraining benefits.
  • InstructGPT pipeline three stages: SFT, reward modeling (Bradley-Terry), PPO + KL.
  • The full RLHF objective L_RLHF = L^CLIP - β · KL(π_θ || π_SFT) connected back to L12’s variational framework.
  • Closed-form optimal policy π*(y|x) ∝ π_SFT(y|x) · exp(R_φ(x,y)/β); soft Bellman posterior at sequence level.
  • Worked example: π_SFT = (0.6, 0.4), R = (1, 0), β = 0.5 gives π* = (0.917, 0.083). Both limits verified (β → 0: reward maximizer; β → ∞: SFT unchanged).
  • Reward hacking section: dominant failure mode; KL penalty as structural defense; operational symptoms (repetition, sycophancy, length hacking, refusal hacking, format gaming).
  • Variants: Constitutional AI (RLAIF), DPO (variational shortcut), GRPO (group-relative), IPO (theoretical refinement of DPO).
  • Operational instruments section: named with criterion (reward-model test accuracy > 65%; measured KL 5-50 nats; win-rate > 50%; HarmBench; sycophancy benchmarks; capability evals). Empirical/value-question split flagged explicitly.
  • Common pitfalls (setting β without measuring KL; believing RM; skipping SFT; comparing across RM normalizations; treating DPO as “RLHF without RM”; using rater agreement as truth proxy).
  • “Why this matters” anchors RLHF as the technique that bridged 2019 GPT-2-era pretrained models to 2022-era instruction-tuned assistants. What RLHF does/does not solve.
  • “What you should remember” closes the lesson.

Two exercises:

  1. 3-response optimal RLHF policy. π_SFT = (0.5, 0.3, 0.2), R = (1.0, 0.5, -1.0), β = 0.5. Compute un-normalized weights (3.6945, 0.8155, 0.0271), partition function Z = 4.5371, optimal policy π* = (0.8143, 0.1797, 0.0060). Part D verifies β → 0 (gives (1, 0, 0)) and β → ∞ (gives SFT (0.5, 0.3, 0.2)). Part E connects back to L12’s soft Bellman framework.

  2. Diagnostic finding → operational instrument mapping. Five findings: (a) over-refusal of benign prompts (catch with win-rate + capability evals + HarmBench); (b) excessive length (catch with length stats + length-controlled win-rate); (c) sycophancy on factual questions (catch with sycophancy benchmarks + capability evals); (d) phrase-level reward gaming (catch with RM adversarial analysis + win-rate); (e) KL = 2 nats + win-rate 55% (this is a working run, no fix needed).

5 flashcards: three-stage pipeline; derive optimal RLHF policy from L11/L12; reward hacking + structural defense; relationship between InstructGPT/DPO/GRPO/Constitutional AI; operational instruments.

One-page reference. Three-stage table. Full objective. Closed-form optimal policy. Worked example reproduced. β decision table mapping β to behavior. Variant table. Operational instruments table. Bradley-Terry preference model statement. Common pitfalls.

5-minute distillation. One-paragraph framing. Five things to remember. Why-this-matters paragraph closing Phase 3 framing. Worked-check memory anchor reproducing the practice numerics. Where this fits (L14 offline RL: problem next).

Primary papers: Ouyang (2022 InstructGPT), Stiennon (2020), Christiano (2017), Bai (2022 Anthropic RLHF), Bai (2022 Constitutional AI), Rafailov (2023 DPO), DeepSeek (2024 GRPO), Azar (2024 IPO), Ethayarajh (2024 KTO). Operational instruments: Perez (2023), Sharma (2023), Mazeika (2024 HarmBench), Hendrycks (2021 MMLU), Cobbe (2021 GSM8K), Chen (2021 HumanEval). Bradley-Terry (1952). Open-source: Huang et al. (37 implementation details), TRL, trlx. Surveys: Casper et al. (2023), Kaufmann et al. (2024).

  • Stage 2 sweep: em/en dashes, U+2212 vs hyphen, caps emphasis, dead /topics/ links.
  • Acronyms allowed in caps: RLHF, SFT, RM, PPO, KL, ELBO, RL, DPO, GRPO, IPO, KTO, RLAIF, MMLU, GSM8K, HumanEval, NLP, LLM, MDP, MaxEnt, MIT, ICML, ICLR, NeurIPS, AISTATS, ACL, MuJoCo, IRL.
  • §6 watch-zone discipline applied throughout per advisor’s L12 boundary-checkpoint pointer:
    • 4-category specificity per failure mode (reward hacking / sycophancy / refusal hacking / length hacking each mapped to specific instrument).
    • Evaluation-methods-naming (training stability, RM accuracy, downstream eval, capability retention).
    • Operational scope test (every claim paired with instrument that would settle it).
    • Domain-specific instrument suite (HarmBench, sycophancy benchmarks, MMLU/GSM8K/HumanEval).
  • The empirical/value-question split is preserved: operational instruments distinguished from broader value-alignment questions. Lesson states explicitly that instruments inform but do not settle the deeper questions.
  • Vendor naming: paper authors + organizations as paper authors (Anthropic / OpenAI / DeepSeek named as authors of cited papers, not as marketed entities). Constitutional AI / RLAIF named as algorithmic approaches per published papers. No marketing framing of any specific vendor product.
  • Lesson 3120
  • Cheatsheet 730
  • Practice 2310
  • Summary 705
  • Brief 1100
  • References 745

Total ≈ 8710 words across 6 artifacts. Math-heavy band; the largest lesson in the track due to combined algorithmic + production + variant + instrument-suite coverage.

  • Component placeholders (�J0�, �J1�) live as MDX comments. The �J2� is configured for “Custom (RLHF papers)” rather than a single CS285 lecture; Lead may want to wire this with a different component or just leave the URL empty.
  • Practice imports real �J0� + �J1� components.
  • Numerics: π_SFT = (0.6, 0.4), R = (1, 0), β = 0.5 gives π* = (0.917, 0.083); verified to 4 decimals. Practice’s 3-response numerics also verify to 4 decimals. Limits work out as expected.
  • Phase 3 opener. The Phase 2 → Phase 3 boundary checkpoint approved this lesson’s draft start; advisor’s L12 boundary-checkpoint message provided the green light. L14 onward continues Phase 3 (offline RL problem L14, offline RL algorithms L15, exploration L16, multi-task and meta-RL L17, open problems L18 which closes the track).
  • §6-watch-zone discipline followed throughout: empirical instruments named, value/policy questions flagged separately. The lesson should be reviewable as a model for the §6-touch-content discipline going forward.