Summary: RLHF (opens Phase 3)
The one paragraph version
Section titled “The one paragraph version”RLHF (reinforcement learning from human feedback) is the production application of everything Phase 2 built. The canonical InstructGPT pipeline has three stages: SFT (supervised fine-tuning on demonstration pairs) anchors the model in instruction-following format; reward modeling (Bradley-Terry loss on preference pairs) trains a scalar reward model R_φ that scores (prompt, response) pairs; PPO + KL (the Lesson 8 clipped surrogate plus a KL penalty to the SFT model) optimizes the policy against the reward model while preventing drift from SFT. The full RLHF objective J(π_θ) = E[R_φ(x, y)] - β · KL(π_θ || π_SFT) is exactly the soft Bellman backup from Lessons 11-12 applied at the sequence level with π_SFT as the prior and β as the temperature. The optimal policy is π*(y | x) ∝ π_SFT(y | x) · exp(R_φ(x, y) / β). The dominant failure mode is reward hacking: the reward model is an imperfect proxy and the policy finds adversarial responses that score high on R_φ but read as nonsense to humans; the KL penalty is the structural defense. Variants: Constitutional AI uses AI-generated preferences; DPO is the variational shortcut that skips the RM and PPO; GRPO uses group-relative advantage; IPO is a theoretical generalization. Operational instruments (reward-model test accuracy, measured KL-from-base, win-rate against base, harm-bench scores, sycophancy benchmarks) operationalize “is RLHF working” into specific empirical measurements distinct in kind from broader value-alignment questions.
Five things to remember
Section titled “Five things to remember”- InstructGPT pipeline: SFT → reward model → PPO + KL. Three stages, each addressing a specific problem (format anchoring, scalable preference signal, stable policy optimization with anti-drift).
- Full RLHF objective:
L_RLHF = L^CLIP - β · KL(π_θ || π_SFT). The variational ELBO from Lessons 11-12 withπ_SFTas prior. The optimal policy isπ* ∝ π_SFT · exp(R / β). - Worked example:
π_SFT = (0.6, 0.4),R = (1, 0),β = 0.5→π* = (0.917, 0.083). Limits:β → 0gives(1, 0)(reward maximizer);β → ∞gives(0.6, 0.4)(SFT unchanged). - Reward hacking is the dominant failure mode. KL penalty is the structural defense. Operational sweet spot:
β = 0.01to0.1; always measure achieved KL alongside win-rate against base. - Variants: Constitutional AI (RLAIF), DPO (variational shortcut), GRPO (group-relative), IPO (theoretical DPO refinement). Same variational construction, different design knobs.
Why this matters
Section titled “Why this matters”RLHF is the technique that turned pretrained language models into useful instruction-following assistants. Every commercial instruction-tuned model from 2022 forward used RLHF or a close variant. The technique is now mature: reproducible at the ~1B parameter scale with a few thousand dollars of compute; available in open-source implementations; well-documented failure modes; well-characterized hyperparameter regimes.
What RLHF operationally verifies: reward-model accuracy, KL-from-base, win-rate against base, harm-bench refusal rates, sycophancy benchmark scores, capability retention. These are specific empirical measurements that distinguish “this RLHF run is working” from “this RLHF run is broken.”
What RLHF does not settle: the deeper value-alignment questions (what preferences should the model be aligned to, whose preferences, how to handle disagreement). The operational instruments inform those questions but they are not the same as those questions. The split is what makes RLHF a productive engineering practice: the operational side is solvable; the value side is broader and ongoing.
The Phase 1/2/3 narrative arc opens its frontier register here: Phase 1 named the algorithms, Phase 2 derived them as resolutions to structural problems, Phase 3 takes the toolkit to deployment realities and to the field’s open problems. RLHF is the opening application. Lessons 14 through 18 continue with the offline-RL pair (L14 problem definition, L15 algorithms BCQ / CQL / IQL), exploration (L16), multi-task and meta-RL (L17), and the field’s open problems (L18, closes the track).
Worked check (memory anchor)
Section titled “Worked check (memory anchor)”Closed-form RLHF optimal policy:
π*(y | x) = (1 / Z(x)) · π_SFT(y | x) · exp(R_φ(x, y) / β)Three-response practice numerics: π_SFT = (0.5, 0.3, 0.2), R = (1.0, 0.5, -1.0), β = 0.5.
Numerators: (3.6945, 0.8155, 0.0271)Z = 4.5371π* = (0.8143, 0.1797, 0.0060)SFT prior put 50% on the best response; RLHF moved this to 81% while shifting almost all mass off the worst response (20% → 0.6%). The KL penalty prevented collapse to deterministic; the reward model signal moved mass appropriately. β → 0 gives (1, 0, 0); β → ∞ gives (0.5, 0.3, 0.2).
Where this fits
Section titled “Where this fits”- Previous (Lesson 12): Control as inference. Closed Phase 2 with the variational unification of SAC + RLHF + DPO.
- This lesson: RLHF deep-dive. Opens Phase 3. Take the L8 PPO + L11/L12 variational framework and apply them to the language-model alignment problem.
- Next (Lesson 14): Offline RL: the problem. The deployment-realistic setting where new data collection is forbidden and naive Q-learning diverges via extrapolation error.
- Later (Lessons 15-18): Offline RL algorithms (BCQ, CQL, IQL), exploration in hard-reward environments, multi-task and meta-RL, and the field’s open problems (closes the track).
The Phase 3 framing
Section titled “The Phase 3 framing”Phase 1 named the algorithms. Phase 2 derived them as resolutions to structural problems. Phase 3 applies them. The dispatch table from L3 told you what each algorithm estimates; the variational framework from L11-L12 told you why they all fit together; the RLHF pipeline shows you what they look like in production. Lesson 13 is the synthesis lesson: everything you learned in Phases 1-2 wired into one production-deployable pipeline.
What you should remember
Section titled “What you should remember”RLHF wires together PPO (Lesson 8) and the variational framework (Lessons 11-12) to align pretrained language models to human preferences. Three stages (SFT → RM → PPO + KL), one variational objective, one dominant failure mode (reward hacking), one structural defense (KL penalty). The variants (Constitutional AI, DPO, GRPO, IPO) are different samplers from the same variational construction. Operational instruments distinguish working runs from failures empirically. The deeper alignment questions remain open; the operational engineering is now mature.