Skip to content

Cheatsheet: RLHF (the InstructGPT pipeline)

StageLossOutput
0. Pretrainednext-token cross-entropy on internet textπ_pretrained
1. SFT`L_SFT = -E[log π_θ(y*x)]` on demonstrations
2. Reward modelBradley-Terry: L_RM = -E[log σ(R_φ(x, y_w) - R_φ(x, y_l))]R_φ
3. PPO + KL`L_RLHF = L^CLIP - β · KL(π_θ

Stage 1 is necessary anchor; Stage 2 reward model is fit on preference pairs; Stage 3 optimizes with PPO machinery + KL constraint.

J(π_θ) = E_{x ~ D, y ~ π_θ} [ R_φ(x, y) ] - β · KL(π_θ(·|x) || π_SFT(·|x))

= variational ELBO for the optimality-conditioned graphical model with:

  • Latent: response y
  • Prior: π_SFT(y | x)
  • Likelihood: exp(R_φ(x, y) / β)
  • Temperature: β
π*(y | x) = (1 / Z(x)) · π_SFT(y | x) · exp(R_φ(x, y) / β)
Z(x) = Σ_y π_SFT(y | x) · exp(R_φ(x, y) / β)

This is the soft Bellman posterior from L12 at the sequence level. PPO optimizes a tractable surrogate for it.

Single prompt x, two responses. π_SFT = (0.6, 0.4), R = (1, 0), β = 0.5.

Responseπ_SFT · exp(R / β)Numerator
y_10.6 · exp(2)4.434
y_20.4 · exp(0)0.400

Z = 4.834. π* = (0.917, 0.083). RLHF concentrated more mass on y_1 than SFT did, less than reward-maximizer would.

βπ*(y_1)Behavior
→ 0→ 1.0Pure reward maximization, ignores SFT
0.01≈ 1.0Risk of reward hacking
0.1≈ 0.99Typical strong-signal regime
0.50.917Worked example
1.0≈ 0.80Weak signal
→ ∞→ 0.6 (SFT)No RL value extracted
β too lowβ too highRight β
Reward hacking; adversarial RM exploitsNo improvement over SFTEmpirical sweet spot 0.01-0.1
Measured KL very highMeasured KL near zeroCalibrate per RM quality
Win-rate against base may dropWin-rate ≈ 50% (no improvement)Win-rate > 50%, KL bounded

Always measure achieved KL and win-rate; do not pick β by feel.

VariantKey changeWhen
InstructGPT/PPO (baseline)SFT + RM + PPO + KLDefault; needs PPO compute
Constitutional AI (Bai 2022)RLAIF preferences from a constitutionScalable preferences, principled
DPO (Rafailov 2023)Skip RM + PPO; direct max-likelihood on prefsSimpler; smaller scale; RM-accuracy is bottleneck
GRPO (DeepSeekMath, Shao 2024; popularized by DeepSeek-R1, 2025)Drop critic, group-normalized rewards (subtract group mean, divide group std)Reasoning tasks; sparse rewards
IPO (Azar 2024)Theoretical generalization of DPOOverconfidence-mitigated DPO
InstrumentWhat it measuresHeuristic pass
Reward-model test accuracyRM held-out preference accuracy> 65% (chance 50%)
Measured KL-from-base`KL(π_θ
Win-rate against baseHuman/LLM-judge preference> 50%
HarmBenchAdversarial refusal evaluationHigher refusal on harmful prompts
Sycophancy benchmarksAnswer-change rate under user opinionLower is better
Capability evals (MMLU, GSM8K, HumanEval)General capability retentionShould not drop materially from SFT

These operationalize “is RLHF working?” into specific empirical measurements. Empirical questions are distinct in kind from broader value-alignment questions which the instruments inform but do not settle.

SymptomMechanism
Repetitive outputRM has weakness on confidence; repetition scores high
SycophancyAgreeing with user always scores high under some RMs
Length hackingLength-bias in human-rater preferences
Refusal hackingOver-cautious responses score higher than risky ones
Adversarial JSON / formattingRM has format-bias from training data

All defended by raising β (the KL penalty) to keep π_θ near π_SFT. Always run gradient-ascent reward-hacking analyses periodically.

For preference pair (y_w preferred over y_l | x):

P(y_w | x) = σ(R(x, y_w) - R(x, y_l))

The reward is well-defined up to an additive constant per prompt. Only differences matter. This is why β cannot be picked by literal scale; always calibrate to achieved KL.

  • Setting β by feel without measuring KL
  • Believing the reward model (always run reward-hacking analyses)
  • Skipping SFT (RLHF directly on pretrained is unstable)
  • Comparing across reward-model normalizations (only differences matter)
  • Treating DPO as “RLHF without the reward model” (it has its own failure modes)
  • Using human-rater agreement as proxy for truth (raters disagree systematically)
  • 3-stage pipeline: SFT → RM → PPO+KL.
  • Full objective: L^CLIP - β · KL(π_θ || π_SFT).
  • Optimal policy: π* ∝ π_SFT · exp(R / β) (soft Bellman posterior from L12).
  • Reward hacking is the dominant failure mode; KL penalty is the structural defense.
  • DPO, GRPO, IPO, Constitutional AI: variants of the same variational construction.
  • Operational instruments operationalize “is RLHF working” empirically.