Skip to content

Cheatsheet: How RLHF and DPO align models

A reward model is a measuring tool, not an algorithm.
RLHF (PPO) and DPO are the algorithms that turn its
scores into actual weight updates on the LLM.
RL termLLM equivalent
AgentThe LLM
StateInput seen so far (prompt + tokens generated to date)
ActionNext token to generate
EnvironmentVocabulary of possible tokens
PolicyLLM’s probability distribution over next token
RewardFrom the reward model, delivered after the full completion

Sparse signal: SFT gets one signal per token. RLHF gets one signal per completion. Less information per training step.

Failure modeWhat breaks
Catastrophic forgettingPushes weights too far; damages pretrained + SFT knowledge
Reward hackingReward model is imperfect; model exploits gaps in the proxy
Training instabilityRL updates can diverge; aggressive steps destroy the policy

Fix: keep the policy close to the SFT reference model via a KL penalty. The objective becomes “maximize reward AND stay close to reference.”

True goal: Give an informative lecture
Proxy: How loudly the audience claps at the end
Failure: Optimizer learns jokes get loud claps
Reward goes up. Lecture is no longer informative.

Generalizes to any AI system trained against an imperfect proxy reward.

Loss = maximize advantage - beta * KL(policy || reference)
(clipped per-step update)
TermRole
AdvantageReward minus expected reward (baseline). Reduces gradient variance.
KL penaltyDistance between current policy and frozen SFT reference. Beta tunes strength.
ClippingCaps how much policy can change in one iteration. Epsilon tunes the range.

Math is name-only at this level. The intuition is: maximize reward, anchor to reference, take small steps.

Model copyFrozen?Why
PolicyNo (training)The LLM you’re updating
Reference modelYesFor the KL penalty (typically the SFT model)
Reward modelYesFrom stage one
Value functionNo (trained jointly)Estimates advantage

Four model copies. Frontier-LLM scale. Heavy.

  • Two-stage training (reward model + policy). Bug in stage one means restart everything.
  • Multiple sensitive hyperparameters (beta, epsilon, GAE parameters, learning rates).
  • Instability risk despite all guardrails.
  • On-policy data: model generates its own training rollouts each iteration.

DPO and the “secretly a reward model” insight

Section titled “DPO and the “secretly a reward model” insight”
1. Start with PPO objective: max reward - beta * KL(policy || reference)
2. Solve in closed form for optimal policy.
3. Rearrange: express reward as a function of the optimal policy.
4. Plug into Bradley-Terry: P(winner > loser) = sigmoid(R(w) - R(l))
5. Partition function cancels in the subtraction.
6. What remains: a supervised loss on policy log-ratios.

Result: no reward model. Just a supervised loss directly on preference pairs.

DPO loss shape:
-E[ log sigmoid( beta * ( log(p(yw|x)/p_ref(yw|x))
- log(p(yl|x)/p_ref(yl|x)) ) ) ]

Same Bradley-Terry shape as the reward-model loss from the previous lesson, with policy log-ratios in place of reward scores.

AxisPPODPO
Model copies42
Training stages2 (reward model + RL)1 (direct loss)
Main hyperparametersBeta, epsilon, GAE params, etc.Beta (typically around 0.1)
Training typeOn-policy RLSupervised
Reported performanceSlightly better on harder benchmarksSlightly behind PPO; gap varies
Pipeline complexityHighDramatically lower
Known wrinkleInstability, hyperparameter tuningDistribution shift (preferences vs model outputs)

Both need preference data. DPO removes the reward-model-training step, not the data-collection step.

  • Best-of-N (BoN): skip RL entirely. Generate N completions at inference, pick the highest-rated. Pushes cost from training to inference. Good for prototyping; not how production systems are aligned.
  • GRPO (Group Relative Policy Optimization): variant of PPO that drops the value function. Used in some recent reasoning-model training (DeepSeek-Math). Covered in Phase 6.
PitfallReality
”RLHF and DPO are completely different ideas.”They are not. DPO is mathematically derived from the PPO objective. Same family.
”DPO eliminates the need for preference data.”No. It eliminates reward-model training. Both methods need preference pairs.
”The SFT model is replaced by the preference-tuned one.”Not exactly. SFT becomes the frozen reference. Preference-tuned policy ships, but the reference is load-bearing during training.
”Reward hacking is a theoretical worry.”It is a practical one. The clapping-volume analogy generalizes. Whenever a reward model approximates human preferences, hard optimization can drift the model away from the actual goal.
  • RLHF: Reinforcement Learning from Human Feedback. Two-stage: train reward model, then RL-tune policy against it.
  • PPO: Proximal Policy Optimization. The classic RL algorithm used in RLHF stage two. Originally a 2017 RL paper.
  • DPO: Direct Preference Optimization. The supervised shortcut, derived from the PPO objective in closed form. 2023.
  • Reference model: the frozen SFT model that anchors the policy via KL penalty. Both PPO and DPO use one.
  • Beta: the KL coefficient. How strongly the reference model anchors the policy. Typically around 0.1 (rough order of magnitude).
  • Advantage: “how much better than expected.” Reward minus baseline. Reduces gradient variance.
  • Clipping: PPO trick that caps per-iteration policy change. Controlled by epsilon.
  • On-policy training: model generates training data from its own current policy. PPO is on-policy. DPO works on a fixed dataset, so is not strictly on-policy.
  • Reward hacking: model optimizing too hard against an imperfect proxy reward. Result: high reward, missed actual goal.
  • GRPO: Group Relative Policy Optimization. PPO variant without value function, popularized in reasoning-model training.

A reward model tells you what’s good. It cannot tell the LLM how to get there.
RLHF uses RL with guardrails (PPO) to push toward higher reward without forgetting.
DPO is the supervised shortcut: skip the reward model, optimize the policy directly on preferences.