How RLHF and DPO align models
What you’ll learn
Section titled “What you’ll learn”This is the closing lesson of Phase 4, How models become helpful, in Track 5 (AI Foundations). The previous lesson left you with a reward model: a tool that takes a prompt-and-completion pair and returns a number representing how aligned the answer is with human preferences. That tool is useful, but it is not yet a better LLM. This lesson covers the algorithms that close the gap: RLHF (Reinforcement Learning from Human Feedback) using PPO (Proximal Policy Optimization) with a KL penalty against the SFT reference model, and DPO (Direct Preference Optimization), the supervised shortcut derived from the same objective. By the end you will know why “just maximize reward” fails, what PPO does at a conceptual level, why DPO can skip the reward model entirely, and how to choose between the two in practice. PPO loss surgery (clipping math, KL term math) is named but not derived; reward hacking is taught through the lecturer’s “applause volume vs informative lecture” framing. Course materials are at cme295.stanford.edu.
Where this fits
Section titled “Where this fits”This is lesson 3 of Phase 4, How models become helpful. The previous lesson (How preferences become reward signals) covered stage one of RLHF: training a reward model from human preference pairs, using the Bradley-Terry formulation. This lesson covers stage two: using that reward model (or in DPO’s case, skipping it) to actually update the LLM’s weights toward preferred completions. After this, Phase 4 is complete. Phase 5, How we steer models at inference, picks up after the model is trained and you are using it.
Before you start
Section titled “Before you start”Prerequisites: the reward-model lesson is required. We assume you understand what preference pairs are, what a reward model produces (a single score per prompt-and-completion pair), and what the Bradley-Terry formulation does. The SFT lesson is also useful since both PPO and DPO anchor against the SFT model as a reference.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”- Explain why “just maximize reward” fails (catastrophic forgetting, reward hacking, training instability) and what the KL penalty against the reference model fixes
- Describe PPO at a conceptual level (advantage maximization, KL penalty, clipping) without deriving its math
- Explain the “your language model is secretly a reward model” insight and what DPO does in plain language
- Distinguish PPO from DPO on practical axes (number of model copies, training stages, hyperparameter complexity, performance trade-off)
- Recognize reward hacking via the lecturer’s “applause volume vs informative lecture” framing
Time and difficulty
Section titled “Time and difficulty”- Read time: about 14 minutes
- Practice time: about 12 minutes (a self-check on the three reasons “just maximize reward” fails, plus flashcards on the PPO/DPO contrast)
- Difficulty: standard