RLHF and DPO: brief

What you’ll learn

This is the closing lesson of Phase 4, How models become helpful, in Track 5 (AI Foundations). The previous lesson left you with a reward model: a tool that takes a prompt-and-completion pair and returns a number representing how aligned the answer is with human preferences. That tool is useful, but it is not yet a better LLM. This lesson covers the algorithms that close the gap: RLHF (Reinforcement Learning from Human Feedback) using PPO (Proximal Policy Optimization) with a KL penalty against the SFT reference model, and DPO (Direct Preference Optimization), the supervised shortcut derived from the same objective. By the end you will know why “just maximize reward” fails, what PPO does at a conceptual level, why DPO can skip the reward model entirely, and how to choose between the two in practice. PPO loss surgery (clipping math, KL term math) is named but not derived; reward hacking is taught through the lecturer’s “applause volume vs informative lecture” framing. Course materials are at cme295.stanford.edu.

Where this fits

This is lesson 3 of Phase 4, How models become helpful. The previous lesson (How preferences become reward signals) covered stage one of RLHF: training a reward model from human preference pairs, using the Bradley-Terry formulation. This lesson covers stage two: using that reward model (or in DPO’s case, skipping it) to actually update the LLM’s weights toward preferred completions. After this, Phase 4 is complete. Phase 5, How we steer models at inference, picks up after the model is trained and you are using it.

Before you start

Prerequisites: the reward-model lesson is required. We assume you understand what preference pairs are, what a reward model produces (a single score per prompt-and-completion pair), and what the Bradley-Terry formulation does. The SFT lesson is also useful since both PPO and DPO anchor against the SFT model as a reference.

By the end, you’ll be able to

Explain why “just maximize reward” fails (catastrophic forgetting, reward hacking, training instability) and what the KL penalty against the reference model fixes
Describe PPO at a conceptual level (advantage maximization, KL penalty, clipping) without deriving its math
Explain the “your language model is secretly a reward model” insight and what DPO does in plain language
Distinguish PPO from DPO on practical axes (number of model copies, training stages, hyperparameter complexity, performance trade-off)
Recognize reward hacking via the lecturer’s “applause volume vs informative lecture” framing

Time and difficulty

Read time: about 14 minutes
Practice time: about 12 minutes (a self-check on the three reasons “just maximize reward” fails, plus flashcards on the PPO/DPO contrast)
Difficulty: standard