References: How RLHF and DPO align models

Source material

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
  Instructor: Afshine Amidi & Shervine Amidi, Stanford University
  Course site: https://cme295.stanford.edu/
  Cheatsheet: https://cme295.stanford.edu/cheatsheet/
  Source lecture (Lecture 5, LLM tuning):
    https://www.youtube.com/watch?v=PmW_TMQ3l0I
  License (lecture videos): as published on Stanford's public YouTube channel
  License (Amidi cheatsheets): MIT
This lesson adapts the RLHF and DPO sections of Stanford CME 295 Lecture 5,
covering [00:18:16] RL framing, [00:48:00] reward hacking and the KL penalty,
[00:53:13] PPO loss with clip and KL variants, [00:58:00] advantage and
the value function, [01:23:00] best-of-N, and [01:30:00 onward] DPO
derivation and the PPO-vs-DPO contrast. Clawdemy provides original notes,
summaries, and quizzes derived from this material for educational purposes.
All rights to the original lectures remain with Stanford and the instructors.

Primary sources

The two papers behind this lesson, in chronological order.

“Proximal Policy Optimization Algorithms”, Schulman et al., 2017. The original PPO paper from OpenAI. Sections 3 (clipped surrogate objective) and 4 (KL penalty variant) are what this lesson refers to as “PPO clip” and “PPO with KL penalty.” The paper predates LLMs; the algorithm was adopted later for RLHF. Read after this lesson if you want the math the lesson deliberately skipped. Section 6 (experiments on Atari and continuous control) is RL-specific and not directly relevant to LLM tuning.
“Direct Preference Optimization: Your Language Model Is Secretly a Reward Model”, Rafailov et al., 2023. The DPO paper. Sections 4 (the derivation) and 5 (theoretical analysis) are the closed-form move this lesson covered at intuition level. Worth reading even at a non-technical level: the derivation is short and the title-claim is real. Section 6 (experiments) reports the head-to-head against PPO on summarization and dialogue benchmarks. The “your language model is secretly a reward model” framing comes from section 5.

RLHF for LLMs (the bridging paper)

“Training language models to follow instructions with human feedback”, Ouyang et al., 2022 (the InstructGPT paper). This is the paper that brought RLHF from RL research to LLM alignment. The training pipeline it describes (SFT, then reward model, then PPO) became the standard recipe across the industry for two years. Reading this paper after the previous lesson and this one closes the loop: you have seen each stage in isolation; this paper shows them composed at scale. Section 3 (methods) is where the pipeline is laid out.

Going deeper

A short list, chosen for durability.

“A Comprehensive Survey of Reward Models”, Lambert et al., 2024. Surveys the reward-modeling landscape post-RLHF and pre-DPO-dominance. Useful if you want a wider view of what reward models actually look like in practice across different alignment efforts.
“Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study”, Xu et al., 2024. Systematic head-to-head benchmarking of PPO and DPO on the same datasets and seeds. The empirical answer to the practical question this lesson ends on. Mostly: PPO has a small edge that varies by task, DPO is much easier to run.
“DeepSeek-Math: Pushing the Limits of Mathematical Reasoning in Open Language Models”, Shao et al., 2024. Introduces GRPO (Group Relative Policy Optimization) in the context of mathematical reasoning. Phase 6 will cover reasoning models in detail; this is where to look if you want the algorithmic specifics of how GRPO drops the value function.

Adjacent topics

The reward-hacking literature. The lecture’s clapping-volume analogy is one instance of a broad pattern in optimization. Search terms: “specification gaming,” “Goodhart’s law,” “outer alignment.” These are conceptual rather than algorithmic; useful for understanding why reward hacking is a structural concern, not a bug to fix in one paper.
KL divergence. The mathematical object the KL penalty uses. Treating it intuitively (as we did in this lesson) is enough to use both PPO and DPO. If you want the formal definition, any introductory information theory text covers it; the relevant property here is non-negativity and the zero-when-distributions-are-identical condition.
Bradley-Terry, again. Covered in the previous lesson (preferences-into-reward-signals). Re-anchored here because DPO substitutes into the same formula. If you skipped or are fuzzy on the previous lesson’s coverage of Bradley-Terry, the DPO derivation will be harder to follow. Worth a re-read.

Stanford CME 295 cheatsheet

Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. The “preference tuning” section covers the same material in their dense visual style. The cheatsheet is more compressed than the lecture and worth using as a study reference after this lesson.

Community discussion

None selected for this lesson. The published literature is consolidated enough that academic sources are the better entry point. Durable community references will be added at a future quarterly review if any consolidate.