References: How preferences become reward signals

Source material

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
  Instructor: Afshine Amidi & Shervine Amidi, Stanford University
  Course site: https://cme295.stanford.edu/
  Cheatsheet: https://cme295.stanford.edu/cheatsheet/
  Source lecture (Lecture 5, LLM tuning): https://www.youtube.com/watch?v=PmW_TMQ3l0I
  License (lecture videos): as published on Stanford's public YouTube channel
  License (Amidi cheatsheets): MIT
This lesson adapts the preference-data and reward-model section of Stanford
CME 295 Lecture 5 (roughly 00:05:08 through 00:42:00). The RL update step
(PPO, DPO, KL penalty, reward hacking) is deliberately deferred to the next
lesson in this phase. Clawdemy provides original notes, summaries, and
quizzes derived from this material for educational purposes. All rights to
the original lectures remain with Stanford and the instructors.

Going deeper

A short list, chosen for durability. Each link is for a specific next step, not a generic “learn more.”

“Training language models to follow instructions with human feedback”, Ouyang et al., 2022. The InstructGPT paper. Section 3.6 covers the reward model specifically: the labeling setup, the dataset composition, and the training objective used to turn preference pairs into a scoring function. This is the closest published description of the preference-data and reward-model pipeline this lesson describes. Read Section 3.6 if you want to see what the process looks like at a production lab, with the specific choices and numbers.
“Rank analysis of incomplete block designs: I. The method of paired comparisons”, Bradley and Terry, 1952. The original Bradley-Terry paper from Biometrika. It is a statistics paper, not an ML paper, and the notation is dated; most readers will prefer a modern explainer. It is listed here because the name appears in ML papers and release notes, and knowing the primary source helps when you want to understand what the name actually refers to.
“RewardBench: Evaluating Reward Models for Language Modeling”, Lambert et al., 2024. A benchmark for evaluating reward models across multiple dimensions: chat, safety, reasoning, and instruction following. Useful if you want to understand how reward models are assessed after training, what “a good reward model” means empirically, and where current reward models tend to fail. The paper’s evaluation of several frontier reward models also illustrates the “different dimensions, different models” point the lesson makes.
Hugging Face TRL documentation: reward model training. The reference implementation used by most open-source RLHF pipelines today. The RewardTrainer documentation covers the practical setup: how to structure preference datasets, how the Bradley-Terry loss is implemented, and what the output checkpoint looks like. If you want to see a working implementation of what this lesson described conceptually, this is the entry point.
Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. The post-pretraining section (Section 5) covers SFT, preference tuning, and RLHF in their dense visual style. The reward-model and preference-data subsections pair well with this lesson’s flashcards as a single-page review surface.

Adjacent topics

Topics that build on or sit beside this one.

Reward hacking. The reward model is an approximation of human preference, not a perfect measure of it. When the LLM is updated to maximize the reward model’s score (next lesson), it sometimes finds response strategies that score high on the reward model but are not actually helpful or safe: a property called reward hacking. The lesson names the phenomenon; understanding it in depth requires reading about the RL update step (next lesson) and the mitigations (KL penalty, iterative data collection).
Direct Preference Optimization (DPO). DPO is an alternative to the two-stage RLHF procedure this lesson describes. Instead of training an explicit reward model and then running RL, DPO trains the LLM directly on preference pairs, collapsing both stages into one. The next lesson covers this. The DPO paper (Rafailov et al., 2023) is the canonical reference: arxiv.org/abs/2305.18290.
Constitutional AI. Anthropic’s approach to preference data collection that uses a set of principles (a “constitution”) to guide both the AI-feedback labels and the training objective. Related to RLAIF; the paper (Bai et al., 2022) is at arxiv.org/abs/2212.08073.
What comes after stage one. Lesson 3 of this phase, on how the reward model is used to update the LLM (PPO and DPO as the two main algorithm families), is the direct continuation.

Community discussion

None selected for this lesson. The public discussion of reward models and preference data has consolidated around the lab papers above, the TRL implementation, and the RewardBench evaluation work. If a canonical forum thread or blog post surfaces that durably extends one of these references, it will be added at the next quarterly review.