Skip to content

How preferences become reward signals

The previous lesson ended with a structural observation: every SFT example is a positive example. There is no slot in the training data for “this response is worse.” The model learns to produce valid responses but has no way to choose between better and worse ones. This lesson covers the data collection and modeling technique that closes that gap.

The core idea is preference data. Instead of asking labelers to write new responses from scratch (slow, expensive, hard to do consistently), you show them two responses to the same prompt and ask which one is better. That binary comparison is called a preference pair, and it is cheaper to collect at scale than SFT examples while carrying information SFT data structurally cannot: which response is worse.

Once you have enough preference pairs, you train a reward model on them. The reward model’s job is to read a prompt and a response and output a single number, a score, reflecting how much a human would prefer it. Stage two of RLHF (the next lesson) uses that score to nudge the LLM toward higher-scoring responses. This lesson is stage one: building the reward model that makes stage two possible.

This is lesson 2 of Phase 4, How models learn to be helpful, sitting between SFT (lesson 1) and the reinforcement learning update (lesson 3). Lesson 1 opened the negative-signal gap. This lesson shows how preference data fills it, at least partially. Lesson 3 covers how the reward model trained here is used to actually update the LLM’s weights. The three lessons together constitute the post-training arc from base model to preference-aligned assistant.

Prerequisites: the Phase 4 opener on instruction tuning and SFT. You need to be comfortable with what SFT does, why it creates a “no negative signal” gap, and roughly what the instruction-tuned model produces. No new math beyond what the previous lesson assumed.

  • Explain what a preference pair is and why pairwise comparison is more reliable than absolute scoring
  • Describe the RLHF two-stage structure and what stage one (the reward model) produces
  • Explain how the reward model is trained on pairs but used pointwise at inference time
  • Distinguish RLHF from RLAIF based on the source of preference labels
  • Recognize that a reward model captures whichever preference dimension the labelers were asked to judge
  • Read time: about 19 minutes
  • Practice time: about 12 minutes (a reward-model inference exercise plus flashcards)
  • Difficulty: standard