Preferences into reward signals: brief

What you’ll learn

The previous lesson ended with a structural observation: every SFT example is a positive example. There is no slot in the training data for “this response is worse.” The model learns to produce valid responses but has no way to choose between better and worse ones. This lesson covers the data collection and modeling technique that closes that gap.

The core idea is preference data. Instead of asking labelers to write new responses from scratch (slow, expensive, hard to do consistently), you show them two responses to the same prompt and ask which one is better. That binary comparison is called a preference pair, and it is cheaper to collect at scale than SFT examples while carrying information SFT data structurally cannot: which response is worse.

Once you have enough preference pairs, you train a reward model on them. The reward model’s job is to read a prompt and a response and output a single number, a score, reflecting how much a human would prefer it. Stage two of RLHF (the next lesson) uses that score to nudge the LLM toward higher-scoring responses. This lesson is stage one: building the reward model that makes stage two possible.

Where this fits

This is lesson 2 of Phase 4, How models learn to be helpful, sitting between SFT (lesson 1) and the reinforcement learning update (lesson 3). Lesson 1 opened the negative-signal gap. This lesson shows how preference data fills it, at least partially. Lesson 3 covers how the reward model trained here is used to actually update the LLM’s weights. The three lessons together constitute the post-training arc from base model to preference-aligned assistant.

Before you start

Prerequisites: the Phase 4 opener on instruction tuning and SFT. You need to be comfortable with what SFT does, why it creates a “no negative signal” gap, and roughly what the instruction-tuned model produces. No new math beyond what the previous lesson assumed.

By the end, you’ll be able to

Explain what a preference pair is and why pairwise comparison is more reliable than absolute scoring
Describe the RLHF two-stage structure and what stage one (the reward model) produces
Explain how the reward model is trained on pairs but used pointwise at inference time
Distinguish RLHF from RLAIF based on the source of preference labels
Recognize that a reward model captures whichever preference dimension the labelers were asked to judge

Time and difficulty

Read time: about 19 minutes
Practice time: about 12 minutes (a reward-model inference exercise plus flashcards)
Difficulty: standard