Summary: How preferences become reward signals
SFT teaches the model what to predict. Preference data teaches it which answer to prefer. The reward model is how that preference becomes a number a training loop can use. This summary is the scan-it-in-four-minutes version. The full lesson covers the reasoning behind each step, with the lecturer’s examples and a worked walk-through of how preference pairs are collected and turned into a training signal.
Core ideas
Section titled “Core ideas”- The gap SFT leaves. Every SFT example is positive: here is the response to produce. There is no slot for a worse option. The model has no way to distinguish between two valid-looking responses; it picks whatever is closest to the average of its training examples. Preference data adds the missing axis.
- A preference pair is the atomic unit. One prompt, two responses, a binary label: A is better. That is all. No rubric score, no rewrite, no per-token annotation.
- Pairwise comparison is more reliable than absolute scoring. Asking a labeler “is this a 0.9 or an 0.85?” is a question humans cannot answer consistently across many examples; the scale drifts and labels become noisy. Asking “is A better than B?” is a simpler judgment humans can make reliably. That is why pairwise binary comparison is the standard.
- Two responses come from the same SFT model. Feed the same prompt twice with temperature greater than zero (temperature adds randomness to sampling, so the same prompt produces different outputs each run). The variation is sampling noise, not a second model.
- RLHF has two stages. Stage one: train a reward model on preference pairs. Stage two: use the reward model to update the LLM. This lesson is stage one. The next lesson is stage two.
- The reward model is a classifier head on a transformer. Take a pretrained transformer, replace the language-modeling head with a single-number output, train on preference pairs using an objective (named after a 1952 statistics paper) that pushes the winning response’s score above the losing response’s score. The result is a function: prompt and response in, score out.
- Trained pairwise, used pointwise. The pairwise structure is a training-time artifact. At inference time, the reward model scores one input at a time. No comparison needed once training is done.
- The reward model captures one dimension. If labelers were asked “which is more helpful,” it learns helpfulness. If asked “which is safer,” it learns safety. Different labeling guidelines produce different reward models; they are not interchangeable. In practice, frontier labs often train one reward model per dimension.
- RLHF vs RLAIF. The human in RLHF refers to who labels the preference pairs. If humans label them, it is RLHF. If another AI model labels them, it is RLAIF. The downstream training procedure is the same; only the label source differs.
- Supervision is sparse. SFT gives the model roughly one training signal per token (every token in the response contributes to the loss). The reward model gives roughly one signal per full completion: generate the entire response, then assign a single score. That asymmetry is one reason RLHF is harder to make stable than SFT.
What changes for you
Section titled “What changes for you”Before this lesson, “trained with RLHF” in a model announcement was probably opaque shorthand. After it, you know what stage one of RLHF produces (a reward model), why pairwise comparison is the standard collection format, and why two labs both claiming “RLHF training” can produce models with noticeably different personalities (different labeling guidelines, different reward models, different directions). When you see a model that feels evasive on a topic, you have a frame for why: the reward model that shaped it can reflect a safety reward model scoring refusal highly on that kind of prompt.
Supervised fine-tuning teaches the model to answer when someone asks.
Preference data teaches it which answer to prefer.
The reward model is how that preference becomes a number a training loop can use.