A labeler asked to write the best response must produce quality from scratch. Showing two options and picking the better one is a much smaller ask.
SFT prompts require balanced distributions
Add too many examples of one type and the model tilts that way. Preference tuning acts on the model’s existing behavior rather than rebalancing what prompts it sees.
SFT teaches what, not what-not
Every SFT example is positive. There is no slot for “this response is worse.” Preference tuning injects the missing negative signal structurally.
The lecturer’s qualifier: preference tuning is not the answer to everything. If the model’s SFT data is genuinely wrong, fix the SFT data. The third stage targets things SFT cannot do by construction.
1. Pick a prompt from production logs or a curated
distribution that mirrors real users.
2. Generate two responses: feed the same prompt into the
SFT model twice with positive temperature (randomness
added to sampling so each run produces a different output).
3. Rate the pair: a human labeler reads both and marks
which response is better, on a binary scale.
Alternatives to human labelers: LLM-as-a-judge (a separate model assigns the preference), or rule-based metrics (BLEU, ROUGE, less common today). A rarer variant: take a bad response from production logs, have a labeler rewrite it into a good one, then use the original-plus-rewrite as a preference pair.
A transformer with a classification head (single-number output) instead of a language-modeling head. Decoder-only is typical (“everything is an LM these days”); encoder-only (BERT-style CLS projection) also works.
Training objective
Bradley-Terry model (1952): push the winning response’s score above the losing response’s score on every preference pair, calibrated so the gaps reflect the observed preferences.
Trained pairwise
Sees pairs during training; the loss requires a winner and a loser per example.
Used pointwise
At inference time, scores one prompt-response input at a time. No comparison needed.
Score scale
Arbitrary. High score = good, low score = bad. Typically normalized before use in stage two.
Data scale
Roughly tens of thousands of preference pairs or more (the lecturer’s qualitative estimate). Smaller than an SFT dataset; much smaller than a pretraining corpus.
The reward model learns whatever the labelers were asked to prefer. Different guidelines produce different reward models.
What labelers were asked
What the reward model learns
”Which response is more helpful?”
Helpfulness
”Which response is safer?”
Safety
”Which response is more concise?”
Conciseness
”Which response is better overall?”
A holistic mix of the labelers’ intuitions, with all the ambiguity that implies
In practice, frontier labs often train one reward model per dimension and combine them in stage two. The lesson covers the one-reward-model case for clarity; knowing the single-reward-model abstraction is a simplification helps when you read about multi-objective alignment.
Preferences are subjective, even with good guidelines. The reward model averages over labelers. Scores are not objective measures of quality.
Conflating RLHF with RL in general
RLHF uses RL machinery (stage two), but the human feedback part is stage one (reward-model training). “RLHF” without qualification usually means both stages; “RL” alone usually means stage two.
Assuming the reward model needs the original preference pair at inference time
The reward model takes a prompt and a response, nothing else. It does not know which model produced the response, what the reference response was, or what training pair it came from.
Reading “aligned with human preferences” as a universal standard
”Aligned” means “aligned with the specific labelers under the specific guidelines.” It is a real distribution, not a universal one. Different labs, different labelers, different directions.
Preference pairs collected, reward model trained (stage one), LLM updated against it (stage two).
”Trained with RLAIF”
Same procedure; AI-model labels instead of human labels.
”Tuned for helpfulness, harmlessness, and honesty”
These are likely the three dimensions the lab built reward models for. Not all three are equally easy to train; the balance reflects the lab’s choices.
”Preference fine-tuned”
Stage one and possibly stage two done, likely with a specific algorithm (RLHF, DPO, or similar; the next lesson covers the algorithm choices).
Preference pair: one prompt, two responses, and a binary label (which response is preferred). The atomic unit of preference data.
Pairwise comparison: a judgment of which of two options is better. The standard format for preference data.
Reward model: a neural network trained to score prompt-response pairs. Takes a single prompt-and-response as input and outputs one number.
Bradley-Terry model: the statistical model behind the reward-model training objective. Named after a 1952 paper on paired comparisons.
RLHF (Reinforcement Learning from Human Feedback): a two-stage post-training process. Stage one trains a reward model from human-labeled preference pairs. Stage two updates the LLM using the reward model.
RLAIF (Reinforcement Learning from AI Feedback): RLHF with AI-model labels instead of human labels. The training procedure is identical.
Sparse supervision: supervision where the feedback signal arrives once per completion rather than once per token. Characteristic of reward-model training and the RLHF second stage.
Dense supervision: supervision where the feedback signal arrives once per token. Characteristic of pretraining and SFT.
LLM-as-a-judge: using a separate language model to assign preference labels instead of human labelers.
Pointwise scoring: assigning an absolute score to each response independently. Less reliable than pairwise comparison for collecting preference data.
Supervised fine-tuning teaches the model to answer when someone asks. Preference data teaches it which answer to prefer. The reward model is how that preference becomes a number a training loop can use.