Skip to content

Cheatsheet: How preferences become reward signals

SFT (positive examples only)
→ no negative signal
→ model cannot rank valid responses
Preference data (pairwise comparison)
→ negative signal added
→ reward model trained from comparisons
→ reward model used to update LLM (next lesson)
ReasonWhat it means
SFT data is hard to writeA labeler asked to write the best response must produce quality from scratch. Showing two options and picking the better one is a much smaller ask.
SFT prompts require balanced distributionsAdd too many examples of one type and the model tilts that way. Preference tuning acts on the model’s existing behavior rather than rebalancing what prompts it sees.
SFT teaches what, not what-notEvery SFT example is positive. There is no slot for “this response is worse.” Preference tuning injects the missing negative signal structurally.

The lecturer’s qualifier: preference tuning is not the answer to everything. If the model’s SFT data is genuinely wrong, fix the SFT data. The third stage targets things SFT cannot do by construction.

FormatWhat you collectWhy it is or is not standard
PointwiseA single score per response (e.g., 0.9 for A, 0.2 for B)Not standard. Humans cannot score consistently across many examples; the scale drifts and labels become noisy.
PairwiseA binary preference: A is better than BStandard. A simpler judgment humans can make reliably.
ListwiseA ranking of N responsesPossible but more cognitively demanding than pairwise; not significantly more useful in practice.

How preference data is collected (three steps)

Section titled “How preference data is collected (three steps)”
1. Pick a prompt from production logs or a curated
distribution that mirrors real users.
2. Generate two responses: feed the same prompt into the
SFT model twice with positive temperature (randomness
added to sampling so each run produces a different output).
3. Rate the pair: a human labeler reads both and marks
which response is better, on a binary scale.

Alternatives to human labelers: LLM-as-a-judge (a separate model assigns the preference), or rule-based metrics (BLEU, ROUGE, less common today). A rarer variant: take a bad response from production logs, have a labeler rewrite it into a good one, then use the original-plus-rewrite as a preference pair.

StageWhat happensOutput
Stage 1 (this lesson)Train a reward model on preference pairsA scoring function: prompt + response → one number
Stage 2 (next lesson)Use the reward model to update the LLMA preference-aligned LLM

RLHF vs RLAIF: identical procedure; only the label source differs. Human labels = RLHF. AI model labels = RLAIF.

PropertyDetail
ArchitectureA transformer with a classification head (single-number output) instead of a language-modeling head. Decoder-only is typical (“everything is an LM these days”); encoder-only (BERT-style CLS projection) also works.
Training objectiveBradley-Terry model (1952): push the winning response’s score above the losing response’s score on every preference pair, calibrated so the gaps reflect the observed preferences.
Trained pairwiseSees pairs during training; the loss requires a winner and a loser per example.
Used pointwiseAt inference time, scores one prompt-response input at a time. No comparison needed.
Score scaleArbitrary. High score = good, low score = bad. Typically normalized before use in stage two.
Data scaleRoughly tens of thousands of preference pairs or more (the lecturer’s qualitative estimate). Smaller than an SFT dataset; much smaller than a pretraining corpus.
Training stageSupervision signalSignals per sequence
Pretraining / SFTPer-token loss: every predicted token contributesOne per token (dense)
Reward model / RLHF stage 2Per-completion score: the entire response is generated, then scored onceRoughly one per full response (sparse)

Sparse supervision is one reason RLHF is harder to stabilize than SFT. Fewer gradient updates per sequence, and each one carries more weight.

What dimension does the reward model capture?

Section titled “What dimension does the reward model capture?”

The reward model learns whatever the labelers were asked to prefer. Different guidelines produce different reward models.

What labelers were askedWhat the reward model learns
”Which response is more helpful?”Helpfulness
”Which response is safer?”Safety
”Which response is more concise?”Conciseness
”Which response is better overall?”A holistic mix of the labelers’ intuitions, with all the ambiguity that implies

In practice, frontier labs often train one reward model per dimension and combine them in stage two. The lesson covers the one-reward-model case for clarity; knowing the single-reward-model abstraction is a simplification helps when you read about multi-objective alignment.

PitfallReality
Treating preference data as objective truthPreferences are subjective, even with good guidelines. The reward model averages over labelers. Scores are not objective measures of quality.
Conflating RLHF with RL in generalRLHF uses RL machinery (stage two), but the human feedback part is stage one (reward-model training). “RLHF” without qualification usually means both stages; “RL” alone usually means stage two.
Assuming the reward model needs the original preference pair at inference timeThe reward model takes a prompt and a response, nothing else. It does not know which model produced the response, what the reference response was, or what training pair it came from.
Reading “aligned with human preferences” as a universal standard”Aligned” means “aligned with the specific labelers under the specific guidelines.” It is a real distribution, not a universal one. Different labs, different labelers, different directions.
Claim in a release announcementWhat it usually means
”Trained with RLHF”Preference pairs collected, reward model trained (stage one), LLM updated against it (stage two).
”Trained with RLAIF”Same procedure; AI-model labels instead of human labels.
”Tuned for helpfulness, harmlessness, and honesty”These are likely the three dimensions the lab built reward models for. Not all three are equally easy to train; the balance reflects the lab’s choices.
”Preference fine-tuned”Stage one and possibly stage two done, likely with a specific algorithm (RLHF, DPO, or similar; the next lesson covers the algorithm choices).
  • Preference pair: one prompt, two responses, and a binary label (which response is preferred). The atomic unit of preference data.
  • Pairwise comparison: a judgment of which of two options is better. The standard format for preference data.
  • Reward model: a neural network trained to score prompt-response pairs. Takes a single prompt-and-response as input and outputs one number.
  • Bradley-Terry model: the statistical model behind the reward-model training objective. Named after a 1952 paper on paired comparisons.
  • RLHF (Reinforcement Learning from Human Feedback): a two-stage post-training process. Stage one trains a reward model from human-labeled preference pairs. Stage two updates the LLM using the reward model.
  • RLAIF (Reinforcement Learning from AI Feedback): RLHF with AI-model labels instead of human labels. The training procedure is identical.
  • Sparse supervision: supervision where the feedback signal arrives once per completion rather than once per token. Characteristic of reward-model training and the RLHF second stage.
  • Dense supervision: supervision where the feedback signal arrives once per token. Characteristic of pretraining and SFT.
  • LLM-as-a-judge: using a separate language model to assign preference labels instead of human labelers.
  • Pointwise scoring: assigning an absolute score to each response independently. Less reliable than pairwise comparison for collecting preference data.

Supervised fine-tuning teaches the model to answer when someone asks.
Preference data teaches it which answer to prefer.
The reward model is how that preference becomes a number a training loop can use.