Skip to content

Cheatsheet: Post-training, SFT and RLHF

pretrain (Phases 1-2)
-> SFT (instruction-following, format)
-> preference tuning (rank plausible outputs)
-> RLHF (3 steps, PPO) OR DPO (1 loss)

Each step is necessary; none replaces the others.

  • Data: chat-formatted (system, user, assistant) messages; instruction-response pairs.
  • Loss: cross-entropy on assistant tokens (mask the user turn).
  • Tools: SFTTrainer (TRL) + LoRA / PEFT for affordable runs.
  • Volume: small + curated. Quality > quantity.
  • Result: model follows instructions, holds dialogue, uses the format you trained.
  • Limit: cannot rank two plausible responses.

SFT imitates the responses it sees -> teaches format + behavior. Cannot encode “A is better than B” for two plausible outputs. Preference tuning fills that gap.

StepWhat it does
1. Preference dataHumans pick A vs B for the same prompt -> (prompt, A, B, preferred) triples
2. Reward model(prompt, response) -> scalar; fit so higher = preferred
3. RL update of policySample responses from SFT model, score with reward model, update with PPO, KL penalty to SFT init

Engineering pains: PPO instability at LLM scale, reward-model hacking, many moving parts.

  • Skips: the explicit reward model AND the RL step.
  • Uses: the closed-form relationship between optimal policy and preference data implied by the RLHF preference model.
  • Pipeline: one supervised-shaped loss on (prompt, preferred, dispreferred) pairs, with the SFT model as a KL reference.
  • Result: more stable, simpler hyperparameters, comparable or better quality on most preference benchmarks. Modern default.
BeforeAfter
Output distributionSpread over plausible responsesConcentrated on the preferred direction
Model characterSFT-shapedOpinionated in the preference data’s direction

In scope: mechanics (data, training loops, distribution shifts). Out of scope: whether the resulting opinion is “aligned with human values” or “safe in deployment” (contested debates).

SituationReach for
Limited bandwidth / no RL expertiseDPO
Want explicit reward shaping flexibilityRLHF (PPO)
Default first tryDPO (simpler with the same target)
  • SFT (supervised fine-tuning): cross-entropy on instruction-response pairs.
  • Preference pair: (prompt, response_A, response_B, preferred).
  • Reward model: scores (prompt, response) to a scalar; fit on preference data.
  • PPO (Proximal Policy Optimization): the RL algorithm used in classic RLHF.
  • DPO (Direct Preference Optimization): trains the policy directly on preference pairs.
  • KL penalty / SFT reference: constrains the optimized policy to stay near the SFT initialization.
  • Stanford CS336, Lecture 15 (Mid/post-training: SFT/RLHF), by Hashimoto and Liang. cs336.stanford.edu. Independent structural mirror in original prose; see references.