Summary: Post-training, SFT and RLHF
A pretrained base model is a next-token predictor; post-training turns it into an assistant. The pipeline: pretrain -> SFT -> preference tuning (RLHF or DPO). SFT continues training the base model on chat-formatted instruction-response pairs, teaching format and basic behavior. SFT alone cannot rank two plausible responses, so preference tuning follows on (prompt, response_A, response_B, preferred) triples. RLHF does it in three steps: collect preference data, train a reward model, then update the SFT policy with RL (PPO) under a KL constraint to the SFT init. DPO simplifies: skip the reward model and the RL step, train directly on the pairs with a closed-form-derived loss; more stable, modern open-model default. Mechanically, preference tuning shifts the output distribution toward the preferred direction and makes the model opinionated in that specific direction. Taught technical-primer throughout; contested questions about alignment or safety are out of scope.
Core ideas
Section titled “Core ideas”- Pipeline: pretrain (language) -> SFT (instruction-following, format) -> preference tuning (rank plausible outputs). Each is necessary.
- SFT data: chat-formatted instruction-response pairs; small, curated; quality > quantity. Same cross-entropy loss as pretraining, often with LoRA/PEFT for affordability. Loss-masking grades only the assistant turns.
- SFT vs preference tuning: SFT imitates good responses (teaches format + behavior); preference tuning expresses “A is better than B” (teaches ranking).
- RLHF: (1) preference data, (2) reward model, (3) RL update of SFT policy (PPO) with a KL penalty to SFT init. Original recipe; works; known engineering pains (PPO instability, reward-model hacking, many moving parts).
- DPO: under the standard preference-model assumption, the optimal policy has a closed-form relationship to preference data; DPO implements that directly with one supervised-shaped loss. Simpler, more stable, modern default.
- Mechanical effect: distribution shifts toward preferred responses; the model becomes opinionated in the preference data’s direction. Out of scope: whether that opinion solves deeper alignment / safety questions.
What changes for you
Section titled “What changes for you”Post-training is the unglamorous step that separates a base model on the Hub from an assistant a user actually wants to talk to. The honest mechanical picture, SFT teaches format and behavior; preference tuning shapes which response wins among plausible ones, demystifies a lot of what gets discussed in the abstract. The recent shift from PPO-based RLHF to DPO is the kind of “simpler with the same target” win the field has rapidly converged on, paralleling the evidence-and-simplicity preference scaling laws and evaluation lessons encouraged. With the post-training pipeline understood at the mechanical level, the next lesson, the track capstone, turns to RL applied to reasoning specifically, where the reward is verifiable correctness rather than human preference.
A pretrained base model is a next-token predictor; SFT teaches it to follow instructions; preference tuning (RLHF or, increasingly, DPO) shapes which response it prefers among plausible ones. Both run on the Trainer-class loops you already know, and DPO’s recent dominance is a simpler-same-target win the field has converged on.