Post-training, SFT and RLHF: brief

What you’ll learn

A pretrained base model is a sophisticated next-token predictor; post-training turns it into something users actually talk to. The source curriculum is Stanford CS336, Lecture 15, by Tatsunori Hashimoto and Percy Liang, with lectures freely available on YouTube and the course at cs336.stanford.edu.

You will state the post-training pipeline (pretrain -> SFT -> preference tuning); describe SFT mechanics on chat-formatted instruction-response data and why quality matters more than quantity; understand why SFT alone cannot rank two plausible responses; walk RLHF’s three steps (preference data, reward model, PPO update of the SFT policy with a KL penalty) and DPO’s simplification (skip the reward model and RL step; train directly on preference pairs with a closed-form loss); and describe what preference tuning changes mechanically in the model’s distribution.

§6 framing note: this lesson is taught at a strictly technical-primer level. RLHF and DPO are named as factual pipeline stages with their mechanics explained. Contested questions about whether these methods solve deeper alignment or safety problems are out of scope, the same discipline applied in Track 14 lesson 10.

Where this fits

This is lesson 13 of 14, the fifth lesson of Phase 3 (scale, data, and alignment). It builds on the data lessons (11 and 12, where the SFT and preference-data pipelines actually live) and on Track 14 lesson 10 (the using-side companion). The capstone lesson (14) keeps the RL machinery introduced here but changes the reward from human preference to verifiable correctness, the modern reasoning-model recipe.

Before you start

Prerequisites: lesson 12 (data filtering, dedup, mixing, synthetic, all directly relevant to building SFT and preference datasets). Track 14 lesson 10 is the using-side companion for SFT mechanics; this lesson is the from-scratch / lab POV.

About the math

Light and conceptual. The RLHF/DPO mathematics is real but not derived here; this lesson states what each method does, what data it consumes, and what changes in the model. The “closed-form relationship between optimal policy and preference data” behind DPO is described, not derived; the canonical paper is referenced for the derivation.

By the end, you’ll be able to

The single capability this lesson builds: explain how supervised fine-tuning and RLHF turn a base model into a usable assistant. Concretely, you will be able to:

State the post-training pipeline (pretrain -> SFT -> preference tuning)
Describe SFT mechanics and why data quality matters
Explain why SFT alone cannot rank plausible responses
Walk RLHF’s three steps and DPO’s simplification
Describe what preference tuning changes in the model’s distribution

Time and difficulty

Read time: about 13 minutes
Practice time: about 10 minutes (method-choice exercise + DPO-vs-SFT clarification, plus flashcards)
Difficulty: deep (Stage C; conceptual, technical-primer; no derivations of RLHF/DPO math)