References: Post-training, SFT and RLHF
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• Stanford CS336, "Language Modeling from Scratch", Lecture 15: Mid/post-training (SFT/RLHF) Instructors: Tatsunori Hashimoto and Percy Liang (Stanford) Course page: https://cs336.stanford.edu/ Lecture videos: YouTube playlist https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaTUHhq-yembLCV License: no explicit license is published on the course site; lecture videos are on YouTube under standard terms; slides are public on GitHub without a stated license. Required attribution: "Based on the structure of Stanford CS336, 'Language Modeling from Scratch,' by Tatsunori Hashimoto and Percy Liang (cs336.stanford.edu). This is an independent structural mirror in original prose; it reproduces no course materials, and Stanford does not endorse it."This lesson mirrors the structure of Lecture 15 (post-training). Clawdemy'slessons are original prose that follows the pedagogical arc of the course.Because the source publishes no explicit license, we cite it as a recommendedcompanion and reproduce none of its materials. This lesson is taught at astrictly technical-primer level; contested debates about alignment andsafety are out of scope.Watch this next
Section titled “Watch this next”- Stanford CS336, Lecture 15: Mid/post-training (SFT/RLHF) by Hashimoto and Liang. The lecture this lesson mirrors. It walks the RLHF mechanics with more attention to PPO details and the practical engineering pains.
Going deeper
Section titled “Going deeper”A short, durable list. Each link is a specific next step, not a generic pile.
-
“Training language models to follow instructions with human feedback” by Ouyang et al. (2022), the InstructGPT paper. The canonical RLHF-applied-to-LLMs paper. Read it for the original three-step pipeline at scale.
-
“Direct Preference Optimization: Your Language Model is Secretly a Reward Model” by Rafailov et al. (2023), the DPO paper. The derivation showing how to skip the reward model and RL, with a clean loss. The technical core of the modern post-training default.
-
The TRL library documentation. Reference implementations of
SFTTrainer,DPOTrainer, andPPOTrainer. The fastest way to see the three approaches side by side in code.
Adjacent topics
Section titled “Adjacent topics”Where this connects inside the track.
-
Data, part 2 (lesson 12). Synthetic data techniques are commonly used to generate SFT and preference data at scale. The same filter-and-dedup discipline applies to that data too.
-
Reasoning and alignment RL (lesson 14, capstone). The next lesson keeps the RL machinery introduced here but changes the reward signal from human preference to verifiable correctness (RLVR), the modern reasoning-model recipe.
-
Track 14 lesson 10 (Fine-tuning LLMs: SFT). The using-side companion: the same SFT mechanics through the TRL library, framed for practitioners using existing models rather than building one from scratch.