Skip to content

References: Post-training, SFT and RLHF

Source curriculum (structural mirror, cited as further study):
• Stanford CS336, "Language Modeling from Scratch", Lecture 15:
Mid/post-training (SFT/RLHF)
Instructors: Tatsunori Hashimoto and Percy Liang (Stanford)
Course page: https://cs336.stanford.edu/
Lecture videos: YouTube playlist
https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaTUHhq-yembLCV
License: no explicit license is published on the course site; lecture
videos are on YouTube under standard terms; slides are public on GitHub
without a stated license.
Required attribution: "Based on the structure of Stanford CS336,
'Language Modeling from Scratch,' by Tatsunori Hashimoto and Percy Liang
(cs336.stanford.edu). This is an independent structural mirror in
original prose; it reproduces no course materials, and Stanford does
not endorse it."
This lesson mirrors the structure of Lecture 15 (post-training). Clawdemy's
lessons are original prose that follows the pedagogical arc of the course.
Because the source publishes no explicit license, we cite it as a recommended
companion and reproduce none of its materials. This lesson is taught at a
strictly technical-primer level; contested debates about alignment and
safety are out of scope.

A short, durable list. Each link is a specific next step, not a generic pile.

Where this connects inside the track.

  • Data, part 2 (lesson 12). Synthetic data techniques are commonly used to generate SFT and preference data at scale. The same filter-and-dedup discipline applies to that data too.

  • Reasoning and alignment RL (lesson 14, capstone). The next lesson keeps the RL machinery introduced here but changes the reward signal from human preference to verifiable correctness (RLVR), the modern reasoning-model recipe.

  • Track 14 lesson 10 (Fine-tuning LLMs: SFT). The using-side companion: the same SFT mechanics through the TRL library, framed for practitioners using existing models rather than building one from scratch.