References: Reasoning and alignment, RL with verifiable rewards
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• Stanford CS336, "Language Modeling from Scratch", Lecture 16: Post-training (RLVR) Instructors: Tatsunori Hashimoto and Percy Liang (Stanford) Course page: https://cs336.stanford.edu/ Lecture videos: YouTube playlist https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaTUHhq-yembLCV License: no explicit license is published on the course site; lecture videos are on YouTube under standard terms; slides are public on GitHub without a stated license. Required attribution: "Based on the structure of Stanford CS336, 'Language Modeling from Scratch,' by Tatsunori Hashimoto and Percy Liang (cs336.stanford.edu). This is an independent structural mirror in original prose; it reproduces no course materials, and Stanford does not endorse it."This lesson builds on Lecture 16 (post-training with RLVR) and extends it intoa reasoning-and-RL-as-systems synthesis as the track capstone; the RL-as-systemsframing here is the lesson's own, not a separate course lecture. Clawdemy'slessons are original prose that follows the pedagogical arc of the course. Because thesource publishes no explicit license, we cite it as a recommended companionand reproduce none of its materials. The lesson is taught at a strictlytechnical-primer level; contested debates about alignment or safety areout of scope.Watch this next
Section titled “Watch this next”- Stanford CS336, Lecture 16: Post-training (RLVR) by Hashimoto. The lecture this lesson builds on, with the GRPO mechanics in more depth; the RL-as-systems framing here is the lesson’s own synthesis.
Going deeper
Section titled “Going deeper”A short, durable list. Each link is a specific next step, not a generic pile.
-
“DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning” by DeepSeek-AI (2025). The paper that made reasoning-via-RL widely known; includes the recipe (SFT, then RLVR with GRPO) and ablations.
-
Open R1 (GitHub). The open community reproduction. Reading the README and recent issues shows how a frontier capability gets rebuilt in the open: code, datasets, recipes, evaluation.
-
“Group Relative Policy Optimization” (GRPO) in TRL. The TRL documentation for GRPO. The reference for the loss and the loop, the fastest way to see the algorithm in code alongside SFTTrainer and DPOTrainer.
Adjacent topics
Section titled “Adjacent topics”Where this connects inside the track.
-
Post-training SFT and RLHF (lesson 13). RLVR is the next stage after that lesson’s SFT + preference tuning, with a different reward signal (verifiable vs preference) and the same TRL library.
-
Inference (lesson 8). RLVR sampling is decoding at scale; every inference-economics technique from there (KV cache, batching, GQA, speculative decoding, quantization) applies inside sample workers.
-
Counting the cost / scaling laws (lessons 2 and 9). RLVR’s compute is dominated by sampling, not gradient updates; the 6ND-style accounting needs adjusting for the sample/verify cost.
-
What transformers do (Track 14 lesson 1). The bookend: that lesson’s working picture, tokens in, tokens out, attention in the middle, still describes the reasoning models RLVR trains.