Skip to content

References: Reasoning and alignment, RL with verifiable rewards

Source curriculum (structural mirror, cited as further study):
• Stanford CS336, "Language Modeling from Scratch", Lecture 16:
Post-training (RLVR)
Instructors: Tatsunori Hashimoto and Percy Liang (Stanford)
Course page: https://cs336.stanford.edu/
Lecture videos: YouTube playlist
https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaTUHhq-yembLCV
License: no explicit license is published on the course site; lecture
videos are on YouTube under standard terms; slides are public on GitHub
without a stated license.
Required attribution: "Based on the structure of Stanford CS336,
'Language Modeling from Scratch,' by Tatsunori Hashimoto and Percy Liang
(cs336.stanford.edu). This is an independent structural mirror in
original prose; it reproduces no course materials, and Stanford does
not endorse it."
This lesson builds on Lecture 16 (post-training with RLVR) and extends it into
a reasoning-and-RL-as-systems synthesis as the track capstone; the RL-as-systems
framing here is the lesson's own, not a separate course lecture. Clawdemy's
lessons are original prose that follows the pedagogical arc of the course. Because the
source publishes no explicit license, we cite it as a recommended companion
and reproduce none of its materials. The lesson is taught at a strictly
technical-primer level; contested debates about alignment or safety are
out of scope.

A short, durable list. Each link is a specific next step, not a generic pile.

Where this connects inside the track.

  • Post-training SFT and RLHF (lesson 13). RLVR is the next stage after that lesson’s SFT + preference tuning, with a different reward signal (verifiable vs preference) and the same TRL library.

  • Inference (lesson 8). RLVR sampling is decoding at scale; every inference-economics technique from there (KV cache, batching, GQA, speculative decoding, quantization) applies inside sample workers.

  • Counting the cost / scaling laws (lessons 2 and 9). RLVR’s compute is dominated by sampling, not gradient updates; the 6ND-style accounting needs adjusting for the sample/verify cost.

  • What transformers do (Track 14 lesson 1). The bookend: that lesson’s working picture, tokens in, tokens out, attention in the middle, still describes the reasoning models RLVR trains.