References: Reasoning and alignment, RL with verifiable rewards

Source material

Source curriculum (structural mirror, cited as further study):
• Stanford CS336, "Language Modeling from Scratch", Lecture 16:
    Post-training (RLVR)
  Instructors: Tatsunori Hashimoto and Percy Liang (Stanford)
  Course page: https://cs336.stanford.edu/
  Lecture videos: YouTube playlist
    https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaTUHhq-yembLCV
  License: no explicit license is published on the course site; lecture
    videos are on YouTube under standard terms; slides are public on GitHub
    without a stated license.
  Required attribution: "Based on the structure of Stanford CS336,
    'Language Modeling from Scratch,' by Tatsunori Hashimoto and Percy Liang
    (cs336.stanford.edu). This is an independent structural mirror in
    original prose; it reproduces no course materials, and Stanford does
    not endorse it."
This lesson builds on Lecture 16 (post-training with RLVR) and extends it into
a reasoning-and-RL-as-systems synthesis as the track capstone; the RL-as-systems
framing here is the lesson's own, not a separate course lecture. Clawdemy's
lessons are original prose that follows the pedagogical arc of the course. Because the
source publishes no explicit license, we cite it as a recommended companion
and reproduce none of its materials. The lesson is taught at a strictly
technical-primer level; contested debates about alignment or safety are
out of scope.

Watch this next

Stanford CS336, Lecture 16: Post-training (RLVR) by Hashimoto. The lecture this lesson builds on, with the GRPO mechanics in more depth; the RL-as-systems framing here is the lesson’s own synthesis.

Going deeper

A short, durable list. Each link is a specific next step, not a generic pile.

“DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning” by DeepSeek-AI (2025). The paper that made reasoning-via-RL widely known; includes the recipe (SFT, then RLVR with GRPO) and ablations.
Open R1 (GitHub). The open community reproduction. Reading the README and recent issues shows how a frontier capability gets rebuilt in the open: code, datasets, recipes, evaluation.
“Group Relative Policy Optimization” (GRPO) in TRL. The TRL documentation for GRPO. The reference for the loss and the loop, the fastest way to see the algorithm in code alongside SFTTrainer and DPOTrainer.

Adjacent topics

Where this connects inside the track.

Post-training SFT and RLHF (lesson 13). RLVR is the next stage after that lesson’s SFT + preference tuning, with a different reward signal (verifiable vs preference) and the same TRL library.
Inference (lesson 8). RLVR sampling is decoding at scale; every inference-economics technique from there (KV cache, batching, GQA, speculative decoding, quantization) applies inside sample workers.
Counting the cost / scaling laws (lessons 2 and 9). RLVR’s compute is dominated by sampling, not gradient updates; the 6ND-style accounting needs adjusting for the sample/verify cost.
What transformers do (Track 14 lesson 1). The bookend: that lesson’s working picture, tokens in, tokens out, attention in the middle, still describes the reasoning models RLVR trains.