Reasoning and alignment, RL with verifiable rewards
What you’ll learn
Section titled “What you’ll learn”This is the track capstone. You will learn the current reasoning frontier and step back over the whole track. The source curriculum is Stanford CS336, Lecture 16 (post-training with RLVR), by Tatsunori Hashimoto and Percy Liang, with lectures freely available on YouTube and the course at cs336.stanford.edu. The reasoning-as-a-systems-problem framing is the lesson’s own synthesis.
You will learn what reasoning models add over ordinary LLMs and why one-shotting multi-step problems fails; explain RLVR’s reward signal (verifiable correctness) and how it differs from RLHF (no reward model, no reward-model hacking); walk the RLVR training loop and describe GRPO (the modern RL algorithm, in TRL); explain why RL at LLM scale is mostly a systems problem (sample + verify + train workers); see self-improvement loops as the verifier-grounded synthetic-data pattern; and watch the lesson tie the whole track together as the capstone synthesis.
§6 framing note: technical-primer throughout, same discipline as lesson 13 and Track 14 lesson 12. RLVR, GRPO, DeepSeek R1, Open R1 are named as factual technical anchors. Contested debates about whether these methods solve deeper alignment problems are out of scope.
Where this fits
Section titled “Where this fits”This is lesson 14 of 14, the sixth and final lesson of Phase 3 and the capstone of Track 15. It extends the post-training pipeline from lesson 13 one stage further (pretrain -> SFT -> preference tuning -> RLVR for reasoning), reuses the inference toolkit from lesson 8 inside sample workers, and synthesizes everything from lessons 1 through 13. The track ends here.
Before you start
Section titled “Before you start”Prerequisites: lesson 13 (the SFT + preference-tuning pipeline that produces the RLVR starting point, and the TRL framing that hosts GRPO alongside SFTTrainer and DPOTrainer). The whole prior track helps for the capstone synthesis; the more of it you have done, the more the closing tie-back lands.
About the math
Section titled “About the math”None. The lesson describes RLVR and GRPO at a mechanical level (sample, verify, update with KL penalty) without derivations. The systems and self-improvement-loop arguments are conceptual.
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”The single capability this lesson builds: explain how reinforcement learning (including RL with verifiable rewards) is used to improve reasoning, and how it is run as a system. Concretely, you will be able to:
- Explain what reasoning models add and why one-shotting multi-step problems fails
- Explain RLVR’s reward signal and how it differs from RLHF
- Walk the RLVR training loop and describe GRPO
- Explain why RL at LLM scale is mostly a systems problem
- Describe self-improvement loops and the track-wide capstone synthesis
Time and difficulty
Section titled “Time and difficulty”- Read time: about 14 minutes
- Practice time: about 12 minutes (capstone synthesis: place Phase-2 tools in RLVR + critically read a reasoning-model claim, plus flashcards)
- Difficulty: deep (Stage C; conceptual capstone with track-wide synthesis; technical-primer scope on RLVR mechanics)