Cheatsheet: Reasoning and alignment, RL with verifiable rewards
What reasoning models add
Section titled “What reasoning models add”prompt -> THINKING (explicit step-by-step) -> ANSWERTrained so the thinking actually improves the answer on multi-step problems (math, code with constraints, logic), where one-shotting often fails.
RLVR (RL with Verifiable Rewards)
Section titled “RLVR (RL with Verifiable Rewards)”| RLHF (lesson 13) | RLVR (this lesson) | |
|---|---|---|
| Reward | Human preference (learned reward model) | Verifiable check: math grader / code tests / puzzle validator |
| Reward model needed? | Yes | No (checker IS the reward) |
| Reward-model hacking? | Possible (policy games the model) | Eliminated (check is fixed) |
| Best for | Open-ended response shaping | Multi-step problems with checkable answers |
The training loop
Section titled “The training loop”1. Sample k reasoning traces per prompt (sample workers; inference)2. Score each trace with the verifier (verifier workers; checker)3. Update the policy on the rewards (train workers; gradient step) + KL penalty to start-of-step policyModern RL algorithm: GRPO (Group Relative Policy Optimization). Normalizes rewards within each k-group; no separate value network. Lives in TRL alongside SFTTrainer and DPOTrainer.
Landscape anchors
Section titled “Landscape anchors”| Term | What it is |
|---|---|
| DeepSeek R1 | Model that showed reasoning-via-RL at scale |
| Open R1 | Hugging Face community open reproduction |
| GRPO | The RL algorithm; in TRL |
RL as a systems problem
Section titled “RL as a systems problem”Three cost centers:
| Stage | Workers | Tooling |
|---|---|---|
| Sample | Many decode passes per prompt | Lesson 8: KV cache, continuous batching, paged attention, GQA, speculative decoding, quantization |
| Verify | Run the checker (code executor, grader, sometimes an LLM judge) | Often a separate worker cluster |
| Train | Gradient updates on rewarded traces | Lessons 6-7: kernels + parallelism (FSDP, TP) |
Coordinating sample + verify + train at scale is most of the engineering. Algorithm is the easy part.
Self-improvement loops
Section titled “Self-improvement loops”attempt many verifiable problems -> filter correct traces -> new SFT (or preference) data -> retrain -> iterateVerifier prevents blind-spot amplification on verifiable problems (no teacher drift; ground truth is the check).
Track-wide map (the capstone synthesis)
Section titled “Track-wide map (the capstone synthesis)”Phase 1 (model): tokenizer (L1) -> accounting (L2) -> architecture (L3-L4)Phase 2 (systems): hardware (L5) -> kernels (L6) -> parallelism (L7) -> inference (L8)Phase 3 (good): scaling (L9) -> evaluation (L10) -> data (L11-L12) -> post-training (L13)Capstone (L14): RLVR uses every layer above at onceThe method that survives the next frontier:
- Account for compute (FLOPs, memory, arithmetic intensity)
- Exploit the hardware (kernels, parallelism, inference economics)
- Scale honestly (Chinchilla + inference-cost adjustment)
- Evaluate the portfolio (not any single benchmark)
- Curate data (filter, dedup, mix, synthetic with verifier)
- Post-train deliberately (SFT, preference tuning, RLVR for reasoning)
Words to use precisely
Section titled “Words to use precisely”- RLVR: RL with verifiable rewards; checker IS the reward.
- GRPO: Group Relative Policy Optimization; modern reasoning RL algorithm.
- Sample / verify / train workers: the three roles in production RL infrastructure.
- Self-improvement loop: filter correct traces back into training data; verifier-grounded.
- Reasoning trace: the explicit step-by-step thinking before the answer.
Source
Section titled “Source”- Stanford CS336, Lecture 16 (Post-training RLVR), by Hashimoto and Liang.
cs336.stanford.edu. Independent structural mirror in original prose; the RL-as-systems framing is the lesson’s own synthesis; see references.