Skip to content

Cheatsheet: Reasoning and alignment, RL with verifiable rewards

prompt -> THINKING (explicit step-by-step) -> ANSWER

Trained so the thinking actually improves the answer on multi-step problems (math, code with constraints, logic), where one-shotting often fails.

RLHF (lesson 13)RLVR (this lesson)
RewardHuman preference (learned reward model)Verifiable check: math grader / code tests / puzzle validator
Reward model needed?YesNo (checker IS the reward)
Reward-model hacking?Possible (policy games the model)Eliminated (check is fixed)
Best forOpen-ended response shapingMulti-step problems with checkable answers
1. Sample k reasoning traces per prompt (sample workers; inference)
2. Score each trace with the verifier (verifier workers; checker)
3. Update the policy on the rewards (train workers; gradient step)
+ KL penalty to start-of-step policy

Modern RL algorithm: GRPO (Group Relative Policy Optimization). Normalizes rewards within each k-group; no separate value network. Lives in TRL alongside SFTTrainer and DPOTrainer.

TermWhat it is
DeepSeek R1Model that showed reasoning-via-RL at scale
Open R1Hugging Face community open reproduction
GRPOThe RL algorithm; in TRL

Three cost centers:

StageWorkersTooling
SampleMany decode passes per promptLesson 8: KV cache, continuous batching, paged attention, GQA, speculative decoding, quantization
VerifyRun the checker (code executor, grader, sometimes an LLM judge)Often a separate worker cluster
TrainGradient updates on rewarded tracesLessons 6-7: kernels + parallelism (FSDP, TP)

Coordinating sample + verify + train at scale is most of the engineering. Algorithm is the easy part.

attempt many verifiable problems -> filter correct traces
-> new SFT (or preference) data -> retrain -> iterate

Verifier prevents blind-spot amplification on verifiable problems (no teacher drift; ground truth is the check).

Phase 1 (model): tokenizer (L1) -> accounting (L2) -> architecture (L3-L4)
Phase 2 (systems): hardware (L5) -> kernels (L6) -> parallelism (L7) -> inference (L8)
Phase 3 (good): scaling (L9) -> evaluation (L10) -> data (L11-L12) -> post-training (L13)
Capstone (L14): RLVR uses every layer above at once

The method that survives the next frontier:

  • Account for compute (FLOPs, memory, arithmetic intensity)
  • Exploit the hardware (kernels, parallelism, inference economics)
  • Scale honestly (Chinchilla + inference-cost adjustment)
  • Evaluate the portfolio (not any single benchmark)
  • Curate data (filter, dedup, mix, synthetic with verifier)
  • Post-train deliberately (SFT, preference tuning, RLVR for reasoning)
  • RLVR: RL with verifiable rewards; checker IS the reward.
  • GRPO: Group Relative Policy Optimization; modern reasoning RL algorithm.
  • Sample / verify / train workers: the three roles in production RL infrastructure.
  • Self-improvement loop: filter correct traces back into training data; verifier-grounded.
  • Reasoning trace: the explicit step-by-step thinking before the answer.
  • Stanford CS336, Lecture 16 (Post-training RLVR), by Hashimoto and Liang. cs336.stanford.edu. Independent structural mirror in original prose; the RL-as-systems framing is the lesson’s own synthesis; see references.