Skip to content

Summary: Reasoning and alignment, RL with verifiable rewards

The track closes at the current frontier. Reasoning models generate explicit step-by-step thinking before the answer, trained so the thinking actually improves the answer on multi-step problems. RLVR (RL with Verifiable Rewards) replaces RLHF’s learned reward model with a verifiable check (math grader, code tests, puzzle validator); no reward model, no reward-model hacking. The loop: sample k reasoning traces per problem, score with the verifier, update the policy to favor high-reward traces with a KL penalty back. The algorithm of the moment is GRPO, in the same TRL library as SFTTrainer and DPOTrainer. The landscape anchors are DeepSeek R1 (showed it works at scale) and Open R1 (the open reproduction). At LLM scale, RL is mostly a systems problem: sample workers run inference at decode scale (lesson 8’s toolkit applies), verifier workers score, and train workers update; coordinating the three is most of the engineering. Self-improvement loops filter correct traces back into training data, with the verifier preventing teacher-blind-spot amplification. This lesson is also the track capstone: it uses every layer the track built. Taught technical-primer; contested alignment debates are out of scope.

  • Reasoning gap. Ordinary LLMs one-shot multi-step problems and often miss. Reasoning models generate explicit thinking and are trained so the thinking improves correctness.
  • RLVR reward = verifiable correctness. Math grader, code tests, puzzle validator. No reward model; reward-model hacking eliminated.
  • The training loop. Sample k traces -> verify each -> update the policy to favor high-reward traces, with a KL penalty to the start-of-step policy. GRPO is the modern RL algorithm; lives in TRL alongside SFTTrainer and DPOTrainer.
  • Landscape anchors. DeepSeek R1 demonstrated reasoning-via-RL at scale; Open R1 reproduces openly. GRPO + TRL + open datasets make the recipe accessible.
  • RL at LLM scale is mostly systems. Sample workers (inference economics from lesson 8 apply: KV cache, batching, kernels, GQA, speculative decoding), verifier workers (often model calls), train workers. Coordinating them is the bulk of the engineering.
  • Self-improvement loops. Filter correct traces from the current model, use as new training data, iterate. Verifier prevents blind-spot amplification on verifiable problems.
  • Track capstone. RLVR touches every layer built across the track: Phase-1 model, Phase-2 systems (sample + train + verifier), Phase-3 scaling laws, evaluation, data, post-training.

This is graduation, and the message is not that you have memorized today’s frontier. Reasoning is the headline now; something else will be next. What you leave with is the method, the loop that runs the same regardless of model and the open tools (TRL, kernels, datasets, Open R1) to apply it. That combination is what lets you pick up the next capability and use it rather than watch from outside. AI is not a fixed body of knowledge you either have or lack; it is a fast-moving ecosystem, and staying useful in it means holding the method steady while the specifics churn. You now have that method. The frontier will keep moving, and you can move with it.

You built it from scratch. RLVR is the current frontier; the loop is sample, verify, update, and the systems work of Phase 2 returns to run it at scale. The track gave you the whole pipeline and, more durably, the method that survives the next frontier and the one after that.