Summary: How reasoning models think differently

Reasoning models are trained, not just prompted. A standard LLM is trained to predict the next token; reasoning emerges only when the prompt asks for it. A reasoning model has been pushed via training (often RL) to produce long internal reasoning chains as part of its policy. The output of a reasoning-model call is a reasoning chain followed by a final answer, both generated as tokens.

The training trick is verifiable rewards. Reasoning models are trained on problems where correctness can be checked mechanically: math with ground-truth answers, coding with test cases. Reward = correct final answer after the reasoning chain. This sharper reward signal lets RL push the model toward reasoning that arrives at correct answers, much more cleanly than RLHF preference signals can.

Compute budget is user-facing. Modern chat UIs expose “standard” vs “extended” thinking. The toggle controls how many reasoning tokens the model is allowed to spend before answering. More reasoning costs more dollars and adds latency; you are paying for reasoning tokens whether you see them or not.

Pass@K is the metric. “Probability at least one of K attempts is correct.” Pass@1 is the strongest claim; higher K makes the number bigger because more attempts = more chances. Read the K before the percentage.

This summary is the scan-it-in-five-minutes version. The full lesson covers the major reasoning benchmarks (HumanEval, SWE-bench, CodeForces, GSM8K, AIME), the Pass@K intuition derivation (“1 minus probability all K wrong”), and the practical pitfalls.

Core ideas

Standard LLM trained to predict plausible next tokens. Reasoning model trained on problems with verifiable correctness, with reward tied to the final-answer correctness after a reasoning chain.
The output of a reasoning-model call is two things: a reasoning chain, then a final answer. Both are tokens. The model has been trained to spend reasoning tokens productively.
The “thinking” UI is a summary, not the raw chain. Three plausible reasons for hiding the raw chain: legibility, user attention, and competitive-moat protection of training-quality reasoning data.
Compute budget exposes the reasoning-cost trade-off. “Standard” vs “extended” thinking is a real lever; reasoning tokens are billed even when not shown.
Verifiable rewards are why this works. Math (ground truth) and coding (test cases) provide clear correctness signals. The boundary of where this approach generalizes to fuzzy tasks is the current research frontier.
Major reasoning benchmarks. HumanEval (small coding, mostly saturated). SWE-bench (real GitHub issues, current frontier). CodeForces (competitive programming, ratings). GSM8K (about 8,500 grade-school math problems, baseline). AIME (US math olympiad qualifier, hard).
Pass@K is the metric. Probability at least one of K attempts is correct. Equals 1 minus probability all K are wrong. Pass@1 is the strongest claim. Higher K mechanically inflates the number.
Pitfall: Pass@1 vs Pass@10 are different claims. Always read the K. Models with strong Pass@10 can have mediocre Pass@1.
Pitfall: “thinking” is a forward-pass loop, not metaphorical cognition. Useful UI shorthand; don’t read too much into it.
Pitfall: reasoning models are not strictly better. They are stronger where their training distribution gave them practice (math, code). On creative or open-ended tasks, the cost may not justify the marginal gain.

What changes for you

After this lesson, model cards stop being opaque. When a reasoning model claims “x% on AIME 2024” or “y% Pass@1 on HumanEval,” you can place it: a benchmark, a metric, a reasoning-model claim. You also know to ask “Pass at what K?” and “What temperature?” before drawing strong conclusions from a number. The numbers move every quarter; the framework does not.

A standard LLM is trained to sound plausible. A reasoning model is trained to be correct.
Compute budget is the new dial: more thinking time, more capability, more cost.
Pass@K is “any of K right.” Read K before you read the percentage.