Cheatsheet: How reasoning models think differently
The one idea that matters
Section titled “The one idea that matters”A reasoning model has been TRAINED to produce long reasoning chainsas part of its policy, with a reward signal tied to the correctnessof the final answer after the chain.
A standard LLM has not.Standard vs reasoning at a glance
Section titled “Standard vs reasoning at a glance”| Aspect | Standard LLM | Reasoning model |
|---|---|---|
| Training reward | Next-token prediction (pretraining) + helpfulness preferences (RLHF) | Correctness of final answer after reasoning chain (RL on verifiable rewards) |
| Reasoning behavior | Emerges only when prompt asks (e.g., “let’s think step by step”) | Default on hard problems; part of the policy |
| Output structure | Final answer (with optional CoT if prompted) | Reasoning chain, then final answer |
| Best on | General-purpose tasks, fuzzy goals | Math, coding, structured logic with verifiable answers |
| API cost | Output tokens you see | Output tokens INCLUDING hidden reasoning tokens |
Why this works (verifiable rewards)
Section titled “Why this works (verifiable rewards)”Math problem → ground truth answer → can verify correctnessCoding problem → test cases → can verify correctnessReasoning chain → arrives at correct answer ↑ RL pushes model in this directionRLHF used learned preferences (fuzzier). Verifiable rewards are sharper.
The boundary: reasoning models work best where verifiable rewards exist. Generalization to fuzzy domains is the current research frontier.
Compute budgets, user-facing
Section titled “Compute budgets, user-facing”| Setting | What it means |
|---|---|
| Standard thinking | Lower reasoning-token budget. Faster. Cheaper. Less capability on hardest problems. |
| Extended thinking | Higher reasoning-token budget. Slower. More expensive. Better on hardest problems. |
Reasoning tokens are billed. You pay for thinking whether you see it or not.
Major reasoning benchmarks
Section titled “Major reasoning benchmarks”| Benchmark | Domain | What it measures | Status |
|---|---|---|---|
| HumanEval | Coding (small) | About 164 function-completion problems with unit tests | Mostly saturated |
| SWE-bench | Coding (real) | Real GitHub issues; produce patches; verified by project test suite | Current frontier |
| CodeForces | Coding (competitive) | Competitive programming with rating-based comparison to humans | Active |
| GSM8K | Math (easy) | About 8,500 grade-school word problems | Mostly saturated |
| AIME | Math (hard) | US math olympiad qualifier exam | Active; clear reasoning-model gap |
Pass@K, the metric to know
Section titled “Pass@K, the metric to know”Pass@K = probability at least one of K attempts is correct = 1 - probability all K attempts are wrong| K | Interpretation | When to care |
|---|---|---|
| K = 1 | First attempt is correct | User-facing reliability |
| K = 10 | Any of 10 attempts is correct | Best-of-N inference workflows |
| K = 100 | Any of 100 attempts is correct | Maximum-effort verification (rare in practice) |
Pass@K rises monotonically with K. A higher K mechanically gives a bigger number. Pass@1 is the most stringent claim.
How to read a reasoning-model claim
Section titled “How to read a reasoning-model claim”"75% on AIME 2024" ↑ ask: Pass at what K?
"47% Pass@1 on SWE-bench Verified" ↑ K is explicit (good) ↑ benchmark variant is explicit (good) ↑ ask: temperature?
"95% Pass@10 on coding-bench-X" ↑ K=10 is high ↑ Pass@1 is probably much lower ↑ ask: where's the Pass@1?Three questions to always ask:
- What is K? Pass@1 is much stronger than Pass@10.
- What is the temperature? Higher temperature inflates Pass@K for K > 1.
- Verified by what? Mechanical verifier (test cases, ground truth) or self-evaluation?
Major reasoning models (timeline)
Section titled “Major reasoning models (timeline)”Sept 2024 - OpenAI o1-preview (the first widely-deployed reasoning model)Dec 2024 - Gemini 2.0 Flash ThinkingJan 2025 - DeepSeek R1 (made the recipe public; major moment)2025+ - Anthropic Claude thinking modes, xAI, Mistral, othersThe technique is now industry-wide. Specifics vary; the architectural shift is broadly shared.
Pitfalls to dodge
Section titled “Pitfalls to dodge”| Pitfall | Reality |
|---|---|
| ”Thinking = consciousness.” | No. The model is generating tokens during a forward-pass loop. The UI word is convenient shorthand; don’t read more into it. |
| ”Higher Pass@K = better model.” | Only at the same K. Pass@K is monotone in K; comparing models requires the same K. |
| ”Reasoning models dominate everywhere.” | They are stronger where verifiable rewards trained them. Creative or open-ended tasks may not see comparable gains. |
| ”The thinking summary is the full chain.” | No. It’s a summary. The raw chain is hidden for legibility, attention, and competitive reasons. |
Glossary
Section titled “Glossary”- Reasoning model: an LLM trained to produce reasoning chains as part of its policy, with reward tied to final-answer correctness. Different from a standard LLM with CoT prompting.
- Verifiable reward: a correctness signal that can be computed mechanically (test cases, ground-truth answers).
- Compute budget: the number of reasoning tokens the model is allowed before producing the final answer. User-facing in modern chat UIs.
- Pass@K: probability at least one of K attempts is correct. Pass@1 is the strongest claim.
- AIME: American Invitational Mathematics Examination. US math-olympiad qualifier. Hard.
- GSM8K: Grade School Math 8K. About 8,500 grade-school word problems. Mostly saturated.
- HumanEval: OpenAI’s coding benchmark, ~164 problems. Mostly saturated.
- SWE-bench: real GitHub issues benchmark. Current frontier.
- CodeForces: competitive programming benchmark with human-comparable rating.
- Reasoning tokens: the tokens a reasoning model produces during its thinking phase. Billed even when not shown.
A standard LLM is trained to sound plausible. A reasoning model is trained to be correct.
Compute budget is the new dial: more thinking time, more capability, more cost.
Pass@K is “any of K right.” Read K before you read the percentage.