Reasoning models: cheatsheet

The one idea that matters

A reasoning model has been TRAINED to produce long reasoning chains
as part of its policy, with a reward signal tied to the correctness
of the final answer after the chain.

A standard LLM has not.

Standard vs reasoning at a glance

Aspect	Standard LLM	Reasoning model
Training reward	Next-token prediction (pretraining) + helpfulness preferences (RLHF)	Correctness of final answer after reasoning chain (RL on verifiable rewards)
Reasoning behavior	Emerges only when prompt asks (e.g., “let’s think step by step”)	Default on hard problems; part of the policy
Output structure	Final answer (with optional CoT if prompted)	Reasoning chain, then final answer
Best on	General-purpose tasks, fuzzy goals	Math, coding, structured logic with verifiable answers
API cost	Output tokens you see	Output tokens INCLUDING hidden reasoning tokens

Why this works (verifiable rewards)

Math problem    → ground truth answer    → can verify correctness
Coding problem  → test cases             → can verify correctness
Reasoning chain → arrives at correct answer
                      ↑
                RL pushes model in this direction

RLHF used learned preferences (fuzzier). Verifiable rewards are sharper.

The boundary: reasoning models work best where verifiable rewards exist. Generalization to fuzzy domains is the current research frontier.

Compute budgets, user-facing

Setting	What it means
Standard thinking	Lower reasoning-token budget. Faster. Cheaper. Less capability on hardest problems.
Extended thinking	Higher reasoning-token budget. Slower. More expensive. Better on hardest problems.

Reasoning tokens are billed. You pay for thinking whether you see it or not.

Major reasoning benchmarks

Benchmark	Domain	What it measures	Status
HumanEval	Coding (small)	About 164 function-completion problems with unit tests	Mostly saturated
SWE-bench	Coding (real)	Real GitHub issues; produce patches; verified by project test suite	Current frontier
CodeForces	Coding (competitive)	Competitive programming with rating-based comparison to humans	Active
GSM8K	Math (easy)	About 8,500 grade-school word problems	Mostly saturated
AIME	Math (hard)	US math olympiad qualifier exam	Active; clear reasoning-model gap

Pass@K, the metric to know

Pass@K = probability at least one of K attempts is correct
       = 1 - probability all K attempts are wrong

K	Interpretation	When to care
K = 1	First attempt is correct	User-facing reliability
K = 10	Any of 10 attempts is correct	Best-of-N inference workflows
K = 100	Any of 100 attempts is correct	Maximum-effort verification (rare in practice)

Pass@K rises monotonically with K. A higher K mechanically gives a bigger number. Pass@1 is the most stringent claim.

How to read a reasoning-model claim

"75% on AIME 2024"
   ↑ ask: Pass at what K?

"47% Pass@1 on SWE-bench Verified"
   ↑ K is explicit (good)
   ↑ benchmark variant is explicit (good)
   ↑ ask: temperature?

"95% Pass@10 on coding-bench-X"
   ↑ K=10 is high
   ↑ Pass@1 is probably much lower
   ↑ ask: where's the Pass@1?

Three questions to always ask:

What is K? Pass@1 is much stronger than Pass@10.
What is the temperature? Higher temperature inflates Pass@K for K > 1.
Verified by what? Mechanical verifier (test cases, ground truth) or self-evaluation?

Major reasoning models (timeline)

Sept 2024  - OpenAI o1-preview  (the first widely-deployed reasoning model)
Dec 2024   - Gemini 2.0 Flash Thinking
Jan 2025   - DeepSeek R1  (made the recipe public; major moment)
2025+      - Anthropic Claude thinking modes, xAI, Mistral, others

The technique is now industry-wide. Specifics vary; the architectural shift is broadly shared.

Pitfalls to dodge

Pitfall	Reality
”Thinking = consciousness.”	No. The model is generating tokens during a forward-pass loop. The UI word is convenient shorthand; don’t read more into it.
”Higher Pass@K = better model.”	Only at the same K. Pass@K is monotone in K; comparing models requires the same K.
”Reasoning models dominate everywhere.”	They are stronger where verifiable rewards trained them. Creative or open-ended tasks may not see comparable gains.
”The thinking summary is the full chain.”	No. It’s a summary. The raw chain is hidden for legibility, attention, and competitive reasons.

Glossary

Reasoning model: an LLM trained to produce reasoning chains as part of its policy, with reward tied to final-answer correctness. Different from a standard LLM with CoT prompting.
Verifiable reward: a correctness signal that can be computed mechanically (test cases, ground-truth answers).
Compute budget: the number of reasoning tokens the model is allowed before producing the final answer. User-facing in modern chat UIs.
Pass@K: probability at least one of K attempts is correct. Pass@1 is the strongest claim.
AIME: American Invitational Mathematics Examination. US math-olympiad qualifier. Hard.
GSM8K: Grade School Math 8K. About 8,500 grade-school word problems. Mostly saturated.
HumanEval: OpenAI’s coding benchmark, ~164 problems. Mostly saturated.
SWE-bench: real GitHub issues benchmark. Current frontier.
CodeForces: competitive programming benchmark with human-comparable rating.
Reasoning tokens: the tokens a reasoning model produces during its thinking phase. Billed even when not shown.

A standard LLM is trained to sound plausible. A reasoning model is trained to be correct.
Compute budget is the new dial: more thinking time, more capability, more cost.
Pass@K is “any of K right.” Read K before you read the percentage.