Skip to content

Cheatsheet: How reasoning models think differently

A reasoning model has been TRAINED to produce long reasoning chains
as part of its policy, with a reward signal tied to the correctness
of the final answer after the chain.
A standard LLM has not.
AspectStandard LLMReasoning model
Training rewardNext-token prediction (pretraining) + helpfulness preferences (RLHF)Correctness of final answer after reasoning chain (RL on verifiable rewards)
Reasoning behaviorEmerges only when prompt asks (e.g., “let’s think step by step”)Default on hard problems; part of the policy
Output structureFinal answer (with optional CoT if prompted)Reasoning chain, then final answer
Best onGeneral-purpose tasks, fuzzy goalsMath, coding, structured logic with verifiable answers
API costOutput tokens you seeOutput tokens INCLUDING hidden reasoning tokens
Math problem → ground truth answer → can verify correctness
Coding problem → test cases → can verify correctness
Reasoning chain → arrives at correct answer
RL pushes model in this direction

RLHF used learned preferences (fuzzier). Verifiable rewards are sharper.

The boundary: reasoning models work best where verifiable rewards exist. Generalization to fuzzy domains is the current research frontier.

SettingWhat it means
Standard thinkingLower reasoning-token budget. Faster. Cheaper. Less capability on hardest problems.
Extended thinkingHigher reasoning-token budget. Slower. More expensive. Better on hardest problems.

Reasoning tokens are billed. You pay for thinking whether you see it or not.

BenchmarkDomainWhat it measuresStatus
HumanEvalCoding (small)About 164 function-completion problems with unit testsMostly saturated
SWE-benchCoding (real)Real GitHub issues; produce patches; verified by project test suiteCurrent frontier
CodeForcesCoding (competitive)Competitive programming with rating-based comparison to humansActive
GSM8KMath (easy)About 8,500 grade-school word problemsMostly saturated
AIMEMath (hard)US math olympiad qualifier examActive; clear reasoning-model gap
Pass@K = probability at least one of K attempts is correct
= 1 - probability all K attempts are wrong
KInterpretationWhen to care
K = 1First attempt is correctUser-facing reliability
K = 10Any of 10 attempts is correctBest-of-N inference workflows
K = 100Any of 100 attempts is correctMaximum-effort verification (rare in practice)

Pass@K rises monotonically with K. A higher K mechanically gives a bigger number. Pass@1 is the most stringent claim.

"75% on AIME 2024"
↑ ask: Pass at what K?
"47% Pass@1 on SWE-bench Verified"
↑ K is explicit (good)
↑ benchmark variant is explicit (good)
↑ ask: temperature?
"95% Pass@10 on coding-bench-X"
↑ K=10 is high
↑ Pass@1 is probably much lower
↑ ask: where's the Pass@1?

Three questions to always ask:

  1. What is K? Pass@1 is much stronger than Pass@10.
  2. What is the temperature? Higher temperature inflates Pass@K for K > 1.
  3. Verified by what? Mechanical verifier (test cases, ground truth) or self-evaluation?
Sept 2024 - OpenAI o1-preview (the first widely-deployed reasoning model)
Dec 2024 - Gemini 2.0 Flash Thinking
Jan 2025 - DeepSeek R1 (made the recipe public; major moment)
2025+ - Anthropic Claude thinking modes, xAI, Mistral, others

The technique is now industry-wide. Specifics vary; the architectural shift is broadly shared.

PitfallReality
”Thinking = consciousness.”No. The model is generating tokens during a forward-pass loop. The UI word is convenient shorthand; don’t read more into it.
”Higher Pass@K = better model.”Only at the same K. Pass@K is monotone in K; comparing models requires the same K.
”Reasoning models dominate everywhere.”They are stronger where verifiable rewards trained them. Creative or open-ended tasks may not see comparable gains.
”The thinking summary is the full chain.”No. It’s a summary. The raw chain is hidden for legibility, attention, and competitive reasons.
  • Reasoning model: an LLM trained to produce reasoning chains as part of its policy, with reward tied to final-answer correctness. Different from a standard LLM with CoT prompting.
  • Verifiable reward: a correctness signal that can be computed mechanically (test cases, ground-truth answers).
  • Compute budget: the number of reasoning tokens the model is allowed before producing the final answer. User-facing in modern chat UIs.
  • Pass@K: probability at least one of K attempts is correct. Pass@1 is the strongest claim.
  • AIME: American Invitational Mathematics Examination. US math-olympiad qualifier. Hard.
  • GSM8K: Grade School Math 8K. About 8,500 grade-school word problems. Mostly saturated.
  • HumanEval: OpenAI’s coding benchmark, ~164 problems. Mostly saturated.
  • SWE-bench: real GitHub issues benchmark. Current frontier.
  • CodeForces: competitive programming benchmark with human-comparable rating.
  • Reasoning tokens: the tokens a reasoning model produces during its thinking phase. Billed even when not shown.

A standard LLM is trained to sound plausible. A reasoning model is trained to be correct.
Compute budget is the new dial: more thinking time, more capability, more cost.
Pass@K is “any of K right.” Read K before you read the percentage.