Practice: How reasoning models think differently

Self-check

A short retrieval pass. Answer in your head (or on paper) before opening the collapsible.

1. What’s the difference between chain-of-thought prompting and a reasoning model?

Show answer

CoT prompting is a technique for getting any LLM to produce reasoning before its answer, by adding instructions or examples to the prompt. The model has not been specifically trained to reason; the prompt is doing the work.

A reasoning model is an LLM whose training objective specifically pushed it toward producing reasoning chains, often via reinforcement learning on problems with verifiable answers. Reasoning is part of the model’s policy, not just elicited from the prompt. The model produces reasoning by default on hard problems, with the reasoning chain and final answer both being generated tokens.

CoT prompting works on any model with varying effectiveness. Reasoning models bake the technique in.

2. Why do reasoning models work especially well on math and coding tasks?

Show answer

Because those domains have verifiable rewards. A math problem has a ground-truth answer that can be parsed and checked. A coding problem has test cases that can be executed. Either gives a clean yes-or-no signal: did this final answer (after the reasoning chain) match the right answer?

That clean signal is what enables reinforcement learning to push the model toward producing reasoning that arrives at correct answers. RLHF (Phase 4) used a learned preference model approximating human judgments, which is fuzzier and harder to optimize against. Verifiable rewards are sharper; RL on them is more effective.

The trade-off: reasoning models are typically strongest on tasks that look like their training domain. Tasks where “correct” is fuzzy (creative writing, open-ended advice) don’t benefit as obviously, because the reward signal cannot be reduced to a yes-or-no check.

3. What is Pass@K, and why does the K matter?

Show answer

Pass@K is the probability that at least one of K attempts at a problem produces a correct answer. Mathematically, it equals 1 minus the probability that all K attempts are wrong.

Pass@1 means “the first attempt is correct.” It is the most stringent and the most user-relevant: it approximates “if a user asks once, what is the chance they get a right answer.”

Pass@K for K greater than 1 means “any of K attempts is correct.” Higher K mechanically inflates the score because more attempts give more chances.

A model can have a great Pass@10 and a mediocre Pass@1: it generates correct answers occasionally but inconsistently. Or a model can have similar Pass@1 and Pass@10: it is reliably correct or reliably wrong. Reading “75% on benchmark X” without knowing K leaves you unable to tell which case you’re in.

4. Name the five major reasoning benchmarks the lesson covered and what each measures.

Show answer

HumanEval. About 164 human-written coding problems. Each is a function signature with a docstring; the model writes the function body. Correctness via included unit tests. Mostly saturated by frontier models now.
SWE-bench. Real GitHub issues from open-source projects. The model produces a code patch that fixes the bug. Correctness via the project’s test suite. Current frontier benchmark.
CodeForces. Competitive programming problems with a rating system that compares model-effective-skill to human contest participants.
GSM8K. About 8,500 grade-school math word problems requiring a few steps of arithmetic reasoning. Saturated; useful as a baseline.
AIME. US math olympiad qualifier exam. Substantially harder than GSM8K. The cleanest benchmark for showing the gap between reasoning models and standard LLMs.

5. Modern chat UIs show a “thinking” indicator on reasoning-model calls. What is actually happening, and what does the UI show versus what the model produces?

Show answer

What is actually happening: the model is generating reasoning tokens during the “thinking” period. Each token is one full forward pass through the network. The model is producing tokens as fast as it can; the wait time corresponds to how many reasoning tokens the model has been allowed (the compute budget).

What the UI shows: typically a summary of the reasoning, not the raw chain. The lecturer offered three plausible reasons providers don’t show the raw chain: (1) raw chain may be hard to follow as plain English, (2) users don’t necessarily want pages of reasoning to read, (3) raw chains are valuable training data and competitors could distill them into their own models.

Practical implication: you are paying for reasoning tokens in the API bill (output tokens include reasoning tokens) even when you can’t see them. A reasoning-model query meaningfully costs more than a standard-model query of the same prompt.

Try it yourself: read a reasoning-model claim

About 10 minutes. Pen and paper.

You see this claim in a model announcement:

“Our new model achieves 89% on AIME 2024 and 47% Pass@1 on SWE-bench Verified.”

Step 1. What does each part of the claim tell you? Decompose it.

Show one possible answer

89% on AIME 2024: The model solved 89% of problems on the AIME (American Invitational Mathematics Examination) 2024 set. AIME is hard (US math olympiad qualifier); 89% is a strong score that would have been near-impossible for a standard LLM a year ago. Implicit metric is likely Pass@1 unless otherwise specified, but the claim doesn’t actually say.
47% Pass@1 on SWE-bench Verified: On the verified subset of SWE-bench (real GitHub issues), the model produces a working patch on its first try 47% of the time. SWE-bench Verified is harder than the original SWE-bench; 47% is a substantial number for that benchmark.

What’s missing from the claim: the K for the AIME number, the temperature setting for both, and any comparison to a baseline. To compare this model meaningfully to another, you’d want both numbers reported with the same K and temperature.

Step 2. A different paper reports “97% Pass@10 on AIME 2024.” Is this stronger or weaker than “89% on AIME 2024” (assuming the first claim was Pass@1)?

Show one possible answer

It is weaker. Pass@10 inflates the number relative to Pass@1 because more attempts mean more chances to get a problem right. A model could have 97% Pass@10 (great) and 50% Pass@1 (mediocre) if its answers are inconsistent.

Without knowing both Pass@1 and Pass@10 for both models, you cannot compare apples-to-apples. The 89%-Pass@1 claim is a stronger statement than 97%-Pass@10 because Pass@1 is the more stringent metric. For the same model, Pass@10 cannot be lower than Pass@1 (Pass@K is monotone in K), but the size of the increase from Pass@1 to Pass@10 is empirical and depends on the model’s sample distribution and the problem set; it cannot be derived from Pass@1 alone. So we know a 89%-Pass@1 model has Pass@10 ≥ 89%, but we can’t say it must exceed 97% just from monotonicity.

This is the classic confusion the lesson is trying to inoculate against. Read the K before the percentage.

Step 3. A reasoning model claims “70% Pass@1 on HumanEval, but only 20% Pass@1 on a new benchmark we built called BookSummary.” Why is the gap?

Show one possible answer

Almost certainly because BookSummary doesn’t have a verifiable reward. HumanEval has unit tests; correctness can be mechanically checked. “Summarize this book” doesn’t have a single right answer. The reasoning model was trained on tasks with verifiable rewards. Tasks without that property don’t benefit from the same training signal.

This is the boundary of where reasoning models help. Math and coding (verifiable) get strong gains. Open-ended fuzzy tasks (no verifier) get smaller or no gains. When you encounter a reasoning-model claim that surprises you in either direction, the verifiable-reward question is usually the explanation.

Flashcards

Eight cards. Click any card to reveal the answer. Use the Print flashcards button to lay out the full set as one card per page.

Q. What's the training-objective difference between a standard LLM and a reasoning model?

A standard LLM is trained to predict the most plausible next token. Reasoning emerges only when the prompt elicits it. A reasoning model is trained (often via RL) on problems with verifiable answers; the reward signal is correctness of the final answer after a reasoning chain. The model learns to spend reasoning tokens productively because reasoning correlates with arriving at correct answers.

Q. What does 'verifiable reward' mean and why does it matter for reasoning models?

A verifiable reward is a correctness signal that can be computed mechanically without human labeling. Math problems with ground-truth answers; coding problems with test cases. The signal is yes-or-no instead of approximate. RL on verifiable rewards is sharper, harder to game, and produces stronger training signals than RLHF’s learned-preference rewards. This is why reasoning models work especially well on math and coding.

Q. What is a 'compute budget' in the context of a reasoning model?

The number of reasoning tokens the model is allowed to spend before producing the final answer. Modern chat UIs expose this as “standard” vs “extended” thinking. The user trades dollars and latency for capability: more reasoning tokens means more compute spent on the problem, which can mean better answers on hard problems. Reasoning tokens are billed even when the UI doesn’t show them.

Q. What is Pass@K, in one sentence and one formula?

Pass@K is the probability that at least one of K attempts at a problem is correct. Equivalent to 1 minus the probability that all K attempts are wrong. The K matters: Pass@1 is the most stringent claim (“first attempt right”); higher K mechanically inflates the score.

Q. Why is the 'thinking' UI in modern chat apps not the raw reasoning chain?

Three plausible reasons. First, the raw chain may not be readable as plain English; internal model thinking doesn’t always come out polished. Second, users don’t typically want pages of reasoning to read. Third, raw reasoning chains are valuable training data; hiding them helps protect a competitive moat that competitors could distill from.

Q. What does HumanEval measure, and why is it less interesting for cutting-edge models now?

HumanEval is a set of about 164 human-written coding problems. Each is a function signature with a docstring; the model writes the function body, and correctness is checked via included unit tests. It’s mostly saturated by frontier models now (high Pass@1 numbers across the board), so it’s no longer a meaningful discriminator at the frontier. Still cited as a baseline check.

Q. What does SWE-bench measure, and why is it harder than HumanEval?

SWE-bench gives the model a real bug report from a real GitHub project (often a multi-file codebase) and asks it to produce a code patch that fixes the bug. Correctness is checked by running the project’s test suite. It’s harder than HumanEval because the context is much larger (real codebases, not isolated functions), the problem is less self-contained (requires understanding a real project’s conventions), and frontier models still leave substantial room for improvement.

Q. A model claims '95% Pass@10 on AIME 2024' but doesn't report Pass@1. What should you take from that?

The claim sounds strong but is incomplete. Pass@10 inflates the number relative to Pass@1, so 95% Pass@10 could correspond to a Pass@1 number much lower (50%, 60%) if the model’s answers are inconsistent. The lack of a Pass@1 report is itself a small signal that the Pass@1 isn’t as flattering. Always ask: “Pass at what K?” and look for the most stringent claim available before drawing conclusions.