Skip to content

Cheatsheet: How chain of thought makes models think out loud

More tokens = more compute.
CoT is how you spend that compute on a hard problem.
The reasoning chain happens in the output tokens.
FlavorWhat’s in the prompt
Zero-shot CoTAppend “Let’s think step by step” (or similar) to the prompt. No examples.
Few-shot CoTShow examples in the prompt that include the reasoning chain, not just the final answer.

Both produce a reasoning chain followed by a final answer. Few-shot tends to be more reliable on hard problems because it constrains the style of reasoning.

ReasonWhat it captures
DecompositionA hard problem may not be in training data; its subproblems usually are. CoT routes through capabilities the model already has.
More tokens = more computeEach token is one full forward pass. Producing reasoning before the answer gives the model more thinking time on the problem.

Empirical pattern: the gain from CoT scales with model size. Tiny models benefit little or sometimes hurt; large models benefit substantially.

Prompt:
A teddy bear was born in 2021. The current year is 2026.
How old is the teddy bear?
Direct (no CoT):
"5 years old."
Zero-shot CoT (append "Let's think step by step"):
"The bear was born in 2021. Current year is 2026.
Age = 2026 - 2021 = 5. The bear is 5 years old."
Few-shot CoT:
[Show one or two examples with reasoning, then the new query]
1. Sample N CoT chains in parallel (typical N: 5 to 40).
2. Parse the final answer from each.
3. Majority-vote on the most common answer.

Trade-off: cost scales with N, latency stays roughly one-sample’s worth.

When to use: hard reasoning problems where one chain is unreliable and you have budget for multiple samples.

Zero-shot
↓ (zero-shot is unreliable)
Few-shot (3 to 5 examples)
↓ (still unreliable on multi-step reasoning)
Zero-shot CoT (append "Let's think step by step")
↓ (need stronger reasoning constraint)
Few-shot CoT (examples with reasoning chains)
↓ (high stakes, willing to pay N times the cost)
CoT with self-consistency (sample N, majority-vote)

Stop at the first level that gives you the reliability you need. Each step costs more tokens.

ScenarioCoT recommended?
Multi-step math word problemsYes
Multi-hop questions (combining several facts)Yes
Code with subtle conditions or edge casesYes
Simple knowledge lookupNo (overkill, just paying for tokens)
Problems the model genuinely cannot solveNo (can produce confident-sounding nonsense)
High-stakes decisions where the chain is mistaken for proofUse, but validate externally
Wrong answer with CoT:
→ The chain shows you WHERE it went wrong
→ You can fix the system prompt, context, or examples
→ Faster than guessing at why the model is wrong
Wrong answer without CoT:
→ No trace of how the model got there
→ Guess and check

CoT prompting vs reasoning models (Phase 6 preview)

Section titled “CoT prompting vs reasoning models (Phase 6 preview)”
CoT promptingReasoning models
What it isTechnique applied at inferenceArchitectural shift; models trained to reason
Where the reasoning livesIn the user promptIn the model’s policy
Works on any model?Yes (better on larger)Only on models trained for it
ExamplesAny LLM with the right promptOpenAI o1, DeepSeek-R1, Gemini Flash Thinking, Claude thinking modes

The Phase 5 → 6 shift: from “steering one inference call” to “letting the model think longer, look things up, or take actions.”

PitfallReality
”Trust a CoT chain because it looks like reasoning.”The chain is correlated with correctness, not a certification of it. Models can produce confident wrong reasoning.
”Add CoT to every prompt.”CoT costs tokens. On simple lookup, it’s just paying for tokens you don’t need. Use the escalation ladder.
”CoT prompting and reasoning models are the same.”They are not. CoT is a prompting technique; reasoning models are trained to reason as part of their policy.
”More tokens always means a better answer.”Only if the extra tokens are productive (reasoning steps that build to the answer). Padding tokens that aren’t load-bearing don’t help.
  • Chain-of-thought (CoT) prompting: asking a model to produce reasoning steps before its final answer. Same model, different prompt.
  • Zero-shot CoT: CoT triggered by a phrase like “Let’s think step by step” with no examples.
  • Few-shot CoT: CoT demonstrated by examples in the prompt that include the reasoning chain.
  • Self-consistency: sample N CoT chains, majority-vote on the answer. Cost-for-accuracy multiplier.
  • Compute budget: the amount of compute (tokens times model size) the model is allowed for one query. CoT is one way to spend a larger compute budget on harder problems.
  • Reasoning model: a model trained to produce long internal reasoning as part of its policy. Different from CoT prompting. Phase 6 territory.

More tokens means more compute. CoT is how you spend that compute on a hard problem.
Zero-shot CoT for free, few-shot CoT to demonstrate the kind of reasoning, self-consistency for the cheap multiplier.
The chain is a signal, not a certification. The model can be wrong with reasoning that sounds right.