Cheatsheet: How chain of thought makes models think out loud
The one idea that matters
Section titled “The one idea that matters”More tokens = more compute.CoT is how you spend that compute on a hard problem.The reasoning chain happens in the output tokens.Two flavors
Section titled “Two flavors”| Flavor | What’s in the prompt |
|---|---|
| Zero-shot CoT | Append “Let’s think step by step” (or similar) to the prompt. No examples. |
| Few-shot CoT | Show examples in the prompt that include the reasoning chain, not just the final answer. |
Both produce a reasoning chain followed by a final answer. Few-shot tends to be more reliable on hard problems because it constrains the style of reasoning.
Why it works
Section titled “Why it works”| Reason | What it captures |
|---|---|
| Decomposition | A hard problem may not be in training data; its subproblems usually are. CoT routes through capabilities the model already has. |
| More tokens = more compute | Each token is one full forward pass. Producing reasoning before the answer gives the model more thinking time on the problem. |
Empirical pattern: the gain from CoT scales with model size. Tiny models benefit little or sometimes hurt; large models benefit substantially.
A worked example
Section titled “A worked example”Prompt:A teddy bear was born in 2021. The current year is 2026.How old is the teddy bear?
Direct (no CoT):"5 years old."
Zero-shot CoT (append "Let's think step by step"):"The bear was born in 2021. Current year is 2026.Age = 2026 - 2021 = 5. The bear is 5 years old."
Few-shot CoT:[Show one or two examples with reasoning, then the new query]Self-consistency
Section titled “Self-consistency”1. Sample N CoT chains in parallel (typical N: 5 to 40).2. Parse the final answer from each.3. Majority-vote on the most common answer.Trade-off: cost scales with N, latency stays roughly one-sample’s worth.
When to use: hard reasoning problems where one chain is unreliable and you have budget for multiple samples.
The escalation ladder
Section titled “The escalation ladder”Zero-shot ↓ (zero-shot is unreliable)Few-shot (3 to 5 examples) ↓ (still unreliable on multi-step reasoning)Zero-shot CoT (append "Let's think step by step") ↓ (need stronger reasoning constraint)Few-shot CoT (examples with reasoning chains) ↓ (high stakes, willing to pay N times the cost)CoT with self-consistency (sample N, majority-vote)Stop at the first level that gives you the reliability you need. Each step costs more tokens.
When CoT helps and when it doesn’t
Section titled “When CoT helps and when it doesn’t”| Scenario | CoT recommended? |
|---|---|
| Multi-step math word problems | Yes |
| Multi-hop questions (combining several facts) | Yes |
| Code with subtle conditions or edge cases | Yes |
| Simple knowledge lookup | No (overkill, just paying for tokens) |
| Problems the model genuinely cannot solve | No (can produce confident-sounding nonsense) |
| High-stakes decisions where the chain is mistaken for proof | Use, but validate externally |
A useful side benefit: debugging
Section titled “A useful side benefit: debugging”Wrong answer with CoT: → The chain shows you WHERE it went wrong → You can fix the system prompt, context, or examples → Faster than guessing at why the model is wrong
Wrong answer without CoT: → No trace of how the model got there → Guess and checkCoT prompting vs reasoning models (Phase 6 preview)
Section titled “CoT prompting vs reasoning models (Phase 6 preview)”| CoT prompting | Reasoning models | |
|---|---|---|
| What it is | Technique applied at inference | Architectural shift; models trained to reason |
| Where the reasoning lives | In the user prompt | In the model’s policy |
| Works on any model? | Yes (better on larger) | Only on models trained for it |
| Examples | Any LLM with the right prompt | OpenAI o1, DeepSeek-R1, Gemini Flash Thinking, Claude thinking modes |
The Phase 5 → 6 shift: from “steering one inference call” to “letting the model think longer, look things up, or take actions.”
Pitfalls to dodge
Section titled “Pitfalls to dodge”| Pitfall | Reality |
|---|---|
| ”Trust a CoT chain because it looks like reasoning.” | The chain is correlated with correctness, not a certification of it. Models can produce confident wrong reasoning. |
| ”Add CoT to every prompt.” | CoT costs tokens. On simple lookup, it’s just paying for tokens you don’t need. Use the escalation ladder. |
| ”CoT prompting and reasoning models are the same.” | They are not. CoT is a prompting technique; reasoning models are trained to reason as part of their policy. |
| ”More tokens always means a better answer.” | Only if the extra tokens are productive (reasoning steps that build to the answer). Padding tokens that aren’t load-bearing don’t help. |
Glossary
Section titled “Glossary”- Chain-of-thought (CoT) prompting: asking a model to produce reasoning steps before its final answer. Same model, different prompt.
- Zero-shot CoT: CoT triggered by a phrase like “Let’s think step by step” with no examples.
- Few-shot CoT: CoT demonstrated by examples in the prompt that include the reasoning chain.
- Self-consistency: sample N CoT chains, majority-vote on the answer. Cost-for-accuracy multiplier.
- Compute budget: the amount of compute (tokens times model size) the model is allowed for one query. CoT is one way to spend a larger compute budget on harder problems.
- Reasoning model: a model trained to produce long internal reasoning as part of its policy. Different from CoT prompting. Phase 6 territory.
More tokens means more compute. CoT is how you spend that compute on a hard problem.
Zero-shot CoT for free, few-shot CoT to demonstrate the kind of reasoning, self-consistency for the cheap multiplier.
The chain is a signal, not a certification. The model can be wrong with reasoning that sounds right.