Skip to content

Summary: How chain of thought makes models think out loud

Same model, same weights, different output. Append “Let’s think step by step” to a prompt and the model that answered “I’m not sure” or guessed wrong on a multi-step problem will often write out a correct reasoning chain. Nothing about the model has changed. The capability was already there. The prompt nudged it to use the capability instead of jumping to a guess.

That’s chain-of-thought prompting. The technique of asking a model to produce a reasoning path before its final answer. Two flavors: zero-shot CoT (just the phrase, no examples) and few-shot CoT (examples that include reasoning, not just answers).

It works for two reasons. First, decomposition: a hard problem may not be in training data, but its subproblems usually are. CoT lets the model solve through pieces it has seen variants of. Second, compute: every generated token is one full forward pass; more tokens means more compute. CoT cashes that compute in by producing reasoning before the answer.

This summary is the scan-it-in-five-minutes version. The full lesson covers the empirical scaling behavior, self-consistency, the practical escalation ladder (zero-shot → few-shot → CoT → CoT few-shot → CoT with self-consistency), and the limits of the technique.

  • Chain-of-thought prompting (CoT) is asking a model to produce reasoning steps before the answer. Same model, same weights, just different prompting.
  • Two flavors. Zero-shot CoT appends a phrase like “Let’s think step by step” with no examples. Few-shot CoT shows examples that include the reasoning chain, not just the input-output pair.
  • Why it works, version 1: decomposition. A hard problem may not appear in training data verbatim. Its subproblems often do. CoT lets the model route through capabilities it already has.
  • Why it works, version 2: more tokens equals more compute. Each generated token is one full forward pass through the network. CoT produces more tokens before the answer, which gives the model more thinking time. This effect is more pronounced on larger models.
  • Self-consistency. Sample N CoT chains in parallel, majority-vote on the final answer. Trades cost for accuracy. Cheap multiplier for hard reasoning tasks.
  • Where CoT helps most. Multi-step reasoning, math word problems, multi-hop questions, code with subtle logic.
  • Where CoT is overkill or misleading. Simple lookup (CoT is just paying for tokens). Tasks the model genuinely cannot solve (CoT can produce confident-sounding nonsense). High-stakes decisions where the chain is mistaken for proof of correctness.
  • CoT can be debugged. When the chain is wrong, you can see where it went wrong and adjust the prompt or context. Debugging without CoT is much harder.
  • Modern frontier models often do CoT by default. Many large models have been trained on CoT-shaped data and produce reasoning steps without being asked.
  • Reasoning models are different. They are trained to produce long internal reasoning as part of their policy, not just when prompted. Phase 6 covers them.

After this lesson, you have a complete inference-time prompting toolkit: zero-shot for the easy case, few-shot when format matters, CoT when reasoning is required, self-consistency when the cost is worth the reliability gain. You can also recognize the difference between prompting a model to reason and using a model that has been trained to reason. The first is what this phase covered. The second is one of the things that changes in Phase 6.

More tokens means more compute. CoT is how you spend that compute on a hard problem.
Zero-shot CoT for free, few-shot CoT to demonstrate the kind of reasoning, self-consistency for the cheap multiplier.
The chain is a signal, not a certification. The model can be wrong with reasoning that sounds right.