References: How chain of thought makes models think out loud

Source material

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
  Instructor: Afshine Amidi & Shervine Amidi, Stanford University
  Course site: https://cme295.stanford.edu/
  Cheatsheet: https://cme295.stanford.edu/cheatsheet/
  Source lectures:
    Lecture 3 (Large Language Models):
      https://www.youtube.com/watch?v=BREr-2cMx-4
    Lecture 6 (LLM Reasoning):
      see course site at https://cme295.stanford.edu/ for the lecture URL
  License (lecture videos): as published on Stanford's public YouTube channel
  License (Amidi cheatsheets): MIT
This lesson adapts the chain-of-thought sections of Stanford CME 295 Lectures 3
and 6, covering [01:18:50] CoT introduction in Lecture 3, [01:21:50] self-
consistency in Lecture 3, and [00:13:48-00:16:18] CoT framing for reasoning
problems in Lecture 6 (CoT as the bridge to reasoning models, which Phase 6
covers in detail). Clawdemy provides original notes, summaries, and quizzes
derived from this material for educational purposes. All rights to the original
lectures remain with Stanford and the instructors.

Primary sources

The three papers behind this lesson, in chronological order.

“Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”, Wei et al., 2022. The paper that introduced few-shot CoT and demonstrated its scaling behavior. The headline finding: CoT prompting produces large performance gains on reasoning benchmarks, but only on sufficiently large models. Smaller models can be hurt by CoT prompting. This is the source of the “scale matters for CoT” claim in the lesson. Sections 3 (math benchmarks) and 4 (commonsense reasoning) are the empirical core.
“Large Language Models are Zero-Shot Reasoners”, Kojima et al., 2022. The paper that documented zero-shot CoT: simply appending “Let’s think step by step” to a prompt produces a reasoning chain. The technique stuck because it’s nearly free (one phrase, no examples) and works across many models. Read section 3 for the empirical comparison of zero-shot CoT against direct prompting and few-shot CoT.
“Self-Consistency Improves Chain of Thought Reasoning in Language Models”, Wang et al., 2022. Introduces self-consistency as the cheap multiplier on top of CoT. The paper’s core insight: sampling multiple reasoning paths and majority-voting outperforms taking the highest-likelihood single chain. The technique is broadly applicable and is one of the most-cited follow-ups to the original CoT paper.

Going deeper

“Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning”, Wang et al., 2023. A follow-up zero-shot CoT technique: instead of “Let’s think step by step,” instruct the model to “first devise a plan, then carry out the plan.” Improves accuracy on multi-step problems where simple step-by-step is insufficient. Worth reading after Kojima et al.
“Tree of Thoughts: Deliberate Problem Solving with Large Language Models”, Yao et al., 2023. Generalizes CoT from a linear chain to a tree of explored reasoning paths, with backtracking and lookahead. Substantially more compute per query but solves problems that linear CoT cannot. Read after self-consistency.

Context for Phase 6

“Chain-of-Thought Reasoning Without Prompting”, Wang et al., 2024. The interesting empirical finding that, with the right decoding strategy, models can produce CoT-shaped output without any prompt manipulation. Worth reading as the bridge to reasoning models (Phase 6): the question stops being “how do we prompt for CoT” and starts being “what models naturally do CoT.”
The reasoning-models literature. OpenAI’s o1 announcement, the DeepSeek-R1 paper (Shao et al., 2024), and Anthropic’s discussion of “thinking” modes are all worth reading as Phase 6 prep. Each represents a different attempt to train a model to do CoT internally as part of its policy, rather than relying on the prompt to elicit it.

Adjacent topics

Compute budgets and inference-time compute. The lesson’s “more tokens equals more compute” framing is starting to crystallize as a research area. Search terms: “inference-time compute,” “thinking time scaling laws,” “reasoning compute budget.” Useful for understanding why reasoning models (Phase 6) are described in terms of how long they’re allowed to think.
The CoT failure modes. Search terms: “CoT hallucination,” “unfaithful CoT,” “post-hoc rationalization in LLMs.” The literature here is growing because the gap between “the chain looks like reasoning” and “the chain actually drove the answer” is a real and concerning failure mode for high-stakes applications.

Stanford CME 295 cheatsheet

Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. The “chain of thought” and “reasoning” sections cover the same material in their dense visual style. Worth using as a study reference after this lesson.

Community discussion

None selected for this lesson. The published literature is consolidated enough that academic sources are the better entry point. Durable community references will be added at a future quarterly review if any consolidate.