Skip to content

References: How chain of thought makes models think out loud

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
Instructor: Afshine Amidi & Shervine Amidi, Stanford University
Course site: https://cme295.stanford.edu/
Cheatsheet: https://cme295.stanford.edu/cheatsheet/
Source lectures:
Lecture 3 (Large Language Models):
https://www.youtube.com/watch?v=BREr-2cMx-4
Lecture 6 (LLM Reasoning):
see course site at https://cme295.stanford.edu/ for the lecture URL
License (lecture videos): as published on Stanford's public YouTube channel
License (Amidi cheatsheets): MIT
This lesson adapts the chain-of-thought sections of Stanford CME 295 Lectures 3
and 6, covering [01:18:50] CoT introduction in Lecture 3, [01:21:50] self-
consistency in Lecture 3, and [00:13:48-00:16:18] CoT framing for reasoning
problems in Lecture 6 (CoT as the bridge to reasoning models, which Phase 6
covers in detail). Clawdemy provides original notes, summaries, and quizzes
derived from this material for educational purposes. All rights to the original
lectures remain with Stanford and the instructors.

The three papers behind this lesson, in chronological order.

  • “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”, Wei et al., 2022. The paper that introduced few-shot CoT and demonstrated its scaling behavior. The headline finding: CoT prompting produces large performance gains on reasoning benchmarks, but only on sufficiently large models. Smaller models can be hurt by CoT prompting. This is the source of the “scale matters for CoT” claim in the lesson. Sections 3 (math benchmarks) and 4 (commonsense reasoning) are the empirical core.

  • “Large Language Models are Zero-Shot Reasoners”, Kojima et al., 2022. The paper that documented zero-shot CoT: simply appending “Let’s think step by step” to a prompt produces a reasoning chain. The technique stuck because it’s nearly free (one phrase, no examples) and works across many models. Read section 3 for the empirical comparison of zero-shot CoT against direct prompting and few-shot CoT.

  • “Self-Consistency Improves Chain of Thought Reasoning in Language Models”, Wang et al., 2022. Introduces self-consistency as the cheap multiplier on top of CoT. The paper’s core insight: sampling multiple reasoning paths and majority-voting outperforms taking the highest-likelihood single chain. The technique is broadly applicable and is one of the most-cited follow-ups to the original CoT paper.

  • “Chain-of-Thought Reasoning Without Prompting”, Wang et al., 2024. The interesting empirical finding that, with the right decoding strategy, models can produce CoT-shaped output without any prompt manipulation. Worth reading as the bridge to reasoning models (Phase 6): the question stops being “how do we prompt for CoT” and starts being “what models naturally do CoT.”

  • The reasoning-models literature. OpenAI’s o1 announcement, the DeepSeek-R1 paper (Shao et al., 2024), and Anthropic’s discussion of “thinking” modes are all worth reading as Phase 6 prep. Each represents a different attempt to train a model to do CoT internally as part of its policy, rather than relying on the prompt to elicit it.

  • Compute budgets and inference-time compute. The lesson’s “more tokens equals more compute” framing is starting to crystallize as a research area. Search terms: “inference-time compute,” “thinking time scaling laws,” “reasoning compute budget.” Useful for understanding why reasoning models (Phase 6) are described in terms of how long they’re allowed to think.

  • The CoT failure modes. Search terms: “CoT hallucination,” “unfaithful CoT,” “post-hoc rationalization in LLMs.” The literature here is growing because the gap between “the chain looks like reasoning” and “the chain actually drove the answer” is a real and concerning failure mode for high-stakes applications.

  • Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. The “chain of thought” and “reasoning” sections cover the same material in their dense visual style. Worth using as a study reference after this lesson.

None selected for this lesson. The published literature is consolidated enough that academic sources are the better entry point. Durable community references will be added at a future quarterly review if any consolidate.