Skip to content

References: How reasoning models think differently

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
Instructor: Afshine Amidi & Shervine Amidi, Stanford University
Course site: https://cme295.stanford.edu/
Cheatsheet: https://cme295.stanford.edu/cheatsheet/
Source lecture (Lecture 6, LLM Reasoning):
see course site at https://cme295.stanford.edu/ for the lecture URL
License (lecture videos): as published on Stanford's public YouTube channel
License (Amidi cheatsheets): MIT
This lesson adapts the reasoning-model section of Stanford CME 295 Lecture 6,
covering [00:22:48-00:24:22] the timeline of reasoning models, [00:24:22-00:27:54]
how to recognize a reasoning model in chat UIs and the hidden raw chain,
[00:27:54-00:31:47] the major benchmarks (HumanEval, SWE-bench, CodeForces,
GSM8K, AIME), and [00:31:47 onward] the Pass@K derivation. Clawdemy provides
original notes, summaries, and quizzes derived from this material for
educational purposes. All rights to the original lectures remain with Stanford
and the instructors.

The papers that established the modern reasoning-model recipe.

The papers behind the major reasoning benchmarks. Each is short and worth scanning if you want to understand exactly what each one measures.

A short list, chosen for durability.

  • “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”, Wei et al., 2022. The CoT-prompting paper from the previous lesson. Worth re-reading in the reasoning-model context: CoT prompting is the technique that worked at the prompting layer; reasoning-model training is what bakes it into the policy.

  • “Chain-of-Thought Reasoning Without Prompting”, Wang et al., 2024. The empirical finding that, with the right decoding strategy, models can produce CoT-shaped output without any prompt manipulation. Bridge from “we prompt for CoT” to “models naturally do CoT” to “models are trained for CoT.”

  • “Scaling Test-Time Compute”, Snell et al., 2024. The paper formalizing the “more thinking time means more capability” intuition with empirical scaling laws for inference-time compute. Useful for understanding why compute budgets are a load-bearing dial.

  • The “reasoning chain authenticity” question. Search terms: “unfaithful CoT,” “post-hoc rationalization in LLMs.” A growing literature asks whether the reasoning chain a reasoning model produces actually drove the answer or is a post-hoc rationalization the model produced after deciding the answer some other way. The honest answer right now is: sometimes the chain drove the answer, sometimes it was decoration. Active research area.

  • GRPO (Group Relative Policy Optimization). Mentioned briefly in this lesson and the Phase 4 closer (RLHF/DPO). The DeepSeek-Math and DeepSeek-R1 papers are the primary sources. Worth reading if you want the algorithmic specifics of how GRPO drops the value function from PPO and uses groups of sampled completions instead.

  • Reasoning-model failure modes. Search terms: “reasoning model overconfidence,” “reasoning model alignment regression.” Reasoning models are subject to the same hallucination, reward-hacking, and out-of-distribution failures as standard LLMs, sometimes amplified by the longer reasoning chains they produce. Worth keeping in mind when interpreting reasoning-model claims.

  • Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. The reasoning-model section covers the same material in their dense visual style. Worth using as a study reference after this lesson.

None selected for this lesson. The reasoning-model field is moving fast enough that academic sources and primary research are the better entry point. Durable community references will be added at a future quarterly review if any consolidate.