References: How reasoning models think differently

Source material

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
  Instructor: Afshine Amidi & Shervine Amidi, Stanford University
  Course site: https://cme295.stanford.edu/
  Cheatsheet: https://cme295.stanford.edu/cheatsheet/
  Source lecture (Lecture 6, LLM Reasoning):
    see course site at https://cme295.stanford.edu/ for the lecture URL
  License (lecture videos): as published on Stanford's public YouTube channel
  License (Amidi cheatsheets): MIT
This lesson adapts the reasoning-model section of Stanford CME 295 Lecture 6,
covering [00:22:48-00:24:22] the timeline of reasoning models, [00:24:22-00:27:54]
how to recognize a reasoning model in chat UIs and the hidden raw chain,
[00:27:54-00:31:47] the major benchmarks (HumanEval, SWE-bench, CodeForces,
GSM8K, AIME), and [00:31:47 onward] the Pass@K derivation. Clawdemy provides
original notes, summaries, and quizzes derived from this material for
educational purposes. All rights to the original lectures remain with Stanford
and the instructors.

Foundational reasoning-model papers

The papers that established the modern reasoning-model recipe.

“Learning to Reason with LLMs” (OpenAI o1 announcement), OpenAI, September 2024. The release announcement and technical sketch for o1-preview, the first widely-deployed reasoning model. Introduces the “more reasoning tokens equals more capability” framing at scale and includes benchmark results that surprised the field. Read for the framing; the technical details were substantially expanded in the DeepSeek paper below.
“DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning”, DeepSeek-AI et al., January 2025. The most-cited reasoning-model paper, partly because it made the technique public. Section 2 (the GRPO algorithm and the reward-model-free training setup) is the load-bearing technical content. The paper reports Pass@1 numbers on AIME and several coding benchmarks that were competitive with OpenAI o1 at the time. Worth reading even if you skip the math.
“DeepSeek-Math: Pushing the Limits of Mathematical Reasoning in Open Language Models”, Shao et al., February 2024. Predecessor to DeepSeek-R1; introduces the GRPO algorithm in the math-reasoning context. Listed here because R1 references it heavily.

The benchmarks

The papers behind the major reasoning benchmarks. Each is short and worth scanning if you want to understand exactly what each one measures.

“Evaluating Large Language Models Trained on Code” (HumanEval), Chen et al., 2021. The HumanEval benchmark introduction. Section 2 describes the dataset construction (about 164 problems, with unit tests). Section 3 introduces the Pass@K metric formally.
“SWE-bench: Can Language Models Resolve Real-World GitHub Issues?”, Jimenez et al., 2024. Introduces SWE-bench. The construction process (mining GitHub issues with merged fixing PRs and automatable test runs) is interesting in its own right. Worth reading if you want to understand why this benchmark is harder than HumanEval.
“Training Verifiers to Solve Math Word Problems” (GSM8K), Cobbe et al., 2021. Introduces GSM8K. The benchmark is about 8,500 grade-school math word problems (7,473 train + 1,319 test). Section 3 has the dataset details.

Going deeper

A short list, chosen for durability.

“Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”, Wei et al., 2022. The CoT-prompting paper from the previous lesson. Worth re-reading in the reasoning-model context: CoT prompting is the technique that worked at the prompting layer; reasoning-model training is what bakes it into the policy.
“Chain-of-Thought Reasoning Without Prompting”, Wang et al., 2024. The empirical finding that, with the right decoding strategy, models can produce CoT-shaped output without any prompt manipulation. Bridge from “we prompt for CoT” to “models naturally do CoT” to “models are trained for CoT.”
“Scaling Test-Time Compute”, Snell et al., 2024. The paper formalizing the “more thinking time means more capability” intuition with empirical scaling laws for inference-time compute. Useful for understanding why compute budgets are a load-bearing dial.

Adjacent topics

The “reasoning chain authenticity” question. Search terms: “unfaithful CoT,” “post-hoc rationalization in LLMs.” A growing literature asks whether the reasoning chain a reasoning model produces actually drove the answer or is a post-hoc rationalization the model produced after deciding the answer some other way. The honest answer right now is: sometimes the chain drove the answer, sometimes it was decoration. Active research area.
GRPO (Group Relative Policy Optimization). Mentioned briefly in this lesson and the Phase 4 closer (RLHF/DPO). The DeepSeek-Math and DeepSeek-R1 papers are the primary sources. Worth reading if you want the algorithmic specifics of how GRPO drops the value function from PPO and uses groups of sampled completions instead.
Reasoning-model failure modes. Search terms: “reasoning model overconfidence,” “reasoning model alignment regression.” Reasoning models are subject to the same hallucination, reward-hacking, and out-of-distribution failures as standard LLMs, sometimes amplified by the longer reasoning chains they produce. Worth keeping in mind when interpreting reasoning-model claims.

Stanford CME 295 cheatsheet

Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. The reasoning-model section covers the same material in their dense visual style. Worth using as a study reference after this lesson.

Community discussion

None selected for this lesson. The reasoning-model field is moving fast enough that academic sources and primary research are the better entry point. Durable community references will be added at a future quarterly review if any consolidate.