Reasoning models, in brief

What you’ll learn

This is the opening lesson of Phase 6, How models reason and act, in Track 5 (AI Foundations). Phase 5 ended with chain-of-thought prompting: a technique for asking any LLM to produce reasoning before its answer. This lesson covers a different, more recent shift: reasoning models are LLMs whose training pushed them toward producing long internal reasoning chains as part of their policy, not just when prompted. The output of a reasoning-model call is two things: a reasoning chain, then a final answer. Both are tokens the model generated. The training objective rewarded correctness of the final answer after reasoning, often via reinforcement learning on problems with verifiable answers (math with ground truth, coding with test cases). The result is a model that reasons natively. This lesson covers what reasoning models are (OpenAI o1, DeepSeek R1, Gemini Flash Thinking, Anthropic thinking modes), what “thinking time” means in modern chat UIs, the compute-budget framing, and how to read claims about reasoning-model performance: HumanEval, SWE-bench, CodeForces, GSM8K, AIME, and especially Pass@K. Course materials are at cme295.stanford.edu.

Where this fits

This is the opener of Phase 6, How models reason and act. The previous lesson (How chain of thought makes models think out loud) covered CoT as a prompting technique. This lesson covers what changes when reasoning is baked into the model’s training, not just elicited from the prompt. The next three lessons in this phase cover RAG (the model fetching text it does not have in its weights), function calling (the model emitting structured calls to external tools), and agent loops (the model chaining tools together into longer-horizon work). Each one further extends what a single LLM call can do.

Before you start

Prerequisites: the chain-of-thought lesson is required. We assume you understand what CoT is at a prompting level and the “more tokens equals more compute” framing. The reward model lesson and RLHF lesson are useful for understanding the training-side claims about verifiable rewards but are not strictly required.

By the end, you’ll be able to

Distinguish a reasoning model from a standard LLM by training objective and what gets rewarded
Explain the compute-budget framing for reasoning models and what “thinking time” means in modern chat UIs
Identify the major reasoning benchmarks (HumanEval, SWE-bench, CodeForces, GSM8K, AIME) and what each measures
Read a Pass@K claim correctly, including the K and the temperature trade-offs
Recognize the role of verifiable rewards in why reasoning models work where they do

Time and difficulty

Read time: about 13 minutes
Practice time: about 12 minutes (a self-check on the standard-vs-reasoning distinction, a hands-on Pass@K reading exercise on benchmark claims, and flashcards)
Difficulty: standard