Why benchmarks can mislead, in brief

What you’ll learn

This is lesson 2 of Phase 7, How we judge models and where they’re going, in Track 5 (AI Foundations). The previous lesson covered LLM-as-a-Judge (one LLM evaluating another’s output). This lesson covers the other half of evaluation: standardized benchmarks with hardcoded answer keys, the source of every “85% on MMLU” or “47% Pass@1 on SWE-bench” number you see in model releases. The lesson walks the major benchmark categories the Stanford lecturer flagged: knowledge benchmarks (MMLU, multiple-choice across 57 subjects, mostly probes pretraining retention); reasoning benchmarks (AIME for hard math, GSM8K for grade-school math, PIQA for common sense); coding benchmarks (HumanEval saturated, CodeForces with human-comparable ratings, SWE-bench for real GitHub issues). It then surfaces the structural reasons benchmark scores can rise faster than real capability (training on benchmark-shaped data, multiple-choice constraints, single-axis measurement, saturation) and applies the lecturer’s “categorize errors before chasing them” methodology to the question of how to read a benchmark claim. By the end you will be able to read benchmark numbers carefully and ask the right follow-up questions. Course materials are at cme295.stanford.edu.

Where this fits

This is lesson 2 of Phase 7. The previous lesson (How we evaluate models, LLM-as-a-Judge) covered the open-ended evaluation half. This lesson covers the standardized-benchmark half. The next lesson (Why tool-using models fail) takes the lecturer’s “categorize before chasing” methodology and applies it to a specific failure-mode taxonomy. After that, two frontier-direction lessons (transformers beyond text, new generation methods) and the safety recap close the track.

Before you start

Prerequisites: the LLM-as-a-Judge lesson is required for narrative continuity (Phase 7 opener). The reasoning models lesson is useful since this lesson references Pass@K (the dominant metric for coding benchmarks).

By the end, you’ll be able to

Identify the major benchmark categories (knowledge, reasoning, coding, common sense) and what each one actually measures
Recognize specific benchmarks by name (MMLU, AIME, GSM8K, HumanEval, SWE-bench, CodeForces, PIQA) and their grading disciplines
Apply the lecturer’s “categorize before chasing” methodology when reading a benchmark claim
Explain three structural reasons benchmark scores can rise faster than real capability (benchmark-shaped training data, format constraints, single-axis measurement)
Use the practical reading checklist (category, metric, saturation, headline-vs-representative, training-cutoff overlap, application match) on benchmark claims you encounter

Time and difficulty

Read time: about 12 minutes
Practice time: about 12 minutes (a self-check on benchmark categories and what each measures, a hands-on benchmark-claim-reading exercise on real-style model-card claims, and flashcards)
Difficulty: standard