Why benchmarks can mislead
What you’ll learn
Section titled “What you’ll learn”This is lesson 2 of Phase 7, How we judge models and where they’re going, in Track 5 (AI Foundations). The previous lesson covered LLM-as-a-Judge (one LLM evaluating another’s output). This lesson covers the other half of evaluation: standardized benchmarks with hardcoded answer keys, the source of every “85% on MMLU” or “47% Pass@1 on SWE-bench” number you see in model releases. The lesson walks the major benchmark categories the Stanford lecturer flagged: knowledge benchmarks (MMLU, multiple-choice across 57 subjects, mostly probes pretraining retention); reasoning benchmarks (AIME for hard math, GSM8K for grade-school math, PIQA for common sense); coding benchmarks (HumanEval saturated, CodeForces with human-comparable ratings, SWE-bench for real GitHub issues). It then surfaces the structural reasons benchmark scores can rise faster than real capability (training on benchmark-shaped data, multiple-choice constraints, single-axis measurement, saturation) and applies the lecturer’s “categorize errors before chasing them” methodology to the question of how to read a benchmark claim. By the end you will be able to read benchmark numbers carefully and ask the right follow-up questions. Course materials are at cme295.stanford.edu.
Where this fits
Section titled “Where this fits”This is lesson 2 of Phase 7. The previous lesson (How we evaluate models, LLM-as-a-Judge) covered the open-ended evaluation half. This lesson covers the standardized-benchmark half. The next lesson (Why tool-using models fail) takes the lecturer’s “categorize before chasing” methodology and applies it to a specific failure-mode taxonomy. After that, two frontier-direction lessons (transformers beyond text, new generation methods) and the safety recap close the track.
Before you start
Section titled “Before you start”Prerequisites: the LLM-as-a-Judge lesson is required for narrative continuity (Phase 7 opener). The reasoning models lesson is useful since this lesson references Pass@K (the dominant metric for coding benchmarks).
By the end, you’ll be able to
Section titled “By the end, you’ll be able to”- Identify the major benchmark categories (knowledge, reasoning, coding, common sense) and what each one actually measures
- Recognize specific benchmarks by name (MMLU, AIME, GSM8K, HumanEval, SWE-bench, CodeForces, PIQA) and their grading disciplines
- Apply the lecturer’s “categorize before chasing” methodology when reading a benchmark claim
- Explain three structural reasons benchmark scores can rise faster than real capability (benchmark-shaped training data, format constraints, single-axis measurement)
- Use the practical reading checklist (category, metric, saturation, headline-vs-representative, training-cutoff overlap, application match) on benchmark claims you encounter
Time and difficulty
Section titled “Time and difficulty”- Read time: about 12 minutes
- Practice time: about 12 minutes (a self-check on benchmark categories and what each measures, a hands-on benchmark-claim-reading exercise on real-style model-card claims, and flashcards)
- Difficulty: standard