Summary: Why benchmarks can mislead

Benchmarks are real evidence, on a narrow slice. A score of 85% on MMLU tells you the model retained a lot of pretraining knowledge. It does not tell you the model can write a working code patch on SWE-bench, follow a multi-step plan, or refuse a malicious prompt. Each benchmark probes one capability axis. Treating any single number as a global verdict is the most common reading error.

The major categories. Knowledge benchmarks (MMLU, multiple-choice across 57 subjects) probe pretraining retention. Reasoning benchmarks (AIME for hard math, GSM8K for grade-school math, PIQA for common sense) probe multi-step thinking. Coding benchmarks (HumanEval saturated; CodeForces with human-comparable ratings; SWE-bench for real GitHub issues) probe code understanding and generation.

Why benchmark scores can rise faster than real capability. Training data leakage (modern pretraining corpora are huge and weakly filtered, so benchmark-shaped content slips in). Multiple-choice constraints (much easier than open-ended). Single-axis measurement (each benchmark covers one slice). Saturation (small differences at 95%+ are mostly noise).

The lecturer’s methodology. Categorize what each benchmark measures before chasing the score. A high MMLU is not a high SWE-bench. A high SWE-bench is not a high real-world deployment. Match the benchmark to the capability you care about; treat the score as evidence about that slice only.

This summary is the scan-it-in-five-minutes version. The full lesson covers the specific benchmarks in detail and the practical reading checklist.

Core ideas

One slice per benchmark. Each benchmark probes a specific capability. The score speaks to that capability and that capability only.
Knowledge benchmarks: MMLU. Massive Multitask Language Understanding. About 60 topics, multiple-choice (4 options). Probes pretraining retention. Not reasoning, not application fit.
Reasoning benchmarks: AIME, GSM8K, PIQA. AIME = US math olympiad qualifier (hard). GSM8K = grade-school math (saturated). PIQA = physical commonsense (about 20,000 examples, 2-option multiple-choice).
Coding benchmarks: HumanEval, CodeForces, SWE-bench. HumanEval (~164 problems, saturated). CodeForces (rating system vs human contestants). SWE-bench (real GitHub issues, current frontier).
Why scores can rise without capability rising. (1) Training data leakage from benchmark-shaped content. (2) Format constraints (multiple-choice is easier than open-ended). (3) Single-axis measurement misses other capabilities. (4) Saturation.
Reading checklist. Category. Metric (Pass@1, Pass@10, accuracy, exact match). Saturation status. Headline vs representative. Training-cutoff overlap with model. Application match.
Pitfall: averaging across benchmarks. Each benchmark measures something different. “Average score across 8 benchmarks” is not a meaningful single number; you have lost the information about which capabilities the model is strong on.
Pitfall: ignoring the metric. “85% on AIME” without knowing the K (Pass@1, Pass@10, majority-vote-at-K) is incomplete information. The K and the temperature meaningfully change what the number means.
Pitfall: underweighting the leakage problem. When a model dramatically improves on a popular benchmark, the first question is not “what new capability was added?” It is “did the training data include benchmark-shaped content?”

What changes for you

After this lesson, model-release announcements stop being numerical magic. When you see “the new model scores 92% on Y,” you can place the claim: what benchmark category, what metric, how saturated, whether the benchmark resembles your use case. You also know to ask the leakage question that the field is increasingly transparent about: was this benchmark in the training mix, or is the score capability rather than recall?

A benchmark measures one slice. Treating the score as a global verdict is the most common reading error.
Match the benchmark category to the capability you care about. Read the metric. Note the saturation status.
Benchmark scores can rise faster than real capability. The number is evidence; it is not the whole story.