Cheatsheet: Why benchmarks can mislead
The one idea that matters
Section titled “The one idea that matters”A benchmark measures one slice.The score speaks to that slice and that slice only.Categorize what each benchmark measures BEFORE chasing the score.The major categories
Section titled “The major categories”| Category | Probes | Specific benchmarks | Grading discipline |
|---|---|---|---|
| Knowledge | Pretraining retention | MMLU (57 subjects, ~14k questions) | 4-option multiple-choice |
| Reasoning, math | Multi-step math thinking | AIME (US olympiad qualifier) | 3-digit answer, hardcoded |
| Reasoning, math | Grade-school math | GSM8K (~8,500 problems, saturated) | Numeric answer, hardcoded |
| Reasoning, common sense | Physical-world understanding | PIQA (~20k examples) | 2-option multiple-choice |
| Coding | Function-completion | HumanEval (~164 problems, saturated) | Run unit tests |
| Coding | Competitive programming | CodeForces (rating system) | Test cases + rating compared to humans |
| Coding | Real-codebase patches | SWE-bench (real GitHub issues, current frontier) | Project’s existing test suite |
Why benchmark scores can rise faster than real capability
Section titled “Why benchmark scores can rise faster than real capability”| Reason | What goes wrong |
|---|---|
| Training on benchmark-shaped data | Pretraining corpora include benchmark-similar content; “score went up” can be “saw more similar examples” instead of “got smarter” |
| Format constraints | Multiple-choice is easier than open-ended; 80% on MMLU is not 80% on free-form Q&A in the same domain |
| Single-axis measurement | Each benchmark covers one slice; high on one ≠ high on all |
| Saturation | At 95%+, score differences are mostly noise |
| Differential leakage | Newer benchmarks have less leakage than older ones; freshness vs training cutoff matters |
The reading checklist
Section titled “The reading checklist”1. What's the benchmark CATEGORY? (knowledge, reasoning, coding, etc)2. What's the METRIC? (Pass@K with what K, accuracy, exact match)3. Is the benchmark SATURATED? (95%+ across the field = noise floor)4. HEADLINE or REPRESENTATIVE? (look for the full table)5. TRAINING-CUTOFF OVERLAP? (newer-vs-older = cleaner signal)6. APPLICATION MATCH? (does it look like your use case?)Reading a real-style claim
Section titled “Reading a real-style claim”"Our new model scored 47% Pass@1 on SWE-bench Verified." ↓ ↓ ↓ benchmark metric benchmark variant (Verified ≠ original SWE-bench)
GOOD: category named (coding), metric explicit (Pass@1), variant explicitASK: what was Pass@5 or Pass@10? what was the temperature? was the benchmark in the training data?How to compare two models on benchmarks
Section titled “How to compare two models on benchmarks”DON'T: average across benchmarks → "model A: 84.3, model B: 82.1"DO: table by category, side by side → "A is stronger on coding (47% vs 38% on SWE-bench)" → "B is stronger on knowledge (88% vs 85% on MMLU)" → which matters for your application?Pitfalls to dodge
Section titled “Pitfalls to dodge”| Pitfall | Reality |
|---|---|
| ”Average across all benchmarks = overall model quality.” | No. Averages flatten capability-by-axis information. Read the table. |
| ”Highest score on MMLU = best model.” | Only on knowledge. MMLU is approaching saturation; small differences are noise. |
| ”85% on AIME, period.” | Incomplete. Pass at what K? Temperature? Without those, the number doesn’t fully parse. |
| ”Big benchmark improvement = big capability improvement.” | Sometimes. Sometimes it’s training-data leakage. The honest answer is usually a mix. |
| ”If it’s not benchmarked, it doesn’t matter.” | Most real applications aren’t directly benchmarked. Check whether the benchmarks resemble your use case before drawing conclusions. |
A capability-axis cheat sheet
Section titled “A capability-axis cheat sheet”| Capability you care about | Benchmark to look at first |
|---|---|
| Factual recall, breadth | MMLU |
| Multi-step math reasoning | AIME (hard) or GSM8K (easy) |
| Common-sense reasoning | PIQA |
| Code completion / small functions | HumanEval (note: saturated) |
| Real-codebase fixes | SWE-bench Verified |
| Competitive programming skill | CodeForces (with rating context) |
| Real-world deployment fit | None directly. Synthetic benchmarks rarely match real use. |
The lecturer’s methodology
Section titled “The lecturer’s methodology”Don't compare models across all benchmarks at once.Categorize what each benchmark measures FIRST.Then evaluate each capability axis SEPARATELY.
The same discipline that works for debugging tool-use failures(next lesson's territory) works for reading benchmarks.Glossary
Section titled “Glossary”- Knowledge benchmark: tests the model’s ability to recall and compose facts. MMLU is the prime example.
- Reasoning benchmark: tests multi-step thinking. AIME, GSM8K, PIQA.
- Coding benchmark: tests code generation. HumanEval, CodeForces, SWE-bench.
- Saturation: when models cluster at the top of a benchmark (typically 95%+); small score differences are noise.
- Pass@K: probability at least one of K attempts is correct. Pass@1 is the most stringent.
- MMLU: Massive Multitask Language Understanding.
- AIME: American Invitational Mathematics Examination.
- PIQA: Physical Interaction Question Answering.
- SWE-bench: software-engineering benchmark from real GitHub issues.
- CodeForces: competitive-programming platform whose problems are used as a reasoning-model benchmark.
A benchmark measures one slice. Treating the score as a global verdict is the most common reading error.
Match the benchmark category to the capability you care about. Read the metric. Note the saturation status.
Benchmark scores can rise faster than real capability. The number is evidence; it is not the whole story.