Skip to content

Cheatsheet: Why benchmarks can mislead

A benchmark measures one slice.
The score speaks to that slice and that slice only.
Categorize what each benchmark measures BEFORE chasing the score.
CategoryProbesSpecific benchmarksGrading discipline
KnowledgePretraining retentionMMLU (57 subjects, ~14k questions)4-option multiple-choice
Reasoning, mathMulti-step math thinkingAIME (US olympiad qualifier)3-digit answer, hardcoded
Reasoning, mathGrade-school mathGSM8K (~8,500 problems, saturated)Numeric answer, hardcoded
Reasoning, common sensePhysical-world understandingPIQA (~20k examples)2-option multiple-choice
CodingFunction-completionHumanEval (~164 problems, saturated)Run unit tests
CodingCompetitive programmingCodeForces (rating system)Test cases + rating compared to humans
CodingReal-codebase patchesSWE-bench (real GitHub issues, current frontier)Project’s existing test suite

Why benchmark scores can rise faster than real capability

Section titled “Why benchmark scores can rise faster than real capability”
ReasonWhat goes wrong
Training on benchmark-shaped dataPretraining corpora include benchmark-similar content; “score went up” can be “saw more similar examples” instead of “got smarter”
Format constraintsMultiple-choice is easier than open-ended; 80% on MMLU is not 80% on free-form Q&A in the same domain
Single-axis measurementEach benchmark covers one slice; high on one ≠ high on all
SaturationAt 95%+, score differences are mostly noise
Differential leakageNewer benchmarks have less leakage than older ones; freshness vs training cutoff matters
1. What's the benchmark CATEGORY? (knowledge, reasoning, coding, etc)
2. What's the METRIC? (Pass@K with what K, accuracy, exact match)
3. Is the benchmark SATURATED? (95%+ across the field = noise floor)
4. HEADLINE or REPRESENTATIVE? (look for the full table)
5. TRAINING-CUTOFF OVERLAP? (newer-vs-older = cleaner signal)
6. APPLICATION MATCH? (does it look like your use case?)
"Our new model scored 47% Pass@1 on SWE-bench Verified."
↓ ↓ ↓
benchmark metric benchmark variant (Verified ≠ original SWE-bench)
GOOD: category named (coding), metric explicit (Pass@1), variant explicit
ASK: what was Pass@5 or Pass@10? what was the temperature?
was the benchmark in the training data?
DON'T: average across benchmarks → "model A: 84.3, model B: 82.1"
DO: table by category, side by side
→ "A is stronger on coding (47% vs 38% on SWE-bench)"
→ "B is stronger on knowledge (88% vs 85% on MMLU)"
→ which matters for your application?
PitfallReality
”Average across all benchmarks = overall model quality.”No. Averages flatten capability-by-axis information. Read the table.
”Highest score on MMLU = best model.”Only on knowledge. MMLU is approaching saturation; small differences are noise.
”85% on AIME, period.”Incomplete. Pass at what K? Temperature? Without those, the number doesn’t fully parse.
”Big benchmark improvement = big capability improvement.”Sometimes. Sometimes it’s training-data leakage. The honest answer is usually a mix.
”If it’s not benchmarked, it doesn’t matter.”Most real applications aren’t directly benchmarked. Check whether the benchmarks resemble your use case before drawing conclusions.
Capability you care aboutBenchmark to look at first
Factual recall, breadthMMLU
Multi-step math reasoningAIME (hard) or GSM8K (easy)
Common-sense reasoningPIQA
Code completion / small functionsHumanEval (note: saturated)
Real-codebase fixesSWE-bench Verified
Competitive programming skillCodeForces (with rating context)
Real-world deployment fitNone directly. Synthetic benchmarks rarely match real use.
Don't compare models across all benchmarks at once.
Categorize what each benchmark measures FIRST.
Then evaluate each capability axis SEPARATELY.
The same discipline that works for debugging tool-use failures
(next lesson's territory) works for reading benchmarks.
  • Knowledge benchmark: tests the model’s ability to recall and compose facts. MMLU is the prime example.
  • Reasoning benchmark: tests multi-step thinking. AIME, GSM8K, PIQA.
  • Coding benchmark: tests code generation. HumanEval, CodeForces, SWE-bench.
  • Saturation: when models cluster at the top of a benchmark (typically 95%+); small score differences are noise.
  • Pass@K: probability at least one of K attempts is correct. Pass@1 is the most stringent.
  • MMLU: Massive Multitask Language Understanding.
  • AIME: American Invitational Mathematics Examination.
  • PIQA: Physical Interaction Question Answering.
  • SWE-bench: software-engineering benchmark from real GitHub issues.
  • CodeForces: competitive-programming platform whose problems are used as a reasoning-model benchmark.

A benchmark measures one slice. Treating the score as a global verdict is the most common reading error.
Match the benchmark category to the capability you care about. Read the metric. Note the saturation status.
Benchmark scores can rise faster than real capability. The number is evidence; it is not the whole story.