Why benchmarks can mislead: cheatsheet

The one idea that matters

A benchmark measures one slice.
The score speaks to that slice and that slice only.
Categorize what each benchmark measures BEFORE chasing the score.

The major categories

Category	Probes	Specific benchmarks	Grading discipline
Knowledge	Pretraining retention	MMLU (57 subjects, ~14k questions)	4-option multiple-choice
Reasoning, math	Multi-step math thinking	AIME (US olympiad qualifier)	3-digit answer, hardcoded
Reasoning, math	Grade-school math	GSM8K (~8,500 problems, saturated)	Numeric answer, hardcoded
Reasoning, common sense	Physical-world understanding	PIQA (~20k examples)	2-option multiple-choice
Coding	Function-completion	HumanEval (~164 problems, saturated)	Run unit tests
Coding	Competitive programming	CodeForces (rating system)	Test cases + rating compared to humans
Coding	Real-codebase patches	SWE-bench (real GitHub issues, current frontier)	Project’s existing test suite

Why benchmark scores can rise faster than real capability

Reason	What goes wrong
Training on benchmark-shaped data	Pretraining corpora include benchmark-similar content; “score went up” can be “saw more similar examples” instead of “got smarter”
Format constraints	Multiple-choice is easier than open-ended; 80% on MMLU is not 80% on free-form Q&A in the same domain
Single-axis measurement	Each benchmark covers one slice; high on one ≠ high on all
Saturation	At 95%+, score differences are mostly noise
Differential leakage	Newer benchmarks have less leakage than older ones; freshness vs training cutoff matters

The reading checklist

1. What's the benchmark CATEGORY?      (knowledge, reasoning, coding, etc)
2. What's the METRIC?                   (Pass@K with what K, accuracy, exact match)
3. Is the benchmark SATURATED?          (95%+ across the field = noise floor)
4. HEADLINE or REPRESENTATIVE?          (look for the full table)
5. TRAINING-CUTOFF OVERLAP?              (newer-vs-older = cleaner signal)
6. APPLICATION MATCH?                   (does it look like your use case?)

Reading a real-style claim

"Our new model scored 47% Pass@1 on SWE-bench Verified."
   ↓                ↓        ↓
   benchmark        metric    benchmark variant (Verified ≠ original SWE-bench)

GOOD: category named (coding), metric explicit (Pass@1), variant explicit
ASK:  what was Pass@5 or Pass@10? what was the temperature?
       was the benchmark in the training data?

How to compare two models on benchmarks

DON'T: average across benchmarks → "model A: 84.3, model B: 82.1"
DO:    table by category, side by side
       → "A is stronger on coding (47% vs 38% on SWE-bench)"
       → "B is stronger on knowledge (88% vs 85% on MMLU)"
       → which matters for your application?

Pitfalls to dodge

Pitfall	Reality
”Average across all benchmarks = overall model quality.”	No. Averages flatten capability-by-axis information. Read the table.
”Highest score on MMLU = best model.”	Only on knowledge. MMLU is approaching saturation; small differences are noise.
”85% on AIME, period.”	Incomplete. Pass at what K? Temperature? Without those, the number doesn’t fully parse.
”Big benchmark improvement = big capability improvement.”	Sometimes. Sometimes it’s training-data leakage. The honest answer is usually a mix.
”If it’s not benchmarked, it doesn’t matter.”	Most real applications aren’t directly benchmarked. Check whether the benchmarks resemble your use case before drawing conclusions.

A capability-axis cheat sheet

Capability you care about	Benchmark to look at first
Factual recall, breadth	MMLU
Multi-step math reasoning	AIME (hard) or GSM8K (easy)
Common-sense reasoning	PIQA
Code completion / small functions	HumanEval (note: saturated)
Real-codebase fixes	SWE-bench Verified
Competitive programming skill	CodeForces (with rating context)
Real-world deployment fit	None directly. Synthetic benchmarks rarely match real use.

The lecturer’s methodology

Don't compare models across all benchmarks at once.
Categorize what each benchmark measures FIRST.
Then evaluate each capability axis SEPARATELY.

The same discipline that works for debugging tool-use failures
(next lesson's territory) works for reading benchmarks.

Glossary

Knowledge benchmark: tests the model’s ability to recall and compose facts. MMLU is the prime example.
Reasoning benchmark: tests multi-step thinking. AIME, GSM8K, PIQA.
Coding benchmark: tests code generation. HumanEval, CodeForces, SWE-bench.
Saturation: when models cluster at the top of a benchmark (typically 95%+); small score differences are noise.
Pass@K: probability at least one of K attempts is correct. Pass@1 is the most stringent.
MMLU: Massive Multitask Language Understanding.
AIME: American Invitational Mathematics Examination.
PIQA: Physical Interaction Question Answering.
SWE-bench: software-engineering benchmark from real GitHub issues.
CodeForces: competitive-programming platform whose problems are used as a reasoning-model benchmark.

A benchmark measures one slice. Treating the score as a global verdict is the most common reading error.
Match the benchmark category to the capability you care about. Read the metric. Note the saturation status.
Benchmark scores can rise faster than real capability. The number is evidence; it is not the whole story.