Skip to content

Lesson: Why benchmarks can mislead

A model card claims “85% on MMLU.” Another claims “47% Pass@1 on SWE-bench.” A press release says the new model is “the first to exceed 90% on AIME.” Each number is doing real work in the announcement; each is also easy to over-read.

The previous lesson covered LLM-as-a-Judge: how the field evaluates outputs that have no single right answer. This lesson is about the other half of evaluation: benchmarks, the standardized tests with hardcoded answer keys that produce the headline numbers in every model release. Benchmarks are real tools and they tell you real things. They also have specific structural limits, and the field’s understanding of those limits has gotten sharper in the last few years.

By the end you will be able to read a benchmark claim and ask the right follow-up questions: what category of capability is this measuring, what does the metric mean, and what can it not tell you?

The frame the Stanford lecturer offers is worth keeping. When you read a benchmark, do not jump to “is this model good?” Instead, categorize what kind of error or capability the benchmark probes, and treat that as the only thing the score speaks to. A high MMLU score tells you the model retained a lot of pretraining knowledge. It does not tell you the model can reason through a SWE-bench task, follow a multi-step plan, or refuse a malicious prompt. Each benchmark is a slice; treating any one of them as a global verdict is the most common mistake.

The lecturer’s broader phrasing: be methodical about categorizing kinds of errors before chasing them. The same applies to capabilities. Categorize what each benchmark measures before treating any number as a global signal of model quality.

The lecturer groups today’s benchmarks into a small set of categories, each probing a different capability axis.

Knowledge benchmarks test whether a model can return facts the model should have learned during pretraining. The prime example is MMLU (Massive Multitask Language Understanding). 57 different subjects, from law to medicine to everyday life, formatted as multiple-choice questions with four options. The model picks one. A correct answer means the model retained that fact (or correctly inferred it from related facts).

What MMLU measures: how well pretraining retained broad-domain knowledge. What it does not measure: reasoning, multi-step problem-solving, tool use, anything specific to a deployed application. A model with strong MMLU has read a lot and remembered a lot. That is one capability among many.

The multiple-choice format is the load-bearing detail. It exists because it is easy to grade automatically (extract the chosen letter, compare to the answer key) and avoids the LaaJ overhead. The cost: real-world use is rarely multiple-choice. The benchmark’s constraint differs from how the model is actually used.

Reasoning benchmarks ask the model to think through a problem before producing the answer. Math problems, multi-step logic, commonsense reasoning. The model has to do work, not just recall.

AIME (American Invitational Mathematics Examination): a US qualifier exam for the math olympiad. The lecturer flags it as a “very hard test” of high-school-level math, with answers in three-digit-number format that grades cleanly. Reasoning models score substantially better on AIME than standard LLMs; the gap is one of the cleanest signals that reasoning models add capability. AIME (alongside GPQA, the graduate-level Q&A reasoning benchmark) is the 2026 floor: the place serious frontier comparisons happen.

GSM8K (Grade School Math 8K): about 8,500 grade-school word problems. Easier than AIME. Effectively saturated by 2026 frontier models and is best read as a historical baseline rather than a current ranking signal. If a 2026 announcement leads with a GSM8K number, that is almost a tell that the model is not competitive on the harder benchmarks where the field has moved.

PIQA (Physical Interaction Question Answering): commonsense reasoning grounded in everyday physical understanding. Two-option multiple-choice (A or B). The lecturer’s example: “How do I find something I lost on the carpet? Vacuum with a solid seal vs vacuum with a hairnet?” Solid seal blocks airflow; hairnet works. The model has to know how a vacuum works at the level a child does. About 20,000 examples.

What reasoning benchmarks measure: the model’s ability to combine knowledge with multi-step thinking. What they do not measure: whether the reasoning is actually grounded (the model can produce confident wrong reasoning), whether it generalizes beyond the benchmark distribution, or whether it works on tasks that mix reasoning with tool use.

Coding benchmarks: HumanEval, CodeForces, SWE-bench

Section titled “Coding benchmarks: HumanEval, CodeForces, SWE-bench”

Coding benchmarks ask the model to write or fix code. Each is checked mechanically against test cases. This is the cleanest verifiable-reward setup the lecturer has flagged repeatedly as the place where modern reasoning models shine.

HumanEval: about 164 small coding problems, each a function signature with a docstring. The model writes the function body. Correctness via included unit tests. Mostly saturated.

CodeForces: competitive programming problems with a rating system that lets you compare a model’s effective skill to a human contestant’s rating. Useful for high-end frontier models where saturation hasn’t hit.

SWE-bench: real GitHub issues from popular Python repositories. The model produces a code patch that fixes the bug; correctness via the project’s test suite. The lecturer notes the construction process: filter PRs that introduced both a fix and a new test, assume the test was failing before and passing after, run the model’s patch and check whether the introduced test now passes. Current frontier benchmark.

What coding benchmarks measure: the model’s ability to produce working code in constrained settings. What they do not measure: full software-engineering practice (architectural decisions, code-review judgment, working with humans), open-ended creativity, anything beyond the test suite the benchmark validates against.

Why benchmark scores can rise faster than real capability

Section titled “Why benchmark scores can rise faster than real capability”

The most useful intuition the lecturer offers is that benchmark scores can rise faster than real-world capability for a few specific reasons. Knowing them is what separates “reads numbers naively” from “reads numbers carefully.”

Training on benchmark-shaped data. Modern pretraining corpora are enormous and weakly filtered. Test data leaks happen. Even when the exact benchmark questions are excluded, similar questions and the same content templates often appear. A model that scored 60% on a benchmark a year ago and scores 90% today might be substantially better, or it might just have seen more benchmark-shaped training data.

Saturated benchmarks. When everyone scores above 95%, small differences are mostly noise. Reading “the new model scored 96.2% vs the old model’s 95.8%” is reading rounding error. HumanEval and GSM8K are now in this zone; SWE-bench Verified and AIME are not yet.

Multiple-choice and other format constraints. A multiple-choice question is easier than open-ended. A 25% score on MMLU is the random-guessing baseline (4 options). Reaching 80% on multiple-choice doesn’t translate cleanly to reaching 80% on open-ended generation in the same domain.

Single-axis measurement. Each benchmark probes one capability. A model strong on reasoning benchmarks may be weak on tool-use benchmarks, or vice versa. Treating any single benchmark as a global verdict is the most common reading error.

Differential leakage. Some benchmarks are more contaminated than others. Newly-released benchmarks (or benchmark variants like SWE-bench Verified or AIME 2025) generally have less leakage than older ones. The freshness of the benchmark relative to the model’s training cutoff matters.

A practical reading checklist when you encounter benchmark numbers in a model card or paper:

  • What is the benchmark category? Knowledge, reasoning, coding, commonsense, tool use, something else? Match the claim to the capability it actually probes.
  • What is the metric? Pass@K (with what K), accuracy, exact match, BLEU, LaaJ score? Each metric has different reading rules. Pass@1 vs Pass@10 differ substantially (Phase 6 reasoning models lesson covered this).
  • Is the benchmark saturated? If yes, the difference between the claimed score and competitors’ scores is probably noise.
  • Is this the headline metric or a representative one? Papers often report numbers across many benchmarks; the headline tends to be the one where the model looks best. Look for the full table.
  • What is the benchmark’s training-cutoff overlap with the model’s training data? Newer benchmarks against older models = cleaner signal. Older benchmarks against newer models = leakage risk.
  • Does the benchmark match how the model would be used in practice? If you care about an application, check whether any of the cited benchmarks resemble that use case. They often do not.

The lecturer’s framing again: a benchmark is one slice. Treat the score as evidence about that slice and that slice only.

Three things to hold onto.

  • Benchmark numbers are real but narrow. A high MMLU score is real evidence that the model has substantial pretrained knowledge. It is not evidence that the model can do your specific job. When choosing a model for an application, the right question is “what benchmark resembles my use case, and how does this model do on it?” not “what is the highest-scoring model on the most popular benchmark?”
  • Saturated benchmarks are no longer useful for ranking. If the field has converged on 95%+ on a benchmark, that benchmark is mostly historical at this point. Newer harder benchmarks (SWE-bench Verified, AIME 2025, frontier-math, reasoning-specific evaluations) are the ones moving. Read announcements with that filter.
  • Training on benchmark-shaped data is a real mechanism, not a conspiracy theory. It does not require malice; it just requires the benchmark’s content to resemble what’s in pretraining corpora. The honest takeaway: benchmark numbers reflect a mix of real capability and benchmark-shaped familiarity, and disentangling those two is hard.

Three mistakes worth dodging.

Treating one benchmark as a global verdict. A model that scores 90% on MMLU and 50% on SWE-bench is not “almost twice as good at MMLU as SWE-bench.” It is good at recall and weaker at coding. The numbers measure different capabilities and should not be averaged.

Ignoring metric details. “85% on AIME” without knowing the K (Pass@1? Pass@10? majority-vote-at-K?) is incomplete information. “47% Pass@1 on SWE-bench Verified” is much more readable because the metric and benchmark variant are explicit. Train yourself to ask which metric and which variant before drawing conclusions.

Underweighting the benchmark-as-training-data problem. When a model dramatically improves on a popular benchmark, the question to ask first is not “what new capability was added?” but “did the training data include benchmark-shaped content?” Sometimes both are true. Sometimes only the second is. The two are usually not visibly distinguishable from outside the lab.

  • Benchmarks measure narrow capabilities, not global model quality. Each benchmark probes one slice; treating any single number as a global verdict is the most common reading error.
  • The major categories are knowledge (MMLU), reasoning (AIME, GSM8K, PIQA), coding (HumanEval, CodeForces, SWE-bench), and tool use. Each has its own grading discipline (multiple-choice, three-digit answers, test cases) that matters for what the score actually says.
  • Benchmark scores can rise faster than real capability for structural reasons: training on benchmark-shaped data, format constraints, single-axis measurement, and saturation.
  • The reading checklist: category, metric, saturation status, headline-vs-representative, training-cutoff overlap, application match. Use it before drawing conclusions from a number.
  • The lecturer’s methodology: be methodical about categorizing what each benchmark measures before chasing scores. The same discipline that works for debugging tool-use failures (Phase 7 lesson 3 territory) works for reading benchmarks.

A benchmark measures one slice. Treating the score as a global verdict is the most common reading error.
Match the benchmark category to the capability you care about. Read the metric. Note the saturation status.
Benchmark scores can rise faster than real capability. The number is evidence; it is not the whole story.