References: Why benchmarks can mislead

Source material

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
  Instructor: Afshine Amidi & Shervine Amidi, Stanford University
  Course site: https://cme295.stanford.edu/
  Cheatsheet: https://cme295.stanford.edu/cheatsheet/
  Source lecture (Lecture 8, LLM Evaluation):
    see course site at https://cme295.stanford.edu/ for the lecture URL
  License (lecture videos): as published on Stanford's public YouTube channel
  License (Amidi cheatsheets): MIT
This lesson adapts the benchmarks section of Stanford CME 295 Lecture 8,
covering [01:23:34-01:25:00] benchmark categories framing + knowledge
benchmarks (MMLU), [01:28:00-01:32:00] reasoning benchmarks (AIME, PIQA),
[01:32:00-01:36:00] coding benchmarks (SWE-bench, HumanEval, CodeForces),
plus the lecturer's methodological framing of "be methodical about
categorizing kinds of errors and capabilities before chasing scores"
[01:23:00]. Clawdemy provides original notes, summaries, and quizzes
derived from this material for educational purposes. All rights to the
original lectures remain with Stanford and the instructors.

Benchmark papers (in order of mention)

The papers behind each major benchmark covered in this lesson.

“Measuring Massive Multitask Language Understanding”, Hendrycks et al., 2021. The MMLU paper. Introduces the benchmark and its 57-task structure (often rounded to “about 60”). Section 3 has the dataset construction (57 subjects across humanities, STEM, social sciences, and other; this is where the canonical “MMLU has 57 subjects” claim comes from, often loosely rounded to “about 60” in the popular press). Worth reading for the framing of “what a knowledge benchmark is for.”
“Training Verifiers to Solve Math Word Problems” (GSM8K), Cobbe et al., 2021. The GSM8K paper. About 8,500 grade-school math word problems. Section 3 has dataset details.
“Evaluating Large Language Models Trained on Code” (HumanEval), Chen et al., 2021. The HumanEval benchmark (about 164 problems with included unit tests). Section 3 introduces Pass@K formally.
“SWE-bench: Can Language Models Resolve Real-World GitHub Issues?”, Jimenez et al., 2024. The SWE-bench paper. Section 2 (the construction process: filter PRs that introduced both fixes and tests, then run the model’s patch against the introduced tests) is the load-bearing part. SWE-bench Verified is a hand-validated subset that’s now the more-cited variant.
“PIQA: Reasoning about Physical Commonsense in Natural Language”, Bisk et al., 2020. The PIQA paper. Common-sense physical reasoning, about 20,000 examples in 2-option format.
AIME is administered by the Mathematical Association of America (MAA). The relevant LLM-evaluation context is in the DeepSeek-R1 paper and other reasoning-model papers that use AIME 2024 and AIME 2025 as benchmarks. AIME problems and answer keys are public; the evaluation discipline (correctness of the three-digit answer) is straightforward.

Benchmark-contamination literature

The structural concern that benchmark scores can rise faster than real capability is being studied empirically.

“Investigating Data Contamination in Modern Benchmarks for Large Language Models”, Sainz et al., 2023. Surveys the evidence for benchmark contamination across major datasets. Useful as a primary source for the “training on benchmark-shaped data” claim in this lesson.
“A Careful Examination of Large Language Model Performance on Grade School Arithmetic”, Zhang et al., 2024. Specifically studies whether GSM8K performance reflects capability or contamination by constructing an analogous-but-fresh dataset. Worth reading after the lesson for the empirical methodology of measuring contamination.

Going deeper

A short list, chosen for durability.

“The False Promise of Imitating Proprietary LLMs”, Gudibande et al., 2023. Documents how surface-level benchmark improvements can fail to translate to real-world capability. Relevant to the “scores can rise faster than capability” framing.
“Holistic Evaluation of Language Models (HELM)”, Liang et al., 2022. Stanford’s framework for evaluating LLMs across multiple capability axes simultaneously, instead of single-benchmark headlines. The framework is one model for what “categorize before chasing” looks like at scale.

Adjacent topics

The “benchmark train-on-test” problem. The boundary between memorization and generalization is fuzzy when the test data is included in the training data. Search terms: “data contamination,” “train-test leakage in LLM benchmarks,” “benchmark hygiene.” Active research; the field is increasingly transparent about which benchmarks were excluded from training.
Saturation as a benchmark life-cycle issue. When all frontier models score 95%+, the benchmark stops being useful for ranking and becomes a baseline check. New harder benchmarks (SWE-bench Verified, frontier-math, AIME 2025, harder reasoning evaluations) are released regularly to replace saturated ones. The cycle continues.
Real-world deployment evaluation. Benchmarks are necessary but not sufficient for deciding whether a model is fit for an application. Production teams typically build internal evaluations that resemble their actual use case. Search terms: “task-specific eval suites,” “internal LLM benchmarking,” “domain-specific evaluation.” Most of this is in vendor blogs and team writeups rather than academic papers.

Stanford CME 295 cheatsheet

Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. The benchmarks section covers the same material in their dense visual style. Worth using as a study reference after this lesson.

Community discussion

None selected for this lesson. Vendor blog posts and the academic literature are the better entry points right now. Durable community references will be added at a future quarterly review if any consolidate.