Cheatsheet: Evaluation
Benchmark formats
Section titled “Benchmark formats”| Format | Examples | Strength | Weakness |
|---|---|---|---|
| Multiple-choice | MMLU, ARC, HellaSwag, TruthfulQA | Cheap, reproducible | Contamination; narrow construct |
| Executable | HumanEval, GSM8K, MATH | Hard to fake (run-and-check) | Public test sets still discussable |
| Instruction-following | IFEval | Checkable format adherence | Doesn’t cover knowledge / reasoning |
| Open-ended | Chatbot Arena, LLM-as-judge | Closest to real usability | No automatic ground truth |
Four reasons evaluation is hard
Section titled “Four reasons evaluation is hard”| What it is | Symptom | |
|---|---|---|
| Construct validity | Benchmark vs the capability you actually want | High MMLU != “intelligent” |
| Contamination | Model saw the test set in training | Memorization scores high |
| Format sensitivity | Harness/prompt/decoding swings scores | ”MMLU 65 vs 67” may not compare |
| Open-ended scoring | No automatic ground truth | Needs human / LLM judges |
Contamination defenses
Section titled “Contamination defenses”- Private or freshly-generated benchmarks
- Executable tasks (run code, check answer): memorization-as-one-token isn’t enough
- Skepticism toward old, widely-discussed public benchmark headlines
- Look for recent, harder, private variants
Open-ended scoring: two paths
Section titled “Open-ended scoring: two paths”| Path | Signal | Cost | Bias |
|---|---|---|---|
| Pairwise human (Arena-style, Elo) | High | Expensive | None systematic (with enough raters) |
| LLM-as-judge | Medium | Cheap | Judge prefers its own style; blind spots |
Useful within a project: LLM-as-judge for fast iteration. Cross-team rankings: human preference.
The layered pragmatic stack
Section titled “The layered pragmatic stack”1. Held-out perplexity (smoke test during/after training)2. Multiple-choice suite (MMLU, ARC, HellaSwag, TruthfulQA)3. Executable benchmarks (HumanEval, GSM8K, MATH, harder)4. Instruction-following (IFEval and friends)5. Open-ended preference (Arena pairwise; LLM-as-judge supplement)6. Domain-specific evals (your data, your tasks)No single score is the model. The portfolio is.
Reading a benchmark headline
Section titled “Reading a benchmark headline”A “state-of-the-art on X” claim is stronger when:
- X is executable or freshly generated (contamination-resistant).
- The model also does well on multiple other benchmarks of different formats.
- The harness is pinned and reproducible.
- Reported alongside open-ended evidence (Arena Elo or human preference).
Weaker when X is an old public multiple-choice with widely-discussed solutions.
Words to use precisely
Section titled “Words to use precisely”- Construct validity: gap between benchmark and the underlying capability.
- Contamination: model saw test items during training.
- Format sensitivity: harness/prompt/parser choices change scores.
- Pairwise preference: judges compare two outputs; Elo aggregates.
- LLM-as-judge: another model scores outputs; cheap, biased.
Source
Section titled “Source”- Stanford CS336, Lecture 12 (Evaluation), by Hashimoto and Liang.
cs336.stanford.edu. Independent structural mirror in original prose; see references.