Language model evaluation: cheatsheet

Benchmark formats

Format	Examples	Strength	Weakness
Multiple-choice	MMLU, ARC, HellaSwag, TruthfulQA	Cheap, reproducible	Contamination; narrow construct
Executable	HumanEval, GSM8K, MATH	Hard to fake (run-and-check)	Public test sets still discussable
Instruction-following	IFEval	Checkable format adherence	Doesn’t cover knowledge / reasoning
Open-ended	Chatbot Arena, LLM-as-judge	Closest to real usability	No automatic ground truth

Four reasons evaluation is hard

	What it is	Symptom
Construct validity	Benchmark vs the capability you actually want	High MMLU != “intelligent”
Contamination	Model saw the test set in training	Memorization scores high
Format sensitivity	Harness/prompt/decoding swings scores	”MMLU 65 vs 67” may not compare
Open-ended scoring	No automatic ground truth	Needs human / LLM judges

Contamination defenses

Private or freshly-generated benchmarks
Executable tasks (run code, check answer): memorization-as-one-token isn’t enough
Skepticism toward old, widely-discussed public benchmark headlines
Look for recent, harder, private variants

Open-ended scoring: two paths

Path	Signal	Cost	Bias
Pairwise human (Arena-style, Elo)	High	Expensive	None systematic (with enough raters)
LLM-as-judge	Medium	Cheap	Judge prefers its own style; blind spots

Useful within a project: LLM-as-judge for fast iteration. Cross-team rankings: human preference.

The layered pragmatic stack

1. Held-out perplexity                 (smoke test during/after training)
2. Multiple-choice suite               (MMLU, ARC, HellaSwag, TruthfulQA)
3. Executable benchmarks               (HumanEval, GSM8K, MATH, harder)
4. Instruction-following               (IFEval and friends)
5. Open-ended preference               (Arena pairwise; LLM-as-judge supplement)
6. Domain-specific evals               (your data, your tasks)

No single score is the model. The portfolio is.

Reading a benchmark headline

A “state-of-the-art on X” claim is stronger when:

X is executable or freshly generated (contamination-resistant).
The model also does well on multiple other benchmarks of different formats.
The harness is pinned and reproducible.
Reported alongside open-ended evidence (Arena Elo or human preference).

Weaker when X is an old public multiple-choice with widely-discussed solutions.

Words to use precisely

Construct validity: gap between benchmark and the underlying capability.
Contamination: model saw test items during training.
Format sensitivity: harness/prompt/parser choices change scores.
Pairwise preference: judges compare two outputs; Elo aggregates.
LLM-as-judge: another model scores outputs; cheap, biased.

Source

Stanford CS336, Lecture 12 (Evaluation), by Hashimoto and Liang. cs336.stanford.edu. Independent structural mirror in original prose; see references.