Skip to content

Cheatsheet: Evaluation

FormatExamplesStrengthWeakness
Multiple-choiceMMLU, ARC, HellaSwag, TruthfulQACheap, reproducibleContamination; narrow construct
ExecutableHumanEval, GSM8K, MATHHard to fake (run-and-check)Public test sets still discussable
Instruction-followingIFEvalCheckable format adherenceDoesn’t cover knowledge / reasoning
Open-endedChatbot Arena, LLM-as-judgeClosest to real usabilityNo automatic ground truth
What it isSymptom
Construct validityBenchmark vs the capability you actually wantHigh MMLU != “intelligent”
ContaminationModel saw the test set in trainingMemorization scores high
Format sensitivityHarness/prompt/decoding swings scores”MMLU 65 vs 67” may not compare
Open-ended scoringNo automatic ground truthNeeds human / LLM judges
  • Private or freshly-generated benchmarks
  • Executable tasks (run code, check answer): memorization-as-one-token isn’t enough
  • Skepticism toward old, widely-discussed public benchmark headlines
  • Look for recent, harder, private variants
PathSignalCostBias
Pairwise human (Arena-style, Elo)HighExpensiveNone systematic (with enough raters)
LLM-as-judgeMediumCheapJudge prefers its own style; blind spots

Useful within a project: LLM-as-judge for fast iteration. Cross-team rankings: human preference.

1. Held-out perplexity (smoke test during/after training)
2. Multiple-choice suite (MMLU, ARC, HellaSwag, TruthfulQA)
3. Executable benchmarks (HumanEval, GSM8K, MATH, harder)
4. Instruction-following (IFEval and friends)
5. Open-ended preference (Arena pairwise; LLM-as-judge supplement)
6. Domain-specific evals (your data, your tasks)

No single score is the model. The portfolio is.

A “state-of-the-art on X” claim is stronger when:

  • X is executable or freshly generated (contamination-resistant).
  • The model also does well on multiple other benchmarks of different formats.
  • The harness is pinned and reproducible.
  • Reported alongside open-ended evidence (Arena Elo or human preference).

Weaker when X is an old public multiple-choice with widely-discussed solutions.

  • Construct validity: gap between benchmark and the underlying capability.
  • Contamination: model saw test items during training.
  • Format sensitivity: harness/prompt/parser choices change scores.
  • Pairwise preference: judges compare two outputs; Elo aggregates.
  • LLM-as-judge: another model scores outputs; cheap, biased.
  • Stanford CS336, Lecture 12 (Evaluation), by Hashimoto and Liang. cs336.stanford.edu. Independent structural mirror in original prose; see references.