Skip to content

Summary: Evaluation

Scaling laws predict loss; what you care about is capability. Evaluation is what bridges them. Benchmarks come in four families: multiple-choice (MMLU, ARC, HellaSwag, TruthfulQA), executable (HumanEval, GSM8K, MATH), instruction-following (IFEval), and open-ended (Chatbot Arena pairwise preference, LLM-as-judge). Evaluation is hard for four reasons: construct validity (does the benchmark measure the capability you care about?), contamination (did the model see the test set in training?), format sensitivity (prompt/harness/decoding choices swing scores), and open-ended scoring (no automatic ground truth). The pragmatic stack is layered: held-out perplexity, a multiple-choice suite, executable benchmarks, instruction-following, open-ended preference, and domain-specific evals. No single score is the model; the portfolio is. This is the scan version; the lesson is the discipline.

  • Loss is not capability. Scaling laws predict cross-entropy; you care about downstream usefulness. The correlation is noisy; some capabilities appear in jumps.
  • Four benchmark formats: multiple-choice (MMLU, ARC, HellaSwag, TruthfulQA), executable (HumanEval, GSM8K, MATH), instruction-following (IFEval), open-ended (Arena pairwise, LLM-as-judge).
  • Four reasons evaluation is hard: construct validity (benchmark vs capability gap), contamination (model saw the test set), format sensitivity (harness choices swing scores), open-ended scoring (no automatic ground truth).
  • Contamination defenses: private or freshly generated benchmarks; executable tasks (run-and-check resists memorization-as-one-token); skepticism toward old public benchmark headlines.
  • Open-ended scoring options: pairwise human preference at scale (high signal, expensive) vs LLM-as-judge (cheap, biased). Both with failure modes in mind.
  • Layered pragmatic stack: perplexity smoke test, multiple-choice suite, executable, instruction-following, open-ended preference, domain-specific. Layers fail differently; the portfolio is the model’s measurement, not any single number.

This lesson is the feedback loop that all the rest of the systems work has been optimizing. Without it you cannot tell whether your improvements are improvements. With it you can read a model release critically: a state-of-the-art number on an old public benchmark is weaker evidence than the same model holding up on a freshly generated or executable benchmark; “X is faster” claims need format-pinned harnesses to be comparable; “X is smarter” claims need a portfolio of benchmarks with different failure modes to be convincing. The discipline of treating any one number with suspicion is also how you tell a saturated benchmark (everyone at ~95%, differences within noise) from a meaningful gap. The next two lessons turn to the data feeding all of this, where construct-validity-style questions reappear in a different shape (filter on what signal? balance against what target?).

If you cannot measure progress, you cannot make it. The portfolio (not any single number) is the model’s capability, and the discipline of treating one benchmark number with skepticism is the difference between a serious training program and a press kit.