Practice: Evaluation

Self-check

Seven short questions. Answer each before opening the collapsible.

1. Why is “loss” not the same as “capability”?

Show answer

Scaling laws predict cross-entropy loss, but what stakeholders care about is downstream usefulness. They are correlated but not identical: some capabilities appear in jumps rather than smoothly with loss, and a lower-loss model can score worse on specific downstream tasks. Evaluation is what bridges the two.

2. Name the four standard benchmark formats and one example of each.

Show answer

Multiple-choice (MMLU, ARC, HellaSwag, TruthfulQA), executable (HumanEval for code, GSM8K and MATH for math), instruction-following (IFEval), and open-ended (Chatbot Arena pairwise preference, or LLM-as-judge). A modern stack uses several across categories.

3. What is the construct-validity problem, in plain terms?

Show answer

A benchmark measures performance on its specific dataset and format. Whether that performance reflects the capability you care about (“the model can reason,” “the model is helpful”) is a separate question. High MMLU is good at multiple-choice academic questions; that is not the same as “intelligent.” The gap between what the benchmark scores and what stakeholders mean is construct validity.

4. What is contamination, and what defenses help?

Show answer

A model trained on web data may have seen public benchmark questions during training, so high scores reflect memorization rather than capability. Defenses: private or freshly generated benchmarks, executable tasks where memorizing one token isn’t enough (run the code; solve the puzzle), and skepticism toward high scores on old, widely-discussed public benchmarks.

5. What is format sensitivity, and why does it matter for cross-paper comparisons?

Show answer

The same model on the same benchmark scores differently depending on prompt format (zero-shot vs few-shot), chain-of-thought, answer parsing, and decoding settings. Reported numbers depend on the evaluation harness, not just the model. Two “MMLU 65 vs 67” reports may not be comparable; this is why community-maintained, version-pinned harnesses (the lm-evaluation-harness family) exist.

6. What are the two practical paths for open-ended evaluation, and what are their failure modes?

Show answer

Pairwise human preference at scale (Chatbot Arena-style: show two anonymous outputs, ask people which is better, Elo-rate). High signal, expensive, slow. LLM-as-judge (another model scores responses). Cheap, but biased: judges tend to prefer outputs that look like their own and have systematic blind spots. Useful for relative comparisons within a project; less so for cross-team rankings.

7. Why is the modern evaluation stack layered rather than a single number?

Show answer

Every layer has known weaknesses, and a layered stack gets you signal because each layer fails differently. Held-out perplexity catches catastrophic training problems; multiple-choice covers knowledge cheaply; executable benchmarks resist contamination; instruction-following checks format adherence; open-ended preference measures actual usability; domain-specific evals capture what matters in production. No single score is the model; the portfolio is.

Try it yourself: read a benchmark claim critically

About 10 minutes, no setup. The diagnostic muscle is the point.

Part A: dissect the claim. A model release reports “MMLU 87.5, GSM8K 92.0, HumanEval 78.0.” For each number, name one reason to be skeptical and one reason it is still useful evidence.

What you’ll get

MMLU 87.5. Skeptical: contamination (MMLU has been public for years; widely discussed; many models have likely seen it). Useful: still a reasonable broad-knowledge proxy, and a model that does badly on it usually has gaps.
GSM8K 92.0. Skeptical: contamination (large, public, well-known). Useful: math answers must be a specific number, so partial memorization is less helpful; harder math benchmarks (MATH, AIME-class) are stronger evidence at the top end.
HumanEval 78.0. Skeptical: code can also be in training; widely discussed solutions exist. Useful: the metric runs the code and checks tests pass, so a memorized but slightly-off solution still fails. Among the harder-to-contaminate benchmarks.

The pattern: harder-to-fake metrics (executable, harder, more recent) are stronger evidence than older multiple-choice headlines.

Part B (reasoning). You are building an internal assistant for legal-document QA. Sketch a four-layer evaluation stack for your team. (Stay technical and methodology-focused; do not propose legal claims, just evaluation design.)

What you should notice

Reasonable layers:

Held-out perplexity on a sample of your domain documents (smoke test during/after training or fine-tuning).
Multiple-choice retrieval-style benchmark built from your documents (does the model identify the correct passage / cited section among distractors?). Cheap, contamination-controlled because it is from your private corpus.
Executable / verifiable tasks: extract structured fields (dates, party names, clause types) into a strict schema; check exact match against a private gold set. Hard to fake.
Open-ended evaluation: domain-expert pairwise preference on draft summaries or Q&A, on a held-out sample; LLM-as-judge as a cheaper supplement for in-flight tuning.

The point is the layering: each layer catches a different failure mode, and a private corpus makes contamination far less of a worry than public benchmarks. The recommendation is structural; the actual legal-correctness call belongs to domain experts, not the evaluation pipeline.

Part C (reasoning). Why is the portfolio of benchmark scores stronger evidence than any single number, even a state-of-the-art one?

What you should notice

Because each benchmark has known weaknesses (contamination, format sensitivity, narrow construct), and the failure modes are different. A model that tops one benchmark may be overfit to its format or contaminated by its data; a model that does broadly well across many benchmarks of different formats and ages is much less likely to be exploiting one specific weakness. The portfolio is the noise-cancelling version of the score: where the benchmarks agree, the signal is stronger than any one of them alone.

Flashcards

Nine cards. Click any card to reveal the answer. Use the Print flashcards button to lay the set out one card per page for offline review.

Q. Why is loss not the same as capability?

Scaling laws predict cross-entropy; you care about downstream usefulness. Correlated but not identical; some capabilities appear in jumps. Evaluation bridges the two.

Q. The four benchmark formats?

Multiple-choice (MMLU, ARC, HellaSwag, TruthfulQA), executable (HumanEval, GSM8K, MATH), instruction-following (IFEval), open-ended (Chatbot Arena, LLM-as-judge).

Q. Construct validity?

A benchmark measures its specific dataset and format. Whether that reflects the capability you care about is a separate question; high MMLU != “intelligent.” The deepest issue in LLM evaluation.

Q. Contamination, and defenses?

The model may have seen public benchmark questions during training, so high scores reflect memorization. Defenses: private/freshly generated benchmarks, executable tasks (run code / check answer), skepticism toward old public benchmarks.

Q. Format sensitivity?

Same model + same benchmark + different prompt format/parser/decoding = different score. Reported numbers depend on the harness, not just the model. Version-pinned harnesses (lm-evaluation-harness family) matter for comparability.

Q. Open-ended evaluation: two paths and their failures?

Pairwise human preference at scale (Chatbot Arena Elo): high signal, expensive. LLM-as-judge: cheap, biased (judges prefer outputs that look like their own; systematic blind spots). Use both with failure modes in mind.

Q. Why a layered stack, not one number?

Every layer has known weaknesses; failure modes differ. Perplexity catches training disasters; MC covers knowledge cheaply; executable resists contamination; instruction-following checks format; open-ended measures usability. No single score is the model; the portfolio is.

Q. Stronger benchmark evidence?

Harder to fake: executable (run-and-check), freshly generated, private held-out, recent, paired with multiple formats. Weaker: old public multiple-choice where contamination is plausible and discussion has leaked answers into the web.

Q. Pragmatic eval stack (layers)?

held-out perplexity, 2) multiple-choice suite, 3) executable (code/math) benchmarks, 4) instruction-following, 5) open-ended (pairwise human + LLM-as-judge), 6) domain-specific. Each layer catches a different failure mode.