Skip to content

Evaluation, measuring a language model

Scaling laws (lesson 9) predict loss; what stakeholders care about is capability. This lesson is what it takes to measure that honestly. The source curriculum is Stanford CS336, Lecture 12, by Tatsunori Hashimoto and Percy Liang, with lectures freely available on YouTube and the course at cs336.stanford.edu.

You will distinguish the four standard benchmark formats (multiple-choice, executable, instruction-following, open-ended); learn the four reasons evaluation is hard (construct validity, contamination, format sensitivity, open-ended scoring) and the practical defenses against each; compare pairwise human preference to LLM-as-judge for open-ended scoring; and sketch the layered pragmatic evaluation stack that modern LLM teams actually run.

This is lesson 10 of 14, the second lesson of Phase 3 (scale, data, and alignment). It is the critical companion to lesson 9 (whose scaling laws predict loss; this lesson interrogates what loss measures). The next two lessons turn to the data those benchmarks and models are built on, where construct-validity-style questions reappear; the final two lessons cover post-training, which is largely steered by evaluation.

Prerequisites: lesson 9 (the scaling-laws context where this lesson interrogates the loss-vs-capability gap). Familiarity with the idea of a held-out validation set (Track 14 lesson 5) helps; this lesson assumes you have seen training-vs-validation discipline before.

None. This is a methodology lesson about benchmarks, harnesses, and evaluation discipline. Where numbers appear, they are scores reported by other teams; the lesson teaches how to read them, not derive them.

The single capability this lesson builds: describe how language models are evaluated and why evaluation is hard. Concretely, you will be able to:

  • Distinguish the four benchmark formats (multiple-choice, executable, instruction-following, open-ended)
  • Name the four reasons evaluation is hard and explain each
  • Describe contamination defenses
  • Compare pairwise human preference to LLM-as-judge for open-ended scoring
  • Sketch the layered pragmatic evaluation stack
  • Read time: about 12 minutes
  • Practice time: about 10 minutes (read a benchmark headline critically + sketch an evaluation stack for a use case, plus flashcards)
  • Difficulty: deep (Stage C; conceptual / methodology-focused, no math)