References: Evaluation
Source material
Section titled “Source material”Source curriculum (structural mirror, cited as further study):• Stanford CS336, "Language Modeling from Scratch", Lecture 12: Evaluation Instructors: Tatsunori Hashimoto and Percy Liang (Stanford) Course page: https://cs336.stanford.edu/ Lecture videos: YouTube playlist https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaTUHhq-yembLCV License: no explicit license is published on the course site; lecture videos are on YouTube under standard terms; slides are public on GitHub without a stated license. Required attribution: "Based on the structure of Stanford CS336, 'Language Modeling from Scratch,' by Tatsunori Hashimoto and Percy Liang (cs336.stanford.edu). This is an independent structural mirror in original prose; it reproduces no course materials, and Stanford does not endorse it."This lesson mirrors the structure of Lecture 12 (evaluation). Clawdemy'slessons are original prose that follows the pedagogical arc of the course.Because the source publishes no explicit license, we cite it as a recommendedcompanion and reproduce none of its materials.Watch this next
Section titled “Watch this next”- Stanford CS336, Lecture 12: Evaluation by Hashimoto and Liang. The lecture this lesson mirrors. It walks the benchmark families and the limits in more depth, with worked examples of harness sensitivity.
Going deeper
Section titled “Going deeper”A short, durable list. Each link is a specific next step, not a generic pile.
-
The
lm-evaluation-harnessby EleutherAI. The community-maintained harness used for most reproducible LLM evaluation. The reference implementation for running a suite of standard benchmarks with pinned formats. -
Chatbot Arena by LMSYS. The dominant pairwise-preference open leaderboard. Worth reading the methodology page for how Elo aggregation works in practice, and how the human-preference signal compares to multiple-choice scores.
-
HELM (Holistic Evaluation of Language Models) by Stanford CRFM. A large multi-scenario evaluation framework that takes the layered-stack idea seriously, with multiple metrics per task and explicit scenario coverage.
Adjacent topics
Section titled “Adjacent topics”Where this connects inside the track.
-
Scaling laws (lesson 9). Scaling laws predict cross-entropy loss; this lesson is the critical look at what that loss actually correlates with for downstream capability.
-
Curating high-quality datasets (Track 14 Lesson 11). The same construct-validity questions return on the data side: are your training examples measuring what you think they are?
-
Reasoning models (lesson 14, this track’s capstone). Reasoning models are often evaluated on harder, executable, harder-to-contaminate benchmarks; the discipline of this lesson is what lets you read reasoning-model claims with discrimination.