Skip to content

How we evaluate models, LLM-as-a-Judge

This is the opening lesson of Phase 7, How we judge models and where they’re going, in Track 5 (AI Foundations). Phase 6 closed the “what one LLM call can be augmented to do” arc. This phase asks the next obvious question: how do we know whether any of it is working? For coding and math problems with verifiable answers (Phase 6 reasoning-models lesson), evaluation is mechanical. For everything else (summaries, advice, conversation, drafts), there is no single right answer and human ratings are too slow and expensive to run at scale. The pragmatic solution the field has converged on is LLM-as-a-Judge (LaaJ): one LLM reads another LLM’s response, plus the prompt that produced it and a description of what counts as good, and returns a score plus a written rationale. This lesson covers what LaaJ is, the pointwise-vs-pairwise distinction, why it works (the judge LLM was trained on text and human preferences; using it as a judge applies that knowledge to a new response), the structured-output discipline that makes LaaJ pipelines reliable in production, the three named biases (position, verbosity, self-enhancement) and how to mitigate them, and the practitioner’s best practices. Course materials are at cme295.stanford.edu.

This is the opener of Phase 7, How we judge models and where they’re going. The previous lesson (How agent loops work) closed Phase 6 by completing the picture of what one LLM call can do. This lesson opens the evaluation arc. The next lesson covers why benchmarks can mislead. After that, why tool-using models fail (the failure-mode taxonomy). Then transformers beyond text (ViT and MoE), then new generation methods (speculative decoding, diffusion language models), and the track closes with a safety-lens recap that pulls together every safety thread woven through Phases 4 to 7.

Prerequisites: the agent loops lesson is required for narrative continuity (Phase 6 is the immediate predecessor). The reward model lesson is useful since LaaJ is increasingly the source of synthetic preference data that feeds reward-model training. The chain-of-thought lesson is useful since the rationale-before-score pattern in LaaJ is the same chain-of-thought intuition applied to evaluation.

  • Define LLM-as-a-Judge and explain why the field uses one LLM to evaluate the output of another
  • Distinguish pointwise from pairwise LaaJ setups and describe when each is the right shape
  • Identify the three main biases (position, verbosity, self-enhancement) and the standard mitigations for each
  • Apply best practices for setting up a LaaJ pipeline (crisp guidelines, binary scale, rationale-before-score, structured output)
  • Recognize how synthetic preference data generation via LaaJ feeds back into reward-model training
  • Read time: about 12 minutes
  • Practice time: about 12 minutes (a self-check on the three biases and their mitigations, a hands-on exercise designing a LaaJ prompt with structured output, and flashcards)
  • Difficulty: standard