References: How we evaluate models, LLM-as-a-Judge

Source material

Source material:
• Stanford CME 295: Transformers & Large Language Models, Autumn 2025
  Instructor: Afshine Amidi & Shervine Amidi, Stanford University
  Course site: https://cme295.stanford.edu/
  Cheatsheet: https://cme295.stanford.edu/cheatsheet/
  Source lecture (Lecture 8, LLM Evaluation):
    see course site at https://cme295.stanford.edu/ for the lecture URL
  License (lecture videos): as published on Stanford's public YouTube channel
  License (Amidi cheatsheets): MIT
This lesson adapts the LLM-as-a-Judge section of Stanford CME 295 Lecture 8,
covering [00:28:11-00:32:30] LaaJ definition + rationale-before-score,
[00:32:30-00:38:39] structured output + pointwise vs pairwise,
[00:38:39-00:47:00] the three biases (position, verbosity, self-enhancement)
and their mitigations, [00:47:00-00:50:00] best practices and human
calibration. Clawdemy provides original notes, summaries, and quizzes
derived from this material for educational purposes. All rights to the
original lectures remain with Stanford and the instructors.

Foundational paper

“Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena”, Zheng et al., 2023. The paper that popularized “LLM-as-a-Judge” as a term and introduced MT-Bench (multi-turn benchmark) plus Chatbot Arena (human-preference battles between models). Sections 2 (LaaJ setup), 3 (the three biases position/verbosity/self-enhancement, with empirical measurements), and 4 (calibration against human ratings) are the conceptual core. Read after this lesson for the empirical detail behind the qualitative claims; the paper measures each bias and reports correlation with human ratings.

Bias-specific deep dives

“Large Language Models are not Fair Evaluators”, Wang et al., 2023. Documents and quantifies position bias specifically; proposes the swap-and-verify mitigation that became the standard. Useful for understanding the empirical magnitude of the bias.
“LLMs as Evaluators: A Comprehensive Survey”, Li et al., 2024. Surveys the landscape of LaaJ approaches and bias mitigations. Useful as a one-stop overview when this lesson’s three biases feel insufficient and you want to see what else has been studied.

Production references

OpenAI’s structured output documentation. The most-cited working reference for production-grade structured output. Covers schema definition and the API parameters that enforce it.
Anthropic’s tool use and structured output documentation. Same idea, different vendor. Useful for comparing API differences.

Going deeper

A short list, chosen for durability.

“Constitutional AI”, Bai et al., 2022. Anthropic’s approach to using a model’s own critiques as training signal, which is conceptually adjacent to LaaJ-as-preference-data. Useful for understanding the synthetic-preference-data pipeline that LaaJ feeds.
“Self-Rewarding Language Models”, Yuan et al., 2024. A specific instance of the “model judges its own outputs as training signal” loop. Shows the technique can work in some settings but also surfaces self-enhancement-bias-related failure modes when not carefully calibrated. Worth reading after this lesson for the failure-mode side of the LaaJ loop.

Adjacent topics

The “judge drift” question. As models change (new versions, new alignment runs), the LaaJ judges built on top of them also change. Production teams need recalibration loops. Search terms: “evaluation drift in LLM pipelines,” “LaaJ recalibration.”
Human calibration practices. Spot-checking LaaJ scores against human ratings is the discipline this lesson named. Production references: how to design calibration samples, how to handle disagreement between LaaJ and human ratings, what cadence to use. Most teams write internal documentation for this; durable academic references are still consolidating.
Bias propagation through synthetic preference data. The full pipeline (LaaJ → preference data → reward model → aligned model) means LaaJ biases can propagate into the final aligned model. Active research area; the field is trying to understand whether biased judges produce biased models.

Stanford CME 295 cheatsheet

Stanford CME 295 cheatsheet by the Amidi twins. MIT-licensed. The LLM-as-a-Judge section covers the same material in their dense visual style, and includes a biases table that’s a useful study reference.

Community discussion

None selected for this lesson. Vendor docs (OpenAI, Anthropic) and academic sources are the better entry points right now. Durable community references will be added at a future quarterly review if any consolidate.