Summary: How we evaluate models, LLM-as-a-Judge

Evaluating an LLM is itself an LLM-shaped problem. Coding and math have verifiable rewards. Summaries, advice, and most other LLM outputs do not. Human raters are slow and expensive. The pragmatic solution: use one LLM to evaluate another LLM’s output. This is LLM-as-a-Judge (LaaJ), and it is what most modern LLM evaluation pipelines actually do.

The setup is simple. Three inputs: the prompt that produced the response, the response, and a description of what counts as good (criteria). Two outputs: a written rationale, and a score (typically PASS or FAIL on a binary scale). The rationale comes before the score deliberately: the chain-of-thought effect applies here. Writing out the reasoning improves accuracy.

Two shapes. Pointwise (one response, absolute judgment) and pairwise (two responses, relative preference). Pairwise is also how synthetic preference data is generated to train reward models cheaply.

Three named biases to defend against. Position bias: judge prefers whichever response is listed first. Mitigated by swapping order. Verbosity bias: judge prefers longer responses. Mitigated by explicit instruction, in-context examples, or length penalty. Self-enhancement bias: judge prefers responses generated by itself. Mitigated by using a different (often bigger) model as the judge.

This summary is the scan-it-in-five-minutes version. The full lesson covers the structured-output discipline, the practitioner’s best practices, and how synthetic preference data generation via LaaJ feeds back into reward-model training.

Core ideas

LLM-as-a-Judge uses an LLM to evaluate another LLM’s output. Inputs: prompt + response + criteria. Outputs: rationale + score. Rationale before score, always.
Two shapes. Pointwise: one response, absolute (PASS/FAIL). Pairwise: two responses, relative (A vs B). Pairwise is also used to generate synthetic preference data.
Why it works. The judge LLM was pretrained on text and aligned with human preferences (Phase 4). Asking it to evaluate is asking it to apply the same pattern matching it would apply to its own outputs.
Why rationale-before-score. The chain-of-thought effect applies: writing out the reasoning gives the model more compute to work the problem before committing.
Structured output is required for production. Constrained-decoding APIs (OpenAI, Anthropic, Google) guarantee parseable outputs by enforcing a JSON schema. Without it, parsing failures break pipelines.
Three biases. Position (first-listed wins more), verbosity (longer wins more), self-enhancement (same-model wins more). All real, all measured, all mitigable.
Standard mitigations. Position: swap order and verify consistency. Verbosity: explicit instruction, in-context examples, length penalty. Self-enhancement: use a different (bigger) judge model.
Best practices. Crisp guidelines. Binary scale (PASS/FAIL beats 1-5 ratings). Rationale before score. Calibrate against humans quarterly. Use structured output.
Synthetic preference data. Pairwise LaaJ generates preference labels at scale. Feeds back into reward-model training (Phase 4). Now widely used to bootstrap alignment without humans-in-the-loop, but introduces its own bias-propagation risks.
Pitfall: trusting LaaJ scores without calibration. Judges drift. Quarterly human spot-checks are not optional; they are the calibration step that keeps the proxy honest.
Pitfall: using the same model for generation and judging. Self-enhancement bias makes scores inflated.

What changes for you

After this lesson, evaluation claims in model cards stop being opaque. When a paper says “we evaluated using GPT-4 as a judge,” you know what is being claimed and what to ask: what was the criteria? Were the biases mitigated? How? Was there human calibration? You can also recognize where LaaJ shows up downstream: in the reward-model pipeline that trained the model itself, in the evaluation suite that measured it, and in the synthetic preference data that bootstraps newer alignment efforts.

Evaluating an LLM is itself an LLM-shaped problem.
LaaJ scales human-style evaluation by replacing the human with another LLM.
The three biases (position, verbosity, self-enhancement) are real, measured, and mitigable, but never fully eliminated.