Skip to content

Cheatsheet: How we evaluate models, LLM-as-a-Judge

Evaluating an LLM is itself an LLM-shaped problem.
LaaJ uses one LLM to score another's output.
Three biases (position, verbosity, self-enhancement)
must be mitigated, but never fully eliminated.
InputsOutputs
Prompt that produced the responseRationale (written reasoning)
The response being judgedScore (typically PASS or FAIL)
Criteria (what counts as good)(Rationale before score, always)

Why rationale-before-score: chain-of-thought effect. Writing out reasoning gives the model more compute to work the problem before committing.

ShapeInputsOutputUse
PointwiseOne responsePASS or FAILAbsolute judgments (“is this correct?”)
PairwiseTwo responses (A and B)A, B, or tieRelative judgments + synthetic preference data

Pairwise also generates training data for reward models (Phase 4). The lecturer flags this as a now-common loop: humans rate a calibration sample, LaaJ rates a large training sample, the result trains alignment for future models.

BiasWhat goes wrongMitigation
PositionJudge prefers first-listed responseSwap order, ask twice, take consistent answer
VerbosityJudge prefers longer response regardless of qualityExplicit instruction + in-context examples + length penalty
Self-enhancementJudge prefers responses generated by itselfUse a different (often bigger) judge model

All three are real, measured, and mitigable. None are fully eliminated.

Round 1: prompt + [A=responseA, B=responseB] → "A" or "B"
Round 2: prompt + [A=responseB, B=responseA] → "A" or "B"
Same answer in both rounds → consistent judgment
Different answers → unreliable for this pair
Stack three mitigations:
1. Add to criteria: "Length is not a quality criterion."
2. Show 1-2 in-context examples where shorter wins.
3. Post-process the score with a length penalty.
Pick a judge model from a different family than the model
being judged. Bigger if possible. Reasoning-capable if possible.
Limit: modern frontier models share training data; "different"
is fuzzy. Still meaningfully better than same-model.
PracticeWhy
Crisp guidelinesVague criteria → noisy judgments. Specific criteria → stable judgments.
Binary scalePASS/FAIL beats 1-5 on reliability for both judges and humans. Granularity adds noise.
Rationale before scoreChain-of-thought effect; improves accuracy. Always do this.
Calibrate vs humans quarterlyJudges drift. Spot-check a small sample with human raters; if correlation drops, recalibrate or change judge.
Structured outputProduction parsing reliability. Define a JSON schema; the API enforces it.
Different model for judgingReduces self-enhancement bias. Bigger if you can afford it.
You are evaluating responses against the following criteria:
[criteria description, including PASS/FAIL conditions]
Prompt that was given to the model:
[prompt]
Response to evaluate:
[response]
First write a rationale explaining your evaluation
against the criteria. Then output a score: PASS or FAIL.
Output as JSON:
{"rationale": "...", "score": "PASS" | "FAIL"}
Synthetic preference data → reward-model training (Phase 4)
Alignment of next-gen models
The "judge quality" question now sits upstream of "model quality"
in many production pipelines. Bias in the judge propagates.
PitfallReality
”Trust LaaJ scores blindly.”Judges drift. Quarterly human spot-checks are not optional.
”Same model can judge itself.”Self-enhancement bias inflates scores. Use a different model.
”Granular scales are more informative.”They are noisier. Binary forces commitment and produces sharper signal.
”LaaJ replaces human review.”For most evaluation work, yes. For high-stakes decisions, no. The right framing: LaaJ is a fast, cheap, scalable proxy.
”Rationale is just commentary.”The rationale is the load-bearing part for debugging and the chain-of-thought-style accuracy boost. Treat it as primary, not decorative.
  • LLM-as-a-Judge (LaaJ): the practice of using one LLM to evaluate the output of another LLM.
  • Pointwise LaaJ: judge a single response (absolute PASS/FAIL).
  • Pairwise LaaJ: judge between two candidate responses (relative preference).
  • Criteria: the description in the LaaJ prompt of what counts as good.
  • Rationale: the written reasoning the judge produces before the score.
  • Position bias: judge prefers first-listed response.
  • Verbosity bias: judge prefers longer response.
  • Self-enhancement bias: judge prefers responses generated by itself.
  • Structured output: API feature that constrains the LLM’s output to a JSON schema; required for production parsing reliability.
  • Synthetic preference data: preference labels generated by pairwise LaaJ instead of human raters; feeds back into reward-model training.

Evaluating an LLM is itself an LLM-shaped problem.
LaaJ scales human-style evaluation by replacing the human with another LLM.
The three biases (position, verbosity, self-enhancement) are real, measured, and mitigable, but never fully eliminated.