LLM-as-a-Judge: cheatsheet

The one idea that matters

Evaluating an LLM is itself an LLM-shaped problem.
LaaJ uses one LLM to score another's output.
Three biases (position, verbosity, self-enhancement)
must be mitigated, but never fully eliminated.

Inputs and outputs

Inputs	Outputs
Prompt that produced the response	Rationale (written reasoning)
The response being judged	Score (typically PASS or FAIL)
Criteria (what counts as good)	(Rationale before score, always)

Why rationale-before-score: chain-of-thought effect. Writing out reasoning gives the model more compute to work the problem before committing.

Two shapes

Shape	Inputs	Output	Use
Pointwise	One response	PASS or FAIL	Absolute judgments (“is this correct?”)
Pairwise	Two responses (A and B)	A, B, or tie	Relative judgments + synthetic preference data

Pairwise also generates training data for reward models (Phase 4). The lecturer flags this as a now-common loop: humans rate a calibration sample, LaaJ rates a large training sample, the result trains alignment for future models.

The three biases

Bias	What goes wrong	Mitigation
Position	Judge prefers first-listed response	Swap order, ask twice, take consistent answer
Verbosity	Judge prefers longer response regardless of quality	Explicit instruction + in-context examples + length penalty
Self-enhancement	Judge prefers responses generated by itself	Use a different (often bigger) judge model

All three are real, measured, and mitigable. None are fully eliminated.

Mitigation details

Position bias

Round 1: prompt + [A=responseA, B=responseB] → "A" or "B"
Round 2: prompt + [A=responseB, B=responseA] → "A" or "B"

Same answer in both rounds → consistent judgment
Different answers              → unreliable for this pair

Verbosity bias

Stack three mitigations:
1. Add to criteria: "Length is not a quality criterion."
2. Show 1-2 in-context examples where shorter wins.
3. Post-process the score with a length penalty.

Self-enhancement bias

Pick a judge model from a different family than the model
being judged. Bigger if possible. Reasoning-capable if possible.

Limit: modern frontier models share training data; "different"
is fuzzy. Still meaningfully better than same-model.

Best practices

Practice	Why
Crisp guidelines	Vague criteria → noisy judgments. Specific criteria → stable judgments.
Binary scale	PASS/FAIL beats 1-5 on reliability for both judges and humans. Granularity adds noise.
Rationale before score	Chain-of-thought effect; improves accuracy. Always do this.
Calibrate vs humans quarterly	Judges drift. Spot-check a small sample with human raters; if correlation drops, recalibrate or change judge.
Structured output	Production parsing reliability. Define a JSON schema; the API enforces it.
Different model for judging	Reduces self-enhancement bias. Bigger if you can afford it.

A LaaJ prompt template

You are evaluating responses against the following criteria:
[criteria description, including PASS/FAIL conditions]

Prompt that was given to the model:
[prompt]

Response to evaluate:
[response]

First write a rationale explaining your evaluation
against the criteria. Then output a score: PASS or FAIL.

Output as JSON:
{"rationale": "...", "score": "PASS" | "FAIL"}

Where LaaJ shows up downstream

Synthetic preference data → reward-model training (Phase 4)
                              ↓
                          Alignment of next-gen models

The "judge quality" question now sits upstream of "model quality"
in many production pipelines. Bias in the judge propagates.

Pitfalls to dodge

Pitfall	Reality
”Trust LaaJ scores blindly.”	Judges drift. Quarterly human spot-checks are not optional.
”Same model can judge itself.”	Self-enhancement bias inflates scores. Use a different model.
”Granular scales are more informative.”	They are noisier. Binary forces commitment and produces sharper signal.
”LaaJ replaces human review.”	For most evaluation work, yes. For high-stakes decisions, no. The right framing: LaaJ is a fast, cheap, scalable proxy.
”Rationale is just commentary.”	The rationale is the load-bearing part for debugging and the chain-of-thought-style accuracy boost. Treat it as primary, not decorative.

Glossary

LLM-as-a-Judge (LaaJ): the practice of using one LLM to evaluate the output of another LLM.
Pointwise LaaJ: judge a single response (absolute PASS/FAIL).
Pairwise LaaJ: judge between two candidate responses (relative preference).
Criteria: the description in the LaaJ prompt of what counts as good.
Rationale: the written reasoning the judge produces before the score.
Position bias: judge prefers first-listed response.
Verbosity bias: judge prefers longer response.
Self-enhancement bias: judge prefers responses generated by itself.
Structured output: API feature that constrains the LLM’s output to a JSON schema; required for production parsing reliability.
Synthetic preference data: preference labels generated by pairwise LaaJ instead of human raters; feeds back into reward-model training.

Evaluating an LLM is itself an LLM-shaped problem.
LaaJ scales human-style evaluation by replacing the human with another LLM.
The three biases (position, verbosity, self-enhancement) are real, measured, and mitigable, but never fully eliminated.