Practice: How we evaluate models, LLM-as-a-Judge

Self-check

1. Why does the field use LLM-as-a-Judge instead of just running benchmarks or asking humans?

Show answer

Three reasons. First, most LLM outputs (summaries, explanations, advice, conversations) do not have a verifiable right answer the way math or coding outputs do. Verifiable benchmarks (HumanEval, GSM8K) cover only a slice of what real LLM use looks like.

Second, human raters are slow, expensive, and noisy. You cannot ask humans to rate every output of every model in every iteration of training. Human rating doesn’t scale to the iteration loops that modern LLM development requires.

Third, modern LLMs were pretrained on text and aligned with human preferences (Phase 4). Asking a strong model to evaluate another model’s response is, in effect, asking it to apply the same pattern matching it would apply to its own outputs. The result correlates well enough with human judgment to be useful for most evaluation purposes, while running fast and cheap.

LaaJ is not a replacement for human review on high-stakes decisions, but it is the right tool for the bulk of evaluation work the field needs to do.

2. Distinguish pointwise from pairwise LaaJ. When would you use each?

Show answer

Pointwise. One prompt, one response, one criteria. The judge returns an absolute judgment (PASS or FAIL, or a score). Use when you need standalone judgments: “is this response correct?” “Does this summary cover the key points?”

Pairwise. One prompt, two candidate responses (A and B), one criteria. The judge returns a relative judgment (“A is better,” “B is better,” or “tied”). Use when you need to choose between alternatives: comparing two model versions, ranking outputs from a single model, or generating synthetic preference data.

The synthetic-preference-data use case is particularly important. Pairwise LaaJ over many response pairs produces a dataset of preference labels suitable for training reward models (Phase 4 territory). This is now a common loop: humans rate a small calibration sample, LaaJ rates a large training sample, and the result trains the reward models that align future model versions.

3. Name the three biases the lesson covered, and the standard mitigation for each.

Show answer

Position bias. Judge tends to prefer whichever response is listed first. Mitigation: swap the order and ask twice. If the judge picks the same response both times, the judgment is consistent. If it flips, the judgment is unreliable for this case.

Verbosity bias. Judge tends to prefer longer responses, even when length doesn’t track quality. Mitigations: explicit instruction in the criteria (“do not prefer longer responses”), in-context examples showing shorter responses preferred, or a length penalty applied post-hoc to the score. These mitigations stack; production systems typically use at least two.

Self-enhancement bias. Judge tends to prefer responses generated by itself (same model). Mitigation: use a different model for judging. Bigger and reasoning-capable, if possible. The lecturer is honest about limits here: most modern frontier models share training data and have correlated biases, so “different model” doesn’t fully eliminate the issue. But it meaningfully helps.

4. Why is the rationale generated before the score, not after?

Show answer

The chain-of-thought effect (covered in Phase 5). Each token a model generates is one full forward pass. Generating a rationale before committing to a score gives the model more compute to actually work the problem rather than rushing to a guess. Empirically, rationale-before-score outperforms score-only for the same reason CoT outperforms direct answering on hard reasoning problems.

There is also a debugging benefit. When a LaaJ score is wrong, the rationale shows you what the judge was reacting to. You can read the rationale and see that the judge was confused about a specific point, which lets you fix the criteria description or change the judge model. Without the rationale, you have a wrong score and no way to know why.

5. What is “structured output” and why is it required for production LaaJ pipelines?

Show answer

Structured output (also called constrained decoding) is a feature of modern LLM APIs that lets you specify a JSON schema the model’s output must conform to. The decoding process is constrained at each step to only produce tokens consistent with the schema. The output is guaranteed to parse as valid JSON of the specified shape.

In practice, you define a schema like {rationale: string, score: enum["PASS", "FAIL"]}, pass it to the API, and the response comes back with both fields populated and machine-readable. No regex parsing, no fallback-on-malformed handling.

Without structured output, every LaaJ call is a roll of the dice on whether the response will be parseable. The model might add extra commentary, format the score unexpectedly, omit fields, or produce something like “The score is: definitely a PASS, given the response…” which doesn’t parse cleanly. Production pipelines with thousands of judgments need reliable parsing, which is why structured output is the discipline.

Try it yourself: design a LaaJ prompt for a real evaluation task

About 15 minutes. Pen and paper.

Setup. You’re building a customer-support chatbot. You want to evaluate whether its responses are both helpful and not fabricating information. You have access to a stronger LLM (different family from your chatbot’s model) you can use as the judge.

Step 1. Write a LaaJ prompt that evaluates a single chatbot response (pointwise). Include all three required inputs (prompt, response, criteria) and ask for both rationale and score.

Show one possible answer

You are evaluating customer-support chatbot responses against
two criteria:

1. HELPFUL: the response actually addresses the customer's
   question or moves toward addressing it. Hedging that doesn't
   make progress is not helpful.

2. ACCURATE: the response does not invent information. If the
   response makes a factual claim, that claim should be
   supported by either the prompt's context or general
   knowledge. Inventing details is not allowed.

To pass, the response must be both helpful AND accurate.
A response that is helpful but inaccurate fails. A response
that is accurate but unhelpful fails.

Customer's prompt:
[CUSTOMER_PROMPT]

Chatbot's response:
[CHATBOT_RESPONSE]

First, write a rationale that explicitly addresses both
criteria (helpful and accurate). Then output a score:
either PASS or FAIL.

Output as JSON: {"rationale": "...", "score": "PASS" | "FAIL"}

The criteria are crisp (each criterion has a definition with examples of what fails). The instruction asks for the rationale to address both criteria explicitly, which forces the judge to reason about both. The output is structured JSON for reliable parsing. The scale is binary (PASS/FAIL), not 1-5.

Step 2. Now you want to compare two versions of your chatbot: version A and version B. Convert the pointwise prompt to a pairwise version. Note any extra discipline you’d add to handle position bias.

Show one possible answer

You are comparing two customer-support chatbot responses to
the same customer prompt. The criteria are:

1. HELPFUL: which response better addresses the question?
2. ACCURATE: which response avoids inventing information?

Customer's prompt:
[CUSTOMER_PROMPT]

Response A:
[RESPONSE_A]

Response B:
[RESPONSE_B]

First, write a rationale comparing both responses on both
criteria. Then output your preference: A, B, or TIE.

Output as JSON: {"rationale": "...", "preference": "A" | "B" | "TIE"}

Position-bias mitigation. For each pair, run this prompt twice with the order swapped. First pass: A as A, B as B. Second pass: B as A, A as B (relabel them so the judge sees them in the opposite order). If both passes agree, the judgment is consistent. If they disagree, mark this pair as inconclusive (or tie) and either skip it or escalate to human review.

Verbosity-bias mitigation. Add to the criteria: “Length is not a quality criterion. Prefer the response that is more helpful and accurate, regardless of length.”

Self-enhancement-bias mitigation. Make sure the judge model is from a different model family than either chatbot version. Don’t use chatbot version A as the judge of itself versus version B.

Step 3. A junior engineer suggests rating responses on a 1-10 scale instead of binary PASS/FAIL because “it gives more granular signal.” How would you respond?

Show one possible answer

Granular scales feel more informative but produce noisier judgments in practice. Both LaaJ judges and humans converge on the middle of large scales (3-7 on a 1-10 scale, for example) for most responses. The middle is where most of the noise lives; small changes in a response can move it from a 5 to a 6 without any actual quality change.

Binary scales force the judge to commit to “this is good enough” or “this is not good enough.” The signal is sharper. Aggregating many binary judgments (the pass rate across N samples, for example) gives you the granularity you wanted without the noise of granular per-judgment scales.

The practical compromise that some teams use: 3-point scales (FAIL, BORDERLINE, PASS) instead of either 2-point or 10-point. Captures the “I’m not sure” case while still being more reliable than 1-10. But binary remains the default in most production LaaJ pipelines.

Flashcards

Eight cards.

Q. What is LLM-as-a-Judge (LaaJ) in one sentence?

LLM-as-a-Judge uses one LLM to evaluate the output of another LLM. The judge takes a prompt, the response that was produced for it, and a description of what counts as good (the criteria), and returns a written rationale plus a score (typically PASS or FAIL on a binary scale).

Q. Why is rationale generated before the score, not after?

Two reasons. First, the chain-of-thought effect: writing out reasoning gives the model more compute (more tokens) to work the problem before committing to a score. Empirically improves accuracy. Second, debuggability: when a score is wrong, the rationale shows you what the judge was reacting to, which lets you fix the criteria or change the judge model. Without the rationale, you have a wrong score and no way to know why.

Q. What's the difference between pointwise and pairwise LaaJ?

Pointwise: one prompt, one response, one criteria. The judge returns an absolute judgment (PASS or FAIL). Use when you need standalone judgments. Pairwise: one prompt, two candidate responses (A and B). The judge returns a relative judgment (A or B preferred, or tie). Use when you need to compare alternatives or generate synthetic preference data for reward-model training.

Q. What is position bias and how is it mitigated?

Position bias is the LaaJ failure mode where the judge prefers whichever response is listed first, regardless of which is actually better. Mitigation: swap the order and ask twice. If the judge picks the same response both ways, the judgment is consistent. If it flips, the judgment is unreliable for this pair, so either tie or escalate to human review.

Q. What is verbosity bias and how is it mitigated?

Verbosity bias is the LaaJ failure mode where the judge prefers longer responses, even when length doesn’t track quality. Three mitigations stack. (1) Explicit instruction in the criteria: “do not prefer longer responses.” (2) In-context examples where shorter responses were preferred. (3) Length penalty applied post-hoc to the score. Production systems typically use at least two of the three.

Q. What is self-enhancement bias and how is it mitigated?

Self-enhancement bias is the LaaJ failure mode where a judge prefers responses generated by itself (same model). Mitigation: use a different model for judging, ideally bigger and reasoning-capable. The lecturer notes that “different” is fuzzy because modern frontier models share training data and have correlated biases, but using a different family is meaningfully better than using the same model.

Q. What is structured output and why is it required for production LaaJ?

Structured output (constrained decoding) is an API feature that lets you specify a JSON schema the model’s output must conform to. The decoding process is constrained to produce only schema-valid tokens. Production LaaJ pipelines need reliable parsing of thousands of judgments per evaluation run. Without structured output, every call is a parser-failure risk: the model might add commentary, format the score unexpectedly, or omit fields. Structured output guarantees parseable output and is non-negotiable in production pipelines.

Q. Why does the field prefer binary (PASS/FAIL) scales over granular (1-10) scales for LaaJ?

Granular scales feel more informative but produce noisier judgments. Both judges and humans cluster their ratings in the middle of large scales (3-7 on a 1-10 scale), and the middle is where the noise lives. Binary scales force commitment, which sharpens the signal. Aggregating many binary judgments (pass rate across N samples) recovers the granularity without the per-judgment noise. A 3-point scale (FAIL, BORDERLINE, PASS) is a common compromise but binary remains the default.