Guide

LLM-as-judge explained

Human reviewers cannot read every model answer your product generates. LLM-as-judge uses a separate language model — often stronger than the one being tested — to score outputs against a written rubric. Teams deploy judges for offline regression suites, A/B comparisons, RAG faithfulness checks, and sampled production monitoring. The technique is powerful and cheap at scale, but judges inherit model biases: position preference, leniency drift, and blind spots on factual errors. This guide covers when judges beat human labels, pointwise versus pairwise protocols, rubric design, bias mitigation, faithfulness scoring for retrieval pipelines, calibration against gold labels, a Harbor Support eval pipeline worked example, method decision tables, common pitfalls, and a production checklist. For the broader eval stack, start with our LLM evaluation and benchmarking primer; for live quality dashboards, see LLM observability and RAG evaluation.

What LLM-as-judge means

An LLM judge is a prompt (or fine-tuned model) that receives a task specification, optional reference material, and a candidate answer, then returns a score, label, or preference. Unlike exact-match metrics, judges handle open-ended text: tone, completeness, reasoning quality, and policy compliance.

Three judge roles in production

Regression gate — score every case in a golden dataset before merge; block deploys when aggregate metrics drop.
Comparative ranking — pick the better of two prompt variants, models, or retrieval configs on the same inputs.
Sampled monitor — grade 1–5% of live traffic for drift detection; escalate outliers to humans.

Judges are not a replacement for domain experts on high-stakes decisions (medical, legal, financial advice). They are a scalable screen that catches obvious regressions and ranks candidates before expensive human review.

Pointwise vs pairwise judging

Pick your protocol before writing rubrics; it changes prompt shape and how you aggregate results.

Pointwise (absolute scoring)

The judge scores one answer on a scale (1–5), Likert labels (poor/fair/good), or pass/fail against criteria. Works well when you need a dashboard metric (“average helpfulness 4.2”) and when answers are evaluated independently. Inspired by frameworks like G-Eval, which chain-of-thought the rubric before emitting a score.

Pairwise (relative preference)

The judge sees two answers to the same prompt and picks A, B, or tie. More reliable for comparing models or prompt versions because relative judgments reduce calibration drift. Aggregate many pairwise results into an Elo or Bradley–Terry ranking.

When to use which

Goal	Prefer	Why
CI threshold (“must score ≥ 4.0”)	Pointwise	Absolute scores map to release gates
Model A vs model B	Pairwise	Higher human–judge agreement on preferences
RAG faithfulness pass/fail	Pointwise binary	Clear violation labels per chunk
Leaderboard ranking	Pairwise + Elo	Stable ordering from sparse comparisons

Rubric design: what judges actually grade

Vague instructions (“rate quality 1–5”) produce vague scores. Strong rubrics decompose quality into observable criteria with anchor examples.

Core dimensions for support and Q&A bots

Helpfulness — does the answer resolve the user’s stated intent?
Faithfulness — are claims supported by provided context or retrieved chunks? (Critical for RAG.)
Completeness — missing steps, caveats, or edge cases?
Safety — policy violations, PII leakage, harmful instructions?
Style — tone, brevity, formatting — only when product requirements demand it.

Rubric writing rules

One criterion per judge call when stakes are high; multi-criteria prompts blur failure modes.
Include 2–3 anchor examples per score band in the system prompt (not in every request).
Require the judge to quote evidence before scoring — reduces random high marks.
Use structured output (JSON with reasoning + score) via schema-constrained generation.

Bias and failure modes

Judges are language models with known evaluation biases. Ignoring them produces confident wrong rankings.

Position bias

In pairwise setups, judges favor the first or second answer depending on model family. Mitigation: run each pair twice with swapped order; count a win only when both passes agree. Drop inconsistent pairs from aggregates.

Leniency and self-preference

Judges rate outputs from the same model family higher (“self-bias”) and drift lenient over time as rubrics feel familiar. Mitigation: use a different judge model than the generator; recalibrate monthly against a frozen human-labeled set; track score distributions, not just means.

Verbosity and confidence bias

Longer, more assertive answers score higher even when wrong. Mitigation: add a rubric line penalizing unsupported claims; provide retrieved context explicitly and ask “unsupported sentence” tagging before the score.

Hallucinated grading

Judges invent reasons or misread the candidate answer — the same hallucination problem in a meta layer. Mitigation: low temperature, force citation of spans from the candidate text, and spot-check 5–10% against humans.

Faithfulness judging for RAG pipelines

Retrieval-augmented apps need judges that separate “well written” from “grounded in sources.” A common pattern:

Pass the user question, retrieved chunks (numbered), and the generated answer.
Ask the judge to list each factual claim in the answer.
Label each claim supported, unsupported, or contradicted by chunk IDs.
Compute faithfulness = supported claims / total claims.

Pair this with cheap deterministic checks: citation regex, embedding similarity between answer sentences and chunks, and empty-retrieval detectors. Judges catch semantic drift; heuristics catch structural bugs.

Calibrating judges against humans

Before trusting a judge in CI, measure agreement with human labels on 100–300 representative cases:

Cohen’s kappa or weighted kappa for ordinal scores
Preference accuracy for pairwise vs human majority vote
False pass rate on known-bad cases (safety, policy, hallucination traps)

Target kappa > 0.6 for internal tooling; higher for customer-facing gates. If agreement is low, fix the rubric before swapping judge models. Document the judge model version in eval reports — upgrading GPT-4 class judges can shift score baselines without any product change.

Worked example: Harbor Support eval pipeline

Harbor Support routes tier-one tickets through a RAG-backed assistant. Their nightly eval pipeline:

Golden set — 420 real tickets (redacted) with human “ideal answer” notes and policy tags.
Generate — run current production stack on each ticket; store traces.
Faithfulness judge — GPT-4 class model, pointwise, chunks injected; fail if any unsupported policy claim.
Helpfulness judge — separate prompt, pairwise vs human reference answer when available; else pointwise 1–5.
Aggregate — block deploy if faithfulness pass rate drops > 2 pp or median helpfulness drops > 0.3 vs seven-day baseline.
Sample — 3% of live replies get the same judges; outliers open Linear tickets with trace_id.

Splitting faithfulness and helpfulness into two judge calls reduced false passes where fluent answers invented refund rules. Cost: roughly $12/night on the golden set — cheaper than one hour of manual QA.

Method decision table

Approach	Best for	Cost	Watch out for
Pointwise G-Eval style	CI gates, dashboards	1 call / answer	Score drift, verbosity bias
Pairwise + swap	Model/prompt A/B	2 calls / pair	Position bias if not swapped
Binary faithfulness	RAG compliance	1 call / answer	Misses subtle contradictions
Human-only	High-stakes, small N	Expensive	Does not scale to nightly CI
Embedding similarity	Near-duplicate checks	Very cheap	Blind to paraphrased errors
Classifier fine-tune	Stable policy labels	Upfront train	Brittle on new failure modes

Common pitfalls

Same model judges itself — self-preference inflates scores; use a stronger or distinct judge.
One rubric for every locale — tone norms differ; localize rubrics or judge prompts.
Chasing leaderboard scores — optimizing for judge metrics that diverge from user thumbs-down.
No frozen baseline — changing judges and rubrics weekly makes trends meaningless.
Ignoring cost — GPT-4 class judges on 10k cases nightly add up; cache chunk embeddings, batch judge calls.
Skipping adversarial cases — include red-team prompts in the golden set; judges that only see easy tickets miss jailbreak regressions.

Production checklist

Define success criteria per task type (faithfulness, helpfulness, safety) before picking judge style.
Write rubrics with anchor examples and forced evidence quotes.
Calibrate judge vs 100+ human labels; record kappa and false-pass rate.
Use pairwise with position swapping for model comparisons.
Split faithfulness and style into separate judge calls for RAG apps.
Pin judge model version; re-baseline when upgrading.
Wire judge failures to trace IDs in observability tooling.
Sample live traffic; alert on metric regression, not single bad answers.
Re-run judges when retrieval index or system prompt changes.
Keep a human escalation path for disputed grades.

Key takeaways

LLM-as-judge scales quality measurement when rubrics are specific and calibrated.
Pairwise with swap beats pointwise for comparing variants; pointwise beats pairwise for threshold gates.
Bias mitigation (position swap, separate judge model, evidence quoting) is not optional.
Faithfulness judging is the highest-ROI judge for RAG products.
Human agreement metrics validate the judge before it validates your app.