Guide
LLM-as-judge explained
Human reviewers cannot read every model answer your product generates. LLM-as-judge uses a separate language model — often stronger than the one being tested — to score outputs against a written rubric. Teams deploy judges for offline regression suites, A/B comparisons, RAG faithfulness checks, and sampled production monitoring. The technique is powerful and cheap at scale, but judges inherit model biases: position preference, leniency drift, and blind spots on factual errors. This guide covers when judges beat human labels, pointwise versus pairwise protocols, rubric design, bias mitigation, faithfulness scoring for retrieval pipelines, calibration against gold labels, a Harbor Support eval pipeline worked example, method decision tables, common pitfalls, and a production checklist. For the broader eval stack, start with our LLM evaluation and benchmarking primer; for live quality dashboards, see LLM observability and RAG evaluation.
What LLM-as-judge means
An LLM judge is a prompt (or fine-tuned model) that receives a task specification, optional reference material, and a candidate answer, then returns a score, label, or preference. Unlike exact-match metrics, judges handle open-ended text: tone, completeness, reasoning quality, and policy compliance.
Three judge roles in production
- Regression gate — score every case in a golden dataset before merge; block deploys when aggregate metrics drop.
- Comparative ranking — pick the better of two prompt variants, models, or retrieval configs on the same inputs.
- Sampled monitor — grade 1–5% of live traffic for drift detection; escalate outliers to humans.
Judges are not a replacement for domain experts on high-stakes decisions (medical, legal, financial advice). They are a scalable screen that catches obvious regressions and ranks candidates before expensive human review.
Pointwise vs pairwise judging
Pick your protocol before writing rubrics; it changes prompt shape and how you aggregate results.
Pointwise (absolute scoring)
The judge scores one answer on a scale (1–5), Likert labels (poor/fair/good), or pass/fail against criteria. Works well when you need a dashboard metric (“average helpfulness 4.2”) and when answers are evaluated independently. Inspired by frameworks like G-Eval, which chain-of-thought the rubric before emitting a score.
Pairwise (relative preference)
The judge sees two answers to the same prompt and picks A, B, or tie. More reliable for comparing models or prompt versions because relative judgments reduce calibration drift. Aggregate many pairwise results into an Elo or Bradley–Terry ranking.
When to use which
| Goal | Prefer | Why |
|---|---|---|
| CI threshold (“must score ≥ 4.0”) | Pointwise | Absolute scores map to release gates |
| Model A vs model B | Pairwise | Higher human–judge agreement on preferences |
| RAG faithfulness pass/fail | Pointwise binary | Clear violation labels per chunk |
| Leaderboard ranking | Pairwise + Elo | Stable ordering from sparse comparisons |
Rubric design: what judges actually grade
Vague instructions (“rate quality 1–5”) produce vague scores. Strong rubrics decompose quality into observable criteria with anchor examples.
Core dimensions for support and Q&A bots
- Helpfulness — does the answer resolve the user’s stated intent?
- Faithfulness — are claims supported by provided context or retrieved chunks? (Critical for RAG.)
- Completeness — missing steps, caveats, or edge cases?
- Safety — policy violations, PII leakage, harmful instructions?
- Style — tone, brevity, formatting — only when product requirements demand it.
Rubric writing rules
- One criterion per judge call when stakes are high; multi-criteria prompts blur failure modes.
- Include 2–3 anchor examples per score band in the system prompt (not in every request).
- Require the judge to quote evidence before scoring — reduces random high marks.
- Use structured output (JSON with
reasoning+score) via schema-constrained generation.
Bias and failure modes
Judges are language models with known evaluation biases. Ignoring them produces confident wrong rankings.
Position bias
In pairwise setups, judges favor the first or second answer depending on model family. Mitigation: run each pair twice with swapped order; count a win only when both passes agree. Drop inconsistent pairs from aggregates.
Leniency and self-preference
Judges rate outputs from the same model family higher (“self-bias”) and drift lenient over time as rubrics feel familiar. Mitigation: use a different judge model than the generator; recalibrate monthly against a frozen human-labeled set; track score distributions, not just means.
Verbosity and confidence bias
Longer, more assertive answers score higher even when wrong. Mitigation: add a rubric line penalizing unsupported claims; provide retrieved context explicitly and ask “unsupported sentence” tagging before the score.
Hallucinated grading
Judges invent reasons or misread the candidate answer — the same hallucination problem in a meta layer. Mitigation: low temperature, force citation of spans from the candidate text, and spot-check 5–10% against humans.
Faithfulness judging for RAG pipelines
Retrieval-augmented apps need judges that separate “well written” from “grounded in sources.” A common pattern:
- Pass the user question, retrieved chunks (numbered), and the generated answer.
- Ask the judge to list each factual claim in the answer.
- Label each claim
supported,unsupported, orcontradictedby chunk IDs. - Compute faithfulness = supported claims / total claims.
Pair this with cheap deterministic checks: citation regex, embedding similarity between answer sentences and chunks, and empty-retrieval detectors. Judges catch semantic drift; heuristics catch structural bugs.
Calibrating judges against humans
Before trusting a judge in CI, measure agreement with human labels on 100–300 representative cases:
- Cohen’s kappa or weighted kappa for ordinal scores
- Preference accuracy for pairwise vs human majority vote
- False pass rate on known-bad cases (safety, policy, hallucination traps)
Target kappa > 0.6 for internal tooling; higher for customer-facing gates. If agreement is low, fix the rubric before swapping judge models. Document the judge model version in eval reports — upgrading GPT-4 class judges can shift score baselines without any product change.
Worked example: Harbor Support eval pipeline
Harbor Support routes tier-one tickets through a RAG-backed assistant. Their nightly eval pipeline:
- Golden set — 420 real tickets (redacted) with human “ideal answer” notes and policy tags.
- Generate — run current production stack on each ticket; store traces.
- Faithfulness judge — GPT-4 class model, pointwise, chunks injected; fail if any unsupported policy claim.
- Helpfulness judge — separate prompt, pairwise vs human reference answer when available; else pointwise 1–5.
- Aggregate — block deploy if faithfulness pass rate drops > 2 pp or median helpfulness drops > 0.3 vs seven-day baseline.
- Sample — 3% of live replies get the same judges; outliers
open Linear tickets with
trace_id.
Splitting faithfulness and helpfulness into two judge calls reduced false passes where fluent answers invented refund rules. Cost: roughly $12/night on the golden set — cheaper than one hour of manual QA.
Method decision table
| Approach | Best for | Cost | Watch out for |
|---|---|---|---|
| Pointwise G-Eval style | CI gates, dashboards | 1 call / answer | Score drift, verbosity bias |
| Pairwise + swap | Model/prompt A/B | 2 calls / pair | Position bias if not swapped |
| Binary faithfulness | RAG compliance | 1 call / answer | Misses subtle contradictions |
| Human-only | High-stakes, small N | Expensive | Does not scale to nightly CI |
| Embedding similarity | Near-duplicate checks | Very cheap | Blind to paraphrased errors |
| Classifier fine-tune | Stable policy labels | Upfront train | Brittle on new failure modes |
Common pitfalls
- Same model judges itself — self-preference inflates scores; use a stronger or distinct judge.
- One rubric for every locale — tone norms differ; localize rubrics or judge prompts.
- Chasing leaderboard scores — optimizing for judge metrics that diverge from user thumbs-down.
- No frozen baseline — changing judges and rubrics weekly makes trends meaningless.
- Ignoring cost — GPT-4 class judges on 10k cases nightly add up; cache chunk embeddings, batch judge calls.
- Skipping adversarial cases — include red-team prompts in the golden set; judges that only see easy tickets miss jailbreak regressions.
Production checklist
- Define success criteria per task type (faithfulness, helpfulness, safety) before picking judge style.
- Write rubrics with anchor examples and forced evidence quotes.
- Calibrate judge vs 100+ human labels; record kappa and false-pass rate.
- Use pairwise with position swapping for model comparisons.
- Split faithfulness and style into separate judge calls for RAG apps.
- Pin judge model version; re-baseline when upgrading.
- Wire judge failures to trace IDs in observability tooling.
- Sample live traffic; alert on metric regression, not single bad answers.
- Re-run judges when retrieval index or system prompt changes.
- Keep a human escalation path for disputed grades.
Key takeaways
- LLM-as-judge scales quality measurement when rubrics are specific and calibrated.
- Pairwise with swap beats pointwise for comparing variants; pointwise beats pairwise for threshold gates.
- Bias mitigation (position swap, separate judge model, evidence quoting) is not optional.
- Faithfulness judging is the highest-ROI judge for RAG products.
- Human agreement metrics validate the judge before it validates your app.
Related reading
- LLM evaluation and benchmarking explained — golden sets, public benchmarks, and CI regression suites
- RAG evaluation explained — retrieval metrics and end-to-end faithfulness
- LLM hallucinations explained — why models invent facts judges must catch
- LLM guardrails explained — safety rails beyond scoring