Guide

LLM confidence calibration and uncertainty explained

Harbor Analytics' overnight fraud scorer was fluent and decisive. For a disputed wire transfer it returned “Confidence: 97% — likely legitimate” with a polished rationale citing merchant category codes the model half-invented. Compliance escalated manually; the transfer was fraudulent. Post-mortem: the model's verbalized confidence bore almost no relationship to actual error rate. Prompting “only answer if sure” made outputs shorter, not safer. The fix was not a bigger model — it was treating uncertainty as a first-class signal: token-level logprobs on structured verdict fields, semantic entropy across three sampled rationales, and hard abstention when either score crossed a calibrated threshold. False-auto-approve rate dropped 62% while human review volume rose only 11%.

Confidence calibration for LLMs asks whether a stated or inferred certainty matches how often the model is right. Uncertainty quantification extracts scores you can threshold, route to human review, or feed LLM-as-judge ensembles. This guide covers verbalized vs logprob-based confidence, semantic entropy and self-consistency disagreement, selective abstention, ties to conformal prediction and classical model calibration, the Harbor Analytics refactor, a technique decision table, pitfalls, and a production checklist.

Why LLM confidence is usually wrong out of the box

Generative models are trained to produce plausible continuations, not calibrated probabilities over facts. Three common failure modes appear in production:

Overconfident fluency — polished prose reads authoritative even when retrieval is empty or the question is out of distribution.
Verbalized confidence illusions — asking the model to output “Confidence: 0.85” rarely improves calibration; the number is another generated token sequence with weak grounding in error statistics.
Ranking vs calibration confusion — a classifier head or judge model may rank risky cases correctly (high AUC) while probability scores are miscalibrated — the same trap that breaks fraud dashboards in classical ML.

Uncertainty work therefore splits into measurement (reliability diagrams on held-out labels), extraction (signals available at inference), and action (abstain, retrieve more, or escalate).

Signals you can extract at inference time

Token logprobs and sequence likelihood

Most inference APIs expose per-token log probabilities for the generated span. Low average logprob on the answer tokens — especially on named entities, numbers, and JSON enum fields — correlates with hallucination risk, though the correlation is task-dependent and must be validated on your data. Normalize by token count; compare against a baseline on in-domain correct answers. For structured outputs, score only the value tokens of high-stakes fields (verdict, amount, account ID), not boilerplate preambles.

Self-consistency and semantic entropy

Sample N answers at non-zero temperature and measure disagreement. Exact-match variance is brittle for free text; semantic entropy clusters paraphrases with an embedding model or NLI model and computes entropy over semantic equivalence classes. High semantic entropy means the model has multiple incompatible plausible answers — a strong abstention trigger for factual QA and extraction tasks. Pair with self-consistency voting when you need an answer, not only a reject flag.

Retrieval and tool grounding gaps

Uncertainty is not only inside the model. Empty retrieval, low reranker scores, or failed tool calls are exogenous uncertainty signals. A calibrated pipeline treats “no evidence retrieved” as high epistemic uncertainty even if the LLM sounds sure.

Judge model scores

A separate rubric-scoring model can output a 0–1 quality or correctness estimate. Like any classifier, judge scores need calibration plots on labeled audit sets; do not assume the raw score is a probability without checking.

Measuring calibration on LLM outputs

Borrow metrics from model calibration:

Reliability diagrams — bin predictions by confidence (verbalized, logprob-derived, or judge score) and plot mean predicted vs observed accuracy per bin.
Expected calibration error (ECE) — weighted average gap between predicted confidence and empirical accuracy across bins.
Brier score — mean squared error between predicted probability and binary outcome; rewards both calibration and sharpness.

For generative tasks, reduce to scorable subtasks: multiple-choice probes, structured classification fields, or human-labeled correctness on a fixed audit sample. You cannot calibrate what you do not label. Start with 500–2,000 audited examples stratified by topic and difficulty before tuning thresholds.

Temperature scaling and Platt scaling apply to judge heads and small classifiers on top of LLM features; they do not magically fix free-form prose. Use post-hoc scaling on the scalar you actually threshold.

Selective abstention and routing

Selective prediction lets the system refuse to answer when uncertainty exceeds a cutoff. Implement abstention as an explicit product behavior, not a hidden retry loop:

Compute one or more uncertainty scores after generation (or mid-pipeline before irreversible actions).
Compare to a threshold tuned for your cost matrix: cost of wrong answer vs cost of human delay vs cost of lost automation.
On abstain: return a safe fallback (“I don't have enough information”), trigger retrieval expansion, or enqueue human review.

Dual-threshold policies work well: auto-send only when both judge score and logprob margin exceed high bars; route to review when either fails. That pattern reduced Harbor's silent errors without sending everything to humans.

Conformal prediction offers distribution-free coverage guarantees for classification-style heads: wrap a score function so the prediction set includes the true label with probability at least 1 - α on exchangeable data. Adapting conformal methods to open-ended generation is active research; use them where you have clear labels and a finite label or answer set.

Harbor Analytics refactor: fraud verdict calibration

Harbor's pipeline was rebuilt around three layers:

Structured verdict JSON — forced schema with decision, risk_score, and evidence_ids via grammar-constrained decoding so logprobs attach to meaningful fields.
Uncertainty features — mean logprob on decision and risk_score tokens, semantic entropy over three rationale samples, retrieval coverage (% of cited evidence IDs present in the chunk store), and a lightweight NLI check between rationale and retrieved spans.
Calibrated router — isotonic regression on a 1,200-case audit set mapped a composite score to empirical fraud-miss rate; auto-approve only below 2% estimated miss rate at the chosen operating point.

Verbalized “97% confident” strings were removed from customer-facing output; stakeholders see decision, evidence links, and a qualitative band (low / medium / high review priority) derived from the calibrated score, not from model prose.

Technique decision table

Technique	Best when	Weak when
Token logprobs	Structured fields, short answers, same model at inference	API hides logprobs; long creative writing; tool-heavy agents
Semantic entropy / self-consistency	Factual QA, extraction, numeric disputes	Latency budget forbids multiple samples; subjective tone tasks
Verbalized confidence	Demos and rough UI hints only	Any automated threshold or compliance claim
LLM-as-judge score	Rubric quality, tone, policy compliance	Judge itself hallucinates without grounding sources
Conformal prediction sets	Finite labels, need coverage guarantees	Open-ended generation without scorable reduction
Full HITL queue	High stakes, low volume, regulatory exposure	Scale requires automation; use uncertainty to shrink queue

Common pitfalls

Trusting verbalized percentages — they optimize for sounding helpful, not for frequentist accuracy.
Thresholds copied from papers — logprob cutoffs transfer poorly across models, prompts, and languages.
No audit labels — reliability diagrams require human or gold-standard correctness, not proxy clicks.
Single global threshold — fraud, medical, and marketing copy need different cost matrices.
Ignoring retrieval state — confident hallucination with empty context is the common case, not the edge case.
Abstention without UX — users get opaque errors; explain that evidence was insufficient and offer escalation.
Calibration drift — new model version or prompt invalidates isotonic maps; re-audit quarterly.

Production checklist

Define scorable outcomes and build a stratified audit set (500+ labels).
Log logprobs, retrieval coverage, sample disagreement, and judge scores per request.
Plot reliability diagrams and ECE per task and per customer tier.
Fit post-hoc calibration (isotonic or temperature) on the scalar you threshold.
Set dual thresholds for auto-send vs review from cost matrix, not vibes.
Implement explicit abstention copy and human escalation paths.
Block auto-send when cited evidence IDs are missing from retrieval.
Re-calibrate after model, prompt, or retrieval index changes.
Track abstention rate, review rate, override rate, and silent-error audits.
Document that verbalized confidence is not used for automated decisions.

Key takeaways

LLMs are fluent before they are calibrated — verbalized confidence is not a safety control.
Combine logprobs, semantic entropy, retrieval gaps, and judge scores; threshold the composite after calibration on audited data.
Selective abstention and HITL routing shrink silent errors when the cost of mistakes exceeds the cost of delay.
Harbor Analytics cut false-auto-approves 62% by scoring structured verdict fields and removing prose confidence claims.
Re-audit when models or prompts change — calibration is a maintenance task, not a one-time fit.