Guide
LLM logprobs and logit bias explained
Harbor Legal's contract triage pipeline asked a 70B model to output a clause
label — INDEMNITY, TERMINATION, IP_ASSIGNMENT
— as plain text. At temperature 0.2, labels were usually right, but the
system could not tell a confident 98% classification from a hesitant guess that
happened to print the same string. Parsing the answer gave a label; it did not give
a score. Worse, occasional creative prefixes (“The clause type is
INDEMNITY”) broke downstream JSON validators. Engineers enabled
logprobs on the completion API, constrained the first token to a
closed vocabulary via prompt design plus logit bias, and read
normalized probabilities directly from the logits. Mis-routed reviews to senior
attorneys dropped 31%, and low-confidence clauses now queue for human review
instead of silently auto-filing. Logprobs are not a niche research feature —
they are how production teams extract calibrated signals from models they already
run, without another fine-tune.
Every autoregressive LLM computes a probability distribution over the vocabulary at each decoding step. Logprobs (log probabilities) expose those scores in API responses. Logit bias lets callers add a fixed offset to specific token logits before sampling — nudging toward or away from tokens without changing model weights. Together they power classification, confidence gating, constrained multiple-choice outputs, and light steering cheaper than fine-tuning. This guide covers how logprobs are computed and returned, top-logprobs for multi-class reads, logit bias semantics and limits, pairing with grammar-constrained decoding and structured outputs, the Harbor Legal triage refactor, a technique decision table, pitfalls, and a production checklist.
What logprobs are
At each generation step the model outputs a vector of logits — unnormalized scores — one per token in the vocabulary. A softmax converts logits to probabilities that sum to 1. The natural logarithm of the chosen token's probability is its logprob. APIs typically return logprobs as negative numbers closer to zero for more likely tokens (e.g. −0.05 for a near-certain token, −3.2 for a rare one).
OpenAI-compatible servers accept logprobs: true and
top_logprobs: N on chat or completions endpoints. The response
includes, per generated token, the logprob of the sampled token plus the top
N alternatives with their logprobs. Anthropic and open-source stacks
(vLLM,
TGI, llama.cpp server) expose similar fields with slightly different JSON shapes.
Logprobs apply to output tokens; some providers also return prompt-token
logprobs for perplexity-style evaluation.
Why logprobs beat parsing text for classification
A classic pattern asks the model: “Classify this email as SPAM or HAM.
Answer with one word.” At temperature 0, you might always get
SPAM or HAM, but you still do not know margin. With
logprobs on the first generated token, you read
P(SPAM) and P(HAM) directly from
top_logprobs. Convert logprobs to probabilities with
exp(logprob), renormalize over your label set if needed, and use
the max as prediction plus entropy or margin as confidence. This is often more
stable than asking for JSON and more informative than greedy string match.
Logit bias: steering before sampling
Logit bias is a map from token ID to a bias value (typically −100 to +100 in OpenAI's API) added to that token's logit before temperature, top-p, and sampling run. A large positive bias can effectively force a token; a large negative bias suppresses it. Bias is per-token, not per-string — multi-token labels require tokenizing each label and biasing the first token, or using a single-token abbreviation per class.
Common uses:
- Suppress junk prefixes — bias against tokens like “The”, “I”, or newline when you want a bare label.
- Nudge format tokens — mild positive bias on
{when you want JSON-first outputs before schema validation kicks in. - Soft vocabulary restriction — negative bias on all tokens outside a small allowlist (weaker than true constrained decoding; large biases can distort calibration).
Logit bias does not change model weights. It is ephemeral inference-time control. For hard guarantees (valid JSON, enum-only fields), prefer grammar-constrained decoding or provider structured output modes; use bias for gentle nudges and format hygiene.
Production patterns
Closed-vocabulary classification
Tokenize each class label. Pick labels that start with distinctive first tokens
when possible (single-token labels are ideal). Prompt with few-shot examples,
set max_tokens: 1 (or generate only until label complete), enable
top_logprobs covering all classes, read probabilities, apply a
confidence threshold for human escalation. Calibrate thresholds on a held-out set
— raw logprobs are not always well-calibrated across domains; see our
confidence calibration guide.
Sequence-level confidence
Sum or average token logprobs across a completion for a coarse perplexity-style score. Useful for detecting hallucination-prone answers in RAG: low average logprob on the answer span may flag “model is guessing.” Pair with citation checks rather than trusting logprobs alone.
Contrastive / mutual scoring
Score two candidate completions by total logprob under the same prompt prefix. Useful for A/B phrasing, choosing between tool-call arguments, or reranking draft outputs from speculative decoding without a separate reward model.
Cost and latency notes
Returning top_logprobs adds serialization overhead but negligible
compute — logits already exist. The expensive part is still forward passes.
Logprob-based classifiers with max_tokens: 1 are often cheaper than
full JSON generations with repair loops.
Harbor Legal contract triage refactor
Harbor Legal's v1 pipeline used a chat prompt ending in “Respond with exactly one label from the list.” Labels were long snake_case strings; the tokenizer split them into two or three tokens, so first-token logprobs alone were ambiguous. The refactor:
- Short codes — mapped each clause type to a unique
single-token code (
I42,T09, etc.) in the prompt legend. - First-token classification —
max_tokens: 1,logprobs: true,top_logprobs: 20, temperature 0. - Logit bias hygiene — −100 bias on common preamble tokens; mild +2 on code-prefix characters shared across labels.
- Confidence gate — if normalized max probability < 0.85 or margin to second-best < 0.15, route to attorney queue instead of auto-folder.
- Calibration set — 2,400 labeled clauses; tuned thresholds to hit 96% precision on auto-route (up from 89%).
- Fallback — low-confidence items run through a slower chain with full-text rationale via prompt chaining.
They did not replace structured JSON exports for final filing — codes are an internal routing layer. Export still uses schema-validated objects after human confirmation on edge cases.
Technique decision table
| Your situation | Prefer | Avoid |
|---|---|---|
| Small closed label set, need confidence scores | Single-token codes + first-token top_logprobs | Parsing free-form natural language labels |
| Must guarantee valid JSON or enum fields | Grammar / structured output modes | Extreme logit bias as a fake schema enforcer |
| Gentle format nudge (start with brace, no preamble) | Moderate logit bias on a few tokens | Bias ±100 on dozens of tokens (distorts ranking) |
| Creative long-form generation | Normal sampling; logprobs for monitoring only | Heavy bias that fights the model's distribution |
| Domain shift from training data | Recalibrate thresholds; few-shot refresh | Trusting raw logprob magnitudes as true probabilities |
| Multi-token class names unavoidable | Constrained decoding over token sequences or score full label logprob | First-token-only logprobs on ambiguous prefixes |
Common pitfalls
- Multi-token labels with first-token-only scoring. “TERMINATION” and “TERMINATE” may share a first token; use single-token codes or score complete sequences.
- Treating logprobs as calibrated probabilities. They are model scores, not frequentist frequencies; validate on your data.
- Logit bias so strong it collapses diversity. +100 on one token makes logprobs meaningless for confidence.
- Ignoring BPE merge artifacts. Leading spaces change token IDs; tokenize labels with the same encoder the API uses.
- Comparing logprobs across different models or temperatures. Thresholds do not transfer without recalibration.
- Logprobs on tool-call channels. Some routers strip or alter logprob fields on function-call paths; test your stack.
- Privacy leakage in logged logprobs. Top alternatives can reveal sensitive token candidates; redact logs in regulated workflows.
Production checklist
- Define a closed label set; prefer single-token codes where possible.
- Tokenize every label with the production model's tokenizer; document token IDs.
- Enable
logprobsand sufficienttop_logprobsto cover all classes. - Set
max_tokensto the minimum needed for the label format. - Apply mild logit bias only for preamble suppression; avoid extreme biases on labels.
- Compute normalized probabilities and margin (top1 − top2) per request.
- Calibrate confidence thresholds on a labeled validation set for your precision target.
- Route low-confidence predictions to human review or a stronger fallback chain.
- Log prediction, margin, and raw logprobs for drift monitoring (with PII policy).
- For hard schema needs, pair logprobs with grammar or structured output validation.
- Re-run calibration after model version upgrades or prompt template changes.
- Document API differences if you serve via LiteLLM or multiple providers.
Key takeaways
- Logprobs expose the model's token-level probability distribution at inference time — the foundation for classification confidence without a separate scorer.
- First-token top_logprobs over single-token label codes is a high-precision, low-latency pattern for triage and routing.
- Logit bias nudges logits before sampling; it is not a substitute for constrained decoding or fine-tuning when hard guarantees matter.
- Raw logprobs need threshold calibration on your data — treat them as ranking signals, not oracle probabilities.
- Harbor Legal cut mis-routes 31% by combining logprobs, compact codes, and confidence gates instead of parsing verbose free-text labels.
Related reading
- LLM sampling and decoding strategies explained — temperature, top-p, and where logprobs enter the loop
- LLM grammar-constrained decoding explained — hard guarantees beyond logit bias
- LLM confidence calibration explained — turning scores into trustworthy gates
- LLM intent classification explained — routing queries with classifiers and embeddings