Guide

LLM logprobs and logit bias explained

Harbor Legal's contract triage pipeline asked a 70B model to output a clause label — INDEMNITY, TERMINATION, IP_ASSIGNMENT — as plain text. At temperature 0.2, labels were usually right, but the system could not tell a confident 98% classification from a hesitant guess that happened to print the same string. Parsing the answer gave a label; it did not give a score. Worse, occasional creative prefixes (“The clause type is INDEMNITY”) broke downstream JSON validators. Engineers enabled logprobs on the completion API, constrained the first token to a closed vocabulary via prompt design plus logit bias, and read normalized probabilities directly from the logits. Mis-routed reviews to senior attorneys dropped 31%, and low-confidence clauses now queue for human review instead of silently auto-filing. Logprobs are not a niche research feature — they are how production teams extract calibrated signals from models they already run, without another fine-tune.

Every autoregressive LLM computes a probability distribution over the vocabulary at each decoding step. Logprobs (log probabilities) expose those scores in API responses. Logit bias lets callers add a fixed offset to specific token logits before sampling — nudging toward or away from tokens without changing model weights. Together they power classification, confidence gating, constrained multiple-choice outputs, and light steering cheaper than fine-tuning. This guide covers how logprobs are computed and returned, top-logprobs for multi-class reads, logit bias semantics and limits, pairing with grammar-constrained decoding and structured outputs, the Harbor Legal triage refactor, a technique decision table, pitfalls, and a production checklist.

What logprobs are

At each generation step the model outputs a vector of logits — unnormalized scores — one per token in the vocabulary. A softmax converts logits to probabilities that sum to 1. The natural logarithm of the chosen token's probability is its logprob. APIs typically return logprobs as negative numbers closer to zero for more likely tokens (e.g. −0.05 for a near-certain token, −3.2 for a rare one).

OpenAI-compatible servers accept logprobs: true and top_logprobs: N on chat or completions endpoints. The response includes, per generated token, the logprob of the sampled token plus the top N alternatives with their logprobs. Anthropic and open-source stacks (vLLM, TGI, llama.cpp server) expose similar fields with slightly different JSON shapes. Logprobs apply to output tokens; some providers also return prompt-token logprobs for perplexity-style evaluation.

Why logprobs beat parsing text for classification

A classic pattern asks the model: “Classify this email as SPAM or HAM. Answer with one word.” At temperature 0, you might always get SPAM or HAM, but you still do not know margin. With logprobs on the first generated token, you read P(SPAM) and P(HAM) directly from top_logprobs. Convert logprobs to probabilities with exp(logprob), renormalize over your label set if needed, and use the max as prediction plus entropy or margin as confidence. This is often more stable than asking for JSON and more informative than greedy string match.

Logit bias: steering before sampling

Logit bias is a map from token ID to a bias value (typically −100 to +100 in OpenAI's API) added to that token's logit before temperature, top-p, and sampling run. A large positive bias can effectively force a token; a large negative bias suppresses it. Bias is per-token, not per-string — multi-token labels require tokenizing each label and biasing the first token, or using a single-token abbreviation per class.

Common uses:

  • Suppress junk prefixes — bias against tokens like “The”, “I”, or newline when you want a bare label.
  • Nudge format tokens — mild positive bias on { when you want JSON-first outputs before schema validation kicks in.
  • Soft vocabulary restriction — negative bias on all tokens outside a small allowlist (weaker than true constrained decoding; large biases can distort calibration).

Logit bias does not change model weights. It is ephemeral inference-time control. For hard guarantees (valid JSON, enum-only fields), prefer grammar-constrained decoding or provider structured output modes; use bias for gentle nudges and format hygiene.

Production patterns

Closed-vocabulary classification

Tokenize each class label. Pick labels that start with distinctive first tokens when possible (single-token labels are ideal). Prompt with few-shot examples, set max_tokens: 1 (or generate only until label complete), enable top_logprobs covering all classes, read probabilities, apply a confidence threshold for human escalation. Calibrate thresholds on a held-out set — raw logprobs are not always well-calibrated across domains; see our confidence calibration guide.

Sequence-level confidence

Sum or average token logprobs across a completion for a coarse perplexity-style score. Useful for detecting hallucination-prone answers in RAG: low average logprob on the answer span may flag “model is guessing.” Pair with citation checks rather than trusting logprobs alone.

Contrastive / mutual scoring

Score two candidate completions by total logprob under the same prompt prefix. Useful for A/B phrasing, choosing between tool-call arguments, or reranking draft outputs from speculative decoding without a separate reward model.

Cost and latency notes

Returning top_logprobs adds serialization overhead but negligible compute — logits already exist. The expensive part is still forward passes. Logprob-based classifiers with max_tokens: 1 are often cheaper than full JSON generations with repair loops.

Harbor Legal contract triage refactor

Harbor Legal's v1 pipeline used a chat prompt ending in “Respond with exactly one label from the list.” Labels were long snake_case strings; the tokenizer split them into two or three tokens, so first-token logprobs alone were ambiguous. The refactor:

  1. Short codes — mapped each clause type to a unique single-token code (I42, T09, etc.) in the prompt legend.
  2. First-token classificationmax_tokens: 1, logprobs: true, top_logprobs: 20, temperature 0.
  3. Logit bias hygiene — −100 bias on common preamble tokens; mild +2 on code-prefix characters shared across labels.
  4. Confidence gate — if normalized max probability < 0.85 or margin to second-best < 0.15, route to attorney queue instead of auto-folder.
  5. Calibration set — 2,400 labeled clauses; tuned thresholds to hit 96% precision on auto-route (up from 89%).
  6. Fallback — low-confidence items run through a slower chain with full-text rationale via prompt chaining.

They did not replace structured JSON exports for final filing — codes are an internal routing layer. Export still uses schema-validated objects after human confirmation on edge cases.

Technique decision table

Your situation Prefer Avoid
Small closed label set, need confidence scores Single-token codes + first-token top_logprobs Parsing free-form natural language labels
Must guarantee valid JSON or enum fields Grammar / structured output modes Extreme logit bias as a fake schema enforcer
Gentle format nudge (start with brace, no preamble) Moderate logit bias on a few tokens Bias ±100 on dozens of tokens (distorts ranking)
Creative long-form generation Normal sampling; logprobs for monitoring only Heavy bias that fights the model's distribution
Domain shift from training data Recalibrate thresholds; few-shot refresh Trusting raw logprob magnitudes as true probabilities
Multi-token class names unavoidable Constrained decoding over token sequences or score full label logprob First-token-only logprobs on ambiguous prefixes

Common pitfalls

  • Multi-token labels with first-token-only scoring. “TERMINATION” and “TERMINATE” may share a first token; use single-token codes or score complete sequences.
  • Treating logprobs as calibrated probabilities. They are model scores, not frequentist frequencies; validate on your data.
  • Logit bias so strong it collapses diversity. +100 on one token makes logprobs meaningless for confidence.
  • Ignoring BPE merge artifacts. Leading spaces change token IDs; tokenize labels with the same encoder the API uses.
  • Comparing logprobs across different models or temperatures. Thresholds do not transfer without recalibration.
  • Logprobs on tool-call channels. Some routers strip or alter logprob fields on function-call paths; test your stack.
  • Privacy leakage in logged logprobs. Top alternatives can reveal sensitive token candidates; redact logs in regulated workflows.

Production checklist

  • Define a closed label set; prefer single-token codes where possible.
  • Tokenize every label with the production model's tokenizer; document token IDs.
  • Enable logprobs and sufficient top_logprobs to cover all classes.
  • Set max_tokens to the minimum needed for the label format.
  • Apply mild logit bias only for preamble suppression; avoid extreme biases on labels.
  • Compute normalized probabilities and margin (top1 − top2) per request.
  • Calibrate confidence thresholds on a labeled validation set for your precision target.
  • Route low-confidence predictions to human review or a stronger fallback chain.
  • Log prediction, margin, and raw logprobs for drift monitoring (with PII policy).
  • For hard schema needs, pair logprobs with grammar or structured output validation.
  • Re-run calibration after model version upgrades or prompt template changes.
  • Document API differences if you serve via LiteLLM or multiple providers.

Key takeaways

  • Logprobs expose the model's token-level probability distribution at inference time — the foundation for classification confidence without a separate scorer.
  • First-token top_logprobs over single-token label codes is a high-precision, low-latency pattern for triage and routing.
  • Logit bias nudges logits before sampling; it is not a substitute for constrained decoding or fine-tuning when hard guarantees matter.
  • Raw logprobs need threshold calibration on your data — treat them as ranking signals, not oracle probabilities.
  • Harbor Legal cut mis-routes 31% by combining logprobs, compact codes, and confidence gates instead of parsing verbose free-text labels.

Related reading