Guide

Out-of-distribution detection explained

Harbor Analytics shipped a document classifier in February that labeled vendor invoices, purchase orders, and credit memos with 98.1% accuracy on a held-out test set of machine-generated PDFs. In April a logistics partner onboarded and began uploading smartphone photos of crumpled fuel receipts, handwritten delivery slips, and blurry warehouse labels. The model did not crash — it returned confident softmax scores above 0.94 on 41% of those uploads while assigning wrong document types. The failure was not random noise; it was out-of-distribution (OOD) input: data that differs systematically from what the model saw during training. OOD detection answers a different question than accuracy on a fixed test set: should we trust this prediction at all? This guide defines OOD detection and how it differs from anomaly detection, concept drift, and adversarial attacks; surveys score-based methods (maximum softmax probability, energy scores, ODIN), density and distance approaches (Mahalanobis distance in penultimate-layer embeddings), ensemble disagreement, and open-set recognition; works a Harbor Analytics guardrail refactor; provides a method decision table, lists pitfalls, and ends with a production checklist. For calibrated prediction intervals with coverage guarantees, see conformal prediction; for crafted evasion inputs, see adversarial attacks.

What “out of distribution” means

A classifier trained on distribution P_train(X, Y) assumes future inputs are drawn from a similar distribution P_test(X, Y). When P_test ≠ P_train, performance can collapse even if test labels still belong to known classes. Three common shifts:

Covariate shift — P(X) changes but P(Y|X) is stable. Example: invoices scanned on phones instead of native PDFs; same label semantics, different pixel statistics.
Label shift — class priors change. Example: after tax season, refund requests dominate the upload queue.
Semantic shift / open set — new classes appear at inference time. Example: users upload passports or medical forms the model never saw; every forced prediction is wrong.

OOD detection (also called open-set recognition when paired with a reject option) assigns each input an OOD score. Inputs below a tuned threshold route to human review, a fallback model, or a polite “unsupported document type” response instead of a hallucinated label.

OOD vs anomaly vs adversarial vs drift

These terms overlap in conversation but imply different mitigations:

Anomaly detection often targets rare in-class outliers (fraudulent transactions among legitimate ones). OOD targets inputs from a different support of the feature space.
Adversarial examples are deliberately crafted to fool a model while staying near a legitimate point. OOD inputs are usually organic — users uploading the wrong file type, not attackers optimizing perturbations.
Concept drift is a temporal phenomenon: the relationship between features and labels evolves. OOD detection is a per-input gate; drift monitoring is a population-level alarm that may trigger retraining.

Baseline score methods: MSP and energy

The simplest OOD detectors piggyback on a trained classifier with no architectural changes. They exploit a known failure mode: neural networks are often overconfident on OOD data, but relative confidence patterns still carry signal.

Maximum softmax probability (MSP)

For logits z and softmax probabilities π_k = exp(z_k) / Σ exp(z_j), the MSP score is max_k π_k. In-distribution (ID) points tend to have higher MSP than OOD points. Flag inputs where MSP falls below a threshold τ tuned on a validation set that includes curated OOD examples.

MSP is a strong zero-cost baseline but fails when OOD inputs accidentally activate a single class strongly (e.g. a blank page that looks like a “cover sheet” class). Always benchmark MSP before adopting heavier methods.

Energy score

The energy score uses raw logits without softmax: E(x) = −T · log Σ_k exp(z_k / T) where temperature T is often 1. Lower energy correlates with ID data. Energy-based OOD detection often outperforms MSP on vision benchmarks because it avoids the competition effect of softmax normalization squashing all classes.

ODIN: input preprocessing plus temperature scaling

ODIN (Out-of-Distribution Detector using Input preprocessing and Network) applies small adversarial perturbations to inputs to increase the gap between ID and OOD MSP scores, combined with temperature scaling on logits. It improves separation at inference cost (extra forward passes). Useful when MSP alone is close but not quite production-ready.

Representation-based detection: Mahalanobis and embeddings

Classifiers learn intermediate representations that cluster by class in ID data. OOD points often land far from all class centroids in penultimate-layer space.

Mahalanobis distance

Fit per-class Gaussians on penultimate activations φ(x) from the training set. For each class c, estimate mean μ_c and shared covariance Σ. The Mahalanobis score is:

M(x) = min_c (φ(x) − μ_c)^T Σ⁻¹ (φ(x) − μ_c)

Higher M(x) suggests OOD. This method is especially effective on vision and document models where OOD inputs produce off-manifold embeddings even when softmax stays high. Compute Σ with shrinkage (Ledoit-Wolf) when feature dimension is large relative to sample count.

Contrastive and self-supervised backbones

Models pretrained with contrastive objectives (SimCLR, CLIP-style dual encoders) often yield more separable embedding spaces for OOD. If your pipeline already uses a frozen encoder plus linear head, Mahalanobis on encoder outputs is a natural first production step before fine-tuning the entire stack.

Ensembles, density models, and generative scores

When single-model scores are insufficient, combine multiple uncertainty signals:

Deep ensembles — train M models with different seeds; high predictive variance or low average MSP across members flags OOD. Cost: M× inference latency unless distilled.
Monte Carlo dropout — enable dropout at inference and sample T forward passes; variance in predictions approximates epistemic uncertainty. Cheaper than full ensembles but less reliable on large transformers without careful tuning.
Normalizing flows and VAEs — estimate explicit log p(x) under a generative model trained on ID data. Low likelihood suggests OOD. Watch for likelihood traps: some OOD images receive higher likelihood than ID images in flow models due to background complexity.
k-NN distance in embedding space — store ID training embeddings; flag inputs whose distance to the k-th nearest neighbor exceeds a calibrated threshold. Simple, interpretable, scales with vector DB tooling.

In production, a weighted combination of MSP (or energy), Mahalanobis distance, and ensemble disagreement often beats any single score on tabular and document pipelines where failure modes are heterogeneous.

Worked example: Harbor Analytics document intake gate

Harbor’s invoice classifier used a ResNet-50 backbone fine-tuned on 120k synthetic and native PDF renders across three classes. After the partner photo incident, the team added an OOD gate before any label was returned to the API:

Curated OOD validation set — 8k images spanning phone photos, screenshots, handwritten notes, unrelated IDs, and blank uploads; none overlapped training classes.
Score stack — energy score on logits, Mahalanobis on penultimate 2048-d embeddings (shared covariance), and k-NN distance to the 5th nearest training embedding (FAISS index).
Threshold tuning — on a mixed ID/OOD validation stream, target 95% true-positive rate on ID (do not reject valid invoices) while maximizing OOD recall. Chose thresholds via ROC analysis per score, then logistic stacking of the three features.
Routing policy — stacked score below gate → return top-1 label; above gate → HTTP 422 with reason: unsupported_document and queue for human triage. No confident wrong labels on OOD uploads in a two-week shadow deployment.
Monitoring — weekly histogram of OOD scores on live traffic; alert when median shifts (signals new upload channel or drift).

Clean accuracy on the original PDF test set dropped zero points — the gate only affects inputs the model should never have answered. Human triage volume rose 6%, acceptable versus the prior 41% silent error rate on photo uploads.

Method decision table

Scenario	Recommended approach	Avoid
Need a same-day baseline on existing classifier	MSP + energy score; tune τ on held-out OOD set	Shipping with no OOD eval because test accuracy is high
Vision / document models with confident OOD errors	Mahalanobis on penultimate layer + energy; optional ODIN	MSP alone when OOD activates a spurious class
Safety-critical or high-stakes automation	Deep ensemble or MC dropout + conformal abstain; human fallback	Single scalar threshold with no ID recall constraint
Tabular fraud / risk scoring	Isolation Forest on features + model score disagreement; domain rules	Image OOD tooling on structured data without constraint sets
Open-set with truly novel classes expected	Explicit “unknown” class in training + outlier exposure (OE)	Forcing closed-set softmax over unseen semantics
LLM classification / routing	Logit margin on label tokens, embedding distance to exemplar bank	Raw softmax on free-form generations without calibration

Common pitfalls

No OOD validation set. Tuning thresholds on ID data alone produces thresholds that never fire — or reject everything.
Leakage from OOD into training. If OOD examples resemble ID classes and slip into fine-tuning, detectors lose separation.
Optimizing only OOD recall. Rejecting 30% of legitimate invoices to catch every photo upload destroys product utility; balance ID true-positive rate.
Likelihood traps in generative OOD. High p(x) does not always mean ID for flow models; validate on your domain.
Ignoring near-OOD. Slightly different but in-support inputs (new invoice template from an existing vendor) need different handling than true open-set uploads.
Static thresholds forever. New upload channels shift score distributions; re-calibrate quarterly or when monitoring histograms drift.
Conflating OOD with adversarial. OOD gates do not stop gradient-based evasion; layer defenses appropriately.
Overconfidence after one score. Stack complementary signals; MSP and Mahalanobis failures are often uncorrelated.

Production checklist

Build a curated OOD validation set representative of real failure modes (not only Gaussian noise).
Benchmark MSP and energy score before investing in heavier detectors.
Add representation-based scores (Mahalanobis or k-NN) for vision, document, and embedding pipelines.
Tune thresholds with explicit ID true-positive targets, not OOD recall alone.
Define a routing policy: reject, fallback model, or human queue — never silent wrong labels.
Log OOD scores and outcomes for every production request.
Monitor score distributions weekly; alert on median or tail shifts.
Re-evaluate after model retraining, new data sources, or partner onboarding.
Pair OOD gates with drift monitoring for population-level changes.
Document residual risk: OOD detection reduces but does not eliminate open-set errors.

Key takeaways

High test accuracy does not mean inputs at inference resemble training data — OOD detection is the gate that asks whether to answer at all.
MSP and energy scores are free baselines every production classifier should measure before launch.
Mahalanobis distance on embeddings catches many confident wrong predictions that softmax misses.
Threshold tuning is a product decision balancing false rejects against silent errors — not a pure ML metric exercise.
OOD, anomaly, adversarial, and drift are related but distinct — use the right tool per failure mode and combine layers.