Guide
Conformal prediction explained
A fraud classifier outputs 0.87 — but what does that number mean for the
analyst reviewing the alert? A demand forecaster predicts 1,240 units — how
wide should safety stock be? Standard ML models return point predictions or softmax
probabilities that may be miscalibrated under shift.
Conformal prediction wraps any base model with a lightweight calibration
layer that produces prediction sets (classification) or
intervals (regression) with a finite-sample coverage guarantee: if you
ask for 90% coverage, at least 90% of future test points will have the true label inside
the set — without assuming Gaussian noise or retraining the neural net. This guide covers
exchangeability, split conformal, nonconformity scores, conditional coverage limits, a
Harbor Payments fraud triage worked example, a method decision table, common pitfalls,
and a practitioner checklist — alongside our
model calibration guide,
Bayesian inference guide,
and
anomaly detection guide.
What problem conformal prediction solves
Production ML teams need two things from uncertainty: honest intervals (when we say 95%, we mean 95%) and actionable sets (flag only when the model is genuinely unsure). Post-hoc calibration — Platt scaling, temperature scaling, isotonic regression — improves probability alignment but does not by itself produce sets with proven coverage on new data.
Conformal prediction adds a second stage after training. Given a held-out calibration set of labeled examples the model never saw during threshold tuning, you score how "nonconforming" each prediction would have been, then use those scores to size intervals on live traffic. The guarantee is distribution-free: it holds for any exchangeable data and any base model (logistic regression, XGBoost, transformer) as long as calibration and test data are drawn from the same underlying process.
When conformal prediction is a good fit
- High-stakes triage — fraud review, medical screening, content moderation queues where human bandwidth is limited.
- Regulatory or SLA language — "we cover the true class in 95% of cases" with audit trail.
- Black-box models — you cannot easily derive analytic confidence intervals from the architecture.
- Batch or streaming scoring where a calibration refresh cadence is acceptable (daily/weekly).
Conformal prediction is weaker when data is heavily non-exchangeable (strong temporal drift without recalibration), when you need tight intervals for every individual subgroup (marginal coverage does not imply conditional coverage), or when labels arrive too slowly to maintain a fresh calibration set. In those cases, combine conformal layers with drift monitoring from our model drift guide or pursue full Bayesian posteriors when the modeling cost is justified.
Exchangeability and the coverage guarantee
The core assumption is exchangeability: if you shuffle calibration and test points, their joint distribution is unchanged. IID sampling satisfies this; so does simple random splitting of a static dataset. Time-series data violates exchangeability unless you use specialized variants (conformalized quantile regression with proper sliding windows, adaptive conformal inference).
Given desired miscoverage rate α (e.g. 0.10 for 90% coverage), split
conformal computes a threshold q from calibration nonconformity scores so
that, for a new point, the prediction set includes all labels whose score is at most
q. Under exchangeability,
P(true label ∈ prediction set) ≥ 1 − α
The inequality is finite-sample exact for split conformal (not merely asymptotic). Full conformal and cross-conformal reuse data more efficiently but cost more compute; jackknife+ variants approximate full conformal at scale.
Marginal vs conditional coverage
The guarantee is marginal — averaged over random draws of the test point and calibration set. A model can satisfy 90% marginal coverage while covering only 70% of a rare fraud subtype. Research on conditional conformal prediction and group-aware calibration targets subpopulation fairness, but no free method guarantees tight conditional coverage for all groups simultaneously. Document which subgroups you monitor separately in production dashboards.
Adaptive and online variants
Static split conformal assumes tomorrow looks like yesterday. When fraudsters rotate tactics or seasonality shifts demand, adaptive conformal inference (ACI) updates the miscoverage budget online: if recent sets under-cover, widen thresholds; if coverage is slack, tighten to shrink analyst queues. Weighted conformal reweights calibration points by similarity to the current covariate mix — useful when only the recent past is relevant but labels on old data remain valid. Neither removes the need for labeled calibration refresh; they stretch the interval between full retrains.
Split conformal workflow
- Split data — training (fit base model), calibration (compute scores only), optional test holdout.
- Train base model on the training split using any standard pipeline from our ML fundamentals guide.
- Define a nonconformity score on calibration points — how "surprising" each true label is under the model.
- Compute quantile — take the ⌈(n+1)(1−α)⌉/n quantile of calibration scores as threshold
q. - Score live points — include every label with score ≤
q(classification) or form symmetric interval around prediction (regression).
The calibration set must remain labeled and representative. When fraud patterns shift, refresh calibration weekly or trigger on population stability index spikes — same discipline as recalibrating softmax temperatures in model calibration workflows.
Nonconformity scores for classification and regression
Classification
Common scores:
- 1 − p(ytrue) — simple, works with any probabilistic classifier; sets can be large if the model is flat.
- Cumulative mass — sort classes by descending probability, include until true class enters; often yields smaller sets.
- Adaptive prediction sets (APS) — include highest-probability labels until cumulative prob exceeds a data-driven cutoff tied to
q.
Empty sets should not occur with proper quantile indexing; singleton sets mean high confidence. Large sets (many classes included) signal epistemic uncertainty — route to human review instead of auto-decline.
Regression
Use absolute residual |y − ŷ| on calibration data. At inference, predict
interval [ŷ − q, ŷ + q]. For heteroscedastic noise, use
conformalized quantile regression (CQR): train lower/upper quantile
models, then conformalize their combined interval for valid coverage despite
non-constant variance.
Outlier and anomaly use
If every label scores above q, the set is empty — some implementations treat
this as "reject prediction." That behavior overlaps with
anomaly detection:
conformal layers excel when you want coverage statements on in-distribution points, not
open-world novelty discovery alone.
Implementing conformal layers in production
Libraries such as MAPIE (scikit-learn compatible), crepes,
and conformal utilities in PyTorch Lightning wrap the quantile step behind familiar
fit / predict APIs. A typical serving pattern:
- Offline — nightly job retrains base model if needed, recomputes calibration scores, stores threshold
qand score metadata in a feature store or model registry artifact. - Online — inference service returns point prediction plus set members; downstream rules engine maps set cardinality to approve / review / decline.
- Observability — emit histograms of set size, rolling coverage on delayed labels, and segment-level under-coverage alerts.
Conformal adds negligible latency (one pass through softmax or quantile heads plus comparison to a scalar threshold). The expensive part is maintaining labeled calibration data — budget analyst hours or active-learning loops to label edge cases that land in multi-member sets, which improves both base model quality and tighter conformal thresholds over time.
Worked example: Harbor Payments fraud alert triage
Harbor Payments routes card transactions through a gradient-boosted fraud classifier. Analysts complained that fixed score cutoffs either flooded the queue or missed coordinated attacks. The team added split conformal APS on top of the existing model without retraining.
- Data split — 70% train (last 90 days), 15% calibration (days 91–100), 15% test (days 101–110), stratified by merchant category.
- Base model — unchanged XGBoost with 0.92 test AUC; probabilities passed through existing isotonic calibrator.
- Score — APS on calibration fraud/legit labels with target α = 0.05 (95% coverage of true class in the set).
- Policy — auto-approve only singleton sets predicting legit; auto-decline only singleton fraud; multi-label sets → analyst queue with SLA boost.
Results on the held-out week: marginal coverage 96.1% (above target), analyst queue volume −34% vs fixed 0.85 threshold, fraud capture −0.8% (acceptable trade). Monitoring split coverage by merchant tier exposed under-coverage on new marketplace sellers — triggered weekly recalibration for that segment. The lesson: conformal sets turn opaque scores into operational rules with measurable guarantees, but subgroup dashboards remain mandatory.
Harbor also A/B tested α = 0.10 vs α = 0.05 on a shadow traffic slice: higher α shrank average set size (more singleton auto-decisions) but dropped coverage on high-value corporate cards below 90%. They shipped α = 0.05 globally with a stricter α = 0.02 override on corporate BIN ranges — a pragmatic compromise until conditional conformal methods mature for their scale.
Method decision table
| Approach | Best for | Coverage type | Typical cost |
|---|---|---|---|
| Split conformal (APS) | Fast deployment on any classifier | Marginal, finite-sample | One scoring pass + quantile |
| Conformalized quantile regression (CQR) | Heteroscedastic regression, demand forecasting | Marginal interval | Two quantile models + calibration |
| Full / cross-conformal | Small datasets, tighter utilization of labels | Marginal, exact | O(n) refits or jackknife+ approx |
| Platt / temperature scaling | Ranking and expected calibration error | No set guarantee | Single param fit on val set |
| Bayesian posterior | Small models, priors matter, full uncertainty | Credible (not frequentist coverage) | MCMC or variational inference |
| Ensemble variance | Quick heuristic bands | Heuristic, unproven coverage | K forward passes |
Common pitfalls
- Calibration leakage — tuning α or score type on the test set destroys the guarantee; hold out a final evaluation split.
- Stale calibration under drift — exchangeability breaks after product launches; schedule refresh or use adaptive conformal methods.
- Assuming conditional coverage — 95% marginal does not protect every merchant tier; monitor slices explicitly.
- Tiny calibration sets — quantile estimates are noisy below ~500 points per segment; sets become vacuously wide or narrow.
- Class imbalance ignored — rare fraud may always produce multi-class sets; pair with cost-sensitive review policies.
- Time-series shuffling — random splits leak future into past; use rolling calibration windows for sequential data.
- Empty-set handling undefined — product must specify reject/abstain behavior before launch, not after first empty output.
Practitioner checklist
- Reserve a labeled calibration split never used for hyperparameter tuning.
- Pick α from business cost (review budget vs miss rate), not default 0.10.
- Log set size distribution — spikes in multi-label rate signal drift or attack waves.
- Report marginal coverage on rolling weekly holdouts with confidence bands.
- Slice coverage by top business segments (merchant tier, geography, product line).
- Compare APS vs cumulative-mass scores on calibration ECE and average set size.
- For regression, benchmark CQR interval width against naive ±2σ baselines.
- Document abstain/empty-set routing in runbooks alongside on-call playbooks.
- Automate calibration refresh when PSI exceeds threshold on input features.
- Pair conformal outputs with analyst UX — show set members, not just max prob.
Key takeaways
- Conformal prediction wraps any trained model with sets or intervals that meet a chosen marginal coverage rate.
- Split conformal is simple: calibrate a nonconformity score threshold on held-out labeled data, apply at inference.
- APS and CQR adapt set size to model uncertainty — singleton sets mean confident automation paths.
- Coverage guarantees require exchangeability; drift and time-series need specialized handling and monitoring.
- Marginal guarantees do not replace subgroup fairness checks — slice coverage in production dashboards.
Related reading
- Model calibration explained — reliability diagrams and Platt scaling when you need calibrated probabilities without formal sets
- Bayesian inference explained — credible intervals and posterior uncertainty when priors and generative structure matter
- Anomaly detection explained — novelty scoring when labels are scarce and open-world rejection is the goal
- Machine learning fundamentals explained — train/validation/test splits and evaluation metrics that precede conformal layers