Guide

Data leakage in machine learning explained

Data leakage is when information from outside the training distribution — often including the label itself — sneaks into features or preprocessing so the model appears brilliant offline and fails in production. It is distinct from overfitting: an overfit model memorizes training noise; a leaky model is handed the answer key. Leakage is the silent killer behind "99% validation AUC, 58% live recall." This guide defines leak types (target, temporal, group, preprocessing), shows how they bypass cross-validation, walks through a Harbor Analytics churn-prediction worked example, provides a leak-type decision table, common pitfalls, and a practitioner checklist. For building features safely, see feature engineering explained; for time-ordered data, see time series forecasting explained.

What data leakage is — and is not

Leakage means the model (or your evaluation pipeline) had access at training or validation time to data that would not be available at the moment you make a real prediction. The classic symptom: metrics that look too good to be true, followed by a cliff when the model meets live traffic.

Leakage is not the same as using a powerful model, having abundant data, or achieving high accuracy on a genuinely hard problem. It is also not necessarily malicious — most leaks come from innocent pipeline ordering mistakes: scaling on the full dataset before splitting, joining tables on the wrong key, or including a column that is updated only after the label event occurs.

The leakage litmus test

For every feature at prediction time, ask: "Could I compute this value knowing only what I knew at decision time, before the outcome happened?" If the answer is no, the feature leaks. Document that timestamp — the as-of moment — for each row in training data.

Major leak types

Target leakage

Target leakage (label leakage) happens when a feature encodes the label directly or through a proxy that only exists because the outcome already happened. Examples: using refund_issued to predict fraud (refunds follow fraud investigations), using days_until_churn as a feature when churn is the label, or including a "reason code" field populated only after a support ticket is closed when predicting ticket category at creation time.

Temporal leakage and lookahead bias

Temporal leakage shuffles past and future. Random train-test splits on time-series data let the model train on March and validate on January — it has already seen the future regime. The same problem appears when you compute rolling statistics over the full series before splitting, or when backtesting a trading model with same-day close prices for entries you could not have filled until the next session. Use walk-forward validation and strict as-of joins instead of random k-fold when rows have order.

Group and duplicate leakage

Group leakage splits related rows across train and test. Multiple CT scans from the same patient, several transactions from one merchant, or duplicate user accounts after a merge can let the model memorize identities. Split by patient_id, user_id, or session_id — not by row index. Near-duplicate text (product descriptions copied across SKUs) causes the same effect in NLP retrieval.

Preprocessing and pipeline leakage

Pipeline leakage contaminates splits through transforms fit on all data: standardizing with global mean and variance, imputing missing values from the full column, selecting top-k features by correlation on the combined set, or tuning hyperparameters on the test fold. Correct pattern: split first, then fit scalers, imputers, and feature selectors inside each training fold only — sklearn Pipeline + cross_val_score enforces this when used properly.

How leakage fools cross-validation

K-fold cross-validation only protects you if folds are constructed with the same constraints production will face. Leakage often survives CV because:

  • Transforms fit globally — each fold's validation rows influenced the scaler fit on training rows in other folds when you preprocessed before CV.
  • Groups span folds — the same customer appears in fold 2 train and fold 4 validation.
  • Time is ignored — shuffled folds let future rows train models evaluated on the past.
  • Target-encoded categoricals — encoding merchant_id with mean fraud rate computed using validation labels leaks the label into the feature.

Remedies: GroupKFold, TimeSeriesSplit, nested CV for hyperparameter search, and pipeline objects where every learnable step runs inside the training portion of each fold. When in doubt, simulate production: freeze a date, train only on prior data, predict forward one day, repeat.

Worked example: Harbor Analytics churn model

Harbor Analytics builds a subscription churn predictor for a B2B logistics SaaS product. The label is churned_within_30d — whether the account cancelled in the 30 days after snapshot date as_of.

First attempt (leaky)

A junior data scientist joins billing, support, and product-usage tables, drops rows with missing values using column medians from the full dataset, one-hot encodes plan tier, and adds support_tickets_last_30d, last_login_days_ago, and downgrade_requested. Random 80/20 split. Gradient boosted trees hit 0.97 AUC on validation. Leadership is thrilled.

Production launch: live AUC 0.61. Postmortem finds three leaks:

  1. downgrade_requested is stamped when churn workflow starts — after the customer has already decided to leave. Classic target leakage.
  2. Median imputation used post-churn billing states from the full table, leaking cancellation timing into fill values for active accounts.
  3. Twelve accounts were duplicated after a CRM merge; both copies straddled train and test, memorizing account-specific patterns.

Second attempt (clean)

The team rebuilds with rules: features computed strictly with event_time <= as_of; drop downgrade_requested; split by account_id using GroupKFold; impute and scale inside a sklearn Pipeline; evaluate with walk-forward monthly snapshots. Validation AUC settles at 0.78 — lower but honest. After three months live, production AUC is 0.76, within expected drift. Marketing stops over-promising retention saves; product uses the model to prioritize outreach, not as a proof of model magic.

Leak-type decision table

Situation Likely leak Fix
Feature exists only after label event Target leakage Remove feature; redefine label window
Ordered events (prices, logs, sessions) Temporal leakage Walk-forward split; as-of joins
Multiple rows per entity Group leakage GroupKFold; deduplicate on entity ID
Scaler fit before split Pipeline leakage Fit transforms inside train fold only
Mean encoding on full data Target leakage via encoding Nested CV or out-of-fold encoding
Hyperparams tuned on test set Evaluation leakage Hold out a final untouched test set
Train and serve feature code differ Serving skew (not leakage but similar cliff) Same pipeline artifact for batch and online

Detection signals before ship

  • Suspiciously high metrics — 99% accuracy on messy real data deserves skepticism, not celebration.
  • Single feature dominance — SHAP shows one column explaining 80% of lift; inspect whether it is a post-outcome artifact.
  • Train vs validation gap near zero on a complex model — may indicate leakage rather than perfect generalization.
  • Performance cliff on time slice — model works on 2024 data, collapses on 2025; often temporal leak or regime change.
  • Ablation surprise — removing an "obvious" feature barely hurts metrics; it may have been redundant with a leaky twin.

Run a negative control: shuffle labels and retrain. AUC should land near 0.5. If shuffled labels still score well, your pipeline leaks somewhere upstream.

Common pitfalls

  • Leakage in feature stores — offline backfills that join future snapshots into historical training rows. Enforce point-in-time correctness in the store API.
  • LLM few-shot contamination — evaluation examples appearing in retrieval context during RAG tests inflate benchmark scores.
  • Data augmentation bleed — augmented copies of a test image placed in training without group splits.
  • Champion-challenger on production labels too early — using outcomes from the same users you just targeted with treatment.
  • Ignoring survivorship bias — training only on accounts that survived long enough to churn looks like predicting churn but conditions on not having churned earlier.

Practitioner checklist

  • Define the prediction moment (as_of) and label horizon in writing.
  • Audit every feature: available at as_of? If not, drop or lag it.
  • Split by group or time when rows are not independent.
  • Fit all preprocessing inside training folds via a Pipeline.
  • Reserve a final test set never used for any tuning decision.
  • Run label-shuffle and time-slice sanity checks before launch.
  • Match train and serve feature code paths; version the pipeline artifact.
  • Monitor live metrics against honest offline estimates — large gaps trigger leak review.

Key takeaways

  • Data leakage gives models information unavailable at prediction time — metrics lie until production.
  • Target and temporal leaks are the most common; group and pipeline leaks are close behind.
  • Cross-validation only helps when folds respect time, groups, and pipeline boundaries.
  • Lower honest validation scores beat inflated ones — Harbor's 0.78 AUC that held live beat 0.97 that did not.
  • When metrics look too good, assume leakage until proven otherwise.

Related reading