Guide

Purged cross-validation for backtesting explained

Harbor Capital's systematic desk trained a gradient-boosted momentum classifier on 12-month forward return labels across 400 liquid U.S. equities. Standard five-fold cross-validation reported a mean out-of-sample Sharpe of 1.42 — well above the desk's 0.8 deployment threshold. Walk-forward backtests on the same features looked merely adequate at 0.61. The gap was not bad luck; it was label overlap leakage. Each training row used prices from months that also appeared inside adjacent test folds because forward-return labels span overlapping calendar windows. Purged cross-validation removes training observations whose label intervals intersect the test set, then adds an embargo buffer so serial correlation cannot bridge the gap. After purging and a 5% embargo, the honest cross-validated Sharpe fell to 0.58 — aligned with walk-forward and low enough to reject deployment. This guide explains why naive k-fold fails on financial series, how purging and embargo work, combinatorial purged CV for stability estimates, the Harbor Capital refactor, a method decision table alongside our backtesting guide, pitfalls, and a production checklist tied to data leakage discipline.

Why standard k-fold leaks on financial labels

Classical k-fold cross-validation assumes observations are independent and identically distributed. Financial return labels routinely violate both assumptions. A label defined as “return from day t to day t + H” uses price paths that overlap across nearby rows: row at t and row at t + 1 share H − 1 days of future information. When folds are shuffled or blocked without accounting for that overlap, the model trains on prices that also determine test-set labels.

Temporal vs random splits

Random k-fold is the worst offender — it places adjacent trading days in train and test simultaneously. Simple chronological splits fix obvious look-ahead but still leak when label horizons span fold boundaries: the last training rows before a test block can include prices from inside the test window. Purging explicitly deletes any training sample whose label interval [t_start, t_end] intersects the test interval.

Serial correlation and embargo

Even after purging intersecting labels, features computed from rolling windows (20-day volatility, 60-day momentum) remain correlated across time. An embargo removes an additional slice of training rows immediately before and after each test segment — typically 1–5% of the sample length. The embargo length should match the longest feature lookback or autocorrelation decay you believe matters.

Purged k-fold step by step

Marcos López de Prado's purged k-fold procedure adapts blocked time-series CV for overlapping outcomes:

Sort observations by label start time (or decision time).
Partition the timeline into k contiguous test folds.
For each fold, define the test index set T.
Purge: drop training rows whose label end ≥ test start and label start ≤ test end.
Embargo: drop training rows within h observations of either test boundary.
Train on the remaining train set; score the test fold; aggregate metrics across folds.

In Python ecosystems, implementations appear in mlfinlab, skfolio, and custom pipelines atop scikit-learn's TimeSeriesSplit. The critical input is a per-row label interval — not just a single timestamp. For event-based labels (triple-barrier hits), intervals vary in length; purging must use actual barrier touch times, not assumed fixed horizons.

Metrics to report per fold

Sharpe or information ratio on purged test returns — not accuracy on overlapping labels.
Turnover and capacity — purged CV on tiny sleeves can still overstate scalability.
Deflated Sharpe when running many trials; purging fixes leakage, not multiple-testing inflation.
Fold stability — wide dispersion across folds signals regime sensitivity, not alpha.

Combinatorial purged cross-validation (CPCV)

Standard purged k-fold produces only k test paths. Markets have few independent regimes; five folds may all include 2020–2021 liquidity extremes. Combinatorial purged CV generates many train/test combinations from the same timeline: choose k test groups from n contiguous blocks, purge and embargo each combination, and collect a distribution of performance paths rather than a single mean.

CPCV answers questions k-fold cannot:

What fraction of plausible paths beat a hurdle rate (probability of backtest overfitting, PBO)?
How sensitive is the strategy to which crisis years land in test?
What is the range of maximum drawdown across combinatorial paths?

The compute cost grows combinatorially — use CPCV for final gatekeeping on strategies that already pass purged k-fold and walk-forward, not for every hyperparameter grid search. Pair CPCV summaries with bootstrap resampling on trade-level returns when you need confidence intervals on Sharpe, not just point estimates.

Harbor Capital momentum sleeve refactor

Harbor's quant team rebuilt validation after the purged-CV shock:

Replaced random k-fold with purged five-fold on 12-month forward labels; embargo set to 5% of sample (~63 trading days).
Logged label start/end per row from triple-barrier events instead of fixed-horizon assumptions.
Ran CPCV with 8 blocks and 2 test groups to estimate PBO; strategies with >40% paths below hurdle were rejected.
Aligned purged CV Sharpe with walk-forward rolls (126-day train, 21-day test step) — targets within 0.1 Sharpe.
Added point-in-time features only; verified no survivorship bias in the underlying universe.
Documented purge/embargo parameters in the model card for allocator audit.

Result: two of three candidate momentum variants failed the purged gate; the surviving sleeve launched at half the originally proposed capital with walk-forward Sharpe 0.54 vs the inflated 1.42. The desk now treats any CV method that ignores label overlap as a research bug, not a conservative estimate.

Method decision table

Validation method	Strength	Weakness	Best for
Random k-fold	High fold count, familiar APIs	Severe leakage with overlapping labels	Never for horizon-based return labels
Simple chronological split	Easy, no shuffle leakage	Still leaks at fold boundaries; single path	Quick sanity checks only
Walk-forward (rolling train/test)	Mimics live deployment; one clear OOS path	Single path; window choice biases result	Production simulation, final sign-off
Purged k-fold	Removes label overlap leakage; multiple folds	Needs per-row label intervals; tuning embargo	ML strategy selection, hyperparameter screening
Combinatorial purged CV	Distribution of paths; PBO estimates	Expensive; still not live trading	Capital allocation gate on finalists
Purged CV without walk-forward	Fast iteration	Misses transaction-cost drift and capacity	Never as sole deployment approval

Common pitfalls

Purging on decision time only — must use full label interval endpoints; barrier labels end at touch time, not horizon.
Zero embargo with rolling features — 60-day momentum features correlate across purged gaps; add embargo ≥ max lookback.
Confusing purging with point-in-time data — purging fixes overlap; it does not replace survivorship-free universes.
Tuning on purged test folds — repeatedly peeking at purged CV while tweaking still overfits; hold a final sealed walk-forward.
Ignoring multiple testing — purged Sharpe across 200 variants needs deflation or Bonferroni-style discipline.
Fixed embargo for variable labels — event-based labels of uneven length need embargo scaled to feature decay, not a constant row count.
Reporting accuracy instead of P&L — classification accuracy on overlapping labels flatters models that do not translate to net returns after costs.

Production checklist

Store label start and end timestamps (or bar indices) for every training row.
Replace random k-fold with purged blocked splits for any horizon-based label.
Set embargo length ≥ longest feature lookback or estimated autocorrelation horizon.
Report per-fold Sharpe, turnover, and max drawdown — not just pooled accuracy.
Require purged CV mean within 0.15 Sharpe of walk-forward before capital review.
Run CPCV or PBO estimate on strategies passing purged k-fold.
Apply transaction costs and slippage in walk-forward, not only in purged CV features.
Version purge/embargo parameters in model cards alongside feature definitions.
Cross-check universe for survivorship and corporate-action timing.
Seal a final out-of-sample period never used in any purged fold during research.

Key takeaways

Overlapping return labels leak future prices into training when folds share calendar days.
Purging removes training rows whose label intervals intersect the test set; embargo blocks serial correlation across boundaries.
Purged k-fold Sharpe often lands far below naive CV — treat the gap as corrected truth, not pessimism.
Combinatorial purged CV estimates how often a strategy path could disappoint, not just average performance.
Purged validation complements walk-forward; neither alone approves live capital.