Guide
Purged cross-validation for backtesting explained
Harbor Capital's systematic desk trained a gradient-boosted momentum classifier on 12-month forward return labels across 400 liquid U.S. equities. Standard five-fold cross-validation reported a mean out-of-sample Sharpe of 1.42 — well above the desk's 0.8 deployment threshold. Walk-forward backtests on the same features looked merely adequate at 0.61. The gap was not bad luck; it was label overlap leakage. Each training row used prices from months that also appeared inside adjacent test folds because forward-return labels span overlapping calendar windows. Purged cross-validation removes training observations whose label intervals intersect the test set, then adds an embargo buffer so serial correlation cannot bridge the gap. After purging and a 5% embargo, the honest cross-validated Sharpe fell to 0.58 — aligned with walk-forward and low enough to reject deployment. This guide explains why naive k-fold fails on financial series, how purging and embargo work, combinatorial purged CV for stability estimates, the Harbor Capital refactor, a method decision table alongside our backtesting guide, pitfalls, and a production checklist tied to data leakage discipline.
Why standard k-fold leaks on financial labels
Classical k-fold cross-validation assumes observations are independent and identically distributed. Financial return labels routinely violate both assumptions. A label defined as “return from day t to day t + H” uses price paths that overlap across nearby rows: row at t and row at t + 1 share H − 1 days of future information. When folds are shuffled or blocked without accounting for that overlap, the model trains on prices that also determine test-set labels.
Temporal vs random splits
Random k-fold is the worst offender — it places
adjacent trading days in train and test simultaneously. Simple
chronological splits fix obvious look-ahead but still leak when
label horizons span fold boundaries: the last training rows before a test
block can include prices from inside the test window. Purging explicitly
deletes any training sample whose label interval
[t_start, t_end] intersects the test interval.
Serial correlation and embargo
Even after purging intersecting labels, features computed from rolling windows (20-day volatility, 60-day momentum) remain correlated across time. An embargo removes an additional slice of training rows immediately before and after each test segment — typically 1–5% of the sample length. The embargo length should match the longest feature lookback or autocorrelation decay you believe matters.
Purged k-fold step by step
Marcos López de Prado's purged k-fold procedure adapts blocked time-series CV for overlapping outcomes:
- Sort observations by label start time (or decision time).
- Partition the timeline into k contiguous test folds.
- For each fold, define the test index set
T. - Purge: drop training rows whose label end ≥ test start and label start ≤ test end.
- Embargo: drop training rows within
hobservations of either test boundary. - Train on the remaining train set; score the test fold; aggregate metrics across folds.
In Python ecosystems, implementations appear in
mlfinlab, skfolio, and custom pipelines atop
scikit-learn's TimeSeriesSplit. The critical input is a
per-row label interval — not just a single timestamp.
For event-based labels (triple-barrier hits), intervals vary in length;
purging must use actual barrier touch times, not assumed fixed horizons.
Metrics to report per fold
- Sharpe or information ratio on purged test returns — not accuracy on overlapping labels.
- Turnover and capacity — purged CV on tiny sleeves can still overstate scalability.
- Deflated Sharpe when running many trials; purging fixes leakage, not multiple-testing inflation.
- Fold stability — wide dispersion across folds signals regime sensitivity, not alpha.
Combinatorial purged cross-validation (CPCV)
Standard purged k-fold produces only k test paths. Markets have few independent regimes; five folds may all include 2020–2021 liquidity extremes. Combinatorial purged CV generates many train/test combinations from the same timeline: choose k test groups from n contiguous blocks, purge and embargo each combination, and collect a distribution of performance paths rather than a single mean.
CPCV answers questions k-fold cannot:
- What fraction of plausible paths beat a hurdle rate (probability of backtest overfitting, PBO)?
- How sensitive is the strategy to which crisis years land in test?
- What is the range of maximum drawdown across combinatorial paths?
The compute cost grows combinatorially — use CPCV for final gatekeeping on strategies that already pass purged k-fold and walk-forward, not for every hyperparameter grid search. Pair CPCV summaries with bootstrap resampling on trade-level returns when you need confidence intervals on Sharpe, not just point estimates.
Harbor Capital momentum sleeve refactor
Harbor's quant team rebuilt validation after the purged-CV shock:
- Replaced random k-fold with purged five-fold on 12-month forward labels; embargo set to 5% of sample (~63 trading days).
- Logged label start/end per row from triple-barrier events instead of fixed-horizon assumptions.
- Ran CPCV with 8 blocks and 2 test groups to estimate PBO; strategies with >40% paths below hurdle were rejected.
- Aligned purged CV Sharpe with walk-forward rolls (126-day train, 21-day test step) — targets within 0.1 Sharpe.
- Added point-in-time features only; verified no survivorship bias in the underlying universe.
- Documented purge/embargo parameters in the model card for allocator audit.
Result: two of three candidate momentum variants failed the purged gate; the surviving sleeve launched at half the originally proposed capital with walk-forward Sharpe 0.54 vs the inflated 1.42. The desk now treats any CV method that ignores label overlap as a research bug, not a conservative estimate.
Method decision table
| Validation method | Strength | Weakness | Best for |
|---|---|---|---|
| Random k-fold | High fold count, familiar APIs | Severe leakage with overlapping labels | Never for horizon-based return labels |
| Simple chronological split | Easy, no shuffle leakage | Still leaks at fold boundaries; single path | Quick sanity checks only |
| Walk-forward (rolling train/test) | Mimics live deployment; one clear OOS path | Single path; window choice biases result | Production simulation, final sign-off |
| Purged k-fold | Removes label overlap leakage; multiple folds | Needs per-row label intervals; tuning embargo | ML strategy selection, hyperparameter screening |
| Combinatorial purged CV | Distribution of paths; PBO estimates | Expensive; still not live trading | Capital allocation gate on finalists |
| Purged CV without walk-forward | Fast iteration | Misses transaction-cost drift and capacity | Never as sole deployment approval |
Common pitfalls
- Purging on decision time only — must use full label interval endpoints; barrier labels end at touch time, not horizon.
- Zero embargo with rolling features — 60-day momentum features correlate across purged gaps; add embargo ≥ max lookback.
- Confusing purging with point-in-time data — purging fixes overlap; it does not replace survivorship-free universes.
- Tuning on purged test folds — repeatedly peeking at purged CV while tweaking still overfits; hold a final sealed walk-forward.
- Ignoring multiple testing — purged Sharpe across 200 variants needs deflation or Bonferroni-style discipline.
- Fixed embargo for variable labels — event-based labels of uneven length need embargo scaled to feature decay, not a constant row count.
- Reporting accuracy instead of P&L — classification accuracy on overlapping labels flatters models that do not translate to net returns after costs.
Production checklist
- Store label start and end timestamps (or bar indices) for every training row.
- Replace random k-fold with purged blocked splits for any horizon-based label.
- Set embargo length ≥ longest feature lookback or estimated autocorrelation horizon.
- Report per-fold Sharpe, turnover, and max drawdown — not just pooled accuracy.
- Require purged CV mean within 0.15 Sharpe of walk-forward before capital review.
- Run CPCV or PBO estimate on strategies passing purged k-fold.
- Apply transaction costs and slippage in walk-forward, not only in purged CV features.
- Version purge/embargo parameters in model cards alongside feature definitions.
- Cross-check universe for survivorship and corporate-action timing.
- Seal a final out-of-sample period never used in any purged fold during research.
Key takeaways
- Overlapping return labels leak future prices into training when folds share calendar days.
- Purging removes training rows whose label intervals intersect the test set; embargo blocks serial correlation across boundaries.
- Purged k-fold Sharpe often lands far below naive CV — treat the gap as corrected truth, not pessimism.
- Combinatorial purged CV estimates how often a strategy path could disappoint, not just average performance.
- Purged validation complements walk-forward; neither alone approves live capital.
Related reading
- Backtesting trading strategies explained — walk-forward rolls, costs, and overfitting controls
- Data leakage in machine learning explained — temporal leakage patterns beyond finance
- Survivorship bias in investing and backtesting explained — honest universes for validation inputs
- Bootstrap resampling explained — confidence intervals on Sharpe and drawdown