Guide

Purged cross-validation for backtesting explained

Harbor Capital's systematic desk trained a gradient-boosted momentum classifier on 12-month forward return labels across 400 liquid U.S. equities. Standard five-fold cross-validation reported a mean out-of-sample Sharpe of 1.42 — well above the desk's 0.8 deployment threshold. Walk-forward backtests on the same features looked merely adequate at 0.61. The gap was not bad luck; it was label overlap leakage. Each training row used prices from months that also appeared inside adjacent test folds because forward-return labels span overlapping calendar windows. Purged cross-validation removes training observations whose label intervals intersect the test set, then adds an embargo buffer so serial correlation cannot bridge the gap. After purging and a 5% embargo, the honest cross-validated Sharpe fell to 0.58 — aligned with walk-forward and low enough to reject deployment. This guide explains why naive k-fold fails on financial series, how purging and embargo work, combinatorial purged CV for stability estimates, the Harbor Capital refactor, a method decision table alongside our backtesting guide, pitfalls, and a production checklist tied to data leakage discipline.

Why standard k-fold leaks on financial labels

Classical k-fold cross-validation assumes observations are independent and identically distributed. Financial return labels routinely violate both assumptions. A label defined as “return from day t to day t + H” uses price paths that overlap across nearby rows: row at t and row at t + 1 share H − 1 days of future information. When folds are shuffled or blocked without accounting for that overlap, the model trains on prices that also determine test-set labels.

Temporal vs random splits

Random k-fold is the worst offender — it places adjacent trading days in train and test simultaneously. Simple chronological splits fix obvious look-ahead but still leak when label horizons span fold boundaries: the last training rows before a test block can include prices from inside the test window. Purging explicitly deletes any training sample whose label interval [t_start, t_end] intersects the test interval.

Serial correlation and embargo

Even after purging intersecting labels, features computed from rolling windows (20-day volatility, 60-day momentum) remain correlated across time. An embargo removes an additional slice of training rows immediately before and after each test segment — typically 1–5% of the sample length. The embargo length should match the longest feature lookback or autocorrelation decay you believe matters.

Purged k-fold step by step

Marcos López de Prado's purged k-fold procedure adapts blocked time-series CV for overlapping outcomes:

  1. Sort observations by label start time (or decision time).
  2. Partition the timeline into k contiguous test folds.
  3. For each fold, define the test index set T.
  4. Purge: drop training rows whose label end ≥ test start and label start ≤ test end.
  5. Embargo: drop training rows within h observations of either test boundary.
  6. Train on the remaining train set; score the test fold; aggregate metrics across folds.

In Python ecosystems, implementations appear in mlfinlab, skfolio, and custom pipelines atop scikit-learn's TimeSeriesSplit. The critical input is a per-row label interval — not just a single timestamp. For event-based labels (triple-barrier hits), intervals vary in length; purging must use actual barrier touch times, not assumed fixed horizons.

Metrics to report per fold

  • Sharpe or information ratio on purged test returns — not accuracy on overlapping labels.
  • Turnover and capacity — purged CV on tiny sleeves can still overstate scalability.
  • Deflated Sharpe when running many trials; purging fixes leakage, not multiple-testing inflation.
  • Fold stability — wide dispersion across folds signals regime sensitivity, not alpha.

Combinatorial purged cross-validation (CPCV)

Standard purged k-fold produces only k test paths. Markets have few independent regimes; five folds may all include 2020–2021 liquidity extremes. Combinatorial purged CV generates many train/test combinations from the same timeline: choose k test groups from n contiguous blocks, purge and embargo each combination, and collect a distribution of performance paths rather than a single mean.

CPCV answers questions k-fold cannot:

  • What fraction of plausible paths beat a hurdle rate (probability of backtest overfitting, PBO)?
  • How sensitive is the strategy to which crisis years land in test?
  • What is the range of maximum drawdown across combinatorial paths?

The compute cost grows combinatorially — use CPCV for final gatekeeping on strategies that already pass purged k-fold and walk-forward, not for every hyperparameter grid search. Pair CPCV summaries with bootstrap resampling on trade-level returns when you need confidence intervals on Sharpe, not just point estimates.

Harbor Capital momentum sleeve refactor

Harbor's quant team rebuilt validation after the purged-CV shock:

  1. Replaced random k-fold with purged five-fold on 12-month forward labels; embargo set to 5% of sample (~63 trading days).
  2. Logged label start/end per row from triple-barrier events instead of fixed-horizon assumptions.
  3. Ran CPCV with 8 blocks and 2 test groups to estimate PBO; strategies with >40% paths below hurdle were rejected.
  4. Aligned purged CV Sharpe with walk-forward rolls (126-day train, 21-day test step) — targets within 0.1 Sharpe.
  5. Added point-in-time features only; verified no survivorship bias in the underlying universe.
  6. Documented purge/embargo parameters in the model card for allocator audit.

Result: two of three candidate momentum variants failed the purged gate; the surviving sleeve launched at half the originally proposed capital with walk-forward Sharpe 0.54 vs the inflated 1.42. The desk now treats any CV method that ignores label overlap as a research bug, not a conservative estimate.

Method decision table

Validation method Strength Weakness Best for
Random k-fold High fold count, familiar APIs Severe leakage with overlapping labels Never for horizon-based return labels
Simple chronological split Easy, no shuffle leakage Still leaks at fold boundaries; single path Quick sanity checks only
Walk-forward (rolling train/test) Mimics live deployment; one clear OOS path Single path; window choice biases result Production simulation, final sign-off
Purged k-fold Removes label overlap leakage; multiple folds Needs per-row label intervals; tuning embargo ML strategy selection, hyperparameter screening
Combinatorial purged CV Distribution of paths; PBO estimates Expensive; still not live trading Capital allocation gate on finalists
Purged CV without walk-forward Fast iteration Misses transaction-cost drift and capacity Never as sole deployment approval

Common pitfalls

  • Purging on decision time only — must use full label interval endpoints; barrier labels end at touch time, not horizon.
  • Zero embargo with rolling features — 60-day momentum features correlate across purged gaps; add embargo ≥ max lookback.
  • Confusing purging with point-in-time data — purging fixes overlap; it does not replace survivorship-free universes.
  • Tuning on purged test folds — repeatedly peeking at purged CV while tweaking still overfits; hold a final sealed walk-forward.
  • Ignoring multiple testing — purged Sharpe across 200 variants needs deflation or Bonferroni-style discipline.
  • Fixed embargo for variable labels — event-based labels of uneven length need embargo scaled to feature decay, not a constant row count.
  • Reporting accuracy instead of P&L — classification accuracy on overlapping labels flatters models that do not translate to net returns after costs.

Production checklist

  • Store label start and end timestamps (or bar indices) for every training row.
  • Replace random k-fold with purged blocked splits for any horizon-based label.
  • Set embargo length ≥ longest feature lookback or estimated autocorrelation horizon.
  • Report per-fold Sharpe, turnover, and max drawdown — not just pooled accuracy.
  • Require purged CV mean within 0.15 Sharpe of walk-forward before capital review.
  • Run CPCV or PBO estimate on strategies passing purged k-fold.
  • Apply transaction costs and slippage in walk-forward, not only in purged CV features.
  • Version purge/embargo parameters in model cards alongside feature definitions.
  • Cross-check universe for survivorship and corporate-action timing.
  • Seal a final out-of-sample period never used in any purged fold during research.

Key takeaways

  • Overlapping return labels leak future prices into training when folds share calendar days.
  • Purging removes training rows whose label intervals intersect the test set; embargo blocks serial correlation across boundaries.
  • Purged k-fold Sharpe often lands far below naive CV — treat the gap as corrected truth, not pessimism.
  • Combinatorial purged CV estimates how often a strategy path could disappoint, not just average performance.
  • Purged validation complements walk-forward; neither alone approves live capital.

Related reading