Guide

Backtesting trading strategies explained

Harbor Capital’s quant desk spent six weeks tuning a 12-month momentum rotation rule on large-cap U.S. equities. The in-sample backtest showed a 1.4 Sharpe ratio, smooth equity curve, and a maximum drawdown under 12%. Paper trading for two months delivered half the return with twice the volatility. The gap was not bad luck — it was biased history, ignored turnover costs, and parameters fit to noise. Backtesting replays a strategy on past prices to estimate how it might have performed, but only honest simulation separates a durable edge from a spreadsheet fantasy. This guide covers what backtests can and cannot prove, validation splits and walk-forward design, the biases that inflate results, realistic cost and liquidity modeling, a Harbor Capital momentum sleeve worked example, a backtest method decision table, common pitfalls, and a production checklist. It complements pairs trading, position sizing, and technical analysis without replacing live risk management.

What backtesting is — and what it is not

A backtest applies explicit rules to historical market data: when to enter, exit, size positions, and rebalance. The output is a simulated equity curve, trade log, and summary statistics (return, volatility, drawdown, win rate). Backtests are cheap experiments; they let you reject bad ideas before allocating capital.

They are not proof of future profits. Markets change regime, liquidity shifts, and your own size can move prices. A backtest answers: “If this rule had been applied mechanically in the past, with these assumptions, what would have happened?” The follow-up question is always whether those assumptions survive contact with reality.

Vectorized vs event-driven engines

Vectorized backtests — operate on full price matrices at once (pandas, numpy). Fast for screening thousands of parameter grids; harder to model complex order logic or partial fills.
Event-driven backtests — step bar-by-bar or tick-by-tick, processing signals, orders, and portfolio state sequentially. Slower but closer to production code paths; preferred before live deployment.

Start vectorized for research; finish event-driven before you trust dollar figures.

Validation design: in-sample, out-of-sample, and walk-forward

The cardinal sin is tuning parameters on the same data you report as performance. Split history deliberately:

In-sample (IS) — training window where you choose lookback lengths, thresholds, and universe filters.
Out-of-sample (OOS) — held-out period evaluated once with frozen rules. OOS Sharpe below IS is normal; OOS near zero means the edge was likely curve-fit.
Walk-forward — roll the IS/OOS window forward through time (e.g., train on 3 years, test on 6 months, advance 6 months, repeat). Aggregating OOS segments approximates how a strategy would have been discovered and deployed sequentially — the gold standard for systematic desks.

For time-series strategies, random k-fold cross-validation from machine learning is usually wrong: it leaks future information into past folds. Use purged, embargoed splits when labels overlap, as in finance ML (see data leakage for the parallel in model training).

Minimum sample size heuristics

Rule-of-thumb: you need enough independent trades or rebalance periods for statistics to stabilize. A strategy with 30 trades and a 60% win rate has wide confidence intervals; claiming a 1.5 Sharpe on five years of monthly rebalances with 12 data points is storytelling, not inference. Prefer longer histories across multiple macro regimes (2008, 2020, 2022) when data exists.

Biases that make backtests lie

These inflate backtested returns more often than any single bug in code:

Survivorship bias — testing only stocks still trading today omits delisted bankruptcies. Use point-in-time universes with historical index constituents.
Look-ahead bias — using revised earnings, restated fundamentals, or same-bar close prices for signals executed at that close. Signals must use data available before the trade timestamp; execute at next open or with explicit lag.
Data snooping / multiple testing — trying 500 parameter combos and reporting the best one without correction. Each trial increases false-discovery risk; track trial count and apply deflated Sharpe or Bonferroni-style skepticism.
Selection bias — backtesting only assets you already believe in (e.g., mega-cap tech since 2010). Pre-register universe rules.
Corporate actions errors — unadjusted splits, missing dividends, bad timezone alignment on intraday data.

If you cannot articulate which bias controls you applied, treat the backtest as illustrative only.

Costs, slippage, and capacity

Gross returns minus realistic frictions equal what you keep:

Commissions and fees — per-share, per-contract, or bps of notional; include exchange and regulatory fees for futures.
Spread crossing — buy at ask, sell at bid; use half-spread minimum or historical bid-ask if available.
Slippage model — fixed bps, square-root market impact (size vs average daily volume), or participation rate caps for large orders.
Borrow costs — for short equity legs in pairs or long-short books; hard-to-borrow names can erase edge.
Turnover tax — high-frequency rebalancing compounds small costs; report turnover explicitly.

Capacity asks how much capital the strategy absorbs before impact dominates. A micro-cap mean-reversion signal that works at $100k may fail at $50M. Stress-test with 2× your intended AUM in the slippage model.

Overfitting: when the backtest is the strategy

Overfitting fits noise in historical data. Warning signs: many free parameters, sharp performance cliffs (small parameter changes collapse returns), IS/OOS performance gap, and strategies that only work in one decade or asset class.

Mitigations beyond walk-forward:

Parameter parsimony — prefer simple rules (single moving-average crossover) over 12-indicator confluence.
Economic narrative — can you explain why the edge should exist (risk premium, behavioral slow adjustment, structural flow)? Narrative-free mining is fragile.
Cross-asset robustness — does a related rule work on bonds, FX, or another equity region with adjusted parameters?
Monte Carlo reshuffling — bootstrap trade sequences or shuffle returns to see if observed Sharpe is exceptional vs luck (ties to bootstrap resampling).

Position sizing after validation should use fractional Kelly or fixed risk budgets — not sizes tuned inside the same backtest that discovered the signal.

Worked example: Harbor Capital 12-month momentum sleeve

Harbor Capital tests a simple cross-sectional momentum sleeve as a satellite to its core equity book:

Universe — S&P 500 constituents as of each rebalance date (point-in-time membership file, survivorship-safe).
Signal — 12-month total return skipping the most recent month (classic 12-1 momentum); rank stocks, long top decile, equal weight, short bottom decile optional for market-neutral variant.
Rebalance — monthly, first trading day; signals computed from prior month-end closes; trades executed at next-day VWAP proxy with 5 bps slippage per leg.
Costs — 2 bps commission, 5 bps slippage, 1 bps financing on gross exposure; turnover ~180% annually on long-only variant.
Validation — 1995–2010 IS for parameter sanity (decile vs quintile); 2011–2025 OOS frozen; walk-forward 3y/1y rolls reported separately.

Results: IS Sharpe ~0.9, OOS Sharpe ~0.55, walk-forward aggregate ~0.5. Momentum crashes in 2009 and 2020 appear in the OOS log — honest reporting, not hidden. Harbor sizes the sleeve at 8% of portfolio risk budget using volatility targeting, not full Kelly on IS Sharpe. Paper trading runs 90 days before capital allocation; live fills compared to simulated VWAP within 8 bps tolerance.

Backtest method decision table

Stage	Method	When to use	Trust level
Screening	Vectorized, coarse costs	Reject obviously bad ideas across 100+ variants	Low — ranking only
Research	Event-driven, realistic costs, IS/OOS split	One strategy family with frozen rules on OOS	Medium
Validation	Walk-forward, purged splits, bootstrap stats	Pre-committee approval for capital	High
Pre-live	Paper trading / shadow mode	Compare live fills, latency, and corporate actions	Very high
Production	Live with kill switches	Small size first; scale on live Sharpe vs model	Ground truth

Common pitfalls

Optimizing on OOS — peeking at held-out data repeatedly until it looks good; OOS becomes IS in disguise.
Ignoring regime change — zero rates and QE distorted many factors; a 2010–2021 backtest may not generalize.
Intraday fantasy fills — assuming you always trade at the daily low when your signal fires at the close.
Leverage without margin calls — backtests that allow infinite leverage hide blow-up paths; model maintenance margin.
Reporting gross only — headline returns before costs mislead allocators and yourself.
Skipping paper trading — code bugs, timezone errors, and API quirks rarely appear until live orders hit the wire.

Production checklist

Document signal, universe, execution lag, and sizing rules before running history.
Use point-in-time data with survivorship-safe universes and adjusted prices.
Split IS/OOS (or walk-forward) and freeze parameters before evaluating OOS.
Model commissions, spread, slippage, borrow, and financing; report net returns.
Track parameter trial count; apply skepticism when many variants were tested.
Report turnover, capacity estimate, and worst drawdown including crisis years.
Compare event-driven results to vectorized screen for material divergences.
Run paper trading or shadow mode; log fill quality vs simulation.
Size live capital with risk budgets, not backtested max leverage.
Define kill switches: max drawdown, Sharpe degradation, or drift vs model.

Key takeaways

Backtests estimate past hypothetical performance — they do not guarantee future results.
Validation discipline separates edge from curve-fit — walk-forward and honest OOS are non-negotiable.
Biases and costs dominate small edges — survivorship, look-ahead, and turnover can erase gross alpha.
Simpler rules with economic rationale survive longer than over-parameterized indicator soup.
Paper trading is the last backtest — live frictions are the only ground truth that pays bills.