Setup
The Fundamental Problem
A backtest is a simulation of a trading strategy applied to historical data to estimate future performance. The core challenge is that the future is not the past — and research processes that iterate over historical data until a "good" strategy is found will produce results that are statistically meaningful in-sample but economically meaningless out-of-sample.
Backtesting is the most abused tool in quantitative finance. Bailey et al. (2014) estimate that the majority of published financial research results are false discoveries, driven by data mining over a finite sample. The antidote is rigorous statistical testing with explicit multiple hypothesis adjustment.
Conventions throughout. Returns are continuously compounded unless stated. Daily returns assumed 252 trading days per year. Sharpe ratio computed as annualised mean excess return divided by annualised return standard deviation. All statistics are defined for excess returns (above cash).
Theory
1. The Sharpe Ratio: Distribution and Inference
The Sharpe ratio of a strategy with daily excess returns is:
Annualised: multiply by (daily frequency).
Asymptotic distribution. Under returns with finite kurtosis (Lo 2002):
where is the excess kurtosis of returns. For (Gaussian returns):
A t-statistic for testing against :
which follows an approximately standard normal distribution under .
Minimum track record. Solving for such that the one-sided test at level rejects :
For (annualised) in daily data, : this gives years of daily data to establish statistical significance at 5%.
2. Multiple Hypothesis Testing: The False Discovery Problem
Suppose a researcher tests independent strategies, each with , at significance level . Expected false discoveries under all-null hypotheses:
For : expect 5 false discoveries even if every strategy has zero alpha. If only the best Sharpe is reported, selection bias inflates it systematically.
Bonferroni correction. Test each hypothesis at level . Controls the familywise error rate (FWER) — probability of any false discovery. Overly conservative when is large.
Benjamini-Hochberg (BH) procedure. Controls the false discovery rate (FDR) — expected proportion of discoveries that are false:
- Order p-values: .
- Find the largest such that .
- Reject all .
BH controls FDR under independence (and under positive dependence, PRDS condition). For correlated strategies, the Benjamini-Yekutieli (BY) procedure controls FDR under arbitrary dependence at the cost of a factor in the threshold.
3. The Deflated Sharpe Ratio
Bailey and López de Prado (2014) propose the Deflated Sharpe Ratio (DSR) to adjust for the selection bias introduced by testing multiple strategies:
where:
- is the observed maximum Sharpe across tried strategies,
- is the expected maximum of iid standard normals (approximation),
- is average pairwise correlation of the strategy returns,
- is excess kurtosis of the selected strategy's returns,
- is the Euler-Mascheroni constant.
DSR means the observed Sharpe, after deflating for search over strategies, remains statistically significant.
Minimum Backtest Length. The expected maximum Sharpe ratio over trials grows as . A backtest of length with strategies tried produces a statistically robust conclusion only if:
For , : minimum daily observations ( years).
4. Walk-Forward Backtest Design
A properly structured backtest:
- Split data before touching it. Designate an out-of-sample (OOS) holdout of at least 20–30% of history. The model is never fitted on this period.
- Training/validation split. Use purged walk-forward CV (see ML Signal Generation module) on the in-sample period to select hyperparameters.
- Single test on holdout. Run the finalised strategy once on the OOS period and report those results.
- Report honestly. Report: number of strategies tried, number discarded, all parameter combinations explored. Failure to disclose is p-hacking.
Implementation
"""
Backtesting statistics: Sharpe inference, MHT correction, DSR.
Assumptions:
- All inputs are daily excess returns (return minus cash rate)
- Returns in decimal form (not percent)
- 252 trading days per year
"""
from __future__ import annotations
import numpy as np
import pandas as pd
from scipy import stats
from typing import Sequence
TRADING_DAYS = 252
def annualised_sharpe(returns: pd.Series, ddof: int = 1) -> float:
"""Annualised Sharpe ratio from daily excess returns."""
mu = returns.mean()
sigma = returns.std(ddof=ddof)
if sigma == 0:
return 0.0
return (mu / sigma) * np.sqrt(TRADING_DAYS)
def sharpe_tstat(returns: pd.Series) -> tuple[float, float]:
"""
t-statistic and one-sided p-value for H0: SR = 0.
Uses Lo (2002) asymptotic result adjusted for excess kurtosis.
Returns (t_stat, p_value).
"""
sr_daily = returns.mean() / returns.std(ddof=1)
T = len(returns)
kurtosis_excess = stats.kurtosis(returns, bias=False) # Fisher definition (excess)
variance_sr = (1.0 + 0.5 * sr_daily**2 * kurtosis_excess) / (T - 1)
t = sr_daily / np.sqrt(variance_sr)
p = 1.0 - stats.norm.cdf(t) # one-sided
return float(t), float(p)
def minimum_track_record_length(
sharpe_annualised: float,
alpha: float = 0.05,
skewness: float = 0.0,
excess_kurtosis: float = 0.0,
) -> float:
"""
Minimum number of daily observations needed to reject H0: SR=0
at one-sided level alpha, accounting for non-normal return distribution.
"""
sr_daily = sharpe_annualised / np.sqrt(TRADING_DAYS)
z = stats.norm.ppf(1 - alpha)
# Variance of SR estimator per Lo (2002)
var_factor = 1.0 + 0.5 * sr_daily**2 * (excess_kurtosis + skewness**2)
T_star = var_factor * (z / sr_daily) ** 2
return float(T_star)
def benjamini_hochberg(
p_values: Sequence[float],
alpha: float = 0.05,
) -> tuple[np.ndarray, float]:
"""
Benjamini-Hochberg FDR correction.
Returns (reject: bool array, adjusted threshold).
Rejects H_{(k)} for all k <= k_star where p_{(k)} <= k/N * alpha.
"""
p = np.array(p_values)
N = len(p)
order = np.argsort(p)
p_sorted = p[order]
threshold = np.arange(1, N + 1) * alpha / N
reject_sorted = np.zeros(N, dtype=bool)
k_star = -1
for k in range(N - 1, -1, -1):
if p_sorted[k] <= threshold[k]:
k_star = k
break
if k_star >= 0:
reject_sorted[: k_star + 1] = True
reject = np.zeros(N, dtype=bool)
reject[order] = reject_sorted
return reject, threshold[k_star] if k_star >= 0 else 0.0
def expected_max_sharpe(n_strategies: int, mean_sr: float = 0.0, std_sr: float = 1.0) -> float:
"""
Expected maximum of n_strategies iid N(mean_sr, std_sr^2) draws.
Uses the approximation from Bailey & Lopez de Prado (2014).
"""
# Expected max of n iid standard normals ~ (1 - gamma_E) * Phi^{-1}(1 - 1/n) + Phi^{-1}(1 - 1/(n*e))
euler_mascheroni = 0.5772156649
if n_strategies <= 1:
return mean_sr
z1 = stats.norm.ppf(1.0 - 1.0 / n_strategies)
z2 = stats.norm.ppf(1.0 - 1.0 / (n_strategies * np.e))
e_max = (1.0 - euler_mascheroni) * z1 + euler_mascheroni * z2
return mean_sr + std_sr * e_max
def deflated_sharpe_ratio(
returns: pd.Series,
n_strategies_tried: int,
mean_sr_tried: float = 0.0,
std_sr_tried: float = 1.0,
) -> float:
"""
Deflated Sharpe Ratio (Bailey & Lopez de Prado 2014).
Returns the probability that the observed SR is not a false discovery
given n_strategies_tried were tested.
Parameters
----------
returns: daily excess returns of the selected strategy
n_strategies_tried: total number of strategies/parameter sets evaluated
mean_sr_tried: mean annualised SR across tried strategies (default 0)
std_sr_tried: std of annualised SRs across tried strategies (default 1)
"""
T = len(returns)
sr_star = annualised_sharpe(returns)
sr_star_daily = sr_star / np.sqrt(TRADING_DAYS)
skew = stats.skew(returns, bias=False)
kurt_excess = stats.kurtosis(returns, bias=False) # excess
# Expected maximum SR given search
e_max = expected_max_sharpe(n_strategies_tried, mean_sr_tried, std_sr_tried)
e_max_daily = e_max / np.sqrt(TRADING_DAYS)
# Variance of the SR estimator
var_numerator = 1.0 - skew * sr_star_daily + (kurt_excess / 4.0) * sr_star_daily**2
var_sr = var_numerator / (T - 1)
if var_sr <= 0:
return 0.0
z = (sr_star_daily - e_max_daily) / np.sqrt(var_sr)
return float(stats.norm.cdf(z))
def walk_forward_backtest(
strategy_returns_func, # callable(train_data, test_data) -> pd.Series
prices: pd.DataFrame, # daily prices, columns = assets
train_window: int = 756, # ~3 years
test_window: int = 252, # ~1 year
) -> pd.Series:
"""
Rolling walk-forward backtest.
Returns concatenated out-of-sample daily strategy returns.
"""
dates = prices.index
T = len(dates)
all_returns = []
start = train_window
while start + test_window <= T:
train_data = prices.iloc[start - train_window : start]
test_data = prices.iloc[start : start + test_window]
oos_returns = strategy_returns_func(train_data, test_data)
all_returns.append(oos_returns)
start += test_window
return pd.concat(all_returns) if all_returns else pd.Series(dtype=float)
Validation
Sharpe t-statistic. Manually: for 5 years of daily data () with annualised SR = 1.0, the daily SR is . The t-statistic is , p-value . Minimum track record for SR = 1.0 at : days — approximately 5 years.
BH correction. For 20 strategies with p-values uniformly drawn from [0, 1] (all null), the expected number of rejections at with uncorrected testing is 1.0. BH reduces this to FDR ≤ 5% without the conservatism of Bonferroni.
DSR. For strategies tried, expected max SR (iid standard normal) ≈ (in standardised units). A backtest reporting SR = 3.0 over 1 year after searching 100 strategies has DSR ≈ 0.5 — not significant.
Limitations
Look-Ahead Bias Eliminates Result Validity
Even a single look-ahead instance — using end-of-day prices to make decisions that would have required intraday data, or rebalancing at prices that assumed perfect execution — renders a backtest invalid. Common sources:
- Treating announcement-day returns as tradeable when the announcement came after market close.
- Using index weights from a later rebalancing date when constructing a historical index simulation.
- Computing position sizes from realised vol of the full period rather than an expanding window.
Overfitting via Strategy Space Search
Every parameter choice, universe filter, lookback selection, and signal weighting is a degree of freedom. A strategy with 10 binary decisions has possible variants — all of which were implicitly "tried" if the researcher inspected the data before choosing. Harvey and Liu (2015) suggest that the minimum t-statistic for a new factor to be credible — accounting for the number of factors already published — should be at least 3.0, not the conventional 2.0 used for a single test.
Transaction Costs and Market Impact
A backtest that ignores transaction costs will overstate performance. Typical execution frictions:
- Commission: 1–5 bps per side for institutional equity.
- Market impact: for a position of shares in a stock with average daily volume (ADV), the Almgren-Chriss impact is annualised vol units — non-trivial for large positions.
- Bid-ask spread: 1–3 bps in large-cap equities; 10–50 bps in small-cap.
Strategies with high turnover (> 50% monthly) are often unviable after costs even if the raw IC is positive.
The Stationarity of Relationships
A backtest spanning 20 years implicitly assumes that the same relationship held throughout. Market microstructure has changed dramatically since the 1990s (decimalization, algorithmic trading, HFT). A signal that worked 1995–2005 may not work at all post-2010. Use time-segmented performance attribution to check whether performance is evenly distributed across sub-periods or concentrated in a specific regime.
Interview Angle
L1. What is the Sharpe ratio? How do you compute a t-statistic to test whether SR > 0? How many years of daily returns do you need to establish a Sharpe ratio of 1.0 as statistically significant at 5%?
L2. Explain the multiple hypothesis testing problem in backtesting. How does the Benjamini-Hochberg procedure differ from Bonferroni? A researcher reports trying 50 strategies and selecting the best with SR = 2.0 over 3 years. Is this result credible? (Expected max of 50 iid SR draws from : in standardised units — but after scaling by the SR distribution, the result is not unusual under the null.)
L3. Define the Deflated Sharpe Ratio. What inputs does it require beyond the observed Sharpe, and why? Design a rigorous backtesting protocol for a cross-sectional equity signal: specify how you would split the data, how you would select hyperparameters, and what statistics you would report. How would you estimate the effective number of independent strategies tried when strategies share features (correlated search space)?
Correlated search space. When strategies have average pairwise return correlation , the effective number of independent tests is . High correlation among tested strategies (e.g., variants of the same momentum signal) reduces the severity of the multiple testing problem but also reduces the diversity of the search. Report both (total trials) and for full transparency.