Backtesting and Statistical Testing

Hard·22 min read

Statistical / ML for QuantsBacktestingSharpe RatioMultiple Hypothesis TestingOverfitting

Setup

The Fundamental Problem

A backtest is a simulation of a trading strategy applied to historical data to estimate future performance. The core challenge is that the future is not the past — and research processes that iterate over historical data until a "good" strategy is found will produce results that are statistically meaningful in-sample but economically meaningless out-of-sample.

Backtesting is the most abused tool in quantitative finance. Bailey et al. (2014) estimate that the majority of published financial research results are false discoveries, driven by data mining over a finite sample. The antidote is rigorous statistical testing with explicit multiple hypothesis adjustment.

Conventions throughout. Returns are continuously compounded unless stated. Daily returns assumed 252 trading days per year. Sharpe ratio computed as annualised mean excess return divided by annualised return standard deviation. All statistics are defined for excess returns (above cash).

Theory

1. The Sharpe Ratio: Distribution and Inference

The Sharpe ratio of a strategy with daily excess returns $(r_1, \ldots, r_T)$ is:

$\widehat{\text{SR}} = \frac{\bar{r}}{\hat{\sigma}}, \qquad \bar{r} = \frac{1}{T}\sum_{t=1}^T r_t, \qquad \hat{\sigma}^2 = \frac{1}{T-1}\sum_{t=1}^T (r_t - \bar{r})^2.$

Annualised: multiply by $\sqrt{252}$ (daily frequency).

Asymptotic distribution. Under $\text{iid}$ returns with finite kurtosis (Lo 2002):

$\sqrt{T}\left(\widehat{\text{SR}} - \text{SR}\right) \xrightarrow{d} \mathcal{N}\!\left(0,\; 1 + \frac{\text{SR}^2}{2}\left(\kappa - 1\right)\right),$

where $\kappa$ is the excess kurtosis of returns. For $\kappa = 0$ (Gaussian returns):

$\widehat{\text{SR}} \approx \mathcal{N}\!\left(\text{SR},\; \frac{1 + \text{SR}^2/2}{T}\right).$

A t-statistic for testing $H_0: \text{SR} = 0$ against $H_A: \text{SR} > 0$ :

$t = \widehat{\text{SR}} \cdot \sqrt{T} / \sqrt{1 + \widehat{\text{SR}}^2/2},$

which follows an approximately standard normal distribution under $H_0$ .

Minimum track record. Solving for $T$ such that the one-sided test at level $\alpha$ rejects $H_0$ :

$T^* = \left(z_\alpha \cdot \frac{\sqrt{1 + \widehat{\text{SR}}^2/2}}{\widehat{\text{SR}}}\right)^2.$

For $\widehat{\text{SR}} = 1$ (annualised) in daily data, $z_{0.05} = 1.645$ : this gives $T^* \approx 5.8$ years of daily data to establish statistical significance at 5%.

2. Multiple Hypothesis Testing: The False Discovery Problem

Suppose a researcher tests $N$ independent strategies, each with $H_0: \text{SR}_k = 0$ , at significance level $\alpha = 0.05$ . Expected false discoveries under all-null hypotheses:

$\mathbb{E}[\text{false discoveries}] = N \cdot \alpha.$

For $N = 100$ : expect 5 false discoveries even if every strategy has zero alpha. If only the best Sharpe is reported, selection bias inflates it systematically.

Bonferroni correction. Test each hypothesis at level $\alpha / N$ . Controls the familywise error rate (FWER) — probability of any false discovery. Overly conservative when $N$ is large.

Benjamini-Hochberg (BH) procedure. Controls the false discovery rate (FDR) — expected proportion of discoveries that are false:

Order p-values: $p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(N)}$ .
Find the largest $k$ such that $p_{(k)} \leq \frac{k}{N} \alpha$ .
Reject all $H_{(1)}, \ldots, H_{(k)}$ .

BH controls FDR $\leq \alpha$ under independence (and under positive dependence, PRDS condition). For correlated strategies, the Benjamini-Yekutieli (BY) procedure controls FDR under arbitrary dependence at the cost of a $\ln N$ factor in the threshold.

3. The Deflated Sharpe Ratio

Bailey and López de Prado (2014) propose the Deflated Sharpe Ratio (DSR) to adjust for the selection bias introduced by testing multiple strategies:

$\text{DSR}(\text{SR}^*) = \Phi\!\left(\frac{(\text{SR}^* - \mathbb{E}[\text{SR}^{\max}]) \cdot \sqrt{T - 1}}{\sqrt{1 - \hat{\rho} \cdot \text{SR}^{*2} + \frac{\text{SR}^{*2}}{2} (\gamma - 1)}}\right),$

where:

$\text{SR}^*$ is the observed maximum Sharpe across $N$ tried strategies,
$\mathbb{E}[\text{SR}^{\max}] \approx (1 - \gamma_E)\Phi^{-1}(1 - 1/N) + \Phi^{-1}(1 - 1/(Ne))$ is the expected maximum of $N$ iid standard normals (approximation),
$\hat{\rho}$ is average pairwise correlation of the $N$ strategy returns,
$\gamma$ is excess kurtosis of the selected strategy's returns,
$\gamma_E \approx 0.5772$ is the Euler-Mascheroni constant.

DSR $> 0.95$ means the observed Sharpe, after deflating for search over $N$ strategies, remains statistically significant.

Minimum Backtest Length. The expected maximum Sharpe ratio over $N$ trials grows as $\sim \sqrt{2 \ln N}$ . A backtest of length $T$ with $N$ strategies tried produces a statistically robust conclusion only if:

$T > \left(\frac{z_{1-\alpha}}{\text{SR}^* - \sqrt{2\ln N}}\right)^2 (1 + \widehat{\text{SR}}^{*2}/2).$

For $N = 50$ , $\text{SR}^* = 1.5$ : minimum $T \approx 3{,}500$ daily observations ( $\approx 14$ years).

4. Walk-Forward Backtest Design

A properly structured backtest:

Split data before touching it. Designate an out-of-sample (OOS) holdout of at least 20–30% of history. The model is never fitted on this period.
Training/validation split. Use purged walk-forward CV (see ML Signal Generation module) on the in-sample period to select hyperparameters.
Single test on holdout. Run the finalised strategy once on the OOS period and report those results.
Report honestly. Report: number of strategies tried, number discarded, all parameter combinations explored. Failure to disclose is p-hacking.

This topic requires Premium

Only today's featured topic is free. Unlock the full Today's Focus archive with Premium.

View pricing →Browse free content

Read the theory? Run the code.

View Notebook→