Statistical / ML for QuantsBacktestingSharpe RatioMultiple Hypothesis TestingOverfitting

Backtesting and Statistical Testing

Module 4 of 422 min readLevel: Hard

Setup

The Fundamental Problem

A backtest is a simulation of a trading strategy applied to historical data to estimate future performance. The core challenge is that the future is not the past — and research processes that iterate over historical data until a "good" strategy is found will produce results that are statistically meaningful in-sample but economically meaningless out-of-sample.

Backtesting is the most abused tool in quantitative finance. Bailey et al. (2014) estimate that the majority of published financial research results are false discoveries, driven by data mining over a finite sample. The antidote is rigorous statistical testing with explicit multiple hypothesis adjustment.

Conventions throughout. Returns are continuously compounded unless stated. Daily returns assumed 252 trading days per year. Sharpe ratio computed as annualised mean excess return divided by annualised return standard deviation. All statistics are defined for excess returns (above cash).


Theory

1. The Sharpe Ratio: Distribution and Inference

The Sharpe ratio of a strategy with daily excess returns (r1,,rT)(r_1, \ldots, r_T) is:

SR^=rˉσ^,rˉ=1Tt=1Trt,σ^2=1T1t=1T(rtrˉ)2.\widehat{\text{SR}} = \frac{\bar{r}}{\hat{\sigma}}, \qquad \bar{r} = \frac{1}{T}\sum_{t=1}^T r_t, \qquad \hat{\sigma}^2 = \frac{1}{T-1}\sum_{t=1}^T (r_t - \bar{r})^2.

Annualised: multiply by 252\sqrt{252} (daily frequency).

Asymptotic distribution. Under iid\text{iid} returns with finite kurtosis (Lo 2002):

T(SR^SR)dN ⁣(0,  1+SR22(κ1)),\sqrt{T}\left(\widehat{\text{SR}} - \text{SR}\right) \xrightarrow{d} \mathcal{N}\!\left(0,\; 1 + \frac{\text{SR}^2}{2}\left(\kappa - 1\right)\right),

where κ\kappa is the excess kurtosis of returns. For κ=0\kappa = 0 (Gaussian returns):

SR^N ⁣(SR,  1+SR2/2T).\widehat{\text{SR}} \approx \mathcal{N}\!\left(\text{SR},\; \frac{1 + \text{SR}^2/2}{T}\right).

A t-statistic for testing H0:SR=0H_0: \text{SR} = 0 against HA:SR>0H_A: \text{SR} > 0:

t=SR^T/1+SR^2/2,t = \widehat{\text{SR}} \cdot \sqrt{T} / \sqrt{1 + \widehat{\text{SR}}^2/2},

which follows an approximately standard normal distribution under H0H_0.

Minimum track record. Solving for TT such that the one-sided test at level α\alpha rejects H0H_0:

T=(zα1+SR^2/2SR^)2.T^* = \left(z_\alpha \cdot \frac{\sqrt{1 + \widehat{\text{SR}}^2/2}}{\widehat{\text{SR}}}\right)^2.

For SR^=1\widehat{\text{SR}} = 1 (annualised) in daily data, z0.05=1.645z_{0.05} = 1.645: this gives T5.8T^* \approx 5.8 years of daily data to establish statistical significance at 5%.

2. Multiple Hypothesis Testing: The False Discovery Problem

Suppose a researcher tests NN independent strategies, each with H0:SRk=0H_0: \text{SR}_k = 0, at significance level α=0.05\alpha = 0.05. Expected false discoveries under all-null hypotheses:

E[false discoveries]=Nα.\mathbb{E}[\text{false discoveries}] = N \cdot \alpha.

For N=100N = 100: expect 5 false discoveries even if every strategy has zero alpha. If only the best Sharpe is reported, selection bias inflates it systematically.

Bonferroni correction. Test each hypothesis at level α/N\alpha / N. Controls the familywise error rate (FWER) — probability of any false discovery. Overly conservative when NN is large.

Benjamini-Hochberg (BH) procedure. Controls the false discovery rate (FDR) — expected proportion of discoveries that are false:

  1. Order p-values: p(1)p(2)p(N)p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(N)}.
  2. Find the largest kk such that p(k)kNαp_{(k)} \leq \frac{k}{N} \alpha.
  3. Reject all H(1),,H(k)H_{(1)}, \ldots, H_{(k)}.

BH controls FDR α\leq \alpha under independence (and under positive dependence, PRDS condition). For correlated strategies, the Benjamini-Yekutieli (BY) procedure controls FDR under arbitrary dependence at the cost of a lnN\ln N factor in the threshold.

3. The Deflated Sharpe Ratio

Bailey and López de Prado (2014) propose the Deflated Sharpe Ratio (DSR) to adjust for the selection bias introduced by testing multiple strategies:

DSR(SR)=Φ ⁣((SRE[SRmax])T11ρ^SR2+SR22(γ1)),\text{DSR}(\text{SR}^*) = \Phi\!\left(\frac{(\text{SR}^* - \mathbb{E}[\text{SR}^{\max}]) \cdot \sqrt{T - 1}}{\sqrt{1 - \hat{\rho} \cdot \text{SR}^{*2} + \frac{\text{SR}^{*2}}{2} (\gamma - 1)}}\right),

where:

  • SR\text{SR}^* is the observed maximum Sharpe across NN tried strategies,
  • E[SRmax](1γE)Φ1(11/N)+Φ1(11/(Ne))\mathbb{E}[\text{SR}^{\max}] \approx (1 - \gamma_E)\Phi^{-1}(1 - 1/N) + \Phi^{-1}(1 - 1/(Ne)) is the expected maximum of NN iid standard normals (approximation),
  • ρ^\hat{\rho} is average pairwise correlation of the NN strategy returns,
  • γ\gamma is excess kurtosis of the selected strategy's returns,
  • γE0.5772\gamma_E \approx 0.5772 is the Euler-Mascheroni constant.

DSR >0.95> 0.95 means the observed Sharpe, after deflating for search over NN strategies, remains statistically significant.

Minimum Backtest Length. The expected maximum Sharpe ratio over NN trials grows as 2lnN\sim \sqrt{2 \ln N}. A backtest of length TT with NN strategies tried produces a statistically robust conclusion only if:

T>(z1αSR2lnN)2(1+SR^2/2).T > \left(\frac{z_{1-\alpha}}{\text{SR}^* - \sqrt{2\ln N}}\right)^2 (1 + \widehat{\text{SR}}^{*2}/2).

For N=50N = 50, SR=1.5\text{SR}^* = 1.5: minimum T3,500T \approx 3{,}500 daily observations (14\approx 14 years).

4. Walk-Forward Backtest Design

A properly structured backtest:

  1. Split data before touching it. Designate an out-of-sample (OOS) holdout of at least 20–30% of history. The model is never fitted on this period.
  2. Training/validation split. Use purged walk-forward CV (see ML Signal Generation module) on the in-sample period to select hyperparameters.
  3. Single test on holdout. Run the finalised strategy once on the OOS period and report those results.
  4. Report honestly. Report: number of strategies tried, number discarded, all parameter combinations explored. Failure to disclose is p-hacking.

Implementation

"""
Backtesting statistics: Sharpe inference, MHT correction, DSR.

Assumptions:
- All inputs are daily excess returns (return minus cash rate)
- Returns in decimal form (not percent)
- 252 trading days per year
"""

from __future__ import annotations

import numpy as np
import pandas as pd
from scipy import stats
from typing import Sequence


TRADING_DAYS = 252


def annualised_sharpe(returns: pd.Series, ddof: int = 1) -> float:
    """Annualised Sharpe ratio from daily excess returns."""
    mu = returns.mean()
    sigma = returns.std(ddof=ddof)
    if sigma == 0:
        return 0.0
    return (mu / sigma) * np.sqrt(TRADING_DAYS)


def sharpe_tstat(returns: pd.Series) -> tuple[float, float]:
    """
    t-statistic and one-sided p-value for H0: SR = 0.
    Uses Lo (2002) asymptotic result adjusted for excess kurtosis.
    Returns (t_stat, p_value).
    """
    sr_daily = returns.mean() / returns.std(ddof=1)
    T = len(returns)
    kurtosis_excess = stats.kurtosis(returns, bias=False)   # Fisher definition (excess)
    variance_sr = (1.0 + 0.5 * sr_daily**2 * kurtosis_excess) / (T - 1)
    t = sr_daily / np.sqrt(variance_sr)
    p = 1.0 - stats.norm.cdf(t)   # one-sided
    return float(t), float(p)


def minimum_track_record_length(
    sharpe_annualised: float,
    alpha: float = 0.05,
    skewness: float = 0.0,
    excess_kurtosis: float = 0.0,
) -> float:
    """
    Minimum number of daily observations needed to reject H0: SR=0
    at one-sided level alpha, accounting for non-normal return distribution.
    """
    sr_daily = sharpe_annualised / np.sqrt(TRADING_DAYS)
    z = stats.norm.ppf(1 - alpha)
    # Variance of SR estimator per Lo (2002)
    var_factor = 1.0 + 0.5 * sr_daily**2 * (excess_kurtosis + skewness**2)
    T_star = var_factor * (z / sr_daily) ** 2
    return float(T_star)


def benjamini_hochberg(
    p_values: Sequence[float],
    alpha: float = 0.05,
) -> tuple[np.ndarray, float]:
    """
    Benjamini-Hochberg FDR correction.
    Returns (reject: bool array, adjusted threshold).
    Rejects H_{(k)} for all k <= k_star where p_{(k)} <= k/N * alpha.
    """
    p = np.array(p_values)
    N = len(p)
    order = np.argsort(p)
    p_sorted = p[order]
    threshold = np.arange(1, N + 1) * alpha / N

    reject_sorted = np.zeros(N, dtype=bool)
    k_star = -1
    for k in range(N - 1, -1, -1):
        if p_sorted[k] <= threshold[k]:
            k_star = k
            break

    if k_star >= 0:
        reject_sorted[: k_star + 1] = True

    reject = np.zeros(N, dtype=bool)
    reject[order] = reject_sorted
    return reject, threshold[k_star] if k_star >= 0 else 0.0


def expected_max_sharpe(n_strategies: int, mean_sr: float = 0.0, std_sr: float = 1.0) -> float:
    """
    Expected maximum of n_strategies iid N(mean_sr, std_sr^2) draws.
    Uses the approximation from Bailey & Lopez de Prado (2014).
    """
    # Expected max of n iid standard normals ~ (1 - gamma_E) * Phi^{-1}(1 - 1/n) + Phi^{-1}(1 - 1/(n*e))
    euler_mascheroni = 0.5772156649
    if n_strategies <= 1:
        return mean_sr
    z1 = stats.norm.ppf(1.0 - 1.0 / n_strategies)
    z2 = stats.norm.ppf(1.0 - 1.0 / (n_strategies * np.e))
    e_max = (1.0 - euler_mascheroni) * z1 + euler_mascheroni * z2
    return mean_sr + std_sr * e_max


def deflated_sharpe_ratio(
    returns: pd.Series,
    n_strategies_tried: int,
    mean_sr_tried: float = 0.0,
    std_sr_tried: float = 1.0,
) -> float:
    """
    Deflated Sharpe Ratio (Bailey & Lopez de Prado 2014).
    Returns the probability that the observed SR is not a false discovery
    given n_strategies_tried were tested.

    Parameters
    ----------
    returns:             daily excess returns of the selected strategy
    n_strategies_tried:  total number of strategies/parameter sets evaluated
    mean_sr_tried:       mean annualised SR across tried strategies (default 0)
    std_sr_tried:        std of annualised SRs across tried strategies (default 1)
    """
    T = len(returns)
    sr_star = annualised_sharpe(returns)
    sr_star_daily = sr_star / np.sqrt(TRADING_DAYS)

    skew = stats.skew(returns, bias=False)
    kurt_excess = stats.kurtosis(returns, bias=False)   # excess

    # Expected maximum SR given search
    e_max = expected_max_sharpe(n_strategies_tried, mean_sr_tried, std_sr_tried)
    e_max_daily = e_max / np.sqrt(TRADING_DAYS)

    # Variance of the SR estimator
    var_numerator = 1.0 - skew * sr_star_daily + (kurt_excess / 4.0) * sr_star_daily**2
    var_sr = var_numerator / (T - 1)

    if var_sr <= 0:
        return 0.0

    z = (sr_star_daily - e_max_daily) / np.sqrt(var_sr)
    return float(stats.norm.cdf(z))


def walk_forward_backtest(
    strategy_returns_func,        # callable(train_data, test_data) -> pd.Series
    prices: pd.DataFrame,         # daily prices, columns = assets
    train_window: int = 756,      # ~3 years
    test_window: int = 252,       # ~1 year
) -> pd.Series:
    """
    Rolling walk-forward backtest.
    Returns concatenated out-of-sample daily strategy returns.
    """
    dates = prices.index
    T = len(dates)
    all_returns = []

    start = train_window
    while start + test_window <= T:
        train_data = prices.iloc[start - train_window : start]
        test_data = prices.iloc[start : start + test_window]
        oos_returns = strategy_returns_func(train_data, test_data)
        all_returns.append(oos_returns)
        start += test_window

    return pd.concat(all_returns) if all_returns else pd.Series(dtype=float)

Validation

Sharpe t-statistic. Manually: for 5 years of daily data (T=1260T = 1260) with annualised SR = 1.0, the daily SR is 1.0/2520.0631.0/\sqrt{252} \approx 0.063. The t-statistic is 0.063×12602.24\approx 0.063 \times \sqrt{1260} \approx 2.24, p-value 0.013\approx 0.013. Minimum track record for SR = 1.0 at α=0.05\alpha = 0.05: T1264T^* \approx 1264 days — approximately 5 years.

BH correction. For 20 strategies with p-values uniformly drawn from [0, 1] (all null), the expected number of rejections at α=0.05\alpha = 0.05 with uncorrected testing is 1.0. BH reduces this to FDR ≤ 5% without the conservatism of Bonferroni.

DSR. For N=100N = 100 strategies tried, expected max SR (iid standard normal) ≈ 2ln1003.03\sqrt{2\ln 100} \approx 3.03 (in standardised units). A backtest reporting SR = 3.0 over 1 year after searching 100 strategies has DSR ≈ 0.5 — not significant.


Limitations

Look-Ahead Bias Eliminates Result Validity

Even a single look-ahead instance — using end-of-day prices to make decisions that would have required intraday data, or rebalancing at prices that assumed perfect execution — renders a backtest invalid. Common sources:

  • Treating announcement-day returns as tradeable when the announcement came after market close.
  • Using index weights from a later rebalancing date when constructing a historical index simulation.
  • Computing position sizes from realised vol of the full period rather than an expanding window.

Overfitting via Strategy Space Search

Every parameter choice, universe filter, lookback selection, and signal weighting is a degree of freedom. A strategy with 10 binary decisions has 210=10242^{10} = 1024 possible variants — all of which were implicitly "tried" if the researcher inspected the data before choosing. Harvey and Liu (2015) suggest that the minimum t-statistic for a new factor to be credible — accounting for the number of factors already published — should be at least 3.0, not the conventional 2.0 used for a single test.

Transaction Costs and Market Impact

A backtest that ignores transaction costs will overstate performance. Typical execution frictions:

  • Commission: 1–5 bps per side for institutional equity.
  • Market impact: for a position of QQ shares in a stock with average daily volume (ADV), the Almgren-Chriss impact is 3σQ/ADV\sim 3\sigma \cdot Q/\text{ADV} annualised vol units — non-trivial for large positions.
  • Bid-ask spread: 1–3 bps in large-cap equities; 10–50 bps in small-cap.

Strategies with high turnover (> 50% monthly) are often unviable after costs even if the raw IC is positive.

The Stationarity of Relationships

A backtest spanning 20 years implicitly assumes that the same relationship held throughout. Market microstructure has changed dramatically since the 1990s (decimalization, algorithmic trading, HFT). A signal that worked 1995–2005 may not work at all post-2010. Use time-segmented performance attribution to check whether performance is evenly distributed across sub-periods or concentrated in a specific regime.


Interview Angle

L1. What is the Sharpe ratio? How do you compute a t-statistic to test whether SR > 0? How many years of daily returns do you need to establish a Sharpe ratio of 1.0 as statistically significant at 5%?

L2. Explain the multiple hypothesis testing problem in backtesting. How does the Benjamini-Hochberg procedure differ from Bonferroni? A researcher reports trying 50 strategies and selecting the best with SR = 2.0 over 3 years. Is this result credible? (Expected max of 50 iid SR draws from N(0,1)\mathcal{N}(0,1): 2.8\approx 2.8 in standardised units — but after scaling by the SR distribution, the result is not unusual under the null.)

L3. Define the Deflated Sharpe Ratio. What inputs does it require beyond the observed Sharpe, and why? Design a rigorous backtesting protocol for a cross-sectional equity signal: specify how you would split the data, how you would select hyperparameters, and what statistics you would report. How would you estimate the effective number of independent strategies tried when strategies share features (correlated search space)?

Correlated search space. When NN strategies have average pairwise return correlation ρˉ\bar{\rho}, the effective number of independent tests is Neff=N/(1+(N1)ρˉ)N_{\text{eff}} = N / (1 + (N-1)\bar{\rho}). High correlation among tested strategies (e.g., variants of the same momentum signal) reduces the severity of the multiple testing problem but also reduces the diversity of the search. Report both NN (total trials) and NeffN_{\text{eff}} for full transparency.

Verify your understanding before moving on.