Brownian Bridge™

Setup

When ML Is Appropriate in Quant Finance

Machine learning is appropriate when the relationship between features and returns is:

Non-linear or high-dimensional — beyond what OLS factor regression can model.
Latent — the relevant features are not immediately obvious but can be extracted from raw data.
Sufficiently stable — the relationship persists long enough to be exploited after accounting for transaction costs.

The word "signal" in quantitative finance refers to a predictive variable — a feature or combination of features — whose value at time $t$ predicts the cross-sectional or time-series return at time $t+h$ for horizon $h$ . The quality of a signal is measured by its information coefficient (IC):

$\text{IC}_t = \text{corr}(\hat{r}_{t+h}, r_{t+h}),$

where $\hat{r}_{t+h}$ is the model's predicted return and $r_{t+h}$ is the realised return. An IC of 0.05 is considered actionable at scale; 0.10 is strong.

Conventions. All features are point-in-time — constructed from data available strictly before the forecast period to avoid look-ahead bias. Returns are cross-sectionally demeaned and winsorised (e.g., at ±3 standard deviations) before model fitting to reduce outlier impact. All features are standardised (zero mean, unit variance) within each cross-sectional rebalancing date.

Theory

1. The Feature Engineering Framework

Feature engineering converts raw financial data into predictive inputs. Categories:

Feature Type	Examples	Rationale
Momentum	1-month, 6-month, 12-month returns; return reversals	Trend persistence; short-term reversal
Value	Book-to-market, earnings yield, free cash flow yield	Mean-reversion to fundamentals
Quality	ROE, gross profitability, accruals	Persistent mispricing of quality
Low risk	Idiosyncratic vol, beta, max drawdown	Low-volatility anomaly
Technical	RSI, moving average crossovers, volume ratios	Short-term supply/demand signals
Alternative	Sentiment from news/filings, satellite data	Unstructured data signals

Winsorisation is critical: return distributions have fat tails. Winsorise each feature $f$ at time $t$ by: $\tilde{f}_{i,t} = \text{clamp}\!\left(f_{i,t},\; \hat{\mu}_t - 3\hat{\sigma}_t,\; \hat{\mu}_t + 3\hat{\sigma}_t\right),$ then cross-sectionally standardise: $x_{i,t} = (\tilde{f}_{i,t} - \bar{\tilde{f}}_t) / \text{std}(\tilde{f}_t)$ .

2. Regularised Linear Models

Standard OLS collapses with many features (multicollinearity, overfitting). Regularisation constrains the coefficient vector:

Ridge regression ( $\ell_2$ regularisation): $\hat{\beta}^{\text{ridge}} = \arg\min_\beta \left\{\|y - X\beta\|_2^2 + \lambda \|\beta\|_2^2\right\} = (X^\top X + \lambda I)^{-1} X^\top y.$

Ridge shrinks all coefficients toward zero proportionally. The closed-form inverse exists even when $X^\top X$ is singular. Optimal $\lambda$ selected by purged cross-validation (see §3).

Lasso ( $\ell_1$ regularisation): $\hat{\beta}^{\text{lasso}} = \arg\min_\beta \left\{\|y - X\beta\|_2^2 + \lambda \|\beta\|_1\right\}.$

Lasso produces sparse solutions — many $\hat{\beta}_j = 0$ exactly — performing implicit feature selection. No closed form; solved via coordinate descent or LARS algorithm. Useful when only a few features are believed to be genuinely predictive.

Elastic Net combines both: $\hat{\beta}^{\text{EN}} = \arg\min_\beta \left\{\|y - X\beta\|_2^2 + \lambda_1 \|\beta\|_1 + \lambda_2 \|\beta\|_2^2\right\}.$

3. Gradient Boosted Trees

For non-linear feature interactions, gradient boosted decision trees (GBM/XGBoost/LightGBM) are the dominant approach in quant equity signal research.

Gradient boosting algorithm. Fit an additive model $F_m(x) = F_{m-1}(x) + \eta f_m(x)$ , where $f_m$ is a regression tree fit to the pseudo-residuals $-[\partial L(y_i, F(x_i))/\partial F(x_i)]_{F=F_{m-1}}$ . For squared loss $L(y, \hat{y}) = (y - \hat{y})^2/2$ , pseudo-residuals equal ordinary residuals. Learning rate $\eta \in (0, 1]$ controls step size; smaller $\eta$ + more trees reduces overfitting.

Key regularisation parameters:

max_depth: maximum depth of each tree (2–4 recommended for financial data)
n_estimators: number of trees
learning_rate $\eta$ : typically 0.01–0.05
min_samples_leaf: minimum samples per leaf node (controls overfit)
subsample: fraction of samples used per tree (stochastic gradient boosting)

Feature importance is computed as total reduction in the objective function attributable to each feature across all splits (impurity-based) or via permutation importance (model-agnostic, more reliable for correlated features).

4. Purged Walk-Forward Cross-Validation

Standard $k$ -fold cross-validation is invalid for time series because:

Data leakage through time: test folds occur before training folds — information from the future is used to build the model.
Overlapping returns: if the label is a 5-day forward return, consecutive observations share 4 days of return data. Validation scores overstate out-of-sample predictability.

Purged walk-forward CV (López de Prado 2018):

For each fold $k$ :

Train on observations $t \leq t_k^{\text{train}}$ .
Purge the $E$ most recent training observations that overlap with the test period in their label window: remove observations $[t_k^{\text{train}} - h + 1, t_k^{\text{train}}]$ .
Optionally add an embargo of $e$ periods after the test set to prevent leakage via microstructure effects.
Test on $[t_k^{\text{test, start}}, t_k^{\text{test, end}}]$ .

The effect: each training/test split mimics real-world live trading — the model was built on past data and evaluated on future data with no information leakage.

Implementation

"""
ML signal generation with purged walk-forward cross-validation.

Assumptions:
- Features are already point-in-time (no look-ahead bias)
- Returns are forward returns (h-period, starting at t+1)
- All inputs are pandas DataFrames with DatetimeIndex
- Cross-sectional standardisation applied before calling fit()
"""

from __future__ import annotations

import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.base import BaseEstimator
from typing import Sequence


def winsorise(df: pd.DataFrame, n_sigma: float = 3.0) -> pd.DataFrame:
    """Cross-sectional winsorisation at n_sigma standard deviations."""
    mu = df.mean(axis=1)
    sigma = df.std(axis=1)
    lower = mu - n_sigma * sigma
    upper = mu + n_sigma * sigma
    return df.clip(lower=lower, upper=upper, axis=0)


def cross_section_standardise(df: pd.DataFrame) -> pd.DataFrame:
    """Z-score each row (cross-section) independently."""
    mu = df.mean(axis=1)
    sigma = df.std(axis=1).replace(0, np.nan)
    return df.sub(mu, axis=0).div(sigma, axis=0)


def information_coefficient(
    predicted: pd.Series,
    realised: pd.Series,
) -> float:
    """
    Spearman rank IC between predicted and realised cross-sectional returns.
    Rank IC is preferred over Pearson for robustness to outliers.
    """
    from scipy.stats import spearmanr
    valid = predicted.dropna().index.intersection(realised.dropna().index)
    if len(valid) < 10:
        return np.nan
    rho, _ = spearmanr(predicted.loc[valid], realised.loc[valid])
    return float(rho)


class PurgedWalkForwardCV:
    """
    Purged walk-forward cross-validation for panel return prediction.

    Parameters
    ----------
    n_splits:   number of forward-rolling test windows
    test_size:  number of time periods per test fold
    purge_gap:  number of periods to purge from training tail (label overlap)
    embargo:    additional periods to drop after test window (microstructure)
    """

    def __init__(
        self,
        n_splits: int = 5,
        test_size: int = 63,      # ~3 months of daily data
        purge_gap: int = 5,       # 5-day forward return → purge last 5 training obs
        embargo: int = 5,
    ) -> None:
        self.n_splits = n_splits
        self.test_size = test_size
        self.purge_gap = purge_gap
        self.embargo = embargo

    def split(self, dates: pd.DatetimeIndex) -> list[tuple[np.ndarray, np.ndarray]]:
        """
        Generate (train_idx, test_idx) pairs.
        dates: sorted DatetimeIndex of all time periods.
        """
        T = len(dates)
        min_train = T - self.n_splits * self.test_size
        if min_train <= self.purge_gap:
            raise ValueError("Insufficient data for requested splits.")

        splits = []
        for k in range(self.n_splits):
            test_start = min_train + k * self.test_size
            test_end = min(test_start + self.test_size, T)
            train_end = test_start - self.purge_gap   # purge overlap
            train_idx = np.arange(0, max(train_end, 0))
            test_idx = np.arange(test_start, test_end)
            if len(train_idx) > 0 and len(test_idx) > 0:
                splits.append((train_idx, test_idx))
        return splits


def evaluate_signal(
    features: pd.DataFrame,      # MultiIndex (date, asset) or (date x asset) panel
    forward_returns: pd.DataFrame,   # same shape as features
    model: BaseEstimator,
    cv: PurgedWalkForwardCV,
) -> pd.DataFrame:
    """
    Evaluate a sklearn-compatible model using purged walk-forward CV.
    Both features and forward_returns must have DatetimeIndex rows and asset columns.

    Returns DataFrame with columns: date, ic, mean_ic, ic_ir (IC / std(IC)).
    """
    dates = features.index
    splits = cv.split(dates)

    results = []
    for train_idx, test_idx in splits:
        train_dates = dates[train_idx]
        test_dates = dates[test_idx]

        X_train = features.loc[train_dates].stack().dropna()
        y_train = forward_returns.loc[train_dates].stack().reindex(X_train.index).dropna()
        X_train = X_train.reindex(y_train.index)

        X_test = features.loc[test_dates].stack().dropna()

        if len(X_train) < 50:
            continue

        # Train
        scaler = StandardScaler()
        X_tr_scaled = scaler.fit_transform(X_train.values.reshape(-1, 1) if X_train.ndim == 1
                                            else X_train.values)
        model.fit(X_tr_scaled, y_train.values)

        # Predict and compute IC per date
        X_te_scaled = scaler.transform(X_test.values.reshape(-1, 1) if X_test.ndim == 1
                                        else X_test.values)
        preds = pd.Series(model.predict(X_te_scaled), index=X_test.index)

        for dt in test_dates:
            if dt not in preds.index.get_level_values(0):
                continue
            pred_t = preds.xs(dt, level=0) if preds.index.nlevels > 1 else preds.loc[dt]
            real_t = forward_returns.loc[dt].dropna()
            ic = information_coefficient(pred_t, real_t)
            results.append({"date": dt, "ic": ic})

    df = pd.DataFrame(results).set_index("date")
    df["mean_ic"] = df["ic"].mean()
    df["ic_ir"] = df["ic"].mean() / df["ic"].std(ddof=1) if len(df) > 1 else np.nan
    return df

Validation

IC benchmarks. Rule of thumb from practitioners (Grinold & Kahn 2000):

IC ≈ 0.02–0.05: weak but potentially usable signal
IC ≈ 0.05–0.10: moderate signal; exploitable with low costs
IC > 0.10: strong signal (rare; often fragile or concentrated)

Residual diagnostics. After fitting, check:

ACF of residuals: should be near zero if serial dependence is captured.
Feature importance stability: feature importances should be stable across CV folds; large instability suggests overfitting.
IC decay curve: IC at lag $h$ should decay monotonically; a spike at $h > 1$ suggests a data processing error.

Overfitting detection. Compare in-sample IC (training set) to out-of-sample IC (test set) per fold. A ratio $\text{IC}^{\text{OOS}} / \text{IC}^{\text{IS}} < 0.5$ indicates significant overfitting. For tree models, reduce max_depth or increase min_samples_leaf.

Limitations

Look-Ahead Bias: The Primary Risk

Look-ahead bias invalidates any backtest. Common sources:

Point-in-time data: accounting ratios must use data as-reported (e.g., COMPUSTAT restates figures retroactively). Using restated fundamentals creates future knowledge.
Survivorship bias: if the universe excludes delisted stocks, the sample is biased upward. A proper backtest includes all stocks that existed at each point in time.
Feature construction: any feature computed using information after the label period is contaminated. Rolling averages computed over the full sample (rather than expanding windows) are a common error.

Signal Decay and Non-stationarity

ML models are trained on historical relationships. These change due to:

Crowding: as capital is deployed into a signal, the alpha is competed away.
Regime shifts: relationships valid in high-vol regimes may not hold in low-vol periods.
Market structure changes: algorithmic trading has eliminated many short-term microstructure signals visible in 1990s data.

Validate using post-publication decay: compare signal IC before and after the academic literature identifies the same feature. McLean and Pontiff (2016) find average anomaly attenuation of $\approx 25\%$ post-publication.

Transaction Costs Are Unavoidable

Raw IC does not account for transaction costs. The information ratio after costs is:

$\text{IR}_{\text{net}} \approx \text{IC} \cdot \sqrt{N} - \frac{2 \cdot c}{\sigma_{\text{cross}}} \cdot \sqrt{N \cdot \text{turnover}},$

where $c$ is the one-way cost per unit notional and $\sigma_{\text{cross}}$ is cross-sectional return dispersion. A signal with IC = 0.04 but 100% monthly turnover may have negative IR net of costs in a mid-cap universe.

Tree Models: Interpretability vs. Stability

Gradient boosted trees with many features can produce highly unstable feature importances across sub-samples. Impurity-based importance is biased toward high-cardinality numerical features. Prefer permutation importance or SHAP values (Shapley Additive Explanations) for interpreting which features drive predictions, especially when features are correlated.

Interview Angle

L1. What is the information coefficient? How would you compute it? Why use rank (Spearman) IC rather than Pearson IC for financial returns? What is look-ahead bias and give two concrete examples of how it contaminates a backtest.

L2. Explain purged walk-forward cross-validation. Why is standard $k$ -fold invalid for time series? What is the purge gap and when is it needed? Compare ridge and lasso regularisation: when would you prefer each? Given a ridge regression solution, what happens to $\hat{\beta}$ as $\lambda \to \infty$ ?

Ridge at $\lambda \to \infty$ : $\hat{\beta}^{\text{ridge}} = (X^\top X + \lambda I)^{-1} X^\top y \to 0$ as $\lambda \to \infty$ , since the penalty dominates. Ridge always produces non-zero coefficients for all features; it cannot perform feature selection. Lasso at sufficient $\lambda$ produces exact zeros via the soft-thresholding of the coordinate descent solution.

L3. Derive the bias-variance decomposition. Explain how regularisation trades off bias and variance. How would you design a feature engineering pipeline for a cross-sectional equity signal that is robust to look-ahead bias, survivorship bias, and transaction costs? What is the IC information ratio (ICIR), and how does it differ from the Sharpe ratio of a strategy built on the signal?