Statistical / ML for QuantsMachine LearningAlpha SignalsRegularisationCross-Validation

ML in Signal Generation

25 min readLevel: Hard

Setup

When ML Is Appropriate in Quant Finance

Machine learning is appropriate when the relationship between features and returns is:

  1. Non-linear or high-dimensional — beyond what OLS factor regression can model.
  2. Latent — the relevant features are not immediately obvious but can be extracted from raw data.
  3. Sufficiently stable — the relationship persists long enough to be exploited after accounting for transaction costs.

The word "signal" in quantitative finance refers to a predictive variable — a feature or combination of features — whose value at time tt predicts the cross-sectional or time-series return at time t+ht+h for horizon hh. The quality of a signal is measured by its information coefficient (IC):

ICt=corr(r^t+h,rt+h),\text{IC}_t = \text{corr}(\hat{r}_{t+h}, r_{t+h}),

where r^t+h\hat{r}_{t+h} is the model's predicted return and rt+hr_{t+h} is the realised return. An IC of 0.05 is considered actionable at scale; 0.10 is strong.

Conventions. All features are point-in-time — constructed from data available strictly before the forecast period to avoid look-ahead bias. Returns are cross-sectionally demeaned and winsorised (e.g., at ±3 standard deviations) before model fitting to reduce outlier impact. All features are standardised (zero mean, unit variance) within each cross-sectional rebalancing date.


Theory

1. The Feature Engineering Framework

Feature engineering converts raw financial data into predictive inputs. Categories:

Feature TypeExamplesRationale
Momentum1-month, 6-month, 12-month returns; return reversalsTrend persistence; short-term reversal
ValueBook-to-market, earnings yield, free cash flow yieldMean-reversion to fundamentals
QualityROE, gross profitability, accrualsPersistent mispricing of quality
Low riskIdiosyncratic vol, beta, max drawdownLow-volatility anomaly
TechnicalRSI, moving average crossovers, volume ratiosShort-term supply/demand signals
AlternativeSentiment from news/filings, satellite dataUnstructured data signals

Winsorisation is critical: return distributions have fat tails. Winsorise each feature ff at time tt by: f~i,t=clamp ⁣(fi,t,  μ^t3σ^t,  μ^t+3σ^t),\tilde{f}_{i,t} = \text{clamp}\!\left(f_{i,t},\; \hat{\mu}_t - 3\hat{\sigma}_t,\; \hat{\mu}_t + 3\hat{\sigma}_t\right), then cross-sectionally standardise: xi,t=(f~i,tf~ˉt)/std(f~t)x_{i,t} = (\tilde{f}_{i,t} - \bar{\tilde{f}}_t) / \text{std}(\tilde{f}_t).

2. Regularised Linear Models

Standard OLS collapses with many features (multicollinearity, overfitting). Regularisation constrains the coefficient vector:

Ridge regression (2\ell_2 regularisation): β^ridge=argminβ{yXβ22+λβ22}=(XX+λI)1Xy.\hat{\beta}^{\text{ridge}} = \arg\min_\beta \left\{\|y - X\beta\|_2^2 + \lambda \|\beta\|_2^2\right\} = (X^\top X + \lambda I)^{-1} X^\top y.

Ridge shrinks all coefficients toward zero proportionally. The closed-form inverse exists even when XXX^\top X is singular. Optimal λ\lambda selected by purged cross-validation (see §3).

Lasso (1\ell_1 regularisation): β^lasso=argminβ{yXβ22+λβ1}.\hat{\beta}^{\text{lasso}} = \arg\min_\beta \left\{\|y - X\beta\|_2^2 + \lambda \|\beta\|_1\right\}.

Lasso produces sparse solutions — many β^j=0\hat{\beta}_j = 0 exactly — performing implicit feature selection. No closed form; solved via coordinate descent or LARS algorithm. Useful when only a few features are believed to be genuinely predictive.

Elastic Net combines both: β^EN=argminβ{yXβ22+λ1β1+λ2β22}.\hat{\beta}^{\text{EN}} = \arg\min_\beta \left\{\|y - X\beta\|_2^2 + \lambda_1 \|\beta\|_1 + \lambda_2 \|\beta\|_2^2\right\}.

3. Gradient Boosted Trees

For non-linear feature interactions, gradient boosted decision trees (GBM/XGBoost/LightGBM) are the dominant approach in quant equity signal research.

Gradient boosting algorithm. Fit an additive model Fm(x)=Fm1(x)+ηfm(x)F_m(x) = F_{m-1}(x) + \eta f_m(x), where fmf_m is a regression tree fit to the pseudo-residuals [L(yi,F(xi))/F(xi)]F=Fm1-[\partial L(y_i, F(x_i))/\partial F(x_i)]_{F=F_{m-1}}. For squared loss L(y,y^)=(yy^)2/2L(y, \hat{y}) = (y - \hat{y})^2/2, pseudo-residuals equal ordinary residuals. Learning rate η(0,1]\eta \in (0, 1] controls step size; smaller η\eta + more trees reduces overfitting.

Key regularisation parameters:

  • max_depth: maximum depth of each tree (2–4 recommended for financial data)
  • n_estimators: number of trees
  • learning_rate η\eta: typically 0.01–0.05
  • min_samples_leaf: minimum samples per leaf node (controls overfit)
  • subsample: fraction of samples used per tree (stochastic gradient boosting)

Feature importance is computed as total reduction in the objective function attributable to each feature across all splits (impurity-based) or via permutation importance (model-agnostic, more reliable for correlated features).

4. Purged Walk-Forward Cross-Validation

Standard kk-fold cross-validation is invalid for time series because:

  1. Data leakage through time: test folds occur before training folds — information from the future is used to build the model.
  2. Overlapping returns: if the label is a 5-day forward return, consecutive observations share 4 days of return data. Validation scores overstate out-of-sample predictability.

Purged walk-forward CV (López de Prado 2018):

For each fold kk:

  1. Train on observations ttktraint \leq t_k^{\text{train}}.
  2. Purge the EE most recent training observations that overlap with the test period in their label window: remove observations [tktrainh+1,tktrain][t_k^{\text{train}} - h + 1, t_k^{\text{train}}].
  3. Optionally add an embargo of ee periods after the test set to prevent leakage via microstructure effects.
  4. Test on [tktest, start,tktest, end][t_k^{\text{test, start}}, t_k^{\text{test, end}}].

The effect: each training/test split mimics real-world live trading — the model was built on past data and evaluated on future data with no information leakage.


Implementation

"""
ML signal generation with purged walk-forward cross-validation.

Assumptions:
- Features are already point-in-time (no look-ahead bias)
- Returns are forward returns (h-period, starting at t+1)
- All inputs are pandas DataFrames with DatetimeIndex
- Cross-sectional standardisation applied before calling fit()
"""

from __future__ import annotations

import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.base import BaseEstimator
from typing import Sequence


def winsorise(df: pd.DataFrame, n_sigma: float = 3.0) -> pd.DataFrame:
    """Cross-sectional winsorisation at n_sigma standard deviations."""
    mu = df.mean(axis=1)
    sigma = df.std(axis=1)
    lower = mu - n_sigma * sigma
    upper = mu + n_sigma * sigma
    return df.clip(lower=lower, upper=upper, axis=0)


def cross_section_standardise(df: pd.DataFrame) -> pd.DataFrame:
    """Z-score each row (cross-section) independently."""
    mu = df.mean(axis=1)
    sigma = df.std(axis=1).replace(0, np.nan)
    return df.sub(mu, axis=0).div(sigma, axis=0)


def information_coefficient(
    predicted: pd.Series,
    realised: pd.Series,
) -> float:
    """
    Spearman rank IC between predicted and realised cross-sectional returns.
    Rank IC is preferred over Pearson for robustness to outliers.
    """
    from scipy.stats import spearmanr
    valid = predicted.dropna().index.intersection(realised.dropna().index)
    if len(valid) < 10:
        return np.nan
    rho, _ = spearmanr(predicted.loc[valid], realised.loc[valid])
    return float(rho)


class PurgedWalkForwardCV:
    """
    Purged walk-forward cross-validation for panel return prediction.

    Parameters
    ----------
    n_splits:   number of forward-rolling test windows
    test_size:  number of time periods per test fold
    purge_gap:  number of periods to purge from training tail (label overlap)
    embargo:    additional periods to drop after test window (microstructure)
    """

    def __init__(
        self,
        n_splits: int = 5,
        test_size: int = 63,      # ~3 months of daily data
        purge_gap: int = 5,       # 5-day forward return → purge last 5 training obs
        embargo: int = 5,
    ) -> None:
        self.n_splits = n_splits
        self.test_size = test_size
        self.purge_gap = purge_gap
        self.embargo = embargo

    def split(self, dates: pd.DatetimeIndex) -> list[tuple[np.ndarray, np.ndarray]]:
        """
        Generate (train_idx, test_idx) pairs.
        dates: sorted DatetimeIndex of all time periods.
        """
        T = len(dates)
        min_train = T - self.n_splits * self.test_size
        if min_train <= self.purge_gap:
            raise ValueError("Insufficient data for requested splits.")

        splits = []
        for k in range(self.n_splits):
            test_start = min_train + k * self.test_size
            test_end = min(test_start + self.test_size, T)
            train_end = test_start - self.purge_gap   # purge overlap
            train_idx = np.arange(0, max(train_end, 0))
            test_idx = np.arange(test_start, test_end)
            if len(train_idx) > 0 and len(test_idx) > 0:
                splits.append((train_idx, test_idx))
        return splits


def evaluate_signal(
    features: pd.DataFrame,      # MultiIndex (date, asset) or (date x asset) panel
    forward_returns: pd.DataFrame,   # same shape as features
    model: BaseEstimator,
    cv: PurgedWalkForwardCV,
) -> pd.DataFrame:
    """
    Evaluate a sklearn-compatible model using purged walk-forward CV.
    Both features and forward_returns must have DatetimeIndex rows and asset columns.

    Returns DataFrame with columns: date, ic, mean_ic, ic_ir (IC / std(IC)).
    """
    dates = features.index
    splits = cv.split(dates)

    results = []
    for train_idx, test_idx in splits:
        train_dates = dates[train_idx]
        test_dates = dates[test_idx]

        X_train = features.loc[train_dates].stack().dropna()
        y_train = forward_returns.loc[train_dates].stack().reindex(X_train.index).dropna()
        X_train = X_train.reindex(y_train.index)

        X_test = features.loc[test_dates].stack().dropna()

        if len(X_train) < 50:
            continue

        # Train
        scaler = StandardScaler()
        X_tr_scaled = scaler.fit_transform(X_train.values.reshape(-1, 1) if X_train.ndim == 1
                                            else X_train.values)
        model.fit(X_tr_scaled, y_train.values)

        # Predict and compute IC per date
        X_te_scaled = scaler.transform(X_test.values.reshape(-1, 1) if X_test.ndim == 1
                                        else X_test.values)
        preds = pd.Series(model.predict(X_te_scaled), index=X_test.index)

        for dt in test_dates:
            if dt not in preds.index.get_level_values(0):
                continue
            pred_t = preds.xs(dt, level=0) if preds.index.nlevels > 1 else preds.loc[dt]
            real_t = forward_returns.loc[dt].dropna()
            ic = information_coefficient(pred_t, real_t)
            results.append({"date": dt, "ic": ic})

    df = pd.DataFrame(results).set_index("date")
    df["mean_ic"] = df["ic"].mean()
    df["ic_ir"] = df["ic"].mean() / df["ic"].std(ddof=1) if len(df) > 1 else np.nan
    return df

Validation

IC benchmarks. Rule of thumb from practitioners (Grinold & Kahn 2000):

  • IC ≈ 0.02–0.05: weak but potentially usable signal
  • IC ≈ 0.05–0.10: moderate signal; exploitable with low costs
  • IC > 0.10: strong signal (rare; often fragile or concentrated)

Residual diagnostics. After fitting, check:

  1. ACF of residuals: should be near zero if serial dependence is captured.
  2. Feature importance stability: feature importances should be stable across CV folds; large instability suggests overfitting.
  3. IC decay curve: IC at lag hh should decay monotonically; a spike at h>1h > 1 suggests a data processing error.

Overfitting detection. Compare in-sample IC (training set) to out-of-sample IC (test set) per fold. A ratio ICOOS/ICIS<0.5\text{IC}^{\text{OOS}} / \text{IC}^{\text{IS}} < 0.5 indicates significant overfitting. For tree models, reduce max_depth or increase min_samples_leaf.


Limitations

Look-Ahead Bias: The Primary Risk

Look-ahead bias invalidates any backtest. Common sources:

  • Point-in-time data: accounting ratios must use data as-reported (e.g., COMPUSTAT restates figures retroactively). Using restated fundamentals creates future knowledge.
  • Survivorship bias: if the universe excludes delisted stocks, the sample is biased upward. A proper backtest includes all stocks that existed at each point in time.
  • Feature construction: any feature computed using information after the label period is contaminated. Rolling averages computed over the full sample (rather than expanding windows) are a common error.

Signal Decay and Non-stationarity

ML models are trained on historical relationships. These change due to:

  • Crowding: as capital is deployed into a signal, the alpha is competed away.
  • Regime shifts: relationships valid in high-vol regimes may not hold in low-vol periods.
  • Market structure changes: algorithmic trading has eliminated many short-term microstructure signals visible in 1990s data.

Validate using post-publication decay: compare signal IC before and after the academic literature identifies the same feature. McLean and Pontiff (2016) find average anomaly attenuation of 25%\approx 25\% post-publication.

Transaction Costs Are Unavoidable

Raw IC does not account for transaction costs. The information ratio after costs is:

IRnetICN2cσcrossNturnover,\text{IR}_{\text{net}} \approx \text{IC} \cdot \sqrt{N} - \frac{2 \cdot c}{\sigma_{\text{cross}}} \cdot \sqrt{N \cdot \text{turnover}},

where cc is the one-way cost per unit notional and σcross\sigma_{\text{cross}} is cross-sectional return dispersion. A signal with IC = 0.04 but 100% monthly turnover may have negative IR net of costs in a mid-cap universe.

Tree Models: Interpretability vs. Stability

Gradient boosted trees with many features can produce highly unstable feature importances across sub-samples. Impurity-based importance is biased toward high-cardinality numerical features. Prefer permutation importance or SHAP values (Shapley Additive Explanations) for interpreting which features drive predictions, especially when features are correlated.


Interview Angle

L1. What is the information coefficient? How would you compute it? Why use rank (Spearman) IC rather than Pearson IC for financial returns? What is look-ahead bias and give two concrete examples of how it contaminates a backtest.

L2. Explain purged walk-forward cross-validation. Why is standard kk-fold invalid for time series? What is the purge gap and when is it needed? Compare ridge and lasso regularisation: when would you prefer each? Given a ridge regression solution, what happens to β^\hat{\beta} as λ\lambda \to \infty?

Ridge at λ\lambda \to \infty: β^ridge=(XX+λI)1Xy0\hat{\beta}^{\text{ridge}} = (X^\top X + \lambda I)^{-1} X^\top y \to 0 as λ\lambda \to \infty, since the penalty dominates. Ridge always produces non-zero coefficients for all features; it cannot perform feature selection. Lasso at sufficient λ\lambda produces exact zeros via the soft-thresholding of the coordinate descent solution.

L3. Derive the bias-variance decomposition. Explain how regularisation trades off bias and variance. How would you design a feature engineering pipeline for a cross-sectional equity signal that is robust to look-ahead bias, survivorship bias, and transaction costs? What is the IC information ratio (ICIR), and how does it differ from the Sharpe ratio of a strategy built on the signal?

Verify your understanding before moving on.

Start Quiz →