Setup
When ML Is Appropriate in Quant Finance
Machine learning is appropriate when the relationship between features and returns is:
- Non-linear or high-dimensional — beyond what OLS factor regression can model.
- Latent — the relevant features are not immediately obvious but can be extracted from raw data.
- Sufficiently stable — the relationship persists long enough to be exploited after accounting for transaction costs.
The word "signal" in quantitative finance refers to a predictive variable — a feature or combination of features — whose value at time predicts the cross-sectional or time-series return at time for horizon . The quality of a signal is measured by its information coefficient (IC):
where is the model's predicted return and is the realised return. An IC of 0.05 is considered actionable at scale; 0.10 is strong.
Conventions. All features are point-in-time — constructed from data available strictly before the forecast period to avoid look-ahead bias. Returns are cross-sectionally demeaned and winsorised (e.g., at ±3 standard deviations) before model fitting to reduce outlier impact. All features are standardised (zero mean, unit variance) within each cross-sectional rebalancing date.
Theory
1. The Feature Engineering Framework
Feature engineering converts raw financial data into predictive inputs. Categories:
| Feature Type | Examples | Rationale |
|---|---|---|
| Momentum | 1-month, 6-month, 12-month returns; return reversals | Trend persistence; short-term reversal |
| Value | Book-to-market, earnings yield, free cash flow yield | Mean-reversion to fundamentals |
| Quality | ROE, gross profitability, accruals | Persistent mispricing of quality |
| Low risk | Idiosyncratic vol, beta, max drawdown | Low-volatility anomaly |
| Technical | RSI, moving average crossovers, volume ratios | Short-term supply/demand signals |
| Alternative | Sentiment from news/filings, satellite data | Unstructured data signals |
Winsorisation is critical: return distributions have fat tails. Winsorise each feature at time by: then cross-sectionally standardise: .
2. Regularised Linear Models
Standard OLS collapses with many features (multicollinearity, overfitting). Regularisation constrains the coefficient vector:
Ridge regression ( regularisation):
Ridge shrinks all coefficients toward zero proportionally. The closed-form inverse exists even when is singular. Optimal selected by purged cross-validation (see §3).
Lasso ( regularisation):
Lasso produces sparse solutions — many exactly — performing implicit feature selection. No closed form; solved via coordinate descent or LARS algorithm. Useful when only a few features are believed to be genuinely predictive.
Elastic Net combines both:
3. Gradient Boosted Trees
For non-linear feature interactions, gradient boosted decision trees (GBM/XGBoost/LightGBM) are the dominant approach in quant equity signal research.
Gradient boosting algorithm. Fit an additive model , where is a regression tree fit to the pseudo-residuals . For squared loss , pseudo-residuals equal ordinary residuals. Learning rate controls step size; smaller + more trees reduces overfitting.
Key regularisation parameters:
max_depth: maximum depth of each tree (2–4 recommended for financial data)n_estimators: number of treeslearning_rate: typically 0.01–0.05min_samples_leaf: minimum samples per leaf node (controls overfit)subsample: fraction of samples used per tree (stochastic gradient boosting)
Feature importance is computed as total reduction in the objective function attributable to each feature across all splits (impurity-based) or via permutation importance (model-agnostic, more reliable for correlated features).
4. Purged Walk-Forward Cross-Validation
Standard -fold cross-validation is invalid for time series because:
- Data leakage through time: test folds occur before training folds — information from the future is used to build the model.
- Overlapping returns: if the label is a 5-day forward return, consecutive observations share 4 days of return data. Validation scores overstate out-of-sample predictability.
Purged walk-forward CV (López de Prado 2018):
For each fold :
- Train on observations .
- Purge the most recent training observations that overlap with the test period in their label window: remove observations .
- Optionally add an embargo of periods after the test set to prevent leakage via microstructure effects.
- Test on .
The effect: each training/test split mimics real-world live trading — the model was built on past data and evaluated on future data with no information leakage.
Implementation
"""
ML signal generation with purged walk-forward cross-validation.
Assumptions:
- Features are already point-in-time (no look-ahead bias)
- Returns are forward returns (h-period, starting at t+1)
- All inputs are pandas DataFrames with DatetimeIndex
- Cross-sectional standardisation applied before calling fit()
"""
from __future__ import annotations
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.base import BaseEstimator
from typing import Sequence
def winsorise(df: pd.DataFrame, n_sigma: float = 3.0) -> pd.DataFrame:
"""Cross-sectional winsorisation at n_sigma standard deviations."""
mu = df.mean(axis=1)
sigma = df.std(axis=1)
lower = mu - n_sigma * sigma
upper = mu + n_sigma * sigma
return df.clip(lower=lower, upper=upper, axis=0)
def cross_section_standardise(df: pd.DataFrame) -> pd.DataFrame:
"""Z-score each row (cross-section) independently."""
mu = df.mean(axis=1)
sigma = df.std(axis=1).replace(0, np.nan)
return df.sub(mu, axis=0).div(sigma, axis=0)
def information_coefficient(
predicted: pd.Series,
realised: pd.Series,
) -> float:
"""
Spearman rank IC between predicted and realised cross-sectional returns.
Rank IC is preferred over Pearson for robustness to outliers.
"""
from scipy.stats import spearmanr
valid = predicted.dropna().index.intersection(realised.dropna().index)
if len(valid) < 10:
return np.nan
rho, _ = spearmanr(predicted.loc[valid], realised.loc[valid])
return float(rho)
class PurgedWalkForwardCV:
"""
Purged walk-forward cross-validation for panel return prediction.
Parameters
----------
n_splits: number of forward-rolling test windows
test_size: number of time periods per test fold
purge_gap: number of periods to purge from training tail (label overlap)
embargo: additional periods to drop after test window (microstructure)
"""
def __init__(
self,
n_splits: int = 5,
test_size: int = 63, # ~3 months of daily data
purge_gap: int = 5, # 5-day forward return → purge last 5 training obs
embargo: int = 5,
) -> None:
self.n_splits = n_splits
self.test_size = test_size
self.purge_gap = purge_gap
self.embargo = embargo
def split(self, dates: pd.DatetimeIndex) -> list[tuple[np.ndarray, np.ndarray]]:
"""
Generate (train_idx, test_idx) pairs.
dates: sorted DatetimeIndex of all time periods.
"""
T = len(dates)
min_train = T - self.n_splits * self.test_size
if min_train <= self.purge_gap:
raise ValueError("Insufficient data for requested splits.")
splits = []
for k in range(self.n_splits):
test_start = min_train + k * self.test_size
test_end = min(test_start + self.test_size, T)
train_end = test_start - self.purge_gap # purge overlap
train_idx = np.arange(0, max(train_end, 0))
test_idx = np.arange(test_start, test_end)
if len(train_idx) > 0 and len(test_idx) > 0:
splits.append((train_idx, test_idx))
return splits
def evaluate_signal(
features: pd.DataFrame, # MultiIndex (date, asset) or (date x asset) panel
forward_returns: pd.DataFrame, # same shape as features
model: BaseEstimator,
cv: PurgedWalkForwardCV,
) -> pd.DataFrame:
"""
Evaluate a sklearn-compatible model using purged walk-forward CV.
Both features and forward_returns must have DatetimeIndex rows and asset columns.
Returns DataFrame with columns: date, ic, mean_ic, ic_ir (IC / std(IC)).
"""
dates = features.index
splits = cv.split(dates)
results = []
for train_idx, test_idx in splits:
train_dates = dates[train_idx]
test_dates = dates[test_idx]
X_train = features.loc[train_dates].stack().dropna()
y_train = forward_returns.loc[train_dates].stack().reindex(X_train.index).dropna()
X_train = X_train.reindex(y_train.index)
X_test = features.loc[test_dates].stack().dropna()
if len(X_train) < 50:
continue
# Train
scaler = StandardScaler()
X_tr_scaled = scaler.fit_transform(X_train.values.reshape(-1, 1) if X_train.ndim == 1
else X_train.values)
model.fit(X_tr_scaled, y_train.values)
# Predict and compute IC per date
X_te_scaled = scaler.transform(X_test.values.reshape(-1, 1) if X_test.ndim == 1
else X_test.values)
preds = pd.Series(model.predict(X_te_scaled), index=X_test.index)
for dt in test_dates:
if dt not in preds.index.get_level_values(0):
continue
pred_t = preds.xs(dt, level=0) if preds.index.nlevels > 1 else preds.loc[dt]
real_t = forward_returns.loc[dt].dropna()
ic = information_coefficient(pred_t, real_t)
results.append({"date": dt, "ic": ic})
df = pd.DataFrame(results).set_index("date")
df["mean_ic"] = df["ic"].mean()
df["ic_ir"] = df["ic"].mean() / df["ic"].std(ddof=1) if len(df) > 1 else np.nan
return df
Validation
IC benchmarks. Rule of thumb from practitioners (Grinold & Kahn 2000):
- IC ≈ 0.02–0.05: weak but potentially usable signal
- IC ≈ 0.05–0.10: moderate signal; exploitable with low costs
- IC > 0.10: strong signal (rare; often fragile or concentrated)
Residual diagnostics. After fitting, check:
- ACF of residuals: should be near zero if serial dependence is captured.
- Feature importance stability: feature importances should be stable across CV folds; large instability suggests overfitting.
- IC decay curve: IC at lag should decay monotonically; a spike at suggests a data processing error.
Overfitting detection. Compare in-sample IC (training set) to out-of-sample IC (test set) per fold. A ratio indicates significant overfitting. For tree models, reduce max_depth or increase min_samples_leaf.
Limitations
Look-Ahead Bias: The Primary Risk
Look-ahead bias invalidates any backtest. Common sources:
- Point-in-time data: accounting ratios must use data as-reported (e.g., COMPUSTAT restates figures retroactively). Using restated fundamentals creates future knowledge.
- Survivorship bias: if the universe excludes delisted stocks, the sample is biased upward. A proper backtest includes all stocks that existed at each point in time.
- Feature construction: any feature computed using information after the label period is contaminated. Rolling averages computed over the full sample (rather than expanding windows) are a common error.
Signal Decay and Non-stationarity
ML models are trained on historical relationships. These change due to:
- Crowding: as capital is deployed into a signal, the alpha is competed away.
- Regime shifts: relationships valid in high-vol regimes may not hold in low-vol periods.
- Market structure changes: algorithmic trading has eliminated many short-term microstructure signals visible in 1990s data.
Validate using post-publication decay: compare signal IC before and after the academic literature identifies the same feature. McLean and Pontiff (2016) find average anomaly attenuation of post-publication.
Transaction Costs Are Unavoidable
Raw IC does not account for transaction costs. The information ratio after costs is:
where is the one-way cost per unit notional and is cross-sectional return dispersion. A signal with IC = 0.04 but 100% monthly turnover may have negative IR net of costs in a mid-cap universe.
Tree Models: Interpretability vs. Stability
Gradient boosted trees with many features can produce highly unstable feature importances across sub-samples. Impurity-based importance is biased toward high-cardinality numerical features. Prefer permutation importance or SHAP values (Shapley Additive Explanations) for interpreting which features drive predictions, especially when features are correlated.
Interview Angle
L1. What is the information coefficient? How would you compute it? Why use rank (Spearman) IC rather than Pearson IC for financial returns? What is look-ahead bias and give two concrete examples of how it contaminates a backtest.
L2. Explain purged walk-forward cross-validation. Why is standard -fold invalid for time series? What is the purge gap and when is it needed? Compare ridge and lasso regularisation: when would you prefer each? Given a ridge regression solution, what happens to as ?
Ridge at : as , since the penalty dominates. Ridge always produces non-zero coefficients for all features; it cannot perform feature selection. Lasso at sufficient produces exact zeros via the soft-thresholding of the coordinate descent solution.
L3. Derive the bias-variance decomposition. Explain how regularisation trades off bias and variance. How would you design a feature engineering pipeline for a cross-sectional equity signal that is robust to look-ahead bias, survivorship bias, and transaction costs? What is the IC information ratio (ICIR), and how does it differ from the Sharpe ratio of a strategy built on the signal?