Brownian Bridge™

Setup

When ML Is Appropriate in Quant Finance

Machine learning is appropriate when the relationship between features and returns is:

Non-linear or high-dimensional — beyond what OLS factor regression can model.
Latent — the relevant features are not immediately obvious but can be extracted from raw data.
Sufficiently stable — the relationship persists long enough to be exploited after accounting for transaction costs.

The word "signal" in quantitative finance refers to a predictive variable — a feature or combination of features — whose value at time $t$ predicts the cross-sectional or time-series return at time $t+h$ for horizon $h$ . The quality of a signal is measured by its information coefficient (IC):

$\text{IC}_t = \text{corr}(\hat{r}_{t+h}, r_{t+h}),$

where $\hat{r}_{t+h}$ is the model's predicted return and $r_{t+h}$ is the realised return. An IC of 0.05 is considered actionable at scale; 0.10 is strong.

Conventions. All features are point-in-time — constructed from data available strictly before the forecast period to avoid look-ahead bias. Returns are cross-sectionally demeaned and winsorised (e.g., at ±3 standard deviations) before model fitting to reduce outlier impact. All features are standardised (zero mean, unit variance) within each cross-sectional rebalancing date.

Theory

1. The Feature Engineering Framework

Feature engineering converts raw financial data into predictive inputs. Categories:

Feature Type	Examples	Rationale
Momentum	1-month, 6-month, 12-month returns; return reversals	Trend persistence; short-term reversal
Value	Book-to-market, earnings yield, free cash flow yield	Mean-reversion to fundamentals
Quality	ROE, gross profitability, accruals	Persistent mispricing of quality
Low risk	Idiosyncratic vol, beta, max drawdown	Low-volatility anomaly
Technical	RSI, moving average crossovers, volume ratios	Short-term supply/demand signals
Alternative	Sentiment from news/filings, satellite data	Unstructured data signals

Winsorisation is critical: return distributions have fat tails. Winsorise each feature $f$ at time $t$ by: $\tilde{f}_{i,t} = \text{clamp}\!\left(f_{i,t},\; \hat{\mu}_t - 3\hat{\sigma}_t,\; \hat{\mu}_t + 3\hat{\sigma}_t\right),$ then cross-sectionally standardise: $x_{i,t} = (\tilde{f}_{i,t} - \bar{\tilde{f}}_t) / \text{std}(\tilde{f}_t)$ .

2. Regularised Linear Models

Standard OLS collapses with many features (multicollinearity, overfitting). Regularisation constrains the coefficient vector:

Ridge regression ( $\ell_2$ regularisation): $\hat{\beta}^{\text{ridge}} = \arg\min_\beta \left\{\|y - X\beta\|_2^2 + \lambda \|\beta\|_2^2\right\} = (X^\top X + \lambda I)^{-1} X^\top y.$

Ridge shrinks all coefficients toward zero proportionally. The closed-form inverse exists even when $X^\top X$ is singular. Optimal $\lambda$ selected by purged cross-validation (see §3).

Lasso ( $\ell_1$ regularisation): $\hat{\beta}^{\text{lasso}} = \arg\min_\beta \left\{\|y - X\beta\|_2^2 + \lambda \|\beta\|_1\right\}.$

Lasso produces sparse solutions — many $\hat{\beta}_j = 0$ exactly — performing implicit feature selection. No closed form; solved via coordinate descent or LARS algorithm. Useful when only a few features are believed to be genuinely predictive.

Elastic Net combines both: $\hat{\beta}^{\text{EN}} = \arg\min_\beta \left\{\|y - X\beta\|_2^2 + \lambda_1 \|\beta\|_1 + \lambda_2 \|\beta\|_2^2\right\}.$

3. Gradient Boosted Trees

For non-linear feature interactions, gradient boosted decision trees (GBM/XGBoost/LightGBM) are the dominant approach in quant equity signal research.

Gradient boosting algorithm. Fit an additive model $F_m(x) = F_{m-1}(x) + \eta f_m(x)$ , where $f_m$ is a regression tree fit to the pseudo-residuals $-[\partial L(y_i, F(x_i))/\partial F(x_i)]_{F=F_{m-1}}$ . For squared loss $L(y, \hat{y}) = (y - \hat{y})^2/2$ , pseudo-residuals equal ordinary residuals. Learning rate $\eta \in (0, 1]$ controls step size; smaller $\eta$ + more trees reduces overfitting.

Key regularisation parameters:

max_depth: maximum depth of each tree (2–4 recommended for financial data)
n_estimators: number of trees
learning_rate $\eta$ : typically 0.01–0.05
min_samples_leaf: minimum samples per leaf node (controls overfit)
subsample: fraction of samples used per tree (stochastic gradient boosting)

Feature importance is computed as total reduction in the objective function attributable to each feature across all splits (impurity-based) or via permutation importance (model-agnostic, more reliable for correlated features).

4. Purged Walk-Forward Cross-Validation

Standard $k$ -fold cross-validation is invalid for time series because:

Data leakage through time: test folds occur before training folds — information from the future is used to build the model.
Overlapping returns: if the label is a 5-day forward return, consecutive observations share 4 days of return data. Validation scores overstate out-of-sample predictability.

Purged walk-forward CV (López de Prado 2018):

For each fold $k$ :

Train on observations $t \leq t_k^{\text{train}}$ .
Purge the $E$ most recent training observations that overlap with the test period in their label window: remove observations $[t_k^{\text{train}} - h + 1, t_k^{\text{train}}]$ .
Optionally add an embargo of $e$ periods after the test set to prevent leakage via microstructure effects.
Test on $[t_k^{\text{test, start}}, t_k^{\text{test, end}}]$ .

ML in Signal Generation