Brownian Bridge™

Setup

The Gauss-Newton method solves the normal equations $(J^\top W^2 J)\,\delta\theta = J^\top W^2 r$ at each iteration. It converges quadratically near a good solution but fails when:

$J^\top J$ is singular or near-singular (rank deficiency from parameter redundancy or flat objective directions),
The initial guess is far from the solution (large residuals make the Gauss-Newton step overshoot),
The objective is highly non-convex (the Gauss-Newton Hessian approximation is poor far from the solution).

The Levenberg-Marquardt (LM) algorithm (Levenberg 1944; Marquardt 1963) resolves all three failures by blending Gauss-Newton with steepest descent via a single damping parameter $\lambda$ . It is the standard algorithm for non-linear least squares in quantitative finance: it is the solver used inside scipy.optimize.least_squares, MATLAB's lsqnonlin, and most commercial calibration engines.

INSIGHT

Financial Insight. A Heston calibration from a cold start (no prior-day parameters) typically converges in 20–50 LM iterations. An intraday recalibration starting from the previous calibration converges in 3–10 iterations. The difference is the starting point relative to the basin of attraction. Understanding LM convergence diagnostics — gain ratio, gradient norm, step size — is essential for diagnosing when a calibration has failed silently (converged to a bad local minimum) versus when it is genuinely well-calibrated.

Assumptions:

The residual function $r: \Theta \to \mathbb{R}^N$ is continuously differentiable on $\Theta \subseteq \mathbb{R}^p$ . The Jacobian $J(\theta) \in \mathbb{R}^{N\times p}$ exists and is computable (analytically or numerically).
We work with weighted residuals; $W = I$ (unit weights) for notational clarity. The extension to non-unit $W$ is immediate: replace $J$ with $WJ$ and $r$ with $Wr$ throughout.
The problem is overdetermined or exactly determined: $N \ge p$ .

Theory

Motivation: Failure of Gauss-Newton

Gauss-Newton minimises the linearised objective: $m(\delta) = \tfrac{1}{2}\| r + J\delta \|^2$

The unconstrained minimiser is $\delta^{\text{GN}} = -(J^\top J)^{-1} J^\top r$ , which exists only when $J^\top J$ is invertible. When $J^\top J$ is ill-conditioned, the step is large and unreliable.

The gradient descent direction is $-\nabla \mathcal{L} = J^\top r$ , which is small when the gradient is small but requires choosing a step length and converges slowly (linearly).

LM interpolates between these two extremes via a single scalar.

The Levenberg-Marquardt Update

DEFINITION

Definition 2.1 (LM Update Step). Given current iterate $\theta$ , the LM update step $\delta_\lambda$ is the solution to the damped normal equations: $\left(J^\top J + \lambda D\right)\delta_\lambda = J^\top r$ where $\lambda > 0$ is the damping parameter and $D$ is a positive definite scaling matrix.

Levenberg's original choice: $D = I$ — damps toward zero step.
Marquardt's choice: $D = \mathrm{diag}(J^\top J)$ — scales each parameter direction by its natural curvature. This is the standard modern form.

THEOREM

Theorem 2.1 (LM Interpolation). For any $\lambda \ge 0$ :

(i) $\lambda = 0$ : $\delta_0 = -(J^\top J)^{-1} J^\top r$ — the Gauss-Newton step.

(ii) $\lambda \to \infty$ : $\delta_\lambda \to -(1/\lambda) J^\top r$ — a short steepest-descent step of length $O(1/\lambda)$ .

(iii) For any $\lambda > 0$ : $\delta_\lambda$ is a descent direction — the objective strictly decreases along $\delta_\lambda$ for small enough step length.

(iv) The step norm $\|\delta_\lambda\|$ is a strictly decreasing function of $\lambda$ .

Sketch of (iii): The gradient of $\mathcal{L}$ in direction $\delta_\lambda$ is $-r^\top J \delta_\lambda = -r^\top J (J^\top J + \lambda D)^{-1} J^\top r$ , which is $\le 0$ since $(J^\top J + \lambda D)^{-1}$ is positive definite. Equality holds only at $J^\top r = 0$ , i.e., at a stationary point.

Property (iv) is the key: $\lambda$ controls the step size. Large $\lambda$ → small conservative step; small $\lambda$ → large aggressive (Gauss-Newton-like) step.

Trust Region Interpretation

THEOREM

Theorem 2.2 (LM as Trust Region Problem). For a given $\lambda > 0$ , the solution $\delta_\lambda$ to the LM damped normal equations is the exact solution of the trust-region sub-problem: $\delta_\lambda = \arg\min_{\delta} \tfrac{1}{2}\| r + J\delta \|^2 \quad \text{subject to} \quad \|\delta\|_D \le \Delta(\lambda)$ where $\|\delta\|_D^2 = \delta^\top D\,\delta$ and $\Delta(\lambda)$ is the radius of the trust region at damping $\lambda$ .

This interpretation is fundamental: $\lambda$ does not fix the step direction but instead enforces a budget on how far the algorithm trusts the linear model $m(\delta) = \|r + J\delta\|^2$ . Large $\lambda$ → small trust region → cautious step; small $\lambda$ → large trust region → full Gauss-Newton step if it lies within the region.

Geometric picture: In parameter space, the trust region is an ellipsoid centred at the current iterate. The LM step is the point on the Gauss-Newton path (from $\theta$ toward $\delta^{\text{GN}}$ ) that lies on the boundary of the ellipsoid if the Gauss-Newton step is too large, or exactly at $\delta^{\text{GN}}$ if it lies inside.

The Gain Ratio and $\lambda$ Update

LM adapts $\lambda$ at each iteration based on the gain ratio $\rho$ :

DEFINITION

Definition 2.2 (Gain Ratio). The gain ratio at iterate $\theta$ with proposed step $\delta$ is: $\rho \;=\; \frac{\mathcal{L}(\theta) - \mathcal{L}(\theta + \delta)}{m(0) - m(\delta)}$ The numerator is the actual reduction in the objective; the denominator is the predicted reduction by the linearised model. $\rho \approx 1$ means the linear model is accurate; $\rho \ll 1$ or $\rho < 0$ means the linear model is poor.

Standard $\lambda$ update heuristic (Nielsen 1999):

Gain ratio $\rho$	Action
$\rho > 0.75$	Step accepted, decrease $\lambda$ : $\lambda \leftarrow \lambda / 3$
$0.25 \le \rho \le 0.75$	Step accepted, keep $\lambda$ unchanged
$\rho < 0.25$	Step rejected, increase $\lambda$ : $\lambda \leftarrow \lambda \times 2$
$\rho \le 0$	Step rejected, increase $\lambda$ : $\lambda \leftarrow \lambda \times 10$

Initialisation: $\lambda_0 = \tau \cdot \max_j (J^\top J)_{jj}$ where $\tau \in [10^{-8}, 1]$ is chosen large when the initial guess is poor and small when the initial guess is good.

Convergence Criteria

The algorithm terminates when any of the following holds:

DEFINITION

Definition 2.3 (LM Convergence Conditions). Stop when:

Gradient criterion: $\| J^\top r \|_\infty < \epsilon_g$ — the gradient is negligible; we are at (or very near) a stationary point.
Step size criterion: $\| \delta_\lambda \| < \epsilon_s (1 + \| \theta \|)$ — the update is smaller than a relative tolerance; parameters are not changing meaningfully.
Objective criterion: $| \mathcal{L}(\theta^{(k+1)}) - \mathcal{L}(\theta^{(k)}) | < \epsilon_f (1 + \mathcal{L}(\theta^{(k)}))$ — the objective is stagnant.
Maximum iterations exceeded.

Typical tolerances: $\epsilon_g = \epsilon_s = \epsilon_f = 10^{-8}$ for calibration; $10^{-5}$ for intraday recalibration under time pressure.

WARNING

Warning — False Convergence. Criteria 2 and 3 can trigger when $\lambda$ is very large and the step is tiny — not because a minimum was found, but because the algorithm is stuck with an enormous damping parameter. Always check the gradient criterion and inspect the final residuals $\| r(\theta^*) \|$ . A large gradient norm at "convergence" signals a failed calibration.

Linear Algebra: Solving the Damped Normal Equations

The $p \times p$ system $(J^\top J + \lambda D)\,\delta = J^\top r$ is solved at each iteration.

Cholesky approach: If $J^\top J + \lambda D$ is positive definite (it is, for any $\lambda > 0$ with $D$ PD), factor $J^\top J + \lambda D = L L^\top$ and solve two triangular systems. Cost: $O(p^3/3)$ for the factorisation + $O(p^2)$ for the solves. Efficient for small $p$ ( $p \le 20$ ).

QR approach (more numerically stable): Form the augmented system $\tilde{J} = \begin{pmatrix} J \\ \sqrt{\lambda} D^{1/2} \end{pmatrix}, \quad \tilde{r} = \begin{pmatrix} r \\ 0 \end{pmatrix}$ and solve $\min_\delta \|\tilde{r} + \tilde{J}\delta\|^2$ via QR decomposition of $\tilde{J}$ . Cost: $O(Np^2)$ for the factorisation. Preferred when $N \gg p$ or when $J^\top J$ may be nearly singular.

REMARK

Remark. For Heston calibration ( $p = 5$ , $N \le 60$ ), Cholesky is fast enough. For LMM calibration ( $p$ up to 50+), QR or iterative methods are preferred.

Rate of Convergence

THEOREM

Theorem 2.3 (Local Convergence of LM). Suppose $\theta^*$ is a local minimum of $\mathcal{L}$ at which $J^*$ has full column rank $p$ . There exists $\epsilon > 0$ such that if $\| \theta^{(0)} - \theta^* \| < \epsilon$ , the LM iterates converge to $\theta^*$ .

If the residuals vanish at the solution ( $r(\theta^*) = 0$ ), convergence is quadratic: $\|\theta^{(k+1)} - \theta^*\| = O(\|\theta^{(k)} - \theta^*\|^2)$ .

If the residuals are non-zero at the solution, convergence is linear at rate proportional to $\| r(\theta^*) \|$ .

The key implication: a near-perfect model fit (zero residuals) gives fast quadratic convergence. A model that cannot fit the surface exactly (e.g., fitting a 5-parameter Heston to a surface with term-structure features the model cannot reproduce) converges linearly. This is a diagnostic: slow convergence near the solution is evidence of model misspecification.

Validation

The companion notebook implements LM from scratch (without scipy.optimize) and verifies:

The gain ratio $\rho$ is close to 1 near the solution (linear model is accurate).
The step norm $\|\delta_\lambda\|$ decreases monotonically as $\lambda$ increases.
$\lambda$ is reduced (toward Gauss-Newton) when the objective decreases well.
Convergence to the correct solution on a 2-parameter test problem with known analytic answer.
The QR and Cholesky approaches give identical steps (within machine precision).

PRACTICE

Before opening the notebook: For $r(\theta) = (\theta_1 - 2, \theta_2 - 3)^\top$ with $J = I_2$ (identity Jacobian): (a) What is the Gauss-Newton step from $\theta = (0, 0)$ ? (b) What is the LM step for $\lambda = 1$ , $D = I$ ? (c) Verify that the LM step converges to the steepest descent direction as $\lambda \to \infty$ .

Limitations

WARNING

Warning — Local Convergence Only. LM converges to a local minimum, not necessarily the global one. For non-convex problems (Heston, SABR), the basin of attraction is finite and starting far from the solution risks convergence to a spurious local minimum. Always validate the calibrated parameters against stylised facts (positive vol of vol, negative correlation for equity smiles) and run multiple starting points.

WARNING

Warning — $\lambda$ Explosion. If $\lambda$ grows to machine limits (e.g., $\lambda > 10^{30}$ ) without the gradient criterion being satisfied, the algorithm has stalled: every candidate step is rejected as unpredictive, but no stationary point has been found. Root causes: (i) wrong initial point, (ii) numerical errors in the Jacobian, (iii) poorly scaled parameters. Fix by normalising parameters to $O(1)$ before running LM.

Other limitations:

Parameter scaling: LM with $D = I$ is not scale-invariant. A parameter $\xi = 0.3$ (vol of vol) and a parameter $\kappa = 2.0$ (mean reversion rate) live on different scales. Use Marquardt scaling $D = \mathrm{diag}(J^\top J)$ to compensate, or normalise each parameter to unit scale before running.
MC pricing noise: Standard LM assumes a deterministic objective. With MC pricing, the objective and Jacobian are noisy. Standard LM will oscillate rather than converge. Use noise-robust methods (average over seeds, use control variates, or switch to a derivative-free method).
Jacobian computation cost: Each iteration requires evaluating $J$ , which costs $p$ or $2p$ pricing function evaluations for finite-difference derivatives. For $p = 5$ Heston parameters and 40 instruments, one LM iteration requires 10–40 option price evaluations (each involving numerical integration). Analytic or AAD-computed Jacobians are essential at production speed.

Interview Angle

PRACTICE

L1 (Junior) — Typical questions:

What problem does LM solve that Gauss-Newton does not? Expected: LM handles singular or ill-conditioned $J^\top J$ via damping. Also handles poor starting points by restricting step size. Know the basic update formula.
What is the damping parameter $\lambda$ and how is it updated? Expected: controls interpolation between GN and gradient descent. Decreased on good steps (high gain ratio), increased on bad steps (low gain ratio). Initialised proportionally to diagonal of $J^\top J$ .
When would you say a calibration has converged? Expected: gradient norm small, step size small, residuals stable. Check all three — not just "ran for N iterations".

PRACTICE

L2 (Senior) — Typical questions:

Derive the LM update step from the trust-region sub-problem. Expected: minimise $\|r + J\delta\|^2$ subject to $\|\delta\|_D^2 \le \Delta^2$ . Via Lagrangian: $(J^\top J + \lambda D)\delta = J^\top r$ where $\lambda$ is the multiplier. Show that step norm is decreasing in $\lambda$ .
What is the gain ratio and why is it a good guide for updating $\lambda$ ? Expected: ratio of actual to predicted reduction in objective. $\rho \approx 1$ means linear model is accurate → trust it more (reduce $\lambda$ ). $\rho < 0$ means the model overpredicted the gain → reduce trust (increase $\lambda$ ).
Under what conditions does LM converge quadratically? Expected: residuals vanish at solution (zero-residual problem). Derive from Taylor expansion of $r(\theta^{(k+1)})$ around $\theta^*$ .

PRACTICE

L3 (Researcher) — Typical questions:

Compare the LM algorithm to Newton's method. Why do we use the Gauss-Newton Hessian approximation rather than the full Hessian? Expected: full Hessian requires $\sum_i r_i \nabla^2 f_i$ — expensive second-order derivatives. Gauss-Newton uses only $J^\top J$ , which is cheaper and always PSD. Full Newton converges faster away from the solution but the GN approximation is asymptotically equivalent near a zero-residual solution. Trade-off is computation vs. convergence radius.
In LMM calibration with 50 forward rates, the parameter space is high-dimensional. How would you modify the LM algorithm to exploit structure in the problem? Expected: block diagonal structure of $J^\top J$ (rates are coupled through swap weights only), cascade calibration (sequential single-expiry fits), rank-reduction of the correlation matrix. Discussion of whether LM is the right tool vs. quasi-Newton or trust-region Newton with sparse linear algebra.
Prove that the LM step is a descent direction for any $\lambda > 0$ . Expected: $\nabla \mathcal{L}^\top \delta_\lambda = -r^\top J (J^\top J + \lambda D)^{-1} J^\top r \le 0$ since $(J^\top J + \lambda D)^{-1}$ is PD. Equality iff $J^\top r = 0$ (stationary point).

Setup

Theory

Motivation: Failure of Gauss-Newton

The Levenberg-Marquardt Update

Trust Region Interpretation

The Gain Ratio and λ\lambdaλ Update

Convergence Criteria

Linear Algebra: Solving the Damped Normal Equations

Rate of Convergence

Validation

Limitations

Interview Angle

The Gain Ratio and $\lambda$ Update