Setup
The Gauss-Newton method solves the normal equations at each iteration. It converges quadratically near a good solution but fails when:
- is singular or near-singular (rank deficiency from parameter redundancy or flat objective directions),
- The initial guess is far from the solution (large residuals make the Gauss-Newton step overshoot),
- The objective is highly non-convex (the Gauss-Newton Hessian approximation is poor far from the solution).
The Levenberg-Marquardt (LM) algorithm (Levenberg 1944; Marquardt 1963) resolves all three
failures by blending Gauss-Newton with steepest descent via a single damping parameter .
It is the standard algorithm for non-linear least squares in quantitative finance: it is the
solver used inside scipy.optimize.least_squares, MATLAB's lsqnonlin, and most commercial
calibration engines.
Financial Insight. A Heston calibration from a cold start (no prior-day parameters) typically converges in 20–50 LM iterations. An intraday recalibration starting from the previous calibration converges in 3–10 iterations. The difference is the starting point relative to the basin of attraction. Understanding LM convergence diagnostics — gain ratio, gradient norm, step size — is essential for diagnosing when a calibration has failed silently (converged to a bad local minimum) versus when it is genuinely well-calibrated.
Assumptions:
- The residual function is continuously differentiable on . The Jacobian exists and is computable (analytically or numerically).
- We work with weighted residuals; (unit weights) for notational clarity. The extension to non-unit is immediate: replace with and with throughout.
- The problem is overdetermined or exactly determined: .
Theory
Motivation: Failure of Gauss-Newton
Gauss-Newton minimises the linearised objective:
The unconstrained minimiser is , which exists only when is invertible. When is ill-conditioned, the step is large and unreliable.
The gradient descent direction is , which is small when the gradient is small but requires choosing a step length and converges slowly (linearly).
LM interpolates between these two extremes via a single scalar.
The Levenberg-Marquardt Update
Definition 2.1 (LM Update Step). Given current iterate , the LM update step is the solution to the damped normal equations: where is the damping parameter and is a positive definite scaling matrix.
- Levenberg's original choice: — damps toward zero step.
- Marquardt's choice: — scales each parameter direction by its natural curvature. This is the standard modern form.
Theorem 2.1 (LM Interpolation). For any :
(i) : — the Gauss-Newton step.
(ii) : — a short steepest-descent step of length .
(iii) For any : is a descent direction — the objective strictly decreases along for small enough step length.
(iv) The step norm is a strictly decreasing function of .
Sketch of (iii): The gradient of in direction is , which is since is positive definite. Equality holds only at , i.e., at a stationary point.
Property (iv) is the key: controls the step size. Large → small conservative step; small → large aggressive (Gauss-Newton-like) step.
Trust Region Interpretation
Theorem 2.2 (LM as Trust Region Problem). For a given , the solution to the LM damped normal equations is the exact solution of the trust-region sub-problem: where and is the radius of the trust region at damping .
This interpretation is fundamental: does not fix the step direction but instead enforces a budget on how far the algorithm trusts the linear model . Large → small trust region → cautious step; small → large trust region → full Gauss-Newton step if it lies within the region.
Geometric picture: In parameter space, the trust region is an ellipsoid centred at the current iterate. The LM step is the point on the Gauss-Newton path (from toward ) that lies on the boundary of the ellipsoid if the Gauss-Newton step is too large, or exactly at if it lies inside.
The Gain Ratio and Update
LM adapts at each iteration based on the gain ratio :
Definition 2.2 (Gain Ratio). The gain ratio at iterate with proposed step is: The numerator is the actual reduction in the objective; the denominator is the predicted reduction by the linearised model. means the linear model is accurate; or means the linear model is poor.
Standard update heuristic (Nielsen 1999):
| Gain ratio | Action |
|---|---|
| Step accepted, decrease : | |
| Step accepted, keep unchanged | |
| Step rejected, increase : | |
| Step rejected, increase : |
Initialisation: where is chosen large when the initial guess is poor and small when the initial guess is good.
Convergence Criteria
The algorithm terminates when any of the following holds:
Definition 2.3 (LM Convergence Conditions). Stop when:
-
Gradient criterion: — the gradient is negligible; we are at (or very near) a stationary point.
-
Step size criterion: — the update is smaller than a relative tolerance; parameters are not changing meaningfully.
-
Objective criterion: — the objective is stagnant.
-
Maximum iterations exceeded.
Typical tolerances: for calibration; for intraday recalibration under time pressure.
Warning — False Convergence. Criteria 2 and 3 can trigger when is very large and the step is tiny — not because a minimum was found, but because the algorithm is stuck with an enormous damping parameter. Always check the gradient criterion and inspect the final residuals . A large gradient norm at "convergence" signals a failed calibration.
Linear Algebra: Solving the Damped Normal Equations
The system is solved at each iteration.
Cholesky approach: If is positive definite (it is, for any with PD), factor and solve two triangular systems. Cost: for the factorisation + for the solves. Efficient for small ().
QR approach (more numerically stable): Form the augmented system and solve via QR decomposition of . Cost: for the factorisation. Preferred when or when may be nearly singular.
Remark. For Heston calibration (, ), Cholesky is fast enough. For LMM calibration ( up to 50+), QR or iterative methods are preferred.
Rate of Convergence
Theorem 2.3 (Local Convergence of LM). Suppose is a local minimum of at which has full column rank . There exists such that if , the LM iterates converge to .
If the residuals vanish at the solution (), convergence is quadratic: .
If the residuals are non-zero at the solution, convergence is linear at rate proportional to .
The key implication: a near-perfect model fit (zero residuals) gives fast quadratic convergence. A model that cannot fit the surface exactly (e.g., fitting a 5-parameter Heston to a surface with term-structure features the model cannot reproduce) converges linearly. This is a diagnostic: slow convergence near the solution is evidence of model misspecification.
Validation
The companion notebook implements LM from scratch (without scipy.optimize) and verifies:
- The gain ratio is close to 1 near the solution (linear model is accurate).
- The step norm decreases monotonically as increases.
- is reduced (toward Gauss-Newton) when the objective decreases well.
- Convergence to the correct solution on a 2-parameter test problem with known analytic answer.
- The QR and Cholesky approaches give identical steps (within machine precision).
Before opening the notebook: For with (identity Jacobian): (a) What is the Gauss-Newton step from ? (b) What is the LM step for , ? (c) Verify that the LM step converges to the steepest descent direction as .
Limitations
Warning — Local Convergence Only. LM converges to a local minimum, not necessarily the global one. For non-convex problems (Heston, SABR), the basin of attraction is finite and starting far from the solution risks convergence to a spurious local minimum. Always validate the calibrated parameters against stylised facts (positive vol of vol, negative correlation for equity smiles) and run multiple starting points.
Warning — Explosion. If grows to machine limits (e.g., ) without the gradient criterion being satisfied, the algorithm has stalled: every candidate step is rejected as unpredictive, but no stationary point has been found. Root causes: (i) wrong initial point, (ii) numerical errors in the Jacobian, (iii) poorly scaled parameters. Fix by normalising parameters to before running LM.
Other limitations:
-
Parameter scaling: LM with is not scale-invariant. A parameter (vol of vol) and a parameter (mean reversion rate) live on different scales. Use Marquardt scaling to compensate, or normalise each parameter to unit scale before running.
-
MC pricing noise: Standard LM assumes a deterministic objective. With MC pricing, the objective and Jacobian are noisy. Standard LM will oscillate rather than converge. Use noise-robust methods (average over seeds, use control variates, or switch to a derivative-free method).
-
Jacobian computation cost: Each iteration requires evaluating , which costs or pricing function evaluations for finite-difference derivatives. For Heston parameters and 40 instruments, one LM iteration requires 10–40 option price evaluations (each involving numerical integration). Analytic or AAD-computed Jacobians are essential at production speed.
Interview Angle
L1 (Junior) — Typical questions:
-
What problem does LM solve that Gauss-Newton does not? Expected: LM handles singular or ill-conditioned via damping. Also handles poor starting points by restricting step size. Know the basic update formula.
-
What is the damping parameter and how is it updated? Expected: controls interpolation between GN and gradient descent. Decreased on good steps (high gain ratio), increased on bad steps (low gain ratio). Initialised proportionally to diagonal of .
-
When would you say a calibration has converged? Expected: gradient norm small, step size small, residuals stable. Check all three — not just "ran for N iterations".
L2 (Senior) — Typical questions:
-
Derive the LM update step from the trust-region sub-problem. Expected: minimise subject to . Via Lagrangian: where is the multiplier. Show that step norm is decreasing in .
-
What is the gain ratio and why is it a good guide for updating ? Expected: ratio of actual to predicted reduction in objective. means linear model is accurate → trust it more (reduce ). means the model overpredicted the gain → reduce trust (increase ).
-
Under what conditions does LM converge quadratically? Expected: residuals vanish at solution (zero-residual problem). Derive from Taylor expansion of around .
L3 (Researcher) — Typical questions:
-
Compare the LM algorithm to Newton's method. Why do we use the Gauss-Newton Hessian approximation rather than the full Hessian? Expected: full Hessian requires — expensive second-order derivatives. Gauss-Newton uses only , which is cheaper and always PSD. Full Newton converges faster away from the solution but the GN approximation is asymptotically equivalent near a zero-residual solution. Trade-off is computation vs. convergence radius.
-
In LMM calibration with 50 forward rates, the parameter space is high-dimensional. How would you modify the LM algorithm to exploit structure in the problem? Expected: block diagonal structure of (rates are coupled through swap weights only), cascade calibration (sequential single-expiry fits), rank-reduction of the correlation matrix. Discussion of whether LM is the right tool vs. quasi-Newton or trust-region Newton with sparse linear algebra.
-
Prove that the LM step is a descent direction for any . Expected: since is PD. Equality iff (stationary point).