Levenberg-MarquardtTrust RegionDamping ParameterGauss-NewtonConvergence

The Levenberg-Marquardt Algorithm — Theory and Implementation

Module 2 of 528 min readLevel: Hard

Setup

The Gauss-Newton method solves the normal equations (JW2J)δθ=JW2r(J^\top W^2 J)\,\delta\theta = J^\top W^2 r at each iteration. It converges quadratically near a good solution but fails when:

  • JJJ^\top J is singular or near-singular (rank deficiency from parameter redundancy or flat objective directions),
  • The initial guess is far from the solution (large residuals make the Gauss-Newton step overshoot),
  • The objective is highly non-convex (the Gauss-Newton Hessian approximation is poor far from the solution).

The Levenberg-Marquardt (LM) algorithm (Levenberg 1944; Marquardt 1963) resolves all three failures by blending Gauss-Newton with steepest descent via a single damping parameter λ\lambda. It is the standard algorithm for non-linear least squares in quantitative finance: it is the solver used inside scipy.optimize.least_squares, MATLAB's lsqnonlin, and most commercial calibration engines.

INSIGHT

Financial Insight. A Heston calibration from a cold start (no prior-day parameters) typically converges in 20–50 LM iterations. An intraday recalibration starting from the previous calibration converges in 3–10 iterations. The difference is the starting point relative to the basin of attraction. Understanding LM convergence diagnostics — gain ratio, gradient norm, step size — is essential for diagnosing when a calibration has failed silently (converged to a bad local minimum) versus when it is genuinely well-calibrated.

Assumptions:

  • The residual function r:ΘRNr: \Theta \to \mathbb{R}^N is continuously differentiable on ΘRp\Theta \subseteq \mathbb{R}^p. The Jacobian J(θ)RN×pJ(\theta) \in \mathbb{R}^{N\times p} exists and is computable (analytically or numerically).
  • We work with weighted residuals; W=IW = I (unit weights) for notational clarity. The extension to non-unit WW is immediate: replace JJ with WJWJ and rr with WrWr throughout.
  • The problem is overdetermined or exactly determined: NpN \ge p.

Theory

Motivation: Failure of Gauss-Newton

Gauss-Newton minimises the linearised objective: m(δ)=12r+Jδ2m(\delta) = \tfrac{1}{2}\| r + J\delta \|^2

The unconstrained minimiser is δGN=(JJ)1Jr\delta^{\text{GN}} = -(J^\top J)^{-1} J^\top r, which exists only when JJJ^\top J is invertible. When JJJ^\top J is ill-conditioned, the step is large and unreliable.

The gradient descent direction is L=Jr-\nabla \mathcal{L} = J^\top r, which is small when the gradient is small but requires choosing a step length and converges slowly (linearly).

LM interpolates between these two extremes via a single scalar.

The Levenberg-Marquardt Update

DEFINITION

Definition 2.1 (LM Update Step). Given current iterate θ\theta, the LM update step δλ\delta_\lambda is the solution to the damped normal equations: (JJ+λD)δλ=Jr\left(J^\top J + \lambda D\right)\delta_\lambda = J^\top r where λ>0\lambda > 0 is the damping parameter and DD is a positive definite scaling matrix.

  • Levenberg's original choice: D=ID = I — damps toward zero step.
  • Marquardt's choice: D=diag(JJ)D = \mathrm{diag}(J^\top J) — scales each parameter direction by its natural curvature. This is the standard modern form.
THEOREM

Theorem 2.1 (LM Interpolation). For any λ0\lambda \ge 0:

(i) λ=0\lambda = 0: δ0=(JJ)1Jr\delta_0 = -(J^\top J)^{-1} J^\top r — the Gauss-Newton step.

(ii) λ\lambda \to \infty: δλ(1/λ)Jr\delta_\lambda \to -(1/\lambda) J^\top r — a short steepest-descent step of length O(1/λ)O(1/\lambda).

(iii) For any λ>0\lambda > 0: δλ\delta_\lambda is a descent direction — the objective strictly decreases along δλ\delta_\lambda for small enough step length.

(iv) The step norm δλ\|\delta_\lambda\| is a strictly decreasing function of λ\lambda.

Sketch of (iii): The gradient of L\mathcal{L} in direction δλ\delta_\lambda is rJδλ=rJ(JJ+λD)1Jr-r^\top J \delta_\lambda = -r^\top J (J^\top J + \lambda D)^{-1} J^\top r, which is 0\le 0 since (JJ+λD)1(J^\top J + \lambda D)^{-1} is positive definite. Equality holds only at Jr=0J^\top r = 0, i.e., at a stationary point.

Property (iv) is the key: λ\lambda controls the step size. Large λ\lambda → small conservative step; small λ\lambda → large aggressive (Gauss-Newton-like) step.

Trust Region Interpretation

THEOREM

Theorem 2.2 (LM as Trust Region Problem). For a given λ>0\lambda > 0, the solution δλ\delta_\lambda to the LM damped normal equations is the exact solution of the trust-region sub-problem: δλ=argminδ12r+Jδ2subject toδDΔ(λ)\delta_\lambda = \arg\min_{\delta} \tfrac{1}{2}\| r + J\delta \|^2 \quad \text{subject to} \quad \|\delta\|_D \le \Delta(\lambda) where δD2=δDδ\|\delta\|_D^2 = \delta^\top D\,\delta and Δ(λ)\Delta(\lambda) is the radius of the trust region at damping λ\lambda.

This interpretation is fundamental: λ\lambda does not fix the step direction but instead enforces a budget on how far the algorithm trusts the linear model m(δ)=r+Jδ2m(\delta) = \|r + J\delta\|^2. Large λ\lambda → small trust region → cautious step; small λ\lambda → large trust region → full Gauss-Newton step if it lies within the region.

Geometric picture: In parameter space, the trust region is an ellipsoid centred at the current iterate. The LM step is the point on the Gauss-Newton path (from θ\theta toward δGN\delta^{\text{GN}}) that lies on the boundary of the ellipsoid if the Gauss-Newton step is too large, or exactly at δGN\delta^{\text{GN}} if it lies inside.

The Gain Ratio and λ\lambda Update

LM adapts λ\lambda at each iteration based on the gain ratio ρ\rho:

DEFINITION

Definition 2.2 (Gain Ratio). The gain ratio at iterate θ\theta with proposed step δ\delta is: ρ  =  L(θ)L(θ+δ)m(0)m(δ)\rho \;=\; \frac{\mathcal{L}(\theta) - \mathcal{L}(\theta + \delta)}{m(0) - m(\delta)} The numerator is the actual reduction in the objective; the denominator is the predicted reduction by the linearised model. ρ1\rho \approx 1 means the linear model is accurate; ρ1\rho \ll 1 or ρ<0\rho < 0 means the linear model is poor.

Standard λ\lambda update heuristic (Nielsen 1999):

Gain ratio ρ\rhoAction
ρ>0.75\rho > 0.75Step accepted, decrease λ\lambda: λλ/3\lambda \leftarrow \lambda / 3
0.25ρ0.750.25 \le \rho \le 0.75Step accepted, keep λ\lambda unchanged
ρ<0.25\rho < 0.25Step rejected, increase λ\lambda: λλ×2\lambda \leftarrow \lambda \times 2
ρ0\rho \le 0Step rejected, increase λ\lambda: λλ×10\lambda \leftarrow \lambda \times 10

Initialisation: λ0=τmaxj(JJ)jj\lambda_0 = \tau \cdot \max_j (J^\top J)_{jj} where τ[108,1]\tau \in [10^{-8}, 1] is chosen large when the initial guess is poor and small when the initial guess is good.

Convergence Criteria

The algorithm terminates when any of the following holds:

DEFINITION

Definition 2.3 (LM Convergence Conditions). Stop when:

  1. Gradient criterion: Jr<ϵg\| J^\top r \|_\infty < \epsilon_g — the gradient is negligible; we are at (or very near) a stationary point.

  2. Step size criterion: δλ<ϵs(1+θ)\| \delta_\lambda \| < \epsilon_s (1 + \| \theta \|) — the update is smaller than a relative tolerance; parameters are not changing meaningfully.

  3. Objective criterion: L(θ(k+1))L(θ(k))<ϵf(1+L(θ(k)))| \mathcal{L}(\theta^{(k+1)}) - \mathcal{L}(\theta^{(k)}) | < \epsilon_f (1 + \mathcal{L}(\theta^{(k)})) — the objective is stagnant.

  4. Maximum iterations exceeded.

Typical tolerances: ϵg=ϵs=ϵf=108\epsilon_g = \epsilon_s = \epsilon_f = 10^{-8} for calibration; 10510^{-5} for intraday recalibration under time pressure.

WARNING

Warning — False Convergence. Criteria 2 and 3 can trigger when λ\lambda is very large and the step is tiny — not because a minimum was found, but because the algorithm is stuck with an enormous damping parameter. Always check the gradient criterion and inspect the final residuals r(θ)\| r(\theta^*) \|. A large gradient norm at "convergence" signals a failed calibration.

Linear Algebra: Solving the Damped Normal Equations

The p×pp \times p system (JJ+λD)δ=Jr(J^\top J + \lambda D)\,\delta = J^\top r is solved at each iteration.

Cholesky approach: If JJ+λDJ^\top J + \lambda D is positive definite (it is, for any λ>0\lambda > 0 with DD PD), factor JJ+λD=LLJ^\top J + \lambda D = L L^\top and solve two triangular systems. Cost: O(p3/3)O(p^3/3) for the factorisation + O(p2)O(p^2) for the solves. Efficient for small pp (p20p \le 20).

QR approach (more numerically stable): Form the augmented system J~=(JλD1/2),r~=(r0)\tilde{J} = \begin{pmatrix} J \\ \sqrt{\lambda} D^{1/2} \end{pmatrix}, \quad \tilde{r} = \begin{pmatrix} r \\ 0 \end{pmatrix} and solve minδr~+J~δ2\min_\delta \|\tilde{r} + \tilde{J}\delta\|^2 via QR decomposition of J~\tilde{J}. Cost: O(Np2)O(Np^2) for the factorisation. Preferred when NpN \gg p or when JJJ^\top J may be nearly singular.

REMARK

Remark. For Heston calibration (p=5p = 5, N60N \le 60), Cholesky is fast enough. For LMM calibration (pp up to 50+), QR or iterative methods are preferred.

Rate of Convergence

THEOREM

Theorem 2.3 (Local Convergence of LM). Suppose θ\theta^* is a local minimum of L\mathcal{L} at which JJ^* has full column rank pp. There exists ϵ>0\epsilon > 0 such that if θ(0)θ<ϵ\| \theta^{(0)} - \theta^* \| < \epsilon, the LM iterates converge to θ\theta^*.

If the residuals vanish at the solution (r(θ)=0r(\theta^*) = 0), convergence is quadratic: θ(k+1)θ=O(θ(k)θ2)\|\theta^{(k+1)} - \theta^*\| = O(\|\theta^{(k)} - \theta^*\|^2).

If the residuals are non-zero at the solution, convergence is linear at rate proportional to r(θ)\| r(\theta^*) \|.

The key implication: a near-perfect model fit (zero residuals) gives fast quadratic convergence. A model that cannot fit the surface exactly (e.g., fitting a 5-parameter Heston to a surface with term-structure features the model cannot reproduce) converges linearly. This is a diagnostic: slow convergence near the solution is evidence of model misspecification.


Validation

The companion notebook implements LM from scratch (without scipy.optimize) and verifies:

  1. The gain ratio ρ\rho is close to 1 near the solution (linear model is accurate).
  2. The step norm δλ\|\delta_\lambda\| decreases monotonically as λ\lambda increases.
  3. λ\lambda is reduced (toward Gauss-Newton) when the objective decreases well.
  4. Convergence to the correct solution on a 2-parameter test problem with known analytic answer.
  5. The QR and Cholesky approaches give identical steps (within machine precision).
PRACTICE

Before opening the notebook: For r(θ)=(θ12,θ23)r(\theta) = (\theta_1 - 2, \theta_2 - 3)^\top with J=I2J = I_2 (identity Jacobian): (a) What is the Gauss-Newton step from θ=(0,0)\theta = (0, 0)? (b) What is the LM step for λ=1\lambda = 1, D=ID = I? (c) Verify that the LM step converges to the steepest descent direction as λ\lambda \to \infty.


Limitations

WARNING

Warning — Local Convergence Only. LM converges to a local minimum, not necessarily the global one. For non-convex problems (Heston, SABR), the basin of attraction is finite and starting far from the solution risks convergence to a spurious local minimum. Always validate the calibrated parameters against stylised facts (positive vol of vol, negative correlation for equity smiles) and run multiple starting points.

WARNING

Warning — λ\lambda Explosion. If λ\lambda grows to machine limits (e.g., λ>1030\lambda > 10^{30}) without the gradient criterion being satisfied, the algorithm has stalled: every candidate step is rejected as unpredictive, but no stationary point has been found. Root causes: (i) wrong initial point, (ii) numerical errors in the Jacobian, (iii) poorly scaled parameters. Fix by normalising parameters to O(1)O(1) before running LM.

Other limitations:

  • Parameter scaling: LM with D=ID = I is not scale-invariant. A parameter ξ=0.3\xi = 0.3 (vol of vol) and a parameter κ=2.0\kappa = 2.0 (mean reversion rate) live on different scales. Use Marquardt scaling D=diag(JJ)D = \mathrm{diag}(J^\top J) to compensate, or normalise each parameter to unit scale before running.

  • MC pricing noise: Standard LM assumes a deterministic objective. With MC pricing, the objective and Jacobian are noisy. Standard LM will oscillate rather than converge. Use noise-robust methods (average over seeds, use control variates, or switch to a derivative-free method).

  • Jacobian computation cost: Each iteration requires evaluating JJ, which costs pp or 2p2p pricing function evaluations for finite-difference derivatives. For p=5p = 5 Heston parameters and 40 instruments, one LM iteration requires 10–40 option price evaluations (each involving numerical integration). Analytic or AAD-computed Jacobians are essential at production speed.


Interview Angle

PRACTICE

L1 (Junior) — Typical questions:

  1. What problem does LM solve that Gauss-Newton does not? Expected: LM handles singular or ill-conditioned JJJ^\top J via damping. Also handles poor starting points by restricting step size. Know the basic update formula.

  2. What is the damping parameter λ\lambda and how is it updated? Expected: controls interpolation between GN and gradient descent. Decreased on good steps (high gain ratio), increased on bad steps (low gain ratio). Initialised proportionally to diagonal of JJJ^\top J.

  3. When would you say a calibration has converged? Expected: gradient norm small, step size small, residuals stable. Check all three — not just "ran for N iterations".

PRACTICE

L2 (Senior) — Typical questions:

  1. Derive the LM update step from the trust-region sub-problem. Expected: minimise r+Jδ2\|r + J\delta\|^2 subject to δD2Δ2\|\delta\|_D^2 \le \Delta^2. Via Lagrangian: (JJ+λD)δ=Jr(J^\top J + \lambda D)\delta = J^\top r where λ\lambda is the multiplier. Show that step norm is decreasing in λ\lambda.

  2. What is the gain ratio and why is it a good guide for updating λ\lambda? Expected: ratio of actual to predicted reduction in objective. ρ1\rho \approx 1 means linear model is accurate → trust it more (reduce λ\lambda). ρ<0\rho < 0 means the model overpredicted the gain → reduce trust (increase λ\lambda).

  3. Under what conditions does LM converge quadratically? Expected: residuals vanish at solution (zero-residual problem). Derive from Taylor expansion of r(θ(k+1))r(\theta^{(k+1)}) around θ\theta^*.

PRACTICE

L3 (Researcher) — Typical questions:

  1. Compare the LM algorithm to Newton's method. Why do we use the Gauss-Newton Hessian approximation rather than the full Hessian? Expected: full Hessian requires iri2fi\sum_i r_i \nabla^2 f_i — expensive second-order derivatives. Gauss-Newton uses only JJJ^\top J, which is cheaper and always PSD. Full Newton converges faster away from the solution but the GN approximation is asymptotically equivalent near a zero-residual solution. Trade-off is computation vs. convergence radius.

  2. In LMM calibration with 50 forward rates, the parameter space is high-dimensional. How would you modify the LM algorithm to exploit structure in the problem? Expected: block diagonal structure of JJJ^\top J (rates are coupled through swap weights only), cascade calibration (sequential single-expiry fits), rank-reduction of the correlation matrix. Discussion of whether LM is the right tool vs. quasi-Newton or trust-region Newton with sparse linear algebra.

  3. Prove that the LM step is a descent direction for any λ>0\lambda > 0. Expected: Lδλ=rJ(JJ+λD)1Jr0\nabla \mathcal{L}^\top \delta_\lambda = -r^\top J (J^\top J + \lambda D)^{-1} J^\top r \le 0 since (JJ+λD)1(J^\top J + \lambda D)^{-1} is PD. Equality iff Jr=0J^\top r = 0 (stationary point).