CalibrationNumerical OptimisationHeston ModelImplied Volatility

Levenberg-Marquardt for Model Calibration

25 min readLevel: Hard

Setup

The Calibration Problem

Model calibration is the process of finding parameter values θRp\theta \in \mathbb{R}^p such that a pricing model reproduces observed market prices. In quantitative finance, this typically means:

  • Given: market-implied vols σ^i\hat{\sigma}_i for instruments i=1,,Ni = 1, \ldots, N (strikes KiK_i, maturities TiT_i, put/call flags).
  • Find: model parameters θ=(θ1,,θp)\theta = (\theta_1, \ldots, \theta_p) minimising the weighted sum of squared residuals.

For the Heston model, θ=(κ,νˉ,ξ,ρ,ν0)\theta = (\kappa, \bar{\nu}, \xi, \rho, \nu_0) (mean-reversion speed, long-run variance, vol-of-vol, correlation, initial variance). For SABR, θ=(α,β,ρ,ν)\theta = (\alpha, \beta, \rho, \nu).

Why implied vols, not prices? Fitting prices across different maturities and strikes without normalisation gives excessive weight to long-dated options (higher absolute prices). Fitting implied vols gives roughly equal weight across the surface and aligns with how traders think about calibration quality.

Assumptions

  1. The pricing function σmodel(K,T;θ)\sigma_{\mathrm{model}}(K, T; \theta) is differentiable in θ\theta (needed for the Jacobian).
  2. NpN \geq p: at least as many market instruments as model parameters (identified system). For Heston, N=30100N = 30\text{–}100 instruments is typical.
  3. The loss surface has a local minimum near the initial guess θ0\theta_0. For most models, multiple local minima exist — global optimisation or multi-start is needed in production.

The Nonlinear Least-Squares Objective

Define residuals r:RpRNr: \mathbb{R}^p \to \mathbb{R}^N:

ri(θ)=wi(σmodel(Ki,Ti;θ)σ^i),i=1,,N,r_i(\theta) = w_i \bigl(\sigma_{\mathrm{model}}(K_i, T_i; \theta) - \hat{\sigma}_i\bigr), \qquad i = 1, \ldots, N,

where wi>0w_i > 0 are weights (typically proportional to the liquidity or inverse bid-ask spread of instrument ii). The objective is:

F(θ)=12r(θ)2=12i=1Nri(θ)2.F(\theta) = \frac{1}{2}\|r(\theta)\|^2 = \frac{1}{2}\sum_{i=1}^N r_i(\theta)^2.

The factor 1/21/2 is conventional and simplifies gradient expressions. The calibration seeks:

θ=argminθΘF(θ),\theta^* = \arg\min_{\theta \in \Theta} F(\theta),

subject to box constraints Θ={θ:θu}\Theta = \{\theta : \ell \leq \theta \leq u\} (e.g., κ>0\kappa > 0, ρ(1,1)\rho \in (-1, 1)).


Theory: Levenberg-Marquardt

Gauss-Newton as the Foundation

The gradient and Hessian of FF are:

F(θ)=J(θ)r(θ),2F(θ)=J(θ)J(θ)+i=1Nri(θ)2ri(θ),\nabla F(\theta) = J(\theta)^\top r(\theta), \qquad \nabla^2 F(\theta) = J(\theta)^\top J(\theta) + \sum_{i=1}^N r_i(\theta) \nabla^2 r_i(\theta),

where JRN×pJ \in \mathbb{R}^{N \times p} is the Jacobian matrix with Jij=ri/θjJ_{ij} = \partial r_i / \partial \theta_j.

The Gauss-Newton (GN) approximation drops the second-order term (the sum of ri2rir_i \nabla^2 r_i), giving:

2F(θ)JJ.\nabla^2 F(\theta) \approx J^\top J.

This is exact when the residuals rir_i are small (near the solution). The GN update solves:

(JJ)δθ=Jr,θθ+δθ.(J^\top J)\, \delta\theta = -J^\top r, \qquad \theta \leftarrow \theta + \delta\theta.

GN converges quadratically when the residuals are small and the problem is well-conditioned, but fails when JJJ^\top J is singular or ill-conditioned (degenerate directions in parameter space).

The LM Damping Term

Levenberg-Marquardt adds a damping term λD\lambda D to JJJ^\top J:

(JJ+λD)δθ=Jr,(J^\top J + \lambda D)\, \delta\theta = -J^\top r,

where λ>0\lambda > 0 is the damping parameter and DD is a diagonal matrix.

Two standard choices for DD:

  • Levenberg (1944): D=ID = I (identity). Moves in the steepest-descent direction when λ\lambda is large.
  • Marquardt (1963): D=diag(JJ)D = \mathrm{diag}(J^\top J). Scales each parameter direction by its gradient curvature, making the step invariant to parameter rescaling. This is the standard choice.

Interpolation between GN and gradient descent. When λ0\lambda \to 0: the equation becomes JJδθ=JrJ^\top J\, \delta\theta = -J^\top r, i.e., Gauss-Newton. When λ\lambda \to \infty: the update becomes δθ(λD)1Jr\delta\theta \approx -(\lambda D)^{-1} J^\top r, a scaled steepest-descent step. LM adaptively chooses between these extremes.

Geometric Interpretation

The LM step minimises a quadratic approximation to FF within a trust region:

δθLM=argminδθ{12Jδθ+r2}subject toδθΔ,\delta\theta_{\mathrm{LM}} = \arg\min_{\delta\theta} \left\{\tfrac{1}{2}\|J\delta\theta + r\|^2\right\} \quad \text{subject to} \quad \|\delta\theta\| \leq \Delta,

for some trust-region radius Δ\Delta. The damping λ\lambda is the Lagrange multiplier for the constraint. Large λ\lambda corresponds to a small trust region (conservative step); small λ\lambda allows a large step.

Damping Parameter Update Rule

A standard update rule (Marquardt 1963, as refined by Moré 1978):

  1. Evaluate proposed step: δθLM=(JJ+λD)1Jr\delta\theta_{\mathrm{LM}} = -(J^\top J + \lambda D)^{-1} J^\top r.
  2. Compute gain ratio: ρ=F(θ)F(θ+δθ)L(θ)L(θ+δθ),\rho = \frac{F(\theta) - F(\theta + \delta\theta)}{\mathcal{L}(\theta) - \mathcal{L}(\theta + \delta\theta)}, where L\mathcal{L} is the quadratic model: L(θ+δ)=F(θ)+δJr+12δJJδ\mathcal{L}(\theta + \delta) = F(\theta) + \delta^\top J^\top r + \tfrac{1}{2}\delta^\top J^\top J \delta.
  3. Accept the step if ρ>ρmin\rho > \rho_{\min} (e.g., ρmin=0.25\rho_{\min} = 0.25): set θθ+δθ\theta \leftarrow \theta + \delta\theta and decrease λ\lambda (e.g., λλ/3\lambda \leftarrow \lambda/3).
  4. Reject the step if ρρmin\rho \leq \rho_{\min}: increase λ\lambda (e.g., λλ×3\lambda \leftarrow \lambda \times 3) and retry.

This self-tuning makes LM robust: it contracts to gradient descent near bad regions and expands to GN near the solution.


Jacobian Computation

The Jacobian Jij=ri/θj=wiσmodel(Ki,Ti;θ)/θjJ_{ij} = \partial r_i / \partial \theta_j = w_i \, \partial \sigma_{\mathrm{model}}(K_i, T_i; \theta) / \partial \theta_j is the most expensive part of LM calibration.

Finite Differences (FD)

Forward differences (first-order accurate, order O(h)O(h)):

Jijri(θ+hej)ri(θ)h.J_{ij} \approx \frac{r_i(\theta + h e_j) - r_i(\theta)}{h}.

Requires pp extra model evaluations per iteration (one per parameter direction). For Heston with p=5p = 5 parameters and N=50N = 50 instruments, this is 5×50=2505 \times 50 = 250 Fourier integrals per LM step.

Central differences (second-order accurate, order O(h2)O(h^2)):

Jijri(θ+hej)ri(θhej)2h.J_{ij} \approx \frac{r_i(\theta + h e_j) - r_i(\theta - h e_j)}{2h}.

Twice the cost, but significantly more accurate for the same step size hh. Step size selection: hεmach107h \approx \sqrt{\varepsilon_{\mathrm{mach}}} \approx 10^{-7} for forward, hεmach1/3105h \approx \varepsilon_{\mathrm{mach}}^{1/3} \approx 10^{-5} for central.

Analytic Jacobian for Affine Models

For models with a closed-form characteristic function φT(u;θ)\varphi_T(u; \theta) (e.g., Heston, Bates), the sensitivity of the model price to each parameter can be computed by differentiating under the Fourier integral:

Cθj=θj[Re0φT(ui/2;θ)eiuku2+1/4du]=Re0φTθj(ui/2;θ)eiuku2+1/4du.\frac{\partial C}{\partial \theta_j} = \frac{\partial}{\partial \theta_j} \left[ \text{Re} \int_0^\infty \varphi_T(u - i/2; \theta) \frac{e^{-iuk}}{u^2 + 1/4}\, du \right] = \text{Re} \int_0^\infty \frac{\partial \varphi_T}{\partial \theta_j}(u - i/2; \theta) \frac{e^{-iuk}}{u^2 + 1/4}\, du.

For Heston, φT/θj\partial \varphi_T / \partial \theta_j can be computed in closed form by differentiating the Riccati solution. This yields the exact Jacobian at the cost of 5 additional Fourier integrals (one per parameter), which is far cheaper than FD (especially when each Fourier integral prices all NN strikes simultaneously via FFT).


Convergence Criteria

LM terminates when any of the following conditions is met:

  1. Small gradient: Jr<ε1\|J^\top r\|_\infty < \varepsilon_1 (first-order optimality; gradient norm small).
  2. Small step: δθ2<ε2(1+θ2)\|\delta\theta\|_2 < \varepsilon_2 (1 + \|\theta\|_2) (parameter change negligible).
  3. Small residual: F(θ)<ε3F(\theta) < \varepsilon_3 (objective small enough).
  4. Max iterations: k>kmaxk > k_{\max} (wall-clock budget exceeded).

In production, condition 1 is the cleanest termination test. The gradient norm Jr\|J^\top r\| is zero at any stationary point — including saddle points and local minima — so satisfying criterion 1 only guarantees a stationary point, not a global minimum. Always verify by restarting from alternative initial points.


Implementation

#include <Eigen/Dense>
#include <functional>
#include <vector>
#include <cmath>
#include <stdexcept>

// Calibration result
struct CalibResult {
    Eigen::VectorXd params;    // calibrated parameters
    double          final_rss; // residual sum of squares F(theta*)
    int             iters;     // LM iterations used
    bool            converged;
};

// Levenberg-Marquardt calibrator.
//
// residuals_fn: given theta (p-vector), returns r (N-vector) of weighted residuals
// jacobian_fn:  given theta, returns J (N x p matrix); pass nullptr for FD fallback
// theta0:       initial parameter guess
// lb, ub:       box constraints on parameters
// max_iter, tol: stopping criteria
CalibResult levenberg_marquardt(
    std::function<Eigen::VectorXd(const Eigen::VectorXd&)> residuals_fn,
    std::function<Eigen::MatrixXd(const Eigen::VectorXd&)> jacobian_fn,
    const Eigen::VectorXd& theta0,
    const Eigen::VectorXd& lb,
    const Eigen::VectorXd& ub,
    int    max_iter = 200,
    double tol_grad = 1.0e-8,
    double tol_step = 1.0e-12,
    double lambda0  = 1.0e-3
) {
    const int p = static_cast<int>(theta0.size());

    Eigen::VectorXd theta = theta0;
    Eigen::VectorXd r     = residuals_fn(theta);
    double F              = 0.5 * r.squaredNorm();

    double lambda = lambda0;
    bool converged = false;

    for (int k = 0; k < max_iter; ++k) {
        // Compute Jacobian
        Eigen::MatrixXd J = jacobian_fn
            ? jacobian_fn(theta)
            : [&]() {
                // Central-difference fallback
                Eigen::MatrixXd Jfd(r.size(), p);
                const double h = 1.0e-5;
                for (int j = 0; j < p; ++j) {
                    Eigen::VectorXd th_p = theta, th_m = theta;
                    th_p(j) += h; th_m(j) -= h;
                    Jfd.col(j) = (residuals_fn(th_p) - residuals_fn(th_m)) / (2.0 * h);
                }
                return Jfd;
            }();

        const Eigen::MatrixXd JtJ = J.transpose() * J;
        const Eigen::VectorXd Jtr = J.transpose() * r;

        // Convergence check: gradient norm
        if (Jtr.lpNorm<Eigen::Infinity>() < tol_grad) {
            converged = true;
            break;
        }

        // Marquardt scaling: D = diag(J^T J), clamped to avoid zero
        Eigen::VectorXd diag_D = JtJ.diagonal().cwiseMax(1.0e-16);

        // Solve (J^T J + lambda * diag(D)) * delta = -J^T r
        Eigen::MatrixXd A = JtJ;
        A.diagonal() += lambda * diag_D;
        const Eigen::VectorXd delta = A.ldlt().solve(-Jtr);

        if (delta.norm() < tol_step * (1.0 + theta.norm())) {
            converged = true;
            break;
        }

        // Candidate step, projected onto box constraints
        Eigen::VectorXd theta_new = (theta + delta).cwiseMax(lb).cwiseMin(ub);
        Eigen::VectorXd r_new     = residuals_fn(theta_new);
        double          F_new     = 0.5 * r_new.squaredNorm();

        // Gain ratio rho = (actual reduction) / (predicted reduction)
        double predicted = -Jtr.dot(delta) - 0.5 * delta.dot(JtJ * delta);
        double actual    = F - F_new;
        double rho       = (predicted > 0.0) ? actual / predicted : -1.0;

        if (rho > 0.25) {
            // Accept step
            theta = theta_new;
            r     = r_new;
            F     = F_new;
            // Decrease lambda: expand trust region
            lambda = std::max(lambda / 3.0, 1.0e-16);
        } else {
            // Reject step: contract trust region
            lambda = std::min(lambda * 3.0, 1.0e+16);
        }
    }

    return {theta, F, max_iter, converged};
}

Multi-Start Strategy

The LM algorithm is local. To avoid poor local minima, run multiple independent restarts from diverse initial points and select the solution with the lowest final objective:

// Simple multi-start wrapper
CalibResult multi_start_lm(
    std::function<Eigen::VectorXd(const Eigen::VectorXd&)> residuals_fn,
    std::function<Eigen::MatrixXd(const Eigen::VectorXd&)> jacobian_fn,
    const std::vector<Eigen::VectorXd>& starts,
    const Eigen::VectorXd& lb,
    const Eigen::VectorXd& ub
) {
    CalibResult best;
    best.final_rss = std::numeric_limits<double>::infinity();
    for (const auto& theta0 : starts) {
        auto result = levenberg_marquardt(residuals_fn, jacobian_fn, theta0, lb, ub);
        if (result.final_rss < best.final_rss)
            best = result;
    }
    return best;
}

Validation

Synthetic Calibration Test

Plant known Heston parameters, generate a synthetic vol surface via the Lewis formula, then calibrate and verify recovery. A well-implemented calibrator should recover planted parameters to within 10410^{-4} in absolute parameter error for a noise-free surface of N=50N = 50 instruments.

import numpy as np
from scipy.optimize import least_squares

# Reference: planted Heston parameters
true_params = dict(kappa=2.0, theta=0.04, xi=0.4, rho=-0.6, v0=0.04)

# Generate synthetic surface (assume heston_implied_vol is implemented)
strikes   = np.array([80, 90, 95, 100, 105, 110, 120] * 4, dtype=float)
maturities = np.repeat([0.25, 0.5, 1.0, 2.0], 7)
true_vols  = np.array([
    heston_implied_vol(K, T, **true_params)
    for K, T in zip(strikes, maturities)
])

# Residuals function
def residuals(x):
    kappa, theta, xi, rho, v0 = x
    model_vols = np.array([
        heston_implied_vol(K, T, kappa=kappa, theta=theta, xi=xi, rho=rho, v0=v0)
        for K, T in zip(strikes, maturities)
    ])
    return model_vols - true_vols  # weights = 1 here

# Bounds: kappa, theta, xi > 0; rho in (-1, 1); v0 > 0
bounds = ([1e-4, 1e-4, 1e-4, -0.999, 1e-4],
          [20.0,  1.0,  2.0,  0.999, 1.0])

# Initial guess — intentionally different from true values
x0 = [1.0, 0.06, 0.3, -0.4, 0.06]
result = least_squares(residuals, x0, bounds=bounds, method='trf', ftol=1e-12, xtol=1e-12)
print(f"Calibrated: kappa={result.x[0]:.4f}, theta={result.x[1]:.4f}, "
      f"xi={result.x[2]:.4f}, rho={result.x[3]:.4f}, v0={result.x[4]:.4f}")
print(f"True:       kappa=2.0000, theta=0.0400, xi=0.4000, rho=-0.6000, v0=0.0400")
print(f"Final RMS vol error: {np.sqrt(2 * result.cost / len(strikes)):.2e}")

Limitations

Local minima. The Heston loss surface contains ridges along which many parameter combinations produce nearly identical implied vol surfaces (e.g., the κθ=const\kappa\theta = \text{const} ridge). LM finds the nearest local minimum, which may be far from the global optimum. Multi-start or global optimisers (differential evolution, simulated annealing) are needed to escape these regions.

Parameter unidentifiability. The Heston model has five parameters but only two distinct effects dominate the implied vol surface: the ATM vol term structure (driven by κ\kappa, θ\theta, v0v_0) and the smile curvature (driven by ξ\xi and ρ\rho). The individual values of κ\kappa and θ\theta are poorly identified — many pairs with similar κθ\kappa\theta produce indistinguishable surfaces. Regularisation or parameter reduction is required in practice.

Jacobian quality. FD Jacobians introduce truncation error that degrades convergence near the solution. The optimal step size h(εmach)1/2h^* \approx (\varepsilon_{\mathrm{mach}})^{1/2} for forward FD is on the order of 10710^{-7}, but choosing hh too small introduces cancellation error. Central differences are safer but 2× as expensive. Analytic Jacobians are preferred whenever available.

Calibration instability over time. Daily recalibration of all five Heston parameters produces large day-to-day jumps even when the vol surface moves smoothly. This instability is the result of a flat loss surface, not numerical noise. Tikhonov regularisation (penalising deviation from yesterday's calibration) is the standard fix — see the Regularisation module.

Constraint handling. The LM algorithm as described assumes unconstrained optimisation. Box constraints are handled approximately by projecting the step onto [,u][\ell, u], which can cause the gain ratio ρ\rho to be computed incorrectly. A proper constrained LM (e.g., MINPACK's LMSTR) uses an active-set approach or interior-point scaling.


Interview Angle

L1. Explain the calibration setup for the Heston model. What is the objective function? Why use implied vols rather than prices as the calibration target?

Setup: Find θ=(κ,νˉ,ξ,ρ,ν0)\theta = (\kappa, \bar{\nu}, \xi, \rho, \nu_0) minimising iwi(σiHestonσ^i)2\sum_i w_i(\sigma_i^{\mathrm{Heston}} - \hat{\sigma}_i)^2. Implied vols rather than prices: prices have very different magnitudes across strikes and maturities (deep OTM options are worth cents, ATM options worth dollars). Fitting prices gives enormous weight to expensive options. Implied vols are approximately normalised, giving the calibrator roughly equal influence per market instrument.

L2. Derive the LM update equation. Explain what happens to the step as λ0\lambda \to 0 and λ\lambda \to \infty. What is the Marquardt scaling, and why is it preferable to the identity?

Derivation: LM minimises the second-order Taylor expansion of FF augmented by a penalty: minδ{12Jδ+r2+λ2δDδ}.\min_\delta \{\tfrac{1}{2}\|J\delta + r\|^2 + \tfrac{\lambda}{2}\delta^\top D \delta\}. Taking the gradient and setting to zero: (JJ+λD)δ=Jr(J^\top J + \lambda D)\delta = -J^\top r. As λ0\lambda \to 0: δ=(JJ)1Jr\delta = -(J^\top J)^{-1} J^\top r (Gauss-Newton, quadratic convergence near the solution). As λ\lambda \to \infty: δ(1/λ)D1Jr\delta \approx -(1/\lambda) D^{-1} J^\top r (scaled gradient descent, always descends). Marquardt scaling D=diag(JJ)D = \mathrm{diag}(J^\top J) makes the step scale-invariant: multiplying parameter θj\theta_j by a constant leaves the step unchanged, unlike the identity which biases towards parameters with small natural scale.

L3. Discuss the identifiability problem in Heston calibration. Formally characterise the ridge in the loss surface and propose a regularisation strategy. How would you construct the analytic Jacobian for the Heston model via the Lewis integral?

Identifiability: The ATM implied vol at maturity TT is approximately σATM(T)ν0eκT+νˉ(1eκT)\sigma_{\mathrm{ATM}}(T) \approx \sqrt{\nu_0 e^{-\kappa T} + \bar{\nu}(1 - e^{-\kappa T})}. For TT in a finite set, many combinations of (κ,ν0,νˉ)(\kappa, \nu_0, \bar{\nu}) produce the same sequence {σATM(Ti)}\{\sigma_{\mathrm{ATM}}(T_i)\}. The loss surface has a shallow valley along these combinations. Formally, the Hessian JJJ^\top J has small eigenvalues in these directions, producing a large condition number. Tikhonov regularisation adds λθθprev2\lambda \|\theta - \theta_{\mathrm{prev}}\|^2 to the objective, effectively constraining the solution to be near the previous day's calibration and bounding the condition number. The analytic Jacobian element σHeston/θj\partial \sigma^{\mathrm{Heston}} / \partial \theta_j is obtained by differentiating the Lewis integral: C/θj=Re0(φT/θj)(ui/2)eiuk/(u2+1/4)du\partial C / \partial \theta_j = \mathrm{Re} \int_0^\infty (\partial \varphi_T / \partial \theta_j)(u - i/2) \cdot e^{-iuk}/(u^2 + 1/4)\, du, where φT/θj\partial \varphi_T / \partial \theta_j follows from differentiating the Riccati ODE solutions A(u,T;θ)A(u,T;\theta) and B(u,T;θ)B(u,T;\theta) with respect to θj\theta_j.

Verify your understanding before moving on.

Start Quiz →