Linear AlgebraIterative SolversConjugate GradientNumerical MethodsCalibration

Iterative Solvers for Calibration Problems

Module 5 of 525 min readLevel: Hard

Setup

Direct solvers — LU factorisation (Module 2) and Cholesky (Module 2) — cost O(n3)O(n^3) flops and require O(n2)O(n^2) storage. For a Crank-Nicolson finite difference grid with N=104N = 10^4 spatial nodes, forming and factorising the full stiffness matrix is 101210^{12} flops and 800 MB of storage — impractical. For a 2-factor interest rate PDE on a 200×200200 \times 200 grid, the system is n=4×104n = 4 \times 10^4 — far beyond direct methods.

Iterative solvers exploit the structure of the system: they require only matrix-vector products ApAp, never the factorisation. For sparse matrices (finite difference stencils, sparse covariance models), each ApAp costs O(n)O(n) rather than O(n2)O(n^2). The cost of the solve is O(nkiter)O(n \cdot k_\text{iter}) where kiterk_\text{iter} depends on the condition number.

Where iterative solvers appear in quant finance:

  1. PDE pricing (finite differences). Black-Scholes and Heston PDEs on dense grids produce large tridiagonal or banded systems at each time step. For 1D grids the Thomas algorithm (a direct tridiagonal solver, O(n)O(n)) is standard; for 2D grids, alternating direction implicit (ADI) splits into 1D tridiagonals. For 3D or coupled systems, iterative solvers (PCG) are used.

  2. Yield curve calibration. Bootstrapping a swap curve with nn pillars gives an n×nn \times n lower-triangular system (direct, trivial). Fitting a regularised Nelson-Siegel-Svensson model to mnm \gg n market quotes leads to a normal-equations solve (AA+λI)x=Ab(A^\top A + \lambda I)x = A^\top b — either direct (via Cholesky) or iterative (CG) depending on nn.

  3. Portfolio optimisation. Mean-variance with n=500n = 500 assets and kk linear constraints gives a KKT system of dimension n+k600n + k \approx 600. Cholesky is fine here. But factor-model covariance Σ=FF+D\Sigma = F F^\top + D (low-rank plus diagonal) inverts analytically via Woodbury — no solver needed.

  4. Large-scale calibration of stochastic vol models. Iterative refinement is used when the Jacobian is large and only approximately known (numerical differentiation), making direct factorisation unreliable.

Mathematical setting. We focus on three cases:

  • SPD system Ax=bAx = b with AA sparse and symmetric positive definite: solved by Conjugate Gradient (CG).
  • Non-symmetric system: GMRES (Generalised Minimal RESidual).
  • Overdetermined LS minAxb2\min \|Ax - b\|_2: LSQR.

Conventions. ARn×nA \in \mathbb{R}^{n \times n} SPD unless stated. rk=bAxkr_k = b - Ax_k is the residual at step kk. ek=xxke_k = x^* - x_k is the error. vA=vAv\|v\|_A = \sqrt{v^\top A v} is the A-norm (energy norm).

INSIGHT

Financial Insight. In production calibration at a bank, iterative solvers are used not just for speed but for robustness: they can be stopped early (before convergence) when the residual is below market-quote precision. Solving to 101010^{-10} accuracy when quotes are only known to 10410^{-4} wastes computation. Early stopping is equivalent to a form of regularisation — an important practical insight.


Theory

1. Quadratic Minimisation and the CG Motivation

For AA SPD, solving Ax=bAx = b is equivalent to minimising the strictly convex quadratic: f(x)=12xAxbx.f(x) = \frac{1}{2} x^\top A x - b^\top x. The gradient is f(x)=Axb=r(x)\nabla f(x) = Ax - b = -r(x). The minimum is at x=A1bx^* = A^{-1}b.

Steepest descent. The simplest iterative approach: move in the direction of the residual rk=bAxkr_k = b - Ax_k by the optimal step size:

DEFINITION

Definition 5.1 (Steepest Descent). Given x0x_0, iterate: αk=rkrkrkArk,xk+1=xk+αkrk.\alpha_k = \frac{r_k^\top r_k}{r_k^\top A r_k}, \qquad x_{k+1} = x_k + \alpha_k r_k. The step αk\alpha_k is chosen to minimise f(xk+αrk)f(x_k + \alpha r_k) over αR\alpha \in \mathbb{R}.

Convergence of steepest descent:

THEOREM

Theorem 5.1. Let κ=κ2(A)=λmax/λmin\kappa = \kappa_2(A) = \lambda_\text{max}/\lambda_\text{min}. Steepest descent satisfies: ek+1A(κ1κ+1)2ekA.\|e_{k+1}\|_A \leq \left(\frac{\kappa - 1}{\kappa + 1}\right)^2 \|e_k\|_A. For ill-conditioned problems (κ1\kappa \gg 1), the convergence factor approaches 1 — very slow.

The fundamental problem with steepest descent: consecutive search directions rk+1rkr_{k+1} \perp r_k in the Euclidean inner product, but they are not AA-orthogonal. The method revisits the same subspaces, causing "zigzag" convergence along narrow valleys of ff when eigenvalues are spread widely.

2. Conjugate Gradient Method

The CG method eliminates the redundant search directions by enforcing A-conjugacy:

DEFINITION

Definition 5.2 (A-conjugate directions). Vectors p,qp, q are A-conjugate (A-orthogonal) if pAq=0p^\top A q = 0. A set {p0,,pk1}\{p_0, \ldots, p_{k-1}\} is A-conjugate if piApj=0p_i^\top A p_j = 0 for all iji \neq j.

THEOREM

Theorem 5.2 (CG Algorithm — Hestenes-Stiefel, 1952). Starting from x0x_0, r0=bAx0r_0 = b - Ax_0, p0=r0p_0 = r_0, iterate: αk=rkrkpkApk,xk+1=xk+αkpk,rk+1=rkαkApk,\alpha_k = \frac{r_k^\top r_k}{p_k^\top A p_k}, \quad x_{k+1} = x_k + \alpha_k p_k, \quad r_{k+1} = r_k - \alpha_k A p_k, βk=rk+1rk+1rkrk,pk+1=rk+1+βkpk.\beta_k = \frac{r_{k+1}^\top r_{k+1}}{r_k^\top r_k}, \quad p_{k+1} = r_{k+1} + \beta_k p_k.

Key properties (proofs by induction on the Krylov subspace structure):

  • rirjr_i \perp r_j for iji \neq j (residuals are mutually orthogonal in the Euclidean inner product).
  • piApj=0p_i^\top A p_j = 0 for iji \neq j (search directions are A-conjugate).
  • xkx_k minimises f(x)=12xAxbxf(x) = \frac{1}{2}x^\top Ax - b^\top x over the Krylov subspace Kk=span{r0,Ar0,,Ak1r0}\mathcal{K}_k = \text{span}\{r_0, Ar_0, \ldots, A^{k-1}r_0\}.
  • In exact arithmetic, CG terminates in at most nn steps with the exact solution.

CG convergence rate:

THEOREM

Theorem 5.3 (CG convergence). The CG iterates satisfy: ekA2(κ1κ+1)ke0A.\|e_k\|_A \leq 2\left(\frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}\right)^k \|e_0\|_A. For κ=100\kappa = 100: CG convergence factor 0.82\approx 0.82; steepest descent factor 0.96\approx 0.96.

Comparison: CG depends on κ\sqrt{\kappa} rather than κ\kappa. For κ=104\kappa = 10^4 (typical calibration matrix):

  • Steepest descent: 104\approx 10^4 iterations to reduce error by 10410^{-4}.
  • CG: 100\approx 100 iterations (100× faster).
EXAMPLE

Example 5.1 (Tridiagonal FD system). A Crank-Nicolson price step for a n=500n = 500 node grid gives a tridiagonal SPD matrix AA with κ4n2/π2105\kappa \approx 4n^2/\pi^2 \approx 10^5 (for the standard 1D heat equation stencil). CG converges in 105316\approx \sqrt{10^5} \approx 316 iterations to machine precision — competitive with the Thomas algorithm (O(n)=O(500)O(n) = O(500) steps). For 2D ADI, the subproblems are 1D tridiagonals, so Thomas algorithm dominates.

3. Preconditioning

The convergence rate of CG depends entirely on κ2(A)\kappa_2(A). Preconditioning replaces the original system with a better-conditioned equivalent.

DEFINITION

Definition 5.3 (Preconditioned CG). Given a preconditioner MAM \approx A (symmetric positive definite, easy to invert), solve the equivalent system: M1/2AM1/2x^=M1/2b,x^=M1/2x.M^{-1/2} A M^{-1/2} \hat x = M^{-1/2} b, \quad \hat x = M^{1/2} x. The effective condition number is κ2(M1A)\kappa_2(M^{-1}A). If M=AM = A, κ=1\kappa = 1 and CG converges in one step.

Implementation. In practice, PCG avoids computing M1/2M^{-1/2} explicitly. The residual preconditioned step replaces rkr_k with zk=M1rkz_k = M^{-1} r_k, and the inner products rkrkr_k^\top r_k become rkzkr_k^\top z_k.

Common preconditioners:

PreconditionerConstruction costApplication costBest for
Diagonal (Jacobi)O(n)O(n)O(n)O(n)Diagonally dominant AA
Incomplete Cholesky (IC0)O(nnz)O(\text{nnz})O(nnz)O(\text{nnz})SPD sparse matrices
SSORO(1)O(1)O(nnz)O(\text{nnz})Banded / finite difference
AMG (algebraic multigrid)O(nlogn)O(n \log n)O(n)O(n)Elliptic PDE systems
REMARK

Remark. For yield curve calibration, the normal equations (AA+λI)x=Ab(A^\top A + \lambda I)x = A^\top b with Tikhonov regularisation have an effective condition number κ2(AA+λI)σ12/λ\kappa_2(A^\top A + \lambda I) \approx \sigma_1^2 / \lambda when λ\lambda dominates the small singular values. Diagonal preconditioning M=λIM = \lambda I (trivially M1=I/λM^{-1} = I/\lambda) reduces κ\kappa by the regularisation. This is one way to see why regularisation aids iterative convergence.

4. Thomas Algorithm: Direct O(n) for Tridiagonal Systems

Before covering LSQR, it is important to note the most practically used "iterative-like" solver in quant finance: the Thomas algorithm (tridiagonal matrix algorithm), which is in fact a direct solver.

DEFINITION

Definition 5.4 (Thomas algorithm). For the tridiagonal system aixi1+bixi+cixi+1=di,i=1,,n,a_i x_{i-1} + b_i x_i + c_i x_{i+1} = d_i, \quad i = 1, \ldots, n, the Thomas algorithm performs Gaussian elimination in O(n)O(n) flops via forward sweep (eliminating aia_i) and backward substitution.

Stability. The Thomas algorithm is numerically stable when the matrix is diagonally dominant (bi>ai+ci|b_i| > |a_i| + |c_i|) — which holds for all standard Crank-Nicolson and Crank-Nicolson-like FD time-stepping schemes. No pivoting is needed.

Cost: O(8n)O(8n) flops — orders of magnitude cheaper than LU (O(n3/3)O(n^3/3)) for 1D grids. For a grid with n=104n = 10^4 nodes, Thomas costs 8×1048 \times 10^4 flops vs LU's 3×10113 \times 10^{11} — a factor of 4×1064 \times 10^6 speedup.

5. LSQR: Iterative Least Squares

For large overdetermined LS problems minAxb2\min \|Ax - b\|_2 where ARm×nA \in \mathbb{R}^{m \times n} is sparse, forming AAA^\top A explicitly destroys sparsity and squares the condition number (Module 4). LSQR (Paige-Saunders, 1982) solves the LS problem iteratively using only matrix-vector products with AA and AA^\top.

THEOREM

Theorem 5.4 (LSQR). LSQR applies the Lanczos bidiagonalisation to build a sequence of iterates xkx_k minimising Axb2\|Ax - b\|_2 over Kk(AA,Ab)\mathcal{K}_k(A^\top A, A^\top b). It satisfies:

  • xkx_k minimises the LS residual over the kk-dimensional Krylov subspace.
  • Convergence rate: O ⁣((κ2(A)1κ2(A)+1)k)O\!\left(\left(\frac{\sqrt{\kappa_2(A)}-1}{\sqrt{\kappa_2(A)}+1}\right)^k\right) — same dependence as CG on κ2(A)\sqrt{\kappa_2(A)} (not κ2(A)2\kappa_2(A)^2 as for normal equations via CG).
  • Numerically equivalent to CG applied to AAx=AbA^\top A x = A^\top b, but avoids forming AAA^\top A.

When to use LSQR vs direct methods:

  • n<103n < 10^3: use direct (SciPy lstsq via LAPACK DGELSD).
  • n103n \geq 10^3, AA sparse: use LSQR (SciPy lsqr).
  • n104n \geq 10^4: use LSQR with a good preconditioner on AAA^\top A.

6. Stopping Criteria

WARNING

Warning. Iterative solvers require explicit stopping criteria. The standard choices are:

  • Absolute residual: stop when rk2εabs\|r_k\|_2 \leq \varepsilon_\text{abs}.
  • Relative residual: stop when rk2/b2εrel\|r_k\|_2 / \|b\|_2 \leq \varepsilon_\text{rel}.
  • Maximum iterations: always set a cap to avoid infinite loops on divergent problems.

For calibration, set εrel\varepsilon_\text{rel} to match quote precision (e.g., 10410^{-4} for bid-ask), not machine precision (101610^{-16}). Over-solving wastes time and does not improve the hedge.


Validation

The companion notebook verifies:

  1. Steepest descent vs CG convergence curves for a 50×5050 \times 50 SPD matrix with known κ\kappa.
  2. CG terminates in n\leq n steps on an n×nn \times n SPD system in exact arithmetic (within floating-point tolerance).
  3. A-conjugacy: piApj=0p_i^\top A p_j = 0 for all iji \neq j in the CG sequence.
  4. Preconditioning: diagonal preconditioner reduces iteration count on a poorly scaled matrix.
  5. Thomas algorithm: O(n)O(n) tridiagonal solve matches scipy.linalg.solve_banded to machine precision.
  6. LSQR convergence: matches the direct LS solution on a sparse overdetermined system.
PRACTICE

Hand exercise. Consider the 2×22 \times 2 SPD system A=(4113)A = \begin{pmatrix} 4 & 1 \\ 1 & 3 \end{pmatrix}, b=(12)b = \begin{pmatrix} 1 \\ 2 \end{pmatrix}, starting from x0=(0,0)x_0 = (0, 0)^\top.

(a) Compute r0=bAx0r_0 = b - Ax_0. What is p0p_0?

(b) Compute α0=r0r0/(r0Ar0)\alpha_0 = r_0^\top r_0 / (r_0^\top A r_0).

(c) Compute x1=x0+α0p0x_1 = x_0 + \alpha_0 p_0 and r1=r0α0Ap0r_1 = r_0 - \alpha_0 A p_0.

(d) For a 2×22 \times 2 SPD system, CG converges in at most 2 steps. Is r1r0r_1 \perp r_0?

(Answer: r0=(1,2)r_0 = (1,2)^\top; r0r0=5r_0^\top r_0 = 5; Ar0=(6,7)Ar_0 = (6, 7)^\top; r0Ar0=20r_0^\top Ar_0 = 20; α0=1/4\alpha_0 = 1/4; x1=(1/4,1/2)x_1 = (1/4, 1/2)^\top; r1=(16/4,27/4)=(1/2,1/4)r_1 = (1 - 6/4, 2 - 7/4)^\top = (-1/2, 1/4)^\top; r1r0=1/2+1/2=0r_1 \cdot r_0 = -1/2 + 1/2 = 0 ✓)


Limitations

WARNING

Warning: CG requires AA to be SPD. If AA is only SPSD (rank-deficient), CG breaks down when the search direction pkp_k becomes AA-singular (pkApk=0p_k^\top A p_k = 0). For indefinite or non-symmetric AA, use MINRES (symmetric indefinite) or GMRES (non-symmetric). GMRES stores all previous residuals and may require restarts for large problems.

WARNING

Warning: floating-point loss of orthogonality. In finite-precision arithmetic, the theoretical A-conjugacy piApj=0p_i^\top Ap_j = 0 degrades as kk grows. For ill-conditioned problems, "ghost" eigenvalues (Ritz values that have already converged) pollute the Krylov subspace, causing CG to stagnate. Remedies: periodic restart, re-orthogonalisation (Lanczos with full re-ortho), or switching to a robust direct solver once the problem is small enough.

WARNING

Warning: Thomas algorithm fails with exact zeros on the diagonal. If any pivot biaici1/bi1b_i - a_i c_{i-1} / b_{i-1} becomes zero during the forward sweep, the algorithm fails. For FD schemes from parabolic PDEs with positive coefficients, this cannot happen — but for schemes with non-standard boundary conditions, check diagonal dominance before applying Thomas.

Practical failure modes:

  • Slow convergence: κ\kappa too large without a preconditioner. Symptom: residual stagnates after initial rapid drop. Fix: apply IC0 or block Jacobi preconditioning.
  • Divergence in GMRES: restart frequency too low, or the matrix is nearly singular. Increase restart frequency or switch to a direct solve on the ill-conditioned subproblem.
  • LSQR damping: using LSQR with a damping parameter λ>0\lambda > 0 solves the Tikhonov-regularised LS — equivalent to the direct Tikhonov formula but without forming AAA^\top A.

Interview Angle

PRACTICE

L1 (Junior Quant Developer).

Expected depth: algorithm description, when to use which solver.

  1. "What is the conjugate gradient method and when would you use it over LU factorisation?" Answer: CG is an iterative solver for SPD Ax=bAx = b that only requires matrix-vector products — cost O(nk)O(n \cdot k) vs O(n3)O(n^3) for LU. Use CG when AA is large and sparse (e.g., FD stiffness matrix, sparse covariance). Use LU when AA is small (n<103n < 10^3) or dense.

  2. "Why does CG converge faster than steepest descent?" Answer: Convergence of steepest descent depends on (κ1)/(κ+1)(\kappa-1)/(\kappa+1); CG depends on (κ1)/(κ+1)(\sqrt\kappa-1)/(\sqrt\kappa+1). CG eliminates redundant search directions by enforcing A-conjugacy, so it spans a new dimension of the Krylov subspace at each step. For κ=104\kappa = 10^4: steepest descent needs 104\sim 10^4 iterations; CG needs 100\sim 100.

  3. "What does the Thomas algorithm do, and why is it used in FD pricing?" Answer: The Thomas algorithm is a direct O(n)O(n) solver for tridiagonal systems. Crank-Nicolson PDE pricing produces one tridiagonal system per time step. Thomas is the standard solver — far faster than general LU (O(n3)O(n^3)) and exploits the exact sparsity pattern.

PRACTICE

L2 (Senior Quant / Quant Developer).

Expected depth: derivation of CG update, convergence proof sketch, preconditioning design.

  1. "Derive the CG step size αk\alpha_k from the A-conjugacy requirements." Answer: xk+1=xk+αkpkx_{k+1} = x_k + \alpha_k p_k minimises f(xk+αpk)f(x_k + \alpha p_k) over α\alpha. Differentiating: ddαf=pk(A(xk+αpk)b)=pkrk+αpkApk=0\frac{d}{d\alpha}f = p_k^\top(A(x_k + \alpha p_k) - b) = -p_k^\top r_k + \alpha p_k^\top A p_k = 0. So αk=pkrk/pkApk\alpha_k = p_k^\top r_k / p_k^\top A p_k. By construction p0=r0p_0 = r_0, and subsequently pkrk=rkrkp_k^\top r_k = r_k^\top r_k (provable by induction from the Gram-Schmidt A-conjugation step).

  2. "A Heston PDE on a 100×100100 \times 100 (spot × vol) grid gives a 104×10410^4 \times 10^4 sparse system at each time step. What solver strategy would you recommend?" Answer: ADI (alternating direction implicit) splits the 2D problem into two 1D tridiagonal solves per time step — each solved with the Thomas algorithm in O(n)O(n). If ADI is not accurate enough (e.g., near expiry with large mixed derivative terms), use PCG with IC0 preconditioning. Target residual tolerance 108\sim 10^{-8} (matches pricing accuracy requirement of 10810^{-8} relative price error).

PRACTICE

L3 (Quant Researcher / Model Risk).

Expected depth: Krylov subspace theory, connection to polynomials and eigenvalues, early stopping as regularisation.

  1. "Explain why CG convergence depends on the distribution of eigenvalues of AA, not just on κ2(A)\kappa_2(A)." Answer: CG minimises ekA\|e_k\|_A over the Krylov subspace Kk\mathcal{K}_k. The error satisfies ekAminpkPk,pk(0)=1maxλσ(A)pk(λ)e0A\|e_k\|_A \leq \min_{p_k \in \mathcal{P}_k, p_k(0)=1} \max_{\lambda \in \sigma(A)} |p_k(\lambda)| \|e_0\|_A, where Pk\mathcal{P}_k is the set of degree-kk polynomials. If eigenvalues cluster in a few groups (e.g., kk distinct eigenvalues), CG converges in exactly kk steps regardless of κ2\kappa_2. The worst case is uniformly spread eigenvalues (where Chebyshev polynomials give the κ\sqrt\kappa bound). For FD matrices, eigenvalues cluster near two values at fine grids — explaining why PCG with a good preconditioner converges in far fewer steps than the κ\sqrt\kappa bound predicts.

  2. "In what sense is early stopping of an iterative solver equivalent to regularisation, and how would you set the tolerance for a calibration problem?" Answer: CG at iteration kk produces the minimiser of ff over Kk\mathcal{K}_k. This is equivalent to applying the degree-kk polynomial filter pk(λ)p_k(\lambda) to the eigenvalues of A1A^{-1}. Small eigenvalues of AA (corresponding to large eigenvalues of A1A^{-1}, i.e., ill-conditioned directions) are filtered out in early iterations — CG only resolves them once kk is large enough that Kk\mathcal{K}_k spans those directions. Stopping early thus suppresses contributions from the null-space-proximate directions, exactly like TSVD regularisation (Module 4). The optimal tolerance is εrelδb/b\varepsilon_\text{rel} \approx \|\delta b\|/\|b\| where δb\delta b is the data noise level — solving beyond this adds computational cost without improving the solution quality beyond what the data supports.