Setup
Direct solvers — LU factorisation (Module 2) and Cholesky (Module 2) — cost flops and require storage. For a Crank-Nicolson finite difference grid with spatial nodes, forming and factorising the full stiffness matrix is flops and 800 MB of storage — impractical. For a 2-factor interest rate PDE on a grid, the system is — far beyond direct methods.
Iterative solvers exploit the structure of the system: they require only matrix-vector products , never the factorisation. For sparse matrices (finite difference stencils, sparse covariance models), each costs rather than . The cost of the solve is where depends on the condition number.
Where iterative solvers appear in quant finance:
-
PDE pricing (finite differences). Black-Scholes and Heston PDEs on dense grids produce large tridiagonal or banded systems at each time step. For 1D grids the Thomas algorithm (a direct tridiagonal solver, ) is standard; for 2D grids, alternating direction implicit (ADI) splits into 1D tridiagonals. For 3D or coupled systems, iterative solvers (PCG) are used.
-
Yield curve calibration. Bootstrapping a swap curve with pillars gives an lower-triangular system (direct, trivial). Fitting a regularised Nelson-Siegel-Svensson model to market quotes leads to a normal-equations solve — either direct (via Cholesky) or iterative (CG) depending on .
-
Portfolio optimisation. Mean-variance with assets and linear constraints gives a KKT system of dimension . Cholesky is fine here. But factor-model covariance (low-rank plus diagonal) inverts analytically via Woodbury — no solver needed.
-
Large-scale calibration of stochastic vol models. Iterative refinement is used when the Jacobian is large and only approximately known (numerical differentiation), making direct factorisation unreliable.
Mathematical setting. We focus on three cases:
- SPD system with sparse and symmetric positive definite: solved by Conjugate Gradient (CG).
- Non-symmetric system: GMRES (Generalised Minimal RESidual).
- Overdetermined LS : LSQR.
Conventions. SPD unless stated. is the residual at step . is the error. is the A-norm (energy norm).
Financial Insight. In production calibration at a bank, iterative solvers are used not just for speed but for robustness: they can be stopped early (before convergence) when the residual is below market-quote precision. Solving to accuracy when quotes are only known to wastes computation. Early stopping is equivalent to a form of regularisation — an important practical insight.
Theory
1. Quadratic Minimisation and the CG Motivation
For SPD, solving is equivalent to minimising the strictly convex quadratic: The gradient is . The minimum is at .
Steepest descent. The simplest iterative approach: move in the direction of the residual by the optimal step size:
Definition 5.1 (Steepest Descent). Given , iterate: The step is chosen to minimise over .
Convergence of steepest descent:
Theorem 5.1. Let . Steepest descent satisfies: For ill-conditioned problems (), the convergence factor approaches 1 — very slow.
The fundamental problem with steepest descent: consecutive search directions in the Euclidean inner product, but they are not -orthogonal. The method revisits the same subspaces, causing "zigzag" convergence along narrow valleys of when eigenvalues are spread widely.
2. Conjugate Gradient Method
The CG method eliminates the redundant search directions by enforcing A-conjugacy:
Definition 5.2 (A-conjugate directions). Vectors are A-conjugate (A-orthogonal) if . A set is A-conjugate if for all .
Theorem 5.2 (CG Algorithm — Hestenes-Stiefel, 1952). Starting from , , , iterate:
Key properties (proofs by induction on the Krylov subspace structure):
- for (residuals are mutually orthogonal in the Euclidean inner product).
- for (search directions are A-conjugate).
- minimises over the Krylov subspace .
- In exact arithmetic, CG terminates in at most steps with the exact solution.
CG convergence rate:
Theorem 5.3 (CG convergence). The CG iterates satisfy: For : CG convergence factor ; steepest descent factor .
Comparison: CG depends on rather than . For (typical calibration matrix):
- Steepest descent: iterations to reduce error by .
- CG: iterations (100× faster).
Example 5.1 (Tridiagonal FD system). A Crank-Nicolson price step for a node grid gives a tridiagonal SPD matrix with (for the standard 1D heat equation stencil). CG converges in iterations to machine precision — competitive with the Thomas algorithm ( steps). For 2D ADI, the subproblems are 1D tridiagonals, so Thomas algorithm dominates.
3. Preconditioning
The convergence rate of CG depends entirely on . Preconditioning replaces the original system with a better-conditioned equivalent.
Definition 5.3 (Preconditioned CG). Given a preconditioner (symmetric positive definite, easy to invert), solve the equivalent system: The effective condition number is . If , and CG converges in one step.
Implementation. In practice, PCG avoids computing explicitly. The residual preconditioned step replaces with , and the inner products become .
Common preconditioners:
| Preconditioner | Construction cost | Application cost | Best for |
|---|---|---|---|
| Diagonal (Jacobi) | Diagonally dominant | ||
| Incomplete Cholesky (IC0) | SPD sparse matrices | ||
| SSOR | Banded / finite difference | ||
| AMG (algebraic multigrid) | Elliptic PDE systems |
Remark. For yield curve calibration, the normal equations with Tikhonov regularisation have an effective condition number when dominates the small singular values. Diagonal preconditioning (trivially ) reduces by the regularisation. This is one way to see why regularisation aids iterative convergence.
4. Thomas Algorithm: Direct O(n) for Tridiagonal Systems
Before covering LSQR, it is important to note the most practically used "iterative-like" solver in quant finance: the Thomas algorithm (tridiagonal matrix algorithm), which is in fact a direct solver.
Definition 5.4 (Thomas algorithm). For the tridiagonal system the Thomas algorithm performs Gaussian elimination in flops via forward sweep (eliminating ) and backward substitution.
Stability. The Thomas algorithm is numerically stable when the matrix is diagonally dominant () — which holds for all standard Crank-Nicolson and Crank-Nicolson-like FD time-stepping schemes. No pivoting is needed.
Cost: flops — orders of magnitude cheaper than LU () for 1D grids. For a grid with nodes, Thomas costs flops vs LU's — a factor of speedup.
5. LSQR: Iterative Least Squares
For large overdetermined LS problems where is sparse, forming explicitly destroys sparsity and squares the condition number (Module 4). LSQR (Paige-Saunders, 1982) solves the LS problem iteratively using only matrix-vector products with and .
Theorem 5.4 (LSQR). LSQR applies the Lanczos bidiagonalisation to build a sequence of iterates minimising over . It satisfies:
- minimises the LS residual over the -dimensional Krylov subspace.
- Convergence rate: — same dependence as CG on (not as for normal equations via CG).
- Numerically equivalent to CG applied to , but avoids forming .
When to use LSQR vs direct methods:
- : use direct (SciPy
lstsqvia LAPACK DGELSD). - , sparse: use LSQR (SciPy
lsqr). - : use LSQR with a good preconditioner on .
6. Stopping Criteria
Warning. Iterative solvers require explicit stopping criteria. The standard choices are:
- Absolute residual: stop when .
- Relative residual: stop when .
- Maximum iterations: always set a cap to avoid infinite loops on divergent problems.
For calibration, set to match quote precision (e.g., for bid-ask), not machine precision (). Over-solving wastes time and does not improve the hedge.
Validation
The companion notebook verifies:
- Steepest descent vs CG convergence curves for a SPD matrix with known .
- CG terminates in steps on an SPD system in exact arithmetic (within floating-point tolerance).
- A-conjugacy: for all in the CG sequence.
- Preconditioning: diagonal preconditioner reduces iteration count on a poorly scaled matrix.
- Thomas algorithm: tridiagonal solve matches
scipy.linalg.solve_bandedto machine precision. - LSQR convergence: matches the direct LS solution on a sparse overdetermined system.
Hand exercise. Consider the SPD system , , starting from .
(a) Compute . What is ?
(b) Compute .
(c) Compute and .
(d) For a SPD system, CG converges in at most 2 steps. Is ?
(Answer: ; ; ; ; ; ; ; ✓)
Limitations
Warning: CG requires to be SPD. If is only SPSD (rank-deficient), CG breaks down when the search direction becomes -singular (). For indefinite or non-symmetric , use MINRES (symmetric indefinite) or GMRES (non-symmetric). GMRES stores all previous residuals and may require restarts for large problems.
Warning: floating-point loss of orthogonality. In finite-precision arithmetic, the theoretical A-conjugacy degrades as grows. For ill-conditioned problems, "ghost" eigenvalues (Ritz values that have already converged) pollute the Krylov subspace, causing CG to stagnate. Remedies: periodic restart, re-orthogonalisation (Lanczos with full re-ortho), or switching to a robust direct solver once the problem is small enough.
Warning: Thomas algorithm fails with exact zeros on the diagonal. If any pivot becomes zero during the forward sweep, the algorithm fails. For FD schemes from parabolic PDEs with positive coefficients, this cannot happen — but for schemes with non-standard boundary conditions, check diagonal dominance before applying Thomas.
Practical failure modes:
- Slow convergence: too large without a preconditioner. Symptom: residual stagnates after initial rapid drop. Fix: apply IC0 or block Jacobi preconditioning.
- Divergence in GMRES: restart frequency too low, or the matrix is nearly singular. Increase restart frequency or switch to a direct solve on the ill-conditioned subproblem.
- LSQR damping: using LSQR with a damping parameter solves the Tikhonov-regularised LS — equivalent to the direct Tikhonov formula but without forming .
Interview Angle
L1 (Junior Quant Developer).
Expected depth: algorithm description, when to use which solver.
-
"What is the conjugate gradient method and when would you use it over LU factorisation?" Answer: CG is an iterative solver for SPD that only requires matrix-vector products — cost vs for LU. Use CG when is large and sparse (e.g., FD stiffness matrix, sparse covariance). Use LU when is small () or dense.
-
"Why does CG converge faster than steepest descent?" Answer: Convergence of steepest descent depends on ; CG depends on . CG eliminates redundant search directions by enforcing A-conjugacy, so it spans a new dimension of the Krylov subspace at each step. For : steepest descent needs iterations; CG needs .
-
"What does the Thomas algorithm do, and why is it used in FD pricing?" Answer: The Thomas algorithm is a direct solver for tridiagonal systems. Crank-Nicolson PDE pricing produces one tridiagonal system per time step. Thomas is the standard solver — far faster than general LU () and exploits the exact sparsity pattern.
L2 (Senior Quant / Quant Developer).
Expected depth: derivation of CG update, convergence proof sketch, preconditioning design.
-
"Derive the CG step size from the A-conjugacy requirements." Answer: minimises over . Differentiating: . So . By construction , and subsequently (provable by induction from the Gram-Schmidt A-conjugation step).
-
"A Heston PDE on a (spot × vol) grid gives a sparse system at each time step. What solver strategy would you recommend?" Answer: ADI (alternating direction implicit) splits the 2D problem into two 1D tridiagonal solves per time step — each solved with the Thomas algorithm in . If ADI is not accurate enough (e.g., near expiry with large mixed derivative terms), use PCG with IC0 preconditioning. Target residual tolerance (matches pricing accuracy requirement of relative price error).
L3 (Quant Researcher / Model Risk).
Expected depth: Krylov subspace theory, connection to polynomials and eigenvalues, early stopping as regularisation.
-
"Explain why CG convergence depends on the distribution of eigenvalues of , not just on ." Answer: CG minimises over the Krylov subspace . The error satisfies , where is the set of degree- polynomials. If eigenvalues cluster in a few groups (e.g., distinct eigenvalues), CG converges in exactly steps regardless of . The worst case is uniformly spread eigenvalues (where Chebyshev polynomials give the bound). For FD matrices, eigenvalues cluster near two values at fine grids — explaining why PCG with a good preconditioner converges in far fewer steps than the bound predicts.
-
"In what sense is early stopping of an iterative solver equivalent to regularisation, and how would you set the tolerance for a calibration problem?" Answer: CG at iteration produces the minimiser of over . This is equivalent to applying the degree- polynomial filter to the eigenvalues of . Small eigenvalues of (corresponding to large eigenvalues of , i.e., ill-conditioned directions) are filtered out in early iterations — CG only resolves them once is large enough that spans those directions. Stopping early thus suppresses contributions from the null-space-proximate directions, exactly like TSVD regularisation (Module 4). The optimal tolerance is where is the data noise level — solving beyond this adds computational cost without improving the solution quality beyond what the data supports.