Brownian Bridge™

Setup

The Levenberg-Marquardt algorithm requires the Jacobian $J(\theta) \in \mathbb{R}^{N \times p}$ at every iteration. For a Heston calibration with $N = 40$ instruments and $p = 5$ parameters, $J$ has 200 entries. For an LMM calibration with $N = 100$ swaptions and $p = 50$ parameters, $J$ has 5000 entries — each requiring a pricing function evaluation or a derivative computation.

How $J$ is computed determines the speed and accuracy of the entire calibration pipeline. There are four approaches: (i) analytic differentiation, (ii) finite differences, (iii) complex step differentiation, and (iv) automatic differentiation (AAD). Each involves a different trade-off between implementation cost, accuracy, and computational load.

INSIGHT

Financial Insight. On a production vol desk, the Jacobian must be computed in milliseconds. A Heston calibration running every minute with 40 instruments and 5 parameters requires $5 \times 2 = 10$ additional function evaluations per iteration (central FD) on top of the base evaluation — a $10\times$ overhead. AAD reduces this to approximately $1\times$ (one additional backward pass), making real-time calibration feasible. Understanding this trade-off is the difference between a research-grade implementation and a production system.

Assumptions:

The pricing function $f: \mathbb{R}^p \to \mathbb{R}$ (or $f: \mathbb{R}^p \to \mathbb{R}^N$ ) is smooth in $\theta$ on the domain of interest.
Floating-point arithmetic: all computations are in IEEE 754 double precision, machine epsilon $\varepsilon_{\text{mach}} \approx 2.2 \times 10^{-16}$ .
For AAD: the pricing function is implemented as a sequence of elementary operations (addition, multiplication, $\exp$ , $\ln$ , etc.) — the standard assumption for any AD tool.

Theory

Finite Differences: Forward and Central

DEFINITION

Definition 3.1 (Finite Difference Approximations). For scalar $f: \mathbb{R} \to \mathbb{R}$ :

Forward difference (first order): $f'(x) \approx \frac{f(x + h) - f(x)}{h}$

Backward difference (first order): $f'(x) \approx \frac{f(x) - f(x - h)}{h}$

Central difference (second order): $f'(x) \approx \frac{f(x + h) - f(x - h)}{2h}$

Error analysis: By Taylor's theorem:

$f(x + h) = f(x) + h f'(x) + \frac{h^2}{2} f''(x) + \frac{h^3}{6} f'''(x) + \cdots$

For the forward difference: $\frac{f(x+h) - f(x)}{h} = f'(x) + \underbrace{\frac{h}{2} f''(x)}_{\text{truncation}} + O(h^2)$

Truncation error: $O(h)$ . Roundoff error from floating-point evaluation of $f(x+h) - f(x)$ : $O(\varepsilon_{\text{mach}} |f(x)| / h)$ .

For the central difference: $\frac{f(x+h) - f(x-h)}{2h} = f'(x) + \underbrace{\frac{h^2}{6} f'''(x)}_{\text{truncation}} + O(h^4)$

Truncation error: $O(h^2)$ . Roundoff error: $O(\varepsilon_{\text{mach}} |f(x)| / h)$ (same structure, since $f(x)$ cancels but floating-point errors in $f(x\pm h)$ persist).

THEOREM

Theorem 3.1 (Optimal Step Size for Finite Differences). The total error (truncation + roundoff) is minimised at:

Forward FD: $h^* \approx \sqrt{\varepsilon_{\text{mach}} |f| / |f''|} \approx \sqrt{\varepsilon_{\text{mach}}} \approx 10^{-8}$
Central FD: $h^* \approx \left(\frac{3\varepsilon_{\text{mach}} |f|}{|f'''|}\right)^{1/3} \approx \varepsilon_{\text{mach}}^{1/3} \approx 10^{-5}$

At the optimal $h$ , the forward FD achieves accuracy $O(\varepsilon_{\text{mach}}^{1/2})$ and the central FD achieves $O(\varepsilon_{\text{mach}}^{2/3})$ .

Central FD is $\sim 10^4$ times more accurate than forward FD at their respective optimal step sizes, at the cost of twice as many function evaluations. In calibration, the reduced error is almost always worth the extra evaluation.

Cost for the Jacobian:

Method	Evaluations per row of $J$	Total for $J \in \mathbb{R}^{N\times p}$
Forward FD	$p + 1$ (includes base)	$p + 1$ (one base, $p$ perturbed)
Central FD	$2p$	$2p$ (no base reuse)
Complex step	$p$	$p$ (same cost as forward, but no base cancel)

For the Jacobian, all $N$ rows share the same perturbations: compute $f(\theta + he_j)$ for $j=1,\ldots,p$ , compute all $N$ residuals at each perturbed point. Total: $p+1$ (forward) or $2p$ (central) calls to the full pricing engine (all $N$ instruments).

REMARK

Remark. For Heston ( $p = 5$ , central FD): 10 full pricing engine calls per LM iteration. For LMM ( $p = 50$ , central FD): 100 full pricing engine calls per iteration. With 50 LM iterations and 1 ms per pricing call, LMM calibration takes $100 \times 50 \times 1\,\text{ms} = 5\,\text{seconds}$ — acceptable for end-of-day but not for real-time.

The Complex Step Method

THEOREM

Theorem 3.2 (Complex Step Derivative). If $f: \mathbb{R} \to \mathbb{R}$ extends to an analytic function on a strip of the complex plane containing the real axis, then for any $h > 0$ : $f'(x) = \frac{\mathrm{Im}[f(x + ih)]}{h} + O(h^2)$ where the error is purely from the Taylor remainder, with no cancellation error from floating-point subtraction.

Derivation: By the Taylor expansion of $f$ in $\mathbb{C}$ : $f(x + ih) = f(x) + ih\,f'(x) - \frac{h^2}{2} f''(x) - \frac{ih^3}{6} f'''(x) + \cdots$

Taking the imaginary part: $\mathrm{Im}[f(x + ih)] = h f'(x) - \frac{h^3}{6} f'''(x) + \cdots$

Dividing by $h$ : $f'(x) = \mathrm{Im}[f(x+ih)]/h + O(h^2)$ .

The key advantage: there is no subtraction of nearly-equal quantities. Catastrophic cancellation — the main source of roundoff error in FD — does not occur. One can use $h = 10^{-200}$ and achieve machine-precision accuracy.

WARNING

Warning — Implementation Requirement. The complex step method requires evaluating $f$ with complex-number arguments. The entire pricing chain — characteristic functions, numerical integrations, norm distributions — must support complex input. This is straightforward for Black-Scholes (the NormCdf can be extended to complex arguments via the error function), but requires careful implementation for models involving branching logic (if x > 0). A naive implementation that uses fabs(x) (real absolute value) will give wrong derivatives when applied to a complex number.

Automatic Differentiation (AAD)

Finite differences and complex step compute one column of $J$ per perturbation (forward mode in the sense of perturbation direction). When $p \ll N$ (few parameters, many instruments), this is efficient. When $N \ll p$ or when computing a scalar objective gradient, the reverse mode of AAD computes all partial derivatives in a single backward pass.

DEFINITION

Definition 3.2 (Computation Graph and Tape). Any differentiable computation $f(\theta)$ can be written as a sequence of elementary operations $v_1, v_2, \ldots, v_M$ (a computation graph or tape). Each $v_k$ depends on previous nodes; the output is $v_M$ .

Forward mode (tangent): propagate $\dot{v}_k = \partial v_k / \partial \theta_j$ alongside values. Cost per derivative: one forward pass. Total for all $p$ derivatives: $p$ passes.

Reverse mode (adjoint): propagate adjoints $\bar{v}_k = \partial f / \partial v_k$ backward from output to inputs. Cost: one forward pass (to record the tape) plus one backward pass. Computes the full gradient $\nabla_\theta f$ in $O(1)$ passes — independent of $p$ .

THEOREM

Theorem 3.3 (Cost of Reverse-Mode AAD). For a scalar function $f: \mathbb{R}^p \to \mathbb{R}$ , reverse-mode AAD computes the full gradient $\nabla_\theta f \in \mathbb{R}^p$ in at most $c_{\text{rev}} \le 5$ times the cost of a single function evaluation, independent of $p$ .

For the calibration Jacobian $J \in \mathbb{R}^{N \times p}$ , reverse mode computes one row $\nabla_\theta f_i$ per backward pass — $N$ backward passes in total. Forward mode computes one column per pass — $p$ forward passes in total. The efficient choice depends on the ratio $N/p$ :

Mode	Passes	Prefer when
Forward (tangent)	$p$	$p \ll N$ (few parameters, many instruments)
Reverse (adjoint)	$N$	$N \ll p$ (many parameters, few instruments)
Central FD	$2p$	No AAD available; $p$ small

For Heston ( $p=5$ , $N=40$ ): forward mode (5 passes) vs. reverse mode (40 passes) — forward wins. For LMM ( $p=50$ , $N=100$ ): reverse mode (100 passes) vs. forward mode (50 passes) — forward mode still wins on pass count, but each pass is cheaper with AAD than with FD.

INSIGHT

Financial Insight. The true benefit of reverse-mode AAD is not in the Jacobian count — it is in computing the gradient of the scalar objective $\nabla_\theta \mathcal{L}$ in one backward pass. This is what makes AAD critical for gradient-based calibration: $\mathcal{L}(\theta) = \tfrac{1}{2}\|r(\theta)\|^2$ is scalar, and $\nabla_\theta \mathcal{L} = -J^\top r$ is computed in one backward pass regardless of $p$ . For LMM with $p = 50$ : one backward pass vs. 100 forward passes with FD. This is the $100\times$ speedup that motivates AAD implementation on production desks.

Application: Analytic Vega in Black-Scholes

For a European call under Black-Scholes:

$C(S, K, r, T, \sigma) = S\,N(d_+) - K e^{-rT} N(d_-)$

where $d_\pm = [\ln(S/K) + (r \pm \frac{\sigma^2}{2})T] / (\sigma\sqrt{T})$ .

The analytic derivative with respect to $\sigma$ (vega) is:

$\frac{\partial C}{\partial \sigma} = S\sqrt{T}\,n(d_+)$

where $n$ is the standard normal PDF. This is the benchmark for validating FD and AAD implementations: the complex step or reverse-mode AAD result must match this to within $O(\varepsilon_{\text{mach}})$ .

EXAMPLE

Example 3.1. For $S = 100$ , $K = 100$ , $r = 0.05$ , $T = 1$ , $\sigma = 0.20$ :

Analytic vega: $100 \cdot \sqrt{1} \cdot n(d_+) = 100 \cdot n(0.35) \approx 37.52$
Forward FD with $h = 10^{-4}$ : $[C(100,100,0.05,1,0.2001) - C(100,100,0.05,1,0.2)]/10^{-4}$ gives approximately $37.52$ with error $O(h)$ .
Central FD with $h = 10^{-5}$ : matches to 10 significant figures.
Complex step with $h = 10^{-50}$ : matches to 14 significant figures.

The notebook verifies these numerically and plots the FD error as a function of $h$ .

Validation

The companion notebook verifies the four derivative methods on the Black-Scholes vega and on a 2-parameter nonlinear function, confirming:

Forward FD error decreases as $O(h)$ for $h > h^*$ and increases as $O(1/h)$ for $h < h^*$ .
Central FD error decreases as $O(h^2)$ until the roundoff floor.
Complex step achieves machine-precision accuracy for all tested $h$ values.
A simple reverse-mode AAD implementation produces the correct gradient on a small model.
The optimal step size predictions of Theorem 3.1 are verified empirically.

PRACTICE

Before opening the notebook: For $f(\theta) = \theta_1^2 \sin(\theta_2)$ at $\theta = (2, \pi/4)$ : (a) Compute $\partial f/\partial \theta_1$ and $\partial f/\partial \theta_2$ analytically. (b) Estimate the forward FD error at $h = 10^{-4}$ . (c) What step size would you choose for central FD, and why?

Limitations

WARNING

Warning — FD Step Size Selection. Using a single default $h = 10^{-6}$ for all parameters is wrong when parameters have different scales. For $\kappa \approx 2.0$ , $h = 10^{-6}$ gives a relative perturbation of $5 \times 10^{-7}$ — fine. For $v_0 \approx 0.04$ , $h = 10^{-6}$ gives a relative perturbation of $2.5 \times 10^{-5}$ — also fine. But for a correlation $\rho \approx -0.8$ bounded in $(-1, 1)$ , a step of $h = 10^{-6}$ is fine, whereas a step of $h = 0.1$ would give a wrong estimate. Scale parameters to $O(1)$ before applying FD.

WARNING

Warning — AAD Through Branching Code. Reverse-mode AAD requires the computation graph to be differentiable. If the pricing function contains branches (if K > S) or discontinuous operations (e.g., digital payoffs), the derivative may not exist at the branch point. Smoothing or indicator relaxation is required. The complex step method has the same issue: a real-valued branch if real(x) > 0 evaluated at a complex $x$ must be handled carefully.

Other limitations:

Parallelism: Column-wise FD (one column of $J$ per perturbation) parallelises trivially over $p$ threads. Reverse-mode AAD is sequential per backward pass; parallelism requires batching.
Memory cost of AAD: Reverse mode requires storing the tape (the entire computation graph) between the forward and backward passes. For complex pricing functions, this tape can be large. Checkpointing techniques trade recomputation for memory.
Implicit functions: If the pricing function involves solving an implicit equation (e.g., Newton's method for implied vol inversion), the Jacobian requires the implicit function theorem: $\partial \sigma_{\text{impl}} / \partial \theta = -(\partial G/\partial \sigma)^{-1} \partial G/\partial \theta$ where $G(\sigma, \theta) = 0$ defines the implied vol. This must be handled explicitly in AAD — naively differentiating through the Newton iterations works but is less efficient.

Interview Angle

PRACTICE

L1 (Junior) — Typical questions:

What is a finite difference and when would you use it to compute a derivative? Expected: $[f(x+h) - f(x)]/h$ (forward) or $[f(x+h) - f(x-h)]/(2h)$ (central). Use when analytic derivatives are unavailable. Know that $h$ must be chosen carefully.
What is the difference between forward and central finite differences? Expected: forward is $O(h)$ accurate, central is $O(h^2)$ . Central is more accurate at the cost of one extra function evaluation. Know the respective optimal step sizes.
What is vega? How would you compute it numerically? Expected: sensitivity of option price to implied vol. Forward or central FD on $\sigma$ . Better: analytic formula $\mathcal{V} = S\sqrt{T}\,n(d_+)$ for Black-Scholes.

PRACTICE

L2 (Senior) — Typical questions:

Derive the optimal step size for central finite differences. What limits the accuracy? Expected: minimise truncation $O(h^2)$ plus roundoff $O(\varepsilon_{\text{mach}}/h)$ → $h^* \sim \varepsilon_{\text{mach}}^{1/3}$ . At $h^*$ , achieve $O(\varepsilon_{\text{mach}}^{2/3})$ accuracy. The limit is machine epsilon, not the Taylor truncation.
Explain the complex step method. Why is it immune to cancellation error? Expected: evaluate $f$ at complex argument $x + ih$ ; take imaginary part divided by $h$ . The imaginary part is proportional to $f'(x)$ by the Taylor expansion; no real-minus-real cancellation occurs. Step $h$ can be arbitrarily small.
When is reverse-mode AAD preferred over forward-mode or finite differences? Expected: when the function is scalar (computing gradient of $\mathcal{L}$ ) or when $N \ll p$ (fewer outputs than parameters). Derive the cost comparison.

PRACTICE

L3 (Researcher) — Typical questions:

Explain the forward and reverse modes of automatic differentiation. What is the cost complexity of each? Expected: full derivation. Forward mode propagates $\dot{v}$ alongside $v$ ; cost $p$ passes for full Jacobian. Reverse mode propagates adjoints $\bar{v}$ backward; cost $N$ passes. For scalar objective: reverse mode is $O(1)$ passes regardless of $p$ .
How do you differentiate through an implicit function (e.g., the Newton iteration for implied vol inversion)? Expected: implicit function theorem. If $G(\sigma, \theta) = 0$ , then $d\sigma/d\theta = -(\partial G/\partial \sigma)^{-1} \partial G/\partial\theta$ . This avoids differentiating through the Newton loop. Naive AAD through the loop works but is less numerically stable and more expensive.
Compare complex step and dual-number forward-mode AAD. Are they mathematically equivalent? Expected: both extend the real numbers to a two-component system. Dual numbers: $x = a + b\epsilon$ with $\epsilon^2 = 0$ ; complex: $x = a + bj$ with $j^2 = -1$ . Dual-number forward mode gives $f(a + \epsilon) = f(a) + f'(a)\epsilon$ — exact to machine precision. Complex step uses $j^2 = -1$ so higher-order terms appear in the real part (but not the imaginary part if we take only the $O(h)$ imaginary term). For derivatives: equivalent in practice. Dual numbers are exact; complex step has $O(h^2)$ error from the imaginary part being $hf'(x) - h^3 f'''(x)/6 + \cdots$