JacobianFinite DifferencesComplex StepAADAutomatic Differentiation

Jacobian Computation — Finite Difference and AAD

Module 3 of 525 min readLevel: Hard

Setup

The Levenberg-Marquardt algorithm requires the Jacobian J(θ)RN×pJ(\theta) \in \mathbb{R}^{N \times p} at every iteration. For a Heston calibration with N=40N = 40 instruments and p=5p = 5 parameters, JJ has 200 entries. For an LMM calibration with N=100N = 100 swaptions and p=50p = 50 parameters, JJ has 5000 entries — each requiring a pricing function evaluation or a derivative computation.

How JJ is computed determines the speed and accuracy of the entire calibration pipeline. There are four approaches: (i) analytic differentiation, (ii) finite differences, (iii) complex step differentiation, and (iv) automatic differentiation (AAD). Each involves a different trade-off between implementation cost, accuracy, and computational load.

INSIGHT

Financial Insight. On a production vol desk, the Jacobian must be computed in milliseconds. A Heston calibration running every minute with 40 instruments and 5 parameters requires 5×2=105 \times 2 = 10 additional function evaluations per iteration (central FD) on top of the base evaluation — a 10×10\times overhead. AAD reduces this to approximately 1×1\times (one additional backward pass), making real-time calibration feasible. Understanding this trade-off is the difference between a research-grade implementation and a production system.

Assumptions:

  • The pricing function f:RpRf: \mathbb{R}^p \to \mathbb{R} (or f:RpRNf: \mathbb{R}^p \to \mathbb{R}^N) is smooth in θ\theta on the domain of interest.
  • Floating-point arithmetic: all computations are in IEEE 754 double precision, machine epsilon εmach2.2×1016\varepsilon_{\text{mach}} \approx 2.2 \times 10^{-16}.
  • For AAD: the pricing function is implemented as a sequence of elementary operations (addition, multiplication, exp\exp, ln\ln, etc.) — the standard assumption for any AD tool.

Theory

Finite Differences: Forward and Central

DEFINITION

Definition 3.1 (Finite Difference Approximations). For scalar f:RRf: \mathbb{R} \to \mathbb{R}:

Forward difference (first order): f(x)f(x+h)f(x)hf'(x) \approx \frac{f(x + h) - f(x)}{h}

Backward difference (first order): f(x)f(x)f(xh)hf'(x) \approx \frac{f(x) - f(x - h)}{h}

Central difference (second order): f(x)f(x+h)f(xh)2hf'(x) \approx \frac{f(x + h) - f(x - h)}{2h}

Error analysis: By Taylor's theorem:

f(x+h)=f(x)+hf(x)+h22f(x)+h36f(x)+f(x + h) = f(x) + h f'(x) + \frac{h^2}{2} f''(x) + \frac{h^3}{6} f'''(x) + \cdots

For the forward difference: f(x+h)f(x)h=f(x)+h2f(x)truncation+O(h2)\frac{f(x+h) - f(x)}{h} = f'(x) + \underbrace{\frac{h}{2} f''(x)}_{\text{truncation}} + O(h^2)

Truncation error: O(h)O(h). Roundoff error from floating-point evaluation of f(x+h)f(x)f(x+h) - f(x): O(εmachf(x)/h)O(\varepsilon_{\text{mach}} |f(x)| / h).

For the central difference: f(x+h)f(xh)2h=f(x)+h26f(x)truncation+O(h4)\frac{f(x+h) - f(x-h)}{2h} = f'(x) + \underbrace{\frac{h^2}{6} f'''(x)}_{\text{truncation}} + O(h^4)

Truncation error: O(h2)O(h^2). Roundoff error: O(εmachf(x)/h)O(\varepsilon_{\text{mach}} |f(x)| / h) (same structure, since f(x)f(x) cancels but floating-point errors in f(x±h)f(x\pm h) persist).

THEOREM

Theorem 3.1 (Optimal Step Size for Finite Differences). The total error (truncation + roundoff) is minimised at:

  • Forward FD: hεmachf/fεmach108h^* \approx \sqrt{\varepsilon_{\text{mach}} |f| / |f''|} \approx \sqrt{\varepsilon_{\text{mach}}} \approx 10^{-8}

  • Central FD: h(3εmachff)1/3εmach1/3105h^* \approx \left(\frac{3\varepsilon_{\text{mach}} |f|}{|f'''|}\right)^{1/3} \approx \varepsilon_{\text{mach}}^{1/3} \approx 10^{-5}

At the optimal hh, the forward FD achieves accuracy O(εmach1/2)O(\varepsilon_{\text{mach}}^{1/2}) and the central FD achieves O(εmach2/3)O(\varepsilon_{\text{mach}}^{2/3}).

Central FD is 104\sim 10^4 times more accurate than forward FD at their respective optimal step sizes, at the cost of twice as many function evaluations. In calibration, the reduced error is almost always worth the extra evaluation.

Cost for the Jacobian:

MethodEvaluations per row of JJTotal for JRN×pJ \in \mathbb{R}^{N\times p}
Forward FDp+1p + 1 (includes base)p+1p + 1 (one base, pp perturbed)
Central FD2p2p2p2p (no base reuse)
Complex steppppp (same cost as forward, but no base cancel)

For the Jacobian, all NN rows share the same perturbations: compute f(θ+hej)f(\theta + he_j) for j=1,,pj=1,\ldots,p, compute all NN residuals at each perturbed point. Total: p+1p+1 (forward) or 2p2p (central) calls to the full pricing engine (all NN instruments).

REMARK

Remark. For Heston (p=5p = 5, central FD): 10 full pricing engine calls per LM iteration. For LMM (p=50p = 50, central FD): 100 full pricing engine calls per iteration. With 50 LM iterations and 1 ms per pricing call, LMM calibration takes 100×50×1ms=5seconds100 \times 50 \times 1\,\text{ms} = 5\,\text{seconds} — acceptable for end-of-day but not for real-time.

The Complex Step Method

THEOREM

Theorem 3.2 (Complex Step Derivative). If f:RRf: \mathbb{R} \to \mathbb{R} extends to an analytic function on a strip of the complex plane containing the real axis, then for any h>0h > 0: f(x)=Im[f(x+ih)]h+O(h2)f'(x) = \frac{\mathrm{Im}[f(x + ih)]}{h} + O(h^2) where the error is purely from the Taylor remainder, with no cancellation error from floating-point subtraction.

Derivation: By the Taylor expansion of ff in C\mathbb{C}: f(x+ih)=f(x)+ihf(x)h22f(x)ih36f(x)+f(x + ih) = f(x) + ih\,f'(x) - \frac{h^2}{2} f''(x) - \frac{ih^3}{6} f'''(x) + \cdots

Taking the imaginary part: Im[f(x+ih)]=hf(x)h36f(x)+\mathrm{Im}[f(x + ih)] = h f'(x) - \frac{h^3}{6} f'''(x) + \cdots

Dividing by hh: f(x)=Im[f(x+ih)]/h+O(h2)f'(x) = \mathrm{Im}[f(x+ih)]/h + O(h^2).

The key advantage: there is no subtraction of nearly-equal quantities. Catastrophic cancellation — the main source of roundoff error in FD — does not occur. One can use h=10200h = 10^{-200} and achieve machine-precision accuracy.

WARNING

Warning — Implementation Requirement. The complex step method requires evaluating ff with complex-number arguments. The entire pricing chain — characteristic functions, numerical integrations, norm distributions — must support complex input. This is straightforward for Black-Scholes (the NormCdf can be extended to complex arguments via the error function), but requires careful implementation for models involving branching logic (if x > 0). A naive implementation that uses fabs(x) (real absolute value) will give wrong derivatives when applied to a complex number.

Automatic Differentiation (AAD)

Finite differences and complex step compute one column of JJ per perturbation (forward mode in the sense of perturbation direction). When pNp \ll N (few parameters, many instruments), this is efficient. When NpN \ll p or when computing a scalar objective gradient, the reverse mode of AAD computes all partial derivatives in a single backward pass.

DEFINITION

Definition 3.2 (Computation Graph and Tape). Any differentiable computation f(θ)f(\theta) can be written as a sequence of elementary operations v1,v2,,vMv_1, v_2, \ldots, v_M (a computation graph or tape). Each vkv_k depends on previous nodes; the output is vMv_M.

Forward mode (tangent): propagate v˙k=vk/θj\dot{v}_k = \partial v_k / \partial \theta_j alongside values. Cost per derivative: one forward pass. Total for all pp derivatives: pp passes.

Reverse mode (adjoint): propagate adjoints vˉk=f/vk\bar{v}_k = \partial f / \partial v_k backward from output to inputs. Cost: one forward pass (to record the tape) plus one backward pass. Computes the full gradient θf\nabla_\theta f in O(1)O(1) passes — independent of pp.

THEOREM

Theorem 3.3 (Cost of Reverse-Mode AAD). For a scalar function f:RpRf: \mathbb{R}^p \to \mathbb{R}, reverse-mode AAD computes the full gradient θfRp\nabla_\theta f \in \mathbb{R}^p in at most crev5c_{\text{rev}} \le 5 times the cost of a single function evaluation, independent of pp.

For the calibration Jacobian JRN×pJ \in \mathbb{R}^{N \times p}, reverse mode computes one row θfi\nabla_\theta f_i per backward pass — NN backward passes in total. Forward mode computes one column per pass — pp forward passes in total. The efficient choice depends on the ratio N/pN/p:

ModePassesPrefer when
Forward (tangent)pppNp \ll N (few parameters, many instruments)
Reverse (adjoint)NNNpN \ll p (many parameters, few instruments)
Central FD2p2pNo AAD available; pp small

For Heston (p=5p=5, N=40N=40): forward mode (5 passes) vs. reverse mode (40 passes) — forward wins. For LMM (p=50p=50, N=100N=100): reverse mode (100 passes) vs. forward mode (50 passes) — forward mode still wins on pass count, but each pass is cheaper with AAD than with FD.

INSIGHT

Financial Insight. The true benefit of reverse-mode AAD is not in the Jacobian count — it is in computing the gradient of the scalar objective θL\nabla_\theta \mathcal{L} in one backward pass. This is what makes AAD critical for gradient-based calibration: L(θ)=12r(θ)2\mathcal{L}(\theta) = \tfrac{1}{2}\|r(\theta)\|^2 is scalar, and θL=Jr\nabla_\theta \mathcal{L} = -J^\top r is computed in one backward pass regardless of pp. For LMM with p=50p = 50: one backward pass vs. 100 forward passes with FD. This is the 100×100\times speedup that motivates AAD implementation on production desks.

Application: Analytic Vega in Black-Scholes

For a European call under Black-Scholes:

C(S,K,r,T,σ)=SN(d+)KerTN(d)C(S, K, r, T, \sigma) = S\,N(d_+) - K e^{-rT} N(d_-)

where d±=[ln(S/K)+(r±σ22)T]/(σT)d_\pm = [\ln(S/K) + (r \pm \frac{\sigma^2}{2})T] / (\sigma\sqrt{T}).

The analytic derivative with respect to σ\sigma (vega) is:

Cσ=STn(d+)\frac{\partial C}{\partial \sigma} = S\sqrt{T}\,n(d_+)

where nn is the standard normal PDF. This is the benchmark for validating FD and AAD implementations: the complex step or reverse-mode AAD result must match this to within O(εmach)O(\varepsilon_{\text{mach}}).

EXAMPLE

Example 3.1. For S=100S = 100, K=100K = 100, r=0.05r = 0.05, T=1T = 1, σ=0.20\sigma = 0.20:

  • Analytic vega: 1001n(d+)=100n(0.35)37.52100 \cdot \sqrt{1} \cdot n(d_+) = 100 \cdot n(0.35) \approx 37.52

  • Forward FD with h=104h = 10^{-4}: [C(100,100,0.05,1,0.2001)C(100,100,0.05,1,0.2)]/104[C(100,100,0.05,1,0.2001) - C(100,100,0.05,1,0.2)]/10^{-4} gives approximately 37.5237.52 with error O(h)O(h).

  • Central FD with h=105h = 10^{-5}: matches to 10 significant figures.

  • Complex step with h=1050h = 10^{-50}: matches to 14 significant figures.

The notebook verifies these numerically and plots the FD error as a function of hh.


Validation

The companion notebook verifies the four derivative methods on the Black-Scholes vega and on a 2-parameter nonlinear function, confirming:

  1. Forward FD error decreases as O(h)O(h) for h>hh > h^* and increases as O(1/h)O(1/h) for h<hh < h^*.
  2. Central FD error decreases as O(h2)O(h^2) until the roundoff floor.
  3. Complex step achieves machine-precision accuracy for all tested hh values.
  4. A simple reverse-mode AAD implementation produces the correct gradient on a small model.
  5. The optimal step size predictions of Theorem 3.1 are verified empirically.
PRACTICE

Before opening the notebook: For f(θ)=θ12sin(θ2)f(\theta) = \theta_1^2 \sin(\theta_2) at θ=(2,π/4)\theta = (2, \pi/4): (a) Compute f/θ1\partial f/\partial \theta_1 and f/θ2\partial f/\partial \theta_2 analytically. (b) Estimate the forward FD error at h=104h = 10^{-4}. (c) What step size would you choose for central FD, and why?


Limitations

WARNING

Warning — FD Step Size Selection. Using a single default h=106h = 10^{-6} for all parameters is wrong when parameters have different scales. For κ2.0\kappa \approx 2.0, h=106h = 10^{-6} gives a relative perturbation of 5×1075 \times 10^{-7} — fine. For v00.04v_0 \approx 0.04, h=106h = 10^{-6} gives a relative perturbation of 2.5×1052.5 \times 10^{-5} — also fine. But for a correlation ρ0.8\rho \approx -0.8 bounded in (1,1)(-1, 1), a step of h=106h = 10^{-6} is fine, whereas a step of h=0.1h = 0.1 would give a wrong estimate. Scale parameters to O(1)O(1) before applying FD.

WARNING

Warning — AAD Through Branching Code. Reverse-mode AAD requires the computation graph to be differentiable. If the pricing function contains branches (if K > S) or discontinuous operations (e.g., digital payoffs), the derivative may not exist at the branch point. Smoothing or indicator relaxation is required. The complex step method has the same issue: a real-valued branch if real(x) > 0 evaluated at a complex xx must be handled carefully.

Other limitations:

  • Parallelism: Column-wise FD (one column of JJ per perturbation) parallelises trivially over pp threads. Reverse-mode AAD is sequential per backward pass; parallelism requires batching.
  • Memory cost of AAD: Reverse mode requires storing the tape (the entire computation graph) between the forward and backward passes. For complex pricing functions, this tape can be large. Checkpointing techniques trade recomputation for memory.
  • Implicit functions: If the pricing function involves solving an implicit equation (e.g., Newton's method for implied vol inversion), the Jacobian requires the implicit function theorem: σimpl/θ=(G/σ)1G/θ\partial \sigma_{\text{impl}} / \partial \theta = -(\partial G/\partial \sigma)^{-1} \partial G/\partial \theta where G(σ,θ)=0G(\sigma, \theta) = 0 defines the implied vol. This must be handled explicitly in AAD — naively differentiating through the Newton iterations works but is less efficient.

Interview Angle

PRACTICE

L1 (Junior) — Typical questions:

  1. What is a finite difference and when would you use it to compute a derivative? Expected: [f(x+h)f(x)]/h[f(x+h) - f(x)]/h (forward) or [f(x+h)f(xh)]/(2h)[f(x+h) - f(x-h)]/(2h) (central). Use when analytic derivatives are unavailable. Know that hh must be chosen carefully.

  2. What is the difference between forward and central finite differences? Expected: forward is O(h)O(h) accurate, central is O(h2)O(h^2). Central is more accurate at the cost of one extra function evaluation. Know the respective optimal step sizes.

  3. What is vega? How would you compute it numerically? Expected: sensitivity of option price to implied vol. Forward or central FD on σ\sigma. Better: analytic formula V=STn(d+)\mathcal{V} = S\sqrt{T}\,n(d_+) for Black-Scholes.

PRACTICE

L2 (Senior) — Typical questions:

  1. Derive the optimal step size for central finite differences. What limits the accuracy? Expected: minimise truncation O(h2)O(h^2) plus roundoff O(εmach/h)O(\varepsilon_{\text{mach}}/h)hεmach1/3h^* \sim \varepsilon_{\text{mach}}^{1/3}. At hh^*, achieve O(εmach2/3)O(\varepsilon_{\text{mach}}^{2/3}) accuracy. The limit is machine epsilon, not the Taylor truncation.

  2. Explain the complex step method. Why is it immune to cancellation error? Expected: evaluate ff at complex argument x+ihx + ih; take imaginary part divided by hh. The imaginary part is proportional to f(x)f'(x) by the Taylor expansion; no real-minus-real cancellation occurs. Step hh can be arbitrarily small.

  3. When is reverse-mode AAD preferred over forward-mode or finite differences? Expected: when the function is scalar (computing gradient of L\mathcal{L}) or when NpN \ll p (fewer outputs than parameters). Derive the cost comparison.

PRACTICE

L3 (Researcher) — Typical questions:

  1. Explain the forward and reverse modes of automatic differentiation. What is the cost complexity of each? Expected: full derivation. Forward mode propagates v˙\dot{v} alongside vv; cost pp passes for full Jacobian. Reverse mode propagates adjoints vˉ\bar{v} backward; cost NN passes. For scalar objective: reverse mode is O(1)O(1) passes regardless of pp.

  2. How do you differentiate through an implicit function (e.g., the Newton iteration for implied vol inversion)? Expected: implicit function theorem. If G(σ,θ)=0G(\sigma, \theta) = 0, then dσ/dθ=(G/σ)1G/θd\sigma/d\theta = -(\partial G/\partial \sigma)^{-1} \partial G/\partial\theta. This avoids differentiating through the Newton loop. Naive AAD through the loop works but is less numerically stable and more expensive.

  3. Compare complex step and dual-number forward-mode AAD. Are they mathematically equivalent? Expected: both extend the real numbers to a two-component system. Dual numbers: x=a+bϵx = a + b\epsilon with ϵ2=0\epsilon^2 = 0; complex: x=a+bjx = a + bj with j2=1j^2 = -1. Dual-number forward mode gives f(a+ϵ)=f(a)+f(a)ϵf(a + \epsilon) = f(a) + f'(a)\epsilon — exact to machine precision. Complex step uses j2=1j^2 = -1 so higher-order terms appear in the real part (but not the imaginary part if we take only the O(h)O(h) imaginary term). For derivatives: equivalent in practice. Dual numbers are exact; complex step has O(h2)O(h^2) error from the imaginary part being hf(x)h3f(x)/6+hf'(x) - h^3 f'''(x)/6 + \cdots