Brownian Bridge™

Setup

Why the naive definition breaks

You likely learned conditional probability as $\mathbb{P}(A|B) = \mathbb{P}(A \cap B)/\mathbb{P}(B)$ , valid when $\mathbb{P}(B) > 0$ . This is perfectly adequate for discrete problems. It fails the moment you work with continuous random variables.

Fix $Y$ a continuous random variable on $(\Omega, \mathcal{F}, \mathbb{P})$ (Module 1) and ask: what is $\mathbb{E}[X | Y = y]$ for a specific real value $y$ ? The event $\{Y = y\}$ has probability zero for any $y$ . The ratio $\mathbb{P}(\cdot \cap \{Y = y\})/\mathbb{P}(\{Y = y\})$ is $0/0$ — undefined, not merely small. Yet "the expected value of $X$ given $Y = y$ " is a perfectly natural and computationally important quantity.

The same issue arises throughout stochastic analysis. A Brownian filtration $\mathcal{F}_t = \sigma(W_s, s \leq t)$ is a $\sigma$ -algebra, not an event. Conditioning on $\mathcal{F}_t$ means "conditioning on the information available at time $t$ " — a sub- $\sigma$ -algebra of $\mathcal{F}$ . There is no event to plug into the ratio formula.

The modern resolution — due independently to Kolmogorov (1933) and developed through the Radon-Nikodym theorem — defines conditional expectation as a random variable characterised by an integral identity, not a ratio.

Conventions

Throughout this module:

$(\Omega, \mathcal{F}, \mathbb{P})$ is the probability space from Module 1: $\Omega$ the sample space, $\mathcal{F}$ a $\sigma$ -algebra on $\Omega$ , $\mathbb{P}$ a probability measure.
$L^1(\Omega, \mathcal{F}, \mathbb{P})$ denotes the space of $\mathcal{F}$ -measurable functions with $\mathbb{E}[|X|] < \infty$ . $L^2$ adds $\mathbb{E}[X^2] < \infty$ .
$\mathcal{G} \subseteq \mathcal{F}$ denotes a sub- $\sigma$ -algebra — a $\sigma$ -algebra in its own right, but coarser than $\mathcal{F}$ .
"a.s." means $\mathbb{P}$ -almost surely: except on a set of probability zero.
$(\mathcal{F}_t)_{t \geq 0}$ denotes a filtration: an increasing family of sub- $\sigma$ -algebras, $\mathcal{F}_s \subseteq \mathcal{F}_t$ for $s \leq t$ , representing the accumulation of market information over time.
$\mathbb{Q}$ denotes the risk-neutral measure; $r$ the continuously compounded risk-free rate.

The pricing motivation

The risk-neutral price of a derivative with payoff $\Phi(S_T)$ at time $t < T$ is:

$V_t = e^{-r(T-t)} \mathbb{E}^{\mathbb{Q}}\bigl[\Phi(S_T) \,\big|\, \mathcal{F}_t\bigr].$

This formula appears in Black-Scholes, Heston, Hull-White, and every other risk-neutral pricing model. The object $\mathbb{E}^{\mathbb{Q}}[\Phi(S_T) | \mathcal{F}_t]$ is not a number: it is a random variable — one that depends on which path the market has taken up to time $t$ , i.e. on $\mathcal{F}_t$ . To manipulate it — to argue that $V_t$ is a martingale, to apply Itô's lemma to it, to use it in hedging arguments — you need the measure-theoretic definition.

INSIGHT

Why this matters on a desk. Valuations computed inside a risk engine are conditional expectations: "given today's market state (the $\mathcal{F}_t$ information), what is the expected discounted payoff?" XVA desks compute $\mathbb{E}[\text{exposure} | \mathcal{F}_t]$ across thousands of Monte Carlo paths. The tower property (proved below) is the mathematical identity that guarantees path-wise consistency when you aggregate over nested simulation steps. When a model produces inconsistent valuations across time steps, the root cause is almost always a violation of the tower property — typically caused by approximating the conditional expectation with the wrong conditioning set.

Theory

1. Definition via Radon-Nikodym

DEFINITION

Definition 1.1 (Conditional Expectation). Let $X \in L^1(\Omega, \mathcal{F}, \mathbb{P})$ and let $\mathcal{G} \subseteq \mathcal{F}$ be a sub- $\sigma$ -algebra. The conditional expectation of $X$ given $\mathcal{G}$ , written $\mathbb{E}[X | \mathcal{G}]$ , is defined as any $\mathcal{G}$ -measurable random variable $Z : \Omega \to \mathbb{R}$ satisfying:

$\int_G Z \, d\mathbb{P} = \int_G X \, d\mathbb{P} \qquad \text{for all } G \in \mathcal{G}.$

Such a $Z$ exists and is unique $\mathbb{P}$ -a.s. We call this the defining property of conditional expectation.

Two conditions are imposed: (i) $Z$ must be $\mathcal{G}$ -measurable — it must be determinable from the information in $\mathcal{G}$ alone; (ii) $Z$ must integrate to the same value as $X$ over every set $G \in \mathcal{G}$ . Together these say: $Z$ is the best guess of $X$ given the information in $\mathcal{G}$ , calibrated so that its integral always matches $X$ 's.

Existence and uniqueness via Radon-Nikodym. Define a signed measure $\nu : \mathcal{G} \to \mathbb{R}$ by $\nu(G) = \int_G X \, d\mathbb{P}$ . Since $X \in L^1$ , $\nu$ is a finite signed measure on $(\Omega, \mathcal{G})$ , and $\nu \ll \mathbb{P}|_\mathcal{G}$ (absolutely continuous with respect to $\mathbb{P}$ restricted to $\mathcal{G}$ ). By the Radon-Nikodym theorem, there exists a $\mathcal{G}$ -measurable function $Z = d\nu/d(\mathbb{P}|_\mathcal{G})$ satisfying $\nu(G) = \int_G Z \, d\mathbb{P}$ for all $G \in \mathcal{G}$ . This is exactly the defining property. Any two such functions agree $\mathbb{P}$ -a.s. by the uniqueness clause of the Radon-Nikodym theorem.

2. Geometric interpretation (L² case)

REMARK

Orthogonal projection in $L^2$ . When $X \in L^2(\Omega, \mathcal{F}, \mathbb{P})$ , the conditional expectation $\mathbb{E}[X|\mathcal{G}]$ is the orthogonal projection of $X$ onto the closed subspace $L^2(\Omega, \mathcal{G}, \mathbb{P}) \subseteq L^2(\Omega, \mathcal{F}, \mathbb{P})$ .

Orthogonality here means: the residual $X - \mathbb{E}[X|\mathcal{G}]$ is orthogonal to every $\mathcal{G}$ -measurable $Z \in L^2$ :

$\mathbb{E}\bigl[(X - \mathbb{E}[X|\mathcal{G}]) \cdot Z\bigr] = 0 \qquad \text{for all } Z \in L^2(\Omega, \mathcal{G}, \mathbb{P}).$

This is equivalent to the defining property: set $G = \{Z > 0\}$ (or use a linearity/density argument). The projection interpretation implies that $\mathbb{E}[X|\mathcal{G}]$ is the best $\mathcal{G}$ -measurable predictor of $X$ in mean-square sense — minimising $\mathbb{E}[(X - Z)^2]$ over all $\mathcal{G}$ -measurable $Z$ .

The geometric picture makes several properties obvious. The projection of a projection onto the same space is the same projection — this is a restatement of the tower property. A vector already lying in the subspace projects to itself — this is $\mathbb{E}[X|\mathcal{F}] = X$ . A vector orthogonal to the subspace projects to zero — this corresponds to the independence case $\mathbb{E}[X|\mathcal{G}] = \mathbb{E}[X]$ .

3. Key properties

THEOREM

Theorem 3.1 (Properties of Conditional Expectation). Let $X, Y \in L^1(\Omega, \mathcal{F}, \mathbb{P})$ , $\alpha, \beta \in \mathbb{R}$ , and $\mathcal{G}, \mathcal{H} \subseteq \mathcal{F}$ sub- $\sigma$ -algebras. Then:

(i) Linearity: $\mathbb{E}[\alpha X + \beta Y | \mathcal{G}] = \alpha \mathbb{E}[X|\mathcal{G}] + \beta \mathbb{E}[Y|\mathcal{G}]$ a.s.

(ii) Tower property: If $\mathcal{H} \subseteq \mathcal{G} \subseteq \mathcal{F}$ , then $\mathbb{E}\bigl[\mathbb{E}[X|\mathcal{G}]\,\big|\,\mathcal{H}\bigr] = \mathbb{E}[X|\mathcal{H}]$ a.s.

(iii) Pulling out known factors: If $Y$ is $\mathcal{G}$ -measurable and $XY \in L^1$ , then $\mathbb{E}[XY|\mathcal{G}] = Y \cdot \mathbb{E}[X|\mathcal{G}]$ a.s.

(iv) Trivial conditioning: $\mathbb{E}[X|\{\emptyset, \Omega\}] = \mathbb{E}[X]$ a.s. (constant random variable).

(v) Full conditioning: $\mathbb{E}[X|\mathcal{F}] = X$ a.s.

(vi) Independence: If $X$ is independent of $\mathcal{G}$ (i.e. $X$ is independent of every $G \in \mathcal{G}$ ), then $\mathbb{E}[X|\mathcal{G}] = \mathbb{E}[X]$ a.s.

(vii) Jensen's inequality: If $\varphi : \mathbb{R} \to \mathbb{R}$ is convex and $\varphi(X) \in L^1$ , then $\varphi(\mathbb{E}[X|\mathcal{G}]) \leq \mathbb{E}[\varphi(X)|\mathcal{G}]$ a.s.

PROOF

Proof of (ii) — Tower property. We must show that $Z := \mathbb{E}[X|\mathcal{H}]$ satisfies the defining property of $\mathbb{E}[\mathbb{E}[X|\mathcal{G}]|\mathcal{H}]$ . The candidate is $Z$ , which is already $\mathcal{H}$ -measurable (by definition of $\mathbb{E}[X|\mathcal{H}]$ ). It remains to check the integral condition: for every $H \in \mathcal{H}$ :

$\int_H \mathbb{E}[X|\mathcal{H}] \, d\mathbb{P} \stackrel{?}{=} \int_H \mathbb{E}[X|\mathcal{G}] \, d\mathbb{P}.$

Since $\mathcal{H} \subseteq \mathcal{G}$ , every $H \in \mathcal{H}$ is also in $\mathcal{G}$ . By the defining property applied to $\mathbb{E}[X|\mathcal{G}]$ with $G = H$ :

$\int_H \mathbb{E}[X|\mathcal{G}] \, d\mathbb{P} = \int_H X \, d\mathbb{P}.$

By the defining property applied to $\mathbb{E}[X|\mathcal{H}]$ with $G = H$ :

$\int_H \mathbb{E}[X|\mathcal{H}] \, d\mathbb{P} = \int_H X \, d\mathbb{P}.$

Both sides equal $\int_H X \, d\mathbb{P}$ , so $\mathbb{E}[\mathbb{E}[X|\mathcal{G}]|\mathcal{H}] = \mathbb{E}[X|\mathcal{H}]$ a.s. $\square$

Proof sketch of (iii) — Pulling out known factors. Verify that $Y \cdot \mathbb{E}[X|\mathcal{G}]$ satisfies both conditions of Definition 1.1. (a) It is $\mathcal{G}$ -measurable since $Y$ is $\mathcal{G}$ -measurable and $\mathbb{E}[X|\mathcal{G}]$ is $\mathcal{G}$ -measurable. (b) For any $G \in \mathcal{G}$ :

$\int_G Y \cdot \mathbb{E}[X|\mathcal{G}] \, d\mathbb{P} = \int_G XY \, d\mathbb{P}$

by the key identity $\int_G Y f \, d\mathbb{P} = \int_G Y X \, d\mathbb{P}$ when $f = \mathbb{E}[X|\mathcal{G}]$ — which follows from approximating $Y$ by $\mathcal{G}$ -measurable simple functions and using linearity and the defining property. $\square$

The tower property — qualitative reading. Conditioning on less information than you already have can only reduce information. If $\mathcal{H} \subseteq \mathcal{G}$ (coarser filtration), then conditioning on $\mathcal{H}$ after already conditioning on $\mathcal{G}$ collapses back to what the coarser $\mathcal{H}$ would have given you directly. Iterated conditioning always loses information to the smallest $\sigma$ -algebra.

Common mistake. Candidates confuse the direction: $\mathbb{E}[\mathbb{E}[X|\mathcal{G}]|\mathcal{H}] = \mathbb{E}[X|\mathcal{H}]$ requires $\mathcal{H} \subseteq \mathcal{G}$ . The other direction — $\mathbb{E}[\mathbb{E}[X|\mathcal{H}]|\mathcal{G}]$ when $\mathcal{H} \subseteq \mathcal{G}$ — gives $\mathbb{E}[X|\mathcal{H}]$ for a different reason: $\mathbb{E}[X|\mathcal{H}]$ is already $\mathcal{H} \subseteq \mathcal{G}$ -measurable, so further conditioning on $\mathcal{G}$ returns it unchanged (property (v) applied to the sub-space $\mathcal{G}$ ). In either case, the result is conditioning on the coarser $\sigma$ -algebra.

4. Conditioning on a random variable

DEFINITION

Definition 4.1. For a random variable $Y : \Omega \to \mathbb{R}$ , define

$\mathbb{E}[X|Y] := \mathbb{E}[X|\sigma(Y)],$

where $\sigma(Y) = \{Y^{-1}(B) : B \in \mathcal{B}(\mathbb{R})\}$ is the $\sigma$ -algebra generated by $Y$ (Module 1).

Since $\mathbb{E}[X|Y]$ is $\sigma(Y)$ -measurable, the Doob-Dynkin lemma guarantees the existence of a Borel function $g : \mathbb{R} \to \mathbb{R}$ such that $\mathbb{E}[X|Y] = g(Y)$ . The function $g$ evaluated at $y$ gives the conditional expectation $\mathbb{E}[X|Y=y]$ in the sense of regular conditional distributions.

Regular conditional distributions: for most practical spaces (Polish spaces, which include $\mathbb{R}^n$ and $C([0,T])$ ), there exists a probability kernel $\kappa : \Omega \times \mathcal{F} \to [0,1]$ such that $\mathbb{E}[X|Y](\omega) = \int x \, \kappa(\omega, dx)$ . This is the rigorous version of "the conditional distribution of $X$ given $Y = y$ ." Existence on general measurable spaces is not guaranteed but holds in all practically relevant cases.

EXAMPLE

Example 4.2 (Bivariate Gaussian). Let $(X, Y)$ be jointly Gaussian with means $(\mu_X, \mu_Y)$ , standard deviations $(\sigma_X, \sigma_Y)$ , and correlation $\rho \in (-1,1)$ . Then:

$\mathbb{E}[X|Y=y] = \mu_X + \rho \frac{\sigma_X}{\sigma_Y}(y - \mu_Y).$

Derivation. Write $X = \mu_X + \alpha(Y - \mu_Y) + \varepsilon$ where we choose $\alpha$ to make $\varepsilon$ uncorrelated with $Y$ (and hence, since jointly Gaussian, independent of $Y$ ). Setting $\text{Cov}(X - \alpha Y, Y) = 0$ :

$\text{Cov}(X, Y) - \alpha \text{Var}(Y) = 0 \implies \alpha = \frac{\text{Cov}(X,Y)}{\text{Var}(Y)} = \rho \frac{\sigma_X}{\sigma_Y}.$

Then $\varepsilon = X - \mu_X - \alpha(Y - \mu_Y)$ is Gaussian and independent of $Y$ , so $\mathbb{E}[\varepsilon|Y] = \mathbb{E}[\varepsilon] = 0$ by property (vi). Applying linearity and property (iii):

$\mathbb{E}[X|Y=y] = \mu_X + \alpha(y - \mu_Y) + \mathbb{E}[\varepsilon|Y=y] = \mu_X + \rho\frac{\sigma_X}{\sigma_Y}(y - \mu_Y). \qquad \square$

In finance: if $X$ and $Y$ are jointly Gaussian log-returns, this formula is the best linear predictor of $X$ from $Y$ — the foundation of factor models and regression-based hedging.

5. Martingales via conditional expectation

THEOREM

Theorem 5.1 (Doob Martingale). Let $X \in L^1(\Omega, \mathcal{F}, \mathbb{P})$ and $(\mathcal{F}_t)_{t \geq 0}$ a filtration with $\mathcal{F}_t \subseteq \mathcal{F}$ . Define $M_t := \mathbb{E}[X|\mathcal{F}_t]$ . Then $(M_t)_{t \geq 0}$ is a martingale with respect to $(\mathcal{F}_t)$ .

PROOF

Proof. We check the three martingale conditions.

Adaptedness: $M_t = \mathbb{E}[X|\mathcal{F}_t]$ is $\mathcal{F}_t$ -measurable by definition. ✓
Integrability: $\mathbb{E}[|M_t|] = \mathbb{E}[|\mathbb{E}[X|\mathcal{F}_t]|] \leq \mathbb{E}[\mathbb{E}[|X||\mathcal{F}_t]] = \mathbb{E}[|X|] < \infty$ , using Jensen's inequality (property (vii)) and the tower property. ✓
Martingale property: For $s \leq t$ , since $\mathcal{F}_s \subseteq \mathcal{F}_t$ (filtration is increasing), the tower property gives:

$\mathbb{E}[M_t | \mathcal{F}_s] = \mathbb{E}[\mathbb{E}[X|\mathcal{F}_t]|\mathcal{F}_s] = \mathbb{E}[X|\mathcal{F}_s] = M_s. \qquad \square$

INSIGHT

Martingales and pricing. A Doob martingale is the canonical example showing that "best predictions of a terminal value" form a martingale. In risk-neutral pricing: $V_t = e^{-r(T-t)}\mathbb{E}^{\mathbb{Q}}[\Phi(S_T)|\mathcal{F}_t]$ , so the discounted price process $e^{-rt}V_t = \mathbb{E}^{\mathbb{Q}}[e^{-rT}\Phi(S_T)|\mathcal{F}_t]$ is exactly a Doob martingale under $\mathbb{Q}$ with terminal value $e^{-rT}\Phi(S_T)$ . The no-arbitrage condition requires discounted prices to be $\mathbb{Q}$ -martingales — this is the First Fundamental Theorem of Asset Pricing. The tower property is the mathematical engine that makes this time-consistency work.

The converse question — which martingales arise as Doob martingales? — is the content of the Martingale Representation Theorem: under certain conditions (e.g., a Brownian filtration), every square-integrable martingale can be written as a stochastic integral, which is in turn a Doob martingale for some terminal random variable. This will be covered in the Stochastic Calculus course.

Validation

The companion notebook at /notebooks/probability-theory-conditional-expectation.html verifies every claim in this module using pure Python with exact rational arithmetic (cells 0–4) and float arithmetic (cell 5).

The notebook checks:

Discrete CE from scratch — $\Omega = \{a,b,c,d\}$ uniform, $\mathcal{G} = \sigma(\{a,b\},\{c,d\})$ : compute $\mathbb{E}[X|\mathcal{G}]$ and verify the defining property holds on both atoms.
Tower property — 3-level filtration on $\Omega = \{1,...,8\}$ : verify $\mathbb{E}[\mathbb{E}[X|\mathcal{G}]|\mathcal{H}] = \mathbb{E}[X|\mathcal{H}]$ exactly.
Pulling-out-known-factors — verify $\mathbb{E}[XY|\mathcal{G}] = Y \cdot \mathbb{E}[X|\mathcal{G}]$ on the same discrete setup.
Jensen's inequality — verify $\varphi(\mathbb{E}[X|\mathcal{G}]) \leq \mathbb{E}[\varphi(X)|\mathcal{G}]$ for $\varphi(x) = x^2$ .
Bivariate Gaussian CE — implement $\mathbb{E}[X|Y=y] = \mu_X + \rho(\sigma_X/\sigma_Y)(y-\mu_Y)$ and verify numerically for known parameter values; summarise all checks.

PRACTICE

Hand exercise before opening the notebook. Let $\Omega = \{a,b,c,d\}$ with $\mathbb{P}$ uniform (each outcome has probability $1/4$ ). Let $\mathcal{G} = \sigma(\{a,b\}, \{c,d\})$ , so $\mathcal{G} = \{\emptyset, \{a,b\}, \{c,d\}, \Omega\}$ . Define $X(a)=1, X(b)=3, X(c)=0, X(d)=4$ .

Compute $\mathbb{E}[X|\mathcal{G}]$ . Since $\mathbb{E}[X|\mathcal{G}]$ is $\mathcal{G}$ -measurable, it is constant on each atom. On $\{a,b\}$ : the average of $X$ over $\{a,b\}$ under the uniform measure is $(X(a)+X(b))/2 = (1+3)/2 = 2$ . On $\{c,d\}$ : $(0+4)/2 = 2$ .
Verify the defining property the defining property: $\int_{\{a,b\}} \mathbb{E}[X|\mathcal{G}] \, d\mathbb{P} = 2 \cdot \frac{1}{2} = 1$ and $\int_{\{a,b\}} X \, d\mathbb{P} = 1 \cdot \frac{1}{4} + 3 \cdot \frac{1}{4} = 1$ ✓. Check $\{c,d\}$ similarly.
Verify using the notebook — cell 1 replicates this calculation exactly.

Limitations

Almost-sure uniqueness — not pointwise. The defining property the defining property determines $\mathbb{E}[X|\mathcal{G}]$ only up to sets of probability zero. Two functions $Z$ and $Z'$ satisfying the defining property may disagree on a $\mathbb{P}$ -null set. Always say "a version of $\mathbb{E}[X|\mathcal{G}]$ " when precision matters. When concatenating conditional expectations (e.g., computing $\mathbb{E}[X|\mathcal{G}](\omega)$ for each $\omega$ in a simulation), a poor choice of version can produce measurability problems.

WARNING

Simulation trap: the null-set ambiguity. In Monte Carlo, you approximate $\mathbb{E}[X|\mathcal{G}]$ by regressing $X$ on the state variables generating $\mathcal{G}$ (this is the Longstaff-Schwartz idea for American options). The regression produces one specific version of the conditional expectation. If you then use this version in a subsequent time step — for example, comparing it against a threshold to decide early exercise — you are implicitly assuming this version is "correct" on every simulated path. In practice this is fine for $\mathbb{P}$ -almost all paths, but it fails on null sets. More subtly: the approximation error from regression and the null-set issue are separate sources of error. Conflating them leads to incorrect convergence analysis.

The $L^1$ vs. $L^2$ gap. The geometric Hilbert-space interpretation (orthogonal projection) requires $X \in L^2$ . For $X \in L^1 \setminus L^2$ , conditional expectation still exists by the Radon-Nikodym argument, but the "best predictor in mean-square" interpretation breaks down — the mean-square error may be infinite. In practice, all bounded payoffs are in $L^p$ for all $p$ , so this distinction rarely bites in pricing, but it does matter for heavy-tailed distributions (e.g., stable processes, power-law returns).

Regular conditional distributions. On general measurable spaces, regular conditional distributions (the kernels $\kappa(\omega, \cdot)$ ) need not exist. They do exist when $\Omega$ is a Polish space (complete separable metric space) — which covers $\mathbb{R}^n$ , $C([0,T])$ , and $D([0,T])$ (càdlàg paths). For stochastic processes in finance this is always satisfied. But on exotic path spaces (non-separable function spaces, abstract probability spaces without metric structure) the existence of regular conditional distributions is not automatic.

Monte Carlo regression error. In numerical computation (LSMC, nested simulation), $\mathbb{E}[X|\mathcal{G}]$ is approximated by regression: fit $Z = \sum_k c_k \phi_k(\text{state variables})$ where $\phi_k$ are basis functions. This introduces: (1) approximation error from truncating the basis expansion; (2) statistical error from estimating coefficients on finitely many paths; (3) model error from choosing the wrong state variables to condition on. The tower property holds exactly in theory; in a simulation, small violations of it are a direct measure of regression quality.

Conditional Expectation and the Tower Property