ProbabilityMeasure TheoryConditional ExpectationMartingales

Conditional Expectation and the Tower Property

Module 3 of 525 min readLevel: Hard

Setup

Why the naive definition breaks

You likely learned conditional probability as P(AB)=P(AB)/P(B)\mathbb{P}(A|B) = \mathbb{P}(A \cap B)/\mathbb{P}(B), valid when P(B)>0\mathbb{P}(B) > 0. This is perfectly adequate for discrete problems. It fails the moment you work with continuous random variables.

Fix YY a continuous random variable on (Ω,F,P)(\Omega, \mathcal{F}, \mathbb{P}) (Module 1) and ask: what is E[XY=y]\mathbb{E}[X | Y = y] for a specific real value yy? The event {Y=y}\{Y = y\} has probability zero for any yy. The ratio P({Y=y})/P({Y=y})\mathbb{P}(\cdot \cap \{Y = y\})/\mathbb{P}(\{Y = y\}) is 0/00/0 — undefined, not merely small. Yet "the expected value of XX given Y=yY = y" is a perfectly natural and computationally important quantity.

The same issue arises throughout stochastic analysis. A Brownian filtration Ft=σ(Ws,st)\mathcal{F}_t = \sigma(W_s, s \leq t) is a σ\sigma-algebra, not an event. Conditioning on Ft\mathcal{F}_t means "conditioning on the information available at time tt" — a sub-σ\sigma-algebra of F\mathcal{F}. There is no event to plug into the ratio formula.

The modern resolution — due independently to Kolmogorov (1933) and developed through the Radon-Nikodym theorem — defines conditional expectation as a random variable characterised by an integral identity, not a ratio.

Conventions

Throughout this module:

  • (Ω,F,P)(\Omega, \mathcal{F}, \mathbb{P}) is the probability space from Module 1: Ω\Omega the sample space, F\mathcal{F} a σ\sigma-algebra on Ω\Omega, P\mathbb{P} a probability measure.
  • L1(Ω,F,P)L^1(\Omega, \mathcal{F}, \mathbb{P}) denotes the space of F\mathcal{F}-measurable functions with E[X]<\mathbb{E}[|X|] < \infty. L2L^2 adds E[X2]<\mathbb{E}[X^2] < \infty.
  • GF\mathcal{G} \subseteq \mathcal{F} denotes a sub-σ\sigma-algebra — a σ\sigma-algebra in its own right, but coarser than F\mathcal{F}.
  • "a.s." means P\mathbb{P}-almost surely: except on a set of probability zero.
  • (Ft)t0(\mathcal{F}_t)_{t \geq 0} denotes a filtration: an increasing family of sub-σ\sigma-algebras, FsFt\mathcal{F}_s \subseteq \mathcal{F}_t for sts \leq t, representing the accumulation of market information over time.
  • Q\mathbb{Q} denotes the risk-neutral measure; rr the continuously compounded risk-free rate.

The pricing motivation

The risk-neutral price of a derivative with payoff Φ(ST)\Phi(S_T) at time t<Tt < T is:

Vt=er(Tt)EQ[Φ(ST)Ft].V_t = e^{-r(T-t)} \mathbb{E}^{\mathbb{Q}}\bigl[\Phi(S_T) \,\big|\, \mathcal{F}_t\bigr].

This formula appears in Black-Scholes, Heston, Hull-White, and every other risk-neutral pricing model. The object EQ[Φ(ST)Ft]\mathbb{E}^{\mathbb{Q}}[\Phi(S_T) | \mathcal{F}_t] is not a number: it is a random variable — one that depends on which path the market has taken up to time tt, i.e. on Ft\mathcal{F}_t. To manipulate it — to argue that VtV_t is a martingale, to apply Itô's lemma to it, to use it in hedging arguments — you need the measure-theoretic definition.

INSIGHT

Why this matters on a desk. Valuations computed inside a risk engine are conditional expectations: "given today's market state (the Ft\mathcal{F}_t information), what is the expected discounted payoff?" XVA desks compute E[exposureFt]\mathbb{E}[\text{exposure} | \mathcal{F}_t] across thousands of Monte Carlo paths. The tower property (proved below) is the mathematical identity that guarantees path-wise consistency when you aggregate over nested simulation steps. When a model produces inconsistent valuations across time steps, the root cause is almost always a violation of the tower property — typically caused by approximating the conditional expectation with the wrong conditioning set.


Theory

1. Definition via Radon-Nikodym

DEFINITION

Definition 1.1 (Conditional Expectation). Let XL1(Ω,F,P)X \in L^1(\Omega, \mathcal{F}, \mathbb{P}) and let GF\mathcal{G} \subseteq \mathcal{F} be a sub-σ\sigma-algebra. The conditional expectation of XX given G\mathcal{G}, written E[XG]\mathbb{E}[X | \mathcal{G}], is defined as any G\mathcal{G}-measurable random variable Z:ΩRZ : \Omega \to \mathbb{R} satisfying:

GZdP=GXdPfor all GG.\int_G Z \, d\mathbb{P} = \int_G X \, d\mathbb{P} \qquad \text{for all } G \in \mathcal{G}.

Such a ZZ exists and is unique P\mathbb{P}-a.s. We call this the defining property of conditional expectation.

Two conditions are imposed: (i) ZZ must be G\mathcal{G}-measurable — it must be determinable from the information in G\mathcal{G} alone; (ii) ZZ must integrate to the same value as XX over every set GGG \in \mathcal{G}. Together these say: ZZ is the best guess of XX given the information in G\mathcal{G}, calibrated so that its integral always matches XX's.

Existence and uniqueness via Radon-Nikodym. Define a signed measure ν:GR\nu : \mathcal{G} \to \mathbb{R} by ν(G)=GXdP\nu(G) = \int_G X \, d\mathbb{P}. Since XL1X \in L^1, ν\nu is a finite signed measure on (Ω,G)(\Omega, \mathcal{G}), and νPG\nu \ll \mathbb{P}|_\mathcal{G} (absolutely continuous with respect to P\mathbb{P} restricted to G\mathcal{G}). By the Radon-Nikodym theorem, there exists a G\mathcal{G}-measurable function Z=dν/d(PG)Z = d\nu/d(\mathbb{P}|_\mathcal{G}) satisfying ν(G)=GZdP\nu(G) = \int_G Z \, d\mathbb{P} for all GGG \in \mathcal{G}. This is exactly the defining property. Any two such functions agree P\mathbb{P}-a.s. by the uniqueness clause of the Radon-Nikodym theorem.


2. Geometric interpretation (L² case)

REMARK

Orthogonal projection in L2L^2. When XL2(Ω,F,P)X \in L^2(\Omega, \mathcal{F}, \mathbb{P}), the conditional expectation E[XG]\mathbb{E}[X|\mathcal{G}] is the orthogonal projection of XX onto the closed subspace L2(Ω,G,P)L2(Ω,F,P)L^2(\Omega, \mathcal{G}, \mathbb{P}) \subseteq L^2(\Omega, \mathcal{F}, \mathbb{P}).

Orthogonality here means: the residual XE[XG]X - \mathbb{E}[X|\mathcal{G}] is orthogonal to every G\mathcal{G}-measurable ZL2Z \in L^2:

E[(XE[XG])Z]=0for all ZL2(Ω,G,P).\mathbb{E}\bigl[(X - \mathbb{E}[X|\mathcal{G}]) \cdot Z\bigr] = 0 \qquad \text{for all } Z \in L^2(\Omega, \mathcal{G}, \mathbb{P}).

This is equivalent to the defining property: set G={Z>0}G = \{Z > 0\} (or use a linearity/density argument). The projection interpretation implies that E[XG]\mathbb{E}[X|\mathcal{G}] is the best G\mathcal{G}-measurable predictor of XX in mean-square sense — minimising E[(XZ)2]\mathbb{E}[(X - Z)^2] over all G\mathcal{G}-measurable ZZ.

The geometric picture makes several properties obvious. The projection of a projection onto the same space is the same projection — this is a restatement of the tower property. A vector already lying in the subspace projects to itself — this is E[XF]=X\mathbb{E}[X|\mathcal{F}] = X. A vector orthogonal to the subspace projects to zero — this corresponds to the independence case E[XG]=E[X]\mathbb{E}[X|\mathcal{G}] = \mathbb{E}[X].


3. Key properties

THEOREM

Theorem 3.1 (Properties of Conditional Expectation). Let X,YL1(Ω,F,P)X, Y \in L^1(\Omega, \mathcal{F}, \mathbb{P}), α,βR\alpha, \beta \in \mathbb{R}, and G,HF\mathcal{G}, \mathcal{H} \subseteq \mathcal{F} sub-σ\sigma-algebras. Then:

(i) Linearity: E[αX+βYG]=αE[XG]+βE[YG]\mathbb{E}[\alpha X + \beta Y | \mathcal{G}] = \alpha \mathbb{E}[X|\mathcal{G}] + \beta \mathbb{E}[Y|\mathcal{G}] a.s.

(ii) Tower property: If HGF\mathcal{H} \subseteq \mathcal{G} \subseteq \mathcal{F}, then E[E[XG]H]=E[XH]\mathbb{E}\bigl[\mathbb{E}[X|\mathcal{G}]\,\big|\,\mathcal{H}\bigr] = \mathbb{E}[X|\mathcal{H}] a.s.

(iii) Pulling out known factors: If YY is G\mathcal{G}-measurable and XYL1XY \in L^1, then E[XYG]=YE[XG]\mathbb{E}[XY|\mathcal{G}] = Y \cdot \mathbb{E}[X|\mathcal{G}] a.s.

(iv) Trivial conditioning: E[X{,Ω}]=E[X]\mathbb{E}[X|\{\emptyset, \Omega\}] = \mathbb{E}[X] a.s. (constant random variable).

(v) Full conditioning: E[XF]=X\mathbb{E}[X|\mathcal{F}] = X a.s.

(vi) Independence: If XX is independent of G\mathcal{G} (i.e. XX is independent of every GGG \in \mathcal{G}), then E[XG]=E[X]\mathbb{E}[X|\mathcal{G}] = \mathbb{E}[X] a.s.

(vii) Jensen's inequality: If φ:RR\varphi : \mathbb{R} \to \mathbb{R} is convex and φ(X)L1\varphi(X) \in L^1, then φ(E[XG])E[φ(X)G]\varphi(\mathbb{E}[X|\mathcal{G}]) \leq \mathbb{E}[\varphi(X)|\mathcal{G}] a.s.

PROOF

Proof of (ii) — Tower property. We must show that Z:=E[XH]Z := \mathbb{E}[X|\mathcal{H}] satisfies the defining property of E[E[XG]H]\mathbb{E}[\mathbb{E}[X|\mathcal{G}]|\mathcal{H}]. The candidate is ZZ, which is already H\mathcal{H}-measurable (by definition of E[XH]\mathbb{E}[X|\mathcal{H}]). It remains to check the integral condition: for every HHH \in \mathcal{H}:

HE[XH]dP=?HE[XG]dP.\int_H \mathbb{E}[X|\mathcal{H}] \, d\mathbb{P} \stackrel{?}{=} \int_H \mathbb{E}[X|\mathcal{G}] \, d\mathbb{P}.

Since HG\mathcal{H} \subseteq \mathcal{G}, every HHH \in \mathcal{H} is also in G\mathcal{G}. By the defining property applied to E[XG]\mathbb{E}[X|\mathcal{G}] with G=HG = H:

HE[XG]dP=HXdP.\int_H \mathbb{E}[X|\mathcal{G}] \, d\mathbb{P} = \int_H X \, d\mathbb{P}.

By the defining property applied to E[XH]\mathbb{E}[X|\mathcal{H}] with G=HG = H:

HE[XH]dP=HXdP.\int_H \mathbb{E}[X|\mathcal{H}] \, d\mathbb{P} = \int_H X \, d\mathbb{P}.

Both sides equal HXdP\int_H X \, d\mathbb{P}, so E[E[XG]H]=E[XH]\mathbb{E}[\mathbb{E}[X|\mathcal{G}]|\mathcal{H}] = \mathbb{E}[X|\mathcal{H}] a.s. \square

Proof sketch of (iii) — Pulling out known factors. Verify that YE[XG]Y \cdot \mathbb{E}[X|\mathcal{G}] satisfies both conditions of Definition 1.1. (a) It is G\mathcal{G}-measurable since YY is G\mathcal{G}-measurable and E[XG]\mathbb{E}[X|\mathcal{G}] is G\mathcal{G}-measurable. (b) For any GGG \in \mathcal{G}:

GYE[XG]dP=GXYdP\int_G Y \cdot \mathbb{E}[X|\mathcal{G}] \, d\mathbb{P} = \int_G XY \, d\mathbb{P}

by the key identity GYfdP=GYXdP\int_G Y f \, d\mathbb{P} = \int_G Y X \, d\mathbb{P} when f=E[XG]f = \mathbb{E}[X|\mathcal{G}] — which follows from approximating YY by G\mathcal{G}-measurable simple functions and using linearity and the defining property. \square

The tower property — qualitative reading. Conditioning on less information than you already have can only reduce information. If HG\mathcal{H} \subseteq \mathcal{G} (coarser filtration), then conditioning on H\mathcal{H} after already conditioning on G\mathcal{G} collapses back to what the coarser H\mathcal{H} would have given you directly. Iterated conditioning always loses information to the smallest σ\sigma-algebra.

Common mistake. Candidates confuse the direction: E[E[XG]H]=E[XH]\mathbb{E}[\mathbb{E}[X|\mathcal{G}]|\mathcal{H}] = \mathbb{E}[X|\mathcal{H}] requires HG\mathcal{H} \subseteq \mathcal{G}. The other direction — E[E[XH]G]\mathbb{E}[\mathbb{E}[X|\mathcal{H}]|\mathcal{G}] when HG\mathcal{H} \subseteq \mathcal{G} — gives E[XH]\mathbb{E}[X|\mathcal{H}] for a different reason: E[XH]\mathbb{E}[X|\mathcal{H}] is already HG\mathcal{H} \subseteq \mathcal{G}-measurable, so further conditioning on G\mathcal{G} returns it unchanged (property (v) applied to the sub-space G\mathcal{G}). In either case, the result is conditioning on the coarser σ\sigma-algebra.


4. Conditioning on a random variable

DEFINITION

Definition 4.1. For a random variable Y:ΩRY : \Omega \to \mathbb{R}, define

E[XY]:=E[Xσ(Y)],\mathbb{E}[X|Y] := \mathbb{E}[X|\sigma(Y)],

where σ(Y)={Y1(B):BB(R)}\sigma(Y) = \{Y^{-1}(B) : B \in \mathcal{B}(\mathbb{R})\} is the σ\sigma-algebra generated by YY (Module 1).

Since E[XY]\mathbb{E}[X|Y] is σ(Y)\sigma(Y)-measurable, the Doob-Dynkin lemma guarantees the existence of a Borel function g:RRg : \mathbb{R} \to \mathbb{R} such that E[XY]=g(Y)\mathbb{E}[X|Y] = g(Y). The function gg evaluated at yy gives the conditional expectation E[XY=y]\mathbb{E}[X|Y=y] in the sense of regular conditional distributions.

Regular conditional distributions: for most practical spaces (Polish spaces, which include Rn\mathbb{R}^n and C([0,T])C([0,T])), there exists a probability kernel κ:Ω×F[0,1]\kappa : \Omega \times \mathcal{F} \to [0,1] such that E[XY](ω)=xκ(ω,dx)\mathbb{E}[X|Y](\omega) = \int x \, \kappa(\omega, dx). This is the rigorous version of "the conditional distribution of XX given Y=yY = y." Existence on general measurable spaces is not guaranteed but holds in all practically relevant cases.

EXAMPLE

Example 4.2 (Bivariate Gaussian). Let (X,Y)(X, Y) be jointly Gaussian with means (μX,μY)(\mu_X, \mu_Y), standard deviations (σX,σY)(\sigma_X, \sigma_Y), and correlation ρ(1,1)\rho \in (-1,1). Then:

E[XY=y]=μX+ρσXσY(yμY).\mathbb{E}[X|Y=y] = \mu_X + \rho \frac{\sigma_X}{\sigma_Y}(y - \mu_Y).

Derivation. Write X=μX+α(YμY)+εX = \mu_X + \alpha(Y - \mu_Y) + \varepsilon where we choose α\alpha to make ε\varepsilon uncorrelated with YY (and hence, since jointly Gaussian, independent of YY). Setting Cov(XαY,Y)=0\text{Cov}(X - \alpha Y, Y) = 0:

Cov(X,Y)αVar(Y)=0    α=Cov(X,Y)Var(Y)=ρσXσY.\text{Cov}(X, Y) - \alpha \text{Var}(Y) = 0 \implies \alpha = \frac{\text{Cov}(X,Y)}{\text{Var}(Y)} = \rho \frac{\sigma_X}{\sigma_Y}.

Then ε=XμXα(YμY)\varepsilon = X - \mu_X - \alpha(Y - \mu_Y) is Gaussian and independent of YY, so E[εY]=E[ε]=0\mathbb{E}[\varepsilon|Y] = \mathbb{E}[\varepsilon] = 0 by property (vi). Applying linearity and property (iii):

E[XY=y]=μX+α(yμY)+E[εY=y]=μX+ρσXσY(yμY).\mathbb{E}[X|Y=y] = \mu_X + \alpha(y - \mu_Y) + \mathbb{E}[\varepsilon|Y=y] = \mu_X + \rho\frac{\sigma_X}{\sigma_Y}(y - \mu_Y). \qquad \square

In finance: if XX and YY are jointly Gaussian log-returns, this formula is the best linear predictor of XX from YY — the foundation of factor models and regression-based hedging.


5. Martingales via conditional expectation

THEOREM

Theorem 5.1 (Doob Martingale). Let XL1(Ω,F,P)X \in L^1(\Omega, \mathcal{F}, \mathbb{P}) and (Ft)t0(\mathcal{F}_t)_{t \geq 0} a filtration with FtF\mathcal{F}_t \subseteq \mathcal{F}. Define Mt:=E[XFt]M_t := \mathbb{E}[X|\mathcal{F}_t]. Then (Mt)t0(M_t)_{t \geq 0} is a martingale with respect to (Ft)(\mathcal{F}_t).

PROOF

Proof. We check the three martingale conditions.

  1. Adaptedness: Mt=E[XFt]M_t = \mathbb{E}[X|\mathcal{F}_t] is Ft\mathcal{F}_t-measurable by definition. ✓

  2. Integrability: E[Mt]=E[E[XFt]]E[E[XFt]]=E[X]<\mathbb{E}[|M_t|] = \mathbb{E}[|\mathbb{E}[X|\mathcal{F}_t]|] \leq \mathbb{E}[\mathbb{E}[|X||\mathcal{F}_t]] = \mathbb{E}[|X|] < \infty, using Jensen's inequality (property (vii)) and the tower property. ✓

  3. Martingale property: For sts \leq t, since FsFt\mathcal{F}_s \subseteq \mathcal{F}_t (filtration is increasing), the tower property gives:

E[MtFs]=E[E[XFt]Fs]=E[XFs]=Ms.\mathbb{E}[M_t | \mathcal{F}_s] = \mathbb{E}[\mathbb{E}[X|\mathcal{F}_t]|\mathcal{F}_s] = \mathbb{E}[X|\mathcal{F}_s] = M_s. \qquad \square

INSIGHT

Martingales and pricing. A Doob martingale is the canonical example showing that "best predictions of a terminal value" form a martingale. In risk-neutral pricing: Vt=er(Tt)EQ[Φ(ST)Ft]V_t = e^{-r(T-t)}\mathbb{E}^{\mathbb{Q}}[\Phi(S_T)|\mathcal{F}_t], so the discounted price process ertVt=EQ[erTΦ(ST)Ft]e^{-rt}V_t = \mathbb{E}^{\mathbb{Q}}[e^{-rT}\Phi(S_T)|\mathcal{F}_t] is exactly a Doob martingale under Q\mathbb{Q} with terminal value erTΦ(ST)e^{-rT}\Phi(S_T). The no-arbitrage condition requires discounted prices to be Q\mathbb{Q}-martingales — this is the First Fundamental Theorem of Asset Pricing. The tower property is the mathematical engine that makes this time-consistency work.

The converse question — which martingales arise as Doob martingales? — is the content of the Martingale Representation Theorem: under certain conditions (e.g., a Brownian filtration), every square-integrable martingale can be written as a stochastic integral, which is in turn a Doob martingale for some terminal random variable. This will be covered in the Stochastic Calculus course.


Validation

The companion notebook at /notebooks/probability-theory-conditional-expectation.html verifies every claim in this module using pure Python with exact rational arithmetic (cells 0–4) and float arithmetic (cell 5).

The notebook checks:

  1. Discrete CE from scratchΩ={a,b,c,d}\Omega = \{a,b,c,d\} uniform, G=σ({a,b},{c,d})\mathcal{G} = \sigma(\{a,b\},\{c,d\}): compute E[XG]\mathbb{E}[X|\mathcal{G}] and verify the defining property holds on both atoms.
  2. Tower property — 3-level filtration on Ω={1,...,8}\Omega = \{1,...,8\}: verify E[E[XG]H]=E[XH]\mathbb{E}[\mathbb{E}[X|\mathcal{G}]|\mathcal{H}] = \mathbb{E}[X|\mathcal{H}] exactly.
  3. Pulling-out-known-factors — verify E[XYG]=YE[XG]\mathbb{E}[XY|\mathcal{G}] = Y \cdot \mathbb{E}[X|\mathcal{G}] on the same discrete setup.
  4. Jensen's inequality — verify φ(E[XG])E[φ(X)G]\varphi(\mathbb{E}[X|\mathcal{G}]) \leq \mathbb{E}[\varphi(X)|\mathcal{G}] for φ(x)=x2\varphi(x) = x^2.
  5. Bivariate Gaussian CE — implement E[XY=y]=μX+ρ(σX/σY)(yμY)\mathbb{E}[X|Y=y] = \mu_X + \rho(\sigma_X/\sigma_Y)(y-\mu_Y) and verify numerically for known parameter values; summarise all checks.
PRACTICE

Hand exercise before opening the notebook. Let Ω={a,b,c,d}\Omega = \{a,b,c,d\} with P\mathbb{P} uniform (each outcome has probability 1/41/4). Let G=σ({a,b},{c,d})\mathcal{G} = \sigma(\{a,b\}, \{c,d\}), so G={,{a,b},{c,d},Ω}\mathcal{G} = \{\emptyset, \{a,b\}, \{c,d\}, \Omega\}. Define X(a)=1,X(b)=3,X(c)=0,X(d)=4X(a)=1, X(b)=3, X(c)=0, X(d)=4.

  1. Compute E[XG]\mathbb{E}[X|\mathcal{G}]. Since E[XG]\mathbb{E}[X|\mathcal{G}] is G\mathcal{G}-measurable, it is constant on each atom. On {a,b}\{a,b\}: the average of XX over {a,b}\{a,b\} under the uniform measure is (X(a)+X(b))/2=(1+3)/2=2(X(a)+X(b))/2 = (1+3)/2 = 2. On {c,d}\{c,d\}: (0+4)/2=2(0+4)/2 = 2.
  2. Verify the defining property the defining property: {a,b}E[XG]dP=212=1\int_{\{a,b\}} \mathbb{E}[X|\mathcal{G}] \, d\mathbb{P} = 2 \cdot \frac{1}{2} = 1 and {a,b}XdP=114+314=1\int_{\{a,b\}} X \, d\mathbb{P} = 1 \cdot \frac{1}{4} + 3 \cdot \frac{1}{4} = 1 ✓. Check {c,d}\{c,d\} similarly.
  3. Verify using the notebook — cell 1 replicates this calculation exactly.

Limitations

Almost-sure uniqueness — not pointwise. The defining property the defining property determines E[XG]\mathbb{E}[X|\mathcal{G}] only up to sets of probability zero. Two functions ZZ and ZZ' satisfying the defining property may disagree on a P\mathbb{P}-null set. Always say "a version of E[XG]\mathbb{E}[X|\mathcal{G}]" when precision matters. When concatenating conditional expectations (e.g., computing E[XG](ω)\mathbb{E}[X|\mathcal{G}](\omega) for each ω\omega in a simulation), a poor choice of version can produce measurability problems.

WARNING

Simulation trap: the null-set ambiguity. In Monte Carlo, you approximate E[XG]\mathbb{E}[X|\mathcal{G}] by regressing XX on the state variables generating G\mathcal{G} (this is the Longstaff-Schwartz idea for American options). The regression produces one specific version of the conditional expectation. If you then use this version in a subsequent time step — for example, comparing it against a threshold to decide early exercise — you are implicitly assuming this version is "correct" on every simulated path. In practice this is fine for P\mathbb{P}-almost all paths, but it fails on null sets. More subtly: the approximation error from regression and the null-set issue are separate sources of error. Conflating them leads to incorrect convergence analysis.

The L1L^1 vs. L2L^2 gap. The geometric Hilbert-space interpretation (orthogonal projection) requires XL2X \in L^2. For XL1L2X \in L^1 \setminus L^2, conditional expectation still exists by the Radon-Nikodym argument, but the "best predictor in mean-square" interpretation breaks down — the mean-square error may be infinite. In practice, all bounded payoffs are in LpL^p for all pp, so this distinction rarely bites in pricing, but it does matter for heavy-tailed distributions (e.g., stable processes, power-law returns).

Regular conditional distributions. On general measurable spaces, regular conditional distributions (the kernels κ(ω,)\kappa(\omega, \cdot)) need not exist. They do exist when Ω\Omega is a Polish space (complete separable metric space) — which covers Rn\mathbb{R}^n, C([0,T])C([0,T]), and D([0,T])D([0,T]) (càdlàg paths). For stochastic processes in finance this is always satisfied. But on exotic path spaces (non-separable function spaces, abstract probability spaces without metric structure) the existence of regular conditional distributions is not automatic.

Monte Carlo regression error. In numerical computation (LSMC, nested simulation), E[XG]\mathbb{E}[X|\mathcal{G}] is approximated by regression: fit Z=kckϕk(state variables)Z = \sum_k c_k \phi_k(\text{state variables}) where ϕk\phi_k are basis functions. This introduces: (1) approximation error from truncating the basis expansion; (2) statistical error from estimating coefficients on finitely many paths; (3) model error from choosing the wrong state variables to condition on. The tower property holds exactly in theory; in a simulation, small violations of it are a direct measure of regression quality.


Interview Angle

L1 — Junior quant / quant developer

Expected depth: State the defining property, identify CE as a random variable not a number, state the tower property correctly, apply independence.

PRACTICE

Q1. "What is E[XFt]\mathbb{E}[X|\mathcal{F}_t] in the context of option pricing?"

Expected: it is the time-tt risk-neutral value (up to discounting) — a random variable representing the expected payoff given the market information available at time tt. It is not a single number; it depends on the realised path up to tt.

Common mistake: saying "it is the conditional probability" (it is an expectation) or "it is a number" (it is a random variable indexed by Ft\mathcal{F}_t).

Q2. "State the tower property."

Expected: if HGF\mathcal{H} \subseteq \mathcal{G} \subseteq \mathcal{F}, then E[E[XG]H]=E[XH]\mathbb{E}[\mathbb{E}[X|\mathcal{G}]|\mathcal{H}] = \mathbb{E}[X|\mathcal{H}] a.s. Strong answer adds: "The smaller (coarser) σ\sigma-algebra wins."

Common mistake: reversing the direction — stating the outer conditioning is on G\mathcal{G} and the inner on H\mathcal{H} without checking the inclusion order.

Q3. "If XX and YY are independent, what is E[XY]\mathbb{E}[X|Y]?"

Expected: E[XY]=E[X]\mathbb{E}[X|Y] = \mathbb{E}[X] a.s. — the constant random variable equal to the unconditional mean. Knowing YY gives no information about XX.

L2 — Senior quant

Expected depth: Prove the tower property from the defining property, explain why CE is not a number, prove the pulling-out-known-factors property.

PRACTICE

Q1. "Prove the tower property from the defining property of conditional expectation."

Expected: the proof in §3 above — verify that E[XH]\mathbb{E}[X|\mathcal{H}] satisfies the defining integral identity for E[E[XG]H]\mathbb{E}[\mathbb{E}[X|\mathcal{G}]|\mathcal{H}] by using HG\mathcal{H} \subseteq \mathcal{G} and applying the defining property twice.

Q2. "Why is E[XG]\mathbb{E}[X|\mathcal{G}] a random variable and not a single number?"

Expected: it is a function ΩR\Omega \to \mathbb{R}, G\mathcal{G}-measurable. Its value at ω\omega depends on which G\mathcal{G}-atom contains ω\omega. On a continuous space, it varies continuously with the conditioning information. Saying "it is a number" is the mistake of treating G\mathcal{G} as a single event rather than a σ\sigma-algebra.

Q3. "State and prove: if YY is G\mathcal{G}-measurable, then E[XYG]=YE[XG]\mathbb{E}[XY|\mathcal{G}] = Y \mathbb{E}[X|\mathcal{G}] a.s."

Expected: verify the two conditions of Definition 1.1 — G\mathcal{G}-measurability (product of G\mathcal{G}-measurable functions) and the integral identity. For the integral identity: approximate YY by simple functions YnYY_n \nearrow Y, use linearity and the MCT; or use the functional form argument for bounded YY and extend by density. Strong answers note that YY bounded or Y0Y \geq 0 are sufficient conditions; the general L1L^1 case requires XYL1XY \in L^1.

L3 — Quant researcher

Expected depth: Radon-Nikodym theorem statement and role in CE existence, Monte Carlo regression error decomposition, failure conditions for CE.

PRACTICE

Q1. "What is the Radon-Nikodym theorem and why does it guarantee the existence of conditional expectation?"

Expected: RN theorem states that if νμ\nu \ll \mu (absolutely continuous signed measures on (Ω,G)(\Omega, \mathcal{G})), there exists a G\mathcal{G}-measurable ff with ν(G)=Gfdμ\nu(G) = \int_G f \, d\mu for all GGG \in \mathcal{G}. Apply with μ=PG\mu = \mathbb{P}|_\mathcal{G} and ν(G)=GXdP\nu(G) = \int_G X \, d\mathbb{P}: the density f=dν/dμf = d\nu/d\mu is E[XG]\mathbb{E}[X|\mathcal{G}].

Q2. "In Monte Carlo simulation, you approximate E[XG]\mathbb{E}[X|\mathcal{G}] by regressing XX on the state variables that generate G\mathcal{G}. What are the sources of error in this approximation?"

Expected: (1) Approximation/basis error — the regression function space (e.g., polynomials of degree k\leq k) may not contain the true conditional expectation; (2) Statistical/variance error — finite sample size means the regression coefficients are estimated with noise, creating O(1/N)O(1/\sqrt{N}) error; (3) State variable misspecification — if the chosen state variables do not generate G\mathcal{G} exactly (e.g., omitting a dimension of the Markov state), the regression targets the wrong conditional expectation entirely; (4) Nested simulation error — if XX itself is estimated by an inner simulation, its noise feeds into the outer regression.

Q3. "Can conditional expectation fail to exist? Under what conditions?"

Expected: CE always exists for XL1X \in L^1 on any probability space (by Radon-Nikodym, which holds for any σ\sigma-finite measure). The subtlety is regular conditional distributions — these can fail on non-separable measurable spaces. On Polish spaces they always exist. A deeper failure mode: if (Ω,F,P)(\Omega, \mathcal{F}, \mathbb{P}) is not σ\sigma-finite (unusual in probability, where P(Ω)=1\mathbb{P}(\Omega)=1 always implies σ\sigma-finiteness), the Radon-Nikodym theorem may not apply in full generality. In all standard finance applications, existence is guaranteed.