Rigorous foundation for probability. Sigma-algebras, Lebesgue integration.
Deep-dive lesson - accessible entry point but dense material. Use worked examples and spaced repetition.
Probability looks like “assign a number to an event,” but to do that rigorously you must answer: which subsets count as events, and how do we assign probabilities consistently when there are infinitely many ways to combine events? Measure theory is the framework that makes those questions precise—and it also replaces “area under a curve” with a more flexible notion of integration that works for discontinuous functions, limits of random variables, and continuous distributions.
Measure theory builds probability from three pieces: (1) a σ-algebra of measurable sets (the events you’re allowed to talk about), (2) a measure μ that assigns sizes to those sets and is countably additive, and (3) the Lebesgue integral ∫ f dμ, defined by approximating a function using simple functions and taking limits. This foundation explains why expectations behave well under limits (monotone/dominated convergence), and why densities are “Radon–Nikodym derivatives.”
In calculus you learned integrals as areas under curves, often via Riemann sums. That’s powerful, but it has friction points:
Measure theory addresses these by separating two roles that are blended in Riemann integration:
1) Which subsets are “measurable”? (Which events can we assign sizes/probabilities to?)
2) How do we assign a size μ to those subsets?
Then integration becomes: sum the values of a function weighted by μ, generalized via limits.
A measure-theoretic universe is a triple:
In probability, μ is usually written P and has P(Ω) = 1. In analysis, μ could be Lebesgue measure on ℝ (length), or counting measure, or many others.
Measure theory is not just a bag of definitions. It gives a consistent logic for:
These “countable” properties are exactly what probability needs, because sequences of random variables and sequences of events are the bread and butter of convergence and laws of large numbers.
We’ll build up in layers:
We’ll keep connecting back to probability intuition: “events,” “probabilities,” “expectations.”
Suppose Ω is uncountably infinite (e.g., Ω = ℝ). You might hope to assign a probability to every subset of Ω.
But there is a deep obstruction: there exist subsets of ℝ (e.g., Vitali sets) for which no translation-invariant “length” can be consistently defined while preserving natural properties like countable additivity. So we compromise: we choose a rich, well-behaved family of subsets—enough to include all sets we care about in applications—on which a measure can live.
That family is a σ-algebra.
A collection 𝔽 ⊂ 𝒫(Ω) (a set of subsets of Ω) is a σ-algebra if:
1) Ω ∈ 𝔽
2) If A ∈ 𝔽 then Aᶜ ∈ 𝔽
3) If A₁, A₂, … ∈ 𝔽 then ⋃ₙ Aₙ ∈ 𝔽 (closure under countable unions)
From these, you automatically also get closure under:
Finite unions are not enough. Many limit operations produce countable unions/intersections:
If your “events” weren’t closed under these, you couldn’t even state many convergence results.
𝔽 = {∅, Ω}. Too small for most purposes.
𝔽 = 𝒫(Ω). Always a σ-algebra. For uncountable Ω, many interesting measures cannot be defined on all subsets while keeping desired properties.
The Borel σ-algebra 𝔹(ℝ) is generated by open sets (equivalently intervals). Intuitively, it contains sets you can build from intervals using countable unions/intersections/complements.
Notation: 𝔹(ℝ) = σ(open sets).
This is the standard measurable structure for real-valued random variables.
Often you don’t want to specify 𝔽 directly. Instead, you start with “basic” sets 𝒢 (like intervals) and take the smallest σ-algebra containing them.
Formally:
This intersection is again a σ-algebra (intersection of σ-algebras preserves closure properties).
Why this matters in probability:
| Object | What it is | Closed under | Why it’s used |
|---|---|---|---|
| Algebra (field) of sets | collection of subsets | complements + finite unions | too weak for limit operations |
| σ-algebra | collection of subsets | complements + countable unions | supports measure + convergence |
| Topology | collection of “open” sets | arbitrary unions + finite intersections | continuity, not size |
| Borel σ-algebra 𝔹(ℝ) | σ-algebra generated by open sets | countable unions/intersections/complements | standard measurable sets on ℝ |
A σ-algebra is not a measure. It does not assign sizes. It only defines which questions (events) are legal.
Think of 𝔽 as the “language” of measurable events; μ will be the “semantics” that assigns numbers.
Once you commit to a σ-algebra 𝔽, you want μ(A) to behave like “size.” The critical property is that disjoint pieces add up—even if you have countably many of them.
A measure is a function μ: 𝔽 → [0, ∞] such that:
1) μ(∅) = 0
2) (Countable additivity) If A₁, A₂, … are disjoint sets in 𝔽, then
μ(⋃ₙ Aₙ) = ∑ₙ μ(Aₙ)
A triple (Ω, 𝔽, μ) is a measure space.
Let A, B ∈ 𝔽.
1) Monotonicity: If A ⊂ B then μ(A) ≤ μ(B)
Proof idea:
2) Finite additivity (a special case of countable additivity)
3) Subadditivity: μ(⋃ₙ Aₙ) ≤ ∑ₙ μ(Aₙ)
This holds even without disjointness; you can prove it by converting to disjoint pieces.
This is one of the big reasons countable additivity matters.
If A₁ ⊂ A₂ ⊂ … (increasing sequence) and A = ⋃ₙ Aₙ, then
μ(A) = limₙ→∞ μ(Aₙ)
Sketch:
If A₁ ⊃ A₂ ⊃ … (decreasing) and μ(A₁) < ∞ and A = ⋂ₙ Aₙ, then
μ(A) = limₙ→∞ μ(Aₙ)
The finiteness condition matters; without it the statement can fail (∞ − ∞ ambiguities).
Let Ω be any set, 𝔽 = 𝒫(Ω). Define
μ(A) = number of elements in A (possibly ∞).
This makes ∫ f dμ become a sum.
This is the rigorous notion of “length.” For intervals,
μ((a, b)) = b − a
and it extends to a huge σ-algebra (Lebesgue measurable sets) containing all Borel sets.
Lebesgue measure is the backbone of continuous probability and analysis.
A probability space is a measure space (Ω, 𝔽, P) with P(Ω) = 1.
So probability theory is measure theory plus the normalization P(Ω)=1.
A set N ∈ 𝔽 is a null set if μ(N) = 0.
A measure space is complete if every subset of a null set is measurable (i.e., is in 𝔽). This is desirable because “events of probability zero” should not create measurability paradoxes when you take subsets.
Lebesgue measure is complete on the Lebesgue σ-algebra, while the Borel measure on 𝔹(ℝ) is not complete (there are subsets of Borel-null sets that are not Borel).
It’s tempting to think every set has a “size.” Measure theory explicitly rejects that on ℝ.
Instead you pick:
This is the trade: not every subset is measurable, but everything measurable behaves beautifully.
Riemann integration partitions the x-axis, then samples f(x). Lebesgue integration flips the perspective:
This is a major advantage when:
Given measurable spaces (Ω, 𝔽) and (S, 𝒮), a function f: Ω → S is measurable if
∀B ∈ 𝒮, f⁻¹(B) ∈ 𝔽.
In the common case S = ℝ with 𝒮 = 𝔹(ℝ):
In probability:
A simple function φ is a measurable function that takes only finitely many values:
φ = ∑_{k=1}^m a_k 1_{A_k}
where a_k ≥ 0 (often start with nonnegative case) and A_k ∈ 𝔽 are measurable.
Intuition:
We proceed in layers.
For A ∈ 𝔽,
∫ 1_A dμ = μ(A)
This aligns perfectly with probability: E[1_A] = P(A).
If φ = ∑_{k=1}^m a_k 1_{A_k}, define
∫ φ dμ = ∑_{k=1}^m a_k μ(A_k)
You can check this is well-defined (independent of representation) by refining partitions.
For f: Ω → [0, ∞] measurable,
∫ f dμ = sup{ ∫ φ dμ : 0 ≤ φ ≤ f, φ simple }
This definition is motivated by approximation:
Concretely, you can build φ_n by quantizing values of f into dyadic bins.
For measurable f that can be positive or negative, define positive/negative parts:
f⁺ = max(f, 0),
f⁻ = max(−f, 0),
so f = f⁺ − f⁻ and |f| = f⁺ + f⁻.
Then define:
∫ f dμ = ∫ f⁺ dμ − ∫ f⁻ dμ
provided at least one of ∫ f⁺ dμ, ∫ f⁻ dμ is finite (and for integrable functions we require ∫ |f| dμ < ∞).
It bakes in limit behavior.
If f_n ↑ f (pointwise increasing), then the integrals converge:
∫ f_n dμ ↑ ∫ f dμ
This is not an extra theorem; it is tightly tied to the “sup of simple functions” construction.
These are the tools you will constantly use in probability and ML theory.
If 0 ≤ f₁ ≤ f₂ ≤ … and f_n → f pointwise, then
limₙ ∫ f_n dμ = ∫ f dμ.
For nonnegative measurable f_n,
∫ (lim infₙ f_n) dμ ≤ lim infₙ ∫ f_n dμ.
If f_n → f pointwise, and there exists an integrable g with |f_n| ≤ g for all n, then
limₙ ∫ f_n dμ = ∫ f dμ.
Why DCT is a big deal in probability:
Riemann integration asks: “How do we slice the domain into intervals?”
Lebesgue integration asks: “How large is the set of points where the function takes certain values?”
That shift is exactly what you want for probability:
A probability space is (Ω, 𝔽, P) with P(Ω)=1. Then expectation is just a Lebesgue integral:
E[X] = ∫_Ω X(ω) dP(ω)
and for events A ∈ 𝔽,
P(A) = ∫ 1_A dP.
This unifies discrete and continuous cases.
If Ω is countable and P({ω}) = p(ω), then for X: Ω → ℝ,
E[X] = ∑_{ω∈Ω} X(ω) p(ω)
This is exactly ∫ X dP where P is a measure on 𝒫(Ω).
If Ω = ℝ and P has a density f with respect to Lebesgue measure μ (length), then
P(A) = ∫ 1_A(x) f(x) dμ(x)
and
E[X] = ∫ x f(x) dμ(x)
But the key phrase is “with respect to.” That is measure-theoretic.
Given two measures ν and μ on the same measurable space, we say
ν ≪ μ (ν is absolutely continuous w.r.t. μ)
if μ(A)=0 ⇒ ν(A)=0.
In probability:
If ν and μ are σ-finite measures and ν ≪ μ, then there exists a measurable function f such that for all A ∈ 𝔽,
ν(A) = ∫_A f dμ.
We write f = dν/dμ, the Radon–Nikodym derivative.
In probability:
This is the rigorous replacement for “P has a pdf.”
If you can write
dν = f dμ,
then integrals transform as:
∫ h dν = ∫ h f dμ.
This is the backbone of importance sampling and likelihood ratios.
Measure theory also defines conditional expectation as an L²/L¹ projection onto a sub-σ-algebra:
E[X | 𝔾]
where 𝔾 ⊂ 𝔽 is a σ-algebra representing partial information.
While a full treatment is another node, the measure-theoretic framing explains why conditioning is about σ-algebras (information) rather than only about random variables.
With σ-algebras, measures, and Lebesgue integration, you can state and prove:
And in ML theory:
Let Ω = {1,2,3,4}. Suppose you only observe whether the outcome is in A = {1,2} or in Aᶜ = {3,4}. Build the σ-algebra 𝔽 = σ({A}).
Start with the generating family 𝒢 = {A}. We need the smallest σ-algebra containing A and Ω.
So we must include Ω and complements.
Include Ω and ∅:
Ω ∈ 𝔽 by definition of σ-algebra.
Then ∅ = Ωᶜ ∈ 𝔽.
Include A and its complement:
A = {1,2} ∈ 𝔽.
Aᶜ = {3,4} ∈ 𝔽.
Close under countable unions/intersections.
But since Ω is finite, countable unions reduce to finite unions.
The only unions you can form from {∅, A, Aᶜ, Ω} are again one of these four sets.
Therefore:
𝔽 = {∅, {1,2}, {3,4}, Ω}.
Insight: A σ-algebra can be seen as the set of all events that are decidable given a limited observation. Here, observing “in A or not” induces exactly four measurable events: impossible, certainly, A, and Aᶜ.
Let (Ω, 𝔽, μ) be a measure space. Let A₁ ⊂ A₂ ⊂ … and define A = ⋃ₙ Aₙ. Prove μ(A) = limₙ μ(Aₙ).
Define disjoint increments:
B₁ = A₁
Bₙ = Aₙ \ Aₙ₋₁ for n ≥ 2
Then Bₙ are disjoint, and each Bₙ ∈ 𝔽 because σ-algebras are closed under differences.
Show the union is A:
⋃ₙ Bₙ = A₁ ∪ ⋃_{n≥2} (Aₙ \ Aₙ₋₁) = ⋃ₙ Aₙ = A.
(Each new piece adds exactly what was missing before.)
Apply countable additivity:
μ(A) = μ(⋃ₙ Bₙ) = ∑ₙ μ(Bₙ).
Express μ(Aₙ) using the same increments:
Aₙ = ⋃_{k=1}^n B_k (disjoint union)
So μ(Aₙ) = ∑_{k=1}^n μ(B_k).
Take the limit:
limₙ μ(Aₙ) = limₙ ∑_{k=1}^n μ(B_k) = ∑ₙ μ(Bₙ) = μ(A).
(The partial sums converge to the full series by definition.)
Insight: Continuity from below is really “measure respects growing approximations.” It’s the set-level analog of monotone convergence for integrals.
On Ω = [0,1] with Lebesgue measure μ, define φ(x) = 2·1_0,1/4 + 5·1_(1/4,1](x). Compute ∫ φ dμ.
Identify the measurable pieces:
A₁ = [0, 1/4], A₂ = (1/4, 1]
These are Borel (hence Lebesgue measurable) subsets of [0,1].
Compute μ(A₁) and μ(A₂):
μ(A₁) = 1/4 − 0 = 1/4
μ(A₂) = 1 − 1/4 = 3/4
(Endpoints do not affect Lebesgue measure.)
Use the simple function integral rule:
∫ φ dμ = 2·μ(A₁) + 5·μ(A₂).
Plug in values:
∫ φ dμ = 2·(1/4) + 5·(3/4)
= 2/4 + 15/4
= 17/4.
Insight: For step-like functions, Lebesgue integration is literally “value × size of region.” This is the prototype for expectation: E[X] is the average value weighted by probability mass.
A σ-algebra 𝔽 is the collection of sets you are allowed to measure; it is closed under complements and countable unions (hence countable intersections).
A measure μ assigns sizes to sets in 𝔽 and is countably additive over disjoint unions: μ(⨆ₙ Aₙ) = ∑ₙ μ(Aₙ).
Countable additivity implies continuity from below/above, which is essential for reasoning about limits of events.
A random variable is a measurable function X: (Ω, 𝔽) → (ℝ, 𝔹(ℝ)); measurability ensures inverse images of Borel sets are events.
The Lebesgue integral starts with ∫ 1_A dμ = μ(A), extends to simple functions, then to nonnegative measurable functions via supremum over simple under-approximations.
Monotone convergence, Fatou’s lemma, and dominated convergence are the main tools for exchanging limits and integrals (limits and expectations).
Densities are Radon–Nikodym derivatives: if ν ≪ μ, then dν/dμ exists and ν(A) = ∫_A (dν/dμ) dμ.
Probability theory is measure theory with μ(Ω)=1; expectation is just a Lebesgue integral with respect to P.
Assuming every subset of ℝ is measurable; in practice you work with Borel or Lebesgue measurable sets to avoid paradoxes.
Confusing an algebra with a σ-algebra: closure under finite unions is not enough for limit constructions like lim sup/lim inf of events.
Treating “pdf” as always existing; distributions with atoms or singular parts may not be absolutely continuous w.r.t. Lebesgue measure.
Using dominated convergence without actually providing an integrable dominating function g with |f_n| ≤ g.
Let Ω = {1,2,3,4,5,6} (a die). Let A = {1,3,5} (odd outcomes). Write out the σ-algebra σ({A}) explicitly and compute P(B) for each B in that σ-algebra assuming a fair die.
Hint: A single set A generates {∅, Ω, A, Aᶜ}. Then use P(B) = |B|/6 for a fair die.
σ({A}) = {∅, Ω, A, Aᶜ} with A = {1,3,5}, Aᶜ = {2,4,6}.
P(∅)=0, P(Ω)=1, P(A)=3/6=1/2, P(Aᶜ)=3/6=1/2.
Let (Ω, 𝔽, μ) be a measure space and let A, B ∈ 𝔽 with A ⊂ B and μ(B) < ∞. Show that μ(B\A) = μ(B) − μ(A).
Hint: Write B as a disjoint union of A and (B\A), then use countable additivity (finite case).
Since A ⊂ B, we can write B = A ∪ (B\A) and the union is disjoint.
By additivity: μ(B) = μ(A) + μ(B\A).
Because μ(B) < ∞, subtracting is well-defined, giving μ(B\A) = μ(B) − μ(A).
Define f_n(x) = 1_0,n on ℝ with Lebesgue measure μ. Let f(x) = 1_[0,∞)(x). Use monotone convergence to compute limₙ ∫ f_n dμ and compare it to ∫ f dμ.
Hint: The sets [0,n] increase to [0,∞). Compute each integral as a measure of an interval, noting it may be infinite.
We have 0 ≤ f₁ ≤ f₂ ≤ … and f_n(x) ↑ f(x) pointwise. By MCT,
limₙ ∫ f_n dμ = ∫ f dμ.
Compute ∫ f_n dμ = μ([0,n]) = n.
Thus limₙ ∫ f_n dμ = limₙ n = ∞.
Also ∫ f dμ = μ([0,∞)) = ∞.
So both sides match (both infinite), illustrating that MCT allows infinite integrals naturally.