Using random sampling to estimate quantities. Integration, simulation.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
When an integral is too messy to compute and a system is too complex to solve, you can still estimate the answer—by simulating randomness and averaging. That’s the core promise of Monte Carlo methods: convert hard math into repeated sampling.
Monte Carlo methods estimate a quantity by writing it as an expectation I = E_p[f(X)] and approximating it with a sample average Î_N = (1/N)∑_{i=1}^N f(X_i). The estimator is unbiased under mild conditions, converges by the Law of Large Numbers, and has root-mean-square error ≈ √(Var(f(X))/N). Most practical skill comes from controlling variance (and thus error) rather than “just sampling more.”
Monte Carlo methods are a family of techniques that use random sampling to estimate numerical quantities. They are especially useful when:
Many problems can be phrased as: “What is the average value of some function under some distribution?”
Examples:
The key idea is to turn the numerical goal into an expectation.
Let X be a random variable with distribution p (density or pmf). Let f be a function. The target quantity is:
I = E_p[f(X)]
If X is continuous with density p(x):
I = ∫ f(x) p(x) dx
If X is discrete with pmf p(x):
I = ∑_x f(x) p(x)
This seems like we replaced “compute an integral/sum” with “compute an expectation,” but expectations suggest a strategy: sample X from p and average f(X).
Draw independent samples X₁, X₂, …, X_N ∼ p, then compute:
Î_N = (1/N) ∑_{i=1}^N f(X_i)
This is the Monte Carlo estimator (also called the sample mean estimator).
You already know the Law of Large Numbers (LLN), so you already have the convergence story:
Î_N → I as N → ∞ (under mild conditions)
Monte Carlo is powerful because:
Monte Carlo is not magic; it trades algebraic difficulty for computational effort:
A useful mental model:
Monte Carlo begins with a simple thought: if I = E[f(X)], then repeatedly observing f(X) and averaging should approximate I.
If X₁, …, X_N are i.i.d. from p, then each f(X_i) is an i.i.d. sample from the distribution of f(X). The expectation of f(X) is I.
So the sample mean:
Î_N = (1/N) ∑_{i=1}^N f(X_i)
is a natural plug-in estimator.
Under the condition that E[|f(X)|] exists (finite), we can compute:
E[Î_N]
= E[(1/N) ∑_{i=1}^N f(X_i)]
= (1/N) ∑_{i=1}^N E[f(X_i)]
= (1/N) ∑_{i=1}^N E[f(X)]
= (1/N) · N · I
= I
So Î_N is unbiased.
Unbiasedness does not guarantee small error for finite N, but it does mean we are not systematically “off” in one direction.
By the (strong) Law of Large Numbers, if E[|f(X)|] < ∞:
Î_N → I almost surely as N → ∞
This is the convergence guarantee: with enough samples, the estimate stabilizes.
Many integrals can be rewritten in expectation form. Suppose you want:
J = ∫_a^b g(x) dx
Pick a distribution over [a, b]. The most common is uniform:
X ∼ Uniform(a, b), p(x) = 1/(b−a)
Then:
E[g(X)]
= ∫_a^b g(x) · (1/(b−a)) dx
Multiply both sides by (b−a):
∫_a^b g(x) dx = (b−a) E[g(X)]
So a Monte Carlo estimator for J is:
Ĵ_N = (b−a) (1/N) ∑_{i=1}^N g(X_i), X_i ∼ Uniform(a,b)
This is the basic Monte Carlo integration recipe.
A classic Monte Carlo use case is estimating the area of a shape S inside a bounding box B. If X is uniform over B:
P(X ∈ S) = Area(S)/Area(B)
So Area(S) = Area(B) · P(X ∈ S)
Estimate P(X ∈ S) with an average of indicator values:
Î_N = (1/N) ∑_{i=1}^N 1{X_i ∈ S}
This viewpoint is important because it shows:
Computers produce pseudo-random numbers. For Monte Carlo you typically need:
Many Monte Carlo algorithms are really about constructing samples from p efficiently. In this node, we assume we can sample from p directly (later, MCMC removes this assumption).
A minimal Monte Carlo estimator looks like:
Monte Carlo is simple to write down, but to use it well you must understand how the error behaves.
If Monte Carlo error decreased like 1/N, doubling samples would halve your error. But the typical scaling is slower: 1/√N.
That means:
Assume X₁, …, X_N are i.i.d. from p and Var(f(X)) is finite.
Let μ = E[f(X)] = I and σ² = Var(f(X)).
Compute Var(Î_N):
Î_N = (1/N) ∑_{i=1}^N f(X_i)
Because the samples are independent:
Var(Î_N)
= Var((1/N) ∑ f(X_i))
= (1/N²) ∑ Var(f(X_i))
= (1/N²) · N · σ²
= σ² / N
So the standard deviation (standard error) is:
SD(Î_N) = σ / √N
This directly yields the typical Monte Carlo error scale.
RMSE is a clean “average size of error” measure:
RMSE(Î_N) = √(E[(Î_N − I)²])
For an unbiased estimator, RMSE equals the standard deviation:
E[Î_N] = I ⇒ RMSE(Î_N) = √(Var(Î_N)) = σ/√N
So, in the unbiased Monte Carlo setting:
RMSE ≈ √(Var(f(X))/N)
This is the key performance law.
LLN tells you convergence eventually, but CLT tells you what happens at large finite N.
Under mild conditions:
√N (Î_N − I) ⇒ Normal(0, σ²)
Equivalently:
Î_N ≈ Normal(I, σ²/N)
So an approximate 95% confidence interval is:
Î_N ± 1.96 · (σ/√N)
In practice σ is unknown, so you estimate it from the samples:
s² = (1/(N−1)) ∑_{i=1}^N (f(X_i) − Î_N)²
Then use s/√N as the estimated standard error.
Notice what the formula depends on:
You can always increase N, but if Var(f(X)) is huge, you may need an impractical amount of computation.
So Monte Carlo practice is often: change how you sample or how you write the expectation so that the variance drops.
Here are common variance reduction ideas (you don’t need to master all here, but you should recognize them):
| Technique | Core idea | When it helps | Typical tradeoff |
|---|---|---|---|
| Control variates | Use a correlated quantity with known expectation to cancel noise | When you can find a strong correlated control | Requires analytic expectation of control |
| Antithetic variates | Sample negatively correlated pairs | Symmetric problems, monotone f | Needs special sampling design |
| Stratified sampling | Force coverage across regions | When distribution has important subregions | More bookkeeping |
| Importance sampling | Sample more from “important” regions and reweight | Rare events, heavy tails, sharp peaks | Can explode variance if proposal is bad |
The unifying idea: variance reduction improves accuracy at fixed N.
The clean variance result Var(Î_N)=σ²/N relies on independence.
If samples are dependent (as in Markov chains), the estimator still often converges, but the effective sample size is lower:
Var(Î_N) ≈ (σ²_eff)/N
where σ²_eff > σ² when there is positive autocorrelation.
This is the bridge to MCMC: when direct sampling is hard, you accept dependence and then analyze it carefully.
At this point, you should be able to:
Monte Carlo is both a numerical integration tool and a simulation framework. The same estimator supports many workflows.
A major reason Monte Carlo is famous: high-dimensional integrals are painful for grid methods.
Suppose you want:
I = ∫_{ℝ^d} h(x) dx
If you can choose a distribution p(x) that covers the important region, rewrite:
I = ∫ h(x) dx
= ∫ (h(x)/p(x)) p(x) dx
= E_p[ h(X)/p(X) ]
Then estimate:
Î_N = (1/N) ∑_{i=1}^N h(X_i)/p(X_i), X_i ∼ p
This is the general “change of measure” trick that leads directly to importance sampling.
Often you don’t have a closed-form model; you have a simulator. Monte Carlo says:
Example template:
The estimator is still Î_N.
Probabilities are expectations of indicators:
P(A) = E[1{A}]
So if you can simulate the experiment producing event A, you can estimate the probability by:
P̂_N(A) = (1/N) ∑ 1{A_i}
This becomes crucial for:
But note: for rare events, 1{A} has very small mean and may have problematic variance relative to the mean, motivating specialized methods (often importance sampling).
Since SD(Î_N) ≈ σ/√N, to target a standard error ≤ ε:
σ/√N ≤ ε
Solve for N:
√N ≥ σ/ε
N ≥ (σ/ε)²
This is the core scaling law.
Two important practical notes:
1) σ is unknown; estimate it from a pilot run.
2) If f has heavy tails (variance very large or infinite), the CLT-based planning can fail; you may need robust methods or different parameterizations.
Everything so far assumed you can sample X ∼ p easily.
In Bayesian inference, p might be a posterior distribution:
p(θ | data) ∝ p(data | θ) p(θ)
Often you can evaluate p up to a constant but cannot sample from it directly.
MCMC methods construct a Markov chain whose stationary distribution is p and then approximate:
E_p[f(θ)] ≈ (1/N) ∑ f(θ_i)
So MCMC is best understood as:
If Monte Carlo is the “averaging engine,” MCMC is one of the main ways to generate the inputs to that engine.
When applying Monte Carlo, ask:
1) What exactly is I? (Write it as E_p[f(X)].)
2) Can I sample from p? If not, do I need a different p or MCMC?
3) What is Var(f(X)) likely to be? Where does it get large?
4) Do I need variance reduction to make this feasible?
5) How will I report uncertainty (standard error / confidence interval)?
That workflow is the practical “shape” of Monte Carlo methods.
Estimate I = ∫_0^1 x² dx using Monte Carlo, and derive the estimator’s variance and RMSE in closed form.
Rewrite as an expectation under Uniform(0,1):
Let X ∼ Uniform(0,1).
Then p(x)=1 on [0,1], so
E[X²] = ∫_0^1 x² · 1 dx = ∫_0^1 x² dx = I.
So the Monte Carlo estimator is:
Î_N = (1/N) ∑_{i=1}^N X_i²,
with X_i i.i.d. ∼ Uniform(0,1).
Compute the true value (for reference):
I = ∫_0^1 x² dx
= [x³/3]_0^1
= 1/3.
Compute variance of f(X)=X².
First compute E[X²] and E[X⁴]:
E[X²] = 1/3.
E[X⁴] = ∫_0^1 x⁴ dx = [x⁵/5]_0^1 = 1/5.
So
Var(X²) = E[X⁴] − (E[X²])²
= 1/5 − (1/3)²
= 1/5 − 1/9
= (9−5)/45
= 4/45.
Compute variance of the estimator:
Var(Î_N) = Var(X²)/N = (4/45)/N.
Compute RMSE (unbiased case):
RMSE(Î_N) = √(Var(Î_N))
= √((4/45)/N)
= (2/√45) · 1/√N
= (2/(3√5)) · 1/√N.
Insight: Even in this friendly 1D integral, Monte Carlo converges like 1/√N. The constant depends on Var(f(X)); here it’s modest, so the estimator is fairly well-behaved.
Use Monte Carlo to estimate π by sampling points uniformly in the square [−1,1]×[−1,1] and counting how many fall inside the unit circle.
Define the geometry:
Let B be the square [−1,1]×[−1,1].
Area(B)=4.
Let S be the unit disk x² + y² ≤ 1.
Area(S)=π·1²=π.
Convert area to probability:
If (X,Y) is uniform over B, then
P((X,Y) ∈ S) = Area(S)/Area(B) = π/4.
Express probability as an expectation:
Define f(X,Y) = 1{X²+Y² ≤ 1}.
Then
E[f(X,Y)] = P((X,Y) ∈ S) = π/4.
Monte Carlo estimator:
Draw i.i.d. samples (X_i, Y_i) uniform over B.
Compute
p̂_N = (1/N) ∑_{i=1}^N 1{X_i²+Y_i² ≤ 1}.
Then estimate
π̂_N = 4 p̂_N.
Quantify variance:
Each indicator is Bernoulli with success probability p = π/4.
So
Var(f) = p(1−p).
Thus
Var(p̂_N) = p(1−p)/N.
And
Var(π̂_N) = 16 Var(p̂_N)
= 16 p(1−p)/N.
Interpret scaling:
Since p≈0.785, p(1−p)≈0.168.
So SD(π̂_N) ≈ √(16·0.168/N) = √(2.688/N) ≈ 1.64/√N.
To get ≈0.01 typical error you’d need on the order of (1.64/0.01)² ≈ 26,896 samples.
Insight: This iconic example shows both the universality and the slowness of Monte Carlo: the estimator is easy, but high precision requires many samples because error shrinks only like 1/√N.
You simulate a random system and observe outputs f(X_i). Show how to compute a Monte Carlo estimate and an approximate 95% confidence interval using the sample variance.
Run the simulation:
Generate i.i.d. inputs X₁,…,X_N ∼ p.
Compute outputs Y_i = f(X_i).
Compute the Monte Carlo estimate:
Î_N = (1/N) ∑_{i=1}^N Y_i.
Estimate variance from data:
Compute sample variance
s² = (1/(N−1)) ∑_{i=1}^N (Y_i − Î_N)².
Estimate standard error:
SE ≈ s/√N.
Form an approximate 95% confidence interval:
Using CLT,
Î_N ≈ Normal(I, σ²/N).
Replace σ with s:
CI₉₅% ≈ Î_N ± 1.96 · (s/√N).
(For small N, a t-interval with N−1 degrees of freedom is often used.)
Decide if you need more samples:
If the half-width 1.96·s/√N is too large, increase N.
Since half-width scales like 1/√N, reducing it by a factor k requires ≈k² more samples.
Insight: Monte Carlo is rarely just a point estimate. In practice you almost always want uncertainty reporting, and the sample variance gives you an operational way to do that.
Most Monte Carlo problems start by expressing the target as an expectation: I = E_p[f(X)].
The basic Monte Carlo estimator is the sample average: Î_N = (1/N)∑ f(X_i) with X_i ∼ p.
Under mild conditions, Î_N is unbiased: E[Î_N]=I, and consistent by LLN.
If Var(f(X))=σ² is finite and samples are i.i.d., then Var(Î_N)=σ²/N and SD(Î_N)=σ/√N.
RMSE typically scales like 1/√N, so high precision can require many samples.
The constant in the error (σ) matters: variance reduction can be as important as increasing N.
The CLT provides approximate error bars: Î_N ≈ Normal(I, σ²/N), enabling confidence intervals from the sample variance.
When direct sampling from p is hard, Monte Carlo remains the estimator but you need different sampling machinery—often MCMC.
Forgetting to rewrite the target correctly as E_p[f(X)] (missing a scaling factor like (b−a) in Uniform integration).
Assuming error decreases like 1/N instead of 1/√N, leading to unrealistic sample-size expectations.
Ignoring variance: using a Monte Carlo estimator with extremely high Var(f(X)) and being surprised by unstable results.
Reporting Î_N without any uncertainty estimate (no standard error / confidence interval), making results hard to interpret.
(Integration as expectation) Use Monte Carlo to estimate J = ∫_0^2 e^{−x} dx by sampling X ∼ Uniform(0,2). Write J as (b−a)E[g(X)] and give the estimator Ĵ_N.
Hint: For X ∼ Uniform(0,2), p(x)=1/2 on [0,2]. Relate ∫ g(x) dx to E[g(X)].
Let g(x)=e^{−x} and X ∼ Uniform(0,2).
Then
E[g(X)] = ∫_0^2 e^{−x} · (1/2) dx.
So
∫_0^2 e^{−x} dx = 2 E[e^{−X}].
Monte Carlo estimator:
Ĵ_N = 2 · (1/N)∑_{i=1}^N e^{−X_i}, with X_i i.i.d. ∼ Uniform(0,2).
(Error scaling) Suppose Var(f(X)) = 9. How many i.i.d. samples N are needed so that SD(Î_N) ≤ 0.1?
Hint: Use SD(Î_N)=σ/√N with σ=√Var(f(X)).
σ=√9=3.
We need 3/√N ≤ 0.1 ⇒ √N ≥ 30 ⇒ N ≥ 900.
(Indicator variance) You estimate a probability p = P(A) using p̂_N = (1/N)∑1{A_i}. Derive Var(p̂_N) and find the worst-case (largest) variance over p ∈ [0,1].
Hint: 1{A} is Bernoulli(p). The variance of Bernoulli(p) is p(1−p). Maximize this quadratic.
Each indicator Z_i=1{A_i} is Bernoulli(p), so Var(Z_i)=p(1−p).
Since p̂_N is the mean of i.i.d. Z_i,
Var(p̂_N)=p(1−p)/N.
The function p(1−p)=p−p² is maximized at p=1/2, giving maximum variance (1/4)/N.
Next, extend Monte Carlo to cases where direct sampling is difficult: MCMC.
Related foundations you may want to review or connect: