Markov Chain Monte Carlo. Sampling from complex distributions.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
In Bayesian inference you often know your posterior only up to a constant: π(x) ∝ prior(x)·likelihood(x). That “missing constant” makes direct sampling and normalization hard—exactly where Markov Chain Monte Carlo (MCMC) shines: it builds a Markov chain that spends time in regions proportional to π(x), so you can estimate expectations with samples that come from a single long run.
MCMC turns sampling into a dynamical process: design a Markov transition kernel K(x → x′) whose stationary distribution is the target π(x). If the chain is ergodic, averages along the chain converge to expectations under π. Metropolis–Hastings and Gibbs sampling are the two core “proposal-and-correction” mechanisms that guarantee π is invariant while remaining practical for complex, high-dimensional targets.
In many real problems you can evaluate a target density but can’t easily:
A canonical case is Bayesian inference. You want the posterior
π(x) = p(x | data) ∝ p(data | x) p(x)
where the proportionality constant is the evidence Z = p(data).
If your goal is an expectation
Eπ[f(X)] = ∫ f(x) π(x) dx,
then plain Monte Carlo would ask you to draw X₁, …, Xₙ i.i.d. from π. But drawing i.i.d. is often the hard part.
MCMC replaces i.i.d. samples with a dependent sequence
X₀, X₁, X₂, …
generated by a Markov chain with transition kernel K(x → x′). You design K so that:
Then you estimate expectations by time averages:
f̄ₙ = (1/n) ∑ₜ₌₁ⁿ f(Xₜ) ≈ Eπ[f(X)].
The tradeoff you accept is dependence: samples are correlated, which reduces effective sample size. The payoff is you can sample from extremely complex π—high-dimensional, constrained, or only known up to a constant.
π(x′) = ∫ π(x) K(x → x′) dx (continuous)
or π = πK (discrete).
In practice, you rarely verify invariance by solving the stationarity equation directly. Instead, you design K to satisfy a stronger and easier-to-check condition: detailed balance.
A kernel K satisfies detailed balance with π if for all x, x′:
π(x) K(x → x′) = π(x′) K(x′ → x).
If detailed balance holds and K is a valid Markov kernel, then π is invariant:
∫ π(x) K(x → x′) dx
= ∫ π(x′) K(x′ → x) dx
= π(x′) ∫ K(x′ → x) dx
= π(x′).
That last step uses that K(x′ → ·) integrates to 1.
Unlike i.i.d. sampling, MCMC has two phases:
When MCMC works well, you get practical access to:
But MCMC is not “press a button.” You must reason about kernel design, ergodicity, and diagnostics. The rest of this lesson builds those pieces carefully.
Suppose you have a target π(x) that is complicated, but you can:
If you naïvely propose X′ ∼ q(· | x) and always accept it, the chain’s stationary distribution will be whatever q induces—not your π.
Metropolis–Hastings (MH) fixes this by adding an accept/reject correction that adjusts for the mismatch between proposal dynamics and target probabilities.
Given current state x:
α(x, x′) = min\{1, [π(x′) q(x | x′)] / [π(x) q(x′ | x)]\}
Important: you can replace π with π̃ because the normalizing constant cancels in the ratio.
Define the proposal kernel q(x′ | x) and acceptance α(x, x′). The MH transition has two parts:
Now the crucial property: MH is constructed so that detailed balance holds.
Take x ≠ x′. Then:
π(x) K(x → x′)
= π(x) q(x′ | x) α(x, x′)
and with α defined as the min of 1 and a ratio, you can check that
π(x) q(x′ | x) α(x, x′)
= π(x′) q(x | x′) α(x′, x)
= π(x′) K(x′ → x).
Therefore π is invariant.
The MH ratio compares how much the proposed move would distort the target:
r(x, x′) = [π(x′) q(x | x′)] / [π(x) q(x′ | x)].
The q terms matter because if your proposal tends to jump into some region too often, MH corrects by rejecting more there.
A very common choice is a symmetric proposal:
q(x′ | x) = q(x | x′).
For example, in ℝᵈ:
x′ = x + ε, ε ∼ 𝒩(0, σ²I).
Then the acceptance simplifies:
α(x, x′) = min\{1, π(x′)/π(x)\}.
This is simple and broadly applicable, but it can mix slowly in high dimensions or when π has strong correlations.
Proposal choice is the main engineering degree of freedom in MH.
| Proposal style | Typical q(x′ | x) | Pros | Cons |
|---|---|---|---|---|
| Random-walk | x′ = x + ε | Simple; needs only π̃(x) | Can mix slowly; tuning σ is critical | |
| Independence | q(x′) independent of x | Can make big jumps | Needs good global approximation to π | |
| Langevin / gradient-informed | x′ ≈ x + (step)∇log π(x) + noise | Faster mixing on smooth targets | Requires gradients; careful step-size |
Even without going deep into gradient methods, the lesson is: MH correctness is easy; MH efficiency is hard.
If σ is too small in a random-walk proposal, you accept almost everything but explore slowly (high autocorrelation).
If σ is too large, you propose far-away points with tiny π and reject often (stuck chain).
A good proposal balances movement and acceptance. In many problems, you tune σ so that acceptance is neither near 0 nor near 1.
MH guarantees π is invariant if the chain is irreducible and aperiodic on the relevant support.
But MH does not guarantee:
Those depend on proposal quality and the geometry of π.
MCMC’s promise is:
(1/n) ∑ₜ₌₁ⁿ f(Xₜ) → Eπ[f(X)]
even though Xₜ are dependent.
This is not magic; it’s an ergodic theorem for Markov chains. You already know (from prerequisite Markov chains) that a stationary distribution π satisfies π = πK. Ergodicity strengthens this: the chain not only has π, it converges to π from broad initial conditions.
Different textbooks phrase conditions differently, but the core ideas are:
Under these conditions (plus some technical regularity), the chain has a unique stationary distribution π and converges to it.
There are two related but distinct statements:
Law(Xₜ) → π.
(1/n) ∑ₜ₌₁ⁿ f(Xₜ) → Eπ[f(X)] (almost surely, under conditions).
The second is what you use for Monte Carlo estimates.
Because samples are correlated, variance decays slower than 1/n.
Let μ = Eπ[f(X)] and consider the estimator f̄ₙ.
Under standard conditions, a Markov chain CLT gives:
√n (f̄ₙ − μ) ⇒ 𝒩(0, σ²_f),
where
σ²_f = Varπ(f(X₀)) + 2 ∑_{k=1}^∞ Covπ(f(X₀), f(X_k)).
Define the integrated autocorrelation time:
τ_int = 1 + 2 ∑_{k=1}^∞ ρ_k,
where ρ_k are autocorrelations. Then roughly:
Var(f̄ₙ) ≈ (Varπ(f)/n) · τ_int.
This motivates an effective sample size:
n_eff ≈ n / τ_int.
So MCMC accuracy depends not just on n, but on how quickly the chain forgets (how fast ρ_k decays).
If you start at X₀ far from typical regions of π, early samples are biased toward the initial state. The usual operational fix is to discard the first B iterations.
Conceptually:
If the chain moves slowly, discarding more samples doesn’t create independence; it just wastes computation.
Thinning keeps every m-th sample. People do it to “reduce correlation,” but correlation is not removed; you just throw away information.
Thinning can be useful when:
Otherwise, keeping all samples and estimating τ_int is generally better.
Ergodicity is a theorem, but you still need to check behavior.
Common checks:
None of these is a proof of convergence, but they catch many failures.
If π has widely separated modes, a random-walk MH chain may be ergodic in theory but practically non-mixing on feasible timescales.
In such cases you may need:
The lesson here: ergodicity is about infinite time; efficiency is about the time you can afford.
Metropolis–Hastings is general but needs a good proposal q(x′ | x). Sometimes you don’t have a good global proposal, but you do know and can sample from conditional distributions.
Gibbs sampling exploits a simple idea:
If you can sample exactly from the conditional of one variable given the others, you can construct a Markov chain that leaves the joint target π invariant.
This is especially common in Bayesian models with conjugacy.
Let x = (x₁, …, x_d) and target π(x₁, …, x_d). Suppose you can sample from each full conditional:
π(x_i | x_{−i})
where x_{−i} denotes all coordinates except i.
A single-site Gibbs step updates one coordinate at a time:
For i = 1, …, d:
x_i ← draw from π(x_i | x_{−i}).
After sweeping through all coordinates, you have one full Gibbs iteration.
Each conditional update is a Markov kernel K_i that replaces x_i while leaving the others fixed.
If you start with X ∼ π, and then resample X_i from π(x_i | X_{−i}), the resulting joint distribution is still π.
One way to see this is that π factors as:
π(x) = π(x_i | x_{−i}) π(x_{−i}).
So if x_{−i} is distributed as π(x_{−i}) and you sample x_i from the correct conditional, you reconstruct the correct joint.
Composing kernels K = K_d ∘ … ∘ K_1 preserves invariance.
A useful connection: Gibbs is a special case of MH.
If you propose x′ that differs from x only in coordinate i, with proposal density
q(x′ | x) = π(x′_i | x_{−i})
then the MH acceptance ratio becomes 1, because the proposal already matches the target’s conditional structure.
So Gibbs is “perfectly corrected” by construction.
Sometimes coordinates are strongly coupled; single-site updates mix slowly.
Block Gibbs updates a subset (a block) b at once:
x_b ← draw from π(x_b | x_{−b}).
When you can sample the block conditional, this can greatly reduce autocorrelation.
Gibbs requires the ability to sample from conditionals. When conditionals are not standard distributions, you may combine:
This hybrid is extremely common in applied Bayesian computation.
| Aspect | Metropolis–Hastings | Gibbs |
|---|---|---|
| Requires | unnormalized π̃(x) and a proposal q | ability to sample from conditionals |
| Accept/reject | yes | no (acceptance = 1) |
| Tuning | proposal scale/shape | block structure/order |
| Weakness | can reject often | can mix slowly with strong correlations |
The important mental model: MH is about proposing then correcting; Gibbs is about choosing proposals that need no correction.
Bayesian workflows often reduce to computing expectations under the posterior:
If you have samples θ¹, …, θⁿ ∼ (approximately) p(θ | data), then:
MCMC turns integrals into averages.
Because samples are correlated, you should not use i.i.d. standard errors blindly.
Two common approaches:
MCMC is flexible about support constraints.
The key is always the same: define a kernel K that doesn’t leave the support and keeps π invariant.
If π(θ) has strong correlations, a random-walk proposal in the raw coordinates can be inefficient.
Example: suppose (θ₁, θ₂) posterior mass lies near a tilted ellipse. Proposing independent Gaussian noise in each coordinate moves you orthogonally to the ridge often, causing rejections or slow diffusion.
A reparameterization that “whitens” the posterior (makes it closer to spherical) can drastically reduce τ_int.
Once you internalize (1) invariance via detailed balance and (2) ergodicity for time averages, you’re ready for:
But even at this level, you can already implement correct samplers for many Bayesian models and reason about when they will (or won’t) work.
Target on ℝ: π(x) ∝ π̃(x) where π̃(x) = exp(−x⁴ + 2x²). This is not trivial to normalize, but we can evaluate π̃(x) pointwise.
Use a symmetric random-walk proposal: x′ = x + ε, ε ∼ 𝒩(0, σ²).
Because the proposal is symmetric, q(x′|x) = q(x|x′), so the MH acceptance probability simplifies:
α(x, x′) = min{1, π(x′)/π(x)} = min{1, π̃(x′)/π̃(x)}.
Compute the unnormalized log density (for numerical stability):
log π̃(x) = −x⁴ + 2x².
Given current x and proposal x′, form the log acceptance ratio:
log r = log π̃(x′) − log π̃(x)
= [−(x′)⁴ + 2(x′)²] − [−x⁴ + 2x²].
Convert back to an acceptance probability:
r = exp(log r),
α = min{1, r}.
Decision rule:
Key observation: the normalizing constant Z = ∫ π̃(x) dx never appears. Any π(x) = π̃(x)/Z gives the same acceptance ratio because Z cancels:
π(x′)/π(x) = [π̃(x′)/Z] / [π̃(x)/Z] = π̃(x′)/π̃(x).
Insight: Metropolis–Hastings converts “I can evaluate π̃(x) up to a constant” into “I can sample from π.” The entire correction happens through local ratios, so unknown normalization is not a blocker.
You have a proposal kernel q(x′|x). You want to construct a corrected kernel K(x→x′) = q(x′|x)α(x,x′) (for x′≠x) such that detailed balance holds with π.
Start from the desired detailed balance condition for x′ ≠ x:
π(x) q(x′|x) α(x,x′) = π(x′) q(x|x′) α(x′,x).
Rearrange to relate α(x,x′) and α(x′,x):
α(x,x′) / α(x′,x) = [π(x′) q(x|x′)] / [π(x) q(x′|x)] = r(x,x′).
We need a choice of α that:
A standard symmetric construction is:
α(x,x′) = min{1, r(x,x′)}
α(x′,x) = min{1, 1/r(x,x′)}.
Check the ratio in two cases.
Case 1: r ≥ 1.
Then α(x,x′) = 1 and α(x′,x) = 1/r.
So α(x,x′)/α(x′,x) = 1 / (1/r) = r.
Case 2: r < 1.
Then α(x,x′) = r and α(x′,x) = 1.
So α(x,x′)/α(x′,x) = r / 1 = r.
Thus detailed balance holds for x′ ≠ x. Then π is invariant for the full kernel once we include the self-loop probability (the probability of rejecting and staying at x).
Insight: The MH acceptance rule is not arbitrary; it is engineered to satisfy detailed balance while keeping α inside [0,1]. This is the simplest general-purpose way to preserve π as invariant.
Let x = (x₁, x₂) have a bivariate normal target:
x ∼ 𝒩(0, Σ), where Σ = [[1, ρ],[ρ, 1]] with |ρ| < 1.
We will derive the full conditionals π(x₁|x₂) and π(x₂|x₁) and describe a Gibbs sampler.
For a bivariate normal, the conditional distribution is normal. Specifically:
x₁ | x₂ ∼ 𝒩( E[x₁|x₂], Var(x₁|x₂) ).
Compute conditional mean and variance using standard Gaussian conditioning:
E[x₁|x₂] = ρ x₂,
Var(x₁|x₂) = 1 − ρ².
Similarly:
E[x₂|x₁] = ρ x₁,
Var(x₂|x₁) = 1 − ρ².
A single Gibbs iteration (a sweep) is:
1) Sample x₁^{new} ∼ 𝒩(ρ x₂^{old}, 1 − ρ²)
2) Sample x₂^{new} ∼ 𝒩(ρ x₁^{new}, 1 − ρ²)
No accept/reject step is needed: each update draws exactly from the correct conditional.
Mixing intuition:
If |ρ| is close to 1, then 1 − ρ² is small, so each conditional draw is tightly concentrated around ρ times the other variable.
That makes successive samples highly correlated (slow exploration along the thin ellipse), even though every step is accepted.
Insight: Gibbs can be “perfectly accepted” yet still mix slowly when variables are strongly coupled. Correctness (invariance) is separate from efficiency (autocorrelation).
MCMC builds a Markov chain with transition kernel K(x → x′) whose stationary distribution is the target π(x).
Detailed balance, π(x)K(x→x′) = π(x′)K(x′→x), is a convenient sufficient condition for π to be invariant.
Metropolis–Hastings uses a proposal q(x′|x) plus acceptance α(x,x′) = min{1, [π(x′)q(x|x′)]/[π(x)q(x′|x)]} to enforce invariance—even when π is only known up to a constant.
Ergodicity (irreducibility + aperiodicity + stability) is what justifies replacing Eπ[f(X)] with a time average along one long chain.
Correlation reduces information: n samples from MCMC behave like n_eff ≈ n/τ_int independent samples, where τ_int depends on autocorrelation.
Gibbs sampling updates coordinates (or blocks) by drawing from exact conditionals π(x_i | x_{−i}); it is MH with acceptance probability 1.
Burn-in addresses initialization bias; it does not fix slow mixing. Thinning usually wastes samples unless constrained by storage or computation.
Assuming a high acceptance rate means good sampling: tiny random-walk steps can accept often but explore very slowly (high autocorrelation).
Forgetting support constraints (e.g., proposing negative values for a positive parameter) which can silently cause near-zero acceptance or invalid evaluations of π̃(x).
Relying on a single chain and no diagnostics: trace plots, multiple initializations, and n_eff/R̂ are basic checks against non-convergence and multimodal trapping.
Discarding huge burn-in or heavy thinning as a substitute for fixing a poor proposal or bad parameterization.
You use Metropolis–Hastings with an independence proposal q(x′) (does not depend on x). Write the acceptance probability α(x,x′). Then simplify it as much as possible.
Hint: Start from α(x,x′) = min{1, [π(x′) q(x|x′)]/[π(x) q(x′|x)]}. For independence proposals, q(x′|x)=q(x′) and q(x|x′)=q(x).
With independence proposal:
q(x′|x) = q(x′), and q(x|x′) = q(x).
So
α(x,x′) = min{1, [π(x′) q(x)] / [π(x) q(x′)]}.
This form highlights that you accept moves toward regions where π/q is larger.
Suppose you run an MCMC chain and want to estimate μ = Eπ[f(X)]. You compute the empirical mean f̄ₙ. Explain (in 2–4 sentences) why Var(f̄ₙ) is larger than Varπ(f)/n, and name one method to estimate uncertainty correctly.
Hint: Think about autocorrelation: Cov(f(X₀), f(X_k)) terms appear in the variance of the sample mean.
Because MCMC samples are correlated, the variance of the average includes lagged covariances:
Var(f̄ₙ) ≈ (Varπ(f)/n) · τ_int where τ_int = 1 + 2∑_{k≥1} ρ_k.
So the estimator behaves like it had only n_eff ≈ n/τ_int independent samples.
One correct uncertainty method is batch means (split the chain into batches and use variability of batch averages), or estimating τ_int directly and scaling Varπ(f)/n.
Consider a two-variable target π(x₁, x₂). You can sample from π(x₁ | x₂) exactly, but π(x₂ | x₁) is not available in closed form. Propose a valid MCMC scheme that still leaves π invariant.
Hint: Mix Gibbs where you can with MH where you can’t: MH-within-Gibbs.
Use a hybrid MH-within-Gibbs sampler:
1) Update x₁ by Gibbs: sample x₁′ ∼ π(x₁ | x₂) and set x₁ ← x₁′.
2) Update x₂ by an MH step targeting the conditional π(x₂ | x₁):
α = min{1, [π(x₁, x₂′) q(x₂ | x₂′, x₁)] / [π(x₁, x₂) q(x₂′ | x₂, x₁)] }
(equivalently using the conditional since x₁ is fixed).
Composing a Gibbs kernel (invariant) with an MH kernel (invariant) yields an overall kernel that preserves π.