Partial pooling, hyperpriors, shrinkage estimators. Multilevel regression. Empirical Bayes as approximation. Store-level vs chain-level parameter sharing.
When you have many related groups with small per-group data (stores, hospitals, classrooms), hierarchical Bayesian models give you principled 'partial pooling' that improves estimates and quantifies uncertainty—without forcing all groups to be identical.
Hierarchical Bayesian models place priors on group-level parameters (hyperpriors) to achieve partial pooling and shrinkage; empirical Bayes offers a fast approximation by estimating hyperparameters from the marginal likelihood.
Definition and motivation
A hierarchical Bayesian model (also called a multilevel model) is a probabilistic model in which parameters themselves have probability distributions whose parameters (hyperparameters) are also given distributions. Hierarchical structure encodes the idea of exchangeability and partial pooling: groups are similar but not identical. Rather than estimating each group's parameter independently (no pooling) or assuming all groups share a single parameter (complete pooling), hierarchical models let the data determine how much information to borrow across groups.
Why care: partial pooling reduces estimation variance while controlling bias. Consider stores estimating conversion rates: a tiny store with 2 purchases in 20 visits would have an unreliable raw estimate 0.10. A hierarchical model borrows strength from the chain-level distribution to pull (shrink) that estimate toward the chain mean, often giving dramatically better predictive performance.
Core formalism (two-level example)
The canonical two-level hierarchical model for group-specific parameters and observations is:
Here indexes groups (stores), indexes observations in group . Exchangeability across groups means that, before seeing data, group labels don't matter and we treat as iid from .
Concrete numeric toy: Beta-Binomial intuition
If each store has conversion counts and we place with hyperparameters , then posterior inference for each is conjugate:
Numeric example: a small store had successes in visits. If chain-level hyperparameters are (prior mean 0.05, variance small), the posterior mean is
which is shrunk toward 0.05 from the raw rate .
Contrast with complete pooling: estimating a single for all stores would ignore between-store variability; no pooling would give with high variance for small . The hierarchical model, by introducing hyperparameters and a hyperprior, adapts the amount of pooling to the data.
Philosophy and exchangeability
In [Conjugate Priors] we learned how conjugacy gives closed-form posteriors for single-level models like Beta-Binomial and Normal-Normal. Hierarchical models extend this: conjugacy may allow analytic conditional posteriors for lower-level parameters given hyperparameters. But typically we integrate over hyperparameters (or sample them in MCMC) to get full posterior uncertainty.
Exchangeability justifies the iid assumption on : if groups are exchangeable, the joint prior over factorizes into iid draws from the same mixture. Partial pooling is the Bayesian implementation of the bias-variance tradeoff: the posterior for each is shrunk toward the common prior mean, with the extent of shrinkage determined by the relative sizes of within-group noise and between-group variability.
Key quantities to watch: the hyperparameters (e.g. variance of across groups) control shrinkage; the posterior predictive distribution integrates over both and and is what you use for predicting new observations or new groups.
Summary sentence
Hierarchical Bayesian models let you model group heterogeneity while borrowing strength across groups via hyperpriors; this produces principled shrinkage estimators that often dominate non-hierarchical alternatives in prediction and mean squared error.
Normal-Normal hierarchical model (canonical analytic case)
A central worked analytic case is the Normal-Normal hierarchical model with known observation variance. This is where the familiar shrinkage formula arises in closed form and links directly to [Conjugate Priors]. Model:
Let be the group mean and its sampling variance. Conditional on hyperparameters the posterior for is Gaussian (by conjugacy):
where the shrinkage weight is
(We prefer the intuitively clearer form below.) Solving algebra shows the posterior mean can be written as:
This is the effective weight on the data vs the group mean. When , and we rely on the data; when (no between-group variability) and every group collapses to the common mean (complete pooling).
Concrete numeric example (normal-normal)
Suppose (observation SD ), (between-group SD ), observations in group , and group sample mean , hyper-mean . Then
so posterior mean is
Compare: raw mean 10, pooled mean 8, shrunk estimate 9.11.
Derivation sketch
Conditional posterior precision = prior precision + data precision. Prior precision , data precision . Thus posterior mean is a precision-weighted average:
This is the canonical "shrinkage" formula. In [Conjugate Priors] you saw the Normal-Normal conjugacy; here it extends across groups with a hyperprior on and if desired.
James–Stein connection (brief, illuminating)
The James–Stein estimator is a (frequentist) shrinkage estimator for multivariate normal means that dominates the MLE in mean squared error for dimension . Hierarchical Bayes with an appropriate hyperprior produces shrinkage estimators with similar behavior. A heuristic connection: empirical Bayes or hierarchical Bayes produces estimators that shrink toward a common mean; in large-dimension limits the Bayes estimator approximates a form of James–Stein shrinkage.
Empirical Bayes (ML-II) approximation
If and are unknown, one approach is empirical Bayes: estimate them by maximizing the marginal likelihood of the observed 's. Marginally,
So we can compute the (log) marginal likelihood and obtain MLEs ; plug them into the posterior mean formula above to get an empirical Bayes point estimate for each . This ML-II step is often fast and connects to [MCMC] because you can use these estimates to initialize or inform MCMC priors.
Numeric example of empirical Bayes (method of moments)
Given observed with known and , method-of-moments estimates give
Then estimate between-group variance via
Concrete numbers: if we have three groups with , each, , then sample variance of is $1$; average sampling variance is , so method-of-moments gives .
Summary
This section gave the analytic backbone: conjugate normal-normal hierarchies yield explicit shrinkage weights ; empirical Bayes estimates hyperparameters by marginalizing the lower-level parameters, producing fast approximate shrinkage estimators. In practice, for nonconjugate or complex multilevel regressions we use MCMC (see [MCMC]) with either centered or non-centered parameterizations to ensure efficient sampling.
Multilevel regression: varying intercepts and slopes
Hierarchical modeling becomes especially powerful when combined with regression: we let intercepts and slopes vary by group. A simple varying-intercept, varying-slope model for outcome with predictor vector is:
where is group-specific intercept, group-specific slope vector, and is a covariance matrix encoding how intercepts and slopes vary and covary across groups. This fully multilevel regression allows partial pooling in both intercepts and slopes and allows complex cross-group shrinkage: groups with scarce data will have their and pulled toward population means .
Priors on covariance matrices: separation strategy and LKJ
Priors on are critical. A common strategy (the "separation" prior) decomposes
where is the vector of standard deviations for each random effect and a correlation matrix. Priors: put half-Cauchy or half-normal priors on components of (weakly informative), and an LKJ prior on (Lewandowski–Kurowicka–Joe) with concentration parameter favoring identity when . Example numeric choices: , .
Centered vs non-centered parameterizations (important for MCMC efficiency)
In [MCMC] we learned that parameterization affects convergence and mixing. For hierarchical models, two common parameterizations are:
Why it matters: when group-level variance is small and data are informative, centered parameterization mixes well; when is large and/or data per group are small, non-centered often mixes much better. A practical rule: try both or use adaptive diagnostics. Concrete numeric illustration: if and data are abundant, centered is fine; if and small, non-centered reduces funnel-shaped posterior geometry and avoids slow MCMC.
Posterior predictive checks and partial pooling
The full Bayesian workflow uses posterior predictive distributions to judge fit and to predict. For a given group the posterior predictive for a new observation is
In practice we approximate this by MCMC draws: for each draw of and simulate . Posterior predictive checks diagnose over/under-pooling: if pooled model systematically mispredicts heterogeneity, consider richer group-level structure (e.g., allow covariates to explain group differences).
Empirical Bayes vs full Bayes: tradeoffs
In empirical Bayes (ML-II) we estimate hyperparameters by marginal likelihood and then condition on . This often gives good point estimates and is fast for large hierarchical problems. However, it underestimates hyperparameter uncertainty: it treats as known, which can understate posterior variance for predictions, especially with few groups. Full Bayes places hyperpriors on and samples it in MCMC, which captures uncertainty but costs more computation.
Numeric demonstration of marginal likelihood (for normal-normal)
Recall marginal of is . For known the log marginal likelihood is
For example, with , , , , we numerically maximize over to get empirical Bayes estimates used in the shrinkage formula.
Practical advice and diagnostics
Summary
This section extended the core shrinkage idea to multilevel regressions with random effects, explained priors on covariance matrices, and provided practical sampling and parameterization advice. Non-centered parameterization and proper hyperpriors are essential tools for efficient, robust multilevel Bayesian inference.
Store-level vs chain-level parameter sharing (concrete business example)
Imagine an online retailer with many stores (or merchants) belonging to a chain. We want to estimate store conversion rates and predict revenue. Options:
Concrete Beta-Binomial numeric example
Suppose store A: , ; store B: , . With hyperparameters (prior mean 0.05), posterior means are
This demonstrates adaptive pooling: big-store estimates stay near their data; small-store estimates move toward the chain-level prior.
Empirical Bayes in large-scale problems
For thousands of stores one may use empirical Bayes to estimate hyperparameters quickly via marginal likelihood. A common pipeline:
This ML-II approach scales because it reduces high-dimensional posterior inference to a moderate-dimensional optimization; it often suffices for point predictions in industry systems. But be mindful: it ignores hyperparameter uncertainty, which can matter for decision-making, particularly when the number of groups is small.
Real-world use cases
Downstream tasks enabled
Limitations and modeling caveats
Computational practice and modern tooling
Modern probabilistic programming (Stan, PyMC, Turing, etc.) makes hierarchical Bayes accessible. Use non-centered parameterizations, LKJ priors for correlations, and adapt step sizes/warmups in HMC. For very large problems consider variational inference or empirical Bayes for speed; validate approximations against full MCMC on subsamples.
Summary
Hierarchical Bayesian models operationalize the balance between pooling and group-specific estimation. In practice, they are the default approach for multi-group inference when you want improved estimates, calibrated uncertainty, and principled borrowing of strength across related units.
Store A observed 2 purchases out of 20 visits. Chain-level prior parameters are estimated as alpha=5, beta=95 (prior mean 0.05). Compute the posterior mean and 95% credible interval for store A's conversion rate.
Model: y ~ Binomial(n, theta), theta ~ Beta(alpha, beta). Here y=2, n=20, alpha=5, beta=95.
Posterior is Beta(alpha + y, beta + n - y) = Beta(5+2, 95+20-2) = Beta(7, 113).
Posterior mean = (7) / (7+113) = 7/120 ≈ 0.058333.
Compute 95% credible interval using Beta quantiles: lower = Q_{0.025}(Beta(7,113)), upper = Q_{0.975}(Beta(7,113)). Numerically, use standard beta CDF/inv. Approximate: lower ≈ 0.024, upper ≈ 0.108 (compute with software).
Interpretation: raw rate 2/20 = 0.10; posterior mean = 0.058 (shrunk toward prior mean 0.05); credible interval reflects uncertainty from both data and prior.
Insight: This example shows how conjugacy (Beta-Binomial from [Conjugate Priors]) yields a closed-form posterior that shrinks a noisy small-sample estimate toward the chain-level prior mean.
We observe 3 group means: ybar = (9, 11, 10), with each group having n_j=5 observations and known observation variance sigma^2=4. Use method-of-moments (a simple empirical Bayes approach) to estimate mu and tau^2, then compute the posterior mean for group 1.
Model: bar_y_j ~ N(mu, tau^2 + sigma^2/n_j). Here sigma^2/n_j = 4/5 = 0.8 for each group.
Estimate mu by the sample mean: mu_hat = (9+11+10)/3 = 10.
Compute sample variance S^2 of the bar_y: mean 10, deviations (-1,1,0), S^2 = (1+1+0)/(3-1) = 1.
Estimate tau^2 by subtracting average sampling variance: tau2_hat = max(0, S^2 - 0.8) = max(0, 0.2) = 0.2.
Compute shrinkage weight for group 1: kappa = n_j tau2_hat / (n_j tau2_hat + sigma^2) = 50.2 / (50.2 + 4) = 1 / (1 + 4) = 0.2.
Posterior mean for theta_1: (1 - kappa)mu_hat + kappabar_y_1 = 0.810 + 0.29 = 8 + 1.8 = 9.8.
Insight: This example shows empirical Bayes (method-of-moments) producing hyperparameter estimates and the resulting shrinkage: the group mean 9 is pulled toward the pooled mean 10 by weight 0.2. It also shows numeric marginalization: bar_y marginal variance equals between-group plus sampling variance.
We have J=4 stores, each with n_j observations of sales y_{ij} and a predictor x_{ij} (standardized). Model: y_{ij} ~ N(alpha_j + beta * x_{ij}, sigma^2), alpha_j ~ N(mu_alpha, tau_alpha^2). Assume beta=2 (known), sigma^2=1 (known). Observed group means: bar_y = (5.2, 4.8, 6.1, 5.5), each with n_j=10, and pooled predictor mean is zero. Use non-centered parametrization to compute posterior samples conceptually and compute the posterior mean for alpha_3 using mu_alpha prior mean 5 and prior tau_alpha ~ HalfNormal(0,1) with empirical Bayes tau_alpha_hat computed via method-of-moments.
Given beta and sigma known and x standardized with pooled mean zero, the sample group means bar_y_j ≈ alpha_j + beta*bar_x_j (but bar_x_j ≈ 0), so bar_y_j ≈ alpha_j.
Compute pooled mean mu_hat = mean(bar_y) = (5.2+4.8+6.1+5.5)/4 = 5.4.
Sample variance of bar_y: deviations are (-0.2,-0.6,0.7,0.1), squared sum = 0.04+0.36+0.49+0.01=0.9, so S^2 = 0.9/(4-1)=0.3.
Sampling variance per bar_y = sigma^2/n_j = 1/10 = 0.1. Method-of-moments gives tau_alpha_hat = max(0, S^2 - 0.1) = max(0, 0.2) = 0.2.
Non-centered parametrization: alpha_j = mu_alpha + tau_alpha_hat eta_j, eta_j ~ N(0,1). Posterior for eta_j given bar_y_j is equivalent to the normal-normal conjugate case with observed mean bar_y_j and known variances; shrinkage weight kappa = n_j tau2 /(n_j tau2 + sigma^2) = 100.2/(10*0.2+1) = 2/(2+1)=2/3 ≈ 0.6667.
Posterior mean for alpha_3: (1-kappa)mu_hat + kappabar_y_3 = 0.33335.4 + 0.66676.1 = 1.8 + 4.0667 = 5.8667.
Insight: This worked example combines varying-intercept regression with non-centered parametrization logic. It shows how to compute empirical Bayes scale, then express alpha_j via a standardized eta_j for better MCMC conditioning; it also computes numeric posterior mean demonstrating substantial shrinkage toward pooled mean.
Hierarchical Bayesian models encode exchangeability across groups via priors on group-level parameters (hyperpriors), enabling principled partial pooling.
In conjugate cases (e.g., Normal-Normal, Beta-Binomial) closed-form posteriors yield explicit shrinkage weights that blend group data with group-level means.
Empirical Bayes (ML-II) estimates hyperparameters by marginalizing over group-level parameters and optimizing the marginal likelihood, providing fast approximate pooling but underestimating hyperparameter uncertainty.
Multilevel regression generalizes to varying intercepts and slopes; priors on covariance matrices (separation strategy with LKJ) and non-centered parameterizations are essential for robust MCMC.
Shrinkage reduces mean-squared error relative to no-pooling and connects to frequentist shrinkage estimators like James–Stein; however, model specification (exchangeability, covariates) is crucial to avoid bias.
Posterior predictive checks and sensitivity to hyperpriors should guide modeling choices—use weakly informative priors on scales to prevent over-shrinkage.
Practical pipelines often use empirical Bayes for scale and initialization and full Bayesian sampling for final uncertainty quantification when feasible.
Treating empirical Bayes point estimates of hyperparameters as if they were exact: this underestimates posterior uncertainty and can lead to overconfident predictions, especially with few groups.
Using a centered parameterization when group-level variances are poorly identified (small n_j, large tau): leads to funnel-shaped posteriors and slow MCMC mixing—use non-centered parametrization instead.
Assuming exchangeability without checking covariates: if groups differ systematically by observed covariates, failing to include those covariates causes biased shrinkage toward an inappropriate common mean.
Overly-informative hyperpriors on scales (tau) that force unrealistic shrinkage: prefer weakly informative priors (half-normal, half-Cauchy) and perform sensitivity analysis.
Easy: Beta-Binomial partial pooling. You observe two stores: Store 1 has y1=3 successes out of n1=30, Store 2 has y2=20 successes out of n2=200. The chain-level prior is Beta(alpha=6, beta=114) (prior mean 0.05). Compute the posterior mean for each store's theta and compare to raw rates.
Hint: Posterior for each theta_j is Beta(alpha + y_j, beta + n_j - y_j). Posterior mean = (alpha+y)/(alpha+beta+n).
Store 1: posterior Beta(6+3,114+30-3)=Beta(9,141). Posterior mean = 9/(9+141)=9/150=0.06. Raw rate = 3/30=0.10. Store 2: posterior Beta(6+20,114+200-20)=Beta(26,294). Posterior mean = 26/(26+294)=26/320=0.08125. Raw rate = 20/200=0.10. Both are shrunk toward prior mean 0.05; the small store shrinks more (0.06) than the large store (0.08125).
Medium: Normal-Normal with unequal n_j. You observe bar_y = (12, 8) with n = (10, 2) and known sigma^2=9. Assume prior theta_j ~ N(mu=10, tau^2=4). Compute posterior means and posterior variances for theta_1 and theta_2 using the shrinkage formula.
Hint: Posterior precision = 1/tau^2 + n_j/sigma^2; posterior mean = precision^{-1} ( (1/tau^2)mu + (n_j/sigma^2)*bar_y_j ).
For j=1: n1=10, bar_y1=12. Data precision = n1/sigma^2 = 10/9 ≈ 1.1111. Prior precision = 1/tau^2 = 1/4 = 0.25. Posterior precision = 1.3611, posterior variance = 1/1.3611 ≈ 0.7347. Posterior mean = (0.2510 + 1.111112)/1.3611 = (2.5 + 13.3332)/1.3611 = 15.8332/1.3611 ≈ 11.631. For j=2: n2=2, bar_y2=8. Data precision = 2/9 ≈ 0.2222. Posterior precision = 0.25 + 0.2222 = 0.4722, posterior variance = 1/0.4722 ≈ 2.118. Posterior mean = (0.2510 + 0.22228)/0.4722 = (2.5 + 1.7776)/0.4722 = 4.2776/0.4722 ≈ 9.062. Interpretation: the small-sample group 2 is shrunk much closer to mu=10 than group 1.
Hard: Empirical Bayes vs Full Bayes on between-group variance. You have J=5 groups with bar_y = (4.8, 5.1, 4.9, 5.0, 6.2), n_j=5 each, and known sigma^2=1. (a) Compute the empirical Bayes estimate of tau^2 by maximizing the marginal likelihood numerically (or use method-of-moments approximation). (b) Suppose you put a prior tau ~ HalfNormal(0,1) and perform full Bayes: explain qualitatively how the posterior distribution for tau will differ from the point estimate, and how that affects uncertainty of individual theta_j posteriors.
Hint: For (a) use marginal variance across bar_y minus sampling variance. For (b) recall that full Bayes integrates over tau, so posterior variance for theta_j reflects uncertainty about tau in addition to within-group noise.
(a) Sample mean mu_hat = (4.8+5.1+4.9+5.0+6.2)/5 = 5.2. Sample variance S^2 = ( (4.8-5.2)^2 + (5.1-5.2)^2 + (4.9-5.2)^2 + (5.0-5.2)^2 + (6.2-5.2)^2 ) / (5-1) = (0.16+0.01+0.09+0.04+1.0)/4 = 1.30/4 = 0.325. Sampling variance sigma^2/n = 1/5 = 0.2. Method-of-moments tau^2_hat = max(0, S^2 - 0.2) = 0.125. (b) With full Bayes and prior tau ~ HalfNormal(0,1), the posterior for tau will be a distribution concentrated near 0.125 but with nonzero mass, reflecting data uncertainty. The resulting posterior for each theta_j is a mixture over tau, so posterior variances for theta_j will be larger than those from empirical Bayes which conditions on the point estimate; in particular, uncertainty from tau propagates into wider credible intervals for theta_j, especially for small n_j and when J is small.
Looking back: this lesson builds directly on [Conjugate Priors] where Beta-Binomial and Normal-Normal conjugacy were introduced; those closed-form updates are the building blocks for analytic hierarchical calculations and for deriving marginal likelihoods used in empirical Bayes. It also assumes familiarity with [MCMC]: hierarchical models often require sampling from joint posteriors over many parameters, where non-centered parameterizations (covered here) and diagnostics from [MCMC] are essential. Looking forward: mastering hierarchical Bayes enables hierarchical generalized linear models, Gaussian processes with hierarchical kernels, Bayesian causal hierarchical models (multi-site A/B testing), and small-area estimation in official statistics. Downstream techniques that require this material include: multilevel survival analysis, hierarchical state-space models, and modern probabilistic programming best practices (e.g., priors for covariance matrices using LKJ). Empirical Bayes connects to frequentist shrinkage techniques (James–Stein, ridge regression) and to scalable approximate inference (variational Bayes, MAP/ML-II pipelines) used in large-scale industry systems.