Hierarchical Bayesian Models

Probability & StatisticsDifficulty: █████Depth: 9Unlocks: 2

Partial pooling, hyperpriors, shrinkage estimators. Multilevel regression. Empirical Bayes as approximation. Store-level vs chain-level parameter sharing.

Prerequisites (2)

Conjugate Priors? atoms

MCMC6 atoms

Unlocks (1)

Bayesian Forecastinglvl 5

When you have many related groups with small per-group data (stores, hospitals, classrooms), hierarchical Bayesian models give you principled 'partial pooling' that improves estimates and quantifies uncertainty—without forcing all groups to be identical.

TL;DR:

Hierarchical Bayesian models place priors on group-level parameters (hyperpriors) to achieve partial pooling and shrinkage; empirical Bayes offers a fast approximation by estimating hyperparameters from the marginal likelihood.

What Is Hierarchical Bayesian Modeling?

Definition and motivation

A hierarchical Bayesian model (also called a multilevel model) is a probabilistic model in which parameters themselves have probability distributions whose parameters (hyperparameters) are also given distributions. Hierarchical structure encodes the idea of exchangeability and partial pooling: groups are similar but not identical. Rather than estimating each group's parameter independently (no pooling) or assuming all groups share a single parameter (complete pooling), hierarchical models let the data determine how much information to borrow across groups.

Why care: partial pooling reduces estimation variance while controlling bias. Consider stores estimating conversion rates: a tiny store with 2 purchases in 20 visits would have an unreliable raw estimate 0.10. A hierarchical model borrows strength from the chain-level distribution to pull (shrink) that estimate toward the chain mean, often giving dramatically better predictive performance.

Core formalism (two-level example)

The canonical two-level hierarchical model for group-specific parameters $\theta_j$ and observations $y_{ji}$ is:

\begin{aligned} \theta_j &\sim p(\theta_j\mid\phi) \quad(\text{group-level, exchangeable}),\\ y_{ji} &\sim p(y_{ji}\mid\theta_j, \psi) \quad(\text{observation-level}),\\ \phi &\sim p(\phi) \quad(\text{hyperprior}),\\ \psi &\sim p(\psi) \quad(\text{(optional) observation-level hyperparameters}). \end{aligned}

Here $j=1,\dots,J$ indexes groups (stores), $i=1,\dots,n_j$ indexes observations in group $j$ . Exchangeability across groups means that, before seeing data, group labels don't matter and we treat $\theta_1,\dots,\theta_J$ as iid from $p(\theta\mid\phi)$ .

Concrete numeric toy: Beta-Binomial intuition

If each store has conversion counts $y_j\sim\text{Binomial}(n_j, \theta_j)$ and we place $\theta_j\sim\text{Beta}(\alpha,\beta)$ with hyperparameters $(\alpha,\beta)$ , then posterior inference for each $\theta_j$ is conjugate:

\theta_j\mid y_j,\alpha,\beta\sim\text{Beta}(\alpha+y_j,\,\beta+n_j-y_j).

Numeric example: a small store had $y=2$ successes in $n=20$ visits. If chain-level hyperparameters are $\alpha=5,\beta=95$ (prior mean 0.05, variance small), the posterior mean is

E[\theta\mid y]=\frac{\alpha+y}{\alpha+\beta+n} =\frac{5+2}{5+95+20}=\frac{7}{120}\approx0.0583,

which is shrunk toward 0.05 from the raw rate $2/20=0.10$ .

Contrast with complete pooling: estimating a single $\theta$ for all stores would ignore between-store variability; no pooling would give $\hat{\theta}_j=y_j/n_j$ with high variance for small $n_j$ . The hierarchical model, by introducing hyperparameters and a hyperprior, adapts the amount of pooling to the data.

Philosophy and exchangeability

In [Conjugate Priors] we learned how conjugacy gives closed-form posteriors for single-level models like Beta-Binomial and Normal-Normal. Hierarchical models extend this: conjugacy may allow analytic conditional posteriors for lower-level parameters given hyperparameters. But typically we integrate over hyperparameters (or sample them in MCMC) to get full posterior uncertainty.

Exchangeability justifies the iid assumption on $\theta_j\mid\phi$ : if groups are exchangeable, the joint prior over $\theta_{1:J}$ factorizes into iid draws from the same mixture. Partial pooling is the Bayesian implementation of the bias-variance tradeoff: the posterior for each $\theta_j$ is shrunk toward the common prior mean, with the extent of shrinkage determined by the relative sizes of within-group noise and between-group variability.

Key quantities to watch: the hyperparameters (e.g. variance of $\theta_j$ across groups) control shrinkage; the posterior predictive distribution integrates over both $\theta_j$ and $\phi$ and is what you use for predicting new observations or new groups.

Summary sentence

Hierarchical Bayesian models let you model group heterogeneity while borrowing strength across groups via hyperpriors; this produces principled shrinkage estimators that often dominate non-hierarchical alternatives in prediction and mean squared error.

Core Mechanic 1: Conjugate Hierarchies, Partial Pooling, and Analytic Shrinkage

Normal-Normal hierarchical model (canonical analytic case)

A central worked analytic case is the Normal-Normal hierarchical model with known observation variance. This is where the familiar shrinkage formula arises in closed form and links directly to [Conjugate Priors]. Model:

\begin{aligned} \theta_j &\sim N(\mu,\tau^2) \quad(j=1,\dots,J),\\ y_{j\,i} &\sim N(\theta_j,\sigma^2), \quad i=1,\dots,n_j. \end{aligned}

Let $\bar y_j=\frac{1}{n_j}\sum_{i=1}^{n_j} y_{j,i}$ be the group mean and $\sigma^2/n_j$ its sampling variance. Conditional on hyperparameters $(\mu,\tau^2)$ the posterior for $\theta_j$ is Gaussian (by conjugacy):

\theta_j\mid\bar y_j,\mu,\tau^2 \sim N\left( w_j\bar y_j + (1-w_j)\mu, \;\frac{1}{1/\tau^2 + n_j/\sigma^2} \right),

where the shrinkage weight is

w_j = \frac{n_j/\sigma^2}{n_j/\sigma^2 + 1/\tau^2} = \frac{\tau^2}{\tau^2 + \sigma^2/n_j}^{-1}?

(We prefer the intuitively clearer form below.) Solving algebra shows the posterior mean can be written as:

E[\theta_j\mid\bar y_j,\mu,\tau^2] = (1-\kappa_j)\mu + \kappa_j\bar y_j, \quad \kappa_j = \frac{n_j\tau^2}{n_j\tau^2 + \sigma^2}.

This $\kappa_j\in(0,1)$ is the effective weight on the data vs the group mean. When $n_j\to\infty$ , $\kappa_j\to1$ and we rely on the data; when $\tau^2\to0$ (no between-group variability) $\kappa_j\to0$ and every group collapses to the common mean $\mu$ (complete pooling).

Concrete numeric example (normal-normal)

Suppose $\sigma^2=4$ (observation SD $=2$ ), $\tau^2=1$ (between-group SD $=1$ ), $n_j=5$ observations in group $j$ , and group sample mean $\bar y_j=10$ , hyper-mean $\mu=8$ . Then

\kappa_j = \frac{5\cdot 1}{5\cdot 1 + 4} = \frac{5}{9} \approx 0.5556,

so posterior mean is

E[\theta_j\mid\bar y_j]= (1-0.5556)\cdot 8 + 0.5556\cdot 10 = 0.4444\cdot 8 + 0.5556\cdot 10 = 9.1111.

Compare: raw mean 10, pooled mean 8, shrunk estimate 9.11.

Derivation sketch

Conditional posterior precision = prior precision + data precision. Prior precision $=1/\tau^2$ , data precision $=n_j/\sigma^2$ . Thus posterior mean is a precision-weighted average:

E[\theta_j\mid\cdot] = \frac{(1/\tau^2)\mu + (n_j/\sigma^2)\bar y_j}{1/\tau^2 + n_j/\sigma^2} = (1-\kappa_j)\mu + \kappa_j\bar y_j.

This is the canonical "shrinkage" formula. In [Conjugate Priors] you saw the Normal-Normal conjugacy; here it extends across groups with a hyperprior on $\mu$ and $\tau^2$ if desired.

James–Stein connection (brief, illuminating)

The James–Stein estimator is a (frequentist) shrinkage estimator for multivariate normal means that dominates the MLE in mean squared error for dimension $\ge 3$ . Hierarchical Bayes with an appropriate hyperprior produces shrinkage estimators with similar behavior. A heuristic connection: empirical Bayes or hierarchical Bayes produces estimators that shrink toward a common mean; in large-dimension limits the Bayes estimator approximates a form of James–Stein shrinkage.

Empirical Bayes (ML-II) approximation

If $\mu$ and $\tau^2$ are unknown, one approach is empirical Bayes: estimate them by maximizing the marginal likelihood of the observed $\bar y_j$ 's. Marginally,

\bar y_j\sim N(\mu,\tau^2 + \sigma^2/n_j).

So we can compute the (log) marginal likelihood and obtain MLEs $\hat\mu,\hat\tau^2$ ; plug them into the posterior mean formula above to get an empirical Bayes point estimate for each $\theta_j$ . This ML-II step is often fast and connects to [MCMC] because you can use these estimates to initialize or inform MCMC priors.

Numeric example of empirical Bayes (method of moments)

Given $J$ observed $\bar y_j$ with known $\sigma^2$ and $n_j$ , method-of-moments estimates give

\hat\mu = \frac{\sum_j w_j \bar y_j}{\sum_j w_j}, \quad \text{with } w_j = 1/(\sigma^2/n_j).

Then estimate between-group variance via

\widehat{\tau^2} = \max\left(0, \;\frac{1}{J-1}\sum_j (\bar y_j-\hat\mu)^2 - \frac{1}{J}\sum_j\frac{\sigma^2}{n_j}\right).

Concrete numbers: if we have three groups with $\bar y=(9,11,10)$ , $n_j=5$ each, $\sigma^2=4$ , then sample variance of $\bar y$ is $1$; average sampling variance is $\sigma^2/n_j=0.8$ , so method-of-moments gives $\hat\tau^2=\max(0,1-0.8)=0.2$ .

Summary

This section gave the analytic backbone: conjugate normal-normal hierarchies yield explicit shrinkage weights $\kappa_j$ ; empirical Bayes estimates hyperparameters by marginalizing the lower-level parameters, producing fast approximate shrinkage estimators. In practice, for nonconjugate or complex multilevel regressions we use MCMC (see [MCMC]) with either centered or non-centered parameterizations to ensure efficient sampling.

Core Mechanic 2: Multilevel Regression, Non-centered Parametrization, and Hyperpriors

Multilevel regression: varying intercepts and slopes

Hierarchical modeling becomes especially powerful when combined with regression: we let intercepts and slopes vary by group. A simple varying-intercept, varying-slope model for outcome $y_{ij}$ with predictor vector $x_{ij}$ is:

\begin{aligned} y_{ij} &\sim N(\alpha_j + x_{ij}^T\beta_j,\,\sigma^2),\\[4pt] \begin{pmatrix}\alpha_j \\\beta_j\end{pmatrix} &\sim N\left( \begin{pmatrix}\mu_{\alpha} \\\mu_{\beta}\end{pmatrix},\; \Sigma_{\mathrm{group}} \right), \end{aligned}

where $\alpha_j$ is group-specific intercept, $\beta_j$ group-specific slope vector, and $\Sigma_{\mathrm{group}}$ is a covariance matrix encoding how intercepts and slopes vary and covary across groups. This fully multilevel regression allows partial pooling in both intercepts and slopes and allows complex cross-group shrinkage: groups with scarce data will have their $\alpha_j$ and $\beta_j$ pulled toward population means $(\mu_\alpha,\mu_\beta)$ .

Priors on covariance matrices: separation strategy and LKJ

Priors on $\Sigma_{\mathrm{group}}$ are critical. A common strategy (the "separation" prior) decomposes

\Sigma_{\mathrm{group}} = \mathrm{diag}(\tau)\,R\,\mathrm{diag}(\tau),

where $\tau$ is the vector of standard deviations for each random effect and $R$ a correlation matrix. Priors: put half-Cauchy or half-normal priors on components of $\tau$ (weakly informative), and an LKJ prior on $R$ (Lewandowski–Kurowicka–Joe) with concentration parameter $\eta$ favoring identity when $\eta>1$ . Example numeric choices: $\tau_k \sim \mathrm{HalfCauchy}(0,2)$ , $R\sim\mathrm{LKJ}(\eta=2)$ .

Centered vs non-centered parameterizations (important for MCMC efficiency)

In [MCMC] we learned that parameterization affects convergence and mixing. For hierarchical models, two common parameterizations are:

•Centered: directly model $\theta_j\sim N(\mu,\tau^2)$ and sample $\theta_j$ .
•Non-centered: reparametrize $\theta_j = \mu + \tau\,\eta_j$ , with $\eta_j\sim N(0,1)$ .

Why it matters: when group-level variance $\tau$ is small and data are informative, centered parameterization mixes well; when $\tau$ is large and/or data per group are small, non-centered often mixes much better. A practical rule: try both or use adaptive diagnostics. Concrete numeric illustration: if $\tau=0.01$ and data are abundant, centered is fine; if $\tau=5$ and $n_j$ small, non-centered reduces funnel-shaped posterior geometry and avoids slow MCMC.

Posterior predictive checks and partial pooling

The full Bayesian workflow uses posterior predictive distributions to judge fit and to predict. For a given group $j$ the posterior predictive for a new observation $y_{j\,\text{new}}$ is

p(y_{j\,\text{new}}\mid\text{data}) = \iint p(y_{j\,\text{new}}\mid\theta_j,\psi)\,p(\theta_j\mid\phi,\text{data})\,p(\phi\mid\text{data})\,d\theta_j\,d\phi.

In practice we approximate this by MCMC draws: for each draw of $\phi$ and $\theta_j$ simulate $y_{\text{new}}$ . Posterior predictive checks diagnose over/under-pooling: if pooled model systematically mispredicts heterogeneity, consider richer group-level structure (e.g., allow covariates to explain group differences).

Empirical Bayes vs full Bayes: tradeoffs

In empirical Bayes (ML-II) we estimate hyperparameters $\phi$ by marginal likelihood and then condition on $\hat\phi$ . This often gives good point estimates and is fast for large hierarchical problems. However, it underestimates hyperparameter uncertainty: it treats $\hat\phi$ as known, which can understate posterior variance for predictions, especially with few groups. Full Bayes places hyperpriors on $\phi$ and samples it in MCMC, which captures uncertainty but costs more computation.

Numeric demonstration of marginal likelihood (for normal-normal)

Recall marginal of $\bar y_j$ is $\bar y_j\sim N(\mu, \tau^2 + \sigma^2/n_j)$ . For known $\sigma^2$ the log marginal likelihood is

\ell(\mu,\tau^2) = -\frac{1}{2}\sum_j\left[ \log(\tau^2+\sigma^2/n_j) + \frac{(\bar y_j-\mu)^2}{\tau^2+\sigma^2/n_j} \right] + \text{const}.

For example, with $J=3$ , $\bar y=(9,11,10)$ , $n_j=5$ , $\sigma^2=4$ , we numerically maximize $\ell$ over $\mu,\tau^2$ to get empirical Bayes estimates used in the shrinkage formula.

Practical advice and diagnostics

•Standardize predictors before hierarchical modeling; it stabilizes priors and sampling.
•Use weakly informative priors (e.g., half-normal for scales) to regularize and avoid pathologies.
•Compare centered vs non-centered parameterizations; non-centered is often superior when group-level variance is small relative to observation noise.
•Check posterior predictive distributions for each group to identify misfit.

Summary

This section extended the core shrinkage idea to multilevel regressions with random effects, explained priors on covariance matrices, and provided practical sampling and parameterization advice. Non-centered parameterization and proper hyperpriors are essential tools for efficient, robust multilevel Bayesian inference.

Applications and Connections: Store-level vs Chain-level Sharing, Empirical Bayes in Practice, and Downstream Uses

Store-level vs chain-level parameter sharing (concrete business example)

Imagine an online retailer with many stores (or merchants) belonging to a chain. We want to estimate store conversion rates $\theta_j$ and predict revenue. Options:

•No pooling: estimate each store separately, $\hat\theta_j = y_j/n_j$ , high variance for small stores.
•Complete pooling: estimate a single chain-level $\theta$ , ignores useful heterogeneity.
•Hierarchical model: $\theta_j\sim p(\theta\mid\phi)$ with chain-level hyperprior on $\phi$ . This automatically shares information across stores: large stores dominate inference for themselves; tiny stores are shrunk toward the chain-level mean.

Concrete Beta-Binomial numeric example

Suppose store A: $y_A=2$ , $n_A=20$ ; store B: $y_B=40$ , $n_B=400$ . With hyperparameters $\alpha=10,\beta=190$ (prior mean 0.05), posterior means are

•store A: $E[\theta_A]=(10+2)/(200+20)=12/220\approx0.0545$ (shrunk from 0.10 to 0.0545),
•store B: $E[\theta_B]=(10+40)/(200+400)=50/600\approx0.0833$ (raw 0.10 shrunk only slightly).

This demonstrates adaptive pooling: big-store estimates stay near their data; small-store estimates move toward the chain-level prior.

Empirical Bayes in large-scale problems

For thousands of stores one may use empirical Bayes to estimate hyperparameters quickly via marginal likelihood. A common pipeline:

1)Aggregate store-level sufficient statistics (e.g., counts for Beta-Binomial or sample means and variances for Normal-Normal).
2)Optimize marginal likelihood for hyperparameters (e.g., $\alpha,\beta$ or $\mu,\tau^2$ ).
3)Compute posterior summaries for each store conditional on the estimated hyperparameters.

This ML-II approach scales because it reduces high-dimensional posterior inference to a moderate-dimensional optimization; it often suffices for point predictions in industry systems. But be mindful: it ignores hyperparameter uncertainty, which can matter for decision-making, particularly when the number of groups $J$ is small.

Real-world use cases

•A/B testing across multiple sites: hierarchical models estimate site-specific treatment effects while pooling information to improve detection power.
•Education: estimate school-level effects (test scores) with partial pooling across schools; used in value-added models.
•Healthcare: hospital-level outcome rates with case-mix adjustment; hierarchical models stabilize hospital estimates and reduce false positives.
•Small-area estimation: census and survey domains with sparse data benefit from Bayesian small-area models.

Downstream tasks enabled

•Better prediction: partial pooling often reduces out-of-sample error compared to no pooling.
•Decision-making with uncertainty: full Bayesian posterior distributions for group-level parameters enable decision rules that account for both within-group and between-group uncertainty.
•Hierarchical modeling of complex structures: nested hierarchies (patients within clinicians within hospitals), crossed random effects, and multi-membership models are natural extensions.

Limitations and modeling caveats

•Model misspecification: if groups are not exchangeable (e.g., unmodeled covariates explain between-group differences), naive pooling may be inappropriate. Solution: include group-level covariates or model non-exchangeable structure.
•Outliers and heavy tails: a Normal prior for $\theta_j$ may be too light-tailed; consider Student-t for robustness.
•Over-shrinkage: overly-informative hyperpriors can force excessive shrinkage. Use weakly informative priors on hyperparameters and check sensitivity.

Computational practice and modern tooling

Modern probabilistic programming (Stan, PyMC, Turing, etc.) makes hierarchical Bayes accessible. Use non-centered parameterizations, LKJ priors for correlations, and adapt step sizes/warmups in HMC. For very large problems consider variational inference or empirical Bayes for speed; validate approximations against full MCMC on subsamples.

Summary

Hierarchical Bayesian models operationalize the balance between pooling and group-specific estimation. In practice, they are the default approach for multi-group inference when you want improved estimates, calibrated uncertainty, and principled borrowing of strength across related units.

Worked Examples (3)

Beta-Binomial Store Conversion (Small store)

Store A observed 2 purchases out of 20 visits. Chain-level prior parameters are estimated as alpha=5, beta=95 (prior mean 0.05). Compute the posterior mean and 95% credible interval for store A's conversion rate.

Model: y ~ Binomial(n, theta), theta ~ Beta(alpha, beta). Here y=2, n=20, alpha=5, beta=95.
Posterior is Beta(alpha + y, beta + n - y) = Beta(5+2, 95+20-2) = Beta(7, 113).
Posterior mean = (7) / (7+113) = 7/120 ≈ 0.058333.
Compute 95% credible interval using Beta quantiles: lower = Q_{0.025}(Beta(7,113)), upper = Q_{0.975}(Beta(7,113)). Numerically, use standard beta CDF/inv. Approximate: lower ≈ 0.024, upper ≈ 0.108 (compute with software).
Interpretation: raw rate 2/20 = 0.10; posterior mean = 0.058 (shrunk toward prior mean 0.05); credible interval reflects uncertainty from both data and prior.

Insight: This example shows how conjugacy (Beta-Binomial from [Conjugate Priors]) yields a closed-form posterior that shrinks a noisy small-sample estimate toward the chain-level prior mean.

Normal-Normal Empirical Bayes (Estimating tau^2)

We observe 3 group means: ybar = (9, 11, 10), with each group having n_j=5 observations and known observation variance sigma^2=4. Use method-of-moments (a simple empirical Bayes approach) to estimate mu and tau^2, then compute the posterior mean for group 1.

Model: bar_y_j ~ N(mu, tau^2 + sigma^2/n_j). Here sigma^2/n_j = 4/5 = 0.8 for each group.
Estimate mu by the sample mean: mu_hat = (9+11+10)/3 = 10.
Compute sample variance S^2 of the bar_y: mean 10, deviations (-1,1,0), S^2 = (1+1+0)/(3-1) = 1.
Estimate tau^2 by subtracting average sampling variance: tau2_hat = max(0, S^2 - 0.8) = max(0, 0.2) = 0.2.
Compute shrinkage weight for group 1: kappa = n_j tau2_hat / (n_j tau2_hat + sigma^2) = 50.2 / (50.2 + 4) = 1 / (1 + 4) = 0.2.
Posterior mean for theta_1: (1 - kappa)mu_hat + kappabar_y_1 = 0.810 + 0.29 = 8 + 1.8 = 9.8.

Insight: This example shows empirical Bayes (method-of-moments) producing hyperparameter estimates and the resulting shrinkage: the group mean 9 is pulled toward the pooled mean 10 by weight 0.2. It also shows numeric marginalization: bar_y marginal variance equals between-group plus sampling variance.

Varying-intercept Regression with Non-centered Parametrization

We have J=4 stores, each with n_j observations of sales y_{ij} and a predictor x_{ij} (standardized). Model: y_{ij} ~ N(alpha_j + beta * x_{ij}, sigma^2), alpha_j ~ N(mu_alpha, tau_alpha^2). Assume beta=2 (known), sigma^2=1 (known). Observed group means: bar_y = (5.2, 4.8, 6.1, 5.5), each with n_j=10, and pooled predictor mean is zero. Use non-centered parametrization to compute posterior samples conceptually and compute the posterior mean for alpha_3 using mu_alpha prior mean 5 and prior tau_alpha ~ HalfNormal(0,1) with empirical Bayes tau_alpha_hat computed via method-of-moments.

Given beta and sigma known and x standardized with pooled mean zero, the sample group means bar_y_j ≈ alpha_j + beta*bar_x_j (but bar_x_j ≈ 0), so bar_y_j ≈ alpha_j.
Compute pooled mean mu_hat = mean(bar_y) = (5.2+4.8+6.1+5.5)/4 = 5.4.
Sample variance of bar_y: deviations are (-0.2,-0.6,0.7,0.1), squared sum = 0.04+0.36+0.49+0.01=0.9, so S^2 = 0.9/(4-1)=0.3.
Sampling variance per bar_y = sigma^2/n_j = 1/10 = 0.1. Method-of-moments gives tau_alpha_hat = max(0, S^2 - 0.1) = max(0, 0.2) = 0.2.
Non-centered parametrization: alpha_j = mu_alpha + tau_alpha_hat eta_j, eta_j ~ N(0,1). Posterior for eta_j given bar_y_j is equivalent to the normal-normal conjugate case with observed mean bar_y_j and known variances; shrinkage weight kappa = n_j tau2 /(n_j tau2 + sigma^2) = 100.2/(10*0.2+1) = 2/(2+1)=2/3 ≈ 0.6667.
Posterior mean for alpha_3: (1-kappa)mu_hat + kappabar_y_3 = 0.33335.4 + 0.66676.1 = 1.8 + 4.0667 = 5.8667.

Insight: This worked example combines varying-intercept regression with non-centered parametrization logic. It shows how to compute empirical Bayes scale, then express alpha_j via a standardized eta_j for better MCMC conditioning; it also computes numeric posterior mean demonstrating substantial shrinkage toward pooled mean.

Key Takeaways

✓
Hierarchical Bayesian models encode exchangeability across groups via priors on group-level parameters (hyperpriors), enabling principled partial pooling.
✓
In conjugate cases (e.g., Normal-Normal, Beta-Binomial) closed-form posteriors yield explicit shrinkage weights that blend group data with group-level means.
✓
Empirical Bayes (ML-II) estimates hyperparameters by marginalizing over group-level parameters and optimizing the marginal likelihood, providing fast approximate pooling but underestimating hyperparameter uncertainty.
✓
Multilevel regression generalizes to varying intercepts and slopes; priors on covariance matrices (separation strategy with LKJ) and non-centered parameterizations are essential for robust MCMC.
✓
Shrinkage reduces mean-squared error relative to no-pooling and connects to frequentist shrinkage estimators like James–Stein; however, model specification (exchangeability, covariates) is crucial to avoid bias.
✓
Posterior predictive checks and sensitivity to hyperpriors should guide modeling choices—use weakly informative priors on scales to prevent over-shrinkage.
✓
Practical pipelines often use empirical Bayes for scale and initialization and full Bayesian sampling for final uncertainty quantification when feasible.

Common Mistakes

✗
Treating empirical Bayes point estimates of hyperparameters as if they were exact: this underestimates posterior uncertainty and can lead to overconfident predictions, especially with few groups.
✗
Using a centered parameterization when group-level variances are poorly identified (small n_j, large tau): leads to funnel-shaped posteriors and slow MCMC mixing—use non-centered parametrization instead.
✗
Assuming exchangeability without checking covariates: if groups differ systematically by observed covariates, failing to include those covariates causes biased shrinkage toward an inappropriate common mean.
✗
Overly-informative hyperpriors on scales (tau) that force unrealistic shrinkage: prefer weakly informative priors (half-normal, half-Cauchy) and perform sensitivity analysis.

Practice

easy

Easy: Beta-Binomial partial pooling. You observe two stores: Store 1 has y1=3 successes out of n1=30, Store 2 has y2=20 successes out of n2=200. The chain-level prior is Beta(alpha=6, beta=114) (prior mean 0.05). Compute the posterior mean for each store's theta and compare to raw rates.

Hint: Posterior for each theta_j is Beta(alpha + y_j, beta + n_j - y_j). Posterior mean = (alpha+y)/(alpha+beta+n).

Show solution

Store 1: posterior Beta(6+3,114+30-3)=Beta(9,141). Posterior mean = 9/(9+141)=9/150=0.06. Raw rate = 3/30=0.10. Store 2: posterior Beta(6+20,114+200-20)=Beta(26,294). Posterior mean = 26/(26+294)=26/320=0.08125. Raw rate = 20/200=0.10. Both are shrunk toward prior mean 0.05; the small store shrinks more (0.06) than the large store (0.08125).

medium

Medium: Normal-Normal with unequal n_j. You observe bar_y = (12, 8) with n = (10, 2) and known sigma^2=9. Assume prior theta_j ~ N(mu=10, tau^2=4). Compute posterior means and posterior variances for theta_1 and theta_2 using the shrinkage formula.

Hint: Posterior precision = 1/tau^2 + n_j/sigma^2; posterior mean = precision^{-1} ( (1/tau^2)mu + (n_j/sigma^2)*bar_y_j ).

Show solution

For j=1: n1=10, bar_y1=12. Data precision = n1/sigma^2 = 10/9 ≈ 1.1111. Prior precision = 1/tau^2 = 1/4 = 0.25. Posterior precision = 1.3611, posterior variance = 1/1.3611 ≈ 0.7347. Posterior mean = (0.2510 + 1.111112)/1.3611 = (2.5 + 13.3332)/1.3611 = 15.8332/1.3611 ≈ 11.631. For j=2: n2=2, bar_y2=8. Data precision = 2/9 ≈ 0.2222. Posterior precision = 0.25 + 0.2222 = 0.4722, posterior variance = 1/0.4722 ≈ 2.118. Posterior mean = (0.2510 + 0.22228)/0.4722 = (2.5 + 1.7776)/0.4722 = 4.2776/0.4722 ≈ 9.062. Interpretation: the small-sample group 2 is shrunk much closer to mu=10 than group 1.

hard

Hard: Empirical Bayes vs Full Bayes on between-group variance. You have J=5 groups with bar_y = (4.8, 5.1, 4.9, 5.0, 6.2), n_j=5 each, and known sigma^2=1. (a) Compute the empirical Bayes estimate of tau^2 by maximizing the marginal likelihood numerically (or use method-of-moments approximation). (b) Suppose you put a prior tau ~ HalfNormal(0,1) and perform full Bayes: explain qualitatively how the posterior distribution for tau will differ from the point estimate, and how that affects uncertainty of individual theta_j posteriors.

Hint: For (a) use marginal variance across bar_y minus sampling variance. For (b) recall that full Bayes integrates over tau, so posterior variance for theta_j reflects uncertainty about tau in addition to within-group noise.

Show solution

(a) Sample mean mu_hat = (4.8+5.1+4.9+5.0+6.2)/5 = 5.2. Sample variance S^2 = ( (4.8-5.2)^2 + (5.1-5.2)^2 + (4.9-5.2)^2 + (5.0-5.2)^2 + (6.2-5.2)^2 ) / (5-1) = (0.16+0.01+0.09+0.04+1.0)/4 = 1.30/4 = 0.325. Sampling variance sigma^2/n = 1/5 = 0.2. Method-of-moments tau^2_hat = max(0, S^2 - 0.2) = 0.125. (b) With full Bayes and prior tau ~ HalfNormal(0,1), the posterior for tau will be a distribution concentrated near 0.125 but with nonzero mass, reflecting data uncertainty. The resulting posterior for each theta_j is a mixture over tau, so posterior variances for theta_j will be larger than those from empirical Bayes which conditions on the point estimate; in particular, uncertainty from tau propagates into wider credible intervals for theta_j, especially for small n_j and when J is small.

Connections

Looking back: this lesson builds directly on [Conjugate Priors] where Beta-Binomial and Normal-Normal conjugacy were introduced; those closed-form updates are the building blocks for analytic hierarchical calculations and for deriving marginal likelihoods used in empirical Bayes. It also assumes familiarity with [MCMC]: hierarchical models often require sampling from joint posteriors over many parameters, where non-centered parameterizations (covered here) and diagnostics from [MCMC] are essential. Looking forward: mastering hierarchical Bayes enables hierarchical generalized linear models, Gaussian processes with hierarchical kernels, Bayesian causal hierarchical models (multi-site A/B testing), and small-area estimation in official statistics. Downstream techniques that require this material include: multilevel survival analysis, hierarchical state-space models, and modern probabilistic programming best practices (e.g., priors for covariance matrices using LKJ). Empirical Bayes connects to frequentist shrinkage techniques (James–Stein, ridge regression) and to scalable approximate inference (variational Bayes, MAP/ML-II pipelines) used in large-scale industry systems.

Quality: pending (0.0/5)

Hierarchical Bayesian Models

Prerequisites (2)

Unlocks (1)

Graph Position

What Is Hierarchical Bayesian Modeling?

Core Mechanic 1: Conjugate Hierarchies, Partial Pooling, and Analytic Shrinkage

Core Mechanic 2: Multilevel Regression, Non-centered Parametrization, and Hyperpriors

Applications and Connections: Store-level vs Chain-level Sharing, Empirical Bayes in Practice, and Downstream Uses

Worked Examples (3)

Beta-Binomial Store Conversion (Small store)

Normal-Normal Empirical Bayes (Estimating tau^2)

Varying-intercept Regression with Non-centered Parametrization

Key Takeaways

Common Mistakes

Practice

Connections