Beta-Binomial, Normal-Normal, Gamma-Poisson conjugacy. Closed-form posterior updates. Exponential family and sufficient statistics.
Conjugate priors turn Bayesian updating from numerical slog into simple algebra: with the right prior the posterior stays in the same family and updates are closed-form.
A conjugate prior is a prior distribution that, when combined with a likelihood from a given family, yields a posterior in the same distributional family — enabling closed-form posterior updates (Beta-Binomial, Gamma-Poisson, Normal-Normal) and exposing sufficient statistics via the exponential-family structure.
Definition and motivation
A conjugate prior for a likelihood family is a prior distribution p(θ) such that the posterior p(θ | data) lies in the same parametric family as p(θ). Conjugacy is valuable because it turns Bayesian Inference (the prerequisite where we update priors with likelihoods to get posteriors) into simple algebraic updates instead of requiring numerical integration or sampling.
Formal statement
If the likelihood p(x | θ) belongs to a family L and the prior p(θ) belongs to a family P, we say P is conjugate to L if for any dataset x the posterior p(θ | x) ∈ P. The algebraic consequence is that the posterior parameters are updated by adding data-dependent terms to prior parameters: these update rules are closed form and often involve sufficient statistics (sums, counts, means).
Why care?
A simple motivating example (Beta-Binomial)
In Common Distributions (prerequisite) we learned the Binomial likelihood for n Bernoulli trials with success probability θ and the Beta distribution as a flexible prior on θ. The Beta(a,b) density is
The Binomial likelihood for k successes in n trials is
Multiply prior and likelihood to get posterior up to normalization:
which is again a Beta distribution: Beta(a+k, b+n-k). This is conjugacy: Beta is conjugate to Binomial.
Numeric example (concrete)
Let the prior be Beta(2,3) (so a=2, b=3). Observe k=8 successes in n=20 trials. Posterior is:
So the posterior mean is . This single arithmetic update — add successes to a, failures to b — is the essence of conjugacy.
Intuition from sufficient statistics
Conjugate updates typically depend only on a small set of summary statistics of the data (sufficient statistics). For Binomial/Bernoulli that statistic is the count of successes k (or equivalently the sample mean times n). This is the same reason the Beta prior only needs two parameters (a,b): they act like pseudo-counts that add to the observed counts.
Preview of families we will study
Each produces closed-form posterior parameters, posterior predictive distributions with closed forms, and clear sufficient statistics.
Beta-Binomial conjugacy — derivation and predictive distribution
Setup. Observations x_1, ..., x_n are iid Bernoulli(θ) (or equivalently we observe k successes in n Binomial(n,θ) trials). Prior: θ ~ Beta(a,b). Likelihood: p(k|θ)=\binom{n}{k}θ^{k}(1-θ)^{n-k}. Posterior (derive):
Hence
Numeric example. Prior Beta(1,1) (uniform), observe k=30 successes in n=100. Posterior is Beta(1+30, 1+70)=Beta(31,71). Posterior mean = 31/102 ≈ 0.3039.
Posterior predictive for a single future Bernoulli trial
The posterior predictive probability that the next trial is a success is
Numeric example: with Beta(1,1) prior and k=30, n=100, this equals (1+30)/(1+1+100)=31/102 ≈ 0.3039.
Predictive for m future trials (Beta-Binomial)
The predictive distribution for the number y of successes in m future trials (m>0) is the Beta-Binomial:
where B(·,·) is the Beta function. Numeric example: with prior Beta(2,3), observed k=8,n=20, compute probability of y=2 successes in m=5 future trials:
Posterior parameters: a'=10, b'=15. Then
You can compute B via Gamma: B(12,18)=\Gamma(12)\Gamma(18)/\Gamma(30) etc., or use software. The closed form keeps things analytic.
Gamma-Poisson conjugacy — derivation and predictive distribution
Setup. Observed counts x_1,...,x_n drawn iid Poisson(λ). Likelihood:
Prior: λ ~ Gamma(α, β) with the rate parametrization (shape α, rate β) so
Multiply prior and likelihood (ignore data-only constants):
This is Gamma(α + Σ x_i, β + n). Numeric example: prior Gamma(2,1) (α=2,β=1). Data x = {3,2,4} => Σ x_i = 9, n=3. Posterior: Gamma(2+9, 1+3) = Gamma(11,4). Posterior mean = (α')/(β') = 11/4 = 2.75.
Predictive distribution for a new observation
Marginalizing λ yields a closed-form predictive for a new count x':
where α' = α+Σ x_i and β' = β+n. This is a form of the (Poisson–Gamma) Negative Binomial-like mixture. Numeric example: With prior Gamma(2,1) and data {3,2,4} we had α'=11, β'=4. The predictive probability that next count x'=2 is
Compute numerically: Γ(13)/Γ(11)=11·12=132, so p=132/(2) · (4/5)^{11} · (1/5)^2 =66 · (4/5)^{11} · (1/25). You can plug values to obtain a numerical probability.
What these two conjugate pairs teach
Normal-Normal conjugacy (known variance) — closed-form posterior and derivation
Setup. Observations x_1,...,x_n are iid Normal(μ, σ^2) with known variance σ^2. Prior on the mean: μ ~ Normal(μ_0, τ_0^2). This is the standard conjugate pair for location with known variance. The posterior for μ is Normal(μ_n, τ_n^2) with
Derivation (completing the square). The likelihood is
The prior is
Multiplying and completing the square in μ gives a Normal posterior with precision (inverse variance) equal to the sum of prior precision and data precision: . The posterior mean is a precision-weighted average of μ_0 and the sample mean \bar{x}.
Numeric example. Prior μ_0=0, τ_0^2=1. Known σ^2=2. Data: x = {1.2, 0.8, 1.5} so n=3, \bar{x}=(1.2+0.8+1.5)/3=3.5/3≈1.1667.
Compute precisions: 1/τ_0^2 = 1, n/σ^2 = 3/2 = 1.5, so 1/τ_n^2 = 2.5 => τ_n^2 = 0.4. Posterior mean μ_n = 0.4(0/1 + 31.1667/2) = 0.4*(1.75005) ≈ 0.70002. So μ|data ≈ N(0.7000, 0.4).
Posterior predictive for a new observation x_{new}
The predictive distribution integrates over μ and is Normal with mean μ_n and variance σ^2 + τ_n^2:
Numeric example: variance = 2 + 0.4 = 2.4, so predictive standard deviation ≈ 1.549.
Sufficient statistics and reduction of data
For Normal with known variance, the sufficient statistic for μ is the sample mean \bar{x} (and n). That is, the posterior depends only on n and \bar{x}, not on the full sample. In other words, the data reduce to two numbers: n and \bar{x}. This mirrors the Beta-Binomial case where the sufficient statistic is k.
Exponential family and general conjugacy structure
Many common likelihoods (Bernoulli/Binomial, Poisson, Normal, Exponential, Gamma, etc.) belong to the exponential family. A one-parameter exponential family has density (w.r.t. base measure):
where η is the natural parameter, T(x) is the sufficient statistic for a single observation, and A(η) is the log-partition function.
A natural conjugate prior for η has the form
where ξ and ν are hyperparameters encoding prior pseudo-observations and g normalizes. After observing n iid samples with sufficient statistic S = Σ T(x_i), the posterior becomes
so updates are additive: ξ' = ξ + S, ν' = ν + n. This is the generic conjugate-update pattern.
Concrete mapping to Beta-Binomial
For Bernoulli, T(x)=x and A(η)=\log(1+e^{η}). The Beta prior in θ-space corresponds to a conjugate prior in the natural parameter space via a change of variables; however, the additive-pseudo-count view is simplest: in Beta(a,b) the parameters (a-1, b-1) can be seen as pseudo-success and pseudo-failure counts that add to observed counts.
Why this matters
Practical use cases
1) A/B testing and online experimentation
In A/B tests with binary outcomes (click/no-click), Beta-Binomial conjugacy lets you update beliefs about conversion probabilities in real time. Example: uniform-prior Beta(1,1). Group A observes 30 successes in 100 trials → Beta(31,71). Group B observes 40 successes in 120 trials → Beta(41,81). The probability that p_A > p_B can be computed analytically via Beta integrals or approximated via Monte Carlo sampling from the two Betas. These closed forms make sequential decision rules easy (stop when P(p_A>p_B)>0.95).
Numeric snippet (compare posteriors). Posterior means: A = 31/102 ≈ 0.304, B = 41/122 ≈ 0.336. But the posterior distributions capture uncertainty; the probability p_A>p_B requires integrating their joint Beta densities and is tractable numerically.
2) Count data and rare-event smoothing
For estimating rates (traffic accidents per month, email arrivals), Poisson likelihood with Gamma prior gives a Gamma posterior. The Gamma prior acts as regularization: for small-sample areas it pulls extreme sample means toward the prior mean (empirical Bayes uses pooled data to set hyperparameters). Predicting future counts uses the Gamma–Poisson mixture (negative-binomial predictive) to capture over-dispersion beyond a simple Poisson.
3) Measurement, sensor fusion, and filtering
Normal-Normal conjugacy with known variance is exactly the static version of the Kalman filter's update step: combine prior estimate and new measurements weighted by their precisions. In many engineering settings where sensor noise variance is known, these closed-form updates are used continuously.
4) Hierarchical models and empirical Bayes
Conjugate models are building blocks for hierarchical Bayes. For example, modeling click rates p_i for many websites with p_i ~ Beta(α,β) and data Binomial for each site yields closed-form per-site posteriors. Empirical Bayes sets α,β via marginal likelihood (which can be computed for conjugate families), giving shrinkage estimates that pool information across units.
Connections to downstream methods
Numeric case study (decision example)
Suppose a small clinic records 2 successes in 5 trials for a new treatment. With prior Beta(1,1), posterior Beta(3,4). The expected success probability is 3/7 ≈ 0.4286 and the predictive probability of next trial success is (3)/(7+1?) — careful: the predictive for a single Bernoulli is (a')/(a'+b') = 3/7 ≈ 0.4286 (same as posterior mean). If a competing treatment has posterior predictive 0.35, you might choose the new treatment. Conjugacy makes these computations trivial and transparent.
When conjugacy is not enough
Real models often lack conjugacy (complex likelihoods, non-exponential-family components). Still, conjugate models serve as approximations, initialization for numerical methods, or local updates within larger models (e.g., conditional conjugacy in parts of hierarchical models).
Summary
Conjugate priors give you closed-form Bayesian updates, analytic predictive distributions, and clear sufficient statistics. Recognizing conjugacy (or mapping a model into the exponential family) buys tractability and insight — foundational for modern Bayesian computation and modeling.
Prior: Beta(2,3). Data: observe k=8 successes out of n=15 Bernoulli trials. Compute the posterior and posterior mean.
Write prior parameters: a=2, b=3.
Write data: k=8 successes, n=15 trials, so failures = n-k = 7.
Posterior parameters: a' = a + k = 2 + 8 = 10; b' = b + (n - k) = 3 + 7 = 10.
Therefore posterior is Beta(10,10).
Posterior mean = a'/(a'+b') = 10/(10+10) = 10/20 = 0.5.
Insight: A Beta prior acts as pseudo-counts: the posterior parameters are simply prior counts plus observed counts. Here the prior expectation (2/5=0.4) is pulled toward the data (8/15≈0.533); the posterior mean 0.5 is between them.
Prior: Gamma(3,2) (shape α=3, rate β=2). Data: observed daily counts {4,1,5,2} (n=4, sum Σx = 12). Compute the posterior for λ and the predictive probability that tomorrow's count equals 3.
Write prior parameters: α=3, β=2.
Sum data: n=4, Σx = 4+1+5+2 = 12.
Posterior parameters: α' = α + Σx = 3 + 12 = 15; β' = β + n = 2 + 4 = 6.
Posterior: λ | data ~ Gamma(15, 6) with mean α'/β' = 15/6 = 2.5.
Predictive for x'=3: use the marginal formula
p(x'=3) = Γ(α'+3)/(Γ(α')3!) · (β'/(β'+1))^{α'} · (1/(β'+1))^{3}.
Plug numbers: Γ(18)/Γ(15) = 15·16·17 = 4080. Then 3! = 6, and (β'/(β'+1))^{α'} = (6/7)^{15}, (1/(β'+1))^{3} = (1/7)^{3}. So
p = 4080/6 · (6/7)^{15} · (1/343) = 680 · (6/7)^{15} · (1/343).
Compute numerically if desired: (6/7)^{15} ≈ 0.1051, so p ≈ 680 · 0.1051 / 343 ≈ 715. - compute: 680*0.1051≈71.468; divide by 343 ≈0.2085. So p ≈ 0.2085.
Insight: Gamma prior adds pseudo-counts to the total event rate; the predictive distribution accounts for parameter uncertainty and broadens the forecast relative to a plug-in Poisson with λ=posterior mean.
Prior: μ ~ N(1.0, 0.25) (μ_0=1.0, τ_0^2=0.25). Known observation variance σ^2 = 0.5. Data: x = {0.8, 1.2, 0.9, 1.5} (n=4). Compute posterior for μ and a 95% posterior credible interval for μ.
Compute sample mean: \bar{x} = (0.8+1.2+0.9+1.5)/4 = 4.4/4 = 1.1.
Compute precisions: 1/τ_0^2 = 1/0.25 = 4. Data precision: n/σ^2 = 4/0.5 = 8. Sum precision = 12, so τ_n^2 = 1/12 ≈ 0.083333.
Compute posterior mean: μ_n = τ_n^2(μ_0/τ_0^2 + n\bar{x}/σ^2) = (1/12)(1.04 + 41.1/0.5). Compute 41.1/0.5 = 42.2 = 8.8. So μ_n = (1/12)(4 + 8.8) = (1/12)12.8 = 1.066666... ≈ 1.0667.
Posterior: μ | data ~ N(1.0667, 0.083333). Posterior standard deviation = sqrt(0.083333) ≈ 0.288675.
95% credible interval (approx): μ_n ± 1.96sd ≈ 1.0667 ± 1.960.288675 ≈ 1.0667 ± 0.5659 ⇒ [0.5008, 1.6326].
Insight: Posterior mean is a precision-weighted average of prior mean and sample mean. The credible interval shrinks as data precision increases; here the data had higher precision (8) than the prior (4) so the posterior leans toward the sample mean.
A conjugate prior yields a posterior in the same family as the prior; updates are closed-form additive changes to prior parameters using sufficient statistics.
Beta-Binomial: Beta(a,b) prior + Binomial(k|n,θ) ⇒ posterior Beta(a+k, b+n-k); predictive probability for the next success is (a+k)/(a+b+n). Example: Beta(2,3) + k=8,n=20 ⇒ Beta(10,15).
Gamma-Poisson: Gamma(α,β) prior + Poisson data (Σx) ⇒ posterior Gamma(α+Σx, β+n); predictive for new counts is the Poisson–Gamma mixture with a closed-form (Negative-Binomial-like) PMF. Example: Gamma(2,1) + counts {3,2,4} ⇒ Gamma(11,4).
Normal-Normal (known σ^2): Normal(μ_0,τ_0^2) prior + Normal likelihood ⇒ posterior Normal with precision equal to the sum of prior and data precisions, and mean equal to the precision-weighted average. Example calculations give explicit μ_n and τ_n^2.
Exponential-family conjugacy: if p(x|η)=h(x) exp(η·T(x)-A(η)), a conjugate prior can be written as exp(η·ξ - ν A(η)), and posterior updates are ξ' = ξ + Σ T(x_i), ν' = ν + n.
Closed-form posteriors enable fast inference, predictive distributions, empirical Bayes, and are building blocks for hierarchical models, variational inference, Gibbs sampling, and message passing.
Mixing rate and scale parameterizations for the Gamma distribution: Gamma(α,β) can be parametrized with rate β or scale θ=1/β. Using the wrong form changes posterior updates (e.g., β + n vs. β + n) — always specify whether β is rate or scale.
Forgetting to include both successes and failures when updating Beta parameters: the Beta update requires adding k to a and (n-k) to b. Adding only k (and ignoring failures) yields an incorrect posterior.
Using the posterior mean as a plug-in for predictive variance without accounting for uncertainty: e.g., for Poisson data, plugging λ̂ = E[λ|data] into Poisson(λ̂) underestimates predictive variance compared to integrating λ out using the Gamma–Poisson mixture.
Assuming conjugacy always holds: not all likelihoods have simple conjugate priors; forcing a conjugate prior (or misidentifying the sufficient statistics) can lead to wrong updates — check the algebra or map to the exponential-family form.
Easy: Beta-Binomial update. Prior Beta(3,3). You observe 12 successes out of 20 trials. Find the posterior distribution and its mean.
Hint: Add successes to a and failures to b: a' = a + k, b' = b + n - k.
a' = 3 + 12 = 15; b' = 3 + 8 = 11. Posterior = Beta(15,11). Posterior mean = 15/(15+11) = 15/26 ≈ 0.5769.
Medium: Predictive probability under Gamma-Poisson. Prior Gamma(2,3) (shape α=2, rate β=3). Data: counts {0,1,2,0,1} (n=5, Σx=4). Compute the predictive probability that the next day's count is 2.
Hint: Posterior α' = α + Σx, β' = β + n. Then use the marginal predictive formula for p(x').
α' = 2 + 4 = 6; β' = 3 + 5 = 8. Predictive p(x'=2) = Γ(6+2)/(Γ(6)2!) · (β'/(β'+1))^{6} · (1/(β'+1))^{2}. Compute Γ(8)/Γ(6) = 6·7 = 42. So p = 42/2 · (8/9)^{6} · (1/9)^{2} = 21 · (8/9)^{6} · (1/81). Numerically (8/9)^{6}≈0.5132, so p≈21*0.5132/81≈10.777/81≈0.1330.
Hard: Exponential-family conjugacy derivation. Suppose the likelihood is a one-parameter exponential family p(x|η) = h(x) exp(η T(x) - A(η)). Show that the conjugate prior of the canonical form p(η|ξ,ν) ∝ exp(η ξ - ν A(η)) leads to posterior parameters ξ' = ξ + Σ T(x_i) and ν' = ν + n, and then apply this general formula to derive the Beta-Binomial update mapping (identify η, T(x), A(η), ξ, and ν).
Hint: Multiply prior and likelihood for n iid observations, collect terms multiplying η and A(η), and read off the updated hyperparameters. For Bernoulli, rewrite likelihood in canonical form to find η = log(θ/(1-θ)).
Posterior ∝ p(η|ξ,ν) · ∏_{i=1}^n p(x_i|η) ∝ exp(η ξ - ν A(η)) · ∏_i h(x_i) exp(η T(x_i) - A(η)) = (data-only) · exp(η (ξ + Σ T(x_i)) - (ν + n) A(η)). Thus posterior has the same canonical form with ξ' = ξ + Σ T(x_i) and ν' = ν + n. For Bernoulli(θ), write p(x|θ) = θ^{x}(1-θ)^{1-x} = exp(x log θ + (1-x) log(1-θ)) = h(x) exp(η T(x) - A(η)) with η = log(θ/(1-θ)), T(x)=x, and A(η)=log(1+e^{η}). The conjugate prior in θ-space is Beta, which corresponds to pseudo-counts: choosing ξ = a-1 (sum of pseudo T) and ν = a+b-2 (pseudo sample size) yields posterior updates a' = a + Σ x, b' = b + n - Σ x. Equivalently, in the standard Beta parametrization, the update is a' = a + k, b' = b + n - k.
Looking back: This lesson builds directly on Bayesian Inference (we used the prior×likelihood→posterior operation repeatedly) and Common Distributions (Bernoulli/Binomial, Poisson, Normal, Gamma, Beta). In Bayesian Inference, you learned the general Bayes rule; conjugate priors are special cases where Bayes rule yields algebraic, closed-form updates. In Common Distributions, you learned parametric forms; here we used those densities and simple algebra to derive posteriors.
Looking forward: mastering conjugacy and the exponential-family viewpoint enables several downstream techniques:
Specific requirements: if you intend to implement online A/B testing, bandit algorithms with Beta priors, or count forecasting with Poisson–Gamma models, you'll directly use the formulas here. If you move to non-conjugate models, you will often approximate them via exponential-family conjugates (Laplace approximations, variational families), so understanding conjugacy is essential foundationally.