Updating probability distributions with data. Prior, likelihood, posterior.
Quick unlock - significant prerequisite investment but simple final step. Verify prerequisites first.
You already know Bayes’ theorem as a rule for flipping conditionals: P(A|B) ∝ P(B|A)P(A). Bayesian inference is what happens when you treat the unknown quantity (often a parameter θ) as the “A” you want to reason about, and the observed dataset x as the “B” you’ve learned from—so your result is not a single best guess, but a whole updated distribution over plausible θ values.
Bayesian inference updates beliefs about unknown parameters θ using data x via
Conjugate priors make posteriors easy; otherwise you approximate (MCMC, variational inference).
This node builds on ideas you may already know, but it’s easy to get tripped up by missing one small piece. Here’s the explicit checklist.
You should be comfortable with:
You should also understand the “proportional” form:
where the missing factor is “whatever makes it sum/integrate to 1.” Bayesian inference uses this proportionality constantly.
You should recognize probability mass/density functions and their parameters.
You should know that the likelihood is the same expression as but interpreted as a function of for fixed observed .
MLE chooses:
Bayesian inference will generalize this: it returns a distribution over instead of one optimizer.
The “evidence” (also called the marginal likelihood) is:
(or a sum for discrete ). If you don’t have calculus yet, you can still learn most of Bayesian inference by treating this as “the normalization constant” and focusing on proportional reasoning:
You can do many practical updates with conjugate priors without doing the integral yourself.
People often say “use a flat/uninformative prior.” Two important caveats:
1) ‘Flat’ depends on parameterization. A prior that is uniform in is not uniform in . For example, if , then a uniform prior in induces a non-uniform prior in .
2) “Non-informative” is subtle. Some priors are designed to be less informative under reparameterizations (e.g., Jeffreys priors), but there is no universal free lunch.
Keep this in mind as we talk about priors: they encode assumptions, and assumptions should be made explicit.
In frequentist statistics, parameters are fixed but unknown. In Bayesian statistics, parameters are treated as uncertain quantities described by a probability distribution.
You observe data (which might be a dataset like ), and you want to reason about an unknown parameter (or parameters) .
Bayesian inference is the process of updating your beliefs about after seeing .
Bayes’ theorem in density form is:
Each term has a distinct job:
Often we write the update in proportional form:
That proportional form is not a shortcut; it’s a mindset: start by multiplying prior × likelihood, then normalize.
Bayes’ theorem is a single identity. Bayesian inference is a workflow:
1) Choose a probabilistic model for data: .
2) Choose a prior over unknowns: .
3) Compute or approximate the posterior: .
4) Use the posterior for decisions/predictions.
This workflow forces you to express assumptions.
Suppose is one-dimensional.
This “agreement by multiplication” is the heart of Bayesian updating.
MLE finds a point estimate maximizing the likelihood.
Bayesian inference produces a distribution. But you can recover point estimates from the posterior:
The key difference: Bayesian inference quantifies uncertainty and naturally supports predictive distributions (integrating over ).
A Bayesian model usually begins with a story:
1) Nature draws a parameter from a prior .
2) Then Nature generates data from .
We only observe . Bayesian inference asks: given , what should we believe about ?
A prior can do several jobs:
If is a probability (like a Bernoulli success rate), then a natural prior is the Beta distribution:
Interpretation (informally): looks like prior “successes,” like prior “failures.”
This is a common conceptual speed bump.
Important: likelihoods are not probability distributions over , so they do not need to integrate to 1 over .
If are IID given , then:
That product is why data accumulates evidence quickly.
The posterior is what you use for:
The evidence is:
You can think of it as:
This becomes central in model comparison (Bayes factors), because it penalizes overly flexible models that spread probability mass too thin.
| Object | Notation | What varies? | Must integrate/sum to 1 over θ? | Role |
|---|---|---|---|---|
| Prior | θ | Yes | Belief before data | |
| Likelihood | θ (x fixed) | No | Data support for θ | |
| Posterior | θ | Yes | Belief after data | |
| Evidence | — | — | Normalizer; model score |
If you observe data in chunks, Bayes updates are consistent.
Let data arrive as then . Then:
So yesterday’s posterior becomes today’s prior. This is not just poetic; it’s computationally useful and conceptually clean.
The posterior requires multiplying and normalizing:
Sometimes, that product lands in the same family as the prior. Then the posterior has a closed form, and updating is easy.
That pairing is called conjugacy.
Conjugacy is not required for Bayesian inference, but it’s the clearest way to learn the mechanics.
Assume is the number of successes in Bernoulli trials with success probability .
Compute the unnormalized posterior:
So:
This reveals the “pseudo-count” intuition: successes add to , failures add to .
If data are Poisson with rate :
A conjugate prior is:
(using the rate-parameterization where density is proportional to ).
With IID observations and :
Again: data adds to shape, and the number of observations adds to rate.
If:
and prior:
then the posterior is also normal. The posterior mean becomes a precision-weighted average of the prior mean and sample mean.
Define precision as inverse variance: , .
Let . Then:
So:
This shows a deep Bayesian theme: uncertainty shrinks with data.
Bayesian inference shines when you want to predict future observations .
Instead of plugging in a single estimate of , you average over the posterior:
This is called the posterior predictive distribution.
Intuition: if you are uncertain about , your predictions should reflect that uncertainty.
A Bayesian credible interval is a probability statement about the parameter:
A frequentist 95% confidence interval is a statement about repeated sampling behavior of the interval procedure, not directly about the realized parameter.
Both can be useful, but do not automatically interpret them the same way.
Conjugate priors are beautiful, but many real models are not conjugate.
In those cases:
This node focuses on building the conceptual and algebraic foundation that those methods rely on.
Bayesian inference gives you three capabilities that show up everywhere in modern ML and statistics:
1) Uncertainty-aware learning (not just point estimates)
2) Principled regularization via priors
3) Model comparison via evidence
Let’s connect those to the nodes this unlocks.
VAEs introduce latent variables z and parameters θ.
A typical generative story:
Inference asks for the posterior over latent variables:
But is usually intractable, so we approximate with variational inference (ELBO). That is Bayesian inference scaled up.
When the posterior is complex:
MCMC constructs a Markov chain whose stationary distribution is the posterior, enabling:
The target distribution MCMC needs is exactly the Bayesian posterior (often only known up to a normalization constant, which is fine for many MCMC algorithms).
Bayesian optimization maintains a posterior over functions or surrogate-model parameters (often Gaussian processes). Data updates a prior to a posterior, then an acquisition function uses that posterior uncertainty to pick the next point to evaluate.
The key idea is exploration vs exploitation driven by posterior uncertainty.
In auction settings, bidders and the designer reason about unknown private valuations and types. Bayesian models represent beliefs about those unknowns and update from signals or observed behavior. “Bayesian” in mechanism design often literally refers to priors over types.
Many causal workflows use Bayesian inference to:
Even when causal identification is a separate question, Bayesian inference is frequently the engine used once a causal estimand is defined.
If you’ve seen L2 regularization in regression, there is a Bayesian interpretation:
This is a bridge between optimization-based ML and probabilistic modeling.
After this node, you should be comfortable with:
That’s the conceptual toolkit you need before you dive into MCMC, variational inference, VAEs, Bayesian optimization, and Bayesian causal modeling.
You flip a coin n = 10 times and observe x = 7 heads. You model heads as Bernoulli(θ) with unknown θ. Prior: θ ~ Beta(α=2, β=2) (a mild prior preferring values near 0.5). Compute the posterior, posterior mean, and a simple predictive probability for the next flip being heads.
Write the likelihood for x heads in n flips:
Write the prior density up to proportionality:
Compute unnormalized posterior:
Identify the Beta form:
Posterior is
Compute the posterior mean (for Beta(a,b), mean is a/(a+b)):
Compute posterior predictive probability that next flip is heads:
But , so
Insight: The posterior update is just “add successes and failures” to the prior’s pseudo-counts. The predictive probability automatically accounts for uncertainty because it averages over θ instead of plugging in a single estimate.
A website sees counts of signups per day. Assume x₁,…,xₙ are IID Poisson(λ). You observe n = 5 days with counts: 3, 1, 4, 0, 2 (sum S = 10). Prior: λ ~ Gamma(α=2, β=1) using the rate parameterization (density ∝ λ^{α-1} e^{-βλ}). Compute the posterior and posterior mean.
Write the likelihood for IID Poisson:
Separate terms that depend on λ:
where C does not depend on λ.
Write the prior up to proportionality:
Compute the unnormalized posterior:
Recognize the Gamma form:
Compute the posterior mean (for Gamma(shape α, rate β), mean is α/β):
Insight: The posterior mean blends prior information with data: the data contributes S counts and n exposure units (days). The update is algebraic because the Gamma prior is conjugate to the Poisson likelihood.
Assume x₁,…,xₙ are IID Normal(μ, σ²) with known σ² = 4. You observe n = 4 data points: 2, 1, 3, 2 so the sample mean is x̄ = 2. Prior: μ ~ Normal(μ₀ = 0, τ₀² = 1). Compute the posterior mean and variance.
Compute precisions (inverse variances):
Use the conjugate update formulas:
Plug in numbers for posterior variance:
Plug in numbers for posterior mean:
State posterior:
Insight: Even though the sample mean is 2, the posterior mean is 1 because the prior mean 0 pulls it back (shrinkage). With more data (larger n) or lower noise (smaller σ²), the data would dominate and shrinkage would weaken.
Bayesian inference treats unknown parameters θ as random variables and updates beliefs with data via .
The likelihood is a function of θ for fixed observed x; it is not a probability distribution over θ.
The evidence normalizes the posterior and enables model comparison (marginal likelihood).
Conjugate priors (Beta–Binomial, Gamma–Poisson, Normal–Normal) yield closed-form posteriors and build intuition for updating.
Posterior predictive distributions average over parameter uncertainty: .
MAP estimation is Bayesian point estimation: ; with a uniform prior it matches MLE (but “uniform” is parameterization-dependent).
‘Flat/uninformative’ priors are not automatically objective; they depend on how you parameterize the problem and can encode assumptions implicitly.
Treating the likelihood as a distribution over θ and trying to interpret it as “probability θ is true.” Likelihood is not normalized over θ.
Forgetting the evidence/normalization and thinking exactly (missing the constant that makes it integrate to 1).
Assuming a uniform prior is always non-informative; uniformity changes under reparameterization, so ‘uninformative’ requires care.
Mixing up posterior credible intervals with frequentist confidence intervals and interpreting them identically.
Beta–Binomial practice: You observe n = 20 trials with x = 2 successes. Prior is Beta(α=1, β=1). (a) What is the posterior? (b) What is the posterior mean? (c) What is the posterior predictive probability of success on the next trial?
Hint: Use Beta conjugacy: posterior parameters are (α+x, β+n−x). Predictive success probability is the posterior mean.
(a) Posterior: Beta(α+x, β+n−x) = Beta(1+2, 1+18) = Beta(3, 19).
(b) Posterior mean = 3/(3+19) = 3/22 ≈ 0.13636.
(c) Posterior predictive P(next=1|data) = E[θ|data] = 3/22.
Gamma–Poisson practice: Counts per hour are modeled as Poisson(λ). You observe 8 hours with total count S = 24. Prior is Gamma(α=3, β=2) (rate parameterization). Find the posterior distribution and posterior mean.
Hint: For Poisson with Gamma prior: posterior is Gamma(α+S, β+n). Mean is (α+S)/(β+n).
Posterior: Gamma(α+S, β+n) = Gamma(3+24, 2+8) = Gamma(27, 10).
Posterior mean = 27/10 = 2.7.
MAP vs MLE and priors: Let x₁,…,xₙ ~ Normal(μ, σ²) with known σ². (a) Write the MLE for μ. (b) If the prior is μ ~ Normal(μ₀, τ₀²), derive the MAP estimate for μ by maximizing the posterior (show the algebraic completion of squares or derivative steps).
Hint: The posterior is proportional to likelihood × prior. Taking logs turns products into sums. Differentiate w.r.t. μ and set to 0.
(a) MLE: maximize ∏ᵢ N(xᵢ|μ,σ²). The maximizer is the sample mean:
(b) Posterior (up to proportionality):
Take logs (dropping constants not depending on μ):
Differentiate:
So:
Set to 0:
Multiply by −1 and rearrange:
Thus the MAP (which equals the posterior mean in this conjugate case) is: