Posterior predictive distributions for time series. Structural time series models. Bayesian model averaging and model selection. Demand forecasting with uncertainty quantification.
Forecasting without honest uncertainty is betting in the dark; Bayesian forecasting turns every forecast into a full probability statement so you can trade off risk, inventory, and service levels explicitly.
Bayesian forecasting constructs posterior predictive distributions for future time series observations by integrating parameter and latent-state uncertainty, enabling coherent uncertainty quantification, principled model averaging, and robust demand forecasts with interpretable predictive intervals.
Definition and motivation
Bayesian forecasting is the process of producing probability distributions for future observations of a time series by conditioning on historical data and integrating over posterior uncertainty in both parameters and latent states. Formally, for observed data and future horizon , the posterior predictive distribution is
This identity is the backbone: it separates aleatory uncertainty (conditional observation noise given states and parameters) from epistemic uncertainty (uncertainty about parameters and future latent states ).
Why care: uncertainty quantification, decision-making, and model comparison
Relation to prerequisites
Concrete simple formula and numeric example
Consider IID Poisson counts with prior (shape-rate). The posterior is and the posterior predictive for a future single count is Negative Binomial (Gamma-Poisson mixture):
where . Numeric example: if , , prior , then posterior shape , rate . The predictive mean is and the predictive variance is higher than the mean due to parameter uncertainty; the distribution is Negative Binomial with those parameters.
Key conceptual point
The posterior predictive is the unique object you should use for decision-making: it already accounts for parameter and latent-state uncertainty. The different computational strategies (closed-form conjugate, analytic Kalman-filter propagation, Monte Carlo integration, Sequential Monte Carlo) are simply ways to evaluate or approximate this integral.
Two canonical, illuminating cases give intuition and practical recipes: (A) conjugate observation models (e.g., Poisson-Gamma, Normal-Inverse-Gamma) and (B) linear Gaussian state-space models where Kalman filter analytic formulas produce predictive distributions conditional on parameters. Both yield explicit predictive forms that we can extend to parameter uncertainty by analytic marginalization or Monte Carlo.
A. Conjugate observation models (closed-form predictive)
If the observation model and prior are conjugate, integration over parameters yields closed-form posterior predictives. Example: Normal mean with unknown variance (Normal-Inverse-Gamma conjugacy). Suppose
In this case the one-step posterior predictive is a Student-t:
with posterior hyperparameters , , , . Numeric example: prior , data with so , sample variance . Then , , , and (approx). The predictive is approximately — compute numerically to get predictive mean 2.8 and heavier tails than Gaussian.
Why this matters: the predictive Student-t reflects uncertainty in both mean and variance; credible intervals are wider than plug-in intervals.
B. Linear-Gaussian state-space: Kalman predictive and marginalization over parameters
Consider the Dynamic Linear Model (DLM):
State evolution: ,
Observation: ,
If parameters are known, the Kalman filter gives the predictive distribution for :
Concrete numeric example: local level model with scalar state (), observations (), prior . After observing data , the Kalman filter produces , and the one-step ahead predictive mean for is $1.2$ with variance .
Marginalizing over parameters: combine Kalman conditionals with posterior over parameters. If unknown variances have conjugate priors (Inverse-Gamma), marginal predictive can be Student-t-like; otherwise sample from (via MCMC) and average Kalman predictive densities:
This Monte Carlo marginalization is simple and effective: run MCMC for parameters, and for each draw run a Kalman update to compute the conditional predictive, then average or simulate draws.
Tips for practice
Structural time series models and their expressiveness
A structural time series specifies interpretable latent components: level, trend, seasonality, regression effects, interventions, and time-varying coefficients. For example, a commonly used local linear trend plus seasonality model for scalar observations is
State vector: with evolution
Observation: where are regression covariates with coefficients . This structure is richer than ARIMA: components have direct interpretation, which is invaluable in business contexts (trend vs seasonal vs promotion lift).
Parameter uncertainty and hierarchical priors
In practice you often have multiple related time series (SKU/store pairs). In Hierarchical Bayesian Models we learned partial pooling — use a hierarchical prior on coefficients or innovation variances to borrow strength. For example, for store-specific promotion effect with hyperpriors on . Posterior sharing reduces overfitting for low-data series while allowing high-data series to dominate.
Computational strategies: MCMC, SMC, and particle MCMC
Structural models rarely conjugate fully. Two common approaches:
1) MCMC with forward-filter backward-sample (FFBS): sample parameters (e.g., variances, regression coefficients) using Metropolis or Gibbs; conditionally, sample the latent states via the Kalman smoother (or FFBS for linear-Gaussian) to produce draws from . Each sample yields a draw from the joint posterior; simulate future by forward-propagating the state and adding observation noise.
2) Sequential Monte Carlo / Particle Filters: useful for online updating and non-linear/non-Gaussian models. Particle MCMC (e.g., Particle Gibbs) provides exact (up to Monte Carlo error) draws from the joint posterior.
Numeric sketch: FFBS for DLM
Given parameter draw , run Kalman filter forward to obtain filtering distributions, then run backward sampling to draw . Forward-simulate and . Repeat across samples to obtain Monte Carlo predictive draws. Example numbers: 1000 posterior draws produce 1000 predictive draws, from which empirical predictive mean, 90% interval, and predictive quantiles are computed.
Bayesian model averaging (BMA) and model selection
BMA weights models by posterior model probability:
with
The marginal likelihood is often computed by Laplace approximation, bridge sampling, or thermodynamic integration. Numeric example: two models with log marginal likelihoods , , equal priors . Then posterior odds favor by factor , so and .
Model averaging vs selection in forecasting
Model stacking (an alternative) optimizes weights to minimize a predictive loss (e.g., log-score) on holdout data rather than using marginal likelihoods; it often improves predictive performance in practice when models are misspecified.
Practical warnings and diagnostics
Demand forecasting is the canonical application where Bayesian forecasting shines: decisions (ordering, pricing, promotions) depend on tails and asymmetric loss. Below I show concrete modeling patterns, loss-based decision rules, and numeric illustrations.
Count data and overdispersion: Poisson-Gamma & Negative Binomial predictive
For counts, start with Poisson likelihood and add hierarchical priors or latent overdispersion. The Gamma-Poisson conjugacy yields a Negative Binomial predictive that captures extra-Poisson variance due to parameter uncertainty.
Numeric worked mini-case: single SKU daily demand
Observations: 30 days with total sales S=150 (mean 5/day). Prior (shape-rate). Posterior: , posterior mean . One-day predictive is Negative Binomial with mean and variance . Compare to plug-in Poisson variance 4.903: the predictive variance is slightly larger.
In-store hierarchy: partial pooling for items/stores
Suppose we have 50 stores and want per-store daily rate . A hierarchical model
with hyperpriors on implements partial pooling. In Hierarchical Bayesian Models we learned how posteriors shrink store estimates toward the chain mean proportionally to information; this reduces forecast variance for low-volume stores.
Promotion and regression effects
Add covariates (price, promotion flag) in a log-linear rate:
Posterior uncertainty in propagates into predictive uncertainty. To compute predictive uplift under a promotion, simulate posterior draws and produce predictive draws for future periods under different — this yields posterior distributions over promotional lift.
Inventory decision example: Newsvendor loss
If stock level incurs underage cost per unit sold but not stocked and overage cost per leftover unit, the Bayes-optimal quantity minimizes posterior expected loss and corresponds to the posterior predictive quantile:
Numeric example: if stockout cost , leftover cost , optimal service quantile is . If the posterior predictive negative binomial CDF evaluated at is 0.81 and at is 0.87, choose .
Assessing and communicating uncertainty
Scaling considerations
Summary of applied recipe
1) Specify structural components you believe matter (trend, seasonality, covariates).
2) Choose priors that express domain knowledge; use hierarchical priors for pools of series.
3) Fit via MCMC/SMC/variational methods; validate predictive calibration on holdout data (coverage, PIT).
4) Use posterior predictive draws for decision tasks (newsvendor quantiles, expected costs, promotion uplift distributions).
This pipeline turns time-series forecasting into a decision-ready probabilistic system that explicitly balances bias-variance and model uncertainty.
You observe daily sales for an SKU over T=10 days with counts y: 6,4,7,3,5,6,8,4,3,4 (sum S=50). Prior for rate λ is Gamma(α=2, β=1) (shape-rate). Compute the posterior for λ and the predictive probability of seeing k=7 units tomorrow.
Compute sufficient statistic: S = sum y_t = 50, T = 10.
Posterior shape = α_post = α + S = 2 + 50 = 52. Posterior rate = β_post = β + T = 1 + 10 = 11. Therefore λ | y ∼ Gamma(52,11).
The posterior predictive for y_{T+1} is the Gamma-Poisson (Negative Binomial) mixture: for k≥0,
p(y_{T+1}=k|y)= {Γ(α_post+k)\over k! Γ(α_post)} (β_post/(β_post+1))^{α_post} (1/(β_post+1))^{k}.
Plug numbers for k=7: compute (β_post/(β_post+1))^{α_post} = (11/12)^{52}. Numerically (11/12)^{52} ≈ exp(52 ln(11/12)) ≈ exp(52 -0.087011) ≈ exp(-4.5246) ≈ 0.0108.
Compute the combinatorial prefactor: Γ(52+7)/(7! Γ(52)) = (52·53·...·58)/7! (Gamma ratio). Compute numerator product: 52535455565758 ≈ (use calculator) ≈ 1.366×10^{12}. 7! = 5040. Ratio ≈ 2.71×10^{8}. Then multiply by (1/12)^7 ≈ 1/(35,831,808) ≈ 2.79×10^{-8}. Multiply: 2.71×10^{8} 0.0108 2.79×10^{-8} ≈ (2.71×10^{8} 2.79×10^{-8}) 0.0108 ≈ 7.56 * 0.0108 ≈ 0.0817 (approx).
Thus the posterior predictive probability of observing exactly 7 units tomorrow is approximately 0.082. The predictive mean is α_post/β_post = 52/11 ≈ 4.727, which matches the empirical mean 5 initially but shrinks due to prior.
Insight: This example shows how conjugacy yields closed-form posterior and predictive distributions; the predictive variance exceeds the Poisson variance due to uncertainty in λ.
Local level model x_{t+1}=x_t+η_t, y_t=x_t+ε_t with η_t∼N(0,σ_w^2), ε_t∼N(0,σ_v^2). Prior x_0∼N(0,1). Observed y: 1.2, 0.9, 1.5 (T=3). Assume σ_w^2=0.5 known, but σ_v^2 unknown with prior Inverse-Gamma(α=3, β=2). Compute the one-step-ahead predictive distribution for y_4 marginalizing σ_v^2 approximately using conjugacy (Student-t predictive), and give predictive mean and variance numerically (approx).
Run Kalman filter treating σ_v^2 as symbolic to get filtering mean m_3 and variance C_3. With initial prior m_0=0, C_0=1, known σ_w^2=0.5: after 3 observations the filter gives m_3≈1.2 and C_3≈0.4 (these are approximate from standard recursion).
Conditional on σ_v^2, the one-step predictive for y_4 is Normal with mean m_3 and variance C_3+σ_v^2: y_4|σ_v^2,y ∼ N(m_3, C_3+σ_v^2).
Use conjugacy: prior for σ_v^2 is Inv-Gamma(α,β). The marginal predictive integrating σ_v^2 yields a Student-t with 2α degrees of freedom, location m_3, and scale depending on β and C_3. Specifically,
y_4|y ∼ t_{2α}\left(m_3,\frac{β(1+C_3)}{α}\right).
Plug numbers: α=3, β=2, m_3≈1.2, C_3≈0.4. Degrees of freedom = 6. Scale variance = β(1+C_3)/α = 2(1+0.4)/3 = 21.4/3 = 2.8/3 ≈ 0.9333. So predictive is t_6(1.2, variance≈0.9333).
Predictive mean equals m_3=1.2 (for df>1). Predictive variance for Student-t = scale df/(df-2) = 0.9333 6/4 = 0.9333 * 1.5 = 1.4. So predictive std ≈ 1.183. This matches intuition: predictive variance equals C_3 plus posterior mean of σ_v^2 and extra df inflation.
Insight: This example demonstrates how combining Kalman conditioning on latent state with conjugate priors on noise variance yields a Student-t posterior predictive that accounts for parameter uncertainty analytically.
Two competing models M1 and M2 are fit to the same time series. Log marginal likelihoods are: log p(y|M1) = -120, log p(y|M2) = -125. Prior model probabilities equal. Compute posterior model probabilities, then compute the BMA one-step predictive as weighted average of the models' predictive densities. Suppose model predictive means are μ1=10 (variance 4) and μ2=12 (variance 9); compute BMA predictive mean and approximate variance assuming independence conditional on model choice.
Compute Bayes factors: log BF_{1,2} = -120 - (-125) = 5. So BF_{1,2} = e^{5} ≈ 148.413.
With equal priors, posterior probability p(M1|y) = BF_{1,2} / (1 + BF_{1,2}) = 148.413 / 149.413 ≈ 0.9933. p(M2|y) ≈ 0.0067.
BMA predictive mean = p(M1|y) μ1 + p(M2|y) μ2 = 0.993310 + 0.006712 = 9.933 + 0.0804 = 10.0134 ≈ 10.01.
To approximate predictive variance, use law of total variance: Var(Y) = E[Var(Y|M)] + Var(E[Y|M]). Compute E[Var]=0.99334 + 0.00679 = 3.9732 + 0.0603 = 4.0335.
Compute Var(E[Y|M]) = 0.9933(10-10.0134)^2 + 0.0067(12-10.0134)^2 ≈ 0.9933(0.00018) + 0.0067(3.944) ≈ 0.00018 + 0.0264 ≈ 0.0266. Total variance ≈ 4.0335 + 0.0266 ≈ 4.0601. Predictive std ≈ 2.015.
Thus BMA predictive mean ≈10.01 and variance ≈4.06, nearly the M1 forecast but with tiny additional variance from model uncertainty.
Insight: BMA can heavily weight a dominant model but still formally account for model uncertainty; when marginal likelihoods are close, BMA yields meaningful mixtures and often better calibration.
The posterior predictive p(y_{T+h}|y_{1:T}) is the decision-oriented object: always integrate over both parameter and latent-state uncertainty before making decisions.
Conjugate models (Gamma-Poisson, Normal-Inverse-Gamma) yield closed-form predictive distributions (Negative Binomial, Student-t) that transparently reflect parameter uncertainty; always compute these for sanity checks.
For linear Gaussian state-space models, Kalman filter gives conditional predictive distributions; marginalize parameters by Monte Carlo (draw θ, run Kalman) or analytically when conjugate priors exist.
Structural time series with interpretable components (trend, seasonality, regression effects) are especially useful in business forecasting and should be combined with hierarchical priors for many related series.
Bayesian model averaging uses marginal likelihoods to weight models and produces better-calibrated predictive distributions than selecting a single model in the presence of model uncertainty; stacking is a pragmatic alternative optimizing predictive scores.
Use posterior predictive draws for decision problems (newsvendor quantiles, expected stockouts, stochastic optimization); the optimal action is often a quantile of the predictive.
Calibration checks (PIT, predictive coverage, scoring rules like log score and CRPS) are essential to validate probabilistic forecasts.
Plug-in forecasting: using point estimates of parameters (e.g., MLE or posterior mean) and ignoring parameter uncertainty — this typically underestimates predictive variance and leads to overconfident decisions.
Misinterpreting posterior predictive samples as independent when they are correlated via shared parameter draws — when computing effective quantiles or intervals, use the empirical distribution of independent predictive draws (one per posterior sample) rather than naive bootstrap of conditional draws.
Using marginal likelihoods without checking sensitivity to priors — marginal likelihoods can be strongly influenced by diffuse priors; prefer predictive scoring on holdout data or use robust priors when computing BMA weights.
Confusing observation overdispersion with time-varying mean — model choice between a Negative Binomial (overdispersion) and a structural state evolution (latent dynamics) should be guided by residual diagnostics and PIT plots.
Easy — Conjugate Poisson-Gamma predictive: You observe T=5 days with counts 2,3,1,4,2 (S=12). Prior for λ is Gamma(α=1.5, β=0.5). Compute the posterior for λ and the posterior predictive mean and variance for next-day demand.
Hint: Posterior shape = α+S, rate = β+T; predictive mean = (α+S)/(β+T), predictive variance = mean + mean^2/(α+S).
S=12, posterior shape = 1.5+12=13.5, posterior rate = 0.5+5=5.5. Predictive mean = 13.5/5.5 ≈ 2.4545. Predictive variance = mean + mean^2/(13.5) ≈ 2.4545 + (6.024)/(13.5) ≈ 2.4545 + 0.446 ≈ 2.9005.
Medium — Kalman predictive with parameter draws: Consider a scalar local level model with unknown observation variance σ_v^2 and known state noise σ_w^2=0.2. You have posterior samples for σ_v^2: {0.5, 1.0, 1.5}. For each σ_v^2 sample, the Kalman one-step conditional predictive variance after filtering is C_3+σ_v^2 where C_3=0.3 and predictive mean m_3=2.0. Compute the Monte Carlo approximation to the marginal predictive mean and variance using these three parameter draws (equal weights).
Hint: Average the conditional predictive means (all equal) for marginal mean; use law of total variance for marginal variance: E[Var]+Var(E[·|σ]).
Conditional predictive means are all 2.0, so marginal mean = 2.0. Conditional variances are 0.3+0.5=0.8, 0.3+1.0=1.3, 0.3+1.5=1.8. E[Var]=mean of these = (0.8+1.3+1.8)/3=3.9/3=1.3. Var(E[·|σ])=Var(2.0,2.0,2.0)=0. Total marginal variance = 1.3. So predictive mean 2.0 and variance 1.3.
Hard — Hierarchical count forecasting with partial pooling: You have counts for two stores over 4 days each: store A: 10,12,9,11 (sum S_A=42), store B: 1,0,2,1 (sum S_B=4). Model y_{it}∼Pois(λ_i) and λ_i∼Gamma(α,β) with hyperprior α=2 fixed, but unknown β with prior p(β)∝1/β (improper Jeffreys). Derive the joint posterior up to proportionality for (λ_A,λ_B,β) and describe how you would compute posterior predictive for day 5 for both stores using MCMC. (You are not required to run code, but show formulas and sampling scheme.)
Hint: Write likelihoods for each store and the hierarchical prior; marginal posterior for β can be targeted with MCMC; conditional posteriors for λ_i given β are Gamma (due to conjugacy).
Likelihood: p(y|λ)=Π_i Π_t e^{-λ_i}λ_i^{y_{it}}/y_{it}! ∝ Π_i λ_i^{S_i} e^{-T λ_i}. Prior: p(λ_i|β)=β^{α}/Γ(α) λ_i^{α-1} e^{-β λ_i}. Joint posterior (up to constant):
p(λ_A,λ_B,β|y) ∝ p(β) Π_i [λ_i^{S_i} e^{-T λ_i} λ_i^{α-1} e^{-β λ_i}] = (1/β) Π_i λ_i^{S_i+α-1} e^{-(T+β)λ_i}.
So conditional λ_i | β,y ∼ Gamma(shape = S_i+α, rate = T+β). For our numbers: T=4, α=2 gives λ_A|β ∼ Gamma(42+2=44, 4+β), λ_B|β ∼ Gamma(4+2=6,4+β).
The marginal posterior for β (up to constant) after integrating out λ_i is:
p(β|y) ∝ (1/β) Π_i [Γ(S_i+α)/( (T+β)^{S_i+α} )].
Explicitly: p(β|y) ∝ (1/β) (1/(4+β)^{44})(1/(4+β)^{6}) = (1/β)/(4+β)^{50}, up to multiplicative constants from Γ() which are constant in β. So p(β|y) ∝ β^{-1}(4+β)^{-50}. Sample β via Metropolis-Hastings or slice sampling on β>0.
Gibbs-within-MH scheme:
1) Given current β, sample λ_A ∼ Gamma(44,4+β), λ_B ∼ Gamma(6,4+β).
2) Update β with Metropolis-Hastings proposing β' (e.g., log-normal random walk) targeting p(β|y) ∝ β^{-1}(4+β)^{-50}.
Posterior predictive for day 5 for store i: simulate λ_i^{(s)} from posterior draws and then simulate y_{i,5}^{(s)} ∼ Poisson(λ_i^{(s)}) to get predictive samples. From these, compute predictive mean and credible intervals.
Looking back: This lesson builds directly on the prerequisites. From State-Space Models we used the Kalman filter and smoothing for conditional latent-state inference and built predictive formulas for linear-Gaussian models; the FFBS sampler and forward propagation are direct descendants of those algorithms. From Hierarchical Bayesian Models we used partial pooling to borrow strength across series via hierarchical priors on rates or regression coefficients. From Conjugate Priors we exploited Gamma-Poisson and Normal-Inverse-Gamma conjugacy to derive closed-form posterior predictives (Negative Binomial and Student-t), which serve as both computational shortcuts and diagnostic checks. Looking forward: mastering Bayesian forecasting enables rigorous downstream tasks — stochastic inventory optimization (newsvendor and multi-period ordering), Bayesian decision analysis for promotions and pricing, probabilistic supply chain simulation, causal impact evaluation with state-space counterfactuals, and automated model combination (stacking) pipelines in production. Specific downstream methods that rely on these techniques include probabilistic machine learning methods for hierarchical time series (forecast reconciliation with Bayesian priors), sequential decision making under uncertainty (POMDPs), and Bayesian experimental design for pricing or promotion trials where posterior predictive simulations guide sample-size planning. Practical deployment often requires combining these inferential foundations with computational tools: particle filtering for online updates, variational inference or amortized MCMC for scaling to thousands of series, and robust predictive scoring and calibration routines for continual monitoring.