Kalman filter, hidden Markov models. Latent state estimation via filtering and smoothing. Forward-backward algorithm.
State-space models let you turn messy time-series and latent-process problems into modular, solvable inference tasks — they are the lingua franca of tracking, econometrics, and probabilistic control.
State-space models formalize a latent (Markov) state driving observed data; Kalman filtering and the forward-backward (smoothing) algorithms give optimal filtering and retrospective estimates for linear-Gaussian and discrete-HMM cases respectively.
A state-space model (SSM) is a probabilistic framework that explicitly separates an unobserved internal state (the "latent" or "hidden" state) from noisy observations. This gives two advantages: (1) you model dynamics in the hidden, typically lower-dimensional state, and (2) you perform principled inference on that state using the observation stream. Formally, in discrete time, an SSM is specified by a state (transition) model and an observation (emission) model. Two canonical families are:
where is the latent state, is the observation, and are matrices, and are covariance matrices. Example: an AR(1) process from "Time Series Foundations" can be written as a one-dimensional LGSSM with , , and suitable ; e.g., for AR(1) with , process noise variance , and observation noise variance , set , , , .
where takes values in a finite set. Example: a two-state HMM with transition matrix
and emission probabilities (e.g., categorical or Gaussian) is a simple discrete SSM.
Why build SSMs? From a modeling perspective, many time-series phenomena are naturally explained by a latent quantity evolving over time (hidden economy, object position, unobserved volatility). From an algorithmic perspective, the Markov structure (in the latent space) allows recursive, linear-time inference via dynamic programming: filtering (online) and smoothing (offline). In "Markov Chains" we learned how the memoryless property simplifies analysis; SSMs leverage that same idea: the state at time depends on the past only through . In "Bayesian Inference" we learned updating beliefs via priors and likelihoods; the Kalman filter and forward-backward algorithm are recursive Bayesian updates tailored to sequential data.
Key inference questions in SSMs:
Concrete small numeric example to anchor the notation: consider a 1D LGSSM with , , , , initial prior . You observe . The filtering problem asks for ; the Kalman filter will give this in closed form (derived in the next section) and will produce a Gaussian with explicit mean and variance computed from the matrices above.
Intuition: the transition matrix controls how much we trust the past state to predict the next; the process noise injects uncertainty (how much the state can change unpredictably); the observation matrix and noise govern how informative each measurement is. When is large relative to , measurements are noisy and the filter relies more on predictions; when is large, the filter trusts new observations more.
The Kalman filter is the closed-form optimal Bayesian filter for linear-Gaussian SSMs. It computes for each the Gaussian posterior by alternating a prediction step and an update (correction) step.
Equations (matrix form).
Prediction (time update):
Update (measurement update):
Derivation idea (sketch): the posterior is proportional to prior times likelihood. With Gaussians, multiply two exponentials gives another Gaussian. Completing the square yields the posterior mean and covariance above; the Kalman gain is the coefficient that balances prior uncertainty with observation uncertainty. This is a direct application of "Bayesian Inference" where the prior is the predictive Gaussian and the likelihood is Gaussian centered at .
Numeric 1D worked mini-example (showing every arithmetic step). Keep everything scalar so matrix operations reduce to multiplication.
Model parameters:
Step 1 — Predict to time 1:
Step 2 — Compute innovation covariance and Kalman gain:
Step 3 — Update state mean and covariance with observation :
Interpretation: the posterior mean is between the prediction (0) and the observation (1.5), weighted by the Kalman gain which depends on the relative uncertainties. The posterior variance decreased from $1.81$ (predictive) to $0.95$ (updated): adding the observation reduced uncertainty.
Multiple-step operation and relation to "Time Series Foundations": if you write an AR(1) time series , an equivalent LGSSM can represent the hidden state as the AR(1) latent variable; the Kalman filter then provides the optimal linear minimum MSE estimator for missing values or one-step-ahead forecasts. The Kalman filter is thus the dynamic generalization of the Gaussian conjugate update we saw in "Bayesian Inference".
Important practical points:
The Kalman filter cost per time step is O(n^3) dominated by matrix inverses, but for moderate state dimension it's efficient and exact.
Filtering answers "what is the current state given the past?" Smoothing answers "given the entire dataset, what was the state at time t?" Smoothing reduces posterior variance by incorporating future observations. Two complementary algorithms handle the main SSM families.
1) Rauch–Tung–Striebel (RTS) smoother — linear Gaussian case
After running the Kalman filter forward and storing , and the predictive , for , the RTS smoother performs a backward pass to compute via
Derivation sketch: the smoothed estimates follow from the conditional Gaussian identity for jointly Gaussian vectors conditioned on all observations: one expresses and uses linear-Gaussian conditional formulas. The matrix is the smoothing gain, analogous to but applied backward.
Numeric RTS example (2-step) to illustrate arithmetic. Use the previous 1D example with parameters and two observations . From Section 2 we computed , , and , . After running Kalman update at time 2, suppose we compute and (numbers illustrative). Then backward step for :
The smoothed mean moved toward values that explain future data; variance decreased from $0.95025$ to $0.7492$.
2) Forward–Backward algorithm — discrete HMM case
For finite-state HMMs, smoothing is performed by the forward–backward algorithm (a dynamic-programming factorization). Define the forward variable and backward variable . Then the smoothed state posterior is
Recursions:
Concrete discrete example: two-state HMM with transitions
Emissions: , (H = "High"). Observations: [H, L] where L denotes low with complementary probabilities. Compute forward and backward and then ; make sure to renormalize at each step to avoid underflow. Include scaling factors when implementing for long sequences.
Relationship between the two: both RTS and forward–backward perform the same logical operation (compute posterior over states given all data) but are specialized to linear-Gaussian continuous states and discrete finite states respectively. The backward gains and the backward variables play analogous roles.
Smoothing is essential for parameter learning (EM): the E-step requires expected sufficient statistics like or which are computed from smoothed marginals and pairwise smoothed distributions (for HMMs, ).
State-space methods are a crossroads connecting "Time Series Foundations", "Markov Chains", and "Bayesian Inference" to many applied domains. Here are concrete applications, practical considerations, and downstream connections that rely on understanding SSMs.
Major application areas with concrete examples:
Downstream algorithms that require SSM understanding:
Practicalities and numerical issues:
Concrete numeric tie-in: the log-likelihood of observations under the LGSSM can be computed incrementally from the Kalman filter via
where is the observation dimension. In practice this is used in maximum-likelihood parameter estimation; substitute numeric from the filter each step.
Summary: mastering Kalman filtering and forward–backward smoothing gives you powerful, computationally efficient tools to estimate hidden dynamics and to build higher-level algorithms for learning and control. In the next section (worked examples) we will step through concrete instances to cement arithmetic and intuition.
Parameters: , , , . Prior: , . Observation: . Compute and .
Prediction: .
Predictive covariance: .
Innovation covariance: .
Kalman gain: .
Innovation: .
Updated mean: .
Updated covariance: .
Insight: This example shows how the Kalman gain depends on the relative sizes of predictive uncertainty and measurement noise : larger would reduce , pulling the posterior mean closer to the prediction.
1D model: , , , . Observations: , , . Start with , . Compute and (smoothed estimates for time 2) using a forward Kalman pass and backward RTS smoothing.
Forward pass step 1 (t=1): match Section 2 calculations to get , , , .
Update at t=2 with : , , innovation , so , .
Predict to t=3: , .
Update at t=3 with : , , innovation , so , .
Backward RTS for t=2: compute .
Smoothed mean: .
Smoothed covariance: .
Insight: Smoothing moved the estimate at t=2 towards values that better explain the later observation at t=3 and reduced variance. This illustrates how future data refines past beliefs.
Two states {1,2}, initial distribution , transition matrix . Emission probabilities for observation O ∈ {A,B}: and . Observed sequence: [A,B]. Compute smoothed marginals and .
Forward t=1: , . Normalize by if desired; unnormalized values suffice for ratios.
Forward t=2: compute predictive sums: for state 1: . Multiply by emission gives . For state 2: . Multiply by emission gives .
Backward initialization: .
Backward t=1: . Similarly .
Smoothed marginals: . Compute numerators: for 1: ; for 2: . Normalize sum $0.215$ gives .
For t=2: since . Sum . So .
Insight: The forward–backward algorithm computes exact smoothed posteriors for HMMs in O(S^2T). Scaling/normalization is necessary for long sequences; the same dynamic-programming idea underlies continuous-state smoothers like RTS.
A state-space model separates latent dynamics (state equation) from noisy observations (observation equation), enabling efficient recursive inference.
The Kalman filter provides exact closed-form filtering for linear-Gaussian models via predict/update equations; every formula can be implemented as matrix algebra (with scalar examples above).
Smoothing (RTS for LGSSM, forward–backward for HMMs) uses future observations to reduce uncertainty about past states and provides required expectations for EM parameter learning.
Forward (alpha) recursions are numerically complemented by backward (beta) recursions; normalization/scaling is essential in discrete HMMs to avoid underflow.
State-space representations unify ARMA models and Markov chain ideas: ARMA can be written in companion form and handled by Kalman filtering; HMMs are discrete-state SSMs.
Practical implementations must address numerical stability (Joseph form for covariance updates, scaled forward variables) and identifiability/observability of the model.
Using the naive covariance update without considering numerical stability: rounding can break symmetry/positive-definiteness; use the Joseph form when necessary.
Mixing filtering and smoothing indices: applying smoothing equations with 'filtered' quantities at the wrong time (e.g., using where is required) produces incorrect estimates.
Forgetting to include process noise : setting (unless truly known zero) often leads to overconfident estimates and filter divergence when the true process is stochastic.
Not scaling forward variables in HMM forward–backward: naive implementation on long sequences leads to numerical underflow and incorrect normalized probabilities.
Easy: Consider a 1D LGSSM with , , , , prior , , and observation . Compute and .
Hint: Perform the prediction step to get and , then compute , , update mean and covariance.
Prediction: , . Innovation cov: . Kalman gain: . Innovation: . Updated mean: . Updated covariance: .
Medium: For an LGSSM, show how the E-step of EM leads to sufficient statistics involving and , and write closed-form M-step updates for and (assume C known). Given smoothed means and covariances and cross-covariances , write formulas for the M-step.
Hint: Maximize expected complete-data log-likelihood. The quadratic terms lead to normal equations for A and sample covariances for Q.
The M-step (maximize w.r.t. A,Q) yields
where and . The process noise covariance updates to
where . These are implemented with the smoothed moments computed from RTS.
Hard: Convert an AR(2) process with into a 2D state-space model in companion form with , choose accordingly (assume observations are exact ), and explain whether the system is observable and how you would apply Kalman smoothing if some are missing.
Hint: Companion form places AR coefficients in the first row of A. Process noise enters only in the first component. Observability can be tested via the observability matrix [C; CA]. Missing observations: run Kalman prediction at missing steps and skip update.
Companion state: . Then
Process noise: affects only, so set . Observation noise for exact observations. Observability: the observability matrix is
which has determinant , so full rank and system is observable. For missing (say at time k), run the prediction step to get and , but skip the update (equivalently set ); continue predictions until an observation appears, then resume updates. For smoothing, run the RTS backward pass using stored predictions and filtered estimates; missing observations simply mean the corresponding filtered step equals the predictive step.
Looking backward: this lesson builds directly on "Time Series Foundations" by showing that ARMA and ARIMA models can be cast as state-space systems (companion form) and treated with Kalman filtering for forecasting and missing-data handling; it relies on the Markov property from "Markov Chains" to enable recursive decomposition; and it uses the update logic from "Bayesian Inference" (prior × likelihood -> posterior), specialized in closed-form for Gaussians. Looking forward: mastering Kalman/RTS and forward–backward enables EM-based parameter estimation (Baum–Welch for HMMs, EM for LGSSMs), sequential Monte Carlo and particle filtering for nonlinear/non-Gaussian models, and model-based control (LQG). Concrete downstream topics that require this material include system identification, SLAM in robotics, stochastic optimal control, variational and Monte Carlo inference in time series, and modern state-space neural architectures (e.g., deep state-space models). Understanding observability/identifiability conditions from linear algebra and the numerical stability techniques shown here is essential before deploying these methods in real systems.