State-Space Models

Probability & StatisticsDifficulty: ████░Depth: 11Unlocks: 2

Kalman filter, hidden Markov models. Latent state estimation via filtering and smoothing. Forward-backward algorithm.

Prerequisites (3)

Unlocks (1)

State-space models let you turn messy time-series and latent-process problems into modular, solvable inference tasks — they are the lingua franca of tracking, econometrics, and probabilistic control.

TL;DR:

State-space models formalize a latent (Markov) state driving observed data; Kalman filtering and the forward-backward (smoothing) algorithms give optimal filtering and retrospective estimates for linear-Gaussian and discrete-HMM cases respectively.

What Is a State-Space Model?

A state-space model (SSM) is a probabilistic framework that explicitly separates an unobserved internal state (the "latent" or "hidden" state) from noisy observations. This gives two advantages: (1) you model dynamics in the hidden, typically lower-dimensional state, and (2) you perform principled inference on that state using the observation stream. Formally, in discrete time, an SSM is specified by a state (transition) model and an observation (emission) model. Two canonical families are:

•Linear Gaussian state-space model (LGSSM), often called the Kalman model:

x_t = A x_{t-1} + w_t, \quad w_t \sim \mathcal{N}(0,Q)

y_t = C x_t + v_t, \quad v_t \sim \mathcal{N}(0,R)

where $x_t\in\mathbb{R}^n$ is the latent state, $y_t\in\mathbb{R}^m$ is the observation, $A$ and $C$ are matrices, and $Q,R$ are covariance matrices. Example: an AR(1) process from "Time Series Foundations" can be written as a one-dimensional LGSSM with $A=\phi$ , $C=1$ , and suitable $Q,R$ ; e.g., for AR(1) with $\phi=0.9$ , process noise variance $\sigma_w^2=1$ , and observation noise variance $\sigma_v^2=2$ , set $A=0.9$ , $Q=1$ , $C=1$ , $R=2$ .

•Hidden Markov model (HMM), discrete state case:

z_t \sim p(z_t\mid z_{t-1})\quad\text{(Markov chain)}

y_t \sim p(y_t\mid z_t)\quad\text{(emissions)}

where $z_t$ takes values in a finite set. Example: a two-state HMM with transition matrix

T=\begin{pmatrix}0.8 & 0.2\\ 0.1 & 0.9\end{pmatrix}

and emission probabilities $p(y_t|z_t)$ (e.g., categorical or Gaussian) is a simple discrete SSM.

Why build SSMs? From a modeling perspective, many time-series phenomena are naturally explained by a latent quantity evolving over time (hidden economy, object position, unobserved volatility). From an algorithmic perspective, the Markov structure (in the latent space) allows recursive, linear-time inference via dynamic programming: filtering (online) and smoothing (offline). In "Markov Chains" we learned how the memoryless property simplifies analysis; SSMs leverage that same idea: the state at time $t$ depends on the past only through $x_{t-1}$ . In "Bayesian Inference" we learned updating beliefs via priors and likelihoods; the Kalman filter and forward-backward algorithm are recursive Bayesian updates tailored to sequential data.

Key inference questions in SSMs:

•Filtering: compute $p(x_t\mid y_{1:t})$ (belief about current state given past observations). This is online.
•Prediction: compute $p(x_{t+1}\mid y_{1:t})$ or $p(y_{t+k}\mid y_{1:t})$ .
•Smoothing: compute $p(x_t\mid y_{1:T})$ for $t<T$ ; this uses future information and is an offline pass that reduces uncertainty.
•Parameter learning: estimate $A,C,Q,R$ (or transition and emission probabilities in HMMs) from data, typically via Expectation-Maximization (EM) where the E-step uses smoothing to compute sufficient statistics.

Concrete small numeric example to anchor the notation: consider a 1D LGSSM with $A=0.9$ , $Q=1$ , $C=1$ , $R=2$ , initial prior $x_0\sim\mathcal{N}(0,1)$ . You observe $y_1=1.5$ . The filtering problem asks for $p(x_1\mid y_1)$ ; the Kalman filter will give this in closed form (derived in the next section) and will produce a Gaussian with explicit mean and variance computed from the matrices above.

Intuition: the transition matrix $A$ controls how much we trust the past state to predict the next; the process noise $Q$ injects uncertainty (how much the state can change unpredictably); the observation matrix $C$ and noise $R$ govern how informative each measurement is. When $R$ is large relative to $Q$ , measurements are noisy and the filter relies more on predictions; when $Q$ is large, the filter trusts new observations more.

Core Mechanic 1 — Kalman Filter: Derivation and Worked Mini-Example

The Kalman filter is the closed-form optimal Bayesian filter for linear-Gaussian SSMs. It computes for each $t$ the Gaussian posterior $p(x_t\mid y_{1:t})=\mathcal{N}(x_{t|t}, P_{t|t})$ by alternating a prediction step and an update (correction) step.

Equations (matrix form).

Prediction (time update):

x_{t|t-1} = A x_{t-1|t-1}

P_{t|t-1} = A P_{t-1|t-1} A^\top + Q

Update (measurement update):

S_t = C P_{t|t-1} C^\top + R\quad\text{(innovation covariance)}

K_t = P_{t|t-1} C^\top S_t^{-1}\quad\text{(Kalman gain)}

x_{t|t} = x_{t|t-1} + K_t (y_t - C x_{t|t-1})

P_{t|t} = (I - K_t C) P_{t|t-1}

Derivation idea (sketch): the posterior is proportional to prior times likelihood. With Gaussians, multiply two exponentials gives another Gaussian. Completing the square yields the posterior mean and covariance above; the Kalman gain is the coefficient that balances prior uncertainty with observation uncertainty. This is a direct application of "Bayesian Inference" where the prior is the predictive Gaussian and the likelihood is Gaussian centered at $C x_t$ .

Numeric 1D worked mini-example (showing every arithmetic step). Keep everything scalar so matrix operations reduce to multiplication.

Model parameters:

• $A=0.9$ , $Q=1$ , $C=1$ , $R=2$ .
•Prior at time 0: $x_{0|0}=0$ , $P_{0|0}=1$ .
•Observation: $y_1=1.5$ .

Step 1 — Predict to time 1:

• $x_{1|0} = A x_{0|0} = 0.9\times 0 = 0$ .
• $P_{1|0} = A P_{0|0} A^\top + Q = 0.9^2\times 1 + 1 = 0.81 + 1 = 1.81$ .

Step 2 — Compute innovation covariance and Kalman gain:

• $S_1 = C P_{1|0} C^\top + R = 1\times 1.81 \times 1 + 2 = 3.81$ .
• $K_1 = P_{1|0} C^\top / S_1 = 1.81 / 3.81 \approx 0.475$ . (This is a scalar; we used $P/ S$ .)

Step 3 — Update state mean and covariance with observation $y_1=1.5$ :

•Innovation: $y_1 - C x_{1|0} = 1.5 - 0 = 1.5$ .
• $x_{1|1} = x_{1|0} + K_1\cdot 1.5 = 0 + 0.475\times 1.5 \approx 0.7125$ .
• $P_{1|1} = (1 - K_1 C) P_{1|0} = (1 - 0.475)\times 1.81 = 0.525\times 1.81 \approx 0.95025$ .

Interpretation: the posterior mean $\approx0.71$ is between the prediction (0) and the observation (1.5), weighted by the Kalman gain which depends on the relative uncertainties. The posterior variance decreased from $1.81$ (predictive) to $0.95$ (updated): adding the observation reduced uncertainty.

Multiple-step operation and relation to "Time Series Foundations": if you write an AR(1) time series $y_t=\phi y_{t-1}+\epsilon_t$ , an equivalent LGSSM can represent the hidden state as the AR(1) latent variable; the Kalman filter then provides the optimal linear minimum MSE estimator for missing values or one-step-ahead forecasts. The Kalman filter is thus the dynamic generalization of the Gaussian conjugate update we saw in "Bayesian Inference".

Important practical points:

•Numerical stability: the covariance update $P_{t|t}=(I-K_t C)P_{t|t-1}$ can lose symmetry/positive-definiteness due to rounding; implement stabilized forms such as Joseph form

P_{t|t}=(I-K_tC)P_{t|t-1}(I-K_tC)^\top + K_t R K_t^\top.

•Initialization matters: poor initial $P_{0|0}$ or wrong $Q,R$ will bias early estimates; often use an uninformative large covariance to reflect prior uncertainty.

The Kalman filter cost per time step is O(n^3) dominated by matrix inverses, but for moderate state dimension it's efficient and exact.

Core Mechanic 2 — Smoothing: RTS and Forward–Backward

Filtering answers "what is the current state given the past?" Smoothing answers "given the entire dataset, what was the state at time t?" Smoothing reduces posterior variance by incorporating future observations. Two complementary algorithms handle the main SSM families.

1) Rauch–Tung–Striebel (RTS) smoother — linear Gaussian case

After running the Kalman filter forward and storing $x_{t|t}$ , $P_{t|t}$ and the predictive $x_{t+1|t}$ , $P_{t+1|t}$ for $t=0\ldots T-1$ , the RTS smoother performs a backward pass to compute $p(x_t\mid y_{1:T})=\mathcal{N}(x_{t|T}, P_{t|T})$ via

J_t = P_{t|t} A^\top P_{t+1|t}^{-1}

x_{t|T} = x_{t|t} + J_t (x_{t+1|T} - x_{t+1|t})

P_{t|T} = P_{t|t} + J_t (P_{t+1|T} - P_{t+1|t}) J_t^\top

Derivation sketch: the smoothed estimates follow from the conditional Gaussian identity for jointly Gaussian vectors $(x_t,x_{t+1})$ conditioned on all observations: one expresses $p(x_t\mid y_{1:T}) =\int p(x_t\mid x_{t+1}, y_{1:t}) p(x_{t+1}\mid y_{1:T}) dx_{t+1}$ and uses linear-Gaussian conditional formulas. The matrix $J_t$ is the smoothing gain, analogous to $K_t$ but applied backward.

Numeric RTS example (2-step) to illustrate arithmetic. Use the previous 1D example with parameters $A=0.9,Q=1,C=1,R=2$ and two observations $y_1=1.5, y_2=0.5$ . From Section 2 we computed $x_{1|1}\approx0.7125$ , $P_{1|1}\approx0.95025$ , and $x_{2|1}=A x_{1|1}=0.9\times0.7125\approx0.64125$ , $P_{2|1}=A P_{1|1}A^\top + Q = 0.9^2\times0.95025 + 1 \approx 0.7697 + 1 = 1.7697$ . After running Kalman update at time 2, suppose we compute $x_{2|2}\approx0.55$ and $P_{2|2}\approx0.9$ (numbers illustrative). Then backward step for $t=1$ :

• $J_1 = P_{1|1} A^\top / P_{2|1} = 0.95025\times 0.9 / 1.7697 \approx 0.4837$ .
• $x_{1|2} = x_{1|1} + J_1 (x_{2|2} - x_{2|1}) = 0.7125 + 0.4837 (0.55 - 0.64125) \approx 0.7125 - 0.0445 \approx 0.6680$ .
• $P_{1|2} = P_{1|1} + J_1 (P_{2|2} - P_{2|1}) J_1 \approx 0.95025 + 0.4837 (0.9 - 1.7697) 0.4837 \approx 0.95025 - 0.201 \approx 0.7492$ .

The smoothed mean moved toward values that explain future data; variance decreased from $0.95025$ to $0.7492$.

2) Forward–Backward algorithm — discrete HMM case

For finite-state HMMs, smoothing is performed by the forward–backward algorithm (a dynamic-programming factorization). Define the forward variable $\alpha_t(i) = p(y_{1:t}, z_t=i)$ and backward variable $\beta_t(i) = p(y_{t+1:T} \mid z_t=i)$ . Then the smoothed state posterior is

\gamma_t(i) := p(z_t=i\mid y_{1:T}) = \frac{\alpha_t(i)\beta_t(i)}{\sum_j \alpha_t(j)\beta_t(j)}.

Recursions:

\alpha_1(i) = \pi_i p(y_1\mid z_1=i),\quad \alpha_{t+1}(j)= p(y_{t+1}\mid z_{t+1}=j)\sum_i \alpha_t(i) p(z_{t+1}=j\mid z_t=i).

\beta_T(i)=1,\quad \beta_t(i) = \sum_j p(z_{t+1}=j\mid z_t=i) p(y_{t+1}\mid z_{t+1}=j) \beta_{t+1}(j).

Concrete discrete example: two-state HMM with transitions

T=\begin{pmatrix}0.8 & 0.2\\ 0.1 & 0.9\end{pmatrix},\quad \pi=(0.6,0.4)

Emissions: $p(y_t=H\mid z=1)=0.7$ , $p(y_t=H\mid z=2)=0.3$ (H = "High"). Observations: [H, L] where L denotes low with complementary probabilities. Compute $\alpha$ forward and $\beta$ backward and then $\gamma_t$ ; make sure to renormalize at each step to avoid underflow. Include scaling factors $c_t = \sum_i \alpha_t(i)$ when implementing for long sequences.

Relationship between the two: both RTS and forward–backward perform the same logical operation (compute posterior over states given all data) but are specialized to linear-Gaussian continuous states and discrete finite states respectively. The backward gains $J_t$ and the backward variables $\beta_t$ play analogous roles.

Smoothing is essential for parameter learning (EM): the E-step requires expected sufficient statistics like $\mathbb{E}[x_t x_t^\top]$ or $\mathbb{E}[\mathbf{1}(z_t=i, z_{t+1}=j)]$ which are computed from smoothed marginals and pairwise smoothed distributions (for HMMs, $\xi_t(i,j)=p(z_t=i,z_{t+1}=j\mid y_{1:T})$ ).

Applications and Connections

State-space methods are a crossroads connecting "Time Series Foundations", "Markov Chains", and "Bayesian Inference" to many applied domains. Here are concrete applications, practical considerations, and downstream connections that rely on understanding SSMs.

Major application areas with concrete examples:

•Tracking and navigation (robotics, aerospace): an aircraft's position and velocity are latent states; radar returns are noisy observations. An LGSSM with constant-velocity dynamics ( $A$ a block with 1s and timestep multipliers) and small process noise models motion; the Kalman filter provides real-time position estimates. Example: for a 2D constant-velocity model with state $x=[x,\dot x, y,\dot y]^\top$ , $A$ and $Q$ are chosen from discrete-time kinematic equations and process acceleration variance. Kalman filter runs at sensor rate to fuse GPS and IMU.

•Econometrics (state-space ARMA and unobserved components): many macro time series are modeled as combinations of trend and seasonal latent states. The Kalman smoother recovers the trend component; parameter estimation via EM or maximum likelihood (via Kalman-filtered log-likelihood) yields interpretable model parameters (trend variance, cycle frequency). In "Time Series Foundations" we learned ARMA; any ARMA model can be written in state-space form (companion form) and handled by the Kalman filter for missing data or likelihood evaluation.

•Signal processing and communications: Kalman filters provide optimal linear equalizers and adaptive filters; smoothing is used in offline deconvolution.

•Finance: stochastic volatility models are often written as nonlinear/non-Gaussian SSMs; particle filters or specialized smoothers are used for inference. The EM algorithm using smoothing can estimate latent volatility parameters.

•Machine learning and probabilistic modeling: HMMs are the building block of speech recognition (Baum–Welch algorithm = forward–backward EM), bioinformatics (gene prediction), and sequence labeling. Linear-Gaussian SSMs are used in Gaussian process state-space models and as differentiable building blocks in deep learning architectures (e.g., KalmanNet).

Downstream algorithms that require SSM understanding:

•Expectation–Maximization (EM) for parameter estimation: E-step uses smoothing to compute expectations; M-step updates $A,Q,C,R$ or HMM transition/emission matrices. Concrete: for LGSSM, the expected sufficient statistics are $\sum_t \mathbb{E}[x_t x_t^\top]$ , $\sum_t \mathbb{E}[x_t x_{t-1}^\top]$ , and these are computed by smoothing outputs.

•Particle filters and sequential Monte Carlo: when dynamics or observations are nonlinear/non-Gaussian, the Kalman filter is not optimal; particle filters approximate the filtering distribution by weighted particles. The theoretical underpinning (sequential importance sampling and resampling) builds on the same Markov/Bayesian filtering logic.

•Control theory (LQG — linear-quadratic-Gaussian): the Kalman filter provides the optimal state estimator that, combined with an LQR regulator, gives the separation principle: estimation and control decouple. A practical example: autopilot design separates state estimation (Kalman) and control (LQR), each designed using SSM parameters.

Practicalities and numerical issues:

•Underflow in forward–backward: use log-space or per-step scaling (store $c_t=\sum_i \alpha_t(i)$ and normalize). Example: in a long HMM, raw $\alpha_t$ values shrink exponentially; the scaled $\hat\alpha_t=\alpha_t/c_t$ avoids underflow.

•Observability and identifiability: not every SSM parameter is recoverable from data. Observability (a linear algebra condition on $(A,C)$ ) ensures the state can be inferred from outputs. Example: if $C=[0\;0\;\dots\;0]$ , the state is unobservable no matter how many measurements you collect.

•Model mismatch: wrong noise covariance assumptions ( $Q,R$ ) bias the Kalman gain. In practice, you may treat $Q$ as a tuning parameter or estimate it via EM.

•Complexity: Kalman filter is O(n^3) per time step; HMM forward–backward is O(S^2 T) where S is number of discrete states; particle filters are O(N T) with sample size N.

Concrete numeric tie-in: the log-likelihood of observations under the LGSSM can be computed incrementally from the Kalman filter via

\log p(y_{1:T}) = -\tfrac{1}{2}\sum_{t=1}^T \left[ m\log(2\pi) + \log|S_t| + (y_t - C x_{t|t-1})^\top S_t^{-1} (y_t - C x_{t|t-1}) \right],

where $m$ is the observation dimension. In practice this is used in maximum-likelihood parameter estimation; substitute numeric $S_t$ from the filter each step.

Summary: mastering Kalman filtering and forward–backward smoothing gives you powerful, computationally efficient tools to estimate hidden dynamics and to build higher-level algorithms for learning and control. In the next section (worked examples) we will step through concrete instances to cement arithmetic and intuition.

Worked Examples (3)

1D Kalman Filter Single Update

Parameters: $A=0.8$ , $Q=0.5$ , $C=1$ , $R=1.5$ . Prior: $x_{0|0}=0$ , $P_{0|0}=2$ . Observation: $y_1=1.2$ . Compute $x_{1|1}$ and $P_{1|1}$ .

Prediction: $x_{1|0}=A x_{0|0}=0.8\times0=0$ .
Predictive covariance: $P_{1|0}=A P_{0|0} A^\top + Q = 0.8^2\times 2 + 0.5 = 1.28 + 0.5 = 1.78$ .
Innovation covariance: $S_1 = C P_{1|0} C^\top + R = 1\times 1.78 + 1.5 = 3.28$ .
Kalman gain: $K_1 = P_{1|0} C^\top / S_1 = 1.78 / 3.28 \approx 0.5439$ .
Innovation: $y_1 - C x_{1|0} = 1.2 - 0 = 1.2$ .
Updated mean: $x_{1|1} = x_{1|0} + K_1 \times 1.2 = 0 + 0.5439 \times 1.2 \approx 0.6527$ .
Updated covariance: $P_{1|1} = (1 - K_1 C) P_{1|0} = (1 - 0.5439)\times 1.78 \approx 0.8122$ .

Insight: This example shows how the Kalman gain depends on the relative sizes of predictive uncertainty $P_{1|0}$ and measurement noise $R$ : larger $R$ would reduce $K_1$ , pulling the posterior mean closer to the prediction.

RTS Smoother for 3-step Linear Gaussian

1D model: $A=0.9$ , $Q=1$ , $C=1$ , $R=2$ . Observations: $y_1=1.5$ , $y_2=0.5$ , $y_3=1.0$ . Start with $x_{0|0}=0$ , $P_{0|0}=1$ . Compute $x_{2|3}$ and $P_{2|3}$ (smoothed estimates for time 2) using a forward Kalman pass and backward RTS smoothing.

Forward pass step 1 (t=1): match Section 2 calculations to get $x_{1|1}\approx0.7125$ , $P_{1|1}\approx0.95025$ , $x_{2|1}=A x_{1|1}\approx0.64125$ , $P_{2|1}=0.9^2\times0.95025 + 1 \approx1.7697$ .
Update at t=2 with $y_2=0.5$ : $S_2 = P_{2|1}+R = 1.7697 + 2 = 3.7697$ , $K_2 = 1.7697/3.7697 \approx0.4695$ , innovation $0.5 - 0.64125 = -0.14125$ , so $x_{2|2}=0.64125 + 0.4695\times(-0.14125) \approx0.5748$ , $P_{2|2}=(1-K_2)P_{2|1}\approx 0.9424$ .
Predict to t=3: $x_{3|2}=A x_{2|2}=0.9\times0.5748\approx0.5173$ , $P_{3|2}=0.9^2\times0.9424 + 1 \approx 1.7631$ .
Update at t=3 with $y_3=1.0$ : $S_3 = 1.7631 + 2 = 3.7631$ , $K_3 = 1.7631/3.7631 \approx0.4687$ , innovation $1.0 - 0.5173 = 0.4827$ , so $x_{3|3}=0.5173 + 0.4687\times0.4827 \approx 0.7439$ , $P_{3|3}\approx 0.9418$ .
Backward RTS for t=2: compute $J_2 = P_{2|2} A^\top / P_{3|2} = 0.9424\times 0.9 / 1.7631 \approx 0.4809$ .
Smoothed mean: $x_{2|3} = x_{2|2} + J_2 (x_{3|3} - x_{3|2}) = 0.5748 + 0.4809 (0.7439 - 0.5173) \approx 0.6937$ .
Smoothed covariance: $P_{2|3} = P_{2|2} + J_2 (P_{3|3} - P_{3|2}) J_2 \approx 0.9424 + 0.4809 (0.9418 - 1.7631) 0.4809 \approx 0.7151$ .

Insight: Smoothing moved the estimate at t=2 towards values that better explain the later observation at t=3 and reduced variance. This illustrates how future data refines past beliefs.

Forward–Backward on a Two-State HMM

Two states {1,2}, initial distribution $\pi=(0.5,0.5)$ , transition matrix $T=\begin{pmatrix}0.7&0.3\\0.4&0.6\end{pmatrix}$ . Emission probabilities for observation O ∈ {A,B}: $p(A|1)=0.8,p(B|1)=0.2$ and $p(A|2)=0.3,p(B|2)=0.7$ . Observed sequence: [A,B]. Compute smoothed marginals $\gamma_1$ and $\gamma_2$ .

Forward t=1: $\alpha_1(1)=\pi_1 p(A|1)=0.5\times0.8=0.4$ , $\alpha_1(2)=0.5\times0.3=0.15$ . Normalize by $c_1=0.55$ if desired; unnormalized values suffice for ratios.
Forward t=2: compute predictive sums: for state 1: $\sum_i \alpha_1(i) T_{i\to1}=0.4\times0.7 + 0.15\times0.4 = 0.28 + 0.06 = 0.34$ . Multiply by emission $p(B|1)=0.2$ gives $\alpha_2(1)=0.068$ . For state 2: $\sum_i \alpha_1(i) T_{i\to2}=0.4\times0.3 + 0.15\times0.6 = 0.12 + 0.09 = 0.21$ . Multiply by emission $p(B|2)=0.7$ gives $\alpha_2(2)=0.147$ .
Backward initialization: $\beta_2(1)=\beta_2(2)=1$ .
Backward t=1: $\beta_1(1)=\sum_j T_{1\to j} p(y_2|j) \beta_2(j) = 0.7\times0.2 + 0.3\times0.7 = 0.14 + 0.21 = 0.35$ . Similarly $\beta_1(2) = 0.4\times0.2 + 0.6\times0.7 = 0.08 + 0.42 = 0.5$ .
Smoothed marginals: $\gamma_1(i) \propto \alpha_1(i) \beta_1(i)$ . Compute numerators: for 1: $0.4\times0.35=0.14$ ; for 2: $0.15\times0.5=0.075$ . Normalize sum $0.215$ gives $\gamma_1=(0.14/0.215, 0.075/0.215)\approx(0.651,0.349)$ .
For t=2: $\gamma_2(i) \propto \alpha_2(i) \beta_2(i)=\alpha_2(i)$ since $\beta_2=1$ . Sum $0.068+0.147=0.215$ . So $\gamma_2=(0.068/0.215, 0.147/0.215)\approx(0.316,0.684)$ .

Insight: The forward–backward algorithm computes exact smoothed posteriors for HMMs in O(S^2T). Scaling/normalization is necessary for long sequences; the same dynamic-programming idea underlies continuous-state smoothers like RTS.

Key Takeaways

✓
A state-space model separates latent dynamics (state equation) from noisy observations (observation equation), enabling efficient recursive inference.
✓
The Kalman filter provides exact closed-form filtering for linear-Gaussian models via predict/update equations; every formula can be implemented as matrix algebra (with scalar examples above).
✓
Smoothing (RTS for LGSSM, forward–backward for HMMs) uses future observations to reduce uncertainty about past states and provides required expectations for EM parameter learning.
✓
Forward (alpha) recursions are numerically complemented by backward (beta) recursions; normalization/scaling is essential in discrete HMMs to avoid underflow.
✓
State-space representations unify ARMA models and Markov chain ideas: ARMA can be written in companion form and handled by Kalman filtering; HMMs are discrete-state SSMs.
✓
Practical implementations must address numerical stability (Joseph form for covariance updates, scaled forward variables) and identifiability/observability of the model.

Common Mistakes

✗
Using the naive covariance update $P_{t|t}=(I-K_tC)P_{t|t-1}$ without considering numerical stability: rounding can break symmetry/positive-definiteness; use the Joseph form when necessary.
✗
Mixing filtering and smoothing indices: applying smoothing equations with 'filtered' quantities at the wrong time (e.g., using $x_{t|t-1}$ where $x_{t|t}$ is required) produces incorrect estimates.
✗
Forgetting to include process noise $Q$ : setting $Q=0$ (unless truly known zero) often leads to overconfident estimates and filter divergence when the true process is stochastic.
✗
Not scaling forward variables in HMM forward–backward: naive implementation on long sequences leads to numerical underflow and incorrect normalized probabilities.

Practice

easy

Easy: Consider a 1D LGSSM with $A=0.95$ , $Q=0.2$ , $C=1$ , $R=0.5$ , prior $x_{0|0}=1$ , $P_{0|0}=0.3$ , and observation $y_1=1.4$ . Compute $x_{1|1}$ and $P_{1|1}$ .

Hint: Perform the prediction step to get $x_{1|0}$ and $P_{1|0}$ , then compute $S_1$ , $K_1$ , update mean and covariance.

Show solution

Prediction: $x_{1|0}=0.95\times1=0.95$ , $P_{1|0}=0.95^2\times0.3 + 0.2 = 0.27075 + 0.2 = 0.47075$ . Innovation cov: $S_1 = 0.47075 + 0.5 = 0.97075$ . Kalman gain: $K_1 = 0.47075 / 0.97075 \approx 0.4850$ . Innovation: $1.4 - 0.95 = 0.45$ . Updated mean: $x_{1|1}=0.95 + 0.4850\times0.45 \approx 1.1658$ . Updated covariance: $P_{1|1}=(1 - 0.4850)\times0.47075 \approx 0.2426$ .

medium

Medium: For an LGSSM, show how the E-step of EM leads to sufficient statistics involving $\sum_t \mathbb{E}[x_t x_t^\top]$ and $\sum_t \mathbb{E}[x_t x_{t-1}^\top]$ , and write closed-form M-step updates for $A$ and $Q$ (assume C known). Given smoothed means $x_{t|T}$ and covariances $P_{t|T}$ and cross-covariances $P_{t,t-1|T}$ , write formulas for the M-step.

Hint: Maximize expected complete-data log-likelihood. The quadratic terms lead to normal equations for A and sample covariances for Q.

Show solution

The M-step (maximize w.r.t. A,Q) yields

A^{\text{new}} = \left(\sum_{t=1}^T \mathbb{E}[x_t x_{t-1}^\top]\right) \left(\sum_{t=1}^T \mathbb{E}[x_{t-1} x_{t-1}^\top]\right)^{-1}

where $\mathbb{E}[x_t x_{t-1}^\top]=P_{t,t-1|T} + x_{t|T} x_{t-1|T}^\top$ and $\mathbb{E}[x_{t-1} x_{t-1}^\top]=P_{t-1|T} + x_{t-1|T} x_{t-1|T}^\top$ . The process noise covariance updates to

Q^{\text{new}} = \frac{1}{T}\sum_{t=1}^T \left(\mathbb{E}[x_t x_t^\top] - A^{\text{new}} \mathbb{E}[x_{t} x_{t-1}^\top]^\top\right)

where $\mathbb{E}[x_t x_t^\top]=P_{t|T} + x_{t|T} x_{t|T}^\top$ . These are implemented with the smoothed moments computed from RTS.

hard

Hard: Convert an AR(2) process $y_t = 1.2 y_{t-1} - 0.32 y_{t-2} + \epsilon_t$ with $\epsilon_t\sim\mathcal{N}(0,0.5)$ into a 2D state-space model in companion form with $x_t=[y_t, y_{t-1}]^\top$ , choose $A,C,Q,R$ accordingly (assume observations are exact $y_t$ ), and explain whether the system is observable and how you would apply Kalman smoothing if some $y_t$ are missing.

Hint: Companion form places AR coefficients in the first row of A. Process noise enters only in the first component. Observability can be tested via the observability matrix [C; CA]. Missing observations: run Kalman prediction at missing steps and skip update.

Show solution

Companion state: $x_t=\begin{pmatrix}y_t\\ y_{t-1}\end{pmatrix}$ . Then

A=\begin{pmatrix}1.2 & -0.32\\ 1 & 0\end{pmatrix},\quad C=\begin{pmatrix}1 & 0\end{pmatrix}.

Process noise: $w_t$ affects $y_t$ only, so set $Q=\begin{pmatrix}0.5 & 0\\0 & 0\end{pmatrix}$ . Observation noise $R=0$ for exact observations. Observability: the observability matrix is

\mathcal{O}=\begin{pmatrix}C\\ C A\end{pmatrix} = \begin{pmatrix}1 & 0\\ 1.2 & -0.32\end{pmatrix}

which has determinant $-0.32\neq0$ , so full rank and system is observable. For missing $y_t$ (say at time k), run the prediction step to get $x_{k|k-1}$ and $P_{k|k-1}$ , but skip the update (equivalently set $R=\infty$ ); continue predictions until an observation appears, then resume updates. For smoothing, run the RTS backward pass using stored predictions and filtered estimates; missing observations simply mean the corresponding filtered step equals the predictive step.

Connections

Looking backward: this lesson builds directly on "Time Series Foundations" by showing that ARMA and ARIMA models can be cast as state-space systems (companion form) and treated with Kalman filtering for forecasting and missing-data handling; it relies on the Markov property from "Markov Chains" to enable recursive decomposition; and it uses the update logic from "Bayesian Inference" (prior × likelihood -> posterior), specialized in closed-form for Gaussians. Looking forward: mastering Kalman/RTS and forward–backward enables EM-based parameter estimation (Baum–Welch for HMMs, EM for LGSSMs), sequential Monte Carlo and particle filtering for nonlinear/non-Gaussian models, and model-based control (LQG). Concrete downstream topics that require this material include system identification, SLAM in robotics, stochastic optimal control, variational and Monte Carlo inference in time series, and modern state-space neural architectures (e.g., deep state-space models). Understanding observability/identifiability conditions from linear algebra and the numerical stability techniques shown here is essential before deploying these methods in real systems.

Quality: pending (0.0/5)

State-Space Models

Prerequisites (3)

Unlocks (1)

Graph Position

What Is a State-Space Model?

Core Mechanic 1 — Kalman Filter: Derivation and Worked Mini-Example

Core Mechanic 2 — Smoothing: RTS and Forward–Backward

Applications and Connections

Worked Examples (3)

1D Kalman Filter Single Update

RTS Smoother for 3-step Linear Gaussian

Forward–Backward on a Two-State HMM

Key Takeaways

Common Mistakes

Practice

Connections