Memoryless stochastic processes. Transition matrices, stationary distributions.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
Many real systems are complicated—but their state updates can be simple: “given where I am now, what’s the probability of where I go next?” Markov chains are the mathematical language for that idea, and they power everything from PageRank to MCMC sampling to reinforcement learning.
A (time-homogeneous) Markov chain is a stochastic process where the next state depends only on the current state. For finite states, its behavior is encoded by a transition matrix P with nonnegative entries and rows summing to 1. Distributions evolve as πₜ₊₁ = πₜ P. A stationary distribution π satisfies π = πP (a left eigenvector with eigenvalue 1). Structure (communicating classes, irreducibility, periodicity) determines long-run behavior and whether πₜ converges to π.
Many processes evolve step-by-step with randomness: a web surfer clicks links, a customer moves between product categories, a queue length changes, a robot localizes itself, a sampler proposes a new value. In all these, you often don’t want to model the entire history—because the history is huge.
A Markov chain is the idea that the present is enough. If you know the current state, the past doesn’t add extra predictive power for the next step.
Let {X₀, X₁, X₂, …} be a sequence of random variables taking values in a finite set of states S = {1, 2, …, n}. It is a (time-homogeneous) Markov chain if, for all times t and states i₀, …, iₜ, j,
This is the Markov property: the future depends on the past only through the present.
Time-homogeneous means the transition rule does not change with time:
For a finite-state chain, we store the dynamics in a matrix P (often called a stochastic matrix):
Interpretation: if you are currently in state i, the i-th row of P is the probability distribution over the next state.
Let πₜ be the row vector of state probabilities at time t:
Then the evolution is a clean matrix rule:
This is one of the most important “mechanics” of Markov chains: random evolution becomes linear algebra.
Suppose S = {Rainy, Sunny} with
Using the state order (Rainy=1, Sunny=2):
If today’s distribution is π₀ = [1, 0] (certainly Rainy), then
You should read that as: “tomorrow is 70% Rainy, 30% Sunny.”
You can think about a chain in two complementary ways:
| Viewpoint | Object | Update rule | Interpretation |
|---|---|---|---|
| Single trajectory | X₀ → X₁ → X₂ → … | sample using P | one random run of the system |
| Distribution over states | π₀, π₁, π₂, … | πₜ₊₁ = πₜP | how probabilities flow over time |
In machine learning and statistics, the distribution view is often the one that connects directly to convergence, stationarity, and sampling.
To make the matrix less abstract, here’s the same chain as a directed graph:
<svg xmlns="http://www.w3.org/2000/svg" width="720" height="220" viewBox="0 0 720 220">
<defs>
<marker id="arrow" markerWidth="10" markerHeight="10" refX="8" refY="3" orient="auto" markerUnits="strokeWidth">
<path d="M0,0 L10,3 L0,6 Z" fill="#333" />
</marker>
</defs>
<rect x="0" y="0" width="720" height="220" fill="white"/>
<circle cx="200" cy="110" r="48" fill="#f4f7ff" stroke="#2b4cff" stroke-width="2"/>
<text x="200" y="116" text-anchor="middle" font-family="Arial" font-size="16">Rainy (1)</text>
<circle cx="520" cy="110" r="48" fill="#fff7f0" stroke="#ff7a00" stroke-width="2"/>
<text x="520" y="116" text-anchor="middle" font-family="Arial" font-size="16">Sunny (2)</text>
<!-- self loops -->
<path d="M170,80 C140,40 210,35 225,75" fill="none" stroke="#333" stroke-width="2" marker-end="url(#arrow)"/>
<text x="170" y="42" font-family="Arial" font-size="14">0.7</text>
<path d="M490,80 C460,40 530,35 545,75" fill="none" stroke="#333" stroke-width="2" marker-end="url(#arrow)"/>
<text x="490" y="42" font-family="Arial" font-size="14">0.6</text>
<!-- cross edges -->
<path d="M248,110 C310,70 410,70 472,110" fill="none" stroke="#333" stroke-width="2" marker-end="url(#arrow)"/>
<text x="360" y="78" font-family="Arial" font-size="14">0.3</text>
<path d="M472,120 C410,160 310,160 248,120" fill="none" stroke="#333" stroke-width="2" marker-end="url(#arrow)"/>
<text x="360" y="172" font-family="Arial" font-size="14">0.4</text>
<text x="360" y="205" text-anchor="middle" font-family="Arial" font-size="13" fill="#444">
Diagram: directed edges labeled by transition probabilities P[i,j]
</text>
</svg>Keep this picture in mind: most structural ideas (reachability, recurrence, period) are statements about paths and cycles in this graph.
A single trajectory is noisy: you might see Rainy→Sunny→Sunny even if long-run Sunny is rare (or common). The distribution πₜ is the “average behavior over many parallel universes,” and it’s what we typically analyze.
Suppose πₜ is a row vector. Then
This is just the law of total probability:
If one step is P, then two steps is P², and in general k steps is Pᵏ:
The entries of Pᵏ have a direct probabilistic meaning:
Start from the definition:
Interpretation:
This is again the law of total probability, but now “conditioning on the middle state.”
The matrix rule can feel symbolic, so here is an explicit visualization of probability mass moving.
We use the Rainy/Sunny chain above. Start at π₀ = [1, 0] (certainly Rainy). Compute a few steps:
You can see it drifting.
Here’s a diagram that shows the distribution bars evolving (each bar height is probability):
<svg xmlns="http://www.w3.org/2000/svg" width="760" height="260" viewBox="0 0 760 260">
<rect x="0" y="0" width="760" height="260" fill="white"/>
<text x="20" y="28" font-family="Arial" font-size="14" fill="#222">Distribution evolution πₜ for Rainy/Sunny chain (starting at Rainy)</text>
<!-- axes baseline -->
<line x1="60" y1="220" x2="740" y2="220" stroke="#aaa" stroke-width="1"/>
<!-- function to draw bar groups manually -->
<!-- group labels -->
<text x="90" y="245" font-family="Arial" font-size="12">t=0</text>
<text x="210" y="245" font-family="Arial" font-size="12">t=1</text>
<text x="330" y="245" font-family="Arial" font-size="12">t=2</text>
<text x="450" y="245" font-family="Arial" font-size="12">t=3</text>
<text x="570" y="245" font-family="Arial" font-size="12">t=4</text>
<!-- bars: Rainy in blue, Sunny in orange. scale: 180px max -->
<!-- t=0: [1.00,0.00] -->
<rect x="70" y="40" width="35" height="180" fill="#2b4cff" opacity="0.8"/>
<rect x="110" y="220" width="35" height="0" fill="#ff7a00" opacity="0.8"/>
<!-- t=1: [0.70,0.30] -->
<rect x="190" y="94" width="35" height="126" fill="#2b4cff" opacity="0.8"/>
<rect x="230" y="166" width="35" height="54" fill="#ff7a00" opacity="0.8"/>
<!-- t=2: [0.61,0.39] -->
<rect x="310" y="110.2" width="35" height="109.8" fill="#2b4cff" opacity="0.8"/>
<rect x="350" y="149.8" width="35" height="70.2" fill="#ff7a00" opacity="0.8"/>
<!-- t=3: [0.583,0.417] -->
<rect x="430" y="115.06" width="35" height="104.94" fill="#2b4cff" opacity="0.8"/>
<rect x="470" y="144.94" width="35" height="75.06" fill="#ff7a00" opacity="0.8"/>
<!-- t=4 computed quickly: π4 = π3 P => Rainy = 0.583*0.7 + 0.417*0.4 = 0.4081 + 0.1668 = 0.5749 -->
<rect x="550" y="116.518" width="35" height="103.482" fill="#2b4cff" opacity="0.8"/>
<rect x="590" y="145.518" width="35" height="74.482" fill="#ff7a00" opacity="0.8"/>
<!-- legend -->
<rect x="640" y="50" width="14" height="14" fill="#2b4cff" opacity="0.8"/>
<text x="660" y="62" font-family="Arial" font-size="12">Rainy</text>
<rect x="640" y="72" width="14" height="14" fill="#ff7a00" opacity="0.8"/>
<text x="660" y="84" font-family="Arial" font-size="12">Sunny</text>
<text x="380" y="18" text-anchor="middle" font-family="Arial" font-size="12" fill="#444">Diagram: bars show πₜ; repeated multiplication by P shifts mass toward a limiting mix</text>
</svg>When you later learn convergence theorems, this picture is what “converges” means: the bars stabilize.
Be careful: some textbooks use column vectors and write pₜ₊₁ = Pᵀ pₜ. In this lesson we use row vectors πₜ and πₜ₊₁ = πₜP.
A quick consistency check:
A distribution π is stationary if evolving one step doesn’t change it:
This is a fixed point of the linear map “multiply by P.” Interpreted as flow: the probability leaving each state equals probability entering it (in aggregate, under π).
This equation is also an eigenvector equation: π is a left eigenvector of P with eigenvalue 1.
For any row-stochastic matrix P:
But uniqueness and convergence require extra structure (irreducible/aperiodic), which we develop next.
If you ask “what happens as t → ∞?”, the honest answer depends on the shape of the chain.
These are not numerical issues—they are graph properties of the transition structure.
We say state j is reachable from i (write i → j) if there exists some k ≥ 0 such that Pᵏ[i,j] > 0.
Two states communicate (i ↔ j) if i → j and j → i. Communication is an equivalence relation, so it partitions the state space into communicating classes.
A communicating class C is closed if you cannot leave it:
A special case: an absorbing state a has P[a,a] = 1. Then {a} is a closed class.
Intuition: once probability mass enters a closed class, it never leaves. That’s why closed classes control long-run behavior.
A Markov chain is irreducible if it has exactly one communicating class (i.e., every state can reach every other).
Why you care:
Even if a chain is irreducible, it might “cycle” in a rigid rhythm.
The period of a state i is
In an irreducible chain, all states share the same period. So we can talk about “the chain’s period.”
If the chain has period 2, the distribution may alternate forever:
Aperiodicity is the condition that prevents this “synchronization.”
States {1,2} with
Starting at 1:
It never settles.
Yet it does have a stationary distribution:
So: existence of a stationary distribution does not automatically imply convergence to it from an arbitrary start. Periodicity blocks convergence.
| Property | Graph meaning | Long-run consequence (typical) |
|---|---|---|
| Absorbing state | self-loop with prob 1; closed singleton class | eventual trapping possible |
| Reducible chain | multiple communicating classes | long-run depends on which closed class you enter |
| Irreducible chain | one communicating class | behaves as a single component |
| Periodic chain | returns happen only at multiples of d>1 | distribution may oscillate |
| Aperiodic chain | no forced rhythm | enables convergence of πₜ |
If there are multiple closed classes, you can build stationary distributions supported on each closed class, and mixtures of those are also stationary.
Example idea:
This is one reason uniqueness needs irreducibility.
In many applied contexts, “ergodic” (for finite chains) is shorthand for:
This combination is what gives the clean convergence story:
We’ll state that more formally in the next section.
A stationary distribution π is a probability distribution that is unchanged by one step of the chain.
In applications, π often represents:
Stationarity is
This gives n equations, but they are not independent because probabilities sum to 1. The standard approach is:
1) solve (Pᵀ − I) v = 0 (right-nullspace of Pᵀ − I)
2) normalize so entries sum to 1
Using our row-vector convention, π is a left eigenvector. Computationally, it’s often easier to solve the transpose system.
Also include normalization:
So we solve:
For a finite, irreducible, aperiodic Markov chain:
This is the core “why Markov chains are useful” result: long-run behavior forgets the initial condition.
Two complementary intuitions:
1) Mixing / smoothing: multiplying by P averages across possible next states. Repeated averaging tends to wash out spikes in the distribution.
2) Spectral gap: eigenvalue 1 corresponds to the stationary direction. Other eigenvalues have magnitude < 1 (for ergodic chains), so their contributions decay like |λ|ᵗ.
A sketch (not a full proof): if P is diagonalizable,
then
As t grows, λ₁=1 stays, while |λ₂|ᵗ, |λ₃|ᵗ, … shrink to 0, leaving only the stationary component.
Many chains (especially in MCMC) are designed to satisfy detailed balance with respect to a desired π:
If detailed balance holds and the chain is irreducible, then π is stationary.
Show your work (why detailed balance implies stationarity):
Start from stationarity condition for each state j:
Using detailed balance, rewrite each term:
So detailed balance is a powerful “design pattern” for constructing chains with a chosen stationary distribution.
The web can be viewed as a directed graph; a random surfer follows outgoing links. The transition matrix is built from link structure, often with a “teleport” term to guarantee irreducibility/aperiodicity.
A common form:
(Here written in matrix form; conceptually, with probability 1−α you jump uniformly at random.)
The PageRank vector is the stationary distribution π satisfying π = πP.
Why teleport helps:
RL and MDPs assume the environment has the Markov property:
Even if you haven’t introduced actions yet, pure Markov chains are the “no-action” baseline. Concepts like stationary distribution later become:
In MCMC, you want samples from a complex target distribution π (e.g., a posterior). You design a Markov chain whose stationary distribution is π, run it, and treat later states as approximate samples.
Key connection:
Consider a 3-state Markov chain with transition matrix
Find a stationary distribution π = [π₁, π₂, π₃].
Write the stationarity equations π = πP component-wise.
We compute each coordinate:
π₁ = 0.5π₁ + 0.2π₂ + 0.1π₃
π₂ = 0.5π₁ + 0.5π₂ + 0.0π₃
π₃ = 0.0π₁ + 0.3π₂ + 0.9π₃
Bring terms to one side to get linear equations.
(1) π₁ − 0.5π₁ − 0.2π₂ − 0.1π₃ = 0 ⇒ 0.5π₁ − 0.2π₂ − 0.1π₃ = 0
(2) π₂ − 0.5π₁ − 0.5π₂ = 0 ⇒ 0.5π₂ − 0.5π₁ = 0 ⇒ π₂ = π₁
(3) π₃ − 0.3π₂ − 0.9π₃ = 0 ⇒ 0.1π₃ − 0.3π₂ = 0 ⇒ π₃ = 3π₂
Use (2) and (3) to express everything in terms of π₁.
From (2): π₂ = π₁
From (3): π₃ = 3π₂ = 3π₁
Check that equation (1) is consistent.
Plug in:
0.5π₁ − 0.2π₂ − 0.1π₃
= 0.5π₁ − 0.2(π₁) − 0.1(3π₁)
= 0.5π₁ − 0.2π₁ − 0.3π₁
= 0
So it’s consistent (as expected; one equation is redundant).
Apply normalization π₁ + π₂ + π₃ = 1.
π₁ + π₁ + 3π₁ = 5π₁ = 1
⇒ π₁ = 1/5
⇒ π₂ = 1/5
⇒ π₃ = 3/5
Conclude:
π = [1/5, 1/5, 3/5].
Insight: Stationarity is an eigenvector condition with a probability normalization. In finite chains, one of the linear equations is typically redundant, so you solve n−1 independent equations plus ∑πᵢ=1.
Consider a 4-state chain with states {1,2,3,4} and transitions:
In matrix form (rows are current state):
(a) Identify communicating classes and whether they are closed.
(b) Does πₜ converge from an arbitrary start? Explain briefly.
(c) Find all stationary distributions.
(a) Find communicating classes.
Decide which classes are closed.
(b) Convergence of πₜ.
There are two issues:
1) Reducibility: there are multiple closed classes ({1,2} and {4}). Long-run behavior depends on whether probability mass ends up in {1,2} or in {4}.
2) Periodicity inside {1,2}: the subchain on {1,2} alternates with period 2.
So from an arbitrary start, πₜ does not necessarily converge to a single limit vector. If some mass is in {1,2}, that part oscillates forever between states 1 and 2. Mass that reaches 4 stays at 4.
(c) Find all stationary distributions.
A stationary distribution can place mass on any closed class, but must be stationary within each closed class.
Therefore any convex combination is stationary:
π(α) = α[1/2,1/2,0,0] + (1−α)[0,0,0,1]
= [α/2, α/2, 0, 1−α]
for α ∈ [0,1].
Insight: Closed communicating classes determine where probability can live forever. Multiple closed classes ⇒ non-unique stationary distributions. Periodicity can prevent convergence even when a stationary distribution exists.
Using the Rainy/Sunny chain
compute Pr(X₂ = Sunny | X₀ = Rainy).
Recognize that this is a 2-step transition probability:
Pr(X₂=Sunny | X₀=Rainy) = P²[Rainy,Sunny].
Compute P² = P·P.
Multiply (showing each entry).
Top-right entry (Rainy→Sunny in 2 steps):
P²[1,2] = 0.7·0.3 + 0.3·0.6 = 0.21 + 0.18 = 0.39.
Conclude:
Pr(X₂ = Sunny | X₀ = Rainy) = 0.39.
Insight: Pᵏ packages “sum over all length-k paths” into matrix multiplication. Each entry of Pᵏ is exactly a k-step conditional probability.
Markov property: $$
For finite chains, the transition matrix P is row-stochastic: P[i,j] ≥ 0 and ∑ⱼ P[i,j] = 1.
Distributions evolve linearly: $$
k-step transition probabilities are entries of Pᵏ: $$
A stationary distribution satisfies $$ (a left eigenvector with eigenvalue 1 plus normalization).
Graph structure matters: communicating classes and closed classes determine trapping and non-uniqueness.
Irreducible + aperiodic (often called ergodic in finite chains) implies unique stationary distribution and convergence πₜ → π.
Periodicity can prevent convergence even when a stationary distribution exists (e.g., 2-cycle alternation).
Mixing up row-vector vs column-vector conventions (and accidentally using Pᵀ). Always check whether rows or columns sum to 1 and which side you multiply on.
Assuming “stationary distribution exists” ⇒ “πₜ converges to it.” Periodic or reducible chains can have stationary distributions without convergence from arbitrary starts.
Forgetting that reducible chains can have many stationary distributions (mixtures across closed classes).
Computing π from π=πP but forgetting the normalization constraint ∑ᵢ πᵢ = 1 (or accepting negative entries from an un-constrained solve).
Let
(a) Solve for the stationary distribution π.
(b) Starting from π₀=[1,0], compute π₁ and π₂.
Hint: Use π=πP and π₁+π₂=1. For part (b), multiply row vectors by P.
Let π=[a,b]. Stationarity gives:
a = 0.9a + 0.2b
b = 0.1a + 0.8b (redundant)
From first: a − 0.9a = 0.2b ⇒ 0.1a = 0.2b ⇒ a = 2b.
Normalize: a+b=1 ⇒ 2b+b=1 ⇒ b=1/3, a=2/3.
So π=[2/3, 1/3].
(b) π₁ = [1,0]P = [0.9,0.1].
π₂ = π₁P = [0.9,0.1]P:
First component = 0.9·0.9 + 0.1·0.2 = 0.81+0.02=0.83.
Second component = 0.9·0.1 + 0.1·0.8 = 0.09+0.08=0.17.
So π₂=[0.83,0.17].
Consider the 3-state chain with edges 1→2, 2→3, 3→1 each with probability 1.
(a) Is the chain irreducible?
(b) What is its period?
(c) Find a stationary distribution.
(d) Does πₜ converge to that stationary distribution from π₀=[1,0,0]?
Hint: This is a deterministic cycle. Think about return times to a state. For stationarity, symmetry is a clue.
(a) Yes, it is irreducible: from any state you can reach any other by following the cycle.
(b) The period is 3 because you return to state 1 only at times 3,6,9,… so gcd is 3.
(c) Stationary distribution is uniform: π=[1/3,1/3,1/3] (check: multiplying by P permutes coordinates).
(d) No. Starting from [1,0,0], the distribution is a point mass that rotates: π₁=[0,1,0], π₂=[0,0,1], π₃=[1,0,0]. It does not converge, due to periodicity.
A chain has two closed communicating classes C₁ and C₂, each irreducible and aperiodic on its own. Explain why the overall chain can have infinitely many stationary distributions, and give the general form in words or symbols.
Hint: Stationary distributions can be supported entirely on a closed class. Mixtures of stationary distributions are stationary.
Let π¹ be the unique stationary distribution supported on C₁ (zero outside C₁), and π² the unique stationary distribution supported on C₂. Because C₁ and C₂ are closed, probability mass never leaves them, so each π¹ and π² is stationary for the full chain. Any convex combination
π(α) = απ¹ + (1−α)π², α ∈ [0,1]
is also stationary (linearity of πP). As α varies continuously, there are infinitely many stationary distributions.
Next nodes you can unlock and why they connect: