Formal framework for sequential decision-making. Bellman equations.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
Markov Decision Processes (MDPs) are the “assembly language” of reinforcement learning: once you can express a problem as an MDP, you can derive Bellman equations, prove what an optimal policy means, and justify algorithms like value iteration, policy iteration, and actor-critic methods.
An MDP formalizes sequential decision-making with (S, A, P, R, γ). A policy π(a|s) induces a Markov reward process, which defines value functions V^π(s) and Q^π(s,a) as expected discounted return. Bellman expectation equations describe V^π and Q^π recursively. Bellman optimality equations define V(s) and Q(s,a) via a max over actions. Dynamic programming uses these equations to compute optimal policies when P and R are known.
In reinforcement learning, you interact with an environment over time. The key difficulty is credit assignment across time: an action now may cause a good or bad outcome many steps later. To reason about this cleanly, we want a model that:
An MDP is the standard model that accomplishes this.
A (finite) Markov Decision Process is a tuple:
P(s′ | s, a) = Pr(Sₜ₊₁ = s′ | Sₜ = s, Aₜ = a)
The “atomic step” of an MDP is a decision epoch:
The Markov property means the future depends on the past only through the current state (and chosen action):
Pr(Sₜ₊₁ = s′ | S₀, A₀, …, Sₜ = s, Aₜ = a) = Pr(Sₜ₊₁ = s′ | Sₜ = s, Aₜ = a)
This assumption is what lets us write Bellman equations: it provides the “optimal substructure” for sequential decisions.
A trajectory is a sequence (S₀, A₀, R₁, S₁, A₁, R₂, …). The (discounted) return from time t is:
Gₜ = ∑ₖ₌₀^∞ γᵏ Rₜ₊ₖ₊₁
Discounting is not just a mathematical trick. It does three practical jobs:
An MDP generalizes a Markov chain:
An MDP also sets up dynamic programming:
| Object | What it specifies | What you can do with it | |
|---|---|---|---|
| Markov chain | P(s′ | s) | Study long-run state distribution |
| Markov reward process | P(s′ | s), R(s) | Evaluate long-run returns |
| MDP | P(s′ | s,a), R, γ | Choose actions to maximize return |
The rest of this lesson builds the two central ingredients: policies and value functions, and then shows how Bellman equations connect them.
An MDP by itself is just a world model. To talk about “expected return,” we must specify how actions are selected. That’s the role of a policy.
A policy π is a decision rule mapping states to actions.
Stochastic policies matter even when the environment is fully known:
Once you fix π, the agent is no longer “choosing” actions freely; actions are sampled from π. The resulting state process becomes a Markov chain with transition matrix:
P^π(s′|s) = ∑ₐ π(a|s) P(s′|s,a)
If rewards depend on (s,a) (or (s,a,s′)), you also get an induced reward model, e.g.:
R^π(s) = ∑ₐ π(a|s) R(s,a)
This reduction is conceptually important: policy evaluation is “just” analyzing the Markov reward process induced by π.
Now we define the central predictive quantities.
State-value function:
V^π(s) = 𝔼_π[ Gₜ | Sₜ = s ] = 𝔼_π[ ∑ₖ₌₀^∞ γᵏ Rₜ₊ₖ₊₁ | Sₜ = s ]
Action-value function:
Q^π(s,a) = 𝔼_π[ Gₜ | Sₜ = s, Aₜ = a ]
Intuition:
A closely related quantity is the advantage:
A^π(s,a) = Q^π(s,a) − V^π(s)
It measures how much better/worse an action is compared to the policy’s average behavior at s.
Because of the Markov property, the future after the next step depends only on the next state (and the policy). So the return decomposes into:
Gₜ = Rₜ₊₁ + γ Gₜ₊₁
Take conditional expectation under π.
Start from the definition:
V^π(s) = 𝔼_π[ Gₜ | Sₜ = s ]
Substitute Gₜ = Rₜ₊₁ + γ Gₜ₊₁:
V^π(s)
= 𝔼_π[ Rₜ₊₁ + γ Gₜ₊₁ | Sₜ = s ]
= 𝔼_π[ Rₜ₊₁ | Sₜ = s ] + γ 𝔼_π[ Gₜ₊₁ | Sₜ = s ]
Now expand the next-step expectation by conditioning on Aₜ and Sₜ₊₁:
𝔼_π[ Rₜ₊₁ + γ V^π(Sₜ₊₁) | Sₜ = s ]
= ∑ₐ π(a|s) ∑_{s′} P(s′|s,a) \big( R(s,a,s′) + γ V^π(s′) \big)
So:
V^π(s) = ∑ₐ π(a|s) ∑_{s′} P(s′|s,a) ( R(s,a,s′) + γ V^π(s′) )
This is the Bellman expectation equation for V^π.
Similarly:
Q^π(s,a) = ∑_{s′} P(s′|s,a) ( R(s,a,s′) + γ 𝔼_{a′∼π(·|s′)}[ Q^π(s′,a′) ] )
Or equivalently:
Q^π(s,a) = ∑_{s′} P(s′|s,a) ( R(s,a,s′) + γ ∑_{a′} π(a′|s′) Q^π(s′,a′) )
When π is fixed, you can write the state-value Bellman equation in linear algebra form.
Let V^π ∈ ℝ^{|S|} be a vector with entries V^π(s). Define:
Then:
V^π = r^π + γ P^π V^π
Rearrange:
( I − γ P^π ) V^π = r^π
If ( I − γ P^π ) is invertible (it is under standard discounted assumptions), then:
V^π = ( I − γ P^π )⁻¹ r^π
This “closed form” is conceptually useful: policy evaluation becomes solving a linear system. Practically, DP and iterative methods are often preferred.
Many tasks end (episode termination). A common modeling choice:
Either way, Bellman equations remain valid with appropriate boundary conditions.
Once you define V^π and Q^π, it’s tempting to say “pick the action with highest value.” But value depends on the policy itself: if you change actions now, future actions may also change.
So we define optimality in a fixed point sense: an optimal policy is one whose value dominates all others.
Define the optimal state-value function:
V*(s) = max_π V^π(s)
and optimal action-value function:
Q*(s,a) = max_π Q^π(s,a)
These exist for finite discounted MDPs.
Start from: if you are in s, and you act optimally, you choose the best action a, then the environment transitions, and you continue optimally from s′.
That gives:
V(s) = maxₐ ∑_{s′} P(s′|s,a) ( R(s,a,s′) + γ V(s′) )
This is the Bellman optimality equation.
Similarly, for Q* you commit to the first action a, then behave optimally afterward:
Q(s,a) = ∑_{s′} P(s′|s,a) ( R(s,a,s′) + γ max_{a′} Q(s′,a′) )
Once you have Q*(s,a), you can define a greedy policy:
π(s) ∈ argmaxₐ Q(s,a)
This π* is optimal (ties may yield multiple optimal policies).
A key piece of theory is that being greedy with respect to a value function improves (or at least does not worsen) the policy.
Let π be any policy, and define a new policy π′ such that for all s:
π′(s) ∈ argmaxₐ Q^π(s,a)
Then:
V^{π′}(s) ≥ V^π(s) for all s
This theorem justifies policy iteration: evaluate π to get V^π (or Q^π), improve greedily to get π′, repeat.
Define the Bellman optimality operator T acting on a value function V:
(TV)(s) = maxₐ ∑_{s′} P(s′|s,a) ( R(s,a,s′) + γ V(s′) )
Then V* is the unique fixed point:
V = T V
Under the max norm ‖·‖∞, T is a γ-contraction:
‖T V − T W‖∞ ≤ γ ‖V − W‖∞
This matters because:
Even if you don’t prove the contraction fully, the intuition is strong: discounting shrinks the influence of future differences by γ each step.
| Equation type | Recursion | Unknown | Use |
|---|---|---|---|
| Expectation (policy evaluation) | V^π = r^π + γ P^π V^π | V^π | Evaluate a fixed π |
| Optimality (planning) | V = T V | V* | Find optimal values |
| Q-optimality | Q = T_Q Q | Q* | Extract greedy optimal π* |
A common confusion: “Is reward tied to states or transitions?” In general, reward can depend on (s,a,s′). For many derivations, we use its expectation:
R̄(s,a) = 𝔼[ Rₜ₊₁ | Sₜ=s, Aₜ=a ] = ∑_{s′} P(s′|s,a) R(s,a,s′)
Then Bellman optimality for V* can be written:
V(s) = maxₐ \big( R̄(s,a) + γ ∑_{s′} P(s′|s,a) V(s′) \big)
It’s the same idea—just less notation.
In finite discounted MDPs, there always exists an optimal deterministic stationary policy (depends only on s, not time). That’s a powerful simplification:
This fact is part of why MDPs are such a clean framework.
Dynamic programming works when:
The Bellman optimality equation is precisely the DP recursion.
Value iteration applies the optimality operator repeatedly.
Initialize V₀(s) arbitrarily (often 0). Then iterate:
V_{k+1}(s) = maxₐ ∑_{s′} P(s′|s,a) ( R(s,a,s′) + γ V_k(s′) )
After convergence to V*, extract:
π(s) ∈ argmaxₐ ∑_{s′} P(s′|s,a) ( R(s,a,s′) + γ V(s′) )
What’s happening conceptually:
Policy iteration alternates between evaluation and improvement.
1) Policy evaluation: solve for V^π via:
V^π(s) = ∑ₐ π(a|s) ∑_{s′} P(s′|s,a) ( R(s,a,s′) + γ V^π(s′) )
In matrix form:
( I − γ P^π ) V^π = r^π
2) Policy improvement:
π_new(s) ∈ argmaxₐ ∑_{s′} P(s′|s,a) ( R(s,a,s′) + γ V^π(s′) )
Repeat until the policy stops changing.
| Method | Main idea | Pros | Cons |
|---|---|---|---|
| Value iteration | Update V toward V* directly | Simple, memory-light | Many iterations when γ≈1 |
| Policy iteration | Evaluate π exactly/approximately, then improve | Often fewer iterations | Evaluation can be expensive |
| Modified policy iteration | Partial evaluation + improvement | Flexible tradeoff | More moving parts |
DP assumes you know P and R. In many RL problems you don’t; you only observe samples.
But the MDP framing still matters because:
This is the bridge from MDP theory to modern RL.
Policy gradient methods optimize π_θ directly. The objective is typically:
J(θ) = 𝔼_{τ∼π_θ}[ ∑ₜ γᵗ Rₜ₊₁ ]
Even though policy gradients do not require an explicit model P, they still rely on the MDP structure:
So understanding Bellman equations makes actor-critic updates feel motivated rather than magical.
When you discretize a continuous or ill-defined task (common in LLM agent systems), you often create:
If the resulting system approximately satisfies a Markov property (state summarizes relevant history), you can treat it as an MDP and apply planning/evaluation ideas:
Even if the environment is only approximately Markov, the MDP lens provides a disciplined way to debug and improve the design.
MDPs are clean, but real RL systems introduce complications:
Still, the MDP + Bellman equations remain the conceptual backbone.
States S = {A, B}. Actions are irrelevant because the policy is fixed and the transition is policy-induced.
Dynamics under π:
Discount γ = 0.9
Find V^π(A) and V^π(B).
Write Bellman equations.
V^π(A) = 𝔼[R|A] + γ 𝔼[V^π(S′)|A]
V^π(B) = 𝔼[R|B] + γ 𝔼[V^π(S′)|B]
Substitute the deterministic transitions and rewards.
From A: reward 2, next state B.
V^π(A) = 2 + 0.9 V^π(B)
From B: reward 1, next state B.
V^π(B) = 1 + 0.9 V^π(B)
Solve for V^π(B) first.
V^π(B) = 1 + 0.9 V^π(B)
V^π(B) − 0.9 V^π(B) = 1
0.1 V^π(B) = 1
V^π(B) = 10
Plug into V^π(A).
V^π(A) = 2 + 0.9 · 10
V^π(A) = 2 + 9
V^π(A) = 11
Insight: Bellman expectation equations often let you solve values with simple algebra. Notice how a self-loop with constant reward creates a geometric series; the Bellman equation encodes that series compactly.
Single nonterminal state s, plus terminal state T with V(T)=0.
At state s you can choose:
Discount γ = 0.8
Compute V(s), Q(s,a₁), Q*(s,a₂), and the optimal action.
Write Q* for each action using the Q Bellman optimality equation.
Q(s,a) = ∑_{s′} P(s′|s,a) ( R(s,a,s′) + γ max_{a′} Q(s′,a′) )
Compute Q*(s,a₁).
Taking a₁ leads to terminal T with reward 5.
Q(s,a₁) = 5 + 0.8 · V(T)
V*(T) = 0
So Q*(s,a₁) = 5
Compute Q*(s,a₂).
Taking a₂ gives reward 1 and returns to s.
Q(s,a₂) = 1 + 0.8 · max_{a′} Q(s,a′)
But max_{a′} Q(s,a′) is exactly V(s). So:
Q(s,a₂) = 1 + 0.8 V(s)
Relate V(s) to Q.
V(s) = max( Q(s,a₁), Q*(s,a₂) )
= max( 5, 1 + 0.8 V*(s) )
Check which action is optimal by solving the fixed point.
If a₁ is optimal, then V*(s)=5.
Then Q*(s,a₂)=1+0.8·5=1+4=5.
So a₂ ties with a₁; both give value 5.
Confirm the tie is consistent.
If V*(s)=5, then max(5, 1+0.8·5)=max(5,5)=5, consistent.
Therefore V*(s)=5 and both actions are optimal.
Final:
Q*(s,a₁)=5
Q*(s,a₂)=5
V*(s)=5
Insight: Bellman optimality can produce ties and multiple optimal policies. Here, a₂ is a “delayed gratification loop” whose discounted infinite sum matches the immediate terminal payoff.
MDP with states S={s₁,s₂}, actions A={L,R}. Discount γ=0.5.
Transitions and rewards:
Current policy π chooses L everywhere.
Compute Q^π(s₁,L), Q^π(s₁,R) and improve π at s₁.
Evaluate V^π(s₂) under π (which picks L at s₂).
Bellman:
V^π(s₂) = 1 + 0.5 V^π(s₂)
V^π(s₂) − 0.5 V^π(s₂) = 1
0.5 V^π(s₂) = 1
V^π(s₂) = 2
Compute Q^π(s₁,L).
Taking L in s₁ gives reward 0 then goes to s₂.
Q^π(s₁,L) = 0 + 0.5 V^π(s₂)
= 0.5 · 2
= 1
Compute Q^π(s₁,R).
Taking R in s₁ gives reward 2 then goes to s₂.
Q^π(s₁,R) = 2 + 0.5 V^π(s₂)
= 2 + 1
= 3
Improve policy greedily at s₁.
argmax_a Q^π(s₁,a) = R because 3 > 1.
So set π′(s₁)=R (keeping π′(s₂)=L if only improving at s₁).
Insight: Policy improvement is local but principled: compare actions by their Q^π values (immediate reward plus discounted value of the next state). This is the core move behind policy iteration and many RL control methods.
An MDP is defined by (S, A, P, R, γ) and models sequential decisions where (state, action) determine a distribution over next states (Markov property).
A policy π(a|s) turns an MDP into a Markov reward process with induced transition P^π(s′|s)=∑ₐ π(a|s)P(s′|s,a).
Value functions V^π(s) and Q^π(s,a) are expectations of the discounted return Gₜ=∑ₖ γᵏRₜ₊ₖ₊₁ under π.
Bellman expectation equations recursively define V^π and Q^π using one-step dynamics plus discounted continuation value.
Optimal value functions V and Q satisfy Bellman optimality equations with a max over actions; greedy policies w.r.t. Q* are optimal.
Discount γ controls effective planning horizon and ensures contraction/convergence for standard discounted settings.
Dynamic programming methods (value iteration, policy iteration) solve known MDPs by repeatedly applying Bellman operators or alternating evaluation and improvement.
MDP/Bellman structure underlies modern model-free RL (TD learning, Q-learning) and policy gradient methods (via baselines and advantage).
Confusing V^π and V: V^π evaluates a specific policy; V is the best achievable over all policies.
Forgetting to condition on the policy when computing expectations (e.g., using max where ∑ₐ π(a|s) is required, or vice versa).
Mishandling terminal states: not setting V(terminal)=0 (or not modeling absorbing transitions), leading to incorrect recursions.
Treating γ as optional without consequences: changing γ changes the objective (time preference and effective horizon), not just numerical stability.
Consider an MDP with states {0,1}. From state 0, action a keeps you in 0 with prob 1 and reward 1 each step. From state 1, the episode terminates with reward 0. Suppose the policy keeps taking a in state 0 and you never reach state 1. With γ=0.9, compute V^π(0).
Hint: Write V^π(0) = 1 + γ V^π(0) and solve.
V^π(0) = 1 + 0.9 V^π(0)
V^π(0) − 0.9 V^π(0) = 1
0.1 V^π(0) = 1
V^π(0) = 10
Let an MDP have two states s₁,s₂ and one action per state (so it’s a Markov reward process). Transitions: s₁→s₂ with prob 1 and reward 0; s₂→s₂ with prob 1 and reward 2. With γ=0.5, compute V(s₁), V(s₂) using Bellman equations.
Hint: Solve V(s₂)=2+γV(s₂) first, then plug into V(s₁)=0+γV(s₂).
V(s₂)=2+0.5V(s₂)
V(s₂)−0.5V(s₂)=2
0.5V(s₂)=2
V(s₂)=4
V(s₁)=0+0.5V(s₂)=0.5·4=2
In a finite discounted MDP, define the Bellman optimality operator (TV)(s)=maxₐ ∑_{s′} P(s′|s,a)(R(s,a,s′)+γV(s′)). Show that if you have two value functions V and W, then |(TV)(s) − (TW)(s)| ≤ γ max_{s′}|V(s′)−W(s′)| for every state s.
Hint: Use that maxₐ fₐ − maxₐ gₐ ≤ maxₐ (fₐ−gₐ), then apply triangle inequality and the fact that probabilities sum to 1.
Let Δ = max_{x} |V(x)−W(x)|.
For a fixed s and action a define:
F_a = ∑_{s′} P(s′|s,a)(R(s,a,s′)+γV(s′))
G_a = ∑_{s′} P(s′|s,a)(R(s,a,s′)+γW(s′))
Then F_a−G_a = γ ∑_{s′} P(s′|s,a)(V(s′)−W(s′)).
So |F_a−G_a| ≤ γ ∑_{s′} P(s′|s,a) |V(s′)−W(s′)| ≤ γ ∑_{s′} P(s′|s,a) Δ = γΔ.
Now,
(TV)(s)=max_a F_a, (TW)(s)=max_a G_a.
Use the inequality |max_a F_a − max_a G_a| ≤ max_a |F_a−G_a|.
Therefore |(TV)(s)−(TW)(s)| ≤ max_a γΔ = γΔ = γ max_{s′}|V(s′)−W(s′)|.