Learning from interaction. States, actions, rewards, policies.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
Supervised learning learns from labeled examples. Reinforcement learning (RL) learns from consequences: an agent takes actions, the world responds, and the agent updates its behavior to get more reward over time. The feedback is delayed, noisy, and depends on your choices—so you must reason about sequences, not single predictions.
Reinforcement learning models interaction as an MDP: states s, actions a, transition probabilities P(s′ | s, a), and rewards. A policy π(a | s) chooses actions. The goal is to maximize expected return (long-run cumulative reward). Value functions Vπ(s) and Qπ(s, a) quantify “how good” states and actions are under a policy, enabling planning and learning via Bellman-style recursions.
Many problems are not “predict the right label once.” They are “make a sequence of choices where today’s choice changes tomorrow’s situation.”
Examples:
The core difficulty: you do not get direct labels for the best action. Instead, you get rewards that may be delayed and may depend on long-term consequences.
Reinforcement learning is the study of how an agent can learn a behavior (a policy) that maximizes expected long-run reward by interacting with an environment.
At each time step t:
Over time, the agent should choose actions that lead to higher total reward.
RL combines:
A helpful comparison:
| Setting | Data source | Output | Feedback timing | Main challenge | |
|---|---|---|---|---|---|
| Supervised learning | fixed dataset (x, y) | predictor f(x) | immediate | generalization | |
| Bandits | actions only, no state transitions | action selection | immediate-ish | exploration vs exploitation | |
| RL (MDP) | interactive, sequential | policy π(a | s) | delayed | planning + exploration + credit assignment |
You already know Markov chains: P(s′ | s). RL generalizes this by adding actions: P(s′ | s, a). The agent’s policy π(a | s) effectively induces a Markov chain over states (because actions are chosen as a function of the current state).
You also know expected value. RL objectives are expectations over random trajectories:
So the core RL question becomes: Which policy π maximizes expected return?
If you want to reason about long-term consequences, you need a framework that says:
That framework is the Markov Decision Process (MDP).
An MDP is commonly specified by (S, A, P, R, γ):
Even if the environment is deterministic, P(s′ | s, a) can encode that by putting probability 1 on the deterministic next state.
The Markov property says that the future depends on the past only through the present state:
P(sₜ₊₁ | s₀, a₀, …, sₜ, aₜ) = P(sₜ₊₁ | sₜ, aₜ)
Intuition: sₜ must contain all decision-relevant information from history.
This is not always true for raw observations (e.g., pixels). In practice, RL often deals with partial observability, but this intro node focuses on the MDP setting.
Rewards are a design choice (or a measurement) that encodes what we want. A reward is not the same as a goal:
Reward shaping can make learning easier but can also change the optimal behavior if done carelessly.
Immediate reward is not enough because actions can have long-term effects. We define the return from time t:
Gₜ = rₜ₊₁ + γ rₜ₊₂ + γ² rₜ₊₃ + … = ∑_{k=0}^{∞} γ^k rₜ₊₁₊k
Why discounting helps:
A trajectory is a random sequence:
s₀, a₀, r₁, s₁, a₁, r₂, s₂, …
Given a policy π(a | s) and dynamics P(s′ | s, a), the return Gₜ becomes a random variable. RL optimizes an expected return:
J(π) = Eπ[ G₀ ]
The expectation Eπ[·] is taken over:
With a fixed policy π, the probability of transitioning from state s to s′ becomes:
Pπ(s′ | s) = ∑_{a∈A} π(a | s) P(s′ | s, a)
This is exactly a Markov chain transition rule—so your Markov chain knowledge applies.
A policy can be:
Stochastic policies matter because:
Imagine a commute with two actions:
If this were a one-step problem, you’d compare expected reward:
E[Risky] = 0.5·3 + 0.5·0 = 1.5, so Risky is better.
But with state transitions and long horizons, the best choice depends on how today’s action changes tomorrow’s state distribution. That’s the RL leap from bandits to MDPs.
MDP thinking is a discipline:
The agent’s goal is global: maximize expected return over a long horizon.
But decision-making is local: at state s, you must pick an action now.
A value function bridges this gap by summarizing the long-term consequences of being in a state or taking an action.
Given a policy π:
State-value function (how good is state s under π?):
Vπ(s) = Eπ[ Gₜ | sₜ = s ]
Action-value function (how good is taking action a in s under π?):
Qπ(s, a) = Eπ[ Gₜ | sₜ = s, aₜ = a ]
These expectations are over future actions from π and transitions from P.
Start from the definition of return:
Gₜ = rₜ₊₁ + γ Gₜ₊₁
Now take expectations to relate values across time.
By definition:
Vπ(s) = Eπ[ Gₜ | sₜ = s ]
Substitute Gₜ = rₜ₊₁ + γ Gₜ₊₁:
Vπ(s)
= Eπ[ rₜ₊₁ + γ Gₜ₊₁ | sₜ = s ]
= Eπ[ rₜ₊₁ | sₜ = s ] + γ Eπ[ Gₜ₊₁ | sₜ = s ]
But Gₜ₊₁ depends on sₜ₊₁, and future actions follow π, so:
Eπ[ Gₜ₊₁ | sₜ = s ] = Eπ[ Vπ(sₜ₊₁) | sₜ = s ]
So:
Vπ(s) = Eπ[ rₜ₊₁ + γ Vπ(sₜ₊₁) | sₜ = s ]
Write the expectation explicitly over actions and next states:
Vπ(s) = ∑_{a∈A} π(a | s) ∑_{s′∈S} P(s′ | s, a) [ r(s, a, s′) + γ Vπ(s′) ]
This is a Bellman expectation equation: a self-consistency condition.
Similarly:
Qπ(s, a) = ∑_{s′} P(s′ | s, a) [ r(s, a, s′) + γ ∑_{a′} π(a′ | s′) Qπ(s′, a′) ]
Interpretation:
Often you care about which action is better than the policy’s average behavior.
Define the advantage:
Aπ(s, a) = Qπ(s, a) − Vπ(s)
This will matter later when learning policies.
RL is usually about finding an optimal policy π*.
Define optimal state value:
V*(s) = maxπ Vπ(s)
and optimal action value:
Q*(s, a) = maxπ Qπ(s, a)
These satisfy Bellman optimality equations (you’ll go deeper in the MDP node you unlock):
V(s) = max_{a} ∑_{s′} P(s′ | s, a) [ r(s, a, s′) + γ V(s′) ]
and
Q(s, a) = ∑_{s′} P(s′ | s, a) [ r(s, a, s′) + γ max_{a′} Q(s′, a′) ]
Important pacing point: you do not need to solve these yet; you need to recognize what they mean:
A subtle but powerful view: Bellman equations define a mapping (an operator) whose fixed point is Vπ.
Why that matters later:
You can already see the need: if π is deterministic and suboptimal, you might never see better actions. Stochasticity and off-policy ideas help address that.
Once you have MDP + value functions, there are two broad approaches.
| Approach | What you assume you know | Main operation | Typical algorithms | |
|---|---|---|---|---|
| Planning (model-based) | P(s′ | s, a) and reward model | compute values/policies by sweeps | value iteration, policy iteration |
| Learning (model-free) | only samples (s, a, r, s′) | estimate values/policies from experience | TD learning, Q-learning, policy gradients |
This node is an intro, so focus on the shape of the problems.
If you always take the best-known action, you may miss a better one.
If you explore too much, you waste reward.
This tension is sharper in MDPs because:
A policy π(a | s) that is stochastic can encode exploration directly.
Suppose reward arrives only at the end of an episode (e.g., win/loss). Which earlier actions deserve credit?
Value functions address this by propagating reward information backward through the Bellman structure:
Some environments have terminal states (games), others run forever (control systems).
In episodic tasks with terminal state s_T, it’s common to define V(s_T) = 0 (no future reward after terminal).
In small MDPs, you can store V(s) in a table.
But real problems have huge state spaces. Then we approximate:
Even though this node is conceptual, it helps to see why vectors matter:
This intro has emphasized what the objects are:
The next node will go deeper on:
Think of this node as learning the language and the “units of thought” you’ll use later.
States: S = {A, B}. Actions: in each state you may choose Left or Right.
Discount: γ = 0.9.
Transitions and rewards:
Policy π is stochastic:
Compute Vπ(A) and Vπ(B).
Write Bellman expectation equations:
Vπ(A) = ∑_{a} π(a | A) [ r(A,a) + γ Vπ(s′) ]
Vπ(B) = ∑_{a} π(a | B) [ r(B,a) + γ Vπ(s′) ]
Expand Vπ(A).
At A:
So:
Vπ(A)
= 0.5 [ 0 + 0.9 Vπ(A) ] + 0.5 [ 1 + 0.9 Vπ(B) ]
Expand Vπ(B).
At B policy always chooses Right.
So:
Vπ(B) = 1 · [ 2 + 0.9 Vπ(B) ]
Solve for Vπ(B):
Vπ(B) = 2 + 0.9 Vπ(B)
Vπ(B) − 0.9 Vπ(B) = 2
0.1 Vπ(B) = 2
Vπ(B) = 20
Plug Vπ(B)=20 into Vπ(A):
Vπ(A) = 0.5(0.9 Vπ(A)) + 0.5(1 + 0.9·20)
Vπ(A) = 0.45 Vπ(A) + 0.5(1 + 18)
Vπ(A) = 0.45 Vπ(A) + 0.5·19
Vπ(A) = 0.45 Vπ(A) + 9.5
Solve for Vπ(A):
Vπ(A) − 0.45 Vπ(A) = 9.5
0.55 Vπ(A) = 9.5
Vπ(A) = 9.5 / 0.55 = 17.2727… ≈ 17.27
Insight: This is policy evaluation: once π is fixed, the MDP becomes a Markov reward process, and Vπ is the fixed point of a linear Bellman system. Notice how the high self-loop reward at B (reward 2 forever) dominates both values through discounting.
Use the MDP and Vπ values from the previous example (Vπ(A) ≈ 17.27, Vπ(B)=20). Compute:
1) Qπ(A, Left), Qπ(A, Right)
2) Aπ(A, Left), Aπ(A, Right)
Assume rewards depend only on (s, a) here: r(A,Left)=0, r(A,Right)=1, r(B,Left)=0, r(B,Right)=2.
Use the definition:
Qπ(s, a) = ∑_{s′} P(s′ | s, a) [ r(s,a) + γ Vπ(s′) ]
Because transitions are deterministic here, the sum over s′ collapses to the single next state.
Compute Qπ(A, Left).
Left from A → A with reward 0.
Qπ(A, Left) = 0 + 0.9 Vπ(A)
= 0.9 · 17.2727…
= 15.5454… ≈ 15.55
Compute Qπ(A, Right).
Right from A → B with reward 1.
Qπ(A, Right) = 1 + 0.9 Vπ(B)
= 1 + 0.9 · 20
= 1 + 18
= 19
Check consistency with Vπ(A).
Vπ(A) should equal the policy-weighted average:
Vπ(A) = 0.5 Qπ(A, Left) + 0.5 Qπ(A, Right)
= 0.5(15.5454…) + 0.5(19)
= 7.7727… + 9.5
= 17.2727… (matches).
Compute advantages:
Aπ(A, Left) = Qπ(A, Left) − Vπ(A)
= 15.5454… − 17.2727…
= −1.7273… ≈ −1.73
Aπ(A, Right) = Qπ(A, Right) − Vπ(A)
= 19 − 17.2727…
= 1.7273… ≈ 1.73
Insight: Qπ separates action quality within a state. Vπ is the policy’s average outcome; advantage tells you which actions are better or worse than what π typically does. Policy improvement methods often push probability mass toward positive-advantage actions.
Consider a 3-state MDP with states S = {0, 1, 2} and actions A = {a, b}.
Policy:
Transitions:
Compute Pπ(s′ | s).
Use the mixture rule:
Pπ(s′ | s) = ∑_{action∈{a,b}} π(action | s) P(s′ | s, action)
From s=0:
So:
Pπ(1 | 0)=0.2, Pπ(2 | 0)=0.8, Pπ(0 | 0)=0
From s=1:
So:
Pπ(2 | 1)=0.6, Pπ(0 | 1)=0.4, Pπ(1 | 1)=0
From s=2:
π(a|2)=1 so always take a, next is 2.
So:
Pπ(2 | 2)=1, Pπ(0 | 2)=0, Pπ(1 | 2)=0
Write the induced transition matrix with state order (0,1,2):
Pπ =
[ Pπ(0|0) Pπ(1|0) Pπ(2|0)
Pπ(0|1) Pπ(1|1) Pπ(2|1)
Pπ(0|2) Pπ(1|2) Pπ(2|2) ]
=
[ 0 0.2 0.8
0.4 0 0.6
0 0 1 ]
Insight: Fixing π collapses control into randomness: the MDP becomes a Markov chain over states with transition matrix Pπ. This is the key bridge from Markov chains to RL—actions disappear into the policy probabilities.
RL is learning to choose actions to maximize expected return from interaction, not from labeled examples.
An MDP formalizes sequential decision-making with states s, actions a, transitions P(s′ | s, a), rewards, and discount γ.
A policy π(a | s) can be deterministic or stochastic; with fixed π, the state dynamics become an induced Markov chain Pπ(s′ | s).
The discounted return is Gₜ = ∑_{k=0}^{∞} γ^k rₜ₊₁₊k, capturing delayed consequences and making infinite horizons manageable.
Value functions Vπ(s) and Qπ(s, a) measure long-run desirability under π and satisfy Bellman expectation equations.
Bellman equations encode the core recursion: value = (reward now) + γ·(value later), averaged over policy and dynamics.
Advantage Aπ(s, a) = Qπ(s, a) − Vπ(s) compares actions within a state and foreshadows policy improvement.
Planning assumes you know P and rewards; model-free learning estimates values/policies from sampled transitions.
Confusing the reward rₜ with the return Gₜ: RL optimizes expected return, not immediate reward alone.
Treating observations as states without checking the Markov property (missing hidden information breaks MDP assumptions).
Forgetting that expectations are over both the environment randomness P(s′ | s, a) and the policy randomness π(a | s).
Mixing up Vπ(s) and Qπ(s, a): V is state quality under π; Q is action quality in a state under π.
A one-state continuing task has a single state s and two actions a₁, a₂. Taking a₁ gives reward 1 and returns to s. Taking a₂ gives reward 2 and returns to s. Discount γ = 0.5. (i) Compute Vπ(s) if π chooses a₁ with prob 0.7 and a₂ with prob 0.3. (ii) Compute Qπ(s, a₁) and Qπ(s, a₂).
Hint: With one state, Vπ(s) must satisfy V = E[r] + γV. And Q(s,a) = r(a) + γV.
Compute expected immediate reward:
E[r] = 0.7·1 + 0.3·2 = 0.7 + 0.6 = 1.3
Solve V = 1.3 + 0.5V ⇒ 0.5V = 1.3 ⇒ V = 2.6.
Then:
Qπ(s,a₁) = 1 + 0.5·2.6 = 1 + 1.3 = 2.3
Qπ(s,a₂) = 2 + 0.5·2.6 = 2 + 1.3 = 3.3
Consider an episodic MDP with terminal state T where Vπ(T)=0. From state S, action Go deterministically transitions to T with reward 5. Action Stay deterministically transitions to S with reward 1. Discount γ=0.9. (i) Compute Qπ(S,Go) and Qπ(S,Stay) in terms of Vπ(S). (ii) If π always chooses Stay, solve for Vπ(S).
Hint: Use Qπ(s,a) = r + γVπ(s′). If π always Stay, then Vπ(S)=Qπ(S,Stay).
(i)
Go: next T, so Qπ(S,Go) = 5 + 0.9·Vπ(T) = 5 + 0 = 5.
Stay: next S, so Qπ(S,Stay) = 1 + 0.9·Vπ(S).
(ii) If π always chooses Stay, then Vπ(S) = Qπ(S,Stay) = 1 + 0.9Vπ(S).
So Vπ(S) − 0.9Vπ(S) = 1 ⇒ 0.1Vπ(S)=1 ⇒ Vπ(S)=10.
You are given a policy π and dynamics P(s′ | s, a) for a finite MDP. Describe how to compute the induced Markov chain transition matrix Pπ(s′ | s). Then explain (in 2–4 sentences) why this shows that policy evaluation is closely related to Markov chains.
Hint: Pπ is a policy-weighted mixture of action-conditioned transitions. Under a fixed π, actions are random variables determined by s.
To compute Pπ(s′ | s), combine transitions across actions using the policy probabilities:
Pπ(s′ | s) = ∑_{a∈A} π(a | s) P(s′ | s, a).
This defines a Markov chain over states because the next-state distribution depends only on the current state s (via the mixture), not on earlier history. Therefore, once π is fixed, state visitation and long-run behavior can be analyzed with Markov chain tools, and value functions become expectations over trajectories of that induced chain with rewards.
Unlocks: Markov Decision Processes
Related next steps: