Direct policy optimization. REINFORCE, actor-critic.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
Value-based RL learns “how good states/actions are,” then acts greedily. Policy gradient methods flip that: they directly learn “how to act” by adjusting a differentiable, stochastic policy πθ(a|s) to increase expected return—using gradients estimated from sampled trajectories.
Policy gradients optimize J(θ)=Eτ[∑γᵗrₜ] directly by ascending an unbiased gradient estimator: ∇θJ(θ)=E[∑∇θ log πθ(aₜ|sₜ)·(return/advantage)]. REINFORCE uses Monte Carlo returns (high variance). Actor-critic replaces returns with learned value baselines (lower variance) and uses advantages (A=Q−V), often with bootstrapping and GAE.
In an MDP, you ultimately care about behavior: which actions you take in each state. A policy is the object that produces that behavior. In policy gradient methods, the policy is parameterized and differentiable, so we can change it continuously and aim those changes toward higher return.
Instead of learning a value function first and deriving a policy from it, we optimize a policy directly:
A standard episodic objective is
where a trajectory (rollout) is
and the trajectory distribution is induced by the environment dynamics and the policy:
The key point: θ controls the probability of your actions, and that changes which states you visit and which rewards you obtain.
Typically, θ parameterizes a neural network that outputs either:
Example (discrete):
Example (continuous, diagonal Gaussian):
We then perform stochastic gradient ascent on :
In supervised learning, you get a target label. In RL, you get rewards after decisions. The policy gradient trick ties the final outcome back to earlier action probabilities via
Intuition to hold onto:
If an action led to better-than-expected outcomes, increase its probability in that state. If it led to worse-than-expected outcomes, decrease it.
Policy gradient methods operationalize that intuition with a precise gradient estimator.
To make this idea tangible, your canvas can show a tiny 2-state MDP and a 2-action policy.
Canvas panel A: “Policy sliders”
Canvas panel B: “Trajectory outcomes”
Canvas panel C: “Gradient arrows”
This directly externalizes the algebra: gradient = (score) × (signal).
We want , but is an expectation over trajectories whose distribution depends on θ. Differentiating “through” sampling is awkward because trajectories are discrete random objects.
The score-function (a.k.a. log-derivative) trick gives a way to move the gradient inside an expectation without differentiating the environment dynamics.
The identity to remember:
This works whenever is differentiable in θ and is integrable.
Start from:
where .
Differentiate:
Use :
Recognize the expectation:
Now expand . From
take logs:
Differentiate w.r.t. θ. Only the policy terms depend on θ (environment dynamics are fixed):
So:
This already yields an unbiased estimator: sample a trajectory, compute , and push the log-prob gradients in the direction of .
Using the same return for every time step credits early and late actions equally, even though late actions cannot affect early rewards.
A standard improvement is the reward-to-go:
Then the estimator becomes:
This is still unbiased, but typically has lower variance.
At a high level:
A common minibatch form:
For a softmax policy, you can interpret as:
Then multiplying by decides direction:
Add a per-time-step breakdown:
Learners should see that the policy gradient update is not magic—it’s a weighted push on log-probability.
REINFORCE is unbiased, but its Monte Carlo returns can have enormous variance:
High variance means you need many trajectories (or tiny learning rates) to make stable progress.
The central theme of modern policy gradients is:
Keep the estimator (approximately) unbiased while reducing variance.
Key fact: for any function that does not depend on ,
So we can subtract inside the gradient estimator without changing its expectation:
This can drastically reduce variance when approximates the “typical” return from .
Choose , where
Then is an estimate of the advantage:
Advantage answers a very specific question:
Was this action better or worse than my policy’s average behavior in this state?
This is exactly the signal you want for improving a stochastic policy.
Actor-critic methods maintain:
The critic provides a low-variance learning signal; the actor uses it to update the policy.
A common actor update uses an estimated advantage :
Instead of Monte Carlo , we can use TD-style targets.
One-step TD error (for value critic):
A simple advantage estimate is .
This introduces some bias (because is approximate), but variance often drops dramatically and learning becomes faster.
GAE blends multi-step TD errors with an additional parameter λ that controls bias/variance.
Define TD residuals as above, then
If the critic is a value function , a typical squared-error loss is:
where might be:
| Method | Signal used in actor update | Bias | Variance | Typical use |
|---|---|---|---|---|
| REINFORCE | low (unbiased) | high | small problems, pedagogical baseline | |
| REINFORCE + baseline | low | medium | still Monte Carlo but improved | |
| Actor-critic (TD) | or learned | medium | low | common practical choice |
| Actor-critic + GAE | tunable | tunable | modern on-policy systems (e.g., PPO) |
To address the visualization weakness explicitly, build a panel that runs the same fixed policy for many rollouts and shows the distribution of gradient estimates.
Panel design
What learners should observe
Add a slider for λ (0→1) and animate the histogram tightening/loosening. That makes bias/variance tradeoff visible, not just stated.
Many practical deep RL systems used today (including those in RLHF pipelines) rely on three ideas:
This node focuses on (1) and (2), which are foundational for PPO and RLHF.
A common structure (simplified):
Equivalently, minimize .
PPO is still a policy gradient method, but it modifies the objective so updates do not change the policy too abruptly.
A typical PPO objective uses the probability ratio
and then clips it to avoid overly large updates. Notice how everything here assumes you already understand:
That’s exactly why mastering policy gradients unlocks PPO and thus RLHF.
In RLHF, you often:
Even if the “environment” is text generation and the “reward” comes from a reward model, the policy gradient core remains:
Add a simple toggle in the canvas:
Even if you don’t implement full clipping, just showing “unconstrained step” vs “KL-limited step” prepares learners for PPO’s motivation.
Those are the conceptual and mathematical prerequisites for modern on-policy deep RL.
Consider a 1-state bandit with two actions a∈{0,1}. Reward is r=1 if a=1, and r=0 if a=0. Let the policy be Bernoulli with parameter p=σ(θ): πθ(a=1)=p, πθ(a=0)=1−p. There is one step per episode, so return R(τ)=r.
Goal: compute ∇θJ(θ) exactly, then match it to the REINFORCE estimator.
1) Write the expected return:
J(θ)=E[r]=P(a=1)·1 + P(a=0)·0 = p.
2) Differentiate J(θ)=p=σ(θ):
∇θJ(θ)=dp/dθ = σ(θ)(1−σ(θ)) = p(1−p).
3) Compute the REINFORCE gradient form:
∇θJ(θ)=E[∇θ log πθ(a) · r].
We will compute ∇θ log πθ(a) for each action.
4) For a=1:
log πθ(1)=log p.
∂/∂θ log p = (1/p)·dp/dθ = (1/p)·p(1−p)=1−p.
5) For a=0:
log πθ(0)=log(1−p).
∂/∂θ log(1−p) = (1/(1−p))·(−dp/dθ) = (1/(1−p))·(−p(1−p))=−p.
6) Take the expectation:
E[∇θ log πθ(a)·r]
= P(a=1)·(1−p)·1 + P(a=0)·(−p)·0
= p(1−p).
This matches the exact gradient in step (2).
7) Interpretation:
Insight: This bandit shows the core mechanism cleanly: REINFORCE is an unbiased estimator of the true gradient, and ∇θ log πθ(a) tells you how to change θ to increase the probability of the sampled action.
We use the same 1-state Bernoulli bandit as Example 1, but now consider adding a constant baseline b (which is allowed since it does not depend on the action). Show that E[∇θ log πθ(a)·(r−b)] equals the original gradient, for any b.
1) Start with the baseline gradient estimator:
E[∇θ log πθ(a)·(r−b)] = E[∇θ log πθ(a)·r] − b·E[∇θ log πθ(a)].
2) We already computed E[∇θ log πθ(a)·r] = p(1−p).
3) Now compute E[∇θ log πθ(a)]:
E[∇θ log πθ(a)]
= P(a=1)·(1−p) + P(a=0)·(−p)
= p(1−p) + (1−p)(−p)
= p(1−p) − p(1−p)
= 0.
4) Therefore:
E[∇θ log πθ(a)·(r−b)] = p(1−p) − b·0 = p(1−p).
5) Variance intuition (qualitative):
Choosing b close to E[r]=p makes (r−b) smaller-magnitude on average, which shrinks the spread of sample gradient values, stabilizing learning.
Insight: Baselines work because the expected score function is zero: E[∇θ log πθ(a|s)]=0. You can subtract any action-independent term to reduce variance without biasing the gradient.
Consider an episodic problem with two time steps t=0,1 and discount γ. You collect one trajectory (s0,a0,r0,s1,a1,r1,terminal). You have a value critic Vφ(s) that outputs V0=Vφ(s0) and V1=Vφ(s1). Construct the TD residuals δ0, δ1 and a simple advantage estimate, then write the actor update direction.
1) One-step TD residual at t=1 (terminal next state):
Because the episode ends after r1, we treat Vφ(s2)=0.
δ1 = r1 + γ·0 − V1 = r1 − V1.
2) One-step TD residual at t=0:
δ0 = r0 + γ V1 − V0.
3) Use δt as an advantage estimate:
Â1 = δ1,
Â0 = δ0.
4) Write the sampled policy gradient (one trajectory):
ĝ = ∇θ log πθ(a0|s0)·Â0 + ∇θ log πθ(a1|s1)·Â1.
5) Interpret signs:
6) Critic update targets (one-step):
A standard critic regression would move V0 toward (r0 + γ V1) and move V1 toward r1.
Insight: Actor-critic turns long-horizon Monte Carlo credit assignment into local prediction errors (δt). The critic learns to predict returns; the actor uses the critic’s surprise as the learning signal.
Policy gradient methods optimize a differentiable stochastic policy πθ(a|s) directly, rather than deriving a policy from value estimates.
The objective is expected (discounted) return: J(θ)=Eτ[∑γᵗrₜ], where τ is sampled under πθ and the environment.
The score-function (log-derivative) trick yields an unbiased estimator: ∇θJ=E[∑∇θ log πθ(aₜ|sₜ)·(return/advantage)].
Reward-to-go Gₜ improves credit assignment by only attributing future rewards to an action at time t.
Baselines b(sₜ) do not change the expected gradient if they are action-independent, but can greatly reduce variance.
Using b(s)=V(s) leads to advantage learning: A(s,a)=Q(s,a)−V(s), which answers “better or worse than average?”
Actor-critic methods learn a critic (value function) to provide low-variance advantage estimates, often via TD residuals δₜ.
GAE provides a tunable bias/variance tradeoff for advantage estimation and is a core ingredient of modern on-policy methods like PPO.
Using a baseline that depends on the sampled action aₜ (this can bias the gradient unless handled carefully).
Forgetting discounting or mixing definitions of return (e.g., using undiscounted Gₜ with a discounted critic target).
Not detaching/stop-gradient through advantage targets when updating the actor (can cause unintended coupling and instability).
Confusing maximizing J(θ) with minimizing a loss: sign errors are common (e.g., descending when you meant to ascend).
Derive the reward-to-go policy gradient from the trajectory-level form:
∇θJ = E[ R(τ) ∑ₜ ∇θ log πθ(aₜ|sₜ) ].
Show the steps that justify replacing R(τ) with Gₜ inside the sum.
Hint: Condition on the history up to time t and use the fact that actions at time t cannot affect rewards before time t.
Start from E[∑ₜ ∇ log π(aₜ|sₜ) · R(τ)]. For a fixed t, decompose R(τ)= (∑_{k=0}^{t-1} γ^k r_k) + (γ^t ∑_{k=t}^{T-1} γ^{k-t} r_k). The first term depends only on rewards before t, which are independent of aₜ given the past; its expectation multiplied by ∇ log π(aₜ|sₜ) is zero (score-function property). The remaining term is proportional to the future return from t, i.e., reward-to-go. After adjusting for γ factors, you obtain E[∑ₜ ∇ log π(aₜ|sₜ) · Gₜ].
In a discrete-action softmax policy with logits z=fθ(s) and π(a|s)=exp(z_a)/∑_j exp(z_j), compute ∂/∂z_k log π(a|s).
Hint: Write log π(a|s)=z_a − log(∑_j exp(z_j)) and differentiate.
log π(a|s)=z_a − log(∑_j exp(z_j)). Then ∂/∂z_k log π(a|s)=1[k=a] − exp(z_k)/∑_j exp(z_j)=1[k=a] − π(k|s).
GAE computation practice: given δ0=1.0, δ1=0.5, δ2=−0.2, γ=0.9, compute Â0 for λ=0 and for λ=1 (assume episode ends after t=2 so no further terms).
Hint: Use Â0=∑_{l=0}^{2} (γλ)^l δ_l.
For λ=0: Â0 = (γ·0)^0 δ0 + (γ·0)^1 δ1 + (γ·0)^2 δ2 = δ0 = 1.0. For λ=1: Â0=δ0 + (γ)^1 δ1 + (γ)^2 δ2 = 1.0 + 0.9·0.5 + 0.9^2·(−0.2) = 1.0 + 0.45 − 0.162 = 1.288.
Next nodes and related concepts:
Suggested prior/parallel reinforcement learning nodes (if present in your tech tree):