Policy Gradient Methods

Machine LearningDifficulty: █████Depth: 9Unlocks: 1

Direct policy optimization. REINFORCE, actor-critic.

Interactive Visualization

t=0s

Core Concepts

  • Parameterized stochastic policy as a differentiable mapping (pi_theta(a|s)) that defines behavior and can be optimized
  • Objective is the expected (discounted) return J(theta) = E[sum of rewards under the policy]
  • Policy gradient theorem (score-function estimator): provides an unbiased, sample-based expression for the gradient of J(theta) enabling direct optimization

Key Symbols & Notation

pi_theta(a|s) (parameterized stochastic policy)J(theta) (expected return objective)

Essential Relationships

  • grad_theta J(theta) = E_{trajectories~pi_theta}[grad_theta log pi_theta(a|s) * (return or advantage)] (policy gradient theorem; forms the basis for REINFORCE and actor-critic)

Unlocks (1)

▶ Advanced Learning Details

Graph Position

195
Depth Cost
1
Fan-Out (ROI)
1
Bottleneck Score
9
Chain Length

Cognitive Load

6
Atomic Elements
43
Total Elements
L3
Percentile Level
L4
Atomic Level

All Concepts (16)

  • parameterized stochastic policy: π_θ(a|s) - policy represented by parameters θ that outputs action distributions
  • policy objective J(θ): expected (discounted) return under π_θ treated as a function of θ
  • trajectory τ: sequence (s0,a0,r0,s1,a1,r1,...) sampled from π_θ
  • return G_t: the (discounted) sum of future rewards from time t used as a Monte Carlo target
  • score-function / likelihood-ratio estimator: using ∇_θ log p_θ(·) to move gradient inside expectation
  • policy gradient theorem: closed-form expectation expression for ∇_θ J(θ) in terms of π_θ and value/Q
  • REINFORCE: Monte Carlo policy-gradient algorithm that uses sampled returns as unbiased gradient estimates
  • baseline for variance reduction: any function b(s) subtracted from the return that does not bias the gradient
  • advantage function A^π(s,a) = Q^π(s,a) − V^π(s) used to center policy updates
  • actor-critic architecture: 'actor' updates π_θ, 'critic' learns a value (or Q) estimator to provide targets/advantages
  • TD error δ_t = r_t + γ V(s_{t+1}) − V(s_t) as a bootstrapped estimate usable by the actor
  • bootstrapping vs Monte Carlo tradeoff: bootstrapped (TD) estimates introduce bias but reduce variance; MC is unbiased but high variance
  • on-policy sampling requirement: gradients/estimators assume trajectories sampled from the current policy (π_θ)
  • variance–bias tradeoff in gradient estimation and the role of baselines/critics to manage it
  • entropy regularization (optional): adding policy entropy to the objective to encourage exploration
  • importance-sampling correction (off-policy): reweighting samples by π/μ to use data from a behavior policy (introduces high variance)

Teaching Strategy

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

Value-based RL learns “how good states/actions are,” then acts greedily. Policy gradient methods flip that: they directly learn “how to act” by adjusting a differentiable, stochastic policy πθ(a|s) to increase expected return—using gradients estimated from sampled trajectories.

TL;DR:

Policy gradients optimize J(θ)=Eτ[∑γᵗrₜ] directly by ascending an unbiased gradient estimator: ∇θJ(θ)=E[∑∇θ log πθ(aₜ|sₜ)·(return/advantage)]. REINFORCE uses Monte Carlo returns (high variance). Actor-critic replaces returns with learned value baselines (lower variance) and uses advantages (A=Q−V), often with bootstrapping and GAE.

What Is a Policy Gradient Method?

Why this family exists

In an MDP, you ultimately care about behavior: which actions you take in each state. A policy is the object that produces that behavior. In policy gradient methods, the policy is parameterized and differentiable, so we can change it continuously and aim those changes toward higher return.

Instead of learning a value function first and deriving a policy from it, we optimize a policy directly:

  • Policy: πθ(as)\pi_\theta(a\mid s), a distribution over actions given a state.
  • Objective: expected discounted return under that policy.

A standard episodic objective is

J(θ)  =  Eτπθ[t=0T1γtrt]J(\theta) \;=\; \mathbb{E}_{\tau\sim \pi_\theta}\left[\sum_{t=0}^{T-1} \gamma^t r_t\right]

where a trajectory (rollout) is

τ=(s0,a0,r0,s1,a1,r1,)\tau = (s_0,a_0,r_0,s_1,a_1,r_1,\dots)

and the trajectory distribution is induced by the environment dynamics and the policy:

Pθ(τ)=ρ(s0)t=0T1πθ(atst)P(st+1st,at)P_\theta(\tau) = \rho(s_0)\prod_{t=0}^{T-1}\pi_\theta(a_t\mid s_t)\,P(s_{t+1}\mid s_t,a_t)

The key point: θ controls the probability of your actions, and that changes which states you visit and which rewards you obtain.

What “differentiable policy” means in practice

Typically, θ parameterizes a neural network that outputs either:

  • Discrete actions: logits → softmax → categorical distribution.
  • Continuous actions: mean (and maybe log-std) of a Gaussian.

Example (discrete):

πθ(as)=softmax(fθ(s))a\pi_\theta(a\mid s)=\mathrm{softmax}(f_\theta(s))_a

Example (continuous, diagonal Gaussian):

πθ(as)=N(a;μθ(s),diag(σθ(s)2))\pi_\theta(a\mid s)=\mathcal{N}(a\,;\,\mu_\theta(s),\mathrm{diag}(\sigma_\theta(s)^2))

We then perform stochastic gradient ascent on J(θ)J(\theta):

θθ+αθJ(θ)^\theta \leftarrow \theta + \alpha\,\widehat{\nabla_\theta J(\theta)}

The conceptual leap: “credit assignment” through log-probability

In supervised learning, you get a target label. In RL, you get rewards after decisions. The policy gradient trick ties the final outcome back to earlier action probabilities via

  • how probable the action was (logπθ(as)\log \pi_\theta(a\mid s)), and
  • how good the outcome was (returns or advantages).

Intuition to hold onto:

If an action led to better-than-expected outcomes, increase its probability in that state. If it led to worse-than-expected outcomes, decrease it.

Policy gradient methods operationalize that intuition with a precise gradient estimator.

Visualization plan (interactive canvas)

To make this idea tangible, your canvas can show a tiny 2-state MDP and a 2-action policy.

Canvas panel A: “Policy sliders”

  • Let θ be a single scalar controlling a Bernoulli policy:
  • πθ(a=1s)=σ(θs)\pi_\theta(a=1\mid s)=\sigma(\theta_s) for each state s.
  • Show two sliders (θ for state 0 and state 1).
  • As the learner drags θ, animate action probabilities (bar chart) shifting.

Canvas panel B: “Trajectory outcomes”

  • Sample rollouts under the current policy.
  • Show the returns GtG_t next to each time step.

Canvas panel C: “Gradient arrows”

  • For each visited (sₜ,aₜ), display
  • θlogπθ(atst)\nabla_\theta \log \pi_\theta(a_t\mid s_t) as an arrow on θ.
  • Multiply the arrow by an advantage estimate and show the scaled update.

This directly externalizes the algebra: gradient = (score) × (signal).

Core Mechanic 1: The Policy Gradient Theorem (REINFORCE via the Score Function)

Why we need a special gradient identity

We want θJ(θ)\nabla_\theta J(\theta), but JJ is an expectation over trajectories whose distribution depends on θ. Differentiating “through” sampling is awkward because trajectories are discrete random objects.

The score-function (a.k.a. log-derivative) trick gives a way to move the gradient inside an expectation without differentiating the environment dynamics.

The identity to remember:

θExpθ[f(x)]=Expθ[f(x)θlogpθ(x)]\nabla_\theta \mathbb{E}_{x\sim p_\theta}[f(x)] = \mathbb{E}_{x\sim p_\theta}\left[f(x)\nabla_\theta \log p_\theta(x)\right]

This works whenever pθ(x)p_\theta(x) is differentiable in θ and f(x)f(x) is integrable.

Derivation (showing the work)

Start from:

J(θ)=τPθ(τ)R(τ)J(\theta)=\sum_\tau P_\theta(\tau)\,R(\tau)

where R(τ)=t=0T1γtrtR(\tau)=\sum_{t=0}^{T-1}\gamma^t r_t.

Differentiate:

θJ(θ)=τθPθ(τ)R(τ)\nabla_\theta J(\theta) = \sum_\tau \nabla_\theta P_\theta(\tau)\,R(\tau)

Use P=PlogP\nabla P = P\nabla \log P:

θJ(θ)=τPθ(τ)θlogPθ(τ)R(τ)\nabla_\theta J(\theta) = \sum_\tau P_\theta(\tau)\, \nabla_\theta \log P_\theta(\tau)\,R(\tau)

Recognize the expectation:

θJ(θ)=Eτπθ[R(τ)θlogPθ(τ)]\nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim \pi_\theta}\left[R(\tau)\,\nabla_\theta \log P_\theta(\tau)\right]

Now expand logPθ(τ)\log P_\theta(\tau). From

Pθ(τ)=ρ(s0)t=0T1πθ(atst)P(st+1st,at)P_\theta(\tau) = \rho(s_0)\prod_{t=0}^{T-1}\pi_\theta(a_t\mid s_t)\,P(s_{t+1}\mid s_t,a_t)

take logs:

logPθ(τ)=logρ(s0)+t=0T1logπθ(atst)+t=0T1logP(st+1st,at)\log P_\theta(\tau) = \log \rho(s_0) + \sum_{t=0}^{T-1}\log \pi_\theta(a_t\mid s_t) + \sum_{t=0}^{T-1}\log P(s_{t+1}\mid s_t,a_t)

Differentiate w.r.t. θ. Only the policy terms depend on θ (environment dynamics are fixed):

θlogPθ(τ)=t=0T1θlogπθ(atst)\nabla_\theta \log P_\theta(\tau) = \sum_{t=0}^{T-1}\nabla_\theta \log \pi_\theta(a_t\mid s_t)

So:

θJ(θ)=Eτ[R(τ)t=0T1θlogπθ(atst)]\nabla_\theta J(\theta) = \mathbb{E}_{\tau}\left[R(\tau)\sum_{t=0}^{T-1}\nabla_\theta \log \pi_\theta(a_t\mid s_t)\right]

This already yields an unbiased estimator: sample a trajectory, compute R(τ)R(\tau), and push the log-prob gradients in the direction of R(τ)R(\tau).

Reward-to-go (better credit assignment)

Using the same return R(τ)R(\tau) for every time step credits early and late actions equally, even though late actions cannot affect early rewards.

A standard improvement is the reward-to-go:

Gt=k=tT1γktrkG_t=\sum_{k=t}^{T-1}\gamma^{k-t}r_k

Then the estimator becomes:

θJ(θ)=E[t=0T1θlogπθ(atst)Gt]\nabla_\theta J(\theta) = \mathbb{E}\left[\sum_{t=0}^{T-1}\nabla_\theta\log\pi_\theta(a_t\mid s_t)\,G_t\right]

This is still unbiased, but typically has lower variance.

REINFORCE algorithm (Monte Carlo policy gradient)

At a high level:

  1. 1)Collect trajectories using current πθ\pi_\theta.
  2. 2)For each time step, compute GtG_t.
  3. 3)Update θ by ascending the sampled gradient.

A common minibatch form:

θJ^=1Ni=1Nt=0Ti1θlogπθ(at(i)st(i))Gt(i)\widehat{\nabla_\theta J} = \frac{1}{N}\sum_{i=1}^N\sum_{t=0}^{T_i-1}\nabla_\theta\log\pi_\theta(a_t^{(i)}\mid s_t^{(i)})\,G_t^{(i)}

The geometry of the update (what the gradient does)

For a softmax policy, you can interpret θlogπ\nabla_\theta\log\pi as:

  • increasing parameters that make the taken action more likely,
  • decreasing parameters that make competing actions likely.

Then multiplying by GtG_t decides direction:

  • If GtG_t is large (good), increase probability of those actions.
  • If GtG_t is small/negative (bad), decrease probability.

Visualization: “Score × Return” microscope

Add a per-time-step breakdown:

  • Display logπθ(atst)\log\pi_\theta(a_t\mid s_t).
  • Show θlogπθ(atst)\nabla_\theta\log\pi_\theta(a_t\mid s_t) as a vector arrow (or scalar bar if θ is 1D).
  • Multiply by GtG_t and animate the resulting parameter step.

Learners should see that the policy gradient update is not magic—it’s a weighted push on log-probability.

Core Mechanic 2: Variance Reduction — Baselines, Advantages, and Actor-Critic

Why REINFORCE struggles

REINFORCE is unbiased, but its Monte Carlo returns can have enormous variance:

  • randomness in the environment,
  • randomness in the policy,
  • long horizons with discounting.

High variance means you need many trajectories (or tiny learning rates) to make stable progress.

The central theme of modern policy gradients is:

Keep the estimator (approximately) unbiased while reducing variance.

Baselines: subtract something that doesn’t change the expectation

Key fact: for any function b(st)b(s_t) that does not depend on ata_t,

Eatπθ(st)[θlogπθ(atst)b(st)]=0\mathbb{E}_{a_t\sim \pi_\theta(\cdot\mid s_t)}\left[\nabla_\theta\log\pi_\theta(a_t\mid s_t)\,b(s_t)\right]=0

So we can subtract b(st)b(s_t) inside the gradient estimator without changing its expectation:

θJ(θ)=E[tθlogπθ(atst)(Gtb(st))]\nabla_\theta J(\theta)=\mathbb{E}\left[\sum_t \nabla_\theta\log\pi_\theta(a_t\mid s_t)\,(G_t - b(s_t))\right]

This can drastically reduce variance when b(st)b(s_t) approximates the “typical” return from sts_t.

The most useful baseline: the value function

Choose b(st)=Vπ(st)b(s_t)=V^\pi(s_t), where

Vπ(s)=E[k=0γkrt+kst=s]V^\pi(s)=\mathbb{E}\left[\sum_{k=0}^{\infty}\gamma^k r_{t+k}\mid s_t=s\right]

Then GtVπ(st)G_t - V^\pi(s_t) is an estimate of the advantage:

Aπ(st,at)=Qπ(st,at)Vπ(st)A^\pi(s_t,a_t)=Q^\pi(s_t,a_t)-V^\pi(s_t)

Advantage answers a very specific question:

Was this action better or worse than my policy’s average behavior in this state?

This is exactly the signal you want for improving a stochastic policy.

Actor-critic: two function approximators with different jobs

Actor-critic methods maintain:

  • Actor: the policy πθ(as)\pi_\theta(a\mid s) (parameters θ)
  • Critic: a value function estimate Vϕ(s)V_\phi(s) or action-value Qϕ(s,a)Q_\phi(s,a) (parameters φ)

The critic provides a low-variance learning signal; the actor uses it to update the policy.

A common actor update uses an estimated advantage A^t\widehat{A}_t:

θJ^=E[tθlogπθ(atst)A^t]\widehat{\nabla_\theta J} = \mathbb{E}\left[\sum_t \nabla_\theta\log\pi_\theta(a_t\mid s_t)\,\widehat{A}_t\right]

Bootstrapping: trading bias for lower variance

Instead of Monte Carlo GtG_t, we can use TD-style targets.

One-step TD error (for value critic):

δt=rt+γVϕ(st+1)Vϕ(st)\delta_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)

A simple advantage estimate is A^t=δt\widehat{A}_t=\delta_t.

This introduces some bias (because VϕV_\phi is approximate), but variance often drops dramatically and learning becomes faster.

Generalized Advantage Estimation (GAE)

GAE blends multi-step TD errors with an additional parameter λ that controls bias/variance.

Define TD residuals δt\delta_t as above, then

A^tGAE(γ,λ)=l=0(γλ)lδt+l\widehat{A}_t^{\text{GAE}(\gamma,\lambda)} = \sum_{l=0}^{\infty}(\gamma\lambda)^l\,\delta_{t+l}
  • λ = 0: very low variance, more bias (pure 1-step TD)
  • λ → 1: less bias, more variance (approaches Monte Carlo advantage)

Critic learning objective

If the critic is a value function Vϕ(s)V_\phi(s), a typical squared-error loss is:

LV(ϕ)=E[(Vϕ(st)V^t)2]\mathcal{L}_V(\phi)=\mathbb{E}\left[\left(V_\phi(s_t) - \widehat{V}_t\right)^2\right]

where V^t\widehat{V}_t might be:

  • Monte Carlo return GtG_t
  • TD target rt+γVϕ(st+1)r_t + \gamma V_\phi(s_{t+1})
  • λ-return (related to GAE)

Comparison table (variance and bias intuition)

MethodSignal used in actor updateBiasVarianceTypical use
REINFORCEGtG_tlow (unbiased)highsmall problems, pedagogical baseline
REINFORCE + baselineGtb(st)G_t - b(s_t)lowmediumstill Monte Carlo but improved
Actor-critic (TD)δt\delta_t or learned AAmediumlowcommon practical choice
Actor-critic + GAEA^tGAE\widehat{A}_t^{\text{GAE}}tunabletunablemodern on-policy systems (e.g., PPO)

Visualization: “Variance comparison panel”

To address the visualization weakness explicitly, build a panel that runs the same fixed policy for many rollouts and shows the distribution of gradient estimates.

Panel design

  • Fix θ for a small MDP.
  • Sample K trajectories (e.g., K=200) and compute gradient estimates for a chosen parameter component.
  • Plot three histograms (or violin plots):
  1. 1)REINFORCE with GtG_t
  2. 2)Baseline with GtV(st)G_t - V(s_t)
  3. 3)GAE with chosen λ

What learners should observe

  • REINFORCE histogram is wide (noisy updates).
  • Baseline narrows distribution around the same mean.
  • GAE can further narrow, depending on λ.

Add a slider for λ (0→1) and animate the histogram tightening/loosening. That makes bias/variance tradeoff visible, not just stated.

Application/Connection: From Vanilla Policy Gradient to Actor-Critic Systems (and Why This Unlocks RLHF/PPO)

Why actor-critic is the stepping stone to modern algorithms

Many practical deep RL systems used today (including those in RLHF pipelines) rely on three ideas:

  1. 1)Policy gradient objective (optimize π directly)
  2. 2)Advantage-based updates (baseline/value function)
  3. 3)Stabilization constraints (trust regions, clipping, KL penalties)

This node focuses on (1) and (2), which are foundational for PPO and RLHF.

A canonical on-policy actor-critic training loop

A common structure (simplified):

  1. 1)Collect T steps of experience with current πθ\pi_\theta.
  2. 2)Fit VϕV_\phi to predict returns (or λ-returns).
  3. 3)Compute advantages A^t\widehat{A}_t (often GAE).
  4. 4)Update actor by maximizing:
Lactor(θ)=E[logπθ(atst)A^t]\mathcal{L}_{\text{actor}}(\theta)=\mathbb{E}\left[\log\pi_\theta(a_t\mid s_t)\,\widehat{A}_t\right]

Equivalently, minimize Lactor-\mathcal{L}_{\text{actor}}.

  1. 5)(Often) add an entropy bonus to encourage exploration:
L(θ)=Lactor(θ)+βE[H(πθ(st))]\mathcal{L}(\theta)=\mathcal{L}_{\text{actor}}(\theta) + \beta\,\mathbb{E}[\mathcal{H}(\pi_\theta(\cdot\mid s_t))]

Where PPO fits (preview-level connection)

PPO is still a policy gradient method, but it modifies the objective so updates do not change the policy too abruptly.

A typical PPO objective uses the probability ratio

ρt(θ)=πθ(atst)πθold(atst)\rho_t(\theta)=\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_{\text{old}}}(a_t\mid s_t)}

and then clips it to avoid overly large updates. Notice how everything here assumes you already understand:

  • πθ(as)\pi_\theta(a\mid s) as a differentiable stochastic policy
  • advantage estimates A^t\widehat{A}_t
  • log-probability gradients (since ratios become differences of logs)

That’s exactly why mastering policy gradients unlocks PPO and thus RLHF.

RLHF connection (conceptual)

In RLHF, you often:

  • Train a reward model from human preferences.
  • Use an RL algorithm (commonly PPO) to optimize the language model policy to maximize that learned reward, subject to constraints (e.g., KL to a reference model).

Even if the “environment” is text generation and the “reward” comes from a reward model, the policy gradient core remains:

  • actions are tokens,
  • states are partial sequences,
  • θlogπθ\nabla_\theta \log \pi_\theta is computed by backprop through the transformer,
  • advantage estimation stabilizes learning.

Visualization: “Tiny MDP → PPO-like constraint” (bridge)

Add a simple toggle in the canvas:

  • Vanilla PG mode: ΔθlogπA^\Delta\theta \propto \nabla\log\pi \cdot \widehat{A}
  • Constrained mode (toy PPO): if the KL between old and new policy exceeds a threshold, scale down the update.

Even if you don’t implement full clipping, just showing “unconstrained step” vs “KL-limited step” prepares learners for PPO’s motivation.

Summary of what you should be able to do after this node

  • Write down the objective J(θ)J(\theta) and the trajectory distribution.
  • Derive/recognize the score-function policy gradient estimator.
  • Implement REINFORCE with reward-to-go.
  • Add a baseline and explain why it doesn’t bias the gradient.
  • Build an actor-critic update with TD/GAE advantages.

Those are the conceptual and mathematical prerequisites for modern on-policy deep RL.

Worked Examples (3)

Worked Example 1: REINFORCE on a Tiny 1-State Bandit (Exact Gradient vs Sample Estimate)

Consider a 1-state bandit with two actions a∈{0,1}. Reward is r=1 if a=1, and r=0 if a=0. Let the policy be Bernoulli with parameter p=σ(θ): πθ(a=1)=p, πθ(a=0)=1−p. There is one step per episode, so return R(τ)=r.

Goal: compute ∇θJ(θ) exactly, then match it to the REINFORCE estimator.

  1. 1) Write the expected return:

    J(θ)=E[r]=P(a=1)·1 + P(a=0)·0 = p.

  2. 2) Differentiate J(θ)=p=σ(θ):

    ∇θJ(θ)=dp/dθ = σ(θ)(1−σ(θ)) = p(1−p).

  3. 3) Compute the REINFORCE gradient form:

    ∇θJ(θ)=E[∇θ log πθ(a) · r].

    We will compute ∇θ log πθ(a) for each action.

  4. 4) For a=1:

    log πθ(1)=log p.

    ∂/∂θ log p = (1/p)·dp/dθ = (1/p)·p(1−p)=1−p.

  5. 5) For a=0:

    log πθ(0)=log(1−p).

    ∂/∂θ log(1−p) = (1/(1−p))·(−dp/dθ) = (1/(1−p))·(−p(1−p))=−p.

  6. 6) Take the expectation:

    E[∇θ log πθ(a)·r]

    = P(a=1)·(1−p)·1 + P(a=0)·(−p)·0

    = p(1−p).

    This matches the exact gradient in step (2).

  7. 7) Interpretation:

    • If you sampled a=1 and got r=1, the update uses (1−p) > 0, increasing θ (and thus p).
    • If you sampled a=0 and got r=0, the update is zero (no learning signal), which is a limitation in sparse reward settings.

Insight: This bandit shows the core mechanism cleanly: REINFORCE is an unbiased estimator of the true gradient, and ∇θ log πθ(a) tells you how to change θ to increase the probability of the sampled action.

Worked Example 2: Baseline Does Not Change the Expected Gradient (But Reduces Variance)

We use the same 1-state Bernoulli bandit as Example 1, but now consider adding a constant baseline b (which is allowed since it does not depend on the action). Show that E[∇θ log πθ(a)·(r−b)] equals the original gradient, for any b.

  1. 1) Start with the baseline gradient estimator:

    E[∇θ log πθ(a)·(r−b)] = E[∇θ log πθ(a)·r] − b·E[∇θ log πθ(a)].

  2. 2) We already computed E[∇θ log πθ(a)·r] = p(1−p).

  3. 3) Now compute E[∇θ log πθ(a)]:

    E[∇θ log πθ(a)]

    = P(a=1)·(1−p) + P(a=0)·(−p)

    = p(1−p) + (1−p)(−p)

    = p(1−p) − p(1−p)

    = 0.

  4. 4) Therefore:

    E[∇θ log πθ(a)·(r−b)] = p(1−p) − b·0 = p(1−p).

  5. 5) Variance intuition (qualitative):

    Choosing b close to E[r]=p makes (r−b) smaller-magnitude on average, which shrinks the spread of sample gradient values, stabilizing learning.

Insight: Baselines work because the expected score function is zero: E[∇θ log πθ(a|s)]=0. You can subtract any action-independent term to reduce variance without biasing the gradient.

Worked Example 3: Actor-Critic with TD Advantage on a Two-Step Episode

Consider an episodic problem with two time steps t=0,1 and discount γ. You collect one trajectory (s0,a0,r0,s1,a1,r1,terminal). You have a value critic Vφ(s) that outputs V0=Vφ(s0) and V1=Vφ(s1). Construct the TD residuals δ0, δ1 and a simple advantage estimate, then write the actor update direction.

  1. 1) One-step TD residual at t=1 (terminal next state):

    Because the episode ends after r1, we treat Vφ(s2)=0.

    δ1 = r1 + γ·0 − V1 = r1 − V1.

  2. 2) One-step TD residual at t=0:

    δ0 = r0 + γ V1 − V0.

  3. 3) Use δt as an advantage estimate:

    Â1 = δ1,

    Â0 = δ0.

  4. 4) Write the sampled policy gradient (one trajectory):

    ĝ = ∇θ log πθ(a0|s0)·Â0 + ∇θ log πθ(a1|s1)·Â1.

  5. 5) Interpret signs:

    • If δ0>0, the outcome from s0 was better than V0 predicted, so increase log-prob of a0 in s0.
    • If δ1<0, the final reward was worse than V1 predicted, so decrease log-prob of a1 in s1.
  6. 6) Critic update targets (one-step):

    A standard critic regression would move V0 toward (r0 + γ V1) and move V1 toward r1.

Insight: Actor-critic turns long-horizon Monte Carlo credit assignment into local prediction errors (δt). The critic learns to predict returns; the actor uses the critic’s surprise as the learning signal.

Key Takeaways

  • Policy gradient methods optimize a differentiable stochastic policy πθ(a|s) directly, rather than deriving a policy from value estimates.

  • The objective is expected (discounted) return: J(θ)=Eτ[∑γᵗrₜ], where τ is sampled under πθ and the environment.

  • The score-function (log-derivative) trick yields an unbiased estimator: ∇θJ=E[∑∇θ log πθ(aₜ|sₜ)·(return/advantage)].

  • Reward-to-go Gₜ improves credit assignment by only attributing future rewards to an action at time t.

  • Baselines b(sₜ) do not change the expected gradient if they are action-independent, but can greatly reduce variance.

  • Using b(s)=V(s) leads to advantage learning: A(s,a)=Q(s,a)−V(s), which answers “better or worse than average?”

  • Actor-critic methods learn a critic (value function) to provide low-variance advantage estimates, often via TD residuals δₜ.

  • GAE provides a tunable bias/variance tradeoff for advantage estimation and is a core ingredient of modern on-policy methods like PPO.

Common Mistakes

  • Using a baseline that depends on the sampled action aₜ (this can bias the gradient unless handled carefully).

  • Forgetting discounting or mixing definitions of return (e.g., using undiscounted Gₜ with a discounted critic target).

  • Not detaching/stop-gradient through advantage targets when updating the actor (can cause unintended coupling and instability).

  • Confusing maximizing J(θ) with minimizing a loss: sign errors are common (e.g., descending when you meant to ascend).

Practice

medium

Derive the reward-to-go policy gradient from the trajectory-level form:

∇θJ = E[ R(τ) ∑ₜ ∇θ log πθ(aₜ|sₜ) ].

Show the steps that justify replacing R(τ) with Gₜ inside the sum.

Hint: Condition on the history up to time t and use the fact that actions at time t cannot affect rewards before time t.

Show solution

Start from E[∑ₜ ∇ log π(aₜ|sₜ) · R(τ)]. For a fixed t, decompose R(τ)= (∑_{k=0}^{t-1} γ^k r_k) + (γ^t ∑_{k=t}^{T-1} γ^{k-t} r_k). The first term depends only on rewards before t, which are independent of aₜ given the past; its expectation multiplied by ∇ log π(aₜ|sₜ) is zero (score-function property). The remaining term is proportional to the future return from t, i.e., reward-to-go. After adjusting for γ factors, you obtain E[∑ₜ ∇ log π(aₜ|sₜ) · Gₜ].

easy

In a discrete-action softmax policy with logits z=fθ(s) and π(a|s)=exp(z_a)/∑_j exp(z_j), compute ∂/∂z_k log π(a|s).

Hint: Write log π(a|s)=z_a − log(∑_j exp(z_j)) and differentiate.

Show solution

log π(a|s)=z_a − log(∑_j exp(z_j)). Then ∂/∂z_k log π(a|s)=1[k=a] − exp(z_k)/∑_j exp(z_j)=1[k=a] − π(k|s).

medium

GAE computation practice: given δ0=1.0, δ1=0.5, δ2=−0.2, γ=0.9, compute Â0 for λ=0 and for λ=1 (assume episode ends after t=2 so no further terms).

Hint: Use Â0=∑_{l=0}^{2} (γλ)^l δ_l.

Show solution

For λ=0: Â0 = (γ·0)^0 δ0 + (γ·0)^1 δ1 + (γ·0)^2 δ2 = δ0 = 1.0. For λ=1: Â0=δ0 + (γ)^1 δ1 + (γ)^2 δ2 = 1.0 + 0.9·0.5 + 0.9^2·(−0.2) = 1.0 + 0.45 − 0.162 = 1.288.

Connections

Next nodes and related concepts:

Suggested prior/parallel reinforcement learning nodes (if present in your tech tree):

Quality: A (4.0/5)