Visualizes a parameterized stochastic policy πθ(a|s) as action-probability bars for two states, then repeatedly samples a short trajectory under πθ. A moving cursor shows the current timestep’s score-function factor ∇θ log πθ(a|s) multiplied by return/advantage (baseline V), illustrating the policy gradient theorem as a sample-based estimator that drives updates and improves the running objective J(θ).
Implements a tiny 2-state/2-action MDP and a toy softmax policy with persistent parameters (W logits) plus a baseline V(s). Every ~520ms it samples a horizon-6 trajectory, computes discounted returns with γ=0.95, forms advantages (G−V), applies a REINFORCE-style update to W and an MSE update to V. Rendering uses pixel-snapped rectangles on a black background with GREEN/GREEN_DIM and eased cursor motion for smooth, educational animation.