Visualizes an MDP decision epoch: a highlighted state selects an action via a (stochastic) policy π(a|s), transitions to next states according to P(s'|s,a) with immediate rewards r, then performs a Bellman backup Q(s,a)=E[r]+γE[V(s')]. The final stage morphs from policy evaluation (expectation under π) to Bellman optimality (max over actions).
Time-based 3.8s loop with eased segments (decision→transition→backup→optimality). Blocky grid-snapped drawing (4px*scale) on black with GREEN/GREEN_DIM. γ is shown as a pulsing bar; transitions display P and r labels; backup panel shows expected terms and an animated target value; optimality panel compares action-values and highlights argmax.