Visualizes a small Markov Decision Process (MDP) where an agent repeatedly samples actions from a stochastic policy pi(a|s), transitions to next states via P(s'|s,a), receives rewards, and incrementally updates value estimates V^pi(s) and action-values Q^pi(s,a). The animation shows the policy probabilities per state, the stochastic transition, reward signals, and the value backup target r + γ V(s').
Self-contained Canvas2D draw function with persistent closure state. Uses a 4-state line-world MDP, stochastic transitions (intended vs slip), and simple TD(0)-style incremental updates for V and Q. Blocky 4px-snapped layout, green-on-black palette, and a ~0.9s step animation cycle with easing.