markov-decision-processes

Visualizes an MDP decision epoch: a highlighted state selects an action via a (stochastic) policy π(a|s), transitions to next states according to P(s'|s,a) with immediate rewards r, then performs a Bellman backup Q(s,a)=E[r]+γE[V(s')]. The final stage morphs from policy evaluation (expectation under π) to Bellman optimality (max over actions).

canvasclick to interact

t=0s

practical uses

01.Reinforcement learning (policy evaluation and value iteration)
02.Robotics/control: planning under uncertainty with discounted rewards
03.Operations research: sequential decisions like inventory, routing, scheduling

technical notes

Time-based 3.8s loop with eased segments (decision→transition→backup→optimality). Blocky grid-snapped drawing (4px*scale) on black with GREEN/GREEN_DIM. γ is shown as a pulsing bar; transitions display P and r labels; backup panel shows expected terms and an animated target value; optimality panel compares action-values and highlights argmax.

← second-order-optimization concentration-inequalities →