Visualizes the RLHF pipeline as a looping 3-step process: (1) humans provide preference comparisons (A vs B) that train a reward model r_phi, (2) r_phi converts policy samples into scalar rewards used as the RL return, and (3) the policy pi_theta is updated (PPO-style) while a KL(pi_theta || pi_ref) term keeps it close to the pretrained reference policy pi_ref to reduce drift.
Pure Canvas2D, time-based loop (~4.2s). Blocky green-on-black UI with dashed-flow arrows. PPO objective panel plus a simple discrete distribution overlay to illustrate drift and a KL meter driven by simulated mean-shift. Includes auto-cycling tooltips to mimic interactivity without external input hooks.