Visualizes the attention mask matrix M (1=allow, 0=block) and how causal masking (upper-triangular blocking of future tokens) and padding masking (blocking PAD key positions) combine. The animation cycles between mask types, then demonstrates integration into attention by adding a large negative value to masked logits so softmax assigns ~0 probability to blocked keys for the active query row.
Grid-snapped, green-on-black rendering. Left panel draws M as a blocky Q×K matrix with an animated active query row. Right panel synthesizes logits, applies a combined (causal AND padding) mask using a finite -8 stand-in for -INF, then computes a stable softmax to show probabilities collapsing to ~0 on masked keys. Animation cycles every ~4.2s.