positional-encoding

Shows how position index p is mapped to a positional encoding vector PE(p) and injected into token representations (via add/concat) so parallel processing can still use order; contrasts with relative schemes that use offsets (p_i − p_j) as attention biases in an i×j attention grid.

canvasclick to interact

t=0s

practical uses

01.Enable Transformers to represent word order without recurrence/convolutions
02.Improve long-context generalization (e.g., sinusoidal/rotary or other structured encodings)
03.Model relative relationships like distance-based attention bias for tasks such as retrieval, code, and time series

technical notes

Three-panel loop (absolute → relative → integration) over ~3.6s. Left column renders token positions and vector bars for E(token), PE(p), and their combination; right column renders an attention matrix where cell intensity depends on |i−j| and a highlighted (i,j) shows Δ=p_i−p_j. All geometry is grid-snapped for a blocky aesthetic; animation is time-based using ease() and cycling indices.

← softmax-and-logits game-theory-introduction →