Techniques for injecting sequence order information into token embeddings (e.g., sinusoidal or learned encodings) so models that process tokens in parallel can sense relative or absolute position.
Deep-dive lesson - accessible entry point but dense material. Use worked examples and spaced repetition.
A Transformer can read every token at once. That’s a superpower—but it also means the model has no built-in sense of “first”, “next”, or “last”. Positional encoding is the trick that gives parallel token processing a usable notion of order.
Positional encoding maps each position index p to a vector PE(p). You combine PE(p) with a token embedding so the model can use order. Absolute encodings represent “this token is at position p”. Relative encodings represent “token i is k steps away from token j”, often implemented as attention biases. Sinusoidal encodings are deterministic and extrapolate; learned encodings can fit data but may not generalize to longer sequences unless designed carefully.
Many sequence models (like RNNs) process tokens in order, so time is baked into the computation: token 7 can only be seen after token 6. Transformers, in contrast, are designed to process a sequence in parallel. Self-attention compares tokens to each other without inherently knowing which token came earlier.
If you hand a Transformer the embeddings for “the cat sat” and “sat cat the”, the multiset of token embeddings might look similar. Without extra information, the model has no explicit “slot index” to tell it which token was first.
Positional encoding fixes this by injecting order information into the representation.
Let p be a position index (typically p ∈ {0, 1, …, L−1} for a sequence length L). A positional encoding is a mapping
where d is the model’s embedding dimension.
Each token also has an embedding (from a learned embedding table):
To create the model input at position p, we combine them, often by simple addition:
Here xₚ, E(tokenₚ), and PE(p) are vectors in ℝᵈ. (We’ll write vectors in bold: x, v.)
Positional encoding should make position usable for attention and downstream layers. In practice, that means:
There are two main ways to represent order:
Both can work, but they lead to different inductive biases and different generalization behavior.
There are two common integration mechanisms:
A useful mental model:
Absolute positional encodings make position a feature of the token representation. If position p has a unique “signature” vector PE(p), then the model can learn rules like:
The simplest integration is addition:
Addition is popular because it preserves dimension d and is parameter-free at the integration step.
The most direct approach is to learn a table P ∈ ℝᴸmax×ᵈ, where Lmax is the maximum length during training.
Pros:
Cons:
Sinusoidal PE is deterministic and defined by sines and cosines of different frequencies. For embedding dimension d (assume d is even), define for p ≥ 0 and i ∈ {0, 1, …, d/2 − 1}:
This produces a length-d vector with alternating sine/cosine pairs.
The key intuition is multi-scale periodic structure:
The encoding gives each position a unique pattern across many frequencies.
A major reason sinusoids were chosen is that relative position relationships can be expressed via trig identities.
Let ωᵢ = 1 / 10000^(2i/d). Then:
So the pair [sin(pωᵢ), cos(pωᵢ)] can be transformed to [sin((p+Δ)ωᵢ), cos((p+Δ)ωᵢ)] by a 2×2 rotation matrix depending only on Δ.
That matters because attention compares vectors via dot products. If your representation contains sinusoidal components, the model can learn to detect or use relative offsets.
| Method | Parameters | Extrapolates beyond Lmax? | Typical integration | Notes |
|---|---|---|---|---|
| Learned absolute embedding | Yes (Lmax×d) | No (without hacks) | Add to token embedding | Strong fit to training length distribution |
| Sinusoidal | No | Often yes | Add to token embedding | Deterministic; good default; classic Transformer |
In many implementations, token embeddings are scaled by √d before adding PE:
Why? Early in training, learned embeddings may have small variance; scaling can help balance magnitudes so positional info doesn’t dominate (or vanish).
Also, dropout is often applied to the sum:
This regularizes reliance on any single dimension or exact positional signature.
Absolute encodings label positions, but attention often cares about relative patterns:
Absolute encodings can still learn these behaviors, but it can be less direct, especially when you want translation-invariant rules (“the same rule anywhere in the sequence”). That motivates relative positional mechanisms.
Self-attention’s core operation for a query position i and key position j is a similarity score:
This has no explicit awareness of distance or direction (left vs right). Relative positional encodings inject information about (j−i) directly into the scoring.
A common pattern:
where b(Δ) is a learned scalar bias for offset Δ (possibly clipped to a range).
Instead of a scalar bias, we can use a vector embedding R[Δ] ∈ ℝᵈ and incorporate it into attention:
One family of formulations modifies keys/values:
Expand this:
= (qᵢ)ᵀkⱼ + (qᵢ)ᵀr_{j−i}
So the score becomes content similarity plus an offset-dependent term.
Relative encoding bakes in a helpful inductive bias:
It also often improves generalization to longer sequences because offsets (like Δ = ±1, ±2, …) are length-agnostic.
In practice, you can’t store embeddings for every possible offset if sequences are long. Typical strategies:
So:
This trades exact long-range distance for manageable parameters.
Another widely used modern method is to apply a position-dependent rotation to queries and keys so that dot products encode relative offsets.
High-level idea:
For one 2D pair, if u = [u₀, u₁]ᵀ, define rotation:
[sin θ, cos θ]]
Then u′ = R(θ) u.
If queries and keys are rotated by their positions, the dot product between rotated vectors can become a function of relative angle differences (hence relative position). This tends to generalize well and is popular in LLMs.
| Aspect | Absolute PE | Relative PE |
|---|---|---|
| Represents | “Token is at p” | “Token j is Δ away from i” |
| Typical implementation | Add PE(p) to embeddings | Add bias b(Δ) to attention scores or modify q/k |
| Inductive bias | Position-specific behaviors | Translation-invariant, distance-aware behaviors |
| Long-length generalization | Sinusoidal: decent; Learned: weak | Often strong (offsets/buckets/RoPE) |
You can think of positional information entering the model in two places:
Many strong architectures use both (or a more sophisticated variant) because input position and attention distance can each be useful.
A simplified Transformer input pipeline looks like:
Inside each attention layer (single head for simplicity):
Attention weights:
With basic attention:
With relative bias:
Output:
Once the model has position information, it can learn:
| Goal | Good default | Reason |
|---|---|---|
| Simple baseline Transformer | Sinusoidal absolute PE | No extra params; decent generalization |
| Fixed max length, plenty of data | Learned absolute PE | Flexible and often strong in-domain |
| Strong long-context behavior | Relative bias / RoPE-style | Better length extrapolation and distance awareness |
Positional encoding is one of the key ingredients that make Transformers work on sequences at all. Without it, self-attention is permutation-invariant: it can’t reliably distinguish different orderings. Once you add PE, the attention mechanism becomes sequence-aware.
In the Transformers node, you’ll see how multi-head attention and stacked layers exploit this positional information to build hierarchical features—from local dependencies to global structure.
Let d = 4 (so we have i = 0, 1). Use the classic sinusoidal formula:
PE(p)[2i] = sin(p / 10000^(2i/d))
PE(p)[2i+1] = cos(p / 10000^(2i/d))
Compute PE(0) and PE(1).
Identify the frequencies.
For i = 0:
2i/d = 0/4 = 0
10000^(0) = 1
So the denominator is 1.
For i = 1:
2i/d = 2/4 = 1/2
10000^(1/2) = √10000 = 100
So the denominator is 100.
Compute PE(0).
For p = 0:
PE(0)[0] = sin(0/1) = sin(0) = 0
PE(0)[1] = cos(0/1) = cos(0) = 1
PE(0)[2] = sin(0/100) = sin(0) = 0
PE(0)[3] = cos(0/100) = cos(0) = 1
So:
PE(0) = [0, 1, 0, 1].
Compute PE(1).
For p = 1:
PE(1)[0] = sin(1/1) = sin(1)
PE(1)[1] = cos(1/1) = cos(1)
PE(1)[2] = sin(1/100) = sin(0.01)
PE(1)[3] = cos(1/100) = cos(0.01)
So:
PE(1) = [sin(1), cos(1), sin(0.01), cos(0.01)].
Optional numeric intuition (radians).
sin(1) ≈ 0.84, cos(1) ≈ 0.54
sin(0.01) ≈ 0.01, cos(0.01) ≈ 0.99995
So:
PE(1) ≈ [0.84, 0.54, 0.01, 0.99995].
Notice how the first pair changes a lot between p=0 and p=1, while the second pair changes very little—this is the multi-scale idea.
Insight: Different dimensions encode position at different “speeds”. Fast-changing components help with local order; slow-changing components help disambiguate larger positions and give the model coarse location cues.
Consider a single query position i that will attend to three key positions j ∈ {i−1, i, i+1}. Suppose the content-based dot products (already divided by √dₖ) are:
content scores: s_content = [0.2, 0.2, 0.2]
So without position, all are equally likely.
Now add a relative bias that prefers the previous token:
b(−1)=0.5, b(0)=0.0, b(+1)=−0.5.
Compute the new attention distribution α over j.
Write the total scores.
For Δ = j−i:
Δ = −1 → score = 0.2 + 0.5 = 0.7
Δ = 0 → score = 0.2 + 0.0 = 0.2
Δ = +1 → score = 0.2 − 0.5 = −0.3
So:
s_total = [0.7, 0.2, −0.3].
Apply softmax.
α_k = exp(s_total[k]) / ∑ exp(s_total[m])
Compute exponentials (approx):
exp(0.7) ≈ 2.0138
exp(0.2) ≈ 1.2214
exp(−0.3) ≈ 0.7408
Sum ≈ 2.0138 + 1.2214 + 0.7408 = 3.9760
Normalize.
α(Δ=−1) ≈ 2.0138 / 3.9760 ≈ 0.506
α(Δ=0) ≈ 1.2214 / 3.9760 ≈ 0.307
α(Δ=+1) ≈ 0.7408 / 3.9760 ≈ 0.186
Compare to no-bias case.
Without bias, all scores were equal, so α would be [1/3, 1/3, 1/3].
With bias, attention shifts strongly toward the previous token (Δ=−1) even though the content scores were identical.
Insight: Relative positional bias directly shapes attention patterns (like “look left”) independently of token content. This is often an easier way for the model to learn local or directional dependencies than relying on absolute position signatures.
Suppose we have two tokens A and B with embeddings E(A)=a and E(B)=b in ℝᵈ, and positional encodings PE(0)=p₀, PE(1)=p₁.
Compare the input sets for sequences [A, B] vs [B, A] under addition: x₀=E(token₀)+PE(0), x₁=E(token₁)+PE(1).
Write inputs for [A, B].
Sequence 1:
x₀ = a + p₀
x₁ = b + p₁
Write inputs for [B, A].
Sequence 2:
x′₀ = b + p₀
x′₁ = a + p₁
Show these are not just a permutation of the same vectors unless special coincidences occur.
If the model only saw {a, b} with no position, swapping A and B would just swap the vectors.
With positions, the actual vectors differ:
a + p₀ ≠ a + p₁ (if p₀ ≠ p₁)
So token A at position 0 is representationally different from token A at position 1.
Connect to a simple linear detector.
A linear layer computes zₚ = Wxₚ.
Then:
z₀ = Wa + Wp₀
z₁ = Wb + Wp₁
So the network can learn W such that Wp₀ and Wp₁ act like learned “position features” that influence downstream behavior.
Insight: Even with a simple add operation, positional vectors act like features that downstream layers can read. They turn “token identity” into “token-at-position”, which is enough to break permutation invariance.
Transformers process tokens in parallel, so they need explicit order information; positional encoding supplies it.
A positional encoding is a mapping p ↦ PE(p) ∈ ℝᵈ, combined with token embeddings (commonly by addition: xₚ = E(tokenₚ) + PE(p)).
Absolute encodings identify positions directly (position p has its own vector), while relative encodings represent offsets (j−i) between tokens.
Sinusoidal PE is deterministic, parameter-free, and has trig structure that helps represent relative shifts; it often extrapolates to longer sequences.
Learned absolute positional embeddings are flexible but usually tied to a maximum training length Lmax and may generalize poorly beyond it.
Relative positional methods often enter as attention biases or query/key modifications, directly shaping score(i, j) based on distance and direction.
You can inject position in input space (add to embeddings) or in attention space (bias scores); both approaches make attention order-aware.
Assuming self-attention “naturally” knows token order—without PE or a relative mechanism, it is largely permutation-invariant.
Mixing up absolute and relative: adding PE(p) to embeddings is absolute; adding a term depending on (j−i) to attention is relative.
For learned absolute PE, forgetting the maximum length constraint (Lmax) and being surprised when evaluation sequences exceed it.
Using positional vectors with mismatched scale (e.g., PE dominating token embeddings), leading to unstable training or poor utilization of content.
Sinusoidal PE practice: For d = 6 and position p = 2, write the symbolic form of PE(2) (you don’t need decimals).
Hint: Use i ∈ {0,1,2}. Denominators are 10000^(2i/d) = 10000^(0), 10000^(2/6), 10000^(4/6).
For d=6, i=0,1,2.
Let denom(i) = 10000^(2i/6).
Then:
PE(2)[0] = sin(2 / 10000^(0))
PE(2)[1] = cos(2 / 10000^(0))
PE(2)[2] = sin(2 / 10000^(2/6))
PE(2)[3] = cos(2 / 10000^(2/6))
PE(2)[4] = sin(2 / 10000^(4/6))
PE(2)[5] = cos(2 / 10000^(4/6))
Relative bias reasoning: Suppose an attention head uses score(i, j) = s_content(i, j) + b(j−i). If b(Δ) is strongly positive for Δ=0 and strongly negative otherwise, what behavior does this encourage? Give one task where this might help and one where it might hurt.
Hint: Think about the model preferring to attend to itself (j=i).
A large positive b(0) and negative b(Δ≠0) encourages self-attention to focus on the same position (j=i), reducing mixing across tokens.
Help: tasks where per-token transformations dominate, e.g., tagging where each token label depends mostly on itself, or when early layers should preserve identity.
Hurt: tasks needing context integration, e.g., coreference, long-range dependencies, or next-token prediction where previous tokens are essential.
Design choice: You need a model to handle sequences up to length 16k at inference, but training is mostly on length ≤ 2k. Which positional strategy (learned absolute vs sinusoidal absolute vs relative bias/RoPE) would you favor and why?
Hint: Focus on length extrapolation and what happens to positions not seen during training.
Favor a relative positional method (relative bias or RoPE), because it tends to generalize better when inference lengths exceed training lengths: offsets (j−i) remain meaningful regardless of absolute index. Sinusoidal absolute PE can also extrapolate reasonably since it’s defined for any p, but relative mechanisms more directly encode distance and often behave better for long-context attention patterns. Learned absolute PE is the least suitable because positions beyond Lmax are undefined/untrained and typically require resizing/interpolation with uncertain generalization.
Next: Transformers
Related ideas you’ll encounter soon: