Weighted focus on input elements. Self-attention, cross-attention.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
You’re building a machine translation system. The input is: “The animal didn’t cross the road because it was tired.” When generating “it”, the model must decide: does “it” refer to “animal” or “road”? In classic seq2seq, that decision is buried in a single hidden state bottleneck. Attention fixes this by letting the decoder look back and place a weighted focus over the relevant input tokens.
Now a curiosity gap: attention layers can fail in surprisingly silent ways. Two common ones: (1) applying softmax along the wrong axis (your model still trains, but attends across the batch or feature dimension), and (2) mask leakage (future tokens “peek” through due to broadcasting or dtype mistakes). This lesson makes the mechanism precise enough that you can derive the shapes, verify the axes, and catch these bugs quickly.
Attention computes relevance between a query and many keys, converts relevance scores into weights (softmax), and uses those weights to blend the corresponding value vectors. Self-attention uses Q,K,V from the same sequence; cross-attention uses queries from one sequence (e.g., decoder) and keys/values from another (e.g., encoder). The core formula is: Attention(Q,K,V) = softmax(QKᵀ/√dₖ)V, with masking and batching details crucial in practice.
In sequence-to-sequence modeling, we often want an output sequence to depend on different parts of the input at different times. Translation, summarization, speech recognition, program synthesis—these tasks all have alignment structure:
Older encoder–decoder RNNs forced the entire input sequence into one fixed-size vector (or a narrow channel through the final hidden state). This creates an information bottleneck: long sequences degrade because the decoder can’t selectively retrieve what it needs.
Attention removes the bottleneck by turning “memory” into a set of vectors (one per input element) and letting the model compute a weighted combination of those vectors each time it needs context.
Attention is easiest to understand by analogy to retrieval:
The algorithm:
This is not just a metaphor; it’s literally what the math does.
Suppose we have one query vector q ∈ ℝᵈ, and n keys/values {(kᵢ, vᵢ)} for i=1..n.
1) Similarity scoring (dot-product attention):
2) Score-to-weight via softmax:
3) Aggregate values:
Here o is the attention output (sometimes called the “context vector”).
In neural networks, the input tokens already have embeddings xᵢ. We project them into Q/K/V spaces with learned affine transformations:
This matters because:
This node emphasizes a crucial distinction:
In translation terms:
Attention is easy to write but easy to implement incorrectly.
We’ll keep returning to shapes and axes so you can debug these confidently.
If attention is “weighted focus,” then the score function decides what counts as relevant. The score is computed between a query and each key.
In practice, the most common scoring rule is dot-product similarity because it is fast on GPUs and works well with learned projections.
Assume we have:
Compute all pairwise query–key scores:
Shapes:
Interpretation:
This “score matrix” is the object you will mask, normalize, and use to weight V.
In Transformers, the standard formula is scaled dot-product attention:
Motivation: dot products grow in magnitude with dimension.
A rough variance argument:
So typical score magnitudes scale like √dₖ. Large magnitudes push softmax into saturation:
Dividing by √dₖ keeps the score distribution more stable as dₖ changes.
Dot product is not the only option. Historically, early attention used additive scoring.
| Scoring type | Formula (single pair) | Pros | Cons | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Dot-product | s = qᵀk | Fast, simple, GPU-friendly | Can grow with dₖ (needs scaling) | ||||||||
| Cosine similarity | s = (qᵀk) / ( | q | · | k | ) | Scale-invariant, interpretable | Norm computation adds cost; less common in Transformers | ||||
| Additive (Bahdanau) | s = wᵀ tanh(W_q q + W_k k) | Flexible, can work well with smaller dims | Slower; less parallel-friendly |
Because you already know cosine similarity: note that dot-product attention can learn to behave like cosine similarity if the model learns to normalize representations (or learns norm control via layer norm / projection matrices). But in standard Transformers, the scaling is the main explicit normalization.
When implementing scoring, always write down:
In a batched setting:
A common silent bug: transposing the wrong axes so you compute (B×dₖ×dₖ) or normalize over the wrong dimension.
The score matrix S encodes potential connections:
This is done not at the Q/K/V level but by masking the score matrix before softmax.
We’ll treat masking carefully in the next mechanic because it interacts directly with the probability distribution.
Raw scores S_{ij} are unbounded real numbers. To create a “focus,” we need nonnegative weights that sum to 1 across keys for each query.
Softmax does exactly this, turning each query’s score row into a categorical distribution over keys.
Given scores:
Compute attention weights:
Important: softmax is applied row-wise over the key dimension (size n). That means:
Then aggregate values:
Shapes:
Interpretation:
Masking modifies S before softmax so forbidden positions get probability ≈ 0.
Two common masks:
1) Padding mask (ignore pad tokens)
2) Causal mask (prevent “future” access in autoregressive decoding)
Mechanically, we add a large negative number to masked scores:
Where M_{ij} = 0 if allowed, and M_{ij} = -\infty (or a large negative constant like -10^9) if disallowed.
Then:
Because exp(-∞) → 0, masked entries get weight 0.
Masks are often stored with shape (B×1×1×n) or (B×1×m×n) depending on implementation (especially with multi-head attention).
A common bug pattern:
Result: you mask the wrong dimension or the wrong positions. The model may still train but exhibits “cheating” (decoder sees future) or ignores padding improperly.
Practical discipline:
Given S of shape (B×m×n):
If you accidentally normalize over queries, you enforce that each key distributes probability over queries, which is not the retrieval interpretation.
A quick invariant check:
Softmax can overflow if scores are large. The standard stable computation:
For each row i:
This doesn’t change results because subtracting a constant from all logits preserves softmax.
Sometimes you’ll see a temperature τ:
The Transformer’s √dₖ scaling can be interpreted as a kind of dimension-dependent temperature.
Once you have A, the output is:
This is a linear combination of V with coefficients from A.
Differentiability:
To see the dependency explicitly for a single query i:
If A_{ij} increases, oᵢ moves toward vⱼ.
Keys provide an address space, queries pick addresses, values store content. The softmax makes it a soft (continuous) lookup rather than a hard index.
This is why attention can represent alignment: it’s literally learning a soft alignment matrix A.
At this point, you have the atomic concepts:
Next we connect that to the self vs cross distinction in full architectural context.
Attention is a general operator: it maps (Q,K,V) to O. The difference between self- and cross-attention is simply where these tensors come from.
This origin choice encodes a modeling decision:
Let X ∈ ℝ^(L×d_model) be a sequence of L token embeddings (after adding positional information).
We compute:
Where:
Then:
Mask M depends on the setting:
Interpretation: each token representation is updated by blending information from other tokens.
A key property: self-attention can connect tokens at arbitrary distance in one step (unlike RNNs where information must travel sequentially).
In an encoder–decoder setup:
Cross-attention uses:
So each decoder position forms a query based on what it has generated so far, and retrieves relevant source information.
Shape intuition:
This matrix is literally an alignment between target positions and source positions.
This node unlocks Transformers, where attention is typically multi-head.
Multi-head attention repeats the attention computation h times with different learned projections:
For head r:
Each head produces O_r, then we concatenate and project:
Why multiple heads helps:
But the atomic mechanism remains exactly what you learned: score, softmax, weighted sum.
Let:
Typical shapes:
Scores:
Softmax over L_k:
Output:
This is where the earlier failure modes live:
When attention behaves oddly, check invariants:
1) Row sum invariant (per query):
2) Mask invariant:
3) Causality invariant (decoder self-attention):
4) Sanity input test:
Transformers stack attention layers with:
But none of those change what attention is. They make it trainable, stable, and expressive at scale.
If you can derive the score matrix shape and explain why softmax is row-wise, you’re ready to understand multi-head attention, positional encoding, and full Transformer blocks.
We have 1 query and 3 key/value pairs. Use unscaled dot-product attention for simplicity.
Let q = [1, 0].
Keys:
Values:
Compute scores sᵢ = qᵀkᵢ, weights α via softmax, and output o = ∑ αᵢ vᵢ.
Step 1: Compute dot-product scores
s₁ = [1,0]·[1,0] = 1
s₂ = [1,0]·[0,1] = 0
s₃ = [1,0]·[1,1] = 1
Step 2: Softmax normalization
Compute exp scores:
exp(s₁)=e¹,
exp(s₂)=e⁰=1,
exp(s₃)=e¹
Sum = e + 1 + e = 2e + 1
So:
α₁ = e/(2e+1)
α₂ = 1/(2e+1)
α₃ = e/(2e+1)
Step 3: Weighted sum of values
o = α₁v₁ + α₂v₂ + α₃v₃
= α₁[10,0] + α₂[0,10] + α₃[5,5]
First component:
o₁ = 10α₁ + 0α₂ + 5α₃ = 10α₁ + 5α₃
Second component:
o₂ = 0α₁ + 10α₂ + 5α₃ = 10α₂ + 5α₃
Step 4: Substitute α values
Because α₁ = α₃ = e/(2e+1):
o₁ = 10·e/(2e+1) + 5·e/(2e+1) = 15e/(2e+1)
o₂ = 10·1/(2e+1) + 5·e/(2e+1) = (10 + 5e)/(2e+1)
Insight: Even though k₁ and k₃ tie on relevance, the output is not just “pick one”: it blends v₁ and v₃ heavily, with a smaller contribution from v₂. Attention is a soft retrieval mechanism; ties and near-ties naturally produce mixtures.
You have an encoder–decoder model.
Encoder sequence length L_src = 4, decoder length L_tgt = 3.
Model dimension d_model = 8.
Single-head attention with dₖ = dᵥ = 8.
Batch size B = 2.
Encoder outputs H have shape (B×L_src×d_model) = (2×4×8).
Decoder representations Y have shape (B×L_tgt×d_model) = (2×3×8).
Construct Q,K,V and determine the score matrix shape for:
1) encoder self-attention
2) decoder self-attention
3) decoder cross-attention
Part A: Encoder self-attention
Q = HW_Q, K = HW_K, V = HW_V
So Q,K,V each have shape (2×4×8).
Scores S = QKᵀ:
So S is (2×4×4).
Softmax must be over the last dimension (keys), so over size 4.
Part B: Decoder self-attention
Q,K,V come from Y, so each is (2×3×8).
Scores S is (2×3×3).
Softmax over the last dimension (keys) so each of 3 query positions has a distribution over 3 key positions.
Additionally, apply a causal mask so query position t cannot attend to keys > t.
Part C: Decoder cross-attention
Q comes from Y: Q is (2×3×8).
K,V come from H: K,V are (2×4×8).
Scores S = QKᵀ gives shape (2×3×4).
Softmax must be over the last dimension (keys), so over size 4 (the source positions).
Padding mask applies to the encoder keys (length 4), not to decoder positions.
Insight: The single biggest implementation detail is: softmax normalizes across keys for each query. In cross-attention the key axis is L_src, not L_tgt. If you normalize over the wrong length, the model no longer expresses “which source tokens explain this target token?”
Consider decoder self-attention with L = 3 tokens. We want token 1 (0-indexed) to attend only to keys 0..1.
Suppose scaled scores (already divided by √dₖ) for a single head and single batch item are:
S =
[ [2, 1, 0],
[0, 3, 4],
[1, 1, 1] ]
Apply a causal mask and compute the masked softmax weights for row 1 (the second query).
Step 1: Write the causal mask M (0 allowed, -∞ forbidden)
For L=3:
M =
[ [0, -∞, -∞],
[0, 0, -∞],
[0, 0, 0] ]
Step 2: Mask the scores S' = S + M
Row 1 (second query) originally: [0, 3, 4]
After masking (disallow key 2): [0, 3, -∞]
Step 3: Softmax row 1 stably
Compute max = 3
Subtract max: [0-3, 3-3, -∞] = [-3, 0, -∞]
Exponentiate: [e^-3, 1, 0]
Normalize: sum = e^-3 + 1
So weights are:
A = [ e^-3/(1+e^-3), 1/(1+e^-3), 0 ]
Insight: Without the mask, key 2 would dominate because score 4 is largest. With the mask, its probability is forced to 0. This illustrates why mask correctness is a security property for autoregressive models: a single broadcasting error can re-enable that last entry.
Attention is a differentiable retrieval operation: similarity scoring (Q vs K) → softmax weights → weighted sum of V.
In matrix form: $$, where softmax is applied over the key dimension.
Self-attention uses Q,K,V from the same source sequence; cross-attention uses Q from one source and K,V from another (e.g., decoder queries encoder memory).
The √dₖ scaling prevents dot-product magnitudes from growing with dimension, keeping softmax from saturating and stabilizing gradients.
Masking is done by adding 0 for allowed positions and −∞ (or a large negative) for forbidden positions before softmax.
Two silent implementation failures are common: softmax along the wrong axis (weights don’t sum over keys) and mask leakage via shape/broadcasting mistakes.
Attention weight matrices can be interpreted as soft alignment matrices (especially in cross-attention), connecting directly to seq2seq alignment intuition.
Applying softmax over the wrong dimension (e.g., over queries instead of keys), which breaks the “distribution over keys per query” interpretation while still producing finite outputs.
Mask shape/broadcasting errors that allow attention to forbidden positions (future tokens or padding), causing leakage that may only show up as suspiciously good training loss.
Forgetting the 1/√dₖ scaling (or mis-scaling), leading to overly peaky softmax and poor gradient flow, especially for larger dₖ.
Mixing up K and V: keys are for matching, values are for retrieval; swapping them can reduce performance or destabilize learning.
You have Q ∈ ℝ^(5×16), K ∈ ℝ^(7×16), V ∈ ℝ^(7×32). What are the shapes of the score matrix S, attention weights A, and output O (single head, no batch)? Also: along which axis do you apply softmax?
Hint: Compute S = QKᵀ and track dimensions; softmax should normalize over keys for each query.
S = QKᵀ has shape (5×7). A = softmax(S) has shape (5×7), with softmax applied over the last dimension of size 7 (the keys) for each of the 5 queries. O = AV has shape (5×32).
Consider a decoder self-attention layer with sequence length L=4. Write the causal mask matrix M (entries 0 or −∞) that prevents attending to future tokens. Which entries are allowed for query position 2 (0-indexed)?
Hint: Allowed positions are keys with index ≤ query index.
For L=4,
M =
[ [0, −∞, −∞, −∞],
[0, 0, −∞, −∞],
[0, 0, 0, −∞],
[0, 0, 0, 0] ]
For query position 2, allowed keys are {0,1,2}; key 3 is forbidden.
In cross-attention, a decoder has L_tgt=6 and an encoder has L_src=10. You compute scores S of shape (B×h×6×10). Suppose you accidentally apply softmax over the length-6 axis instead of length-10. Conceptually, what distribution are you computing, and why is it wrong for retrieval?
Hint: Ask: for a fixed query, do weights sum across keys? Or across queries?
Softmax over the length-6 axis normalizes across queries (target positions) for each fixed key, producing a distribution like “how much does this source position contribute across different target queries,” rather than “which source positions are relevant for this target query.” Retrieval requires, for each query position, a distribution over keys (length 10). With the wrong axis, each query no longer forms a proper mixture over encoder values, so the mechanism can’t represent alignment from each target token to source tokens.
Unlocks and extensions:
Related prerequisites and reinforcing nodes: