Attention-based architecture. Multi-head attention, positional encoding.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
Transformers are the first widely successful neural architecture where the main “engine” is not recurrence or convolution, but a learned, content-based routing system: attention. Once you understand the exact mechanics of scaled dot-product attention, multi-head attention, and the Transformer block (attention + feed-forward + residual + layer norm), most modern language and vision models become variations on a single theme.
A Transformer processes a sequence by projecting token representations into queries, keys, and values, computing attention weights via softmax(QKᵀ/√d_k) (with masks as needed), mixing values with those weights, and then applying a position-wise MLP—each sublayer wrapped with residual connections and layer normalization. Multi-head attention repeats attention in parallel with different projections, concatenates head outputs, and mixes them back to d_model. Stacking these blocks yields encoders and decoders; decoders add causal masking and cross-attention to an encoder output.
Before Transformers, sequence modeling was dominated by RNNs/LSTMs/GRUs and CNN-based sequence models. Those families have two persistent pain points:
1) Long-range dependencies are hard. Even with gating, recurrent models struggle to move information across hundreds or thousands of steps.
2) Parallelism is limited. Recurrence is inherently sequential: to compute step t, you need step t−1. That slows training.
Transformers address both by making the core operation all-to-all token interaction in a single layer: every token can “look at” every other token (subject to masking). This interaction is differentiable, learnable, and highly parallelizable on GPUs/TPUs.
A Transformer layer repeatedly does:
Crucially, both steps are wrapped in residual connections and layer normalization for stable optimization.
Assume a sequence length L and model width d_model.
Because attention has no inherent sense of order, we add positional information:
| Family | Primary use | Blocks | Key masking |
|---|---|---|---|
| Encoder-only (e.g., BERT-style) | Understanding, classification, bidirectional context | self-attn + FFN | Padding mask only (no causal mask) |
| Decoder-only (e.g., GPT-style) | Autoregressive generation | masked self-attn + FFN | Causal + padding masks |
| Encoder–decoder (original) | Seq2seq (translation, summarization) | encoder: self-attn + FFN; decoder: masked self-attn + cross-attn + FFN | Decoder uses causal; cross-attn uses padding mask on encoder side |
This lesson focuses on the core mechanics that all of these share: scaled dot-product attention, multi-head attention, and the Transformer layer structure.
Suppose token i needs information from token j (e.g., a pronoun needs its antecedent). Attention lets token i compute a weighted mixture of other tokens’ information.
The key design choice is: weights depend on content (learned similarity), not just distance.
Each token representation xᵢ is linearly projected into three vectors:
In matrix form for a whole sequence X ∈ ℝ^{L×d_model}:
where W_Q, W_K, W_V are learned matrices.
Typically:
In most standard implementations, d_v = d_k = d_model / h per head.
The raw similarity between token i and token j is the dot product:
In matrix form, the score matrix is:
If q and k components have roughly unit variance, then the dot product grows with dimension:
Large magnitudes push softmax into saturation, giving tiny gradients. To keep logits in a reasonable range, we scale:
This is “scaled dot-product attention.”
For each query position i, we take a softmax over j:
So Aᵢⱼ ≥ 0 and ∑ⱼ Aᵢⱼ = 1.
Interpretation: row i of A is a probability distribution over which tokens i will attend to.
Masks are incorporated by adding −∞ (or a very negative number) to disallowed positions before softmax.
Let M ∈ ℝ^{L×L} where:
Then:
Two common masks:
1) Padding mask: disallow attending to padding tokens.
2) Causal mask (decoder): disallow attending to future tokens, i.e., j > i.
Finally, the output at each position i is a weighted sum of values:
In matrix form:
Where O ∈ ℝ^{L×d_v}.
The full formula:
Dot products measure alignment. If qᵢ points in a similar direction to kⱼ, token i will attend more to token j. But because W_Q and W_K are learned, the model can invent similarity notions that match the task (syntax, coreference, topic, etc.).
Self-attention builds an L×L matrix of scores. That’s:
This quadratic dependence is why long-context Transformers require approximations or architectural tricks—but the core mechanism remains the same.
A single attention map must decide one set of weights per token. But language often requires multiple simultaneous relationships:
Multi-head attention lets the model compute multiple attention distributions in parallel, each in its own learned subspace.
Let:
Commonly:
Example: d_model = 768, h = 12 ⇒ d_k = 64.
For head r ∈ {1,…,h}, we have separate projection matrices:
Compute:
Then each head output:
Concatenate head outputs along the feature dimension:
If each O^{(r)} ∈ ℝ^{L×d_v} and d_v = d_k, then O_concat ∈ ℝ^{L×(h d_k)} = ℝ^{L×d_model}.
Then apply a final learned output projection:
This final mixing matters: it lets the model combine information across heads.
Multi-head attention is a pattern, and it can be used in different places:
Cross-attention formula:
Here the decoder learns to retrieve information from the encoded source sequence.
A helpful mental model:
So attention isn’t only where to look; it’s also what to bring back.
Holding d_model fixed, increasing h decreases d_k. There is a trade-off:
| Choice | Benefit | Cost |
|---|---|---|
| More heads (higher h) | More parallel subspaces | Smaller d_k per head (less capacity per head), overhead |
| Fewer heads | More capacity per head | Fewer distinct attention patterns |
Empirically, standard settings (8–32 heads depending on width) work well, but variants exist (multi-query, grouped-query attention) to reduce memory/compute during decoding.
A single attention layer can mix tokens once, but deep language understanding requires multiple rounds of:
So Transformers stack identical blocks. The stability of deep stacking relies on residual connections and layer normalization.
An encoder layer has two sublayers:
1) Multi-head self-attention (MHA)
2) Position-wise feed-forward network (FFN)
Each sublayer is wrapped by residual + layer norm. There are two common normalization conventions:
| Convention | Pattern | Notes |
|---|---|---|
| Post-LN (original 2017) | X + Sublayer(X) → LN | Can be harder to optimize at great depth |
| Pre-LN (common today) | X → LN → Sublayer → X + … | Typically more stable for deep stacks |
We’ll write pre-LN, since it’s widely used.
Let X ∈ ℝ^{L×d_model} be the layer input.
(1) Attention sublayer
(2) Feed-forward sublayer
Output is Y.
The FFN is the same MLP applied independently to each position.
Typical form:
Where:
Even though FFN doesn’t mix tokens, it adds substantial capacity: it can reshape and recombine features within each token vector.
A decoder layer adds one more attention sublayer:
1) Masked multi-head self-attention (causal)
2) Cross-attention over encoder outputs (optional in encoder–decoder)
3) Feed-forward network
Pre-LN decoder (encoder–decoder) sketch:
If it’s decoder-only (GPT-style), you omit the cross-attention term.
Residuals let the model learn modifications rather than complete rewrites.
If a sublayer initially does something unhelpful, the residual path preserves the input:
This makes gradients flow more directly through many layers. In deep Transformers (dozens to hundreds of layers), residual pathways are essential.
Layer norm stabilizes the scale of activations, making training less sensitive to initialization and learning rate.
Layer norm operates per token vector xᵢ:
(You already know LN; here it matters because attention logits and FFN activations can drift in magnitude as depth increases.)
Positional information is usually added once at the bottom:
Then all layers operate on X₀, X₁, …, X_N.
However, there are alternatives (relative position bias, rotary embeddings), but the principle remains: attention needs a way to distinguish positions.
Use case: classification, retrieval, tagging, masked language modeling.
Mechanics:
To classify, you might pool:
Use case: text generation, code generation, next-token prediction.
Mechanics:
Training objective (next-token):
If logits at position i are zᵢ ∈ ℝ^{|Vocab|}, then:
Loss is cross-entropy summed across positions.
Use case: translation, summarization, speech-to-text.
Mechanics:
Autoregressive decoding generates one token at a time. Recomputing attention over the whole prefix is expensive. Instead, store previous keys/values.
At time step t:
This reduces per-step cost from O(t²) to roughly O(t) for attention score computation (still linear in context length per step).
If you forget masking:
Mechanically, masks must be added before softmax.
Deep Transformers are sensitive to:
These details often decide whether training is stable.
Self-attention is permutation-invariant without positional signals:
If you permute tokens, QKᵀ permutes correspondingly, producing the same pattern up to permutation. Positional encoding breaks this symmetry, enabling order-sensitive tasks.
When you inspect trained models, attention heads often develop recognizable patterns:
Not every head is interpretable, and attention weights are not the whole story (FFN and residual streams matter), but the routing intuition remains useful.
Given tokens → embeddings + positions:
For ℓ = 1…N layers:
Final outputs X_N feed task heads (classification, MLM head, etc.).
For decoder-only, the same structure holds, but with causal masking and an output projection to vocabulary logits.
We will compute attention for a sequence of L = 2 tokens with d_k = d_v = 2. Use no mask. Let
Q = [[1, 0],
[0, 1]]
K = [[1, 0],
[1, 1]]
V = [[1, 2],
[3, 4]]
All matrices are in row-major form: each row corresponds to a token position.
Step 1: Compute raw score matrix S = QKᵀ.
Kᵀ = [[1, 1],
[0, 1]]
S = QKᵀ = [[1, 0],
[0, 1]] [[1, 1],
[0, 1]]
= [[1⋅1 + 0⋅0, 1⋅1 + 0⋅1],
[0⋅1 + 1⋅0, 0⋅1 + 1⋅1]]
= [[1, 1],
[0, 1]]
Step 2: Scale by √d_k. Here d_k = 2, so √d_k = √2.
Ŝ = S / √2 = [[1/√2, 1/√2],
[0/√2, 1/√2]]
= [[0.7071, 0.7071],
[0, 0.7071]] (approx)
Step 3: Apply softmax row-wise to get attention weights A.
Row 1: softmax([0.7071, 0.7071]) = [0.5, 0.5]
Row 2: softmax([0, 0.7071])
Compute exp values:
exp(0) = 1
exp(0.7071) ≈ 2.028
Sum ≈ 3.028
So row 2 ≈ [1/3.028, 2.028/3.028] ≈ [0.330, 0.670]
Thus
A ≈ [[0.5, 0.5],
[0.33, 0.67]]
Step 4: Compute output O = AV.
O₁ = 0.5⋅[1,2] + 0.5⋅[3,4] = [2,3]
O₂ = 0.33⋅[1,2] + 0.67⋅[3,4]
= [0.33 + 2.01, 0.66 + 2.68]
= [2.34, 3.34] (approx)
So
O ≈ [[2.00, 3.00],
[2.34, 3.34]]
Insight: Each output token becomes a convex combination of value vectors. Token 1 averaged both tokens equally; token 2 leaned more heavily on token 2 because q₂ aligned better with k₂ than k₁.
Let d_model = 8, number of heads h = 2. Then per-head dimension d_k = d_v = d_model / h = 4. Let sequence length L = 3. We will track tensor shapes through MHA self-attention and the output projection.
Step 1: Start with input X ∈ ℝ^{L×d_model} = ℝ^{3×8}.
Step 2: Project into per-head Q, K, V.
For each head r:
W_Q^{(r)} ∈ ℝ^{8×4}, W_K^{(r)} ∈ ℝ^{8×4}, W_V^{(r)} ∈ ℝ^{8×4}.
Thus:
Q^{(r)} = XW_Q^{(r)} ∈ ℝ^{3×4}
K^{(r)} = XW_K^{(r)} ∈ ℝ^{3×4}
V^{(r)} = XW_V^{(r)} ∈ ℝ^{3×4}
Step 3: Compute attention scores per head.
For head r:
S^{(r)} = Q^{(r)} (K^{(r)})ᵀ.
Shapes: (3×4)(4×3) = 3×3.
Scaling by √d_k keeps shape 3×3.
Softmax row-wise yields A^{(r)} ∈ ℝ^{3×3}.
Step 4: Mix values.
O^{(r)} = A^{(r)} V^{(r)}.
Shapes: (3×3)(3×4) = 3×4.
So each head returns 3×4.
Step 5: Concatenate heads.
O_concat = [O^{(1)} | O^{(2)}] ∈ ℝ^{3×8}.
Because concatenation along features gives 4 + 4 = 8.
Step 6: Output projection.
W_O ∈ ℝ^{8×8}.
Y = O_concat W_O ∈ ℝ^{3×8}.
So MHA maps ℝ^{L×d_model} → ℝ^{L×d_model}, enabling residual addition X + Y.
Insight: Most Transformer components are designed so inputs and outputs share the same shape (L×d_model). That single design choice makes deep stacking with residual connections straightforward.
We want a mask M for L = 4 such that position i can attend only to positions j ≤ i. We will express M as 0 for allowed and −∞ for disallowed entries, to be added to logits before softmax.
Step 1: Write the allowed pattern.
Row i shows which columns j are visible.
i=1: can see [1]
i=2: can see [1,2]
i=3: can see [1,2,3]
i=4: can see [1,2,3,4]
Step 2: Create the matrix with −∞ above the diagonal.
M =
[[ 0, −∞, −∞, −∞],
[ 0, 0, −∞, −∞],
[ 0, 0, 0, −∞],
[ 0, 0, 0, 0]]
Step 3: Use it in attention.
A = softmax( (QKᵀ)/√d_k + M )
Because softmax(exp(−∞)) = 0, all forbidden future positions get exactly zero probability.
Step 4: Add padding masking if needed.
If token 4 were padding, you would also set the entire column j=4 to −∞ (except possibly where you want padding to never be attended at all).
In practice, frameworks combine causal and padding masks by addition.
Insight: Masking is mathematically simple—just add −∞ before softmax—but conceptually essential: it encodes the information constraints that define the task (bidirectional understanding vs left-to-right generation).
Self-attention mixes token information using learnable content-based weights: A = softmax(QKᵀ/√d_k + M), output O = AV.
Queries, keys, and values are linear projections of token states; they let the model learn what to match (Q/K) and what information to transmit (V).
The √d_k scaling keeps attention logits in a healthy range so softmax does not saturate as dimension grows.
Multi-head attention runs several attention mechanisms in parallel on different learned subspaces, concatenates their outputs, then applies an output projection back to d_model.
A Transformer layer is (attention → FFN), each wrapped with residual connections and layer normalization; stacking layers yields powerful hierarchical context building.
Decoder self-attention requires a causal mask so position i cannot attend to any future position j > i; padding masks prevent attending to padding tokens.
Encoder–decoder models add cross-attention where decoder queries attend to encoder keys/values to retrieve source information.
The main computational bottleneck of vanilla self-attention is O(L²) memory/time in sequence length.
Forgetting to add the mask before softmax (or applying it after softmax), which breaks causality or allows attending to padding.
Mixing up tensor shapes: QKᵀ should produce an L×L score matrix; dimension mismatches often come from incorrect transposes or head reshaping.
Omitting the √d_k scaling, causing attention logits to grow with dimension and making training unstable due to softmax saturation.
Assuming attention weights alone “explain” model behavior; the residual stream and FFN can carry and transform information even when attention looks diffuse.
Given L = 3 and d_k = 1, suppose Q = [[2],[0],[1]], K = [[1],[3],[−1]], V = [[10],[20],[30]]. Compute S = QKᵀ, then A = softmax(S) row-wise (no scaling needed because √d_k = 1), then O = AV.
Hint: Sᵢⱼ = qᵢ kⱼ. Compute each row’s softmax separately. Keep results as exact exponentials if you prefer: softmax([a,b,c]) = [eᵃ, eᵇ, eᶜ]/(eᵃ+eᵇ+eᶜ).
S = QKᵀ gives a 3×3 matrix.
Row 1 (q₁=2): [2⋅1, 2⋅3, 2⋅(−1)] = [2, 6, −2]
Row 2 (q₂=0): [0, 0, 0]
Row 3 (q₃=1): [1, 3, −1]
So S = [[2,6,−2],[0,0,0],[1,3,−1]].
A row-wise:
Row 1: softmax([2,6,−2]) = [e², e⁶, e^{−2}] / (e² + e⁶ + e^{−2}).
Row 2: softmax([0,0,0]) = [1/3,1/3,1/3].
Row 3: softmax([1,3,−1]) = [e¹, e³, e^{−1}] / (e¹ + e³ + e^{−1}).
O = AV where V = [10,20,30]ᵀ applied as weighted sum per row:
O₁ = 10A₁₁ + 20A₁₂ + 30A₁₃
O₂ = (10+20+30)/3 = 20
O₃ = 10A₃₁ + 20A₃₂ + 30A₃₃
You have a Transformer with d_model = 1024 and h = 16 heads. (a) What is d_k if you split evenly? (b) What are the shapes of W_Q, W_K, W_V per head? (c) What is the shape of the attention score matrix for a sequence length L = 128 in a single head?
Hint: Use d_k = d_model / h. Remember score matrix is QKᵀ with Q ∈ ℝ^{L×d_k} and K ∈ ℝ^{L×d_k}.
(a) d_k = 1024 / 16 = 64.
(b) Per head, W_Q^{(r)} ∈ ℝ^{1024×64}, W_K^{(r)} ∈ ℝ^{1024×64}, W_V^{(r)} ∈ ℝ^{1024×64} (assuming d_v = d_k).
(c) For L = 128, Q and K are 128×64, so QKᵀ is (128×64)(64×128) = 128×128.
Consider a decoder-only Transformer generating tokens left-to-right. Explain, using the attention formula, exactly where and how the causal mask changes the computation. Then describe what failure mode occurs if you omit the causal mask during training but keep it during inference.
Hint: Point to the logits matrix QKᵀ/√d_k and the additive mask M. Think about the model seeing future tokens during teacher forcing training.
Causal masking modifies attention logits before softmax:
A = softmax( QKᵀ/√d_k + M_causal ).
M_causal has entries Mᵢⱼ = −∞ for j > i and 0 otherwise, forcing Aᵢⱼ = 0 for all future positions.
If you omit M_causal during training with teacher forcing, the model can attend to future ground-truth tokens to predict the next token. It learns a “cheating” solution that relies on information that will not be available at inference.
At inference, when you reintroduce the causal mask, those information paths disappear, causing a sharp performance drop: perplexity increases and generation quality degrades because the model’s learned dependencies are misaligned with the constraints at test time.