Transformers

Machine LearningDifficulty: █████Depth: 14Unlocks: 0

Attention-based architecture. Multi-head attention, positional encoding.

Interactive Visualization

t=0s

Core Concepts

  • Scaled dot-product self-attention implemented with learnable linear projections (queries, keys, values) to compute contextual token-to-token weighting across a sequence
  • Multi-head attention: multiple parallel attention heads applied to different linear projections whose outputs are concatenated to model multiple representation subspaces
  • Transformer layer: a repeatable block that applies multi-head attention and a position-wise feed-forward network, each wrapped by residual connection plus layer normalization, and stacked to form encoder/decoder (decoder adds masked self-attention and encoder-decoder cross-attention)

Key Symbols & Notation

d_model (model hidden dimension)h (number of attention heads)d_k (per-head key/query dimension)

Essential Relationships

  • d_model = h * d_k (model dimension split across heads)
  • Multi-head attention = Concatenate(head_1,...,head_h) * W_O, where head_i = Attention(Q W_Qi, K W_Ki, V W_Vi)
  • Each sublayer uses residual + LayerNorm: output = LayerNorm(x + Sublayer(x)) (applied to attention and feed-forward)
▶ Advanced Learning Details

Graph Position

290
Depth Cost
0
Fan-Out (ROI)
0
Bottleneck Score
14
Chain Length

Cognitive Load

9
Atomic Elements
45
Total Elements
L3
Percentile Level
L4
Atomic Level

All Concepts (15)

  • Queries, Keys, Values (Q, K, V) as specific linear projections of inputs used in attention computation
  • Scaled dot-product attention: computing attention scores via Q K^T, scaling by 1/sqrt(d_k), applying softmax, then weighting V
  • Multi-head attention: running multiple independent attention 'heads' in parallel, each with its own projections, then concatenating results
  • Head splitting and per-head dimensionality: splitting the model dimension into h heads with per-head dimensions d_k and d_v (typically d_k = d_v = d_model / h)
  • Final output projection after concatenating heads (linear layer W^O that maps concatenated head outputs back to d_model)
  • Position-wise feed-forward network (FFN): two-layer MLP applied independently to each sequence position (same parameters for all positions) with a nonlinearity (e.g., ReLU or GELU)
  • Transformer layer/block structure: attention sublayer followed by residual connection + normalization, then FFN sublayer followed by residual connection + normalization (the 'Add & Norm' pattern as used in Transformers)
  • Encoder stack: repeated identical encoder layers (self-attention + FFN) forming the encoder
  • Decoder stack: repeated identical decoder layers where each layer performs masked self-attention, encoder–decoder (cross) attention, then FFN in that order
  • Masked (causal) self-attention inside the decoder as an implementation detail: using masks so decoder queries cannot attend to future positions
  • Application of attention masks via additive masking to logits (e.g., adding large negative values to masked score positions before softmax)
  • Embedding scaling practice: scaling token embeddings by sqrt(d_model) (or similar) before adding positional encodings to stabilize magnitudes
  • Typical Transformer hyperparameters as design choices to learn (d_model, d_ff, h, number of layers N) and their role in model capacity
  • Motivation/benefit of multi-head attention: different heads can attend to different representation subspaces or patterns simultaneously
  • Shape and batching conventions implicit in Transformer computations (e.g., attention score matrix shape [seq_len_q, seq_len_k] per head and per batch)

Teaching Strategy

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

Transformers are the first widely successful neural architecture where the main “engine” is not recurrence or convolution, but a learned, content-based routing system: attention. Once you understand the exact mechanics of scaled dot-product attention, multi-head attention, and the Transformer block (attention + feed-forward + residual + layer norm), most modern language and vision models become variations on a single theme.

TL;DR:

A Transformer processes a sequence by projecting token representations into queries, keys, and values, computing attention weights via softmax(QKᵀ/√d_k) (with masks as needed), mixing values with those weights, and then applying a position-wise MLP—each sublayer wrapped with residual connections and layer normalization. Multi-head attention repeats attention in parallel with different projections, concatenates head outputs, and mixes them back to d_model. Stacking these blocks yields encoders and decoders; decoders add causal masking and cross-attention to an encoder output.

What Is a Transformer?

Why Transformers exist (motivation)

Before Transformers, sequence modeling was dominated by RNNs/LSTMs/GRUs and CNN-based sequence models. Those families have two persistent pain points:

1) Long-range dependencies are hard. Even with gating, recurrent models struggle to move information across hundreds or thousands of steps.

2) Parallelism is limited. Recurrence is inherently sequential: to compute step t, you need step t−1. That slows training.

Transformers address both by making the core operation all-to-all token interaction in a single layer: every token can “look at” every other token (subject to masking). This interaction is differentiable, learnable, and highly parallelizable on GPUs/TPUs.

The Transformer idea in one sentence

A Transformer layer repeatedly does:

  • Mix information across tokens using self-attention (content-based weighted averaging).
  • Transform each token independently using a position-wise feed-forward network (an MLP applied to each position).

Crucially, both steps are wrapped in residual connections and layer normalization for stable optimization.

What a token representation looks like

Assume a sequence length L and model width d_model.

  • Input embeddings form a matrix X ∈ ℝ^{L×d_model}
  • Row i, written xᵢ, is the vector for token i.

Because attention has no inherent sense of order, we add positional information:

  • X₀ = E + P, where E are token embeddings and P are positional encodings (sinusoidal or learned).

The two canonical Transformer families

FamilyPrimary useBlocksKey masking
Encoder-only (e.g., BERT-style)Understanding, classification, bidirectional contextself-attn + FFNPadding mask only (no causal mask)
Decoder-only (e.g., GPT-style)Autoregressive generationmasked self-attn + FFNCausal + padding masks
Encoder–decoder (original)Seq2seq (translation, summarization)encoder: self-attn + FFN; decoder: masked self-attn + cross-attn + FFNDecoder uses causal; cross-attn uses padding mask on encoder side

This lesson focuses on the core mechanics that all of these share: scaled dot-product attention, multi-head attention, and the Transformer layer structure.

Core Mechanic 1: Scaled Dot-Product Self-Attention (Q, K, V)

Why attention is “routing”

Suppose token i needs information from token j (e.g., a pronoun needs its antecedent). Attention lets token i compute a weighted mixture of other tokens’ information.

The key design choice is: weights depend on content (learned similarity), not just distance.

Queries, keys, values (the roles)

Each token representation xᵢ is linearly projected into three vectors:

  • qᵢ (query): what token i is looking for
  • kᵢ (key): what token i offers / how it can be matched
  • vᵢ (value): the information token i will contribute if attended to

In matrix form for a whole sequence X ∈ ℝ^{L×d_model}:

  • Q = XW_Q
  • K = XW_K
  • V = XW_V

where W_Q, W_K, W_V are learned matrices.

Typically:

  • W_Q ∈ ℝ^{d_model×d_k}
  • W_K ∈ ℝ^{d_model×d_k}
  • W_V ∈ ℝ^{d_model×d_v}

In most standard implementations, d_v = d_k = d_model / h per head.

Attention scores and why we scale by √d_k

The raw similarity between token i and token j is the dot product:

  • score(i, j) = qᵢ · kⱼ

In matrix form, the score matrix is:

  • S = QKᵀ where S ∈ ℝ^{L×L}

If q and k components have roughly unit variance, then the dot product grows with dimension:

  • Var(qᵢ · kⱼ) ≈ d_k

Large magnitudes push softmax into saturation, giving tiny gradients. To keep logits in a reasonable range, we scale:

  • Ŝ = (QKᵀ) / √d_k

This is “scaled dot-product attention.”

Softmax to turn scores into weights

For each query position i, we take a softmax over j:

  • A = softmax(Ŝ)

So Aᵢⱼ ≥ 0 and ∑ⱼ Aᵢⱼ = 1.

Interpretation: row i of A is a probability distribution over which tokens i will attend to.

Masking (padding and causal)

Masks are incorporated by adding −∞ (or a very negative number) to disallowed positions before softmax.

Let M ∈ ℝ^{L×L} where:

  • Mᵢⱼ = 0 if attention from i to j is allowed
  • Mᵢⱼ = −∞ if disallowed

Then:

  • A = softmax( (QKᵀ)/√d_k + M )

Two common masks:

1) Padding mask: disallow attending to padding tokens.

2) Causal mask (decoder): disallow attending to future tokens, i.e., j > i.

Weighted sum of values

Finally, the output at each position i is a weighted sum of values:

  • oᵢ = ∑ⱼ Aᵢⱼ vⱼ

In matrix form:

  • O = AV

Where O ∈ ℝ^{L×d_v}.

Putting it together (single-head self-attention)

The full formula:

  • Attention(X) = softmax( (XW_Q)(XW_K)ᵀ / √d_k + M ) (XW_V)

A geometric intuition

Dot products measure alignment. If qᵢ points in a similar direction to kⱼ, token i will attend more to token j. But because W_Q and W_K are learned, the model can invent similarity notions that match the task (syntax, coreference, topic, etc.).

Complexity note (why long context is expensive)

Self-attention builds an L×L matrix of scores. That’s:

  • Time: O(L² d_k)
  • Memory: O(L²)

This quadratic dependence is why long-context Transformers require approximations or architectural tricks—but the core mechanism remains the same.

Core Mechanic 2: Multi-Head Attention (MHA)

Why one attention “view” isn’t enough

A single attention map must decide one set of weights per token. But language often requires multiple simultaneous relationships:

  • One head might track coreference (pronouns → nouns).
  • Another might track syntactic dependencies (verbs → objects).
  • Another might focus on local context (previous few tokens).

Multi-head attention lets the model compute multiple attention distributions in parallel, each in its own learned subspace.

The shape story: d_model, h, d_k

Let:

  • d_model = model width
  • h = number of heads
  • d_k = per-head query/key width

Commonly:

  • d_k = d_model / h

Example: d_model = 768, h = 12 ⇒ d_k = 64.

Per-head projections

For head r ∈ {1,…,h}, we have separate projection matrices:

  • W_Q^{(r)} ∈ ℝ^{d_model×d_k}
  • W_K^{(r)} ∈ ℝ^{d_model×d_k}
  • W_V^{(r)} ∈ ℝ^{d_model×d_v}

Compute:

  • Q^{(r)} = XW_Q^{(r)}
  • K^{(r)} = XW_K^{(r)}
  • V^{(r)} = XW_V^{(r)}

Then each head output:

  • O^{(r)} = softmax( Q^{(r)} (K^{(r)})ᵀ / √d_k + M ) V^{(r)}

Concatenate and mix

Concatenate head outputs along the feature dimension:

  • O_concat = [O^{(1)} | O^{(2)} | … | O^{(h)}]

If each O^{(r)} ∈ ℝ^{L×d_v} and d_v = d_k, then O_concat ∈ ℝ^{L×(h d_k)} = ℝ^{L×d_model}.

Then apply a final learned output projection:

  • Y = O_concat W_O where W_O ∈ ℝ^{d_model×d_model}

This final mixing matters: it lets the model combine information across heads.

Self-attention vs cross-attention inside MHA

Multi-head attention is a pattern, and it can be used in different places:

  • Self-attention: Q, K, V come from the same X.
  • Cross-attention: Q comes from decoder states X_dec; K and V come from encoder states X_enc.

Cross-attention formula:

  • Attention(X_dec, X_enc) = softmax( (X_dec W_Q)(X_enc W_K)ᵀ / √d_k + M ) (X_enc W_V)

Here the decoder learns to retrieve information from the encoded source sequence.

Practical interpretation of heads

A helpful mental model:

  • Each head defines its own “matching function” (via W_Q and W_K)
  • and its own “payload space” (via W_V)

So attention isn’t only where to look; it’s also what to bring back.

A note on head dimension choice

Holding d_model fixed, increasing h decreases d_k. There is a trade-off:

ChoiceBenefitCost
More heads (higher h)More parallel subspacesSmaller d_k per head (less capacity per head), overhead
Fewer headsMore capacity per headFewer distinct attention patterns

Empirically, standard settings (8–32 heads depending on width) work well, but variants exist (multi-query, grouped-query attention) to reduce memory/compute during decoding.

Core Mechanic 3: The Transformer Layer (Sublayers, Residuals, LayerNorm, FFN)

Why the Transformer is a stack of simple blocks

A single attention layer can mix tokens once, but deep language understanding requires multiple rounds of:

  • gathering context
  • transforming it
  • gathering again at a higher abstraction

So Transformers stack identical blocks. The stability of deep stacking relies on residual connections and layer normalization.

The canonical encoder layer

An encoder layer has two sublayers:

1) Multi-head self-attention (MHA)

2) Position-wise feed-forward network (FFN)

Each sublayer is wrapped by residual + layer norm. There are two common normalization conventions:

ConventionPatternNotes
Post-LN (original 2017)X + Sublayer(X) → LNCan be harder to optimize at great depth
Pre-LN (common today)X → LN → Sublayer → X + …Typically more stable for deep stacks

We’ll write pre-LN, since it’s widely used.

Pre-LN encoder layer equations

Let X ∈ ℝ^{L×d_model} be the layer input.

(1) Attention sublayer

  • U = LN(X)
  • A = MHA(U, U, U; M_padding)
  • X′ = X + Dropout(A)

(2) Feed-forward sublayer

  • V = LN(X′)
  • F = FFN(V)
  • Y = X′ + Dropout(F)

Output is Y.

What “position-wise FFN” means

The FFN is the same MLP applied independently to each position.

Typical form:

  • FFN(x) = W₂ σ(W₁ x + b₁) + b₂

Where:

  • W₁ ∈ ℝ^{d_model×d_ff}
  • W₂ ∈ ℝ^{d_ff×d_model}
  • σ is a nonlinearity (ReLU, GELU, SwiGLU variants)
  • d_ff is often 4× d_model (e.g., 3072 for 768)

Even though FFN doesn’t mix tokens, it adds substantial capacity: it can reshape and recombine features within each token vector.

The decoder layer

A decoder layer adds one more attention sublayer:

1) Masked multi-head self-attention (causal)

2) Cross-attention over encoder outputs (optional in encoder–decoder)

3) Feed-forward network

Pre-LN decoder (encoder–decoder) sketch:

  • X₁ = X + MHA(LN(X), LN(X), LN(X); M_causal)
  • X₂ = X₁ + MHA(LN(X₁), K_enc, V_enc; M_enc_padding)
  • Y = X₂ + FFN(LN(X₂))

If it’s decoder-only (GPT-style), you omit the cross-attention term.

Why residual connections matter (conceptual)

Residuals let the model learn modifications rather than complete rewrites.

If a sublayer initially does something unhelpful, the residual path preserves the input:

  • X_out = X_in + ε

This makes gradients flow more directly through many layers. In deep Transformers (dozens to hundreds of layers), residual pathways are essential.

Why layer normalization is placed around sublayers

Layer norm stabilizes the scale of activations, making training less sensitive to initialization and learning rate.

Layer norm operates per token vector xᵢ:

  • μ = (1/d_model) ∑ₖ xᵢₖ
  • σ² = (1/d_model) ∑ₖ (xᵢₖ − μ)²
  • LN(xᵢ) = γ ⊙ (xᵢ − μ)/√(σ² + ϵ) + β

(You already know LN; here it matters because attention logits and FFN activations can drift in magnitude as depth increases.)

Where positional encoding enters

Positional information is usually added once at the bottom:

  • X₀ = Emb(tokens) + PosEnc(positions)

Then all layers operate on X₀, X₁, …, X_N.

However, there are alternatives (relative position bias, rotary embeddings), but the principle remains: attention needs a way to distinguish positions.

Application/Connection: How Transformers Are Used (Encoder, Decoder, Training Objectives, and Practical Concerns)

Encoder-only Transformers (bidirectional)

Use case: classification, retrieval, tagging, masked language modeling.

Mechanics:

  • Full self-attention over the entire sequence (no causal mask).
  • Output is a sequence of contextual embeddings.

To classify, you might pool:

  • use a special [CLS] token representation y_cls
  • or mean-pool: y_pool = (1/L) ∑ᵢ yᵢ

Decoder-only Transformers (autoregressive)

Use case: text generation, code generation, next-token prediction.

Mechanics:

  • Causal mask enforces that token i can only attend to tokens ≤ i.

Training objective (next-token):

  • Given tokens t₁…t_L, model predicts t_{i+1} from context t₁…t_i.

If logits at position i are zᵢ ∈ ℝ^{|Vocab|}, then:

  • p(t_{i+1} | t_{≤i}) = softmax(zᵢ)

Loss is cross-entropy summed across positions.

Encoder–decoder Transformers (seq2seq)

Use case: translation, summarization, speech-to-text.

Mechanics:

  • Encoder produces memory H_enc.
  • Decoder uses masked self-attention for partial output and cross-attention to read from H_enc.

Practical issues that shape real implementations

1) KV caching during decoding

Autoregressive decoding generates one token at a time. Recomputing attention over the whole prefix is expensive. Instead, store previous keys/values.

At time step t:

  • compute K_t, V_t for the new token
  • append to cache
  • attend with Q_t against all cached K_{≤t}

This reduces per-step cost from O(t²) to roughly O(t) for attention score computation (still linear in context length per step).

2) Attention masks are not optional

If you forget masking:

  • the model may attend to padding and learn spurious correlations
  • a decoder may “cheat” by attending to future tokens during training

Mechanically, masks must be added before softmax.

3) Initialization and normalization choices

Deep Transformers are sensitive to:

  • pre-LN vs post-LN
  • residual scaling
  • dropout placement

These details often decide whether training is stable.

4) Why positional encoding is central

Self-attention is permutation-invariant without positional signals:

If you permute tokens, QKᵀ permutes correspondingly, producing the same pattern up to permutation. Positional encoding breaks this symmetry, enabling order-sensitive tasks.

Connecting the mechanics to behavior

When you inspect trained models, attention heads often develop recognizable patterns:

  • local heads: attend strongly to nearby tokens
  • delimiter heads: attend to separators (commas, periods)
  • induction heads: copy patterns like “A … A”

Not every head is interpretable, and attention weights are not the whole story (FFN and residual streams matter), but the routing intuition remains useful.

Summary of the full forward pass (encoder-only)

Given tokens → embeddings + positions:

  • X₀ = Emb(t) + PosEnc(pos)

For ℓ = 1…N layers:

  • X_ℓ = TransformerEncoderLayer(X_{ℓ−1})

Final outputs X_N feed task heads (classification, MLM head, etc.).

For decoder-only, the same structure holds, but with causal masking and an output projection to vocabulary logits.

Worked Examples (3)

Worked Example 1: Compute single-head scaled dot-product attention by hand (tiny numbers)

We will compute attention for a sequence of L = 2 tokens with d_k = d_v = 2. Use no mask. Let

Q = [[1, 0],

[0, 1]]

K = [[1, 0],

[1, 1]]

V = [[1, 2],

[3, 4]]

All matrices are in row-major form: each row corresponds to a token position.

  1. Step 1: Compute raw score matrix S = QKᵀ.

    Kᵀ = [[1, 1],

    [0, 1]]

    S = QKᵀ = [[1, 0],

    [0, 1]] [[1, 1],

    [0, 1]]

    = [[1⋅1 + 0⋅0, 1⋅1 + 0⋅1],

    [0⋅1 + 1⋅0, 0⋅1 + 1⋅1]]

    = [[1, 1],

    [0, 1]]

  2. Step 2: Scale by √d_k. Here d_k = 2, so √d_k = √2.

    Ŝ = S / √2 = [[1/√2, 1/√2],

    [0/√2, 1/√2]]

    = [[0.7071, 0.7071],

    [0, 0.7071]] (approx)

  3. Step 3: Apply softmax row-wise to get attention weights A.

    Row 1: softmax([0.7071, 0.7071]) = [0.5, 0.5]

    Row 2: softmax([0, 0.7071])

    Compute exp values:

    exp(0) = 1

    exp(0.7071) ≈ 2.028

    Sum ≈ 3.028

    So row 2 ≈ [1/3.028, 2.028/3.028] ≈ [0.330, 0.670]

    Thus

    A ≈ [[0.5, 0.5],

    [0.33, 0.67]]

  4. Step 4: Compute output O = AV.

    O₁ = 0.5⋅[1,2] + 0.5⋅[3,4] = [2,3]

    O₂ = 0.33⋅[1,2] + 0.67⋅[3,4]

    = [0.33 + 2.01, 0.66 + 2.68]

    = [2.34, 3.34] (approx)

    So

    O ≈ [[2.00, 3.00],

    [2.34, 3.34]]

Insight: Each output token becomes a convex combination of value vectors. Token 1 averaged both tokens equally; token 2 leaned more heavily on token 2 because q₂ aligned better with k₂ than k₁.

Worked Example 2: Shapes and parameterization of multi-head attention (sanity-checking dimensions)

Let d_model = 8, number of heads h = 2. Then per-head dimension d_k = d_v = d_model / h = 4. Let sequence length L = 3. We will track tensor shapes through MHA self-attention and the output projection.

  1. Step 1: Start with input X ∈ ℝ^{L×d_model} = ℝ^{3×8}.

  2. Step 2: Project into per-head Q, K, V.

    For each head r:

    W_Q^{(r)} ∈ ℝ^{8×4}, W_K^{(r)} ∈ ℝ^{8×4}, W_V^{(r)} ∈ ℝ^{8×4}.

    Thus:

    Q^{(r)} = XW_Q^{(r)} ∈ ℝ^{3×4}

    K^{(r)} = XW_K^{(r)} ∈ ℝ^{3×4}

    V^{(r)} = XW_V^{(r)} ∈ ℝ^{3×4}

  3. Step 3: Compute attention scores per head.

    For head r:

    S^{(r)} = Q^{(r)} (K^{(r)})ᵀ.

    Shapes: (3×4)(4×3) = 3×3.

    Scaling by √d_k keeps shape 3×3.

    Softmax row-wise yields A^{(r)} ∈ ℝ^{3×3}.

  4. Step 4: Mix values.

    O^{(r)} = A^{(r)} V^{(r)}.

    Shapes: (3×3)(3×4) = 3×4.

    So each head returns 3×4.

  5. Step 5: Concatenate heads.

    O_concat = [O^{(1)} | O^{(2)}] ∈ ℝ^{3×8}.

    Because concatenation along features gives 4 + 4 = 8.

  6. Step 6: Output projection.

    W_O ∈ ℝ^{8×8}.

    Y = O_concat W_O ∈ ℝ^{3×8}.

    So MHA maps ℝ^{L×d_model} → ℝ^{L×d_model}, enabling residual addition X + Y.

Insight: Most Transformer components are designed so inputs and outputs share the same shape (L×d_model). That single design choice makes deep stacking with residual connections straightforward.

Worked Example 3: Building a causal mask for a 4-token decoder self-attention

We want a mask M for L = 4 such that position i can attend only to positions j ≤ i. We will express M as 0 for allowed and −∞ for disallowed entries, to be added to logits before softmax.

  1. Step 1: Write the allowed pattern.

    Row i shows which columns j are visible.

    i=1: can see [1]

    i=2: can see [1,2]

    i=3: can see [1,2,3]

    i=4: can see [1,2,3,4]

  2. Step 2: Create the matrix with −∞ above the diagonal.

    M =

    [[ 0, −∞, −∞, −∞],

    [ 0, 0, −∞, −∞],

    [ 0, 0, 0, −∞],

    [ 0, 0, 0, 0]]

  3. Step 3: Use it in attention.

    A = softmax( (QKᵀ)/√d_k + M )

    Because softmax(exp(−∞)) = 0, all forbidden future positions get exactly zero probability.

  4. Step 4: Add padding masking if needed.

    If token 4 were padding, you would also set the entire column j=4 to −∞ (except possibly where you want padding to never be attended at all).

    In practice, frameworks combine causal and padding masks by addition.

Insight: Masking is mathematically simple—just add −∞ before softmax—but conceptually essential: it encodes the information constraints that define the task (bidirectional understanding vs left-to-right generation).

Key Takeaways

  • Self-attention mixes token information using learnable content-based weights: A = softmax(QKᵀ/√d_k + M), output O = AV.

  • Queries, keys, and values are linear projections of token states; they let the model learn what to match (Q/K) and what information to transmit (V).

  • The √d_k scaling keeps attention logits in a healthy range so softmax does not saturate as dimension grows.

  • Multi-head attention runs several attention mechanisms in parallel on different learned subspaces, concatenates their outputs, then applies an output projection back to d_model.

  • A Transformer layer is (attention → FFN), each wrapped with residual connections and layer normalization; stacking layers yields powerful hierarchical context building.

  • Decoder self-attention requires a causal mask so position i cannot attend to any future position j > i; padding masks prevent attending to padding tokens.

  • Encoder–decoder models add cross-attention where decoder queries attend to encoder keys/values to retrieve source information.

  • The main computational bottleneck of vanilla self-attention is O(L²) memory/time in sequence length.

Common Mistakes

  • Forgetting to add the mask before softmax (or applying it after softmax), which breaks causality or allows attending to padding.

  • Mixing up tensor shapes: QKᵀ should produce an L×L score matrix; dimension mismatches often come from incorrect transposes or head reshaping.

  • Omitting the √d_k scaling, causing attention logits to grow with dimension and making training unstable due to softmax saturation.

  • Assuming attention weights alone “explain” model behavior; the residual stream and FFN can carry and transform information even when attention looks diffuse.

Practice

easy

Given L = 3 and d_k = 1, suppose Q = [[2],[0],[1]], K = [[1],[3],[−1]], V = [[10],[20],[30]]. Compute S = QKᵀ, then A = softmax(S) row-wise (no scaling needed because √d_k = 1), then O = AV.

Hint: Sᵢⱼ = qᵢ kⱼ. Compute each row’s softmax separately. Keep results as exact exponentials if you prefer: softmax([a,b,c]) = [eᵃ, eᵇ, eᶜ]/(eᵃ+eᵇ+eᶜ).

Show solution

S = QKᵀ gives a 3×3 matrix.

Row 1 (q₁=2): [2⋅1, 2⋅3, 2⋅(−1)] = [2, 6, −2]

Row 2 (q₂=0): [0, 0, 0]

Row 3 (q₃=1): [1, 3, −1]

So S = [[2,6,−2],[0,0,0],[1,3,−1]].

A row-wise:

Row 1: softmax([2,6,−2]) = [e², e⁶, e^{−2}] / (e² + e⁶ + e^{−2}).

Row 2: softmax([0,0,0]) = [1/3,1/3,1/3].

Row 3: softmax([1,3,−1]) = [e¹, e³, e^{−1}] / (e¹ + e³ + e^{−1}).

O = AV where V = [10,20,30]ᵀ applied as weighted sum per row:

O₁ = 10A₁₁ + 20A₁₂ + 30A₁₃

O₂ = (10+20+30)/3 = 20

O₃ = 10A₃₁ + 20A₃₂ + 30A₃₃

medium

You have a Transformer with d_model = 1024 and h = 16 heads. (a) What is d_k if you split evenly? (b) What are the shapes of W_Q, W_K, W_V per head? (c) What is the shape of the attention score matrix for a sequence length L = 128 in a single head?

Hint: Use d_k = d_model / h. Remember score matrix is QKᵀ with Q ∈ ℝ^{L×d_k} and K ∈ ℝ^{L×d_k}.

Show solution

(a) d_k = 1024 / 16 = 64.

(b) Per head, W_Q^{(r)} ∈ ℝ^{1024×64}, W_K^{(r)} ∈ ℝ^{1024×64}, W_V^{(r)} ∈ ℝ^{1024×64} (assuming d_v = d_k).

(c) For L = 128, Q and K are 128×64, so QKᵀ is (128×64)(64×128) = 128×128.

hard

Consider a decoder-only Transformer generating tokens left-to-right. Explain, using the attention formula, exactly where and how the causal mask changes the computation. Then describe what failure mode occurs if you omit the causal mask during training but keep it during inference.

Hint: Point to the logits matrix QKᵀ/√d_k and the additive mask M. Think about the model seeing future tokens during teacher forcing training.

Show solution

Causal masking modifies attention logits before softmax:

A = softmax( QKᵀ/√d_k + M_causal ).

M_causal has entries Mᵢⱼ = −∞ for j > i and 0 otherwise, forcing Aᵢⱼ = 0 for all future positions.

If you omit M_causal during training with teacher forcing, the model can attend to future ground-truth tokens to predict the next token. It learns a “cheating” solution that relies on information that will not be available at inference.

At inference, when you reintroduce the causal mask, those information paths disappear, causing a sharp performance drop: perplexity increases and generation quality degrades because the model’s learned dependencies are misaligned with the constraints at test time.

Connections

Quality: A (4.3/5)