An affine transformation applies a linear map (matrix multiply) followed by a bias shift; in neural models this corresponds to learned linear layers that project inputs into query/key/value spaces. Recognizing affine transforms helps understand how attention inputs are linearly combined and projected.
Self-serve tutorial - low prerequisites, straightforward concepts.
Every time a Transformer turns token vectors into queries, keys, and values, it’s doing the same fundamental operation: take an input vector, mix its components with a matrix, then shift the result with a bias. That simple “mix then shift” move—an affine transformation—is the workhorse behind linear layers.
An affine transformation maps x ↦ Wx + b. The matrix W performs a linear map (rotation/scale/shear/projection and component mixing), and the bias b translates (shifts) the output. In neural networks, this is a learned linear layer used to project embeddings into new spaces (like Q/K/V in attention).
In many systems you want a controllable way to transform a vector of features into a new vector of features. In machine learning, you repeatedly need to:
A linear map does (1). A bias/translation does (2). Together they form an affine transformation.
An affine transformation from ℝⁿ to ℝᵐ is a function of the form:
where:
In neural-network language, this is a linear layer (often called “fully connected”), even though mathematically it’s affine unless b = 0.
Think of x as a column of numbers (features). Multiplying by W creates weighted sums of those features—each output component is a mixture of all input components.
Then adding b shifts the result by a constant offset independent of x.
A linear map L(x) = Wx has a special property:
But an affine map A(x) = Wx + b generally does not:
So the bias is exactly what lets the model output something nonzero even when the input is zero.
Affine transformations preserve straight lines and parallelism. They do not necessarily preserve angles or lengths.
A useful mental model:
In ML you’ll constantly track dimensions. Here’s the standard setup:
A compact dimension check:
This one rule prevents many mistakes later when you build attention projections.
If you strip away the bias, the matrix multiplication Wx is the mixing engine of a linear layer. It’s how models learn to combine features: emphasize some, suppress others, and create new features from old ones.
Let W have rows w₁ᵀ, w₂ᵀ, …, wₘᵀ (each wᵢ ∈ ℝⁿ). Then:
So each output component is a dot product between the input and a learned weight vector.
Write this explicitly:
This is why people say a linear layer computes “weighted sums”: each yᵢ is a sum of input components xⱼ multiplied by weights.
If W has entries Wᵢⱼ, then:
This shows two important things:
Let W’s columns be c₁, …, cₙ (each cⱼ ∈ ℝᵐ). Then:
So the input scalars xⱼ decide how much of each column vector cⱼ is added.
This is a powerful geometric view:
A diagonal matrix scales each coordinate independently:
But a full matrix creates new features by mixing:
In representation learning, this mixing is essential: the model can rotate into a coordinate system where some later operation (like attention scoring) becomes easier.
For L(x) = Wx:
You can verify by algebra:
L(αu + βv)
= W(αu + βv)
= αWu + βWv
= αL(u) + βL(v)
This matters conceptually: the linear part preserves the “add and scale” structure of vectors.
Affine/linear layers are often used to change feature dimension:
| Goal | n → m | Interpretation |
|---|---|---|
| Compression | large n → small m | projection / bottleneck |
| Expansion | small n → large m | lift into richer feature space |
| Same size | n → n | rotation/scale/shear/mixing |
In Transformers, projections often keep the model dimension d the same (d → d) but also create multiple heads (conceptually splitting into h subspaces). Even when the final dimension is the same, W is still doing a learned change of basis.
If you only have y = Wx, then the output is forced to be 0 when x = 0. That’s not always desirable.
In ML terms: without a bias, the model can only represent functions that pass through the origin. A bias lets the model set a baseline output.
An affine layer is:
Evaluate at x = 0:
So b is the output the layer produces when given zero input.
The map x ↦ Wx transforms the space around the origin. Adding b then shifts every output by the same vector.
If two inputs differ by Δx:
Subtract:
Notice b cancels. This reveals an important geometric fact:
So W controls how differences are transformed; b controls where the transformed cloud sits.
A useful trick is to rewrite the affine map as a pure matrix multiply by augmenting the input with a 1.
Create an extended vector and matrix:
Then:
W̄x̄
= [ W b ] [ x ; 1 ]
= Wx + b·1
= Wx + b
Why this is conceptually helpful:
Even before deep learning, linear models use biases.
A linear classifier might compute:
The set where s(x) = 0 is a hyperplane:
If b = 0, the hyperplane must pass through the origin. With b ≠ 0, it can shift, greatly increasing what you can represent.
Many Transformer implementations include bias terms in linear projections, though some variants remove them for efficiency or symmetry (and compensate elsewhere). Conceptually, knowing that b exists helps you interpret a projection as:
Even if a specific architecture sets b = 0, the affine framework is still the general concept.
Attention needs vectors in roles that are not identical:
Even if all tokens start as embeddings in the same space ℝᵈ (model dimension d), the model benefits from learning different projections for these different roles.
Given a token representation x ∈ ℝᵈ, attention uses learned affine maps:
where W_Q, W_K, W_V ∈ ℝᵈˣᵈ (often) and biases are in ℝᵈ.
With sequences, you apply this to every position. If X ∈ ℝˡˣᵈ is a matrix whose rows are token vectors, then:
Here 1 ∈ ℝˡ is a vector of ones. The important idea is: the same affine transform is applied independently to each token vector.
In multi-head attention with h heads, each head often uses a smaller per-head dimension d_head where d = h·d_head.
One way to view this:
Another view (equivalent conceptually):
Either way, the key point is that attention relies on learned affine maps to create multiple learned “views” of the same input.
Suppose you compare two tokens x₁ and x₂. Their query difference is:
So the bias does not change relative geometry, but it does change the absolute location. In dot-product attention, absolute location can matter because dot products are not translation-invariant:
This is one reason biases can subtly affect attention score distributions.
Embeddings give you dense vectors e(token) ∈ ℝᵈ. On their own they are just coordinates. Affine layers are how the model:
In practice, a Transformer block is largely a sequence of affine maps plus nonlinearities and normalization. Recognizing “Wx + b” everywhere helps you read architectures without getting lost.
Masking affects which attention scores are allowed, but the scores themselves come from dot products of affine-projected vectors:
So masking is applied after affine projections have created Q and K. Understanding affine projections helps you see that masking doesn’t change how Q/K/V are computed; it changes which pairings (i,j) are considered.
| Component | Typical form | Purpose |
|---|---|---|
| Q projection | q = W_Qx + b_Q | prepare “search vectors” |
| K projection | k = W_Kx + b_K | prepare “address vectors” |
| V projection | v = W_Vx + b_V | prepare “content vectors” |
| Output projection | o = W_Oz + b_O | mix heads back together |
| Feed-forward layer 1 | h = W₁x + b₁ | expand dimension |
| Feed-forward layer 2 | y = W₂φ(h) + b₂ | compress back |
Once you can fluently interpret each row of W as “a learned feature detector” and b as “a learned baseline,” the architecture becomes much more transparent.
Let x ∈ ℝ² be x = [2; −1]. Let W ∈ ℝ²ˣ² and b ∈ ℝ² be:
W = [[1, 3],
[−2, 0]]
b = [4; 1]
Compute y = Wx + b, and interpret each output coordinate as a weighted sum plus bias.
Start with y = Wx + b.
Compute Wx using row-by-row dot products.
First row of W is w₁ᵀ = [1, 3].
So (Wx)₁ = [1, 3] · [2; −1]
= 1·2 + 3·(−1)
= 2 − 3
= −1.
Second row of W is w₂ᵀ = [−2, 0].
So (Wx)₂ = [−2, 0] · [2; −1]
= (−2)·2 + 0·(−1)
= −4 + 0
= −4.
So Wx = [−1; −4].
Add the bias b:
y = Wx + b
= [−1; −4] + [4; 1]
= [3; −3].
Insight: Each output is a learned weighted sum of inputs plus a learned offset. Here y₁ = 1·x₁ + 3·x₂ + 4 and y₂ = (−2)·x₁ + 0·x₂ + 1. The matrix mixes features; the bias shifts the baseline.
Consider an affine projection used for queries: q = Wx + b. Take two inputs x₁ and x₂. (1) Derive q₂ − q₁. (2) Show how a shared bias can still affect a dot-product score qᵀk when both sides have biases.
Write the two projected queries:
q₁ = Wx₁ + b
q₂ = Wx₂ + b
Subtract:
q₂ − q₁
= (Wx₂ + b) − (Wx₁ + b)
= Wx₂ + b − Wx₁ − b
= W(x₂ − x₁).
So the bias b does not affect differences between projected vectors; it only shifts them together.
Now consider keys also have a bias: k = Ux + c.
A dot-product score between a query and a key is:
score = qᵀk
= (Wx + b)ᵀ (Ux' + c).
Expand the dot product carefully:
(Wx + b)ᵀ (Ux' + c)
= (Wx)ᵀ(Ux') + (Wx)ᵀc + bᵀ(Ux') + bᵀc.
Even though biases cancel in differences, they introduce extra terms in absolute dot products: (Wx)ᵀc, bᵀ(Ux'), and bᵀc.
Insight: Bias doesn’t change relative geometry (differences), but attention scoring depends on absolute dot products, so biases can shift score distributions via additional cross-terms. This is one reason architectural choices about bias can matter in practice.
Let W ∈ ℝ³ˣ² and b ∈ ℝ³ define y = Wx + b. Construct an augmented matrix W̄ and augmented vector x̄ so that y = W̄x̄ with no explicit + b.
Start with y = Wx + b, where x ∈ ℝ² and y ∈ ℝ³.
Augment the input by appending 1:
x̄ = [ x ; 1 ] ∈ ℝ³.
Create the augmented matrix by appending b as an extra column:
W̄ = [ W b ] ∈ ℝ³ˣ³.
Multiply:
W̄x̄
= [ W b ] [ x ; 1 ]
= Wx + b·1
= Wx + b
= y.
Insight: Bias can be treated as weights on a constant feature. This is handy for reasoning and for deriving gradients: affine maps are linear in their parameters.
An affine transformation is f(x) = Wx + b; it’s the mathematical form of a neural-network “linear layer.”
W ∈ ℝᵐˣⁿ mixes and transforms features; each output coordinate is a dot product with a row of W: yᵢ = wᵢᵀx.
The bias b ∈ ℝᵐ translates outputs and sets f(0) = b; without it, the mapping must pass through the origin.
Differences cancel the bias: (f(x₂) − f(x₁)) = W(x₂ − x₁). So W controls relative geometry; b controls absolute placement.
Affine maps preserve straight lines and parallelism (they’re linear maps plus translation).
You can encode the bias by augmenting the input with a 1: Wx + b = W̄[x; 1].
In Transformers, Q/K/V projections are affine maps applied to token embeddings: q = W_Qx + b_Q, etc.
Calling Wx + b “linear” in the strict math sense: it’s affine unless b = 0.
Getting dimensions wrong (e.g., using W ∈ ℝⁿˣᵐ when x ∈ ℝⁿ and expecting an ℝᵐ output). Always check W ∈ ℝᵐˣⁿ.
Forgetting that each output is a dot product with a row of W, leading to confusion about how features are mixed.
Assuming the bias never matters in attention because it cancels in differences; dot-product scores depend on absolute vectors and can be affected by biases via cross-terms.
Let x = [1; 2; −1] ∈ ℝ³, W = [[2, 0, 1], [−1, 3, 2]] ∈ ℝ²ˣ³, and b = [0; 5] ∈ ℝ². Compute y = Wx + b.
Hint: Compute Wx by row dot products, then add b.
Wx:
First row: [2,0,1]·[1;2;−1] = 2·1 + 0·2 + 1·(−1) = 2 − 1 = 1
Second row: [−1,3,2]·[1;2;−1] = (−1)·1 + 3·2 + 2·(−1) = −1 + 6 − 2 = 3
So Wx = [1; 3].
Add b: y = [1;3] + [0;5] = [1;8].
Suppose f(x) = Wx + b with W ∈ ℝᵐˣⁿ. Prove that for any u, v ∈ ℝⁿ and scalar α, the following holds: f(αu + (1−α)v) = αf(u) + (1−α)f(v).
Hint: Expand both sides using distributivity of matrix multiplication; watch how the bias terms combine.
Left side:
f(αu + (1−α)v) = W(αu + (1−α)v) + b
= αWu + (1−α)Wv + b.
Right side:
αf(u) + (1−α)f(v)
= α(Wu + b) + (1−α)(Wv + b)
= αWu + αb + (1−α)Wv + (1−α)b
= αWu + (1−α)Wv + (α + 1−α)b
= αWu + (1−α)Wv + b.
Both sides match, so the identity holds. (This is a defining “affine” property: it preserves convex combinations.)
You have a Transformer with model dimension d = 512 and number of heads h = 8. If per-head dimension is d_head = 64, what are the typical shapes of W_Q, W_K, W_V for the combined projection (single matrix per type), and what is the shape of the per-token bias b_Q?
Hint: Combined projections usually map ℝᵈ → ℝᵈ, then reshape into (h, d_head).
Since d = h·d_head = 8·64 = 512, a common design is:
W_Q ∈ ℝᵈˣᵈ = ℝ⁵¹²ˣ⁵¹² (and similarly W_K, W_V).
The bias b_Q is added to each token’s projected query vector, so b_Q ∈ ℝᵈ = ℝ⁵¹².
After computing q = W_Qx + b_Q, the result in ℝ⁵¹² is reshaped/split into 8 heads of size 64.
Next nodes you can unlock and why they rely on affine maps: