Embeddings are learned low-dimensional continuous vectors that represent discrete items (e.g., tokens) so that semantic relationships are encoded in geometry. Attention operates over these vector representations, so familiarity with embeddings and their properties is important.
Deep-dive lesson - accessible entry point but dense material. Use worked examples and spaced repetition.
Modern neural networks can’t “look up” meaning directly from discrete IDs like token 1729 or user 4,381,002. They first need a continuous, learnable space where distances and angles can express relationships. Embeddings create that space—and attention operates almost entirely inside it.
An embedding is a learned function that maps a discrete item x to a dense vector vₓ ∈ ℝᵈ. The embedding table is a parameter matrix E ∈ ℝ^{|V|×d}; selecting an embedding is equivalent to multiplying a one-hot vector by E. Geometry in embedding space (dot products, cosine similarity, norms, and offsets) can encode semantic and relational structure, and these vectors are trained jointly with the rest of the model. Understanding embeddings is essential because attention compares tokens via inner products of learned projections of these embeddings.
Neural networks are built out of operations like matrix multiplication, addition, and nonlinearities. Those operations expect numbers. But many ML inputs are discrete symbols:
A discrete ID by itself has no meaningful arithmetic: token 5 is not “closer” to token 6 than to token 5000. If we fed the raw integer ID into a model, we’d be smuggling in a fake ordering.
A classic alternative is a one-hot vector eₓ ∈ {0,1}^{|V|}, where |V| is the vocabulary size. This avoids the fake ordering, but it creates a different problem:
We want a representation where:
That’s exactly what embeddings provide.
An embedding is a learned mapping
x ↦ vₓ, where vₓ ∈ ℝᵈ
We’ll use the key symbol from this node:
In practice, we store embeddings in a matrix (often called an embedding table):
E ∈ ℝ^{|V|×d}
Row x of E (written E[x, :]) is the embedding for item x:
vₓ = E[x, :]
This is why the node description says “embeddings as model parameters”: the rows of E are trainable parameters just like weights in a linear layer.
Even though implementations often do a fast “lookup”, it is mathematically equivalent to multiplying by a matrix.
Let eₓ be the one-hot vector for item x (length |V|). Then:
vₓ = eₓᵀ E
Check the shapes:
So an embedding layer is essentially a linear layer with a very special input (one-hot), where the computation reduces to selecting a row.
“Dense” means most coordinates are nonzero, and information is distributed across coordinates. This matters because:
Embedding vectors are not “features you interpret one-by-one” in general. They’re coordinates in a learned space.
Think of each item x as a point in d-dimensional space. The model learns where to place those points so that tasks become easy: similar items end up near each other; items that interact strongly get high dot products; offsets can represent relations.
This geometric picture becomes especially important once you learn attention: attention weights are computed from dot products between learned projections of embeddings. So if you can reason about dot products and angles, you can reason about why attention focuses where it does.
Embeddings are “just parameters,” but they behave differently from typical weight matrices:
Understanding how embeddings are represented and updated helps you debug training, reason about capacity, and understand common modeling choices.
Let the vocabulary be V with |V| items. The embedding table is
E ∈ ℝ^{|V|×d}
Given an input sequence of token ids (x₁, x₂, …, xₙ), the embedding layer outputs a matrix of vectors:
X =
[ v_{x₁}
v_{x₂}
⋮
v_{xₙ} ] ∈ ℝ^{n×d}
This X is what subsequent layers (like attention) operate on.
Write the one-hot matrix for the whole sequence as:
H ∈ {0,1}^{n×|V|}
Each row i is e_{xᵢ}ᵀ. Then:
X = H E
This view makes it clear that embeddings are linear in E.
Suppose your loss is L. You compute ∂L/∂X from later layers, then backpropagate to E.
From X = H E, using matrix calculus:
∂L/∂E = Hᵀ (∂L/∂X)
What does this mean intuitively?
If token x never appears, its row E[x, :] never updates.
Because updates are sparse:
This is one reason why tokenization (e.g., subword units) helps: it increases coverage so that even rare words share pieces that appear often.
Embeddings are typically initialized with small random values, e.g.
E[x, j] ~ 𝒩(0, σ²)
Why small?
Transformers also use scaling (like dividing by √d in attention) partly to control magnitude. But stable embedding scales still matter.
Because embedding rows are free to drift, it’s common to apply:
Why? If some vectors become huge, dot products become dominated by norms rather than directions.
Embedding tables can dominate parameter count.
Parameter count = |V|×d
Example:
Then parameters = 50,000×1,024 = 51,200,000 ≈ 51M parameters.
This affects:
In language models, you often have:
A common trick is weight tying:
W_out = E_in
This reduces parameters and often improves generalization because the model uses a shared notion of token similarity for both reading and writing.
An embedding layer is a learned matrix where each row is a vector vₓ. Forward pass is indexing; backward pass updates only rows used in the batch. Scale, regularization, and data frequency strongly affect how good those learned vectors become.
Once tokens become vectors, the model can compare them using geometry. This is not a metaphor: attention scores are built from inner products of projected embeddings.
So you want to be fluent with the main geometric tools:
For a, b ∈ ℝᵈ:
a·b = ∑ⱼ aⱼ bⱼ
It can be decomposed as:
a·b = ‖a‖ ‖b‖ cos(θ)
where θ is the angle between vectors.
So a large dot product can happen because:
This matters because many models treat dot products as “similarity.” But dot products mix direction and scale.
Cosine similarity removes magnitude:
cos_sim(a, b) = (a·b) / (‖a‖ ‖b‖)
This lives in [-1, 1] (for nonzero vectors) and measures angular closeness.
In practice:
A useful identity connects distance and dot product:
‖a − b‖² = (a − b)·(a − b)
= a·a − 2a·b + b·b
= ‖a‖² − 2a·b + ‖b‖²
If all vectors are normalized to unit norm (‖a‖ = ‖b‖ = 1), then:
‖a − b‖² = 2 − 2(a·b)
So maximizing dot product is equivalent to minimizing distance when norms are controlled.
It’s tempting to say “embeddings encode meaning,” but the meaning is shaped by:
Two words are “similar” in embedding space if the model benefits from treating them similarly for the task.
In language modeling, similarity often reflects:
Sometimes relations are approximately linear in embedding space. A famous pattern (not a guarantee) is:
v_{king} − v_{man} + v_{woman} ≈ v_{queen}
Interpreting this:
Why can this happen?
If many relationships behave consistently, a linear subspace can represent them efficiently. But remember: this depends on training and is approximate.
To find “queen” from the analogy, you compute:
u = v_{king} − v_{man} + v_{woman}
Then search for the token y whose embedding vᵧ is closest to u (often by cosine similarity):
y = argmaxᵧ cos_sim(u, vᵧ)
This is one place where cosine similarity is preferred: it reduces sensitivity to norm differences.
If a downstream layer applies a linear map, then rotating embedding space can preserve behavior.
For example, if you replace embeddings with E′ = E R where R is an orthogonal matrix (RᵀR = I), and adjust the next linear layer accordingly, many computations can remain unchanged.
So individual coordinates rarely have stable “human meanings.” What’s stable is the relative geometry that the network uses.
| Measure | Formula | Sensitive to norms? | Typical use |
|---|---|---|---|
| Dot product | a·b | Yes | Attention scores, matrix factorization |
| Cosine similarity | (a·b) / (‖a‖‖b‖) | No | Nearest neighbors, retrieval |
| Euclidean distance | ‖a − b‖ | Yes (via norms) | Clustering, k-NN (if normalized, aligns with cosine) |
Geometric semantics means the model learns a space where “useful relationships” become simple geometric relationships—often high dot products, small angles, or consistent offsets. This geometric viewpoint is not optional: it’s the language attention uses.
Attention mechanisms don’t operate on token ids. They operate on vectors.
In a Transformer, the pipeline starts with embeddings:
So if embeddings are poorly trained or poorly scaled, attention has bad raw material.
Let X ∈ ℝ^{n×d} be the stacked token embeddings.
A single attention head uses learned matrices:
W_Q ∈ ℝ^{d×dₖ}, W_K ∈ ℝ^{d×dₖ}, W_V ∈ ℝ^{d×dᵥ}
Then:
Q = X W_Q
K = X W_K
V = X W_V
The attention score between token i and token j is based on a dot product:
score(i, j) = ( qᵢ · kⱼ ) / √dₖ
and attention weights are softmaxed scores.
Notice the chain:
xᵢ → v_{xᵢ} → qᵢ
xⱼ → v_{xⱼ} → kⱼ
So the embedding geometry is transformed, but it strongly influences how easily the model can learn useful dot products.
If components of q and k have variance σ², the dot product sums dₖ terms, so its variance grows with dₖ. Roughly:
Var(q·k) ∝ dₖ
Dividing by √dₖ keeps the scale of scores more stable as dₖ changes, preventing softmax from saturating too early.
Even with scaling, extreme embedding norms can still cause issues upstream. That’s why scale/initialization/normalization choices around embeddings matter.
Tokens alone are insufficient: the sequence order matters.
A common approach is to learn a second embedding table for positions:
P ∈ ℝ^{L×d}
For position i, add:
hᵢ = v_{xᵢ} + pᵢ
Now hᵢ is what the model projects into Q/K/V.
This mirrors the main idea: even “position” is treated as a discrete item that gets a learned dense vector.
Embeddings are a general pattern:
The core idea remains: map discrete or complex objects into a shared vector space where geometry supports the task.
In attention, the model learns to create Q/K spaces where dot products reflect “should attend.” This is a task-specific similarity.
For example:
So embeddings are not just about synonyms—they are about making the downstream operations (especially dot products) predictive.
When embeddings go wrong, you often see:
Interventions often involve:
Once embeddings are clear, attention becomes less mysterious: it’s “compare vectors, mix values.” You’ll be ready to reason about attention heads as geometric comparators operating on embedded tokens.
Next node:
Let |V| = 5 items, embedding dimension d = 3. The embedding table is
E =
[ 0.2 -0.1 0.0
0.0 0.3 0.1
-0.2 0.4 0.5
0.7 0.0 -0.3
0.1 0.2 0.2 ]
Assume item x = 4 (1-indexed). Show that selecting the embedding is the same as multiplying a one-hot vector by E.
Write the one-hot vector for x = 4 (length 5):
e₄ = [0, 0, 0, 1, 0]ᵀ.
Compute v₄ by lookup (row selection):
v₄ = E[4, :] = [0.7, 0.0, -0.3].
Compute e₄ᵀ E explicitly:
e₄ᵀ E = [0, 0, 0, 1, 0]
·
[ 0.2 -0.1 0.0
0.0 0.3 0.1
-0.2 0.4 0.5
0.7 0.0 -0.3
0.1 0.2 0.2 ]
Use the fact that left-multiplying by a one-hot row picks the corresponding row:
e₄ᵀ E = 1·[0.7, 0.0, -0.3] + 0·(other rows)
= [0.7, 0.0, -0.3].
Conclude:
v₄ = e₄ᵀ E, so embedding lookup is a special case of a linear layer with one-hot input.
Insight: Thinking of embeddings as E and inputs as one-hot vectors makes backpropagation intuitive: gradients route to rows of E corresponding to observed tokens.
Let a = [3, 0] and b = [1, 0]. Compute a·b, cos_sim(a, b), and explain what each says about similarity.
Compute the dot product:
a·b = 3·1 + 0·0 = 3.
Compute norms:
‖a‖ = √(3² + 0²) = 3
‖b‖ = √(1² + 0²) = 1
Compute cosine similarity:
cos_sim(a, b) = (a·b) / (‖a‖‖b‖)
= 3 / (3·1)
= 1.
Interpretation:
Insight: Dot product conflates direction and magnitude; cosine isolates direction. Attention uses dot products (with scaling), so vector norms can influence attention even when directions match.
Let u and v be unit vectors (‖u‖ = ‖v‖ = 1) with u·v = 0.8. Compute ‖u − v‖² and ‖u − v‖. سپس explain the relationship.
Start from the identity:
‖u − v‖² = ‖u‖² − 2(u·v) + ‖v‖².
Plug in ‖u‖² = 1, ‖v‖² = 1, u·v = 0.8:
‖u − v‖² = 1 − 2(0.8) + 1
= 2 − 1.6
= 0.4.
Take the square root:
‖u − v‖ = √0.4 ≈ 0.632.
Interpretation:
When vectors are normalized, larger dot product ⇒ smaller distance in a precise way:
‖u − v‖² = 2 − 2(u·v).
Insight: If you normalize embeddings, dot product similarity and Euclidean distance become directly linked, which simplifies nearest-neighbor reasoning and stabilizes similarity comparisons.
An embedding maps a discrete item x to a learned dense vector vₓ ∈ ℝᵈ.
The embedding layer is a parameter matrix E ∈ ℝ^{|V|×d}; vₓ is simply row x of E.
Embedding lookup is mathematically vₓ = eₓᵀ E, i.e., a linear map applied to a one-hot input.
Only embedding rows that appear in a batch receive gradient updates; rare items learn slowly and can be poorly represented.
Geometry is the core: dot products, norms, angles, and distances define similarity and relational structure used by downstream layers.
Dot product mixes direction and magnitude: a·b = ‖a‖‖b‖cos(θ). Cosine similarity removes magnitude effects.
Embedding spaces are not uniquely identifiable (they can be rotated/reparameterized); what matters is relative geometry as used by the network.
Attention relies on these representations: tokens become embeddings, are projected to Q/K/V, and compared via scaled dot products.
Treating token IDs as numeric features (e.g., feeding the integer id directly), which introduces fake ordinality.
Assuming dot product equals “semantic similarity” without considering vector norms; large norms can dominate scores.
Over-interpreting individual embedding coordinates as human-readable features; embeddings are often only meaningful up to rotations and linear transforms.
Ignoring frequency effects: rare items may have noisy embeddings unless you use better tokenization, regularization, or share substructures.
You have |V| = 10,000 tokens and choose embedding dimension d = 512.
1) How many embedding parameters are there?
2) If you store them in float32, roughly how many MB is the table (ignore overhead)?
Hint: Parameters = |V|×d. float32 is 4 bytes per parameter. Convert bytes to MB by dividing by 1024² (or approximate with 10⁶ for a rough estimate).
1) Parameters = 10,000×512 = 5,120,000.
2) Bytes ≈ 5,120,000×4 = 20,480,000 bytes.
In MiB: 20,480,000 / 1024² ≈ 19.5 MiB (about 20 MB).
Let a, b ∈ ℝᵈ be nonzero. Show algebraically that scaling one vector changes dot product but not cosine similarity:
Hint: Use linearity of dot product and the fact that ‖cb‖ = c‖b‖ for c > 0.
Dot product:
a·(cb) = c(a·b) (it scales linearly with c).
Cosine similarity (c > 0):
cos_sim(a, cb) = (a·(cb)) / (‖a‖‖cb‖)
= c(a·b) / (‖a‖ (c‖b‖))
= (a·b) / (‖a‖‖b‖)
= cos_sim(a, b).
So cosine similarity is invariant to positive scaling.
Suppose an embedding table E ∈ ℝ^{|V|×d} is used, and a batch contains token ids [2, 2, 5]. Let the upstream gradient for the resulting embeddings X ∈ ℝ^{3×d} be G = ∂L/∂X, where rows are g₁, g₂, g₃.
Describe (in words and formulas) which rows of ∂L/∂E are nonzero and what values they take.
Hint: Use X = H E where H is the one-hot matrix. Then ∂L/∂E = Hᵀ G. Repeated ids mean gradients add into the same row.
Only rows corresponding to ids 2 and 5 can be nonzero.
Because token id 2 appears twice (positions 1 and 2), its embedding row receives the sum of both gradients:
(∂L/∂E)[2, :] = g₁ + g₂.
Token id 5 appears once (position 3):
(∂L/∂E)[5, :] = g₃.
All other rows x ∉ {2, 5} have:
(∂L/∂E)[x, :] = 0.
This reflects sparse updates and accumulation for repeated tokens.
Prerequisite concept you’ll actively use:
This node unlocks:
Related follow-on ideas (often adjacent in a tech tree):