Continuous vector representations that encode discrete items (words, tokens, or features) into a dense numeric space where geometric relationships reflect semantic or functional similarity. Embeddings are the typical inputs to attention layers and determine how items interact via similarity and projection.
Self-serve tutorial - low prerequisites, straightforward concepts.
Modern ML models can’t directly “think” in words, IDs, or categories. They can only compute with numbers—especially vectors. Vector embeddings are the bridge: they turn discrete items into continuous v ∈ ℝᵈ so geometry (dot products, angles, distances) becomes a usable language for meaning and function.
A vector embedding is a learned mapping x ↦ eₓ ∈ ℝᵈ from a discrete item (token/word/category) to a dense vector. Similar items end up with similar directions (high cosine similarity) and often similar positions (small distance). Embeddings are usually implemented as a trainable lookup table (an “embedding matrix”) and are the standard inputs to attention, where dot products between embeddings produce relevance scores.
Discrete items—like words, tokens, product IDs, user IDs, categorical features, or graph nodes—don’t naturally live in a space where “closeness” or “similarity” is meaningful.
Embeddings solve this by giving each item a dense vector eₓ ∈ ℝᵈ. Once items are vectors, we can compare them with dot products, cosine similarity, and distances. That makes “interaction” between items (in attention, retrieval, classification, etc.) a simple geometric computation.
An embedding is a function (often learned) that maps a discrete item x to a continuous vector:
We write:
In many neural networks, Emb(·) is implemented as a lookup table (a matrix) with one vector per item.
The core intuition is not that embeddings store dictionary definitions. Instead:
For language, distributional learning yields the classic idea: “You know a word by the company it keeps.” If two words appear in similar contexts, a training objective will encourage their embeddings to become similar.
A one-hot vector x ∈ {0,1}^|V| is sparse: exactly one 1 and the rest 0. An embedding eₓ ∈ ℝᵈ is dense: typically all d components are nonzero.
A useful comparison:
| Representation | Dimensionality | Similarity structure | Parameters | Pros | Cons | ||
|---|---|---|---|---|---|---|---|
| Integer ID | 1 | none | 0 | compact | no geometry | ||
| One-hot | V | orthogonal | 0 | simple | huge, no similarity | ||
| Embedding | d (e.g., 128–4096) | learned geometry | V | ·d | expressive, compact | must be learned |
Embeddings are “compact but expressive.” They trade fixed semantics (one-hot) for learnable semantics (geometry).
We’ll use eₓ to denote the embedding vector for item x:
And we’ll often compare embeddings using cosine similarity (prerequisite):
cos(a, b) = (a · b) / (‖a‖ ‖b‖)
Cosine similarity focuses on direction, which is often what matters in learned representation spaces.
Embeddings are a learned coordinate system designed to make downstream computation easy.
When x is a discrete ID, the simplest parameterization is: assign a trainable vector to each ID. Collect these vectors into a matrix E.
Let:
Row i of E is the embedding for token i:
This is literally a learned table.
If x is a one-hot vector representing item i, then the embedding lookup is equivalent to matrix multiplication:
Compute:
e = xᵀ E
Because x has a single 1 at index i, xᵀ E selects the i-th row.
This equivalence is useful for understanding gradients: the model updates the row(s) corresponding to the IDs it sees.
Embeddings are learned because they participate in a loss. The loss might come from:
Regardless of the objective, the embedding vectors are parameters. During backprop, the gradient ∂L/∂eₓ updates eₓ.
A simple mental model:
Consider a very common pattern: predict a label y from an embedding h (which could be eₓ or a contextual vector). Use a linear layer and softmax:
Loss for true class c:
L = −log p_c
Even if you don’t memorize softmax gradients, the key is:
So embeddings become whatever vectors make the rest of the network succeed.
Why not make d enormous?
A useful rule of thumb: pick d based on data scale, vocabulary size, and task complexity. In transformers, d often matches the model width so embeddings can be added to positional encodings and fed into attention.
In language models, there are often two related matrices:
Sometimes these are tied (shared): output weight matrix W is set to Eᵀ.
Why tie?
Embeddings are usually initialized randomly with small variance. Scale matters because dot products and norms affect attention scores and softmax logits.
In attention, if dot products get too large, softmax can saturate (become too peaky). That’s one reason scaled dot-product attention uses 1/√d.
Even before attention, stable embedding scales help optimization.
Embeddings aren’t just for words:
In all cases, the embedding is a learned vector that becomes a “handle” for the model to condition on.
This gives us the basic object: eₓ. Next we’ll focus on the geometry of embeddings—what dot products and angles mean, and why that geometry becomes the substrate for attention.
Once items are vectors, the model can compute interactions with fast linear algebra. The most common interaction is a dot product:
Dot products are cheap, differentiable, and deeply connected to angles and lengths. This is why embeddings pair naturally with attention.
But: dot products depend on both direction and magnitude. Cosine similarity removes magnitude to focus on direction:
cos(a, b) = (a · b) / (‖a‖ ‖b‖)
So you should keep two pictures in mind:
For a, b ∈ ℝᵈ:
Expand squared distance:
‖a − b‖²
= (a − b) · (a − b)
= a·a − 2a·b + b·b
= ‖a‖² − 2(a·b) + ‖b‖²
This shows a key link:
If we L2-normalize embeddings (force ‖eₓ‖ = 1), then:
‖a − b‖² = 2 − 2(a·b)
and since a·b = cos(a, b) for unit vectors:
‖a − b‖² = 2 − 2 cos(a, b)
So for normalized embeddings:
This is why many retrieval systems store normalized embeddings.
Embeddings encode whatever similarity the training process rewards.
The geometry becomes a compressed record of these pressures.
Two important consequences:
So “semantic similarity” is better read as useful similarity under the objective.
In many neural modules, the dot product between embeddings stands for compatibility.
This is a learned notion of compatibility because q, k, u, v come from learned embeddings or learned projections of embeddings.
Real embedding spaces often have quirks:
These effects can harm retrieval or similarity search because cosine similarities become less discriminative.
Mitigations include:
Sometimes we add constraints to shape geometry:
Each choice changes how dot products translate into probabilities.
Another way to view embedding components:
You’ll occasionally find interpretable directions (e.g., sentiment), but that’s not guaranteed.
Attention mechanisms operate on vectors. If your input is discrete tokens, you must first map them to vectors—embeddings.
In a transformer, a typical pipeline is:
So embeddings are the “raw material” that attention will mix.
Self-attention doesn’t usually compare raw embeddings directly. It projects them:
Then attention scores use dot products:
score(i, j) = (qᵢ · kⱼ) / √d_k
and weights are softmax over j.
Even though q, k, v are projected, their source is the embedding space. The structure of embeddings strongly influences what the model can learn efficiently:
Transformers also embed:
The concept is the same: a discrete unit becomes a vector so attention can compare and combine units.
Embeddings are also used for retrieval:
This is the backbone of semantic search and RAG systems.
If embeddings are normalized, ranking by cosine similarity is equivalent to ranking by dot product.
User and item embeddings are classic:
A simple predictor is a dot product:
ŷ = u · v
If user u likes items similar to v, the training objective increases u·v, bringing vectors into alignment.
Memory: embedding tables can dominate parameter count when |V| is large.
Common solutions:
OOV and rare items:
Fine-tuning vs freezing:
Attention is fundamentally about similarity-weighted mixing. Similarity is computed by dot products between vectors.
Embeddings provide:
Once you have eₓ, attention can form q, k, v and compute relevance.
This is why “Vector Embeddings” is a prerequisite: without vectorization, there is no meaningful similarity computation to drive attention.
Next node: Attention Mechanisms.
Suppose a tiny vocabulary V = {A, B, C} with |V| = 3 and embedding dimension d = 2. Let the embedding matrix be
E = [
[1, 0],
[2, 1],
[0, 3]
] (shape 3×2)
Rows correspond to A, B, C in that order. We want the embedding for token B.
Represent token B as a one-hot vector x ∈ ℝ³:
x = [0, 1, 0]
Compute the embedding as e = xᵀ E:
e = [0, 1, 0] · [
[1, 0],
[2, 1],
[0, 3]
]
Multiply:
e = 0·[1, 0] + 1·[2, 1] + 0·[0, 3]
= [2, 1]
So e_B = [2, 1].
Insight: An embedding lookup is algebraically “one-hot selection.” In backprop, only the selected row(s) receive gradients, which is why embedding tables train efficiently even for huge vocabularies.
Let a, b ∈ ℝ² be two unit vectors (‖a‖ = ‖b‖ = 1). Suppose a·b = 0.8. Compute cos(a, b) and ‖a − b‖², and interpret.
Because both vectors are unit length, cosine similarity equals the dot product:
cos(a, b) = (a·b) / (‖a‖‖b‖) = 0.8 / (1·1) = 0.8
Compute squared distance using the expansion:
‖a − b‖²
= ‖a‖² − 2(a·b) + ‖b‖²
= 1 − 2(0.8) + 1
Finish:
‖a − b‖² = 2 − 1.6 = 0.4
Optionally compute distance:
‖a − b‖ = √0.4 ≈ 0.632
Insight: For unit-normalized embeddings, high cosine similarity implies small Euclidean distance (and vice versa). This is why many retrieval systems normalize embeddings: it makes geometry consistent and simplifies ranking.
Suppose you have three token embeddings in ℝ²:
e₁ = [1, 0]
e₂ = [1, 1]
e₃ = [0, 1]
Treat token 2 as a “query” and compute raw dot-product scores sⱼ = e₂ · eⱼ for j ∈ {1,2,3}.
Compute s₁ = e₂ · e₁:
s₁ = [1, 1] · [1, 0] = 1·1 + 1·0 = 1
Compute s₂ = e₂ · e₂:
s₂ = [1, 1] · [1, 1] = 1 + 1 = 2
Compute s₃ = e₂ · e₃:
s₃ = [1, 1] · [0, 1] = 0 + 1 = 1
Interpretation: token 2 is most similar to itself (score 2) and equally similar to tokens 1 and 3 (score 1).
Insight: Attention scoring is built on dot products of vectors. Even before adding projections (W_Q, W_K), the embedding geometry already determines which tokens are “compatible.”
An embedding maps a discrete item x to a dense vector eₓ ∈ ℝᵈ, enabling similarity computations.
A common implementation is an embedding matrix E ∈ ℝ^{|V|×d}; lookup selects a row (equivalently one-hot × E).
Embedding geometry is shaped by the training objective, so “similarity” means “useful similarity for the task.”
Cosine similarity compares directions: cos(a, b) = (a·b) / (‖a‖‖b‖).
For unit vectors, ‖a − b‖² = 2 − 2 cos(a, b), linking distance and cosine similarity tightly.
Dot products serve as learned compatibility scores; this is the core numeric operation behind attention and retrieval.
Embedding tables can dominate parameters (|V|·d), motivating subword tokenization, pruning, factorization, or quantization.
Assuming embedding axes have fixed human-interpretable meaning; in general, meaning is distributed and coordinates are not unique.
Comparing dot products across models or setups without accounting for norms/normalization; magnitudes can distort similarity if not controlled.
Forgetting that embeddings reflect the training objective (and data biases), not an absolute notion of semantic truth.
Choosing an embedding dimension d without considering vocabulary size, data availability, and memory/compute constraints.
You have normalized embeddings (‖a‖ = ‖b‖ = 1). If cos(a, b) = 0.3, compute ‖a − b‖² and ‖a − b‖.
Hint: Use ‖a − b‖² = 2 − 2 cos(a, b) for unit vectors.
‖a − b‖² = 2 − 2(0.3) = 2 − 0.6 = 1.4.
‖a − b‖ = √1.4 ≈ 1.183.
Let |V| = 50,000 and d = 768. Approximately how many parameters are in the embedding table? If stored as float32 (4 bytes), about how much memory does it take?
Hint: Parameters = |V|·d. Memory = parameters × 4 bytes. Convert to MB or GB.
Parameters = 50,000 × 768 = 38,400,000.
Memory ≈ 38.4 million × 4 bytes = 153.6 million bytes.
In MB: 153.6e6 / (1024²) ≈ 146.5 MB (about 150 MB).
Suppose E ∈ ℝ^{4×3} is an embedding matrix for tokens {0,1,2,3}. If a training batch contains tokens [1, 1, 3], which rows of E receive gradient updates during backprop through an embedding lookup? Explain briefly.
Hint: Only looked-up rows are involved in the forward computation; repeated tokens accumulate gradients on the same row.
Rows 1 and 3 receive gradient updates. Row 1 appears twice in the batch, so its gradient contributions accumulate (sum) for that row. Rows 0 and 2 receive no update from this batch because they were not looked up.
Next: Attention Mechanisms
Related nodes you may want nearby in the tech tree: