Representation of discrete tokens (words, subwords, tokens) as continuous vectors used as input to neural models; includes learned embeddings and embedding lookup/initialization. Understanding embeddings covers dimensionality, lookup tables, and basic properties like semantic similarity in vector space.
Deep-dive lesson - accessible entry point but dense material. Use worked examples and spaced repetition.
Neural networks don’t naturally understand discrete symbols like “cat”, “##ing”, or token ID 50256. Token embeddings are the bridge: they turn a token into a continuous vector v ∈ ℝᵈ that a model can compute with.
A token embedding is a learned vector representation for each token in a vocabulary. All embeddings live in an embedding matrix E ∈ ℝ^(V×d). Given a token ID i, the model retrieves the i-th row E[i] (an embedding lookup) to produce eᵢ ∈ ℝᵈ, which becomes the input to downstream layers (e.g., attention/Transformer blocks). Embeddings are parameters: they’re initialized, updated by backprop, and their dimensionality d controls capacity and compute.
Neural networks are built to process numbers: vectors, matrices, and tensors. But language (and many discrete domains) start as symbols:
A token by itself has no natural numeric geometry. Token ID 7 isn’t “closer” to token ID 8 than to token ID 9000, yet a neural network will treat raw numbers that way.
So we introduce a representation that:
1) Is numeric (so the model can compute),
2) Has a geometry (so “similar” tokens can end up close),
3) Is learnable (so it adapts to the training data).
That representation is the token embedding.
Let the vocabulary size be V (number of distinct tokens) and let the embedding dimensionality be d.
We store an embedding matrix:
Each token i (an integer ID in {0, …, V−1}) is assigned the i-th row of E:
This is a lookup table: you don’t compute E[i] by multiplying all of E by something dense; you directly retrieve a row.
You can think of embeddings as placing each token at a point in a d-dimensional space. During training, the model moves these points around to make its predictions better.
This doesn’t guarantee a perfect semantic map, but it gives the model a flexible continuous “surface” on which to build meaning.
In a typical language model pipeline:
1) Text → tokenizer → token IDs (integers)
2) Token IDs → embedding lookup in E → vectors e₁, e₂, …
3) Vectors → Transformer (or other neural network) → predictions
So token embeddings are often the first learned layer of a modern NLP model.
A token ID is discrete. The model must map it to a continuous vector e ∈ ℝᵈ efficiently.
If V is large (e.g., 50k, 200k, 1M), you cannot afford to treat the input as a dense V-dimensional vector at every step. The embedding matrix lets you:
Conceptually, you can represent token i as a one-hot vector xᵢ ∈ {0,1}^V with a 1 at position i.
Then embedding lookup can be written as a matrix product:
Let’s check shapes:
This multiplication selects exactly one row of E, because all entries of xᵢ are 0 except at i.
Even though xᵢᵀ E is a nice equation, real implementations do an index operation:
Because multiplying by a V-length one-hot vector would waste memory and compute.
In practice you have a batch of sequences:
Embedding lookup produces:
So each position t in each sequence b gets a vector:
Different sources use different conventions. You might see embeddings as rows E[i] or columns. The key invariant is:
In this lesson we treat E as V rows, each row is eᵢ ∈ ℝᵈ.
Embeddings can dominate parameter count.
Example:
Parameters = 50,000 · 768 = 38,400,000
That’s 38.4M parameters just for token embeddings.
Once tokens are vectors, you can measure similarity.
Two common measures:
1) Dot product:
2) Cosine similarity:
Cosine similarity compares direction, not magnitude. Many embedding analyses use cosine similarity to focus on relational structure.
Embeddings are optimized to help the model predict correct outputs. If the model benefits from treating two tokens similarly (because they appear in similar contexts), gradient descent tends to push their embeddings in similar directions.
This is not magic; it’s just shared pressure from the loss function across many training examples.
You could assign each token a random vector and never change it. The model would then have to learn everything in later layers, with no ability to shape the input representation.
Learnable embeddings let the model:
Formally, E is part of the model parameters θ.
Common initialization strategies:
A typical goal is to keep activations at reasonable scale early in training.
If σ is too large:
If σ is too small:
Suppose in one training step you see token i at some position. The forward pass retrieves eᵢ = E[i]. The loss L depends on that vector through downstream computations.
Backprop computes the gradient:
But importantly:
This matches the lookup behavior: you only “touch” the embeddings you used.
Consider a toy model:
1) Lookup: eᵢ = E[i]
2) Linear: z = Weᵢ + b
3) Loss L depends on z
Let’s compute ∂L/∂eᵢ.
We use chain rule:
But z = Weᵢ + b, so:
Thus:
And since eᵢ is the i-th row of E, the gradient for E[i] is exactly ∂L/∂eᵢ.
Finally, a gradient descent update (learning rate η):
Every time token i appears, its vector is nudged to reduce loss.
Tokens that appear more often get updated more often.
This can be good (more data) but also can cause imbalance:
Some tokenization strategies (subwords) help reduce the number of truly rare tokens by composing words out of more frequent pieces.
In many language models, the input embedding matrix E is tied (shared) with the output projection matrix used to predict token logits.
If output logits are computed as:
then E serves two roles:
This reduces parameters and often improves performance, but it couples constraints: the same geometry must serve both input and output.
You can initialize E using pretrained vectors (word2vec/GloVe) or from a pretrained transformer.
Comparing options:
| Approach | Pros | Cons | When used |
|---|---|---|---|
| Train E from scratch | Simple; fully task-adapted | Needs lots of data; slow to learn semantics | New domains, sufficient data |
| Initialize from pretrained | Faster convergence; better semantics early | May mismatch tokenizer/vocab; can bake in biases | Many NLP tasks |
| Freeze pretrained E | Stable; fewer trainable params | Limits adaptation; can hurt performance | Low-data or constrained training |
Even when using pretrained embeddings, fine-tuning (updating E) is common.
The embedding dimensionality d is the length of each token vector eᵢ ∈ ℝᵈ.
Choosing d is a trade-off between:
With larger d, each token has more degrees of freedom.
But “bigger d” isn’t automatically better. If you don’t have enough data or model structure to use it, the extra dimensions can become noisy.
Parameters in E scale linearly in d:
The embedding activations for a batch scale as:
So increasing d increases both parameter memory and the size of tensors passed through attention/MLP blocks.
In transformers, token embeddings are usually produced in the same dimension as the model’s hidden size (often called d_model).
That way, you can add other vectors (like positional encodings) and pass embeddings directly into attention blocks without extra projections.
Many downstream operations depend on dot products.
For random vectors with independent components, dot products tend to grow with d unless normalized. This is one reason initialization and normalization layers matter.
If a, b have typical component scale σ, then expected magnitude:
So as d grows, norms grow unless σ shrinks. Good initialization tries to keep these scales stable.
Common rules of thumb (not laws):
The key is: d is a design knob controlling representational bandwidth at the input.
A simplified Transformer input step:
1) Token IDs X ∈ {0,…,V−1}^(B×T)
2) Token embeddings: H_tok[b,t] = E[X[b,t]]
3) Add positional information: H₀ = H_tok + P
4) Pass through stacked attention + MLP blocks
The crucial point: attention doesn’t operate on token IDs; it operates on vectors.
Token embeddings alone do not encode order. The tokens “dog bites man” vs “man bites dog” would be the same multiset of embeddings.
Transformers typically add a positional encoding/embedding P ∈ ℝ^(T×d) (or learned position embeddings) so each position t has its own vector pₜ.
Then:
This simple addition works because both are in ℝᵈ.
Once you have embeddings and pass them into attention, the model computes attention scores between positions.
But you often must prevent attention to:
Masking operates on attention score matrices, not on E directly—but embeddings are what make attention possible in the first place.
The same idea applies broadly:
Whenever you have a discrete symbol set, an embedding matrix is a standard first tool.
A minimal checklist:
These decisions will directly affect training stability and correctness.
Let V = 5 and d = 3. Suppose the embedding matrix is
E =
[ [ 1, 0, 2],
[ 0, 1, 0],
[-1, 1, 1],
[ 2, 2, 2],
[ 0, -1, 3] ]
Token ID i = 2 (0-indexed). Compute e₂ via lookup and via one-hot multiplication.
Lookup definition:
e₂ = E[2]
So e₂ = [−1, 1, 1].
Construct one-hot x₂ ∈ ℝ^5:
x₂ = [0, 0, 1, 0, 0].
Compute x₂ᵀ E:
x₂ᵀ E = 0·E[0] + 0·E[1] + 1·E[2] + 0·E[3] + 0·E[4]
= E[2]
= [−1, 1, 1].
Conclusion:
The matrix formula eᵢ = xᵢᵀ E is exactly row selection in disguise.
Insight: Thinking in one-hot form explains the math, but thinking in lookup form explains the efficiency: only one row is needed, so gradients and updates also stay sparse over rows.
Toy model: z = Weᵢ (ignore bias). Let d = 2 and W =
[ [2, 0],
[0, 1] ]
Assume token i appears, with current embedding eᵢ = [1, −1]. Suppose backprop gives ∂L/∂z = [3, 4]. Compute ∂L/∂eᵢ and do one gradient step with η = 0.1.
We have z = Weᵢ. By chain rule:
∂L/∂eᵢ = Wᵀ (∂L/∂z).
Compute Wᵀ. Here W is diagonal-like, so Wᵀ = W:
Wᵀ =
[ [2, 0],
[0, 1] ].
Multiply:
∂L/∂eᵢ = Wᵀ [3, 4]ᵀ
= [ 2·3 + 0·4,
0·3 + 1·4 ]
= [6, 4].
Gradient descent update:
eᵢ ← eᵢ − η (∂L/∂eᵢ)
= [1, −1] − 0.1·[6, 4]
= [1 − 0.6, −1 − 0.4]
= [0.4, −1.4].
Interpretation:
Only the embedding for token i is updated by this example; other token rows E[j] for j ≠ i are unchanged (for this single-token toy batch).
Insight: Embedding training is ordinary parameter learning; the only special feature is sparsity over vocabulary rows: you update the rows you looked up.
Suppose you have two token embeddings:
a = [2, 0, 1]
b = [1, 1, 0]
Compute dot product and cosine similarity.
Dot product:
a·b = 2·1 + 0·1 + 1·0 = 2.
Norms:
‖a‖ = √(2² + 0² + 1²) = √(4 + 0 + 1) = √5.
‖b‖ = √(1² + 1² + 0²) = √2.
Cosine similarity:
cos(a, b) = (a·b) / (‖a‖‖b‖)
= 2 / (√5 · √2)
= 2 / √10
≈ 0.632.
Insight: Cosine similarity normalizes away vector length, which is useful because embedding norms can vary for reasons unrelated to meaning (frequency, training dynamics, regularization).
Token embeddings map discrete token IDs to continuous vectors eᵢ ∈ ℝᵈ that neural networks can process.
All token vectors live in an embedding matrix E ∈ ℝ^(V×d); lookup retrieves a row: eᵢ = E[i].
The one-hot equation eᵢ = xᵢᵀ E is conceptually helpful, but implementations use efficient indexing (no dense V-dimensional one-hot).
Embeddings are model parameters: they are initialized (often randomly) and updated by backprop; only rows used in a batch receive gradients.
Embedding dimensionality d controls representational capacity, parameter count (V·d), and activation/compute cost (B·T·d).
Vector-space similarity (dot product, cosine similarity) provides a way to analyze learned relations between tokens, though it’s not guaranteed to be purely “semantic.”
In Transformers, token embeddings are the first step; they’re typically combined with positional information before attention layers.
Treating token IDs as numeric features (e.g., feeding raw IDs into a network) instead of using an embedding lookup.
Mismatching dimensions: choosing embedding d that doesn’t match the model’s expected hidden size (or forgetting a projection layer if you intentionally differ).
Forgetting special tokens: not reserving IDs for PAD/UNK/BOS/EOS or accidentally allowing PAD embeddings to influence training without masking.
Assuming embedding similarity always equals human-interpretable semantic similarity; training objective and data distribution strongly shape the geometry.
You have vocabulary size V = 10,000 and embedding dimensionality d = 256.
1) How many parameters are in E?
2) If parameters are stored as 32-bit floats, about how many megabytes does E take (ignore overhead)?
Hint: params = V·d. Memory ≈ params · 4 bytes. 1 MB ≈ 10⁶ bytes (roughly).
1) params(E) = 10,000 · 256 = 2,560,000.
2) Memory ≈ 2,560,000 · 4 = 10,240,000 bytes ≈ 10.24 MB (about 10 MB).
Let E ∈ ℝ^(4×2) be
E =
[ [ 0, 1],
[ 2, 0],
[−1, 3],
[ 4, −2] ]
A sequence of token IDs is [3, 1, 1, 0]. Write down the corresponding embedding vectors in order, and identify which rows of E are reused.
Hint: Lookup means E[i] is the i-th row. Reuse happens when the same ID appears multiple times.
Embeddings:
ID 3 → E[3] = [4, −2]
ID 1 → E[1] = [2, 0]
ID 1 → E[1] = [2, 0]
ID 0 → E[0] = [0, 1]
Row reuse: row 1 is reused (appears twice).
Suppose you are training a model and you notice that very rare tokens have poorly learned embeddings.
Give two strategies (modeling or preprocessing) that can help, and briefly explain why each helps.
Hint: Think about how often a token gets gradient updates and how tokenization affects frequency. Also consider parameter sharing or regularization.
Two helpful strategies:
1) Use subword tokenization (BPE/WordPiece): rare words are decomposed into more frequent pieces, so the model learns embeddings for pieces with more updates, improving generalization to rare words.
2) Tie embeddings or use pretrained initialization: tying input/output embeddings shares statistical strength; pretrained embeddings (or starting from a pretrained LM) give rare tokens a better starting position in vector space, reducing the amount of task data needed to shape them.
(Other valid ideas include increasing data, using adaptive/hashed embeddings, or regularizing/averaging embeddings for low-frequency tokens.)