A measure of similarity between two vectors defined as the cosine of the angle between them (dot product normalized by norms); used as an attention scoring function and for comparing embeddings. It highlights direction-based similarity independent of vector magnitude.
Deep-dive lesson - accessible entry point but dense material. Use worked examples and spaced repetition.
When you compare two vectors, you often care less about how big they are and more about whether they “point” in the same direction. Cosine similarity is the standard tool for measuring that directional agreement—and it shows up everywhere from search and embeddings to attention scores in transformers.
Cosine similarity between two nonzero vectors a and b is
It equals the cosine of the angle between them: 1 means same direction, 0 means orthogonal (no directional alignment), −1 means opposite direction. It’s magnitude-invariant (scaling a vector doesn’t change it), which makes it ideal for comparing embeddings and as an attention scoring function. Be careful: both vectors must be nonzero, and “cosine distance = 1 − cosSim” is commonly used but is not a true metric in general (triangle inequality can fail).
This node is meant to be foundational, but cosine similarity does assume a few micro-skills. Here’s a compact checklist.
| Concept | Meaning | Formula / Note | ||||
|---|---|---|---|---|---|---|
| Vector a | Ordered list of numbers | a = (a₁, a₂, …, aₙ) | ||||
| Dot product | Measures alignment via componentwise multiplication | $$ | ||||
| Euclidean norm | Vector “length” (magnitude) | $$\ | \mathbf{a}\ | = \sqrt{\sum_{i=1}^n a_i^2}$$ | ||
| Angle interpretation | In geometry, dot relates to cos(angle) | $$\mathbf{a}\cdot\mathbf{b}=\ | \mathbf{a}\ | \,\ | \mathbf{b}\ | \cos\theta$$ |
| Nonzero requirement | Cosine similarity divides by norms | Need \ | a\ | > 0 and \ | b\ | > 0 |
If those boxes feel unfamiliar, you can still proceed—just revisit them when the formulas appear.
Suppose you have two vectors representing items:
Often, the direction encodes the “type” or “meaning,” while the length might reflect confidence, frequency, or just the model’s internal scaling.
If we used only the dot product to score similarity, then simply making vectors longer (larger magnitude) would inflate the score—even if the direction stayed the same. That can be undesirable when you want comparisons to be about alignment.
Cosine similarity fixes this by normalizing out the magnitudes.
For two nonzero vectors a and b in ℝⁿ, cosine similarity is
There is a key identity connecting dot product and angle:
where is the angle between a and b (in the usual Euclidean geometry). Rearranging gives
So cosine similarity literally is the cosine of that angle.
Because cosine ranges between −1 and 1:
In many embedding systems (especially after certain training setups), values are often mostly positive, but mathematically the full range [−1, 1] is possible.
The formula divides by . If either vector is 0, then cosine similarity is undefined.
In practice, systems either:
The dot product is
It increases when:
1) components match in sign and are large in magnitude, and
2) the vectors point in similar directions.
But here’s the catch: if you scale a by a constant , then
So dot product is not scale-invariant.
Let
These all point in the same direction (45° line). Dot products:
The second looks “more similar” by dot product, but b and c are equally aligned with a—they just have different lengths.
Cosine similarity divides by both lengths:
Now see what happens if we scale a by :
\nShow your work (scale invariance)
Let a' = ca.
1) Dot product scales:
2) Norm scales:
3) Plug into cosine similarity:
So cosine similarity does not change when you scale one vector by a positive constant.
Define normalized (unit-length) vectors:
Then
This is a powerful mental model:
A key guarantee is that cosine similarity is always between −1 and 1.
Cauchy–Schwarz says:
Divide both sides by (nonzero assumption):
So
This boundedness is one reason cosine similarity is numerically and conceptually convenient.
Because , it inherits the cosine curve’s behavior:
This is slightly different from many intuitive “distance” notions.
In some applications, negative similarity has a clear meaning:
In other applications (e.g., some retrieval systems), negative values might just be treated as “not similar” and thresholded away.
If both vectors are normalized to unit length, then cosine similarity and Euclidean distance are tightly connected.
Let and be unit vectors. Consider squared Euclidean distance:
\nShow your work
Expand:
Since both are unit length:
So
Rearrange:
So on the unit sphere, cosine similarity is basically a monotonic transformation of Euclidean distance.
A frequently used “distance-like” quantity is
This has nice properties (0 when vectors match in direction, bigger when they differ), but it is not always a metric on ℝⁿ. In particular, the triangle inequality may fail.
Why that matters:
Practical takeaway: it’s fine to use as a heuristic distance for ranking and optimization, but don’t automatically assume metric guarantees unless you’ve checked the conditions for your specific setting.
When vectors are very small or nearly zero, the denominator can be tiny.
Common fix:
or add ε inside the product. The exact choice depends on your numerical environment and expectations about zero vectors.
Embeddings map discrete objects (words, items, images) to vectors. A core assumption is:
Cosine similarity is a natural fit because it ignores overall magnitude. This is especially helpful when vector norms vary due to:
Typical retrieval pipeline
Often, systems normalize all embeddings once (store unit vectors), making scoring just a dot product.
In attention mechanisms, we produce:
A general attention score between a query and key can be any similarity function. One simple option is cosine similarity:
Then attention weights are typically:
In modern transformers, the most common is scaled dot-product attention:
Why not always cosine?
That said, cosine attention variants exist and can be helpful in some regimes.
| Scoring | Formula | Sensitive to vector length? | Common use | ||||
|---|---|---|---|---|---|---|---|
| Dot product | a·b | Yes | Many attention layers; fast retrieval if norms are controlled | ||||
| Cosine similarity | (a·b)/(\ | a\ | \ | b\ | ) | No (direction only) | Embedding similarity; some attention variants |
| Euclidean distance | \ | a−b\ | Yes (unless normalized) | Clustering; geometry on unit sphere connects it to cosine |
Because
a very common engineering trick is:
1) store for every embedding
2) compute similarity as a dot product
This can speed up retrieval (especially with vector databases / ANN indices) because dot products are highly optimized.
Let a = (3, 4) and b = (4, 0). Compute cosSim(a, b) and interpret the result.
Compute the dot product:
\n$$
Compute the norms:
\n$$
Plug into cosine similarity:
\n$$
Interpretation:
A cosine similarity of 0.6 means the vectors form an acute angle with moderate alignment (since 1 would be perfectly aligned, 0 would be perpendicular). If you want the angle explicitly:
\n$$
Insight: Cosine similarity turned the raw dot product (12) into a bounded, scale-free score (0.6). The value directly corresponds to an angle, which is a clean geometric notion of similarity.
Let q = (1, 2). Compare x₁ = (2, 4) and x₂ = (10, 20) by cosine similarity with q.
Notice that x₁ = 2q and x₂ = 10q, so all three vectors point in the same direction.
Compute cosSim(q, x₁):
\nDot:
Norms:
Cosine:
Compute cosSim(q, x₂) similarly:
\nDot:
Norms:
Cosine:
Compare with dot products:
\n$$
Dot product prefers x₂ purely because it is longer, while cosine similarity treats them as equally aligned.
Insight: Cosine similarity answers: “Do these vectors point the same way?” Dot product answers: “Are these vectors aligned and large?” That difference is exactly why cosine is popular for embedding comparisons.
Let a = (0, 0, 0) and b = (1, −2, 3). Try to compute cosSim(a, b).
Compute the dot product:
\n$$
Compute the norms:
\n$$
Plug into the formula:
\n$$
which is undefined.
Practical resolution:
If your system might produce zero vectors, you must decide on a policy: reject them, renormalize differently, or use an ε-stabilized denominator.
Insight: Cosine similarity is about direction, but the zero vector has no direction. The undefined division is not a nuisance—it reflects a real geometric ambiguity.
Cosine similarity measures directional alignment: $$
It requires nonzero vectors; the zero vector has no direction, so cosine similarity is undefined with it.
Cosine similarity is scale-invariant: multiplying a vector by a positive constant does not change the score.
Values interpret cleanly: 1 (same direction), 0 (orthogonal), −1 (opposite direction).
If you normalize vectors to unit length, cosine similarity becomes just a dot product: .
Cauchy–Schwarz guarantees the score lies in [−1, 1].
“Cosine distance” defined as is widely used but is not guaranteed to be a metric (triangle inequality may fail).
Cosine similarity is common for comparing embeddings and can serve as an attention scoring function when magnitude should be ignored.
Forgetting the nonzero requirement and attempting to compute cosine similarity with a zero vector (division by zero / undefined direction).
Using dot product as if it were cosine similarity (confusing “large magnitude” with “high similarity”).
Assuming cosine distance is always a true metric and using it in algorithms that require triangle inequality guarantees.
Interpreting cosine similarity as a probability or as bounded to [0, 1] without justification (it can be negative).
Compute cosSim(a, b) for a = (1, −1, 2) and b = (2, 0, 1).
Hint: Compute the dot product and each norm separately, then divide. Keep radicals until the end if you want exact form.
Dot:
Norms:
Cosine similarity:
Show that cosSim(a, b) = cosSim(3a, 0.5b) for any nonzero vectors a, b.
Hint: Use how dot products and norms scale under scalar multiplication: (ca)·(db) and \|ca\|.
Let a' = 3a and b' = 0.5b.
Dot scales:
Norms scale:
Cosine similarity:
Assume \|a\| = \|b\| = 1 (unit vectors). If \|a − b\| = 0.8, compute cosSim(a, b).
Hint: Use the identity \|a − b\|² = 2 − 2(a·b) when both vectors are unit length.
Given unit vectors, we have:
Compute squared distance:
So:
Solve:
\n$$
Since unit vectors satisfy ,
Related conceptual neighbors you may want next: