Vector length/magnitude. L1, L2 (Euclidean), Linf norms.
Deep-dive lesson - accessible entry point but dense material. Use worked examples and spaced repetition.
If you can measure “how far” a vector is from zero, you can compare directions, choose the nearest point, control model complexity, and reason about geometry. Norms are the standard way to do that—and different norms create different notions of distance and “closeness.”
A norm ‖v‖ assigns a nonnegative length to a vector, with three rules: (1) it’s zero only for the zero vector, (2) scaling scales length by |α|, and (3) it obeys the triangle inequality. The most common are ‖v‖₁ (sum of absolute values), ‖v‖₂ (Euclidean), and ‖v‖∞ (max absolute component). Different norms change geometry and behavior in algorithms like clustering.
In 2D or 3D, “length” feels obvious: draw an arrow from the origin to a point and measure how long it is. But in computer science and machine learning you constantly work in higher-dimensional spaces:
In those settings, you still need a rigorous way to answer:
A norm is the mathematical object that turns these questions into consistent computations.
A norm is a function that maps a vector to a nonnegative real number:
‖·‖ : ℝⁿ → ℝ
It must satisfy three properties for all vectors u, v and all scalars α:
1) Positive-definiteness
Intuition: length can’t be negative; the only vector with zero length is the zero vector.
2) Absolute homogeneity
‖αv‖ = |α| ‖v‖
Intuition: scaling an arrow by 3 makes it 3× longer; scaling by −3 flips direction but length is still 3×.
3) Triangle inequality
‖u + v‖ ≤ ‖u‖ + ‖v‖
Intuition: taking two steps (first u, then v) can’t be shorter than the straight-line shortcut by more than the total step lengths.
You said you already know the dot product. Great—because in ℝⁿ the Euclidean norm is closely tied to it:
‖v‖₂ = √(v · v)
But norms are more general than dot products: you can have norms that don’t come from dot products (like ‖·‖₁ and ‖·‖∞). That flexibility is useful: different norms encode different ideas of what it means to be “close,” “large,” or “small.”
A norm automatically defines a distance (a metric) between two vectors:
d(x, y) = ‖x − y‖
So if you choose a norm, you also choose a geometry for your space.
The most used family is the p-norms:
‖v‖ₚ = (∑ᵢ |vᵢ|ᵖ)¹ᐟᵖ, for p ≥ 1
Special cases you’ll use constantly:
Each is a valid norm, satisfies the three properties, and leads to a different notion of “ball” and “nearest.”
When you implement algorithms, you want norms that are:
‖·‖₁, ‖·‖₂, and ‖·‖∞ are the standard trio because they emphasize different aspects of a vector:
Let v = (v₁, v₂, …, vₙ).
L1 norm
‖v‖₁ = ∑ᵢ |vᵢ|
Interpretation: add up the absolute contributions of every coordinate.
L2 (Euclidean) norm
‖v‖₂ = √(∑ᵢ vᵢ²) = √(v · v)
Interpretation: the usual straight-line length.
L∞ norm
‖v‖∞ = maxᵢ |vᵢ|
Interpretation: how large the largest-magnitude coordinate is.
| Norm | Formula | What it emphasizes | Typical use | ||
|---|---|---|---|---|---|
| ‖v‖₁ | ∑ᵢ | vᵢ | overall “mass” across coordinates | sparsity (L1 regularization), Manhattan distance | |
| ‖v‖₂ | √(∑ᵢ vᵢ²) | geometric length, energy | geometry, least squares, k-means default | ||
| ‖v‖∞ | maxᵢ | vᵢ | worst coordinate | constraint bounds, robust tolerances |
A great way to feel norms is to look at their unit balls in 2D: the set of vectors whose norm ≤ 1.
This matters because “closest point” problems (like clustering) depend on the shape of these balls.
If α is a scalar and v is a vector, all norms must satisfy:
‖αv‖ = |α| ‖v‖
For p-norms, you can see it directly:
‖αv‖ₚ
= (∑ᵢ |α vᵢ|ᵖ)¹ᐟᵖ
= (∑ᵢ (|α| |vᵢ|)ᵖ)¹ᐟᵖ
= (∑ᵢ |α|ᵖ |vᵢ|ᵖ)¹ᐟᵖ
= (|α|ᵖ ∑ᵢ |vᵢ|ᵖ)¹ᐟᵖ
= |α| (∑ᵢ |vᵢ|ᵖ)¹ᐟᵖ
= |α| ‖v‖ₚ
So p-norms automatically obey one of the key norm axioms.
When you rely on a norm inside an algorithm, you’re often using its properties implicitly:
The three norm axioms are the “license” that lets you do these steps safely.
For p-norms (p ≥ 1), every term |vᵢ|ᵖ is ≥ 0, so the sum is ≥ 0, so the p-th root is ≥ 0.
Also, ‖v‖ₚ = 0 implies:
(∑ᵢ |vᵢ|ᵖ)¹ᐟᵖ = 0
⇒ ∑ᵢ |vᵢ|ᵖ = 0
A sum of nonnegative numbers is 0 only if each term is 0:
∀i, |vᵢ|ᵖ = 0 ⇒ |vᵢ| = 0 ⇒ vᵢ = 0
So v = 0.
You already saw the derivation in the previous section. This property is what makes norms behave like a geometric length. It also prevents weird measures like “length(2v) = length(v) + 7.”
Triangle inequality is often the hardest to prove, but the easiest to use.
It says:
‖u + v‖ ≤ ‖u‖ + ‖v‖
A very common corollary is a bound on differences (sometimes called a reverse triangle inequality variant):
|‖u‖ − ‖v‖| ≤ ‖u − v‖
This tells you: if two vectors are close, their lengths are close.
Here’s the derivation using triangle inequality twice.
Start with:
u = (u − v) + v
Apply triangle inequality:
‖u‖ = ‖(u − v) + v‖ ≤ ‖u − v‖ + ‖v‖
Rearrange:
‖u‖ − ‖v‖ ≤ ‖u − v‖ (1)
Swap u and v:
‖v‖ − ‖u‖ ≤ ‖v − u‖ = ‖u − v‖ (2)
Combine (1) and (2):
|‖u‖ − ‖v‖| ≤ ‖u − v‖
This inequality is a frequent tool when analyzing iterative algorithms.
In finite-dimensional spaces like ℝⁿ, all norms are “equivalent” in the sense that they bound each other up to constants. Practically: if one norm is small, the others can’t be arbitrarily huge.
For the three norms we care about, these inequalities are especially useful:
1) ‖v‖∞ ≤ ‖v‖₂ ≤ ‖v‖₁
Reasoning (intuition):
2) And with dimension n, you can also bound the other direction:
‖v‖₁ ≤ √n ‖v‖₂
‖v‖₂ ≤ √n ‖v‖∞
These tell you a key high-dimensional fact: the gap between norms can grow with √n. So in large n, your choice of norm can meaningfully change distances and nearest neighbors.
You might see “‖v‖ₚ” written for p < 1 in some ML contexts (e.g., sparsity). But for p < 1, triangle inequality fails, so it’s not a true norm. People still use it as a penalty, but you lose some guarantees.
Clustering groups points by proximity. But “proximity” is defined by a distance, and a distance is often built from a norm:
Change the norm → change the geometry → change which points are nearest → change the clustering.
Classic k-means is typically presented with squared Euclidean distance:
minimize ∑ (over points x) ‖x − μ(cluster(x))‖₂²
Why L2²? Because it pairs perfectly with means.
If you take a single cluster and want the best center c to minimize:
J(c) = ∑ᵢ ‖xᵢ − c‖₂²
the minimizer is the componentwise mean. That makes the update step simple and fast.
(If you instead used L1 distance, the best “center” is a median per coordinate, leading to k-medians. So the norm choice changes the algorithm’s natural center.)
With L1, distances add coordinate-wise. This can be more robust to outliers in some settings and can better match data where moving along axes is natural (think grid/city streets).
Geometrically, L1 balls are diamonds in 2D. That tends to create cluster boundaries aligned differently than L2.
L∞ treats the distance between x and y as the largest coordinate difference:
‖x − y‖∞ = maxᵢ |xᵢ − yᵢ|
This is useful when you care about maximum error in any feature (e.g., tolerances). In clustering, it makes points “close” if they’re close in every coordinate (no big deviation allowed).
| If your notion of similarity is… | Consider | Why |
|---|---|---|
| straight-line geometric closeness | L2 | rotation-invariant, common default |
| total absolute deviation across features | L1 | robust-ish, encourages axis-aligned structure |
| worst-feature deviation must be small | L∞ | enforces uniform closeness across coordinates |
Norm-based distances are sensitive to units. If one coordinate is measured in dollars and another in millimeters, the large-scale coordinate can dominate any norm.
A common fix is to normalize features (e.g., z-score standardization). This is not a norm concept by itself, but norms make the need obvious: the distance ‖x − y‖ depends on coordinate scales.
Because ‖v‖₂ = √(v · v), anything that uses dot products often implicitly uses L2 norms. Examples:
So norms complete the story: dot products measure alignment; norms measure magnitude; together they describe angle and distance.
Let v = (3, −4, 12). Compute the L1, L2, and L∞ norms.
Compute L1:
‖v‖₁ = |3| + |−4| + |12|
= 3 + 4 + 12
= 19
Compute L2:
‖v‖₂ = √(3² + (−4)² + 12²)
= √(9 + 16 + 144)
= √169
= 13
Compute L∞:
‖v‖∞ = max(|3|, |−4|, |12|)
= max(3, 4, 12)
= 12
Insight: Each norm answers a different question: L1 totals all coordinate magnitudes (19), L2 gives geometric length (13), and L∞ reports the largest coordinate magnitude (12).
Let u and v be vectors. Suppose ‖u‖₂ = 5 and ‖v‖₂ = 2. You do not know the angle between them. Bound ‖u + v‖₂ and also bound the possible values of ‖u − v‖₂.
Upper bound ‖u + v‖₂ using triangle inequality:
‖u + v‖₂ ≤ ‖u‖₂ + ‖v‖₂
= 5 + 2
= 7
Lower bound ‖u + v‖₂ using the reverse-triangle form:
|‖u‖₂ − ‖v‖₂| ≤ ‖u + v‖₂
So:
‖u + v‖₂ ≥ |5 − 2| = 3
So the combined bound is:
3 ≤ ‖u + v‖₂ ≤ 7
Now bound ‖u − v‖₂ similarly.
Triangle inequality gives:
‖u − v‖₂ ≤ ‖u‖₂ + ‖v‖₂ = 7
Reverse-triangle gives:
‖u − v‖₂ ≥ |‖u‖₂ − ‖v‖₂| = 3
So:
3 ≤ ‖u − v‖₂ ≤ 7
Insight: Even without knowing directions, the norm axioms let you bound outcomes. In algorithms, these bounds can guarantee stability (values can’t blow up unexpectedly).
Let a = (2, 0) and b = (1, 1). Compare their distances to the origin under L1, L2, and L∞.
Compute norms for a:
‖a‖₁ = |2| + |0| = 2
‖a‖₂ = √(2² + 0²) = √4 = 2
‖a‖∞ = max(|2|, |0|) = 2
Compute norms for b:
‖b‖₁ = |1| + |1| = 2
‖b‖₂ = √(1² + 1²) = √2 ≈ 1.414
‖b‖∞ = max(|1|, |1|) = 1
Compare:
Insight: Two points can be equally close under one norm and not under another. That’s why changing the norm can change nearest neighbors and therefore clustering results.
A norm ‖v‖ is a consistent notion of vector length: nonnegative, zero only at 0, scales as |α|, and satisfies triangle inequality.
The p-norms (p ≥ 1) are defined by ‖v‖ₚ = (∑ᵢ |vᵢ|ᵖ)¹ᐟᵖ; common cases are p = 1, 2, and ∞.
‖v‖₁ measures total absolute magnitude, ‖v‖₂ measures Euclidean length, and ‖v‖∞ measures the largest coordinate magnitude.
Every norm induces a distance: d(x, y) = ‖x − y‖, so choosing a norm chooses your geometry.
Triangle inequality enables powerful bounds like |‖u‖ − ‖v‖| ≤ ‖u − v‖, useful for analysis and stability.
In ℝⁿ, norms bound each other (e.g., ‖v‖∞ ≤ ‖v‖₂ ≤ ‖v‖₁), but the gaps can grow with dimension.
Clustering and nearest-neighbor behavior can change significantly depending on whether you use L1, L2, or L∞ distances.
Forgetting absolute values in ‖v‖₁ or ‖v‖∞ (signs don’t cancel in norms).
Mixing up ‖v‖₂ with ∑ᵢ vᵢ² (the L2 norm includes a square root).
Assuming “‖·‖ₚ” is always a norm for any p; for p < 1, triangle inequality fails.
Comparing distances without considering feature scaling; one large-scale coordinate can dominate any norm-based distance.
Let v = (−1, 2, −2, 4). Compute ‖v‖₁, ‖v‖₂, and ‖v‖∞.
Hint: Use absolute values for L1 and L∞. For L2, square first, then sum, then take √.
‖v‖₁ = |−1|+|2|+|−2|+|4| = 1+2+2+4 = 9.
‖v‖₂ = √((−1)²+2²+(−2)²+4²) = √(1+4+4+16) = √25 = 5.
‖v‖∞ = max(1,2,2,4) = 4.
Suppose ‖u‖₁ = 10 and ‖v‖₁ = 6. Give the tightest bounds you can (using only norm axioms) for ‖u + v‖₁.
Hint: Use triangle inequality for an upper bound and the reverse-triangle form for a lower bound.
Upper bound: ‖u+v‖₁ ≤ ‖u‖₁ + ‖v‖₁ = 16.
Lower bound: |‖u‖₁ − ‖v‖₁| ≤ ‖u+v‖₁ ⇒ ‖u+v‖₁ ≥ |10−6| = 4.
So 4 ≤ ‖u+v‖₁ ≤ 16.
Find a nonzero vector v ∈ ℝ² such that ‖v‖₁ = 2 but ‖v‖∞ = 1. Then compute ‖v‖₂.
Hint: You want the max coordinate magnitude to be 1, and the sum of absolute values to be 2.
One example is v = (1, 1). Then ‖v‖₁ = |1|+|1|=2 and ‖v‖∞=max(1,1)=1.
Compute L2: ‖v‖₂ = √(1²+1²) = √2.
Prerequisite reinforcement: Dot Product
Unlocks and next steps: Clustering
Related ideas you’ll likely meet soon: