K-means, hierarchical clustering. Unsupervised grouping.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
When you don’t have labels, clustering is the workhorse tool for discovering structure: it tries to answer “which points belong together?” using only a notion of similarity (distance).
Clustering groups data points so that points in the same group are more similar than points in different groups. K-means does this by minimizing within-cluster squared distances to centroids μⱼ (fast, simple, but assumes roughly spherical clusters and needs k). Hierarchical clustering builds a dendrogram by repeatedly merging clusters using a linkage rule (more flexible shapes, but slower; you choose a cut level instead of k). Distance choice, scaling, initialization, and validation (e.g., silhouette) determine whether clusters are meaningful.
Clustering is an unsupervised learning task: you’re given data points x₁,…,xₙ (no labels), and you want to group them into “clusters” so that points in the same group are “similar,” and points in different groups are “dissimilar.”
The key idea is that clustering is not a single problem—it’s a family of problems defined by choices you make:
1) Representation: what is a data point? (a vector x ∈ ℝᵈ? a document? an image embedding?)
2) Similarity / distance: how do we measure “close”? (‖x−y‖₂, cosine distance, etc.)
3) Cluster model: what does a cluster “look like”? (a centroid-based ball, a connected component, a hierarchy)
4) Objective / procedure: what criterion do we optimize or what rule do we follow?
A practical definition you can use:
Notice what’s missing: there is no universal “correct” clustering without extra assumptions. Two different distance metrics can yield different but equally defensible clusterings.
Clustering shows up in many places because it provides structure without labels:
If your data are vectors in ℝᵈ, distance-based clustering is fundamentally geometric.
This is why we study multiple clustering paradigms. In this lesson, we’ll focus on two core methods:
We’ll also treat clustering as a pipeline: choose distance + scale features, run algorithm, then validate and interpret.
The major theme: clustering is about optimizing or constructing a grouping consistent with your similarity notion—and knowing when that grouping is trustworthy.
K-means is the classic centroid-based clustering algorithm. It is popular because it is simple, fast, and often surprisingly effective.
Suppose you believe each cluster can be summarized by a “typical” point—its centroid—and that points should belong to the cluster whose centroid is closest.
This belief translates into an objective: choose k centroids and assignments so that points are close to their assigned centroid.
Given assignments cᵢ and centroids μⱼ, the within-cluster sum of squares (WCSS) is:
J(c, μ) = ∑ᵢ ‖xᵢ − μ_{cᵢ}‖₂²
K-means solves:
minimize over assignments cᵢ and centroids μⱼ:
min J(c, μ) = ∑ᵢ ‖xᵢ − μ_{cᵢ}‖₂²
This is sometimes described as “minimizing within-cluster variance.”
But it also implies a geometric assumption: clusters are roughly spherical (balls) in Euclidean space.
K-means uses an iterative procedure (often called Lloyd’s algorithm):
1) Assignment step (E-step-like):
Fix centroids μⱼ. Assign each point to the nearest centroid:
cᵢ ← argminⱼ ‖xᵢ − μⱼ‖₂²
2) Update step (M-step-like):
Fix assignments cᵢ. Update each centroid to the mean of points in that cluster:
μⱼ ← (1 / |Cⱼ|) ∑_{i: cᵢ = j} xᵢ
These steps repeat until assignments stop changing or the objective decreases only slightly.
Focus on one cluster j with points {xᵢ : cᵢ = j}. We want μ that minimizes:
f(μ) = ∑_{i: cᵢ=j} ‖xᵢ − μ‖₂²
Expand using dot products:
‖xᵢ − μ‖₂² = (xᵢ − μ)·(xᵢ − μ)
= xᵢ·xᵢ − 2 xᵢ·μ + μ·μ
So
f(μ) = ∑ (xᵢ·xᵢ) − 2 ∑(xᵢ·μ) + ∑(μ·μ)
Let m = |Cⱼ|. Note ∑(μ·μ) = m(μ·μ).
Take the gradient with respect to μ:
∇ f(μ) = -2 ∑ xᵢ + 2 m μ
Set ∇ f(μ) = 0:
-2 ∑ xᵢ + 2 m μ = 0
⇒ m μ = ∑ xᵢ
⇒ μ = (1/m) ∑ xᵢ
So the centroid update is not a heuristic—it is the exact minimizer of the squared-distance objective given fixed assignments.
Each iteration:
So J decreases monotonically and the algorithm converges in finitely many steps to a local minimum.
But:
Random initialization (choose k random points as initial μⱼ) can be unstable.
k-means++ improves this by spreading initial centroids apart:
1) Pick one centroid uniformly from the data.
2) For each point x, compute D(x) = distance to nearest chosen centroid.
3) Choose next centroid with probability proportional to D(x)².
4) Repeat until k centroids are chosen.
This tends to reduce bad starts and usually improves results with minimal overhead.
K-means requires k upfront, which is often the hardest practical choice.
Common heuristics:
For a point i:
Silhouette:
sᵢ = (bᵢ − aᵢ) / max(aᵢ, bᵢ)
Average sᵢ over i gives a score for k.
K-means uses Euclidean distance, so feature scaling matters.
If one feature has a large numeric range, it dominates distances.
Typical fix: standardize each feature:
x' = (x − mean) / std
Without scaling, you may cluster by “units” rather than by actual structure.
K-means is a sharp tool—but it’s sharp in a particular shape. Next we’ll study hierarchical clustering, which often provides more flexibility and interpretability via a dendrogram.
Hierarchical clustering builds a nested family of clusters rather than committing to a single partition with a fixed k.
Sometimes you don’t believe the data has one “correct” k. Instead, you might believe there are multiple meaningful scales:
Hierarchical clustering captures this by producing a dendrogram (a tree) that shows how clusters merge (or split) as you change the similarity threshold.
There are two main variants:
Agglomerative clustering is more common in practice, so we’ll focus on it.
1) Start with n clusters: { {1}, {2}, …, {n} }
2) Compute distances between clusters using a linkage rule
3) Merge the two closest clusters
4) Update distances and repeat until one cluster remains
The output is a dendrogram. To get a final clustering, you cut the dendrogram at a chosen height (distance threshold) or choose a number of clusters.
Hierarchical clustering needs a distance between sets of points (clusters), not just between individual points.
Let A and B be two clusters.
Common linkage rules:
| Linkage | Cluster distance d(A,B) | Behavior / bias | ||||
|---|---|---|---|---|---|---|
| Single | min_{x∈A, y∈B} ‖x−y‖ | Can create “chains”; finds elongated shapes | ||||
| Complete | max_{x∈A, y∈B} ‖x−y‖ | Prefers compact clusters; sensitive to outliers | ||||
| Average | (1/( | A | B | )) ∑_{x∈A}∑_{y∈B} ‖x−y‖ | Balanced; less extreme | |
| Ward | increase in within-cluster SSE after merge | Similar spirit to k-means; prefers spherical clusters |
Ward’s method chooses the merge that causes the smallest increase in total within-cluster sum of squares.
If you define the within-cluster SSE for a cluster C with centroid μ_C:
SSE(C) = ∑_{x∈C} ‖x − μ_C‖₂²
Ward chooses to merge A and B that minimize:
Δ(A,B) = SSE(A∪B) − SSE(A) − SSE(B)
This makes hierarchical clustering feel more “k-means-like,” but it still yields a full hierarchy.
In a dendrogram:
A “natural” clustering often appears as:
Cutting the tree at a height between those regimes yields a partition.
Hierarchical clustering is typically more expensive than k-means:
So it’s often used for:
Hierarchical clustering can use many distances:
Different distances can change the dendrogram dramatically. Scaling features is still crucial whenever numeric ranges differ.
Hierarchical clustering:
But:
K-means:
But:
You’ll often try both: use hierarchical clustering to understand structure (choose k), then run k-means for a scalable final partition.
Clustering algorithms always produce some grouping—even on random noise. The real skill is building a workflow that makes clusters meaningful and defensible.
Different goals imply different choices:
Ask: what would count as a “good” cluster in your domain?
Raw features might not reflect similarity.
Examples:
Clustering works best when your representation makes “similarity” meaningful geometrically.
If using Euclidean distance, scale each feature to comparable units.
Typical options:
| Method | Formula | When to use |
|---|---|---|
| Standardization | x' = (x − mean)/std | Most common; good default |
| Min-max scaling | x' = (x − min)/(max − min) | When bounds matter |
| L2 normalization | x' = x/‖x‖₂ | For cosine-like comparisons on direction |
A useful mental map:
| If your clusters are… | Try… |
|---|---|
| compact blobs | K-means or Ward hierarchical |
| possibly elongated / connected | Single/average linkage hierarchical |
| you want multi-scale structure | hierarchical clustering |
| very large n | K-means (with k-means++) |
(There are other families like DBSCAN, GMMs, spectral clustering, but this node focuses on K-means + hierarchical.)
Validation is not just a single number.
Internal metrics can be misleading if your distance metric doesn’t match your semantic goal.
Run clustering under perturbations:
If cluster assignments drastically change, the structure may be weak.
Even without labels, you can often check:
For each cluster j, compute:
Interpretation is often the main product of clustering.
A practical mitigation in high dimensions is to reduce dimension first (e.g., PCA) for clustering—carefully, because projections can also distort structure.
So if you think “variance within clusters” is the right notion, both methods can align:
The core professional habit: treat clustering as an iterative scientific process, not a one-click algorithm.
Data points in ℝ²:
x₁=(1,1), x₂=(1,2), x₃=(4,4), x₄=(5,4)
Choose k=2. Initialize centroids:
μ₁=(1,1), μ₂=(5,4). Use squared Euclidean distance.
Assignment step: compute ‖xᵢ−μⱼ‖₂².
For x₁=(1,1):
‖(1,1)−(1,1)‖² = 0
‖(1,1)−(5,4)‖² = (−4)²+(−3)² = 16+9 = 25
⇒ c₁=1.
For x₂=(1,2):
‖(1,2)−(1,1)‖² = 0²+1² = 1
‖(1,2)−(5,4)‖² = (−4)²+(−2)² = 16+4 = 20
⇒ c₂=1.
For x₃=(4,4):
‖(4,4)−(1,1)‖² = 3²+3² = 9+9 = 18
‖(4,4)−(5,4)‖² = (−1)²+0² = 1
⇒ c₃=2.
For x₄=(5,4):
‖(5,4)−(1,1)‖² = 4²+3² = 16+9 = 25
‖(5,4)−(5,4)‖² = 0
⇒ c₄=2.
Update step: recompute centroids as the mean of assigned points.
Cluster 1 has x₁, x₂:
μ₁ ← ( (1,1) + (1,2) ) / 2 = ( (2,3) ) / 2 = (1, 1.5)
Cluster 2 has x₃, x₄:
μ₂ ← ( (4,4) + (5,4) ) / 2 = ( (9,8) ) / 2 = (4.5, 4)
Objective after update (optional check):
J = ∑ᵢ ‖xᵢ − μ_{cᵢ}‖²
For cluster 1 with μ₁=(1,1.5):
‖(1,1)−(1,1.5)‖² = 0²+(−0.5)² = 0.25
‖(1,2)−(1,1.5)‖² = 0²+(0.5)² = 0.25
For cluster 2 with μ₂=(4.5,4):
‖(4,4)−(4.5,4)‖² = (−0.5)²+0² = 0.25
‖(5,4)−(4.5,4)‖² = (0.5)²+0² = 0.25
Total J = 1.0
Insight: K-means alternates between “nearest centroid” assignments and “mean of assigned points” updates. The arithmetic mean appears because it exactly minimizes the sum of squared Euclidean distances within each cluster.
Two features: height (cm) and income (USD). Consider two people:
A: (170 cm, 50,000)
B: (180 cm, 50,500)
C: (171 cm, 120,000)
Use Euclidean distance on raw features vs standardized features.
Raw distances:
Between A and B:
Δheight=10, Δincome=500
‖A−B‖₂ = √(10² + 500²) = √(100 + 250,000) = √250,100 ≈ 500.10
Between A and C:
Δheight=1, Δincome=70,000
‖A−C‖₂ = √(1² + 70,000²) ≈ 70,000.00
Observation: income dwarfs height.
Even a 10 cm difference barely matters compared to $500.
So clustering will be driven almost entirely by income scale, not necessarily by what you intend.
Standardize each feature (conceptually):
Let height'=(height−mean)/std_height and income'=(income−mean)/std_income.
After standardization, typical variations in height and income become comparable (unitless).
Resulting effect:
Distances become sensitive to both features.
Now A is close to C in height, but far in income; whether A clusters with B or C depends on relative standardized differences, not raw units.
Insight: Clustering is only as meaningful as your distance metric—and Euclidean distance is extremely sensitive to feature scale. Standardization is often not optional; it defines what “similar” means.
Points on a line (ℝ¹): {0, 2, 3, 10}. Use single linkage with Euclidean distance. Build the merge sequence and interpret a dendrogram cut.
Start with singleton clusters:
{0}, {2}, {3}, {10}
Pairwise point distances:
|0−2|=2, |0−3|=3, |0−10|=10
|2−3|=1, |2−10|=8
|3−10|=7
Single linkage distance between clusters is the minimum pairwise distance.
Closest pair is {2} and {3} with distance 1.
Merge:
C₁ = {2,3} at height 1.
Now clusters: {0}, {2,3}, {10}
Compute distances:
d({0},{2,3}) = min(|0−2|,|0−3|)=min(2,3)=2
d({10},{2,3}) = min(|10−2|,|10−3|)=min(8,7)=7
d({0},{10}) = 10
Closest is {0} and {2,3} at distance 2.
Merge:
C₂ = {0,2,3} at height 2.
Now clusters: {0,2,3}, {10}
Final merge:
d({0,2,3},{10}) = min(|10−0|,|10−2|,|10−3|)=min(10,8,7)=7
Merge at height 7.
Dendrogram heights: 1, 2, 7.
Interpretation by cutting:
If you cut at height between 2 and 7 (e.g., 4), you get 2 clusters:
{0,2,3} and {10}.
If you cut below 2 (e.g., 1.5), you get 3 clusters:
{0}, {2,3}, {10}.
Insight: Hierarchical clustering gives you a family of clusterings. The linkage rule controls the geometry: single linkage tends to merge via nearest neighbors, which can produce chained clusters.
Clustering is defined by a triangle of choices: representation, distance/similarity, and clustering objective/procedure.
K-means minimizes J = ∑ᵢ ‖xᵢ − μ_{cᵢ}‖₂² via alternating assignment (nearest centroid) and update (μⱼ = mean of assigned points).
K-means converges to a local minimum; initialization matters (k-means++ is a strong default).
Feature scaling often determines the result because Euclidean distance is scale-sensitive.
Hierarchical clustering builds a dendrogram via greedy merges; you choose a cut (height or number of clusters) after seeing the structure.
Linkage rules (single/complete/average/Ward) encode different biases about what clusters should look like.
Validation should include more than a single metric: combine silhouette/elbow with stability and domain interpretability checks.
No clustering is ‘correct’ without assumptions; the goal is a useful, stable grouping aligned with your notion of similarity.
Running K-means on unscaled features and interpreting the output as meaningful structure.
Assuming the algorithm discovered “true categories” rather than a partition induced by your distance metric and model assumptions.
Choosing k by eyeballing one run instead of checking multiple initializations and using a validation approach (silhouette, stability).
Using hierarchical clustering on very large n without considering O(n²) memory/time costs (distance matrix blow-up).
You have points x₁=(0,0), x₂=(0,1), x₃=(5,5), x₄=(6,5). Run one full K-means iteration with k=2 starting from μ₁=(0,0), μ₂=(6,5). Give assignments and updated centroids.
Hint: Compute squared distances to each centroid, assign each point to the closer one, then average points in each cluster to update μⱼ.
Assignments by squared distance:
x₁ to μ₁ (0 vs large)
x₂ to μ₁ (1 vs large)
x₃ to μ₂ (distance to (6,5) is 1; to (0,0) is 50)
x₄ to μ₂ (0 vs 61)
Updated centroids:
μ₁ = ((0,0)+(0,1))/2 = (0, 0.5)
μ₂ = ((5,5)+(6,5))/2 = (5.5, 5)
For hierarchical clustering, explain (in 2–4 sentences) how single linkage and complete linkage can produce different cluster shapes. Then give one scenario where single linkage is a poor choice.
Hint: Think about min vs max distances between points across clusters, and how a single ‘bridge’ point can connect groups.
Single linkage uses d(A,B)=min distance between points, so clusters can merge through nearest-neighbor chains and form long, thin shapes. Complete linkage uses d(A,B)=max distance, which discourages merges that would create large diameters and thus prefers compact clusters. Single linkage is a poor choice when noise points create ‘bridges’ between two real groups, causing them to chain together into one cluster.
Derive the K-means centroid update: show that μ that minimizes f(μ) = ∑_{i=1}^m ‖xᵢ − μ‖₂² is the mean (1/m)∑ᵢ xᵢ.
Hint: Expand the squared norm, take ∇ with respect to μ, set it to 0, and solve for μ.
f(μ) = ∑ᵢ (xᵢ−μ)·(xᵢ−μ) = ∑ᵢ (xᵢ·xᵢ) − 2∑ᵢ(xᵢ·μ) + ∑ᵢ(μ·μ).
Let m be the number of points. Then ∑ᵢ(μ·μ) = m(μ·μ).
Take gradient:
∇f(μ) = -2∑ᵢ xᵢ + 2mμ.
Set ∇f(μ) = 0:
-2∑ᵢ xᵢ + 2mμ = 0 ⇒ mμ = ∑ᵢ xᵢ ⇒ μ = (1/m)∑ᵢ xᵢ.