Union-Find structure. Near-constant time union and find.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
You’re repeatedly asked questions like “Are a and b in the same group?” while also merging groups over time. Disjoint Sets (Union-Find) is the classic structure that makes those queries fast—so fast that in practice it feels constant-time even for huge inputs.
Disjoint Sets (Union-Find) maintains a partition of elements into non-overlapping sets. It supports
With path compression + union by rank/size, a sequence of m operations on n elements runs in O(m α(n)) time (α is inverse Ackermann), which is “near-constant” amortized.
Many problems evolve by merging groups and asking connectivity questions:
A naive approach stores each set explicitly (like a list). Then:
Union-Find is designed for exactly this pattern: lots of operations, interleaving unions and queries, where we want each operation to be almost constant time.
We maintain a collection of sets S₁, S₂, … such that:
This is a partition of U.
Instead of storing full sets, we store parent pointers that form a forest of rooted trees.
The representative (root) is a canonical “name” for the set. Two elements are in the same set iff they have the same root.
Formally, define:
We typically implement three operations:
1) make-set(x)
2) find(x)
3) union(a, b)
We’ll use these arrays:
Rank is an estimate of tree height; size is exact set size. Either supports good performance when used correctly.
Suppose we have elements 1..7. Initially each is its own root.
After unions, we might have:
Then:
Union-Find’s entire job is to maintain this forest efficiently as it changes.
Every union needs to know which sets it is merging, and every connectivity query needs to know whether two elements share a set. Both tasks reduce to a single primitive:
Find the representative root.
Without optimizations, find(x) just walks up parent pointers:
That is correct—but could be slow if trees get tall.
Union-Find maintains this invariant:
As long as union() only links root to root, and never creates cycles, the forest remains valid.
Pseudocode (conceptually):
This runs in O(h), where h is the height of x’s tree. If h becomes O(n), find becomes O(n) and everything collapses.
Motivation: If we frequently call find(x), it’s wasteful to traverse the same long path over and over.
Path compression flattens the tree during find:
In a recursive form:
find(x):
Now, the next time we call find on any of those nodes, it returns quickly.
Important subtlety: path compression alone already helps a lot, but to get the famous near-constant bound we pair it with a good union rule (next section).
Suppose we have a chain:
Calling find(1) without compression visits 1,2,3,4,5.
With path compression, after find(1):
Now find(2) is essentially O(1).
Path compression can make one find operation do extra work (rewiring pointers), but that work pays off later.
Over a long sequence of operations, the total cost is tiny per operation. The formal result (when combined with union by rank/size) is:
where α(n) is the inverse Ackermann function.
For any realistic n (even n ≈ 10²⁰), α(n) ≤ 5. So in practice, Union-Find behaves like constant time.
Some environments avoid recursion. You can still compress paths iteratively in two passes:
1) Walk up to find the root.
2) Walk again and set each visited node’s parent to the root.
Both are common; recursive is compact, iterative avoids stack depth issues.
A union operation merges two sets. Internally, that means we take two roots r₁ and r₂ and make one the parent of the other.
If we do this carelessly (e.g., always parent[r₂] = r₁), we can create tall trees:
Tall trees make find slow.
So union must be disciplined: always attach the “smaller” or “shallower” tree under the “bigger” or “deeper” one.
Store size[root] = number of nodes in that root’s tree.
Algorithm:
1) r₁ = find(a), r₂ = find(b)
2) if r₁ = r₂: already same set
3) attach smaller to larger:
Why it helps: each time a node’s tree is attached under another, the size of the resulting tree at least doubles. A node can only be “moved upward” O(log n) times.
Store rank[root], which approximates tree height.
Algorithm:
1) r₁ = find(a), r₂ = find(b)
2) if r₁ = r₂: return
3) compare ranks:
Why rank works: ranks only increase when two equal-rank trees are merged, which is rare per node, and creates a provable bound when combined with path compression.
Both are excellent. Rank is slightly more common in textbooks; size can be easier to reason about and gives you set sizes for free.
| Heuristic | Stored value | Union rule | Extra capability | Typical use |
|---|---|---|---|---|
| Union by rank | rank[root] ≈ height | attach lower-rank under higher-rank; tie ⇒ increment | none directly | classic DSU |
| Union by size | size[root] = set size | attach smaller size under larger size | can answer component sizes | many contest implementations |
Either one + path compression ⇒ near-constant amortized time.
With both heuristics:
This is an amortized statement:
α(n) grows so slowly that it’s ≤ 4 for values of n far beyond what you can store in memory.
That’s why Union-Find is one of the few structures where theoretical and practical performance line up remarkably well.
Union must connect roots to roots.
That means union(a,b) should do:
If you accidentally do parent[a] = b without finding roots, you can break the forest structure and even create cycles, making find incorrect or non-terminating.
Union-Find shines when:
It does not directly support efficiently removing connections (dynamic connectivity with deletions is harder).
In Kruskal’s algorithm, you sort edges by weight and add them if they don’t create a cycle.
The cycle check is exactly:
So DSU provides cycle detection quickly, making Kruskal efficient.
As you scan edges:
This is a lightweight way to detect cycles without full DFS each time.
Suppose you’re building components from interactions:
With union by size, you can also answer:
In an N×N grid, you may open cells and union adjacent open cells.
Queries like “is there a path from top to bottom?” become connectivity checks between virtual nodes representing borders.
Union-Find is a practical way to maintain an equivalence relation ~ over elements:
Union operations add equivalences; find queries membership in equivalence classes.
You already know:
Think of DSU as a case study where an extremely simple structure (parent pointers) becomes powerful when paired with the right heuristics.
When writing DSU in code:
The resulting code is short, reliable, and performance-friendly.
Union-Find is less about vector math (no v or ‖v‖ needed here), and more about combinatorial structure. Still, it’s grounded in precise invariants and amortized bounds, which is why it belongs in a core data structures track.
Elements: {1,2,3,4,5,6}. Start with make-set on all.
We perform operations:
1) union(1,2)
2) union(2,3)
3) union(4,5)
4) find(3)
5) union(3,4)
Track parent[·] and rank[·].
Initialization:
parent[x]=x for x=1..6
rank[x]=0 for x=1..6
1) union(1,2)
find(1)=1, find(2)=2 (both roots)
ranks equal (0 and 0) ⇒ choose 1 as new root
parent[2]=1
rank[1]=rank[1]+1=1
Now: 1 is root of {1,2}
2) union(2,3)
First find(2):
(With path compression, parent[2] already points to 1, so no change.)
find(3)=3
Union roots 1 and 3:
rank[1]=1 > rank[3]=0 ⇒ parent[3]=1
Now: root 1 represents {1,2,3}
3) union(4,5)
find(4)=4, find(5)=5
ranks equal ⇒ parent[5]=4 and rank[4]=1
Now: root 4 represents {4,5}
4) find(3)
parent[3]=1, parent[1]=1 ⇒ root is 1
Path compression step sets parent[3]=1 (already 1)
So find(3)=1
5) union(3,4)
find(3)=1
find(4)=4
Compare ranks: rank[1]=1, rank[4]=1 (tie)
Attach one under the other; pick 1 as root:
parent[4]=1
Increment rank[1]: rank[1]=2
Note: node 5 still has parent[5]=4, but 4 now points to 1.
After this, a future find(5) will compress 5 → 1.
Insight: Two ideas are visible here: (1) union by rank prevents systematic growth of height; (2) path compression doesn’t need to run on every node immediately—future find calls gradually flatten the structure, yielding the amortized near-constant cost.
Undirected graph with vertices {A,B,C,D}. Edges arrive in this order:
( A,B ), ( B,C ), ( C,D ), ( D,A )
Use DSU to detect when a cycle first appears.
Initialize:
parent[A]=A, parent[B]=B, parent[C]=C, parent[D]=D
Edge (A,B):
find(A)=A, find(B)=B ⇒ different
union(A,B)
Now A and B are in same set.
Edge (B,C):
find(B)=find(A)=A (after compression)
find(C)=C ⇒ different
union(B,C) merges C into {A,B,C}
Edge (C,D):
find(C)=A
find(D)=D ⇒ different
union(C,D) merges D into {A,B,C,D}
Edge (D,A):
find(D)=A
find(A)=A ⇒ same representative
Therefore adding (D,A) creates a cycle.
We can report: the first cycle is detected at edge (D,A).
Insight: Cycle detection is just a same-set query: an undirected edge (u,v) forms a cycle exactly when u and v are already connected. DSU turns that global graph property into two near-constant-time finds.
Elements {1,2,3,4,5,6,7}. Use union by size + path compression.
Operations:
union(1,2), union(2,3), union(4,5), union(6,7), union(5,6)
Then query: size of component containing 7.
Initialization:
parent[x]=x
size[x]=1 for all x
union(1,2):
find(1)=1, find(2)=2
sizes tie ⇒ choose 1 as root
parent[2]=1
size[1]=2
union(2,3):
find(2)=find(1)=1
find(3)=3
Attach smaller to larger:
parent[3]=1
size[1]=3
union(4,5):
parent[5]=4
size[4]=2
union(6,7):
parent[7]=6
size[6]=2
union(5,6):
find(5): parent[5]=4 ⇒ root 4
find(6)=6
sizes: size[4]=2, size[6]=2 tie ⇒ pick 4
parent[6]=4
size[4]=4
Now 7 still points to 6, and 6 points to 4.
Query size of component containing 7:
find(7): 7 → 6 → 4 (root)
Path compression updates parent[7]=4 (and possibly parent[6]=4 already)
Answer = size[4] = 4
Insight: Union by size gives you component sizes essentially for free. Path compression ensures that repeated size queries become extremely fast because nodes quickly learn their true root.
Disjoint Sets maintains a partition: each element belongs to exactly one non-overlapping set.
Internally it stores a forest of rooted trees using parent[x] pointers; each root is a set representative.
find(x) returns the root representative; two elements are in the same set iff their roots are equal.
Path compression rewires nodes during find to point closer to the root, flattening trees over time.
Union must link root to root; otherwise you can break invariants or create cycles.
Union by rank or union by size prevents tall trees by attaching the shallower/smaller tree under the deeper/larger one.
With both heuristics, m operations on n elements take O(m α(n)) time amortized, which is effectively constant in practice.
Forgetting to call find() inside union(), and instead linking a and b directly (can create incorrect structures or cycles).
Updating rank/size on non-root nodes, or failing to update it only on the new root after a merge.
Implementing path compression incorrectly (e.g., compressing to parent[parent[x]] inconsistently) and accidentally skipping necessary updates.
Assuming DSU supports deletions efficiently; standard Union-Find is for merges (unions) and queries, not edge removals.
You have elements {1..8}. Perform: union(1,2), union(3,4), union(2,3), union(5,6), union(7,8), union(6,7). After these operations, are 1 and 4 connected? Are 5 and 8 connected?
Hint: Connectivity is determined entirely by representatives: check whether find(x) = find(y). You don’t need the exact parent array if you track which components got merged.
union(1,2) ⇒ {1,2}
union(3,4) ⇒ {3,4}
union(2,3) merges {1,2} with {3,4} ⇒ {1,2,3,4}
So 1 and 4 are connected.
union(5,6) ⇒ {5,6}
union(7,8) ⇒ {7,8}
union(6,7) merges {5,6} with {7,8} ⇒ {5,6,7,8}
So 5 and 8 are connected.
Consider a DSU using union by rank (no path compression). Prove that the rank of any node is at most ⌊log₂ n⌋, where n is the number of elements.
Hint: Rank increases only when two roots of equal rank are merged. Track a lower bound on the size of a tree as a function of its rank.
Claim: a root of rank r has at least 2ʳ nodes in its tree.
Base: r=0 ⇒ singleton tree has size 1 = 2⁰.
Inductive step: rank increases from r to r+1 only when two trees of rank r are merged. By induction each has size ≥ 2ʳ, so merged size ≥ 2ʳ + 2ʳ = 2ʳ⁺¹.
Thus size ≥ 2ʳ for rank r. Since size ≤ n, we have 2ʳ ≤ n ⇒ r ≤ log₂ n, so rank ≤ ⌊log₂ n⌋.
In an undirected graph with vertices 1..n, edges arrive one by one. Describe an algorithm using DSU to output the first edge that creates a cycle, or report that no cycle is created after all edges arrive. Give the time complexity in terms of m edges and n vertices.
Hint: The cycle condition for an undirected edge (u,v) is that u and v are already connected before adding the edge.
Algorithm:
Initialize DSU with make-set(1..n).
For each edge (u,v) in arrival order:
If finished without finding such an edge, report “no cycle.”
Time: Each edge does up to two finds and possibly one union. With path compression + union by rank/size, total time is O(m α(n)).