Key-value storage with O(1) average lookup. Collision handling.
Deep-dive lesson - accessible entry point but dense material. Use worked examples and spaced repetition.
Arrays give O(1) access—if you already know the index. Hash tables are the classic trick for turning “find by key” into “go to index” on average, without scanning.
A hash table stores (key → value) pairs in an array A. A hash function h(key) deterministically maps each key to an array index i. Because different keys can map to the same i (a collision), the table needs a collision strategy (usually chaining or open addressing). With a good hash function and controlled load factor α, lookup/insert/delete are O(1) on average.
You already know a core tradeoff:
A hash table is a data structure that tries to keep the array’s speed while supporting associative mapping: store and retrieve values by a unique key.
Instead of asking “where is this key in the array?”, we compute where it should go.
A hash table stores key-value pairs (key → value) using:
Then we store the pair in A[i] (or in some structure associated with A[i]).
If every key mapped to a unique index, operations would be trivial:
That would be O(1) worst-case. But reality is messier.
A collision happens when two different keys map to the same index:
Collisions are unavoidable because:
By the pigeonhole principle, if you insert more than m distinct keys, at least one collision must occur.
So a hash table is really:
When people say “hash tables have O(1) lookup,” the precise claim is:
The common performance lever is the load factor:
where n = number of stored elements, m = number of buckets.
Rough intuition:
Hash tables keep α under control by resizing (rehashing) when the table gets too full.
Arrays are fast because indexing is direct. Hashing tries to manufacture an index from a key.
So the goal of h(key) is not “cryptographic security”—it’s:
We typically think of hashing in two stages:
1) Map key to an integer (a hash code):
2) Compress that integer into the array range:
So indices always land in 0…m−1.
Hash tables assume a consistent relationship between equality and hashing:
The reverse is not required (and generally false):
That’s exactly collisions.
A common approach is a polynomial rolling hash. For a string s with characters s[0], s[1], …, s[k−1], interpret each character as a number cᵢ (e.g., ASCII code).
One form:
Then compress:
You don’t need to memorize this. The idea is:
A hash function is “good” for hash tables if it:
If h is poor (e.g., always returns 0), performance collapses to O(n).
The modulus m matters. Many implementations choose m carefully (often prime, or a power of 2 with bit-mixing).
What you should remember at difficulty 2:
Suppose keys distribute uniformly across m buckets.
That’s not a proof of O(1), but it builds the intuition: average work per bucket is proportional to α.
So keeping α bounded (e.g., α ≤ 0.75) is a strategy to keep operations near constant time.
Collisions are not an edge case—they are the normal case once the table has enough items. Two main families of strategies dominate.
Idea: A[i] stores a collection (often a linked list, dynamic array, or small balanced tree) of all entries whose keys hash to i.
Operations (high-level):
1) i = h(key)
2) search bucket A[i] for key
3) if found, update value; else append new pair
1) i = h(key)
2) search bucket A[i] for matching key
1) i = h(key)
2) remove entry from bucket if present
Cost intuition:
So average time is often described as O(1 + α). If α is kept constant by resizing, that becomes O(1) average.
Idea: Store entries directly in the array. If A[i] is occupied, probe other indices according to a rule until you find an empty slot.
Common probing methods:
Cost intuition:
Also, deletion is trickier: you often need a special “tombstone” marker so searches don’t stop early.
| Feature | Separate chaining | Open addressing |
|---|---|---|
| Where entries live | In buckets (lists/arrays) at A[i] | Directly in array slots |
| Handles α > 1? | Yes (buckets can grow) | No (must have n ≤ m) |
| Typical implementation complexity | Simpler | Trickier (probing + tombstones) |
| Memory overhead | Extra pointers/structures | Low overhead |
| Performance when nearly full | More graceful | Can degrade sharply |
At this level, separate chaining is easiest to reason about and is common in teaching.
To keep operations fast, hash tables resize when α crosses a threshold.
Typical policy (example):
This is called rehashing because h depends on m.
A resize costs O(n) because you reinsert all elements.
But it happens infrequently (e.g., when the table doubles). Over many inserts, the average extra cost per insert stays small.
You don’t need full amortized analysis yet—just remember:
Hash tables appear any time you need fast “lookup by key”:
Hash tables hit a sweet spot:
Hash tables are not always the best choice:
1) No ordering by default
2) Worst-case can be O(n)
3) Prefix/range queries are awkward
A trie is a tree specialized for strings (or sequences). It supports operations in O(k) where k = length of the string, and it naturally supports prefix queries.
So the mental transition is:
If your next goal is autocomplete, dictionary prefix search, or storing many strings with shared prefixes, tries are a natural unlock.
We have m = 5 buckets: A[0]…A[4]. The hash function is h(key) = key mod 5 for integer keys. We insert keys with values: (12 → "a"), (7 → "b"), (2 → "c"), (17 → "d"). Use separate chaining (each A[i] is a list). Then look up key 17.
Compute indices:
All keys collide into bucket A[2]. After insertion (order doesn’t matter conceptually), bucket A[2] contains:
All other buckets are empty.
Lookup key 17:
1) Compute i = h(17) = 2
2) Go to bucket A[2]
3) Scan entries until key = 17 is found
4) Return the associated value "d"
Insight: Even with a terrible distribution (everything collides), the hash table is still correct—it just loses its speed. Performance depends on keeping average bucket sizes small (α controlled + decent hashing).
A hash table with separate chaining has m = 100 buckets and currently holds n = 60 keys. Assume keys distribute uniformly. Estimate the average number of elements you examine during a successful lookup (very rough model: scan half the bucket on average).
Compute load factor:
α = n / m = 60 / 100 = 0.6
Expected bucket size ≈ α ≈ 0.6 elements per bucket.
Interpretation: many buckets are empty; some have 1; a few have 2+.
For a successful lookup, you must search within the bucket containing the key.
A rough rule of thumb: average comparisons in a list of length L is about (L+1)/2.
Plug in L ≈ 0.6:
Average comparisons ≈ (0.6 + 1) / 2 = 0.8
Total work is:
Insight: This is why people say O(1) average: if α stays bounded (like 0.6), the bucket work stays bounded too.
We store integer keys in a table with m = 4 using h(key) = key mod m and separate chaining. Keys inserted: 1, 5, 9. Then we resize to m′ = 8. Show the old and new bucket indices.
With m = 4:
So all three land in A[1] (a collision chain).
Resize to m′ = 8. Now the hash-to-index step changes because modulus changed:
After rehashing into the new array A′:
Bucket sizes are smaller than before.
Insight: Resizing isn’t just “making more room”—it changes where keys belong. That’s why you must reinsert (rehash) existing entries.
A hash table implements associative mapping: store and retrieve values by key (key → value).
It uses an array A and a hash function h(key) to compute an index i, then stores the entry in A[i] (or in a structure at A[i]).
Collisions are inevitable: key₁ ≠ key₂ can still yield h(key₁) = h(key₂). Correctness requires a collision-handling strategy.
Two common collision strategies are separate chaining (buckets hold lists) and open addressing (probe for an empty slot).
Performance depends on distribution quality and load factor α = n/m; average bucket size is about α under uniform hashing assumptions.
Resizing (rehashing) keeps α bounded; it costs O(n) sometimes but keeps average operations near O(1).
Hash tables are great for exact-match lookups, but they don’t naturally support prefix/range queries—this motivates structures like tries.
Assuming O(1) is worst-case: hash tables are typically O(1) average but can be O(n) in the worst case with many collisions.
Forgetting to compare keys after hashing: h(key) only chooses a bucket/slot; you still must check equality to find the right entry.
Ignoring load factor: letting α grow too large without resizing leads to long chains or expensive probing.
Implementing deletion incorrectly in open addressing (not using tombstones), which can break future lookups.
A hash table uses m = 10 buckets and separate chaining. You insert n = 25 keys. (1) Compute the load factor α. (2) If keys distribute uniformly, what is the expected bucket size?
Hint: Use α = n/m. Under uniform distribution, expected bucket size ≈ α.
(1) α = n/m = 25/10 = 2.5
(2) Expected bucket size ≈ 2.5 entries per bucket.
Given m = 7 and h(key) = key mod 7 for integer keys, insert keys 10, 3, 17, 24 into a chaining hash table. List which bucket A[i] each key goes to, and identify collisions.
Hint: Compute each key mod 7. A collision means two keys share the same i.
Compute indices:
All collide in bucket A[3].
Open addressing (linear probing): m = 8, h(key) = key mod 8. Insert keys in order: 3, 11, 19. Show the final array indices used (assume empty array initially).
Hint: All three keys have the same initial index. With linear probing, you try i, i+1, i+2,… wrapping mod m.
Compute initial indices:
Final placements: 3@A[3], 11@A[4], 19@A[5].
Next: Tries
Related reinforcement nodes you likely already know: