Fundamental limits on algorithm performance.
Quick unlock - significant prerequisite investment but simple final step. Verify prerequisites first.
Some algorithmic problems are hard not because we haven’t found the right trick yet, but because the input simply cannot reveal the answer fast enough. Information-theoretic bounds formalize that idea: regardless of cleverness, you need a minimum number of bits from the world to identify the correct output with small error.
Information-theoretic lower bounds argue that any algorithm must acquire enough information to distinguish among many possible answers. You quantify (1) how much uncertainty the problem initially has via entropy H(X), (2) how much information each observation can provide via mutual information I(X;Y), and (3) how remaining uncertainty forces errors via Fano’s inequality. This yields lower bounds on sample complexity, query complexity, and communication—even for optimal algorithms.
In many lower-bound arguments, you don’t analyze a specific algorithm—you analyze the best possible algorithm. Information-theoretic lower bounds do this by treating the unknown solution as a random variable and asking:
If the observations do not carry enough information, then any estimator/algorithm must fail with nontrivial probability.
This perspective is powerful because it separates two things:
Then a lower bound follows from a budget argument:
If each observation yields at most a small number of bits, you need many observations to shrink uncertainty enough.
Depending on the model, the algorithm obtains information by:
Information-theoretic bounds are most natural when the algorithm’s interaction with the input can be represented as a random variable Y produced from X via some channel p(y|x) (possibly adaptive).
Let X be an unknown “instance label” (e.g., which hypothesis is true). The algorithm observes Y (data, transcript, query answers) and outputs an estimate \(\hat{X}\). If the algorithm must succeed with error ≤ δ:
The three core tools in this node are:
If X is uniform over M possible cases, then
Even if X isn’t uniform, H(X) captures the average number of bits needed to describe X optimally.
This yields the first intuition: if your observations only reveal, say, ≤ 0.1 bits each on average, you need about (log₂ M)/0.1 observations to identify X reliably.
They are:
They are not:
| Question | Information-theoretic view | Computational view |
|---|---|---|
| What limits success? | Not enough bits in observations | Not enough time/operations |
| Typical output | Sample/query/communication lower bound | Time lower bound (rare) |
| Tools | H(·), I(·;·), Fano, data processing | Reductions, circuit complexity, fine-grained |
| Handles randomness/noise? | Naturally | Often awkward |
The rest of this lesson builds the standard pipeline: choose a family of hard instances (a random X), model what algorithm sees (Y), upper-bound I(X;Y), and then invoke Fano to lower-bound error unless m is large.
Before any observation, the algorithm does not know which instance it is facing. If there are many plausible instances, then the algorithm has large uncertainty.
To turn “many plausible instances” into a number, we choose a random variable X indexing instances. Two common designs:
The second is crucial in estimation problems where the unknown is continuous (mean estimation, regression coefficients w, etc.). You discretize the space into many separated candidates so that:
That creates a bridge from estimation accuracy to multi-way hypothesis testing.
Suppose the algorithm must output an element of some answer set A (e.g., which of M cases holds). If there are M equally likely answers, then:
Interpretation: even with an optimal prefix code, you’d need ~log₂ M bits to specify the answer.
Now connect to observations Y. The maximum information you could possibly gain about X from Y is I(X;Y). If I(X;Y) is much smaller than H(X), then a lot of uncertainty remains.
Suppose the goal is to estimate a parameter θ ∈ Θ with error ≤ ε in some metric d(·,·). Build a set {θ₁,…,θ_M} such that
This is a 2ε-packing (sometimes also called a separated set). If the estimator outputs \(\hat{θ}\) with d(\(\hat{θ}\), θ) ≤ ε, then \(\hat{θ}\) must be closer to the correct θ_i than to any other θ_j. That means from \(\hat{θ}\) we can decode X with zero error.
So:
Therefore, if we show identification requires many samples/queries, then estimation also requires many.
Let X be uniform on {1,…,M}. Conditioned on X=i, the world generates observations Y according to distribution P_i.
Any estimator/algorithm produces output \(\hat{X}\) from Y. If we can prove that for m observations,
then no algorithm can solve the corresponding estimation problem with high probability at that sample size.
At this point we have identified the “needed information”: H(X) ≈ log₂ M. Next we need to limit how much the algorithm can learn from its interactions.
But before that, notice a subtlety:
So entropy is the demand (how much must be learned), while mutual information is the supply (how much can be learned).
When selecting {θ_i} or distributions {P_i}, you want:
Common closeness measures used to upper-bound I(X;Y) include:
Even though KL is not in the node’s prerequisite list, it appears naturally because mutual information can be upper-bounded by expected KL.
You can think of the process as:
If the channel is too noisy, even the optimal decoder cannot recover X reliably.
That statement becomes formal with the next mechanic: mutual information and the data processing inequality.
An algorithm doesn’t just need to see data; it needs to see data that depends on the unknown. Mutual information captures exactly how much knowing Y reduces uncertainty about X.
Definition (in bits):
Key interpretations:
So if we can upper-bound I(X;Y) by something like m·c (m samples times c bits per sample), then we can relate m to H(X).
A universal constraint: information cannot increase by post-processing.
If X → Y → Z is a Markov chain (Z is computed from Y), then
This matters because the algorithm’s internal state, transcript, and final answer \(\hat{X}\) are all functions of observed data. DPI says:
So it suffices to bound I(X;Y), even if the algorithm is arbitrarily clever.
If Y = (Y₁,…,Y_m) are m observations (possibly dependent), then:
This is crucial for adaptive algorithms: Y_t may depend on past observations through the algorithm’s choice of query or experiment.
Even then, the chain rule holds, and you can bound each term by the maximum information one step can reveal.
A typical setup: conditioned on X=i, the observations are i.i.d. from P_i. Then
when Y_t are conditionally independent given X.
More generally, with adaptivity you often show:
for all t, and conclude I(X;Y) ≤ m c.
One standard identity (useful as a bounding technique) is:
where P(Y) is the mixture distribution ∑_x P(x)P(Y|x).
This expresses information as “how far each conditional distribution is from the mixture.” If all P(Y|X=x) are close to each other, then they are all close to the mixture, giving small I(X;Y).
A common upper bound for finite hypothesis sets with X uniform:
(up to standard variants/inequalities). The exact constant form depends on the lemma used, but the theme is stable: pairwise closeness limits information.
At a high level, once you have:
then remaining uncertainty is
If m·c is much smaller than log₂ M, then H(X|Y) remains large. Large conditional entropy means many labels are still plausible after seeing the data.
But to turn that into a concrete error probability statement, we need Fano’s inequality.
In comparison-based sorting, each comparison yields 1 bit of outcome (less/greater; ignore equality for simplicity). If the hidden object is which permutation is present, then:
If the transcript Y consists of m comparison outcomes, then H(Y) ≤ m bits and thus I(X;Y) ≤ H(Y) ≤ m.
This is the information-theoretic skeleton behind the classic Ω(n log n) comparisons lower bound.
Important nuance: this argument assumes the comparisons are the only information source (no extra structure). It cleanly demonstrates how information per query constrains performance.
So far we can show that after observing Y, the algorithm still has uncertainty H(X|Y). But an algorithm doesn’t output a distribution; it outputs a single guess \(\hat{X}\). We need a theorem that converts “remaining uncertainty” into “probability of being wrong.”
Fano’s inequality does exactly that for multi-class identification.
Let X take values in {1,…,M}. Let \(\hat{X}\) be any estimator based on Y. Define error event E = [\(\hat{X} \neq X\)]. Then:
where h₂(p) = −p log₂ p − (1−p) log₂(1−p) is the binary entropy.
Rearranging yields a lower bound on P(E). A widely used corollary is:
(assuming X is uniform; the “+1” comes from bounding h₂(p) ≤ 1).
So if I(X;Y) is significantly smaller than log₂ M, then the error probability must be bounded away from 0.
The key idea: to describe X given Y, you can describe whether you made an error, and if you did, which wrong label occurred.
Let E be the error indicator. Then:
1) Expand conditional entropy by introducing E:
H(X|Y)
= H(X, E | Y) − H(E | X, Y)
But E is a function of (X, \(\hat{X}\)), and \(\hat{X}\) is a function of Y, so given X and Y, E is determined. Thus:
H(E | X, Y) = 0
So:
H(X|Y) = H(X, E | Y)
2) Use chain rule:
H(X, E | Y)
= H(E | Y) + H(X | E, Y)
3) Bound each term:
For the second term, condition on whether an error happened:
H(X | E, Y)
= P(E=0)·H(X | E=0, Y) + P(E=1)·H(X | E=1, Y)
If E=0, then \(\hat{X}=X\), so X is determined by \(\hat{X}\) and hence by Y, giving:
H(X | E=0, Y) = 0
If E=1, then X can be any of the M−1 wrong labels, so:
H(X | E=1, Y) ≤ log₂(M−1)
Combine:
H(X | E, Y) ≤ P(E)·log₂(M−1)
4) Put it together:
H(X|Y)
≤ h₂(P(E)) + P(E)·log₂(M−1)
This is Fano.
So: to get low error, you must learn almost all the bits needed to specify X.
If your transcript Y is deterministic and has at most 2^m possible outcomes, you can sometimes argue:
That’s a noiseless, worst-case counting bound.
Fano is the noisy/general version:
This makes it indispensable for sample complexity and noisy query models.
Most information-theoretic lower bounds follow a pattern:
This turns “how hard is it?” into “how much can you learn per step?”
This is a clean information-theoretic bound matching the classical Ω(n log n).
If each comparison is flipped with probability η, then each query conveys < 1 bit of information. You can quantify I(X;Y_t | history) ≤ 1 − h₂(η) bits (a standard channel capacity fact for a binary symmetric channel). Then:
So noisy comparisons increase query complexity by a factor ≈ 1/(1 − h₂(η)).
Let the unknown be θ ∈ {−ε, +ε}. Each sample Y_t = θ + Z_t where Z_t ∼ N(0,σ²). One sample provides limited information about which sign is correct. With m samples, I(X;Y) grows like m·(ε²/σ²) for small ε/σ, yielding m = Ω(σ²/ε²) to achieve small error. This matches standard statistical rates.
Let X be split across two parties. The transcript T has length m bits, hence I(X;T) ≤ m. If the function output requires distinguishing among many possibilities, then H(output) is large and you derive m = Ω(H(output)) or stronger.
A lower bound is only as strong as the hardest distribution family you pick. Typical strategies:
Big-O is typically used for upper bounds: algorithm A runs in O(f(n)). Information-theoretic arguments yield Ω(·) lower bounds on resources such as:
In some problems (sorting), this directly yields a time lower bound in a restricted comparison model. In others (statistical estimation), it yields sample complexity lower bounds; computation may be easier or harder depending on the setting.
When you face a new problem and wonder if an information bound applies, ask:
If yes, you likely can produce a meaningful information-theoretic lower bound.
We sort n distinct keys using only pairwise comparisons. Let X be the (unknown) total order/permutation of the n keys, assumed uniform over all n! permutations. The algorithm performs m comparisons and observes transcript Y ∈ {0,1}^m (each bit indicates the comparison outcome). We show m ≥ log₂(n!) comparisons are necessary for zero-error sorting in the comparison model.
1) Model the uncertainty in the input order.
X ∈ S_n, uniform.
H(X) = log₂(n!).
2) Relate the transcript to information gained.
Each comparison outcome is 1 bit, so the full transcript Y has at most 2^m possible values.
Thus H(Y) ≤ m.
3) Use the mutual information bound.
I(X;Y) ≤ H(Y) ≤ m.
4) For a correct sorting algorithm (zero error), Y must determine X’s order uniquely.
That means H(X|Y) = 0.
So:
I(X;Y) = H(X) − H(X|Y) = H(X).
5) Combine.
H(X) = I(X;Y) ≤ m
⇒ m ≥ log₂(n!).
6) Convert log₂(n!) to n log n.
Using Stirling’s approximation:
log₂(n!) ≈ n log₂ n − (log₂ e) n + O(log n).
So m = Ω(n log n).
Insight: This is an information budget argument: the unknown permutation contains about n log₂ n bits. Each comparison reveals at most 1 bit, so you need Ω(n log n) comparisons—no algorithmic cleverness can change that in the comparison model.
Unknown X ∈ {1,…,M} is uniform. Conditioned on X=i, we observe m i.i.d. samples Y_t from a distribution P_i. Suppose we can show a per-sample information bound I(X;Y_t) ≤ α bits (for example because all P_i are very similar). Derive an m requirement to make error ≤ δ using Fano.
1) Start from Fano’s corollary.
P(error) ≥ 1 − (I(X;Y) + 1)/log₂ M.
2) Use conditional independence to add information over samples.
Y = (Y₁,…,Y_m), with Y_t i.i.d. given X.
Then:
I(X;Y) ≤ ∑_{t=1}^m I(X;Y_t) = mα.
3) Plug into Fano.
P(error) ≥ 1 − (mα + 1)/log₂ M.
4) Require P(error) ≤ δ.
We need:
1 − (mα + 1)/log₂ M ≤ δ
⇒ (mα + 1)/log₂ M ≥ 1 − δ
⇒ mα + 1 ≥ (1 − δ) log₂ M
⇒ m ≥ ((1 − δ) log₂ M − 1)/α.
5) Interpret.
If α is small (each sample is only weakly informative), m must scale like (log M)/α.
Insight: Fano turns an information upper bound into an unavoidable error floor. If each sample carries only α bits about which hypothesis is true, you need on the order of log₂ M / α samples to identify the correct hypothesis with small error.
There is an unknown X uniform over {1,…,N}. You may ask yes/no questions adaptively, but each answer is flipped independently with probability η (0 < η < 1/2). Let the transcript be Y = (Y₁,…,Y_m). Show that m must be at least about log₂ N divided by the per-query information (≈ 1 − h₂(η)) to drive error small.
1) Needed information.
H(X) = log₂ N.
2) Apply Fano (target error δ).
δ ≥ P(error) ≥ 1 − (I(X;Y) + 1)/log₂ N
⇒ I(X;Y) ≥ (1 − δ) log₂ N − 1.
3) Use chain rule for adaptive queries.
I(X;Y) = ∑_{t=1}^m I(X;Y_t | Y₁,…,Y_{t−1}).
4) Bound each term by the channel capacity of a noisy yes/no answer.
Given the true (noise-free) answer A_t ∈ {0,1}, the observed Y_t is A_t flipped with probability η.
No matter how the question is chosen, Y_t is A_t passed through a binary symmetric channel BSC(η), so:
I(X;Y_t | history) ≤ I(A_t;Y_t | history) ≤ 1 − h₂(η).
5) Sum over m queries.
I(X;Y) ≤ m(1 − h₂(η)).
6) Combine with step (2).
m(1 − h₂(η)) ≥ (1 − δ) log₂ N − 1
⇒ m ≥ ((1 − δ) log₂ N − 1)/(1 − h₂(η)).
Insight: Noise reduces the information per answer from 1 bit to at most 1 − h₂(η) bits. The lower bound is a direct “bits required / bits per query” calculation, formalized through mutual information and Fano.
Information-theoretic lower bounds treat the unknown instance/answer as a random variable X and the algorithm’s observations as Y.
Entropy H(X) quantifies the minimum information (in bits) needed to identify the correct hypothesis among many possibilities; for uniform M-way choice, H(X)=log₂ M.
Mutual information I(X;Y)=H(X)−H(X|Y) quantifies how much the observations reduce uncertainty; it is the “rate of learning.”
Data processing inequality ensures post-processing can’t create information: I(X;\hat{X}) ≤ I(X;Y), so you can bound any algorithm by bounding its observation channel.
Chain rule lets you handle adaptivity: I(X;Y)=∑ I(X;Y_t | history), so a per-step information cap yields a total information cap.
Fano’s inequality converts remaining uncertainty into an unavoidable error probability: if I(X;Y) ≪ log₂ M, then any estimator must err with probability bounded away from 0.
A common strategy for continuous problems is packing: build many well-separated parameters so accurate estimation would imply identifying one of M hypotheses.
These bounds typically yield Ω(sample/query/communication) limits; they can match classical results like Ω(n log n) comparisons for sorting and show noise inflation factors in query models.
Assuming each observation gives 1 bit of information. In noisy or continuous settings, the mutual information per sample can be far smaller (and must be bounded, not guessed).
Forgetting adaptivity: bounding I(X;Y) as m·I(X;Y₁) without checking conditional independence or without using the chain rule correctly.
Using Fano with an M that is too small (weak packing), which yields a valid but loose bound; the choice of hypothesis family is crucial.
Confusing worst-case and average-case: information-theoretic bounds are often proved under a chosen prior over instances; you must state clearly what guarantee (average error vs worst-case) you conclude.
You have an unknown X uniform over {1,…,256}. You observe m samples Y₁,…,Y_m such that for every t, I(X;Y_t | Y₁,…,Y_{t−1}) ≤ 0.2 bits. Using Fano’s corollary, give a lower bound on m needed to ensure P(error) ≤ 0.1.
Hint: Use log₂ 256 = 8. Bound I(X;Y) by the chain rule, then require I(X;Y) ≥ (1−δ)log₂ M − 1.
Here M=256 so log₂ M=8 and δ=0.1.
Fano corollary rearranged:
I(X;Y) ≥ (1−δ)log₂ M − 1 = 0.9·8 − 1 = 7.2 − 1 = 6.2 bits.
Chain rule with per-step cap:
I(X;Y) = ∑_{t=1}^m I(X;Y_t | history) ≤ 0.2m.
So 0.2m ≥ 6.2 ⇒ m ≥ 31.
(If m must be an integer, m ≥ 31.)
Comparison sorting with duplicates: Suppose n items contain only k distinct keys (k ≤ n), and the algorithm must output a stable sorted order. Give an information-theoretic lower bound on the number of comparisons in terms of the number of distinct stable outputs.
Hint: Let X be the true stable-sorted outcome among M possibilities. Each comparison yields 1 bit, so need at least log₂ M comparisons (up to zero-error assumptions). What is M for multiset permutations with stability?
Let the multiset have counts c₁,…,c_k summing to n. The number of distinct total orders of the multiset (ignoring stability) is n!/∏_{i=1}^k c_i!. For a stable sort, the relative order within equal keys is fixed, so the number of possible stable sorted outputs equals the number of possible key sequences consistent with the unknown input arrangement, which is exactly M = n!/∏ c_i!.
If X is uniform over these M possibilities, then H(X)=log₂ M.
A transcript of m comparisons has at most 2^m outcomes, so m ≥ log₂ M = log₂(n!) − ∑_{i=1}^k log₂(c_i!).
This lower bound reduces to Ω(n log n) when all keys are distinct (all c_i=1), and becomes smaller when many duplicates exist.
Noisy yes/no identification: X is uniform on {1,…,N}. You can ask adaptive yes/no questions, each answer flipped with probability η=0.1. Using the bound m ≥ ((1−δ)log₂ N − 1)/(1 − h₂(η)), estimate m needed for N=1,048,576 (which is 2^20) and δ=0.01.
Hint: Compute log₂ N = 20. Compute h₂(0.1) ≈ −0.1 log₂ 0.1 − 0.9 log₂ 0.9 ≈ 0.469. Then 1 − h₂(η) ≈ 0.531.
Given N=2^20, log₂ N=20. With δ=0.01:
Required information:
(1−δ)log₂ N − 1 = 0.99·20 − 1 = 19.8 − 1 = 18.8 bits.
Binary entropy at η=0.1:
h₂(0.1) = −0.1 log₂ 0.1 − 0.9 log₂ 0.9
≈ −0.1(−3.3219) − 0.9(−0.1520)
≈ 0.3322 + 0.1368
≈ 0.469.
So per-query info cap ≈ 1 − h₂(0.1) ≈ 0.531.
Thus:
m ≥ 18.8 / 0.531 ≈ 35.4.
So at least about 36 noisy questions are required (by this information-theoretic bound) to get error ≤ 0.01.