Maximum reliable transmission rate. Shannon's theorem.
Quick unlock - significant prerequisite investment but simple final step. Verify prerequisites first.
Why can your Wi‑Fi stream video reliably even though the air is noisy, while a slightly higher bitrate suddenly collapses into glitches? Channel capacity is the sharp boundary that explains this: below a certain rate you can drive error probability arbitrarily close to 0 with good codes; above it you cannot, no matter what you do.
A channel is a probabilistic mapping p(y|x). Its capacity C is the maximum mutual information between input and output over all input distributions: C = max_{p(x)} I(X;Y) (bits per channel use). Shannon’s noisy-channel coding theorem says: if R < C, there exist codes with vanishing error; if R > C, reliable communication is impossible. For common channels: BSC(p): C = 1 − H₂(p); AWGN with power P and noise N₀/2: C = (1/2) log₂(1 + P/σ²) per real use (or B log₂(1 + SNR) bits/s for bandwidth B).
Communication over a noisy medium has two competing forces:
Before Shannon, it wasn’t clear whether noise merely degrades performance smoothly, or whether there is a crisp limit. Shannon’s insight is that, for a broad class of channels, there is a single number C that acts as a boundary.
Capacity is therefore the maximum reliable transmission rate.
You already know random variables, entropy, and mutual information. A channel formalizes how an input symbol becomes an output symbol stochastically:
A single channel use is:
So the induced joint distribution is:
p(x,y) = p(x)p(y|x)
and the output marginal is:
p(y) = ∑ₓ p(x)p(y|x)
This is the exact setting in which mutual information I(X;Y) is defined.
A block code of length n uses the channel n times.
A communication system is reliable if it can send one of M messages with small error probability. For a given channel, capacity C is the supremum of rates R for which reliable communication is possible.
For a discrete memoryless channel (DMC), Shannon’s capacity has a remarkably clean expression:
C = max_{p(x)} I(X;Y)
where I(X;Y) is computed under p(x,y) = p(x)p(y|x).
This is deep because it turns a problem about codes (combinatorial objects over long blocks) into a problem about choosing a good input distribution p(x) for a single use of the channel.
Mutual information measures how many bits the output reveals about the input:
I(X;Y) = H(X) − H(X|Y)
So I(X;Y) is the net information delivered through the noise in one channel use. If you can design the input distribution, you can maximize this delivered information.
We’ll focus on the conceptual heart: why C = max I(X;Y), what the theorem says, and how to compute C in key examples.
The channel law p(y|x) is fixed by physics. But the input distribution p(x) is (often) under your control: you can choose how frequently to use each symbol.
Different p(x) produce different joint distributions p(x,y), and therefore different I(X;Y). Some choices waste the channel.
Example intuition:
Capacity captures the best you can do:
C = max_{p(x)} I(X;Y)
Start from:
I(X;Y) = ∑ₓ∑ᵧ p(x,y) log₂( p(x,y) / (p(x)p(y)) )
Substitute p(x,y) = p(x)p(y|x):
I(X;Y)
= ∑ₓ∑ᵧ p(x)p(y|x) log₂( p(x)p(y|x) / (p(x)p(y)) )
= ∑ₓ∑ᵧ p(x)p(y|x) log₂( p(y|x) / p(y) )
So:
I(X;Y) = 𝔼[ log₂( p(Y|X) / p(Y) ) ]
This is a clean “information gain” form: how surprising Y is under the unconditional distribution p(y), compared to the conditional distribution p(y|x) when you know the input.
Using identities:
I(X;Y) = H(Y) − H(Y|X)
For a fixed channel law:
H(Y|X) = ∑ₓ p(x) H(Y|X=x)
where H(Y|X=x) depends only on the row p(y|x) in the channel transition matrix.
So changing p(x):
Capacity is the best tradeoff: produce outputs that are as distinguishable as possible while accounting for noise.
Many important channels are symmetric: roughly, every input symbol “looks the same” from the channel’s perspective up to a permutation of outputs. For such channels:
This matters because it turns an optimization into a direct formula.
BSC(p): input bit flips with probability p.
Because both inputs are treated identically by noise, uniform p(x=0)=p(x=1)=1/2 is optimal.
We’ll compute C explicitly later.
It’s important to separate two ideas:
A channel might have large capacity, but you can still operate far below it if your p(x) is poorly chosen.
Capacity is not merely a number you compute from p(y|x). It corresponds to a coding statement.
To appreciate why this is nontrivial, notice:
This sets up the next core mechanism: Shannon’s coding theorem, with achievability and converse.
Shannon’s theorem has two halves:
This is why capacity is a knife-edge.
We’ll avoid full measure-theoretic proofs, but we will outline the logic, because the structure of the argument is as important as the formulas.
A length-n code consists of:
The channel is memoryless:
p(yⁿ|xⁿ) = ∏_{i=1}^n p(yᵢ|xᵢ)
Error probability (average over messages):
Pₑ = (1/M) ∑_{m=1}^M Pr[m̂ ≠ m | m sent]
Rate:
R = (1/n) log₂ M
For large n, sequences behave “typically”:
This is the engine behind compression and also channel coding.
In channel coding, the idea is:
Then decoding can pick the unique codeword whose cloud contains the received output.
Shannon’s achievability proof uses random coding:
Then show: if R < I(X;Y) (for that chosen p(x)), the expected error probability (averaged over the random codebook) goes to 0 as n → ∞.
Because the average over random codes is small, there must exist at least one deterministic code with small error.
Finally, maximize over p(x) to get rates up to C = max I(X;Y).
The random coding analysis typically yields a bound like:
Pₑ ≤ 2^{−n·E(R)}
for some positive error exponent E(R) when R < I(X;Y). The exact exponent requires more machinery, but the important message is: below mutual information, overlaps become exponentially rare.
The converse often starts from Fano’s inequality, relating decoding error to conditional entropy:
H(M|Yⁿ) ≤ 1 + Pₑ log₂ M
Rearrange to connect the message entropy H(M) = log₂ M to mutual information:
I(M;Yⁿ) = H(M) − H(M|Yⁿ)
≥ log₂ M − (1 + Pₑ log₂ M)
So if Pₑ is small, then I(M;Yⁿ) must be close to log₂ M.
Next, use data processing and memorylessness:
M → Xⁿ → Yⁿ
so:
I(M;Yⁿ) ≤ I(Xⁿ;Yⁿ)
and for a memoryless channel:
I(Xⁿ;Yⁿ) ≤ ∑_{i=1}^n I(Xᵢ;Yᵢ)
Finally, each I(Xᵢ;Yᵢ) ≤ C (by definition of capacity as a max over input distributions). Hence:
I(Xⁿ;Yⁿ) ≤ nC
Combine:
log₂ M ≈ I(M;Yⁿ) ≤ nC
Divide by n:
R = (1/n)log₂ M ≤ C
This shows: if you want vanishing error, your rate cannot exceed capacity.
Assume Pₑ → 0. Then from Fano:
H(M|Yⁿ) ≤ 1 + Pₑ log₂ M
Compute:
I(M;Yⁿ)
= H(M) − H(M|Yⁿ)
≥ log₂ M − 1 − Pₑ log₂ M
= (1 − Pₑ) log₂ M − 1
But also:
I(M;Yⁿ) ≤ I(Xⁿ;Yⁿ)
For a memoryless channel, one can show (using chain rule and conditioning reduces entropy):
I(Xⁿ;Yⁿ)
= ∑_{i=1}^n I(Xⁿ;Yᵢ|Y^{i−1})
≤ ∑_{i=1}^n I(Xᵢ;Yᵢ)
≤ ∑_{i=1}^n C
= nC
Thus:
(1 − Pₑ) log₂ M − 1 ≤ nC
Divide by n and let n → ∞, Pₑ → 0:
R ≤ C
That’s the converse.
The theorem is asymptotic:
In practice, modern codes (LDPC, Turbo, Polar) approach capacity with manageable complexity, but there is still a gap due to finite blocklength and decoding limits.
This is why “capacity” is simultaneously:
Next we’ll compute C for canonical channels to make the definition concrete.
BSC(p): 𝒳 = {0,1}, 𝒴 = {0,1}
Let X ∼ Bern(1/2). Then Y is also Bern(1/2) by symmetry.
Compute:
I(X;Y) = H(Y) − H(Y|X)
First:
H(Y) = 1 (bit)
Next, Y|X=x is a flip with probability p, so:
H(Y|X=x) = H₂(p)
where H₂(p) = −p log₂ p − (1−p) log₂(1−p).
Therefore:
H(Y|X) = ∑ₓ p(x)H(Y|X=x) = H₂(p)
So:
I(X;Y) = 1 − H₂(p)
For a BSC, uniform input is optimal, hence:
C = 1 − H₂(p)
Interpretation:
BEC(ε): output is either the input bit or an erasure symbol ⊥.
For uniform input:
So:
C = 1 − ε
This clean form is why BEC is a favorite theoretical sandbox.
A standard real-valued AWGN channel use:
Y = X + Z
where Z ∼ 𝒩(0, σ²) independent of X.
To avoid infinite capacity (you could send arbitrarily large amplitudes), impose an average power constraint:
𝔼[X²] ≤ P
The capacity (per real channel use) is:
C = (1/2) log₂(1 + P/σ²)
If you model a continuous-time bandlimited channel of bandwidth B (Hz) with noise spectral density N₀/2, the famous form is:
C = B log₂(1 + SNR) bits/s
where SNR = P/(N₀B) depending on conventions.
The maximizing input distribution under a variance constraint is Gaussian:
X ∼ 𝒩(0, P)
Reason (high-level): among all distributions with fixed variance, the Gaussian has maximum differential entropy. Since:
I(X;Y) = h(Y) − h(Y|X)
and h(Y|X) = h(Z) is fixed, maximizing I is maximizing h(Y). With Y = X + Z, variance(Y) = P + σ². The Gaussian achieves the largest h(Y) given that variance.
Engineers compare practical schemes to capacity:
If a design is far from capacity, you ask:
Modern codes often show a sharp transition: as SNR increases past a threshold, BER suddenly drops dramatically. This mirrors the capacity boundary (though shifted by finite-length effects).
Real channels have constraints beyond average power:
Capacity generalizes, but the core idea remains: maximize mutual information subject to constraints.
| Channel | Model | Constraint | Capacity C | Optimal p(x) |
|---|---|---|---|---|
| BSC(p) | bit flips | none | 1 − H₂(p) | uniform |
| BEC(ε) | erasures | none | 1 − ε | uniform |
| AWGN | Y=X+Z, Z∼𝒩(0,σ²) | 𝔼[X²]≤P | (1/2)log₂(1+P/σ²) | Gaussian |
In every case:
Shannon’s theorem then turns that maximum into an operational promise: the channel can be made to behave like an almost noiseless bit pipe of rate C, if you code across long blocks.
Channel: BSC with crossover probability p. Find C and sanity-check p=0 and p=1/2.
Use the capacity formula for a DMC:
C = max_{p(x)} I(X;Y).
For a BSC, by symmetry, the maximizing input is uniform:
Pr[X=0]=Pr[X=1]=1/2.
Compute I(X;Y)=H(Y)−H(Y|X).
Find H(Y): with uniform input and symmetric flipping,
Pr[Y=0]=Pr[Y=1]=1/2 ⇒ H(Y)=1.
Find H(Y|X): for any x, Y is flipped with probability p, so
H(Y|X=x)=H₂(p).
Because this does not depend on x,
H(Y|X)=∑ₓ p(x)H(Y|X=x)=H₂(p).
Therefore
I(X;Y)=1−H₂(p).
Conclude capacity:
C = 1 − H₂(p) bits/use.
Edge checks:
Insight: Capacity falls as noise increases, but not linearly; the binary entropy H₂(p) captures how uncertainty from flips eats away at the 1 bit/use available in a noiseless binary channel.
Real AWGN channel: Y = X + Z with Z ∼ 𝒩(0, σ²). Constraint: 𝔼[X²] ≤ P. Derive C = (1/2)log₂(1+P/σ²).
Start from mutual information:
I(X;Y)=h(Y)−h(Y|X).
Given Y=X+Z and Z independent of X,
Y|X=x is just x+Z, so its differential entropy equals that of Z:
h(Y|X)=h(Z).
So maximizing I(X;Y) over inputs subject to 𝔼[X²]≤P is equivalent to maximizing h(Y) because h(Z) is fixed:
max I(X;Y) = max h(Y) − h(Z).
Compute variance of Y:
Var(Y)=Var(X+Z)=Var(X)+Var(Z) (independence)
= 𝔼[X²] + σ² ≤ P + σ².
Use the maximum-entropy fact: among all real-valued distributions with a fixed variance, the Gaussian has the largest differential entropy.
So choose X ∼ 𝒩(0,P). Then Y is Gaussian with variance P+σ²:
Y ∼ 𝒩(0, P+σ²).
Compute differential entropies (in bits):
h(𝒩(0, s²)) = (1/2) log₂(2πe s²).
Thus:
h(Y) = (1/2)log₂(2πe(P+σ²))
h(Z) = (1/2)log₂(2πeσ²).
Subtract:
I(X;Y) = h(Y) − h(Z)
= (1/2)log₂(2πe(P+σ²)) − (1/2)log₂(2πeσ²)
= (1/2)log₂((P+σ²)/σ²)
= (1/2)log₂(1 + P/σ²).
Because Gaussian X is optimal under the power constraint, capacity is:
C = (1/2)log₂(1 + P/σ²) bits per real channel use.
Insight: For AWGN, noise contributes a fixed entropy h(Z). The only lever is how “spread out” the output Y can be under the power constraint, and Gaussian signaling maximizes that spread in the entropy sense.
Let 𝒳={0,1}, 𝒴={0,1} with transition law:
Pr[Y=1|X=0]=0.1, Pr[Y=1|X=1]=0.6.
Compute I(X;Y) as a function of q=Pr[X=1], then (numerically) find the maximizing q and capacity.
Parameterize the input distribution by q:
Pr[X=1]=q, Pr[X=0]=1−q.
Compute the output probability of Y=1:
Pr[Y=1] = Pr[Y=1|X=0]Pr[X=0] + Pr[Y=1|X=1]Pr[X=1]
= 0.1(1−q) + 0.6q
= 0.1 + 0.5q.
Let r = 0.1 + 0.5q.
Compute H(Y):
H(Y)=H₂(r).
Compute H(Y|X):
H(Y|X)= (1−q)H₂(0.1) + qH₂(0.6).
Thus mutual information is:
I(q) = H₂(0.1+0.5q) − (1−q)H₂(0.1) − qH₂(0.6).
Compute constants:
H₂(0.1) ≈ −0.1log₂0.1 −0.9log₂0.9 ≈ 0.469.
H₂(0.6) ≈ 0.971.
Now evaluate I(q) at a few q values to locate the maximum:
H(Y|X)=0.5·0.469 + 0.5·0.971=0.720.
I(0.5)=0.934−0.720=0.214 bits/use.
H(Y|X)=0.6·0.469 + 0.4·0.971=0.670.
I(0.4)=0.881−0.670=0.211.
H(Y|X)=0.4·0.469 + 0.6·0.971=0.770.
I(0.6)=0.971−0.770=0.201.
The maximum from these samples is near q≈0.5 with I≈0.214 bits/use.
So capacity is approximately:
C ≈ 0.214 bits/use, achieved near q≈0.5.
Insight: Unlike symmetric channels, an asymmetric channel may require a non-uniform input to maximize I(X;Y). Here, extremes q=0 or 1 convey zero information because the output distribution becomes fixed; mixing inputs makes outputs more distinguishable.
A channel is specified by its transition law p(y|x); it is a probabilistic mapping from inputs to outputs.
Channel capacity C is the maximum reliable rate (bits per channel use): C = max_{p(x)} I(X;Y) for a DMC.
Mutual information can be written as I(X;Y)=H(Y)−H(Y|X)=𝔼[log₂(p(Y|X)/p(Y))], making its “information flow” meaning explicit.
Shannon’s noisy-channel coding theorem: if R < C, there exist codes with Pₑ → 0; if R > C, reliable communication is impossible.
Achievability is shown via random coding and typicality/packing arguments; the converse uses Fano’s inequality plus information inequalities to upper-bound rate by C.
For symmetric channels (e.g., BSC, BEC), uniform inputs achieve capacity; symmetry removes the need for optimization.
Canonical capacities: BSC(p): 1−H₂(p); BEC(ε): 1−ε; AWGN (power P, noise σ²): (1/2)log₂(1+P/σ²) per real use.
Confusing I(X;Y) for a particular input distribution with capacity C; capacity requires maximizing over p(x) (and respecting constraints like power).
Thinking capacity is a guarantee at finite blocklength: Shannon’s theorem is asymptotic; finite-length systems need overhead and have nonzero error.
Forgetting constraints in continuous channels: without an average/peak power constraint, an AWGN channel would have unbounded capacity.
Misinterpreting “bits per channel use” vs “bits per second”: converting requires accounting for symbol rate, bandwidth, and the channel model normalization.
BEC(ε): Show that C = 1 − ε bits/use by computing I(X;Y) for uniform X and arguing optimality.
Hint: Use I(X;Y)=H(X)−H(X|Y). When Y is not erased, X is known exactly; when erased, X remains uniform.
Let X ∼ Bern(1/2) so H(X)=1. For BEC(ε), Y ∈ {0,1,⊥}. If Y∈{0,1}, then X is determined ⇒ H(X|Y∈{0,1})=0. If Y=⊥ (probability ε), then X is still Bern(1/2) ⇒ H(X|Y=⊥)=1. Thus:
H(X|Y)= (1−ε)·0 + ε·1 = ε.
So I(X;Y)=H(X)−H(X|Y)=1−ε. The BEC is symmetric, so uniform X is capacity-achieving ⇒ C=1−ε.
Consider a BSC with p=0.11. Compute C numerically (in bits/use).
Hint: Use C = 1 − H₂(p). Compute H₂(0.11) = −0.11log₂0.11 −0.89log₂0.89.
H₂(0.11)=−0.11log₂(0.11)−0.89log₂(0.89).
log₂(0.11)≈−3.184, so −0.11log₂(0.11)≈0.350.
log₂(0.89)≈−0.168, so −0.89log₂(0.89)≈0.150.
Thus H₂(0.11)≈0.500 and C≈1−0.500=0.500 bits/use (about 0.50).
Real AWGN: Y = X + Z, Z∼𝒩(0,1). Power constraint 𝔼[X²]≤9. Compute capacity per real channel use and compare to the noiseless 1-bit BSC intuition.
Hint: Use C = (1/2)log₂(1 + P/σ²) with σ²=1, P=9.
C=(1/2)log₂(1+9/1)=(1/2)log₂(10).
log₂(10)≈3.322, so C≈(1/2)·3.322≈1.661 bits per real channel use.
This can exceed 1 bit/use because the AWGN input is not restricted to binary symbols; with higher SNR and continuous amplitudes, one channel use can convey more than one bit reliably (with appropriate coding and modulation), consistent with Shannon’s theorem.