Optimal lossy compression. Tradeoff between bits and fidelity.
Quick unlock - significant prerequisite investment but simple final step. Verify prerequisites first.
Lossless compression asks: “How many bits do I need to reproduce the data exactly?” Rate–distortion theory asks the harder—and more practical—question: “If I’m allowed to be slightly wrong, what is the absolute minimum number of bits I need?”
Rate–distortion theory characterizes the optimal tradeoff between compression rate (bits/symbol) and allowed average distortion. For a source X with distortion measure d(x,x̂), the rate–distortion function is
It is decreasing and convex in D, with R(0)=H(X) under a “zero-distortion implies exact reconstruction” condition. Operationally, R(D) is the best achievable asymptotic rate for block lossy coding, and the minimizing conditional distribution p(x̂|x) is the “test channel” that describes the optimal stochastic reconstruction mechanism.
Rate–distortion theory is the part of information theory that formalizes optimal lossy compression. “Lossy” means the decoder’s reconstruction need not equal the original source symbol ; instead we control a fidelity criterion.
1) A source with distribution . Typically we assume an i.i.d. source sequence where are independent and identically distributed like .
2) A distortion measure (single-letter) satisfying
From this we define the per-symbol average distortion constraint
For blocks, a standard extension is
and we ask that .
3) A rate in bits/symbol. If a block encoder maps to an index , then measures the number of bits available per source symbol.
Given an allowed distortion level , what is the smallest achievable rate ?
This minimum is the rate–distortion function .
For a discrete memoryless source (DMS) with single-letter distortion, Shannon’s rate–distortion theorem states
The minimizer is called a test channel. It is not a physical channel; it is a mathematical object describing the optimal way (in an information-theoretic sense) to correlate the reconstruction with the source under the distortion constraint.
Even without interactivity, it helps to visualize as a curve:
Rate R(D) (bits/symbol)
^
|\
| \
| \
| \
| \
| \
| \
+--------------------> Distortion D
D_maxis a minimum over . Mutual information measures how many bits per symbol the reconstruction “knows” about the source. Allowing more distortion lets be less correlated with , reducing .
This node builds on several ideas. You said you already know entropy and mutual information; that’s necessary but not sufficient for a smooth ride at difficulty 5.
| Topic | What you need here | Quick reminder | ||
|---|---|---|---|---|
| Entropy | Interpret as optimal lossless rate | |||
| Mutual information | Understand as “bits of dependence” | |||
| i.i.d. block coding / typicality | Why coding theorems are asymptotic and why “per-symbol” rates emerge | Typical sequences occupy about outcomes | ||
| Conditional distributions | Comfort manipulating $p(\hat x | x)p(\hat x)$ | $p(\hat x)=\sum_x p(x)p(\hat x | x)$ |
| Basic convex optimization | Recognize convexity, constraints, and Lagrange multipliers/KKT | Minimize objective subject to inequality constraints |
1) Discrete vs continuous: for continuous sources, is replaced by differential entropy , which is not directly a coding rate. Operational statements often involve mutual information directly and require additional regularity.
2) Distortion definition matters: average distortion can be defined in expectation, with high probability, or as excess-distortion probability. Standard Shannon RD uses expected per-letter distortion; variants exist.
3) $R(0)=H(X)$ is conditional: as noted above, it’s true in the common case where only when .
If any of these prerequisites feel shaky, it’s worth revisiting them before you attempt the proofs or algorithms (e.g., Blahut–Arimoto) in the later sections.
Before optimizing anything, we must decide what “good reconstruction” means. Rate–distortion theory forces you to be explicit: you provide a distortion measure and a target distortion level .
A single-letter distortion assigns a penalty to each symbol pair . For sequences, we typically average:
The average matters because it makes the constraint scale nicely with and matches i.i.d. assumptions.
We then require (one common formulation)
A subtle point: This is an expected distortion constraint; it allows rare large-distortion events as long as they don’t contribute too much to the average. Another common alternative is an excess-distortion probability constraint .
1) Hamming distortion for discrete alphabets:
Then , i.e. average distortion equals symbol error rate.
2) Squared error for real-valued sources:
Then is a mean-squared error (MSE) constraint.
3) Weighted distortions: sometimes different symbols are more costly to distort.
Two distortion levels are especially important.
For Hamming distortion, a zero-rate decoder should output the most probable symbol, so .
This gives a boundary condition you should remember:
In lossy compression, the reconstruction might be “wrong” but perceptually fine (audio/image), or wrong but still useful (downstream ML tasks). Rate–distortion cleanly separates:
That separation is powerful: once is fixed, the rest is mathematics.
If your distortion is Hamming, the RD problem resembles “coding with errors.” But it is not channel coding: in RD, the encoder sees the entire and chooses a description; the distortion constraint is on reconstruction quality, not on a noisy channel outcome.
This perspective will matter when we introduce the test channel: the “noise” in is controlled by us to meet distortion, not imposed by nature.
Now we connect the operational question (“minimum bits”) to an information quantity.
For a fixed source distribution and distortion measure , define
This is sometimes called the Shannon lower bound form (not to be confused with the separate Shannon lower bound approximation used in continuous cases). The key is: we are minimizing mutual information over all conditional distributions that satisfy the distortion constraint.
Given and a candidate test channel :
1) Induced joint:
2) Induced marginal:
3) Mutual information:
4) Expected distortion:
We choose to minimize the first, while keeping the second ≤ D.
A rough operational interpretation:
RD theory makes this precise: in the limit of large blocklength, the minimum description rate equals the minimum mutual information.
1) Monotonicity: if you allow more distortion, you never need more rate.
If , then the feasible set at contains the feasible set at , so
2) Convexity: is convex in .
Intuition: you can time-share between two codes (or two test channels) achieving distortions and . Using them a fraction and yields average distortion and average rate . Optimality then implies
That inequality is exactly convexity.
3) Boundary values:
Why under that condition:
If instead multiple reconstructions have zero distortion for a given , then might be a function that preserves less information than , and (hence ) can be smaller.
We often solve constrained problems by introducing a multiplier :
Then, for each , the optimizer corresponds to a point on the RD curve. (Formally, parameterizes supporting lines to the convex function .)
This is where convex optimization enters: is convex in for fixed , and the distortion constraint is linear in , making the problem a convex program.
At optimum, the conditional often has an exponential tilt form (for discrete alphabets):
More explicitly,
This is not magic; it is the KKT optimality condition for minimizing mutual information subject to expected distortion. It is also the basis for the Blahut–Arimoto algorithm (iterative computation of ).
Pause here and internalize the meaning: to compress optimally, you behave as if the reconstruction is drawn from a Gibbs distribution where reconstructions with smaller distortion are exponentially preferred, but also modulated by a global prior that must self-consistently match the induced marginal.
The single-letter formula is elegant, but why does it equal the true minimum code rate? Rate–distortion theory has an operational meaning: it predicts what block codes can and cannot do as .
Fix a DMS and single-letter distortion . For any distortion level :
So is the sharp threshold.
Lossy compression gains power from coding across many symbols:
This “vector quantization” viewpoint is one way to understand why is often lower than what naive symbol-by-symbol quantization would suggest.
Given an optimal test channel achieving :
1) Compute the induced marginal .
2) Randomly generate a codebook of reproduction sequences i.i.d. from .
3) Given a source block , pick an index such that is “jointly typical” with respect to (or approximately minimizes distortion).
4) Send ; decoder outputs .
The probability that no suitable codeword exists drops when , which is the core reason mutual information appears.
For any code:
Then, using memorylessness and single-letterization arguments (plus convexity), you get
Thus must be at least .
1) Benchmarking codecs: Real codecs (JPEG, AAC, H.264) can be compared to the theoretical RD bound for simplified source models.
2) Learning and VAEs: The objective in a variational autoencoder resembles an RD Lagrangian:
(Details depend on approximations; but the conceptual link is strong.)
3) Information bottleneck: Minimizing subject to preserving information about a target echoes RD structure, with “distortion” defined via prediction loss.
4) Control and estimation: RD-like bounds appear in quantized control and remote estimation (minimum communication to achieve an MSE).
R
| • (low D, high rate)
| •
| •
| •
| •
| •
|•________________________ D
0 D* D_max
At D=0: often R(0)=H(X).
At D>=D_max: R(D)=0.
The curve is decreasing and convex.
A slope (supporting line) corresponds to a Lagrange multiplier β.When you later compute RD points numerically, you are essentially moving along this curve by varying .
With these connections in place, the rate–distortion formula should feel less like a definition and more like a deep theorem tying together coding, probability, and optimization.
Let X ~ Bernoulli(1/2) (a fair bit). Use Hamming distortion d(x,x̂)=1{x≠x̂}. Find the rate–distortion function R(D) for 0 ≤ D ≤ 1/2.
Step 1: Write the definition
R(D)=min_{p(x̂|x): E[d(X,X̂)]≤D} I(X;X̂).
Step 2: Parameterize the test channel
For binary X and binary X̂, a natural family is a binary symmetric channel (BSC) with crossover probability q:
P(X̂≠X)=q.
Then E[d(X,X̂)]=q.
So the distortion constraint is q ≤ D.
Step 3: Compute mutual information for a BSC with uniform input
If X ~ Bern(1/2) and X̂ is X flipped with prob q, then X̂ is also uniform.
Thus
I(X;X̂)=H(X̂)−H(X̂|X)=1 − H₂(q),
where H₂(q)=−q log₂ q − (1−q)log₂(1−q) is the binary entropy.
Step 4: Minimize over feasible q
We need to minimize 1 − H₂(q) subject to q ≤ D.
Since H₂(q) increases on [0,1/2], 1−H₂(q) decreases on [0,1/2].
So the minimum occurs at the largest feasible q, i.e. q=D.
Step 5: State R(D)
For 0 ≤ D ≤ 1/2,
R(D)=1 − H₂(D).
For D ≥ 1/2, you can output random bits independent of X to get distortion 1/2 at rate 0, so R(D)=0.
Insight: For a uniform bit, the optimal lossy code behaves like deliberately adding Bernoulli(D) noise: the reconstruction is a noisy version of the source. The “price” in bits is exactly the mutual information left after that noise, 1 − H₂(D).
Let X ~ Bernoulli(p) with p=0.9. Use Hamming distortion. (i) Compute D_max (minimum distortion achievable at zero rate). (ii) Explain what this implies about R(D) for large D.
Step 1: Zero rate means X̂ is independent of X
With R=0, the decoder gets no information, so it must output a constant symbol (or random symbol) not depending on X.
Step 2: Minimize expected Hamming distortion over constant reconstructions
If decoder always outputs 1, distortion is P(X=0)=1−p.
If decoder always outputs 0, distortion is P(X=1)=p.
Pick the smaller:
D_max = min(p, 1−p).
Step 3: Plug in p=0.9
D_max = min(0.9, 0.1)=0.1.
Step 4: Interpret for the RD curve
If you allow distortion D ≥ 0.1, then the decoder can simply always output 1 and achieve expected distortion 0.1 with zero bits.
Therefore R(D)=0 for all D ≥ 0.1.
Insight: The right endpoint of the RD curve depends strongly on the source distribution and distortion. For skewed sources, you can get surprisingly low distortion without sending any bits—just guess the mode.
Show the key KKT/variational step that leads to p(x̂|x) ∝ p(x̂)e^{−β d(x,x̂)} for the optimization min I(X;X̂) subject to E[d] ≤ D and ∑_{x̂} p(x̂|x)=1 for each x.
Step 1: Write the objective in a convenient form
I(X;X̂)=∑_{x,x̂} p(x)p(x̂|x) log( p(x̂|x) / p(x̂) ),
where p(x̂)=∑_x p(x)p(x̂|x).
Step 2: Form the Lagrangian
Introduce β ≥ 0 for the distortion constraint and λ(x) for normalization:
L = I(X;X̂) + β( ∑_{x,x̂} p(x)p(x̂|x)d(x,x̂) − D ) + ∑_x λ(x)( ∑_{x̂} p(x̂|x) − 1 ).
Step 3: Take a directional derivative w.r.t. p(x̂|x)
Holding p(x) fixed, set derivative to zero at optimum.
A standard calculation yields the stationarity condition:
log p(x̂|x) − log p(x̂) + 1 + β d(x,x̂) + λ(x)/p(x) = 0.
Step 4: Solve for p(x̂|x)
Rearrange:
log p(x̂|x) = log p(x̂) − β d(x,x̂) − c(x),
where c(x) collects constants for each x.
Exponentiate:
p(x̂|x) = p(x̂) e^{−β d(x,x̂)} e^{−c(x)}.
Step 5: Enforce normalization to find e^{−c(x)}
We need ∑_{x̂} p(x̂|x)=1:
1 = e^{−c(x)} ∑_{x̂} p(x̂)e^{−β d(x,x̂)}
⇒ e^{−c(x)} = 1 / ∑_{x̂} p(x̂)e^{−β d(x,x̂)}.
Step 6: Final form
Therefore
p(x̂|x)= \frac{p(x̂)e^{−β d(x,x̂)}}{\sum_{x̂'} p(x̂')e^{−β d(x,x̂')}}.
Insight: This is the mathematical heart of RD computation: the optimal encoder–decoder behavior can be represented by a self-consistent “Gibbs” conditional. Varying β traces the RD curve, and iterating the consistency between p(x̂|x) and p(x̂) leads to Blahut–Arimoto.
A distortion measure d(x,x̂) and constraint E[d(X,X̂)]≤D define what “fidelity” means; everything else follows from that choice.
The rate–distortion function is the constrained minimum mutual information: $$
R(D) is nonincreasing and convex in D; for D ≥ D_max, R(D)=0.
The endpoint R(0)=H(X) holds when zero distortion forces exact reconstruction (d(x,x̂)=0 ⇒ x̂=x).
The optimizer p(x̂|x) is called the test channel and often has the exponential tilt form p(x̂|x) ∝ p(x̂)e^{−β d(x,x̂)}.
Operationally, R(D) is the asymptotically optimal bits/symbol for block lossy coding; it is not merely a definition.
Time-sharing (mixing codes) explains convexity and provides a constructive intuition for intermediate operating points.
Many modern ML objectives (e.g., bottlenecked representations) resemble RD Lagrangians trading distortion against information.
Assuming R(0)=H(X) always, without checking the distortion measure’s “zero distortion implies equality” property.
Confusing the test channel p(x̂|x) with a real physical channel; it is an optimization variable describing an optimal reconstruction law.
Forgetting that RD results are asymptotic in blocklength n; scalar quantizers can be far from the RD bound.
Mixing discrete and continuous intuitions: replacing H with differential entropy h and treating it like a code rate leads to incorrect conclusions.
Let X ~ Bernoulli(1/2) with Hamming distortion. Compute R(D) at D=0, D=0.1, and D=0.5.
Hint: Use R(D)=1−H₂(D) for 0≤D≤1/2, and R(D)=0 for D≥1/2. Recall H₂(0)=0 and H₂(1/2)=1.
D=0: R(0)=1−H₂(0)=1−0=1 bit/symbol.
D=0.1: R(0.1)=1−H₂(0.1) ≈ 1−0.468995 ≈ 0.5310 bits/symbol.
D=0.5: R(0.5)=0 (since 1−H₂(0.5)=0 and also D≥1/2 implies zero rate).
For a general discrete source X with distribution p(x) and Hamming distortion d(x,x̂)=1{x≠x̂}, compute D_max (the distortion achievable at zero rate).
Hint: At zero rate, X̂ cannot depend on X. The best constant guess is the most probable symbol.
Zero rate means choose a fixed reconstruction x̂ minimizing P(X≠x̂). The minimizer is any mode x̂* ∈ argmax_x p(x). Then D_max = 1 − max_x p(x). Therefore R(D)=0 for all D ≥ 1−max_x p(x).
Show (using time-sharing) that R(D) is convex: for any D₁, D₂ and θ∈[0,1], prove R(θD₁+(1−θ)D₂) ≤ θR(D₁)+(1−θ)R(D₂).
Hint: Take two test channels p₁(x̂|x) and p₂(x̂|x) that achieve (or nearly achieve) R(D₁), R(D₂). Introduce an independent Bernoulli(θ) variable Q selecting which channel to use, and define X̂ based on Q.
Let Q~Bernoulli(θ) independent of X. Define conditional p(x̂|x,Q=q)=p_q(x̂|x) for q∈{1,2}. Then expected distortion is
E[d(X,X̂)] = θE[d(X,X̂)|Q=1] + (1−θ)E[d(X,X̂)|Q=2] ≤ θD₁+(1−θ)D₂.
Now bound mutual information:
I(X;X̂) ≤ I(X;X̂,Q) = I(X;Q)+I(X;X̂|Q) = 0 + θI₁(X;X̂)+(1−θ)I₂(X;X̂),
where I_q is computed under p_q.
Thus there exists a feasible channel at distortion θD₁+(1−θ)D₂ with mutual information ≤ θR(D₁)+(1−θ)R(D₂) (up to approximation if p_q are near-optimal). Taking the minimum over all feasible channels gives
R(θD₁+(1−θ)D₂) ≤ θR(D₁)+(1−θ)R(D₂), proving convexity.
Next nodes you may want: