Rate-Distortion Theory

Information TheoryDifficulty: █████Depth: 8Unlocks: 0

Optimal lossy compression. Tradeoff between bits and fidelity.

Interactive Visualization

t=0s

Core Concepts

  • Single-letter distortion measure: a nonnegative function d(x,x_hat) and the expected distortion constraint E[d(X,X_hat)] \u2264 D (per-symbol average).
  • Rate-distortion function: R(D) as the minimal information cost (bits per symbol) required for lossy compression at distortion level D.

Key Symbols & Notation

R(D) (rate-distortion function)p(x_hat|x) (test-channel / conditional reconstruction distribution)

Essential Relationships

  • R(D) = inf_{p(x_hat|x): E[d(X,X_hat)] \u2264 D} I(X;X_hat) (variational definition linking distortion, test channel, and mutual information)
▶ Advanced Learning Details

Graph Position

100
Depth Cost
0
Fan-Out (ROI)
0
Bottleneck Score
8
Chain Length

Cognitive Load

5
Atomic Elements
38
Total Elements
L2
Percentile Level
L3
Atomic Level

All Concepts (16)

  • Distortion measure d(x,x̂): a nonnegative function that quantifies fidelity loss between a source symbol x and a reproduction symbol x̂
  • Reproduction alphabet X̂: the set of allowed reconstruction symbols (distinct from the source alphabet)
  • Expected (average) distortion E[d(X,X̂)] = sum_{x,x̂} p(x,x̂) d(x,x̂): the constraint used in lossy compression
  • Rate–Distortion function R(D): the minimum information rate (bits/symbol) required to achieve average distortion ≤ D
  • Distortion–Rate function D(R): minimal achievable distortion given a rate R (inverse viewpoint)
  • Test channel / conditional reproduction law p(x̂|x): the randomized mapping from source symbol x to reconstruction x̂ used in the single-letter formulation
  • Operational coding primitives for RD: encoder, decoder, codebook, block length n, rate R=(1/n) log M, and average distortion over a block
  • Single-letter characterization: the asymptotic (per-symbol) reduction of the coding problem to a per-letter optimization over p(x̂|x)
  • Lagrangian (parametric) formulation: introducing a multiplier β (or λ) to convert the constrained minimization to an unconstrained one
  • Parametric representation R(β) (or R–D curve via β): computing R(D) by sweeping the Lagrange multiplier
  • Optimal test channel structure (exponential family form) at optimum: p*(x̂|x) proportional to p(x̂) exp(-β d(x,x̂))
  • Convexity and monotonicity properties: R(D) is a nonincreasing, convex function of D
  • Achievability and converse (rate–distortion coding theorem): R(D) is both an achievable rate and a lower bound asymptotically
  • Blahut–Arimoto algorithm for Rate–Distortion: iterative numerical method to compute R(D) and the optimal p(x̂|x)
  • Important canonical examples/closed forms (as concepts): e.g., Gaussian source with MSE where R(D) = (1/2) log(σ^2/D)
  • Shannon lower bound and other analytic bounds useful for approximating R(D)

Teaching Strategy

Quick unlock - significant prerequisite investment but simple final step. Verify prerequisites first.

Lossless compression asks: “How many bits do I need to reproduce the data exactly?” Rate–distortion theory asks the harder—and more practical—question: “If I’m allowed to be slightly wrong, what is the absolute minimum number of bits I need?”

TL;DR:

Rate–distortion theory characterizes the optimal tradeoff between compression rate (bits/symbol) and allowed average distortion. For a source X with distortion measure d(x,x̂), the rate–distortion function is

R(D)=minp(x^x):  E[d(X,X^)]DI(X;X^).R(D)=\min_{p(\hat x|x):\;\mathbb E[d(X,\hat X)]\le D} I(X;\hat X).

It is decreasing and convex in D, with R(0)=H(X) under a “zero-distortion implies exact reconstruction” condition. Operationally, R(D) is the best achievable asymptotic rate for block lossy coding, and the minimizing conditional distribution p(x̂|x) is the “test channel” that describes the optimal stochastic reconstruction mechanism.

What Is Rate–Distortion Theory?

Rate–distortion theory is the part of information theory that formalizes optimal lossy compression. “Lossy” means the decoder’s reconstruction X^\hat X need not equal the original source symbol XX; instead we control a fidelity criterion.

The central objects

1) A source XX with distribution p(x)p(x). Typically we assume an i.i.d. source sequence Xn=(X1,,Xn)X^n=(X₁,\dots,X_n) where XiX_i are independent and identically distributed like XX.

2) A distortion measure (single-letter) d(x,x^)d(x,\hat x) satisfying

  • d(x,x^)0d(x,\hat x)\ge 0 for all (x,x^)(x,\hat x),
  • usually d(x,x)=0d(x,x)=0 (not strictly required, but common).

From this we define the per-symbol average distortion constraint

E[d(X,X^)]D.\mathbb E\,[d(X,\hat X)] \le D.

For blocks, a standard extension is

dn(xn,x^n)=1ni=1nd(xi,x^i),d_n(x^n,\hat x^n)=\frac{1}{n}\sum_{i=1}^n d(x_i,\hat x_i),

and we ask that E[dn(Xn,X^n)]D\mathbb E[d_n(X^n,\hat X^n)]\le D.

3) A rate RR in bits/symbol. If a block encoder maps XnX^n to an index M{1,,2nR}M\in\{1,\dots,2^{nR}\}, then RR measures the number of bits available per source symbol.

The question RD theory answers

Given an allowed distortion level DD, what is the smallest achievable rate RR?

This minimum is the rate–distortion function R(D)R(D).

The clean single-letter characterization

For a discrete memoryless source (DMS) with single-letter distortion, Shannon’s rate–distortion theorem states

R(D)=minp(x^x):  E[d(X,X^)]DI(X;X^).R(D)=\min_{p(\hat x|x):\;\mathbb E[d(X,\hat X)]\le D} I(X;\hat X).

The minimizer p(x^x)p(\hat x|x) is called a test channel. It is not a physical channel; it is a mathematical object describing the optimal way (in an information-theoretic sense) to correlate the reconstruction with the source under the distortion constraint.

An RD curve: the picture you should have in mind

Even without interactivity, it helps to visualize R(D)R(D) as a curve:

Rate R(D) (bits/symbol)
^
|\
| \
|  \
|   \
|    \
|     \
|      \
+--------------------> Distortion D
       D_max
  • Left side (small D): high fidelity → high rate.
  • Right side (large D): low fidelity → low rate.
  • Typically, R(D)R(D) is nonincreasing and convex in DD.

A useful intuition: “bits buy you correlation”

R(D)R(D) is a minimum over I(X;X^)I(X;\hat X). Mutual information measures how many bits per symbol the reconstruction “knows” about the source. Allowing more distortion lets X^\hat X be less correlated with XX, reducing I(X;X^)I(X;\hat X).

Caveats (so you don’t overgeneralize)

  • Continuous alphabets: one must be careful with differential entropy, existence of minimizers, and distortion constraints. The formula still has an analog, but the operational meaning needs measure-theoretic care.
  • When does $R(0)=H(X)$ hold? A common statement is R(0)=H(X)R(0)=H(X). This holds under the condition that zero distortion forces exact reconstruction, i.e. d(x,x^)=0d(x,\hat x)=0 iff x^=x\hat x=x (or at least uniquely identifies xx). If your distortion measure has d(x,x^)=0d(x,\hat x)=0 for multiple x^x\hat x\ne x, then R(0)R(0) can be smaller.
  • RD is asymptotic: the theorem is about large blocklength nn with vanishing probability of exceeding distortion (or average distortion), not necessarily about small finite files.

Prerequisites (Accuracy Box): What You Must Already Be Comfortable With

This node builds on several ideas. You said you already know entropy and mutual information; that’s necessary but not sufficient for a smooth ride at difficulty 5.

Required prerequisites

TopicWhat you need hereQuick reminder
EntropyInterpret H(X)H(X) as optimal lossless rateH(X)=xp(x)logp(x)H(X)=-\sum_x p(x)\log p(x)
Mutual informationUnderstand I(X;Y)I(X;Y) as “bits of dependence”I(X;Y)=x,yp(x,y)logp(x,y)p(x)p(y)I(X;Y)=\sum_{x,y} p(x,y)\log\frac{p(x,y)}{p(x)p(y)}
i.i.d. block coding / typicalityWhy coding theorems are asymptotic and why “per-symbol” rates emergeTypical sequences occupy about 2nH(X)2^{nH(X)} outcomes
Conditional distributionsComfort manipulating $p(\hat xx)andinduced and induced p(\hat x)$$p(\hat x)=\sum_x p(x)p(\hat xx)$
Basic convex optimizationRecognize convexity, constraints, and Lagrange multipliers/KKTMinimize objective subject to inequality constraints

Helpful (but not strictly mandatory) background

  • Jensen’s inequality and convex functions.
  • Log-sum inequality (often used to prove convexity and information inequalities).
  • A bit of familiarity with channel capacity is helpful because RD is a “dual sibling” of capacity (minimize II subject to distortion vs maximize II subject to cost).

Key caveats to keep in mind as you learn

1) Discrete vs continuous: for continuous sources, H(X)H(X) is replaced by differential entropy h(X)h(X), which is not directly a coding rate. Operational statements often involve mutual information directly and require additional regularity.

2) Distortion definition matters: average distortion can be defined in expectation, with high probability, or as excess-distortion probability. Standard Shannon RD uses expected per-letter distortion; variants exist.

3) $R(0)=H(X)$ is conditional: as noted above, it’s true in the common case where d(x,x^)=0d(x,\hat x)=0 only when x^=x\hat x=x.

If any of these prerequisites feel shaky, it’s worth revisiting them before you attempt the proofs or algorithms (e.g., Blahut–Arimoto) in the later sections.

Core Mechanic 1: Distortion Measures and What “Fidelity” Really Means

Before optimizing anything, we must decide what “good reconstruction” means. Rate–distortion theory forces you to be explicit: you provide a distortion measure d(x,x^)d(x,\hat x) and a target distortion level DD.

Single-letter distortion and per-symbol averaging

A single-letter distortion assigns a penalty to each symbol pair (x,x^)(x,\hat x). For sequences, we typically average:

dn(xn,x^n)=1ni=1nd(xi,x^i).d_n(x^n,\hat x^n)=\frac{1}{n}\sum_{i=1}^n d(x_i,\hat x_i).

The average matters because it makes the constraint scale nicely with nn and matches i.i.d. assumptions.

We then require (one common formulation)

E[dn(Xn,X^n)]D.\mathbb E\,[d_n(X^n,\hat X^n)] \le D.

A subtle point: This is an expected distortion constraint; it allows rare large-distortion events as long as they don’t contribute too much to the average. Another common alternative is an excess-distortion probability constraint Pr(dn(Xn,X^n)>D)ϵ\Pr(d_n(X^n,\hat X^n)>D)\le \epsilon.

Examples of distortion measures

1) Hamming distortion for discrete alphabets:

d(x,x^)=1{xx^}.d(x,\hat x)=\mathbf{1}\{x\ne \hat x\}.

Then E[d(X,X^)]=Pr(XX^)\mathbb E[d(X,\hat X)] = \Pr(X\ne \hat X), i.e. average distortion equals symbol error rate.

2) Squared error for real-valued sources:

d(x,x^)=(xx^)2.d(x,\hat x)=(x-\hat x)^2.

Then DD is a mean-squared error (MSE) constraint.

3) Weighted distortions: sometimes different symbols are more costly to distort.

Feasible distortion range: DminD_{\min} to DmaxD_{\max}

Two distortion levels are especially important.

  • Best possible distortion (often 0):
  • If X^=X\hat X=X is allowed, then Dmin=minE[d(X,X^)]D_{\min}=\min \mathbb E[d(X,\hat X)] is typically 0.
  • Distortion with zero rate: what if the decoder gets no bits?
  • Then X^\hat X cannot depend on XX at all. The best it can do is choose a fixed reproduction distribution p(x^)p(\hat x) (or a constant x^\hat x).
  • Define
Dmax=minp(x^)E[d(X,X^)]subject to X^X.D_{\max}=\min_{p(\hat x)} \mathbb E[d(X,\hat X)] \quad \text{subject to } \hat X \perp X.

For Hamming distortion, a zero-rate decoder should output the most probable symbol, so Dmax=1maxxp(x)D_{\max}=1-\max_x p(x).

This gives a boundary condition you should remember:

  • For DDmaxD\ge D_{\max}, you can achieve distortion with zero bits, so R(D)=0R(D)=0.

Why distortion is not “error”

In lossy compression, the reconstruction might be “wrong” but perceptually fine (audio/image), or wrong but still useful (downstream ML tasks). Rate–distortion cleanly separates:

  • Semantics: you pick d(x,x^)d(x,\hat x) to encode what you care about.
  • Limits: R(D)R(D) tells you the best possible compression given that notion of fidelity.

That separation is powerful: once dd is fixed, the rest is mathematics.

A sanity check that prevents confusion

If your distortion is Hamming, the RD problem resembles “coding with errors.” But it is not channel coding: in RD, the encoder sees the entire xnx^n and chooses a description; the distortion constraint is on reconstruction quality, not on a noisy channel outcome.

This perspective will matter when we introduce the test channel: the “noise” in p(x^x)p(\hat x|x) is controlled by us to meet distortion, not imposed by nature.

Core Mechanic 2: The Rate–Distortion Function as a Constrained Mutual Information Minimum

Now we connect the operational question (“minimum bits”) to an information quantity.

The definition

For a fixed source distribution p(x)p(x) and distortion measure d(x,x^)d(x,\hat x), define

R(D)=minp(x^x):  E[d(X,X^)]DI(X;X^).R(D)=\min_{p(\hat x|x):\;\mathbb E[d(X,\hat X)]\le D} I(X;\hat X).

This is sometimes called the Shannon lower bound form (not to be confused with the separate Shannon lower bound approximation used in continuous cases). The key is: we are minimizing mutual information over all conditional distributions p(x^x)p(\hat x|x) that satisfy the distortion constraint.

Expand the pieces (so it feels concrete)

Given p(x)p(x) and a candidate test channel p(x^x)p(\hat x|x):

1) Induced joint:

p(x,x^)=p(x)p(x^x).p(x,\hat x)=p(x)p(\hat x|x).

2) Induced marginal:

p(x^)=xp(x)p(x^x).p(\hat x)=\sum_x p(x)p(\hat x|x).

3) Mutual information:

I(X;X^)=x,x^p(x)p(x^x)logp(x^x)p(x^).I(X;\hat X)=\sum_{x,\hat x} p(x)p(\hat x|x)\log\frac{p(\hat x|x)}{p(\hat x)}.

4) Expected distortion:

E[d(X,X^)]=x,x^p(x)p(x^x)d(x,x^).\mathbb E[d(X,\hat X)] = \sum_{x,\hat x} p(x)p(\hat x|x)d(x,\hat x).

We choose p(x^x)p(\hat x|x) to minimize the first, while keeping the second ≤ D.

Why mutual information is the right “cost”

A rough operational interpretation:

  • If you want the decoder to output X^n\hat X^n correlated with XnX^n, the encoder must convey information about XnX^n.
  • The best achievable per-symbol information transfer is captured by I(X;X^)I(X;\hat X) for an optimal randomized reconstruction rule.

RD theory makes this precise: in the limit of large blocklength, the minimum description rate equals the minimum mutual information.

Fundamental properties of R(D)

1) Monotonicity: if you allow more distortion, you never need more rate.

If D2D1D₂\ge D₁, then the feasible set at D2D₂ contains the feasible set at D1D₁, so

R(D2)R(D1).R(D₂)\le R(D₁).

2) Convexity: R(D)R(D) is convex in DD.

Intuition: you can time-share between two codes (or two test channels) achieving distortions D1D₁ and D2D₂. Using them a fraction θ\theta and 1θ1-\theta yields average distortion θD1+(1θ)D2\theta D₁+(1-\theta)D₂ and average rate θR(D1)+(1θ)R(D2)\theta R(D₁)+(1-\theta)R(D₂). Optimality then implies

R(θD1+(1θ)D2)θR(D1)+(1θ)R(D2).R(\theta D₁+(1-\theta)D₂) \le \theta R(D₁)+(1-\theta)R(D₂).

That inequality is exactly convexity.

3) Boundary values:

  • For large enough DD (≥ DmaxD_{\max}), R(D)=0R(D)=0.
  • Under the “zero distortion implies exact reconstruction” condition, R(0)=H(X)R(0)=H(X).

Why R(0)=H(X)R(0)=H(X) under that condition:

  • If E[d(X,X^)]=0\mathbb E[d(X,\hat X)]=0, then d(X,X^)=0d(X,\hat X)=0 almost surely.
  • If d(x,x^)=0d(x,\hat x)=0 only when x^=x\hat x=x, then X^=X\hat X=X a.s.
  • Then I(X;X^)=I(X;X)=H(X)I(X;\hat X)=I(X;X)=H(X).
  • Since R(0)R(0) is the minimum such II, it equals H(X)H(X).

If instead multiple reconstructions have zero distortion for a given xx, then X^\hat X might be a function that preserves less information than XX, and I(X;X^)I(X;\hat X) (hence R(0)R(0)) can be smaller.

The Lagrangian viewpoint (a workhorse)

We often solve constrained problems by introducing a multiplier β0\beta\ge 0:

minp(x^x)  I(X;X^)+βE[d(X,X^)].\min_{p(\hat x|x)}\; I(X;\hat X) + \beta\,\mathbb E[d(X,\hat X)].

Then, for each β\beta, the optimizer corresponds to a point on the RD curve. (Formally, β\beta parameterizes supporting lines to the convex function R(D)R(D).)

This is where convex optimization enters: I(X;X^)I(X;\hat X) is convex in p(x^x)p(\hat x|x) for fixed p(x)p(x), and the distortion constraint is linear in p(x^x)p(\hat x|x), making the problem a convex program.

The “test channel” form you should remember

At optimum, the conditional often has an exponential tilt form (for discrete alphabets):

p(x^x)p(x^)eβd(x,x^).p(\hat x|x) \propto p(\hat x)\,e^{-\beta d(x,\hat x)}.

More explicitly,

p(x^x)=p(x^)eβd(x,x^)x^p(x^)eβd(x,x^).p(\hat x|x)=\frac{p(\hat x)e^{-\beta d(x,\hat x)}}{\sum_{\hat x'} p(\hat x')e^{-\beta d(x,\hat x')}}.

This is not magic; it is the KKT optimality condition for minimizing mutual information subject to expected distortion. It is also the basis for the Blahut–Arimoto algorithm (iterative computation of R(D)R(D)).

Pause here and internalize the meaning: to compress optimally, you behave as if the reconstruction is drawn from a Gibbs distribution where reconstructions with smaller distortion are exponentially preferred, but also modulated by a global prior p(x^)p(\hat x) that must self-consistently match the induced marginal.

Application/Connection: Operational Meaning, Achievability vs Converse, and How RD Is Used

The single-letter formula is elegant, but why does it equal the true minimum code rate? Rate–distortion theory has an operational meaning: it predicts what block codes can and cannot do as nn→∞.

The operational RD theorem (informal but accurate)

Fix a DMS XX and single-letter distortion dd. For any distortion level DD:

  • Achievability: For any rate R>R(D)R>R(D), there exists a sequence of (n,R)(n,R) lossy codes whose expected distortion is ≤ D for sufficiently large nn.
  • Converse: For any rate R<R(D)R<R(D), no sequence of codes can achieve expected distortion ≤ D for all large nn.

So R(D)R(D) is the sharp threshold.

Why block coding matters

Lossy compression gains power from coding across many symbols:

  • With n=1n=1, you are designing a scalar quantizer.
  • With large nn, you can use a large codebook of 2nR2^{nR} reproduction sequences and map each source block to the nearest (lowest distortion) reproduction.

This “vector quantization” viewpoint is one way to understand why R(D)R(D) is often lower than what naive symbol-by-symbol quantization would suggest.

A typical RD-code construction (high level)

Given an optimal test channel p(x^x)p(\hat x|x) achieving R(D)R(D):

1) Compute the induced marginal p(x^)p(\hat x).

2) Randomly generate a codebook of 2nR2^{nR} reproduction sequences x^n(m)\hat x^n(m) i.i.d. from ip(x^i)\prod_i p(\hat x_i).

3) Given a source block xnx^n, pick an index mm such that (xn,x^n(m))(x^n,\hat x^n(m)) is “jointly typical” with respect to p(x,x^)=p(x)p(x^x)p(x,\hat x)=p(x)p(\hat x|x) (or approximately minimizes distortion).

4) Send mm; decoder outputs x^n(m)\hat x^n(m).

The probability that no suitable codeword exists drops when R>I(X;X^)R>I(X;\hat X), which is the core reason mutual information appears.

The converse idea (why you can’t beat R(D))

For any code:

  • The message MM is at most nRnR bits, so H(M)nRH(M)\le nR.
  • The reconstruction X^n\hat X^n is a function of MM, so information about XnX^n that reaches the decoder is bottlenecked.
  • One shows that
nRI(Xn;X^n).nR \ge I(X^n;\hat X^n).

Then, using memorylessness and single-letterization arguments (plus convexity), you get

1nI(Xn;X^n)R(E[d(X,X^)]).\frac{1}{n}I(X^n;\hat X^n) \ge R\big(\mathbb E[d(X,\hat X)]\big).

Thus RR must be at least R(D)R(D).

Where RD theory shows up in practice

1) Benchmarking codecs: Real codecs (JPEG, AAC, H.264) can be compared to the theoretical RD bound for simplified source models.

2) Learning and VAEs: The objective in a variational autoencoder resembles an RD Lagrangian:

lossE[distortion]+λI(input;latent).\text{loss} \approx \mathbb E[\text{distortion}] + \lambda\, I(\text{input};\text{latent}).

(Details depend on approximations; but the conceptual link is strong.)

3) Information bottleneck: Minimizing I(X;T)I(X;T) subject to preserving information about a target YY echoes RD structure, with “distortion” defined via prediction loss.

4) Control and estimation: RD-like bounds appear in quantized control and remote estimation (minimum communication to achieve an MSE).

One more static RD diagram (annotated)

R
|            • (low D, high rate)
|          •
|        •
|      •
|    •
|  •
|•________________________ D
  0        D*         D_max

At D=0: often R(0)=H(X).
At D>=D_max: R(D)=0.
The curve is decreasing and convex.
A slope (supporting line) corresponds to a Lagrange multiplier β.

When you later compute RD points numerically, you are essentially moving along this curve by varying β\beta.

Connecting back to prerequisites

  • Entropy explains the lossless endpoint (under conditions).
  • Mutual information explains the “cost of correlation.”
  • Typical sequences explain why random codebooks work.
  • Convex optimization explains why the test channel has an exponential form and why the curve is convex.

With these connections in place, the rate–distortion formula should feel less like a definition and more like a deep theorem tying together coding, probability, and optimization.

Worked Examples (3)

Binary symmetric source with Hamming distortion: compute R(D)

Let X ~ Bernoulli(1/2) (a fair bit). Use Hamming distortion d(x,x̂)=1{x≠x̂}. Find the rate–distortion function R(D) for 0 ≤ D ≤ 1/2.

  1. Step 1: Write the definition

    R(D)=min_{p(x̂|x): E[d(X,X̂)]≤D} I(X;X̂).

  2. Step 2: Parameterize the test channel

    For binary X and binary X̂, a natural family is a binary symmetric channel (BSC) with crossover probability q:

    P(X̂≠X)=q.

    Then E[d(X,X̂)]=q.

    So the distortion constraint is q ≤ D.

  3. Step 3: Compute mutual information for a BSC with uniform input

    If X ~ Bern(1/2) and X̂ is X flipped with prob q, then X̂ is also uniform.

    Thus

    I(X;X̂)=H(X̂)−H(X̂|X)=1 − H₂(q),

    where H₂(q)=−q log₂ q − (1−q)log₂(1−q) is the binary entropy.

  4. Step 4: Minimize over feasible q

    We need to minimize 1 − H₂(q) subject to q ≤ D.

    Since H₂(q) increases on [0,1/2], 1−H₂(q) decreases on [0,1/2].

    So the minimum occurs at the largest feasible q, i.e. q=D.

  5. Step 5: State R(D)

    For 0 ≤ D ≤ 1/2,

    R(D)=1 − H₂(D).

    For D ≥ 1/2, you can output random bits independent of X to get distortion 1/2 at rate 0, so R(D)=0.

Insight: For a uniform bit, the optimal lossy code behaves like deliberately adding Bernoulli(D) noise: the reconstruction is a noisy version of the source. The “price” in bits is exactly the mutual information left after that noise, 1 − H₂(D).

Non-uniform binary source: compute D_max and interpret the zero-rate point

Let X ~ Bernoulli(p) with p=0.9. Use Hamming distortion. (i) Compute D_max (minimum distortion achievable at zero rate). (ii) Explain what this implies about R(D) for large D.

  1. Step 1: Zero rate means X̂ is independent of X

    With R=0, the decoder gets no information, so it must output a constant symbol (or random symbol) not depending on X.

  2. Step 2: Minimize expected Hamming distortion over constant reconstructions

    If decoder always outputs 1, distortion is P(X=0)=1−p.

    If decoder always outputs 0, distortion is P(X=1)=p.

    Pick the smaller:

    D_max = min(p, 1−p).

  3. Step 3: Plug in p=0.9

    D_max = min(0.9, 0.1)=0.1.

  4. Step 4: Interpret for the RD curve

    If you allow distortion D ≥ 0.1, then the decoder can simply always output 1 and achieve expected distortion 0.1 with zero bits.

    Therefore R(D)=0 for all D ≥ 0.1.

Insight: The right endpoint of the RD curve depends strongly on the source distribution and distortion. For skewed sources, you can get surprisingly low distortion without sending any bits—just guess the mode.

Derive the exponential-form optimality condition for the test channel (discrete case)

Show the key KKT/variational step that leads to p(x̂|x) ∝ p(x̂)e^{−β d(x,x̂)} for the optimization min I(X;X̂) subject to E[d] ≤ D and ∑_{x̂} p(x̂|x)=1 for each x.

  1. Step 1: Write the objective in a convenient form

    I(X;X̂)=∑_{x,x̂} p(x)p(x̂|x) log( p(x̂|x) / p(x̂) ),

    where p(x̂)=∑_x p(x)p(x̂|x).

  2. Step 2: Form the Lagrangian

    Introduce β ≥ 0 for the distortion constraint and λ(x) for normalization:

    L = I(X;X̂) + β( ∑_{x,x̂} p(x)p(x̂|x)d(x,x̂) − D ) + ∑_x λ(x)( ∑_{x̂} p(x̂|x) − 1 ).

  3. Step 3: Take a directional derivative w.r.t. p(x̂|x)

    Holding p(x) fixed, set derivative to zero at optimum.

    A standard calculation yields the stationarity condition:

    log p(x̂|x) − log p(x̂) + 1 + β d(x,x̂) + λ(x)/p(x) = 0.

  4. Step 4: Solve for p(x̂|x)

    Rearrange:

    log p(x̂|x) = log p(x̂) − β d(x,x̂) − c(x),

    where c(x) collects constants for each x.

    Exponentiate:

    p(x̂|x) = p(x̂) e^{−β d(x,x̂)} e^{−c(x)}.

  5. Step 5: Enforce normalization to find e^{−c(x)}

    We need ∑_{x̂} p(x̂|x)=1:

    1 = e^{−c(x)} ∑_{x̂} p(x̂)e^{−β d(x,x̂)}

    ⇒ e^{−c(x)} = 1 / ∑_{x̂} p(x̂)e^{−β d(x,x̂)}.

  6. Step 6: Final form

    Therefore

    p(x̂|x)= \frac{p(x̂)e^{−β d(x,x̂)}}{\sum_{x̂'} p(x̂')e^{−β d(x,x̂')}}.

Insight: This is the mathematical heart of RD computation: the optimal encoder–decoder behavior can be represented by a self-consistent “Gibbs” conditional. Varying β traces the RD curve, and iterating the consistency between p(x̂|x) and p(x̂) leads to Blahut–Arimoto.

Key Takeaways

  • A distortion measure d(x,x̂) and constraint E[d(X,X̂)]≤D define what “fidelity” means; everything else follows from that choice.

  • The rate–distortion function is the constrained minimum mutual information: $R(D)=minp(x^x):  E[d]DI(X;X^).R(D)=\min_{p(\hat x|x):\;\mathbb E[d]\le D} I(X;\hat X).$

  • R(D) is nonincreasing and convex in D; for D ≥ D_max, R(D)=0.

  • The endpoint R(0)=H(X) holds when zero distortion forces exact reconstruction (d(x,x̂)=0 ⇒ x̂=x).

  • The optimizer p(x̂|x) is called the test channel and often has the exponential tilt form p(x̂|x) ∝ p(x̂)e^{−β d(x,x̂)}.

  • Operationally, R(D) is the asymptotically optimal bits/symbol for block lossy coding; it is not merely a definition.

  • Time-sharing (mixing codes) explains convexity and provides a constructive intuition for intermediate operating points.

  • Many modern ML objectives (e.g., bottlenecked representations) resemble RD Lagrangians trading distortion against information.

Common Mistakes

  • Assuming R(0)=H(X) always, without checking the distortion measure’s “zero distortion implies equality” property.

  • Confusing the test channel p(x̂|x) with a real physical channel; it is an optimization variable describing an optimal reconstruction law.

  • Forgetting that RD results are asymptotic in blocklength n; scalar quantizers can be far from the RD bound.

  • Mixing discrete and continuous intuitions: replacing H with differential entropy h and treating it like a code rate leads to incorrect conclusions.

Practice

easy

Let X ~ Bernoulli(1/2) with Hamming distortion. Compute R(D) at D=0, D=0.1, and D=0.5.

Hint: Use R(D)=1−H₂(D) for 0≤D≤1/2, and R(D)=0 for D≥1/2. Recall H₂(0)=0 and H₂(1/2)=1.

Show solution

D=0: R(0)=1−H₂(0)=1−0=1 bit/symbol.

D=0.1: R(0.1)=1−H₂(0.1) ≈ 1−0.468995 ≈ 0.5310 bits/symbol.

D=0.5: R(0.5)=0 (since 1−H₂(0.5)=0 and also D≥1/2 implies zero rate).

medium

For a general discrete source X with distribution p(x) and Hamming distortion d(x,x̂)=1{x≠x̂}, compute D_max (the distortion achievable at zero rate).

Hint: At zero rate, X̂ cannot depend on X. The best constant guess is the most probable symbol.

Show solution

Zero rate means choose a fixed reconstruction x̂ minimizing P(X≠x̂). The minimizer is any mode x̂* ∈ argmax_x p(x). Then D_max = 1 − max_x p(x). Therefore R(D)=0 for all D ≥ 1−max_x p(x).

hard

Show (using time-sharing) that R(D) is convex: for any D₁, D₂ and θ∈[0,1], prove R(θD₁+(1−θ)D₂) ≤ θR(D₁)+(1−θ)R(D₂).

Hint: Take two test channels p₁(x̂|x) and p₂(x̂|x) that achieve (or nearly achieve) R(D₁), R(D₂). Introduce an independent Bernoulli(θ) variable Q selecting which channel to use, and define X̂ based on Q.

Show solution

Let Q~Bernoulli(θ) independent of X. Define conditional p(x̂|x,Q=q)=p_q(x̂|x) for q∈{1,2}. Then expected distortion is

E[d(X,X̂)] = θE[d(X,X̂)|Q=1] + (1−θ)E[d(X,X̂)|Q=2] ≤ θD₁+(1−θ)D₂.

Now bound mutual information:

I(X;X̂) ≤ I(X;X̂,Q) = I(X;Q)+I(X;X̂|Q) = 0 + θI₁(X;X̂)+(1−θ)I₂(X;X̂),

where I_q is computed under p_q.

Thus there exists a feasible channel at distortion θD₁+(1−θ)D₂ with mutual information ≤ θR(D₁)+(1−θ)R(D₂) (up to approximation if p_q are near-optimal). Taking the minimum over all feasible channels gives

R(θD₁+(1−θ)D₂) ≤ θR(D₁)+(1−θ)R(D₂), proving convexity.

Connections

Next nodes you may want:

Quality: A (4.2/5)