Information Bottleneck

Information TheoryDifficulty: ████Depth: 8Unlocks: 0

Compression that preserves relevant information.

Interactive Visualization

t=0s

Core Concepts

  • Compression-relevance trade-off: compress X while preserving information about Y
  • Bottleneck variable T: a compressed/latent representation of X that should retain relevant information about Y

Key Symbols & Notation

T (bottleneck / compressed latent variable)beta (positive trade-off / Lagrange multiplier)

Essential Relationships

  • Markov chain Y - X - T (T is produced from X; no direct dependence of T on Y)
  • IB objective: minimize I(X;T) - beta * I(T;Y) (equivalently maximize I(T;Y) - beta * I(X;T)) to balance compression and relevance
▶ Advanced Learning Details

Graph Position

113
Depth Cost
0
Fan-Out (ROI)
0
Bottleneck Score
8
Chain Length

Cognitive Load

6
Atomic Elements
32
Total Elements
L1
Percentile Level
L4
Atomic Level

All Concepts (12)

  • Compressed representation T: a new random variable that encodes X with reduced information
  • Information preservation/relevance: the idea of preserving information in T that is specifically relevant to predicting Y
  • IB objective (trade-off) as a design goal: compress X while keeping predictive information about Y
  • Lagrangian and constrained formulations: optimizing a trade-off via a Lagrange multiplier or via a constrained optimization
  • Trade-off parameter beta (β): a scalar that controls the emphasis on relevance versus compression
  • Stochastic encoder/decoder viewpoint: using probabilistic mappings p(t|x) and p(y|t) rather than deterministic mappings
  • Distortion-as-KL interpretation: measuring representational distortion by the KL divergence between conditional predictive distributions
  • Predictive clustering: representing X by clustering inputs that have similar conditional distributions p(y|x)
  • Information Bottleneck (IB) curve / information plane: the frontier of achievable pairs (I(X;T), I(T;Y))
  • Sufficiency and minimal sufficient representation: T is sufficient for Y if it preserves I(T;Y)=I(X;Y); minimal sufficiency minimizes I(X;T) among sufficient T
  • Markov chain constraint for encoders: the encoder must satisfy the conditional-independence structure Y - X - T (T depends only on X)
  • Phase/structural transitions in optimal representations as β varies (changes in number/shape of clusters or support of p(t|x))

Teaching Strategy

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

You have data X that contains “everything,” but your downstream task only cares about Y. The Information Bottleneck principle asks: can we compress X into a smaller representation T that forgets irrelevant details, while keeping what matters for predicting Y?

TL;DR:

Information Bottleneck (IB) formalizes representation learning as a trade-off: minimize I(X;T) (compression) while maximizing I(T;Y) (relevance). With a Lagrange multiplier β>0, IB solves for a stochastic encoder p(t|x) that produces a bottleneck variable T, under the Markov chain T–X–Y. The discrete IB has fixed-point updates; the practical “Variational IB” (VIB) replaces intractable mutual informations with variational bounds, yielding a loss that looks like prediction loss + β·KL regularization—closely related to β-VAEs and regularized neural classifiers.

What Is Information Bottleneck?

The problem IB is trying to solve (why before how)

In many learning settings, your input variable XX contains a mix of:

  • Signal: aspects of XX that help predict a target YY (labels, future outcomes, etc.).
  • Nuisance: aspects of XX that are unrelated to YY (noise, style, background, idiosyncrasies).

If you let a model store everything about XX, it can overfit, memorize, or learn brittle features. If you compress too aggressively, you lose predictive power.

The Information Bottleneck (IB) framework turns this into a clean information-theoretic optimization:

  • Create a compressed representation TT of XX.
  • Ensure TT retains information about YY.

Here TT is called the bottleneck variable because it limits how much information from XX can flow downstream.

The IB story in one line

Find a stochastic mapping p(tx)p(t|x) such that:

  • TT is as independent of $X$ as possible (compression)
  • while TT is as informative about $Y$ as possible (relevance)

The standard IB Lagrangian is:

minp(tx)  LIB=I(X;T)βI(T;Y),β>0.\min_{p(t|x)} \; \mathcal{L}_{IB} = I(X;T) - \beta \, I(T;Y), \quad \beta>0.

Interpretation:

  • I(X;T)I(X;T): how many bits TT “remembers” about XX. Smaller means stronger compression.
  • I(T;Y)I(T;Y): how many bits TT carries about YY. Larger means better relevance.
  • β\beta: trade-off knob.
  • Small β\beta → prioritize compression (small I(X;T)I(X;T)), risk losing predictive info.
  • Large β\beta → prioritize relevance (large I(T;Y)I(T;Y)), allow more capacity.

A key modeling assumption: the Markov chain

IB usually assumes the representation is formed from XX alone:

T    X    YT \; - \; X \; - \; Y

This means: given XX, TT is independent of YY (because you compute TT from XX).

Formally:

p(tx,y)=p(tx).p(t|x,y) = p(t|x).

This assumption is not just a technicality—it encodes the idea that you don’t get to peek at the label $Y$ when forming the representation (at test time, you only see XX).

“Compression” vs “relevance” as two competing forces

It helps to visualize two extremes:

SettingWhat happensRisk
Very strong compression (force I(X;T)I(X;T) small)TT discards many details of XXTT may lose predictive info about YY
Very strong relevance (force I(T;Y)I(T;Y) large)TT keeps whatever helps predict YYTT may retain lots of XX (less robust/generalizable)

IB doesn’t assume you know which parts of XX are relevant. It discovers them by optimizing these two pressures.


Pause and check 1 (sanity questions)

Before going further, make sure these answers feel clear:

  1. 1)If TT is a copy of XX, what is I(X;T)I(X;T) likely to be? (Large—approximately H(X)H(X).)
  2. 2)If TT is constant (always the same value), what are I(X;T)I(X;T) and I(T;Y)I(T;Y)? (Both 0.)
  3. 3)Why is β\beta needed? (Because you can’t generally minimize I(X;T)I(X;T) and maximize I(T;Y)I(T;Y) simultaneously without trading them off.)

If those are intuitive, you’re ready for the mechanics.

Core Mechanic 1: The IB Objective and What It Really Measures

Step 1: Start from what we control

We do not choose p(x,y)p(x,y); it comes from the world/data. What we can choose is an encoder (possibly stochastic):

p(tx).p(t|x).

Together with p(x,y)p(x,y), this induces:

  • p(t)=xp(x)p(tx)p(t) = \sum_x p(x) p(t|x)
  • p(yt)=xp(yx)p(xt)p(y|t) = \sum_x p(y|x) p(x|t)

Step 2: Expand the two mutual informations

Recall the mutual information identities:

I(X;T)=x,tp(x,t)logp(tx)p(t).I(X;T) = \sum_{x,t} p(x,t) \log \frac{p(t|x)}{p(t)}.

And

I(T;Y)=t,yp(t,y)logp(yt)p(y).I(T;Y) = \sum_{t,y} p(t,y) \log \frac{p(y|t)}{p(y)}.

The IB Lagrangian:

LIB=I(X;T)βI(T;Y)\mathcal{L}_{IB} = I(X;T) - \beta I(T;Y)

is therefore a functional of the entire conditional distribution p(tx)p(t|x).

What I(X;T)I(X;T) means operationally

Think of transmitting TT given XX. If TT contains many distinguishable states for different xx, then TT carries many bits about XX. If many different xx map to similar distributions over tt, then TT forgets details.

A useful equivalent form is:

I(X;T)=H(T)H(TX).I(X;T) = H(T) - H(T|X).
  • Increasing stochasticity in the encoder (larger H(TX)H(T|X)) can reduce I(X;T)I(X;T).
  • But if you make the encoder too noisy, you may also reduce I(T;Y)I(T;Y).

What I(T;Y)I(T;Y) means operationally

TT is “good” if it makes YY predictable. Another identity:

I(T;Y)=H(Y)H(YT).I(T;Y) = H(Y) - H(Y|T).

Since H(Y)H(Y) is fixed by data, maximizing I(T;Y)I(T;Y) is equivalent to minimizing the conditional entropy H(YT)H(Y|T): make YY as predictable from TT as possible.

The IB curve (rate–distortion flavor)

If you fix a compression budget I(X;T)RI(X;T) \le R and maximize relevance, you get a curve of optimal trade-offs. Equivalently, the Lagrangian with β\beta traces that curve.

This parallels rate–distortion theory:

  • Rate ↔ I(X;T)I(X;T)
  • Distortion ↔ a loss of predictive info about YY

IB can be seen as a task-aware compression method.


Pause and check 2 (micro-summary)

At this point:

  • We choose an encoder p(tx)p(t|x).
  • I(X;T)I(X;T) penalizes how much TT remembers about XX.
  • I(T;Y)I(T;Y) rewards how much TT helps predict YY.

Quick self-test:

  • If you increase β\beta, do you expect I(T;Y)I(T;Y) to go up or down at the optimum? (Up.)
  • If you decrease β\beta, do you expect I(X;T)I(X;T) to go up or down? (Down.)

Now we’ll derive the structure of the optimal p(tx)p(t|x) in the discrete case.

Core Mechanic 2: Discrete IB and the Fixed-Point Equations

Why fixed-point equations appear

The IB optimization is over distributions, not over a few parameters. In the discrete setting (finite XX, YY, TT), the classic IB solution gives self-consistent update rules for:

  • the encoder p(tx)p(t|x)
  • the cluster prior p(t)p(t)
  • the decoder p(yt)p(y|t)

This resembles a “soft clustering” procedure where each xx is assigned probabilistically to bottleneck states tt.

Set up the constrained optimization

We minimize

LIB=I(X;T)βI(T;Y)\mathcal{L}_{IB} = I(X;T) - \beta I(T;Y)

subject to the normalization constraints:

tp(tx)=1x.\sum_t p(t|x)=1 \quad \forall x.

We also rely on the Markov chain TXYT-X-Y.

Key identity: relevance term becomes a KL to class-conditionals

A central result in IB is that the optimal encoder depends on how different p(yx)p(y|x) is from p(yt)p(y|t). The natural measure of “difference between predictive distributions” is KL divergence.

Define the KL:

DKL(p(yx)p(yt))=yp(yx)logp(yx)p(yt).D_{KL}(p(y|x) \| p(y|t)) = \sum_y p(y|x)\log\frac{p(y|x)}{p(y|t)}.

Intuition: if xx and bottleneck state tt imply similar label distributions, then assigning xx to tt costs little relevance.

The fixed-point form (the classic IB equations)

The discrete IB solution satisfies:

1) Encoder update

p(tx)=p(t)Z(x,β)exp(βDKL(p(yx)p(yt)))p(t|x) = \frac{p(t)}{Z(x,\beta)} \exp\Big(-\beta\, D_{KL}(p(y|x)\|p(y|t))\Big)

where Z(x,β)Z(x,\beta) is a normalizer:

Z(x,β)=tp(t)exp(βDKL(p(yx)p(yt))).Z(x,\beta)=\sum_t p(t)\exp\Big(-\beta\, D_{KL}(p(y|x)\|p(y|t))\Big).

2) Cluster prior update

p(t)=xp(x)p(tx).p(t) = \sum_x p(x) p(t|x).

3) Decoder (label distribution per cluster)

p(yt)=xp(yx)p(xt)p(y|t) = \sum_x p(y|x) p(x|t)

and using Bayes:

p(xt)=p(x)p(tx)p(t).p(x|t)=\frac{p(x)p(t|x)}{p(t)}.

Putting it together, (3) is often written as:

p(yt)=1p(t)xp(x)p(tx)p(yx).p(y|t) = \frac{1}{p(t)}\sum_x p(x)p(t|x)p(y|x).

Show-your-work sketch: where does the exponential form come from?

We’ll outline the logic (not every calculus-of-variations detail, but the key steps).

Start with:

I(X;T)=x,tp(x)p(tx)logp(tx)p(t).I(X;T)=\sum_{x,t} p(x)p(t|x)\log\frac{p(t|x)}{p(t)}.

For the relevance term, using I(T;Y)=tp(t)yp(yt)logp(yt)p(y)I(T;Y)=\sum_{t}p(t)\sum_y p(y|t)\log\frac{p(y|t)}{p(y)}.

When you take the functional derivative of LIB\mathcal{L}_{IB} with respect to p(tx)p(t|x) and enforce tp(tx)=1\sum_t p(t|x)=1 using Lagrange multipliers λ(x)\lambda(x), the stationary condition yields something of the form:

logp(tx)=logp(t)βDKL(p(yx)p(yt))λ(x).\log p(t|x) = \log p(t) - \beta\, D_{KL}(p(y|x)\|p(y|t)) - \lambda(x).

Exponentiate both sides:

p(tx)p(t)exp(βDKL(p(yx)p(yt)))p(t|x) \propto p(t)\exp\big(-\beta D_{KL}(p(y|x)\|p(y|t))\big)

and the proportionality constant is exactly 1/Z(x,β)1/Z(x,\beta).

Intuition: IB as soft clustering of inputs by label-meaning

The encoder update says:

  • p(t)p(t) favors common clusters.
  • The exponential penalizes assigning xx to clusters whose label-distribution p(yt)p(y|t) mismatches p(yx)p(y|x).
  • β\beta controls how “hard” the assignments become.

As β0\beta \to 0, the KL term is downweighted and assignments become dominated by p(t)p(t) (compression wins).

As β\beta \to \infty, the encoder strongly prefers clusters that match p(yx)p(y|x) (relevance wins), potentially making TT preserve nearly all predictive structure.


Pause and check 3 (sanity and pacing)

Answer these before moving on:

  1. 1)In the encoder update, what happens if p(yx)=p(yt)p(y|x)=p(y|t) exactly? (KL=0, so xx can be assigned to tt with no relevance penalty.)
  2. 2)Why does a KL divergence appear rather than, say, squared error? (Because we compare distributions over YY; KL is the natural information-theoretic discrepancy.)
  3. 3)What makes these equations “fixed-point”? (Each update uses the others; iterating them seeks a self-consistent solution.)

Next we’ll shift from theory (discrete exact IB) to practice (continuous/high-dimensional settings).

Application/Connection: From Exact IB to Variational Information Bottleneck (VIB) and Deep Learning

Flagging the shift: why we need a variational approximation

The discrete IB fixed-point equations are elegant, but they assume we can:

  • Sum over all xx, tt, and yy.
  • Represent p(tx)p(t|x) as a full table.
  • Compute p(yx)p(y|x) from the data distribution.

In modern ML:

  • XX is high-dimensional (images, text).
  • TT is continuous (latent vectors).
  • p(yx)p(y|x) is unknown; we only have samples.

So we use Variational Information Bottleneck (VIB): an optimization that resembles IB but is tractable with neural networks and minibatches.

The VIB setup

We parameterize:

  • Encoder qϕ(tx)q_\phi(t|x) (neural net producing a distribution, often Gaussian)
  • Decoder/predictor pθ(yt)p_\theta(y|t) (classifier/regressor)
  • A prior over latents p(t)p(t) (often standard normal)

Goal: keep TT predictive of YY while limiting information from XX into TT.

Replace the intractable terms with bounds

Relevance term

We want to maximize I(T;Y)=H(Y)H(YT)I(T;Y)=H(Y)-H(Y|T). Since H(Y)H(Y) is constant, we minimize H(YT)H(Y|T).

In practice, we minimize negative log-likelihood under a decoder:

Ep(x,y)Eqϕ(tx)[logpθ(yt)].\mathbb{E}_{p(x,y)}\,\mathbb{E}_{q_\phi(t|x)}[-\log p_\theta(y|t)].

This is a standard prediction loss.

Compression term

The IB compression term is I(X;T)I(X;T). In VIB, a common upper bound is:

I(X;T)Ep(x)[DKL(qϕ(tx)p(t))].I(X;T) \le \mathbb{E}_{p(x)}\left[D_{KL}(q_\phi(t|x)\|p(t))\right].

Why this makes sense (sketch):

  • I(X;T)=Ep(x)DKL(q(tx)q(t))I(X;T)=\mathbb{E}_{p(x)} D_{KL}(q(t|x)\|q(t)).
  • q(t)q(t) is the aggregated posterior, hard to compute.
  • Replacing q(t)q(t) with a chosen prior p(t)p(t) yields an upper bound.

The VIB objective (the practical loss)

Putting these together gives a common VIB training objective:

minϕ,θ  LVIB=Ep(x,y)Eqϕ(tx)[logpθ(yt)]+βEp(x)[DKL(qϕ(tx)p(t))].\min_{\phi,\theta}\; \mathcal{L}_{VIB} = \mathbb{E}_{p(x,y)}\,\mathbb{E}_{q_\phi(t|x)}[-\log p_\theta(y|t)] + \beta\, \mathbb{E}_{p(x)}\left[D_{KL}(q_\phi(t|x)\|p(t))\right].

Compare to IB:

  • Prediction loss corresponds to maximizing relevance.
  • KL regularizer corresponds to compression.
  • β\beta again controls the trade-off.

Connection map: IB, β-VAE, and regularization

VIB looks structurally similar to other objectives:

MethodObjective shapeWhat it’s for
IBI(X;T)βI(T;Y)I(X;T) - \beta I(T;Y)theory of optimal representations
VIBNLL($yt)+) + \betaKL( KL(q(tx)\p(t)$)supervised representation learning
β-VAEreconstruction + β\beta KL($q(zx)\p(z)$)unsupervised disentangling/regularization

Key difference: VIB uses YY in the decoder; β-VAE reconstructs XX.

Practical encoder choice: Gaussian + reparameterization

A common choice is:

qϕ(tx)=N(μϕ(x),diag(σϕ2(x))).q_\phi(t|x) = \mathcal{N}(\mu_\phi(x), \mathrm{diag}(\sigma^2_\phi(x))).

Sampling is done via the reparameterization trick:

t=μϕ(x)+σϕ(x)ϵ,ϵN(0,I).t = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon, \quad \epsilon\sim\mathcal{N}(0,I).

This makes gradients flow through stochastic sampling.

Why VIB can improve robustness and generalization

The KL term pushes qϕ(tx)q_\phi(t|x) toward a simple prior (like N(0,I)\mathcal{N}(0,I)). That tends to:

  • limit memorization (capacity control)
  • encourage smoother representations
  • reduce sensitivity to nuisance variation in XX

But if β\beta is too large, you get posterior collapse (the encoder ignores XX and TT carries little information), harming prediction.


Pause and check 4 (explicit bridge from theory to practice)

Make sure you can articulate these two transitions:

  1. 1)Exact IB → VIB: We can’t compute I(X;T)I(X;T) and I(T;Y)I(T;Y) exactly in deep settings, so we optimize a surrogate using NLL and a KL-to-prior.
  2. 2)$p(t|x)$ → $q_\phi(t|x)$: We move from an unconstrained distribution to a parameterized family (neural network).

If that’s clear, the worked examples will feel grounded rather than magical.

Worked Examples (3)

Discrete IB: One encoder update step on a tiny problem

Let X∈{x₁,x₂}, Y∈{0,1}, T∈{t₁,t₂}. Suppose p(x₁)=p(x₂)=0.5.

Assume the label conditionals are:

  • p(y=1|x₁)=0.9, p(y=0|x₁)=0.1
  • p(y=1|x₂)=0.2, p(y=0|x₂)=0.8

Initialize p(t₁)=p(t₂)=0.5 and choose decoder distributions:

  • p(y=1|t₁)=0.8, p(y=0|t₁)=0.2
  • p(y=1|t₂)=0.3, p(y=0|t₂)=0.7

Let β=2. Compute p(t|x) using the IB encoder update.

  1. Write the encoder update:

    p(t|x) = [p(t)/Z(x,β)] · exp(-β · D_KL(p(y|x) || p(y|t))).

  2. Compute D_KL for x₁ vs each t.

    For t₁:

    D₁ = 0.9·log(0.9/0.8) + 0.1·log(0.1/0.2)

    = 0.9·log(1.125) + 0.1·log(0.5)

    ≈ 0.9·0.1173 + 0.1·(-0.6931)

    ≈ 0.1056 - 0.0693

    ≈ 0.0363.

    For t₂:

    D₂ = 0.9·log(0.9/0.3) + 0.1·log(0.1/0.7)

    = 0.9·log(3) + 0.1·log(1/7)

    ≈ 0.9·1.0986 + 0.1·(-1.9459)

    ≈ 0.9887 - 0.1946

    ≈ 0.7941.

  3. Compute unnormalized weights for x₁:

    w(t₁|x₁)=p(t₁)·exp(-βD₁)=0.5·exp(-2·0.0363)

    =0.5·exp(-0.0726)≈0.5·0.9300≈0.4650.

    w(t₂|x₁)=0.5·exp(-2·0.7941)=0.5·exp(-1.5882)

    ≈0.5·0.2043≈0.1021.

  4. Normalize:

    Z(x₁,β)=0.4650+0.1021=0.5671.

    p(t₁|x₁)=0.4650/0.5671≈0.820.

    p(t₂|x₁)=0.1021/0.5671≈0.180.

  5. Repeat for x₂.

    Compute KLs:

    For t₁:

    D₃ = 0.2·log(0.2/0.8) + 0.8·log(0.8/0.2)

    = 0.2·log(0.25) + 0.8·log(4)

    ≈ 0.2·(-1.3863) + 0.8·(1.3863)

    ≈ -0.2773 + 1.1090

    ≈ 0.8317.

    For t₂:

    D₄ = 0.2·log(0.2/0.3) + 0.8·log(0.8/0.7)

    ≈ 0.2·(-0.4055) + 0.8·(0.1335)

    ≈ -0.0811 + 0.1068

    ≈ 0.0257.

  6. Weights for x₂:

    w(t₁|x₂)=0.5·exp(-2·0.8317)=0.5·exp(-1.6634)

    ≈0.5·0.1893≈0.0946.

    w(t₂|x₂)=0.5·exp(-2·0.0257)=0.5·exp(-0.0514)

    ≈0.5·0.9499≈0.4750.

    Normalize:

    Z(x₂,β)=0.0946+0.4750=0.5696.

    p(t₁|x₂)=0.0946/0.5696≈0.166.

    p(t₂|x₂)=0.4750/0.5696≈0.834.

Insight: The encoder assigns x₁ mostly to t₁ and x₂ mostly to t₂ because those clusters have similar label distributions. β controls how sharply the KL mismatch turns into near-hard assignments.

VIB loss on a single datapoint with a Gaussian bottleneck

Binary classification with y∈{0,1}. Let the bottleneck T be 1D.

Assume encoder qϕ(t|x)=N(μ,σ²) outputs μ=1.0 and σ=0.5 for this x.

Use prior p(t)=N(0,1).

Decoder pθ(y=1|t)=sigmoid(wt+b) with w=2, b=0. Suppose the true label is y=1.

Let β=0.1. Compute the per-example VIB loss approximately using one Monte Carlo sample ε=0 (so t=μ).

  1. Sample t using reparameterization:

    t = μ + σ·ε. With ε=0:

    t = 1.0 + 0.5·0 = 1.0.

  2. Compute prediction probability:

    pθ(y=1|t)=sigmoid(2·1+0)=sigmoid(2)=1/(1+e^{-2})≈0.8808.

  3. Compute negative log-likelihood for y=1:

    NLL = -log pθ(y=1|t) = -log(0.8808) ≈ 0.1269.

  4. Compute KL(q||p) for 1D Gaussians:

    If q=N(μ,σ²), p=N(0,1), then

    D_KL(q||p) = 0.5·(μ² + σ² - 1 - log σ²).

    Plug in μ=1.0, σ²=0.25:

    D_KL = 0.5·(1.0 + 0.25 - 1 - log 0.25)

    = 0.5·(0.25 - (-1.3863))

    = 0.5·(1.6363)

    ≈ 0.8182.

  5. Combine into VIB loss:

    L = NLL + β·KL ≈ 0.1269 + 0.1·0.8182

    ≈ 0.1269 + 0.0818

    ≈ 0.2087.

Insight: Even when prediction is confident (small NLL), the model pays a compression cost if qϕ(t|x) drifts away from the prior. Increasing β would push μ toward 0 and σ toward 1 (more compressed), potentially hurting accuracy if overdone.

Effect of β: two limiting behaviors you can compute mentally

Consider the VIB objective L = E[-log pθ(y|t)] + β·KL(qϕ(t|x)||p(t)). Think about what happens as β→0 and β→∞, without doing full training.

  1. Case 1: β→0.

    The loss becomes dominated by the prediction term:

    L ≈ E[-log pθ(y|t)].

    So the encoder is free to choose qϕ(t|x) to make prediction easiest, even if it memorizes x in t.

    In the extreme, qϕ(t|x) could become nearly deterministic with tiny σ and highly varying μ(x).

  2. Case 2: β→∞.

    The KL term dominates:

    L ≈ β·KL(qϕ(t|x)||p(t)).

    The easiest way to minimize KL for all x is to set qϕ(t|x)=p(t) regardless of x.

    Then t becomes independent of x, implying I(X;T)≈0.

    But then y is hard to predict from t, so the NLL term increases.

Insight: β sets an information budget. β too small risks overfitting; β too large forces T to ignore X, collapsing predictive power. Practical work is largely about finding the right regime.

Key Takeaways

  • Information Bottleneck learns a representation T of X that preserves information relevant to Y while discarding the rest.

  • The canonical trade-off is $minp(tx)I(X;T)βI(T;Y)\min_{p(t|x)} I(X;T) - \beta I(T;Y)$ with β>0 controlling compression vs relevance.

  • The Markov chain T–X–Y encodes that T is computed from X (no label leakage).

  • In discrete IB, the optimal encoder has a Gibbs/exponential form using a KL divergence between p(y|x) and p(y|t).

  • The discrete IB equations are fixed-point updates for p(t|x), p(t), and p(y|t), interpretable as soft clustering by label-meaning.

  • Variational IB (VIB) replaces intractable mutual informations with tractable surrogates: prediction NLL + β·KL(q(t|x)||p(t)).

  • β governs representation capacity: too low can memorize; too high can cause posterior collapse and hurt accuracy.

Common Mistakes

  • Confusing the roles of the two mutual informations: I(X;T) is the compression penalty, I(T;Y) is the relevance reward.

  • Dropping the Markov assumption T–X–Y implicitly (e.g., designing T using Y at test time), which breaks the intended meaning of the objective.

  • Assuming VIB’s KL(q(t|x)||p(t)) equals I(X;T) exactly; it is typically an upper bound depending on the chosen prior.

  • Interpreting β as a learning rate-like nuisance parameter; it is a meaningful trade-off that changes what information the representation retains.

Practice

easy

Discrete IB intuition: Suppose two inputs x₁ and x₂ have identical label distributions p(y|x₁)=p(y|x₂). In the discrete IB fixed-point encoder update, what does this suggest about how x₁ and x₂ should be assigned to bottleneck states t? Explain using the KL term.

Hint: Look at D_KL(p(y|x)||p(y|t)). What happens if the two p(y|x) are the same?

Show solution

If p(y|x₁)=p(y|x₂), then for any cluster t the KL divergences D_KL(p(y|x₁)||p(y|t)) and D_KL(p(y|x₂)||p(y|t)) are equal. Therefore the encoder update produces the same relative preferences over t for x₁ and x₂ (up to the same normalizer Z). This suggests x₁ and x₂ are information-theoretically indistinguishable with respect to Y and can be compressed together (assigned similarly in T) without loss of relevance.

medium

Compute a KL regularizer: Let q(t|x)=N(μ,σ²) with μ=0.2 and σ=2.0, and prior p(t)=N(0,1). Compute D_KL(q||p).

Hint: Use D_KL = 0.5·(μ² + σ² − 1 − log σ²).

Show solution

Here μ²=0.04 and σ²=4. So

D_KL = 0.5·(0.04 + 4 − 1 − log 4)

= 0.5·(3.04 − 1.3863)

= 0.5·(1.6537)

≈ 0.8269.

hard

Reasoning about β: In VIB, you observe posterior collapse (the encoder outputs q(t|x)≈p(t) for all x and accuracy drops). Name two concrete adjustments you could try, and explain the direction each changes the trade-off.

Hint: Think: which term is overpowering the other? How can you reduce that pressure or increase the usefulness of T?

Show solution

Posterior collapse indicates the KL/compression pressure is too strong relative to the prediction benefit. Two adjustments: (1) Decrease β, directly reducing the weight on KL(q(t|x)||p(t)), allowing T to carry more information about X (and thus about Y). (2) Increase decoder strength or training signal so using T becomes beneficial (e.g., a more expressive pθ(y|t), better optimization, or annealing β from 0 upward), which increases the relative gain from encoding predictive information, counteracting the incentive to ignore X.

Connections

Quality: A (4.2/5)