Compression that preserves relevant information.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
You have data X that contains “everything,” but your downstream task only cares about Y. The Information Bottleneck principle asks: can we compress X into a smaller representation T that forgets irrelevant details, while keeping what matters for predicting Y?
Information Bottleneck (IB) formalizes representation learning as a trade-off: minimize I(X;T) (compression) while maximizing I(T;Y) (relevance). With a Lagrange multiplier β>0, IB solves for a stochastic encoder p(t|x) that produces a bottleneck variable T, under the Markov chain T–X–Y. The discrete IB has fixed-point updates; the practical “Variational IB” (VIB) replaces intractable mutual informations with variational bounds, yielding a loss that looks like prediction loss + β·KL regularization—closely related to β-VAEs and regularized neural classifiers.
In many learning settings, your input variable contains a mix of:
If you let a model store everything about , it can overfit, memorize, or learn brittle features. If you compress too aggressively, you lose predictive power.
The Information Bottleneck (IB) framework turns this into a clean information-theoretic optimization:
Here is called the bottleneck variable because it limits how much information from can flow downstream.
Find a stochastic mapping such that:
The standard IB Lagrangian is:
Interpretation:
IB usually assumes the representation is formed from alone:
This means: given , is independent of (because you compute from ).
Formally:
This assumption is not just a technicality—it encodes the idea that you don’t get to peek at the label $Y$ when forming the representation (at test time, you only see ).
It helps to visualize two extremes:
| Setting | What happens | Risk |
|---|---|---|
| Very strong compression (force small) | discards many details of | may lose predictive info about |
| Very strong relevance (force large) | keeps whatever helps predict | may retain lots of (less robust/generalizable) |
IB doesn’t assume you know which parts of are relevant. It discovers them by optimizing these two pressures.
Before going further, make sure these answers feel clear:
If those are intuitive, you’re ready for the mechanics.
We do not choose ; it comes from the world/data. What we can choose is an encoder (possibly stochastic):
Together with , this induces:
Recall the mutual information identities:
And
The IB Lagrangian:
is therefore a functional of the entire conditional distribution .
Think of transmitting given . If contains many distinguishable states for different , then carries many bits about . If many different map to similar distributions over , then forgets details.
A useful equivalent form is:
is “good” if it makes predictable. Another identity:
Since is fixed by data, maximizing is equivalent to minimizing the conditional entropy : make as predictable from as possible.
If you fix a compression budget and maximize relevance, you get a curve of optimal trade-offs. Equivalently, the Lagrangian with traces that curve.
This parallels rate–distortion theory:
IB can be seen as a task-aware compression method.
At this point:
Quick self-test:
Now we’ll derive the structure of the optimal in the discrete case.
The IB optimization is over distributions, not over a few parameters. In the discrete setting (finite , , ), the classic IB solution gives self-consistent update rules for:
This resembles a “soft clustering” procedure where each is assigned probabilistically to bottleneck states .
We minimize
subject to the normalization constraints:
We also rely on the Markov chain .
A central result in IB is that the optimal encoder depends on how different is from . The natural measure of “difference between predictive distributions” is KL divergence.
Define the KL:
Intuition: if and bottleneck state imply similar label distributions, then assigning to costs little relevance.
The discrete IB solution satisfies:
1) Encoder update
where is a normalizer:
2) Cluster prior update
3) Decoder (label distribution per cluster)
and using Bayes:
Putting it together, (3) is often written as:
We’ll outline the logic (not every calculus-of-variations detail, but the key steps).
Start with:
For the relevance term, using .
When you take the functional derivative of with respect to and enforce using Lagrange multipliers , the stationary condition yields something of the form:
Exponentiate both sides:
and the proportionality constant is exactly .
The encoder update says:
As , the KL term is downweighted and assignments become dominated by (compression wins).
As , the encoder strongly prefers clusters that match (relevance wins), potentially making preserve nearly all predictive structure.
Answer these before moving on:
Next we’ll shift from theory (discrete exact IB) to practice (continuous/high-dimensional settings).
The discrete IB fixed-point equations are elegant, but they assume we can:
In modern ML:
So we use Variational Information Bottleneck (VIB): an optimization that resembles IB but is tractable with neural networks and minibatches.
We parameterize:
Goal: keep predictive of while limiting information from into .
We want to maximize . Since is constant, we minimize .
In practice, we minimize negative log-likelihood under a decoder:
This is a standard prediction loss.
The IB compression term is . In VIB, a common upper bound is:
Why this makes sense (sketch):
Putting these together gives a common VIB training objective:
Compare to IB:
VIB looks structurally similar to other objectives:
| Method | Objective shape | What it’s for | |||
|---|---|---|---|---|---|
| IB | theory of optimal representations | ||||
| VIB | NLL($y | t\betaq(t | x)\ | p(t)$) | supervised representation learning |
| β-VAE | reconstruction + KL($q(z | x)\ | p(z)$) | unsupervised disentangling/regularization |
Key difference: VIB uses in the decoder; β-VAE reconstructs .
A common choice is:
Sampling is done via the reparameterization trick:
This makes gradients flow through stochastic sampling.
The KL term pushes toward a simple prior (like ). That tends to:
But if is too large, you get posterior collapse (the encoder ignores and carries little information), harming prediction.
Make sure you can articulate these two transitions:
If that’s clear, the worked examples will feel grounded rather than magical.
Let X∈{x₁,x₂}, Y∈{0,1}, T∈{t₁,t₂}. Suppose p(x₁)=p(x₂)=0.5.
Assume the label conditionals are:
Initialize p(t₁)=p(t₂)=0.5 and choose decoder distributions:
Let β=2. Compute p(t|x) using the IB encoder update.
Write the encoder update:
p(t|x) = [p(t)/Z(x,β)] · exp(-β · D_KL(p(y|x) || p(y|t))).
Compute D_KL for x₁ vs each t.
For t₁:
D₁ = 0.9·log(0.9/0.8) + 0.1·log(0.1/0.2)
= 0.9·log(1.125) + 0.1·log(0.5)
≈ 0.9·0.1173 + 0.1·(-0.6931)
≈ 0.1056 - 0.0693
≈ 0.0363.
For t₂:
D₂ = 0.9·log(0.9/0.3) + 0.1·log(0.1/0.7)
= 0.9·log(3) + 0.1·log(1/7)
≈ 0.9·1.0986 + 0.1·(-1.9459)
≈ 0.9887 - 0.1946
≈ 0.7941.
Compute unnormalized weights for x₁:
w(t₁|x₁)=p(t₁)·exp(-βD₁)=0.5·exp(-2·0.0363)
=0.5·exp(-0.0726)≈0.5·0.9300≈0.4650.
w(t₂|x₁)=0.5·exp(-2·0.7941)=0.5·exp(-1.5882)
≈0.5·0.2043≈0.1021.
Normalize:
Z(x₁,β)=0.4650+0.1021=0.5671.
p(t₁|x₁)=0.4650/0.5671≈0.820.
p(t₂|x₁)=0.1021/0.5671≈0.180.
Repeat for x₂.
Compute KLs:
For t₁:
D₃ = 0.2·log(0.2/0.8) + 0.8·log(0.8/0.2)
= 0.2·log(0.25) + 0.8·log(4)
≈ 0.2·(-1.3863) + 0.8·(1.3863)
≈ -0.2773 + 1.1090
≈ 0.8317.
For t₂:
D₄ = 0.2·log(0.2/0.3) + 0.8·log(0.8/0.7)
≈ 0.2·(-0.4055) + 0.8·(0.1335)
≈ -0.0811 + 0.1068
≈ 0.0257.
Weights for x₂:
w(t₁|x₂)=0.5·exp(-2·0.8317)=0.5·exp(-1.6634)
≈0.5·0.1893≈0.0946.
w(t₂|x₂)=0.5·exp(-2·0.0257)=0.5·exp(-0.0514)
≈0.5·0.9499≈0.4750.
Normalize:
Z(x₂,β)=0.0946+0.4750=0.5696.
p(t₁|x₂)=0.0946/0.5696≈0.166.
p(t₂|x₂)=0.4750/0.5696≈0.834.
Insight: The encoder assigns x₁ mostly to t₁ and x₂ mostly to t₂ because those clusters have similar label distributions. β controls how sharply the KL mismatch turns into near-hard assignments.
Binary classification with y∈{0,1}. Let the bottleneck T be 1D.
Assume encoder qϕ(t|x)=N(μ,σ²) outputs μ=1.0 and σ=0.5 for this x.
Use prior p(t)=N(0,1).
Decoder pθ(y=1|t)=sigmoid(wt+b) with w=2, b=0. Suppose the true label is y=1.
Let β=0.1. Compute the per-example VIB loss approximately using one Monte Carlo sample ε=0 (so t=μ).
Sample t using reparameterization:
t = μ + σ·ε. With ε=0:
t = 1.0 + 0.5·0 = 1.0.
Compute prediction probability:
pθ(y=1|t)=sigmoid(2·1+0)=sigmoid(2)=1/(1+e^{-2})≈0.8808.
Compute negative log-likelihood for y=1:
NLL = -log pθ(y=1|t) = -log(0.8808) ≈ 0.1269.
Compute KL(q||p) for 1D Gaussians:
If q=N(μ,σ²), p=N(0,1), then
D_KL(q||p) = 0.5·(μ² + σ² - 1 - log σ²).
Plug in μ=1.0, σ²=0.25:
D_KL = 0.5·(1.0 + 0.25 - 1 - log 0.25)
= 0.5·(0.25 - (-1.3863))
= 0.5·(1.6363)
≈ 0.8182.
Combine into VIB loss:
L = NLL + β·KL ≈ 0.1269 + 0.1·0.8182
≈ 0.1269 + 0.0818
≈ 0.2087.
Insight: Even when prediction is confident (small NLL), the model pays a compression cost if qϕ(t|x) drifts away from the prior. Increasing β would push μ toward 0 and σ toward 1 (more compressed), potentially hurting accuracy if overdone.
Consider the VIB objective L = E[-log pθ(y|t)] + β·KL(qϕ(t|x)||p(t)). Think about what happens as β→0 and β→∞, without doing full training.
Case 1: β→0.
The loss becomes dominated by the prediction term:
L ≈ E[-log pθ(y|t)].
So the encoder is free to choose qϕ(t|x) to make prediction easiest, even if it memorizes x in t.
In the extreme, qϕ(t|x) could become nearly deterministic with tiny σ and highly varying μ(x).
Case 2: β→∞.
The KL term dominates:
L ≈ β·KL(qϕ(t|x)||p(t)).
The easiest way to minimize KL for all x is to set qϕ(t|x)=p(t) regardless of x.
Then t becomes independent of x, implying I(X;T)≈0.
But then y is hard to predict from t, so the NLL term increases.
Insight: β sets an information budget. β too small risks overfitting; β too large forces T to ignore X, collapsing predictive power. Practical work is largely about finding the right regime.
Information Bottleneck learns a representation T of X that preserves information relevant to Y while discarding the rest.
The canonical trade-off is $$ with β>0 controlling compression vs relevance.
The Markov chain T–X–Y encodes that T is computed from X (no label leakage).
In discrete IB, the optimal encoder has a Gibbs/exponential form using a KL divergence between p(y|x) and p(y|t).
The discrete IB equations are fixed-point updates for p(t|x), p(t), and p(y|t), interpretable as soft clustering by label-meaning.
Variational IB (VIB) replaces intractable mutual informations with tractable surrogates: prediction NLL + β·KL(q(t|x)||p(t)).
β governs representation capacity: too low can memorize; too high can cause posterior collapse and hurt accuracy.
Confusing the roles of the two mutual informations: I(X;T) is the compression penalty, I(T;Y) is the relevance reward.
Dropping the Markov assumption T–X–Y implicitly (e.g., designing T using Y at test time), which breaks the intended meaning of the objective.
Assuming VIB’s KL(q(t|x)||p(t)) equals I(X;T) exactly; it is typically an upper bound depending on the chosen prior.
Interpreting β as a learning rate-like nuisance parameter; it is a meaningful trade-off that changes what information the representation retains.
Discrete IB intuition: Suppose two inputs x₁ and x₂ have identical label distributions p(y|x₁)=p(y|x₂). In the discrete IB fixed-point encoder update, what does this suggest about how x₁ and x₂ should be assigned to bottleneck states t? Explain using the KL term.
Hint: Look at D_KL(p(y|x)||p(y|t)). What happens if the two p(y|x) are the same?
If p(y|x₁)=p(y|x₂), then for any cluster t the KL divergences D_KL(p(y|x₁)||p(y|t)) and D_KL(p(y|x₂)||p(y|t)) are equal. Therefore the encoder update produces the same relative preferences over t for x₁ and x₂ (up to the same normalizer Z). This suggests x₁ and x₂ are information-theoretically indistinguishable with respect to Y and can be compressed together (assigned similarly in T) without loss of relevance.
Compute a KL regularizer: Let q(t|x)=N(μ,σ²) with μ=0.2 and σ=2.0, and prior p(t)=N(0,1). Compute D_KL(q||p).
Hint: Use D_KL = 0.5·(μ² + σ² − 1 − log σ²).
Here μ²=0.04 and σ²=4. So
D_KL = 0.5·(0.04 + 4 − 1 − log 4)
= 0.5·(3.04 − 1.3863)
= 0.5·(1.6537)
≈ 0.8269.
Reasoning about β: In VIB, you observe posterior collapse (the encoder outputs q(t|x)≈p(t) for all x and accuracy drops). Name two concrete adjustments you could try, and explain the direction each changes the trade-off.
Hint: Think: which term is overpowering the other? How can you reduce that pressure or increase the usefulness of T?
Posterior collapse indicates the KL/compression pressure is too strong relative to the prediction benefit. Two adjustments: (1) Decrease β, directly reducing the weight on KL(q(t|x)||p(t)), allowing T to carry more information about X (and thus about Y). (2) Increase decoder strength or training signal so using T becomes beneficial (e.g., a more expressive pθ(y|t), better optimization, or annealing β from 0 upward), which increases the relative gain from encoding predictive information, counteracting the incentive to ignore X.
Prereqs you’re using here:
Natural next nodes to unlock/relate: