Shared information between two variables. I(X;Y) = H(X) - H(X|Y).
Quick unlock - significant prerequisite investment but simple final step. Verify prerequisites first.
Mutual information answers a deceptively simple question: “How many bits do X and Y actually share?” It’s the quantity that turns vague notions like “correlated”, “predictive”, and “dependent” into a precise, operational measure in bits (or nats).
Mutual information measures how much knowing one random variable reduces uncertainty about the other:
I(X;Y) = H(X) − H(X|Y) = H(Y) − H(Y|X) = H(X)+H(Y)−H(X,Y).
It is always ≥ 0, equals 0 iff X and Y are independent, and can be viewed as a KL divergence: I(X;Y) = D_KL(p(x,y) ‖ p(x)p(y)).
Entropy H(X) tells you how uncertain X is by itself. But many problems are about relationships:
Correlation is not enough: it’s linear, scale-dependent, and can miss nonlinear dependence. We want a measure that:
Mutual information does exactly this.
Let X and Y be discrete random variables with joint distribution p(x,y). The mutual information between X and Y is
I(X;Y) = H(X) − H(X|Y).
Interpretation:
Because the definition is symmetric (as we’ll prove), I(X;Y) is also how much X tells you about Y.
If you use log₂, I(X;Y) is in bits. If you use natural log, it’s in nats. The math is identical; only the scale changes.
Think of X as a hidden message and Y as a noisy observation.
So mutual information is a knob between 0 and “all of H(X)” depending on how informative Y is.
We can rewrite I(X;Y) in several equivalent ways:
1) Symmetric conditional-entropy form
I(X;Y) = H(X) − H(X|Y) = H(Y) − H(Y|X).
2) Joint-entropy form
I(X;Y) = H(X) + H(Y) − H(X,Y).
3) Expectation of a log-likelihood ratio
I(X;Y) = ∑ₓ ∑ᵧ p(x,y) log( p(x,y) / (p(x)p(y)) ).
4) KL divergence form
I(X;Y) = D_KL( p(x,y) ‖ p(x)p(y) ).
Each form emphasizes something different:
| Expression | What it highlights | Useful when | |
|---|---|---|---|
| I(X;Y)=H(X)−H(X | Y) | information gain about X from Y | prediction, feature usefulness |
| I(X;Y)=H(X)+H(Y)−H(X,Y) | overlap/shared randomness | Venn-style intuition, algebra | |
| I(X;Y)=∑ p log(p/(pp)) | average log ratio | manual computation from tables | |
| I(X;Y)=D_KL(p(x,y)‖p(x)p(y)) | deviation from independence | proving ≥0, theory |
Mutual information will become a central “currency” later in channel capacity and information bottleneck.
Mutual information shows up inside larger derivations (channel capacity, rate–distortion, variational bounds). If you can fluidly switch between its equivalent forms, many proofs become “one line” instead of opaque.
We’ll derive the key identities carefully.
Start from the definition:
I(X;Y) = H(X) − H(X|Y).
Use the chain rule for entropy:
H(X,Y) = H(Y) + H(X|Y).
Solve this for H(X|Y):
H(X|Y) = H(X,Y) − H(Y).
Substitute back:
I(X;Y)
= H(X) − (H(X,Y) − H(Y))
= H(X) − H(X,Y) + H(Y)
= H(X) + H(Y) − H(X,Y).
That’s the “overlap” form.
The overlap form is symmetric in X and Y:
I(X;Y) = H(X) + H(Y) − H(X,Y) = I(Y;X).
So mutual information doesn’t care which variable you call “input” vs “output”.
If X and Y share some randomness, then the pair is “less uncertain” than the sum, and the difference is the shared part.
This is where the common Venn-diagram intuition comes from. (Be cautious: entropy isn’t literally a set measure, but the algebra matches.)
From I(X;Y) = H(X) − H(X|Y):
I(X;Y) ≤ H(X).
Similarly, I(X;Y) ≤ H(Y).
So:
0 ≤ I(X;Y) ≤ min{H(X), H(Y)}.
The upper bound is achievable when one variable is a deterministic function of the other (e.g., Y=f(X) with no collisions). Then knowing one fully determines the other, up to the smaller entropy.
Another way to read the definition is pointwise.
Define pointwise mutual information (PMI):
pmi(x,y) = log( p(x,y) / (p(x)p(y)) ).
Then
I(X;Y) = E[ pmi(X,Y) ] = ∑ₓ ∑ᵧ p(x,y) pmi(x,y).
So mutual information is the average log factor by which the joint probability differs from what independence would predict.
This is the “shared bits” story: mutual information counts how many bits about X are carried by Y, on average.
Two foundational facts make mutual information reliable:
These properties are not obvious from H(X) − H(X|Y) alone, because conditional entropy can behave in unintuitive ways (especially in continuous settings). The cleanest explanation comes from KL divergence.
Start from the overlap form:
I(X;Y) = H(X) + H(Y) − H(X,Y).
Write each entropy as an expectation:
H(X) = −∑ₓ p(x) log p(x)
H(Y) = −∑ᵧ p(y) log p(y)
H(X,Y) = −∑ₓ∑ᵧ p(x,y) log p(x,y).
Plug in:
I(X;Y)
= −∑ₓ p(x) log p(x) − ∑ᵧ p(y) log p(y) + ∑ₓ∑ᵧ p(x,y) log p(x,y).
Convert the marginal sums into joint sums (since ∑ᵧ p(x,y)=p(x), ∑ₓ p(x,y)=p(y)):
−∑ₓ p(x) log p(x)
= −∑ₓ∑ᵧ p(x,y) log p(x)
−∑ᵧ p(y) log p(y)
= −∑ₓ∑ᵧ p(x,y) log p(y)
So
I(X;Y)
= ∑ₓ∑ᵧ p(x,y) log p(x,y)
− ∑ₓ∑ᵧ p(x,y) log p(x)
− ∑ₓ∑ᵧ p(x,y) log p(y)
Combine logs:
I(X;Y)
= ∑ₓ∑ᵧ p(x,y) log( p(x,y) / (p(x)p(y)) ).
Now recognize this as a KL divergence:
D_KL( p(x,y) ‖ p(x)p(y) )
= ∑ₓ∑ᵧ p(x,y) log( p(x,y) / (p(x)p(y)) ).
Therefore:
I(X;Y) = D_KL( p(x,y) ‖ p(x)p(y) ).
KL divergence is always nonnegative:
D_KL(P ‖ Q) ≥ 0,
with equality iff P=Q almost everywhere.
So immediately:
I(X;Y) ≥ 0.
This is a major reason the KL form is beloved: it packages a deep inequality into a known theorem (Gibbs’ inequality).
Equality holds iff
p(x,y) = p(x)p(y) for all x,y with p(x,y)>0.
But that is exactly the definition of independence. Hence:
I(X;Y) = 0 ⇔ X ⟂ Y.
This gives mutual information a crisp meaning as a dependence measure.
In bits (log₂), it tells you how many extra bits per sample you pay, on average, by pretending the variables are independent.
Mutual information detects any dependence, not just linear.
That makes it especially useful in machine learning for feature selection and representation learning, where dependencies are often nonlinear.
A closely related quantity you’ll likely meet soon is:
I(X;Y|Z) = H(X|Z) − H(X|Y,Z).
It measures how much X and Y share after accounting for Z. It appears in graphical models and in information bottleneck derivations. You don’t need it yet, but it’s helpful to see that mutual information fits into a larger family of “information differences.”
In many systems you:
Mutual information becomes the natural objective/constraint because it counts how many bits about one variable survive into another.
A channel takes an input X and produces an output Y according to p(y|x). If you choose an input distribution p(x), the channel induces a joint p(x,y)=p(x)p(y|x).
The quantity I(X;Y) measures how much information the output reveals about the input under that choice of p(x).
Shannon capacity is:
C = max_{p(x)} I(X;Y).
Interpretation:
Mutual information is the objective because it precisely measures “how many bits made it through.”
Suppose X is a source and you compress it into a representation Ŷ (often written X̂) that is allowed to be imperfect, controlled by a distortion measure d(x, x̂).
Rate–distortion theory studies:
minimize I(X;X̂)
subject to E[d(X,X̂)] ≤ D.
Meaning:
Again, I(·;·) is the rate.
In supervised learning, you often want a representation T that compresses X but keeps what matters for Y.
The information bottleneck formalizes this tradeoff:
minimize I(X;T) − β I(T;Y).
Interpretation:
This objective is built entirely out of mutual information terms.
If you are deciding which feature Xᵢ to keep for predicting Y, a common heuristic is to prefer larger I(Xᵢ;Y). Unlike correlation, this works for arbitrary discrete relationships.
Caution: estimating mutual information from finite data can be tricky (bias, binning choices), but conceptually it’s a powerful criterion.
| Domain | Variables | Role of mutual information |
|---|---|---|
| Communication | input X, output Y | maximize I(X;Y) to get capacity |
| Lossy compression | source X, reconstruction X̂ | minimize I(X;X̂) under distortion |
| Representation learning | data X, rep T, target Y | trade off I(X;T) vs I(T;Y) |
| Statistics | two measurements X,Y | test dependence via I(X;Y) |
Mutual information is the bridge between probability models and operational goals: how many bits you can transmit, store, or preserve.
Let X ∈ {0,1} be a fair bit: p(x=0)=p(x=1)=1/2. Let Y be X flipped with probability ε (0 ≤ ε ≤ 1/2):
Compute I(X;Y) in bits (log₂).
Step 1: Compute H(X).
X is fair ⇒ H(X)=1 bit.
Step 2: Compute H(X|Y).
For a BSC with a fair input, the posterior p(x|y) has the same error rate ε:
So H(X|Y=y) = H₂(ε), where H₂(ε)= −ε log₂ ε − (1−ε) log₂(1−ε).
Step 3: Average over y.
Because H(X|Y=y) is the same for y=0 and y=1,
H(X|Y)=∑ᵧ p(y) H(X|Y=y)=H₂(ε).
Step 4: Use I(X;Y)=H(X)−H(X|Y).
I(X;Y)=1 − H₂(ε).
Insight: A BSC destroys information exactly according to the binary entropy of its flip probability. When ε=0, H₂(0)=0 ⇒ I=1 bit (perfect). When ε=1/2, H₂(1/2)=1 ⇒ I=0 bits (pure noise).
Show that I(X;Y)=0 if and only if X and Y are independent, using the identity I(X;Y)=D_KL(p(x,y) ‖ p(x)p(y)).
Step 1: Start from the KL form.
I(X;Y)=D_KL(p(x,y) ‖ p(x)p(y)).
Step 2: Use the fundamental property of KL divergence.
For any distributions P,Q on the same support:
D_KL(P ‖ Q) ≥ 0,
with equality iff P=Q (for all outcomes where P>0).
Step 3: Apply it to the joint vs product-of-marginals.
I(X;Y)=0 ⇔ D_KL(p(x,y) ‖ p(x)p(y))=0 ⇔ p(x,y)=p(x)p(y) for all relevant (x,y).
Step 4: Recognize independence.
The condition p(x,y)=p(x)p(y) is exactly the definition of X ⟂ Y.
Insight: Mutual information is the “distance from independence” measured in KL divergence. This makes nonnegativity and the independence criterion immediate.
Let X,Y ∈ {0,1} with joint distribution:
Compute I(X;Y) in bits.
Step 1: Compute marginals.
p(x=0)=0.4+0.1=0.5, p(x=1)=0.1+0.4=0.5.
p(y=0)=0.4+0.1=0.5, p(y=1)=0.1+0.4=0.5.
Step 2: Use the log-ratio formula.
I(X;Y)=∑ₓ∑ᵧ p(x,y) log₂( p(x,y) / (p(x)p(y)) ).
Here p(x)p(y)=0.5·0.5=0.25 for all pairs.
Step 3: Compute each term.
For (0,0): 0.4 · log₂(0.4/0.25)=0.4 · log₂(1.6).
For (0,1): 0.1 · log₂(0.1/0.25)=0.1 · log₂(0.4).
For (1,0): 0.1 · log₂(0.1/0.25)=0.1 · log₂(0.4).
For (1,1): 0.4 · log₂(0.4/0.25)=0.4 · log₂(1.6).
Step 4: Combine.
I = 0.8 · log₂(1.6) + 0.2 · log₂(0.4).
Use log₂(1.6)≈0.6781 and log₂(0.4)≈−1.3219.
So I ≈ 0.8·0.6781 + 0.2·(−1.3219)
≈ 0.5425 − 0.2644
≈ 0.2781 bits.
Insight: Even with uniform marginals, dependence appears as probability mass concentrating on (0,0) and (1,1). Mutual information quantifies that dependence at ≈0.278 bits—not huge, but clearly nonzero.
Mutual information measures shared information: I(X;Y)=H(X)−H(X|Y).
It is symmetric: I(X;Y)=I(Y;X)=H(X)+H(Y)−H(X,Y).
I(X;Y) can be written as an expectation: I(X;Y)=∑ p(x,y) log( p(x,y)/(p(x)p(y)) ).
KL view: I(X;Y)=D_KL(p(x,y) ‖ p(x)p(y)) makes key properties immediate.
Nonnegativity: I(X;Y) ≥ 0 always (Gibbs’ inequality).
Independence: I(X;Y)=0 ⇔ p(x,y)=p(x)p(y) ⇔ X ⟂ Y.
Bounds: 0 ≤ I(X;Y) ≤ min{H(X), H(Y)}; equality on the right when one variable determines the other.
Mutual information is the core objective/constraint in channel capacity, rate–distortion, and information bottleneck.
Thinking “I(X;Y)=0 means uncorrelated.” Zero correlation is weaker than independence; mutual information detects nonlinear dependence.
Mixing log bases mid-problem (log₂ vs ln), leading to inconsistent units (bits vs nats).
Forgetting to compute marginals correctly when using I(X;Y)=∑ p log(p/(pp)).
Assuming I(X;Y)=H(X)+H(Y) (it’s H(X)+H(Y)−H(X,Y)); the joint entropy matters.
Let X be a fair bit and let Y=X (deterministic copy). Compute I(X;Y).
Hint: Use I(X;Y)=H(X)−H(X|Y). What is H(X|Y) when Y determines X?
Since Y=X, knowing Y reveals X exactly ⇒ H(X|Y)=0. Also H(X)=1 bit (fair). Therefore I(X;Y)=1−0=1 bit.
Let X,Y ∈ {0,1} with joint distribution p(0,0)=0.25, p(0,1)=0.25, p(1,0)=0.25, p(1,1)=0.25. Compute I(X;Y).
Hint: Check whether p(x,y)=p(x)p(y). If yes, mutual information is zero.
The distribution is uniform, so p(x)=0.5 and p(y)=0.5. Then p(x)p(y)=0.25 for each pair, matching p(x,y). Thus X and Y are independent and I(X;Y)=0 bits.
Suppose X has entropy H(X)=3 bits. A representation T satisfies H(X|T)=1.2 bits. (a) Compute I(X;T). (b) Give one sentence interpreting the result.
Hint: Mutual information is the reduction in uncertainty: I(X;T)=H(X)−H(X|T).
(a) I(X;T)=H(X)−H(X|T)=3−1.2=1.8 bits.
(b) Observing T reduces uncertainty about X by 1.8 bits on average; equivalently, T retains 1.8 bits of information about X.
Next nodes you can unlock with this concept:
Related background: