Relative entropy. Measuring difference between distributions.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
Two probability distributions can look “close” on a plot but behave very differently when you use one to make predictions under the other. KL divergence is the tool that turns that mismatch into a single number—with a clear operational meaning: how many extra bits (or nats) you pay, on average, per sample when you code or predict using Q while the world is actually P.
KL divergence measures the average extra log-loss incurred by using instead of the true . It is nonnegative, asymmetric, and becomes infinite when assigns zero probability where does not.
When you compare two vectors, Euclidean distance is natural. For probability distributions, Euclidean distance often hides what actually matters for learning and decision-making:
So we want a measure that is prediction-aware: it should tell us how costly it is to pretend the world follows when it actually follows .
KL divergence (Kullback–Leibler divergence), also called relative entropy, does exactly that.
Let and be distributions over the same discrete set \(\mathcal{X}\). The KL divergence from to is
A few immediate notes:
For densities and (with respect to the same base measure),
The same “support mismatch” rule applies: if on a region where , the KL is infinite.
KL is an average of a log ratio:
So why isn’t KL sometimes negative overall? Because of a deep inequality (Jensen / Gibbs) that forces the expectation to be ≥ 0.
Canvas A: Bernoulli slider
Prompted observations:
This is the first “feel” for what KL cares about: being wrong in the direction of underestimating true events is catastrophic in log-loss terms.
Imagine two distributions over the same line:
They barely overlap. On , but , so .
If you overlay them, the key highlighted region is where $p(x)$ has mass but $q(x)$ is zero. KL treats that as an absolute failure, because it corresponds to assigning impossible probability to events that occur.
Keep those distinctions in mind; they will matter when we interpret mode-seeking vs mode-covering behavior later.
In probabilistic modeling, a standard way to score a model on data drawn from is negative log-likelihood (log-loss). For an outcome , the log-loss under model is
The expected log-loss under the true distribution is
This quantity shows up everywhere:
So the natural question becomes:
How much worse is it to use than to use the true , on average?
Start from KL:
Expand the log ratio:
Plug in:
Rewrite each term:
So
Interpretation:
In base 2 logs, it is literally “extra bits per symbol.”
The statement formalizes: on average, you cannot do better (in expected log-loss) than predicting with the true distribution.
A clean proof uses Jensen’s inequality on the concave log function.
Let’s show the common Gibbs inequality form.
We want to show:
Equivalently,
Now use Jensen (log is concave):
Therefore,
with equality iff for all where .
Canvas B: per-outcome contribution plot
For a small discrete space (say 5 outcomes), plot bars of
Prompts:
This visually explains why KL focuses on being accurate where puts mass.
Suppose you have data i.i.d., and a parametric model .
The average negative log-likelihood is
As , this converges to
Since
minimizing expected NLL over is the same as minimizing (because does not depend on ).
So MLE is “KL minimization from data distribution to model distribution.”
That phrasing becomes extremely useful later (cross-entropy, VAEs, variational inference).
Because KL is expectation-weighted under the first argument, swapping arguments changes what errors are emphasized.
This is not just a mathematical curiosity: it determines whether an approximation spreads out to “cover” all modes or collapses onto one.
Recall:
So each direction imposes a different feasibility constraint:
| Divergence | Catastrophic if… | Intuition | ||
|---|---|---|---|---|
| $D_{\mathrm{KL}}(P\ | \ | Q)$ | where | Model refuses to assign probability to real events |
| $D_{\mathrm{KL}}(Q\ | \ | P)$ | where | Model insists on events the truth says are impossible |
Canvas C: mixture of Gaussians vs single Gaussian approximation
Prompts:
This is a central intuition used in variational inference: minimizing reverse KL (common in VI) tends to be mode-seeking.
Let .
Forward KL penalizes when underestimates events that happen under .
Reverse KL penalizes when places mass where does not.
Thinking of KL as “how surprised would I be if I used Q in a world of P?” naturally picks a direction: the world is , your model is .
So asymmetry is actually part of the meaning.
In discrete problems, if you estimate from limited data, you can accidentally set for an event that later appears. Forward KL then becomes infinite.
Common fix: additive smoothing (Laplace/Dirichlet priors) so for all .
This is not a hack; it encodes uncertainty and prevents “impossible” predictions from finite data artifacts.
You already know entropy . Cross-entropy between and is
And the identity
says: cross-entropy is entropy plus a mismatch penalty.
In classification, is often a one-hot (or soft) label distribution and is the model’s predicted probabilities. Minimizing cross-entropy is minimizing KL (since doesn’t depend on model parameters).
This is why KL is not just a theoretical object—it is embedded in the training objective of most neural classifiers.
In VAEs, we introduce latent variables and want to maximize , but the posterior is intractable.
We choose an approximate posterior and maximize the ELBO:
Two important KL-related insights:
So maximizing ELBO is minimizing a KL divergence to the true posterior.
This is a major “KL is everywhere” moment: it measures how far your approximation is from an intractable truth.
The information bottleneck trades off:
Mutual information is defined via KL:
So KL is the primitive object beneath mutual information. The bottleneck objective can be written using KL-based quantities, and many derivations rely on KL manipulations.
Across these applications, KL is playing the same role:
Canvas D: loss surface in parameter space
Take a simple family (e.g., Bernoulli with parameter ) and a fixed target (Bernoulli with parameter ).
Plot as a function of .
Prompts:
This links KL to optimization behavior: sometimes the objective is smooth and well-behaved; sometimes it has steep cliffs because of near-zero probabilities.
That optimization intuition is crucial when you later study:
Let and . Outcomes are with , and similarly for .
Write the discrete KL definition:
Expand the sum over the two outcomes:
Substitute , , , :
Concrete numbers (nats): let .
Compute pieces:
.
.
So nats.
The term as . Hence KL blows up.
Insight: Even with only two outcomes, the asymmetry is visible: if the true event probability but the model sets near 0, the log-loss penalty becomes arbitrarily large. This is the operational meaning of “support mismatch” in miniature.
Let be the true distribution over and be a model distribution. Define entropy and cross-entropy .
Start from KL:
Split the log ratio:
Distribute and sum:
Multiply by -1 inside the definitions:
Substitute:
Rearrange to get the decomposition:
Insight: Cross-entropy is exactly “irreducible uncertainty” plus “model mismatch” . In learning, you can’t reduce by changing your model, so training focuses on driving the KL term down.
Let . Let the target be bimodal: , , . Consider two approximations:
Compute forward KL for :
Plug in each term (nats):
Approximate:
.
So A+B contribute .
.
So C contributes .
Total forward KL .
Compute forward KL for :
Terms:
Total forward KL (much larger).
Now compute reverse KL for :
Reverse KL terms (nats):
Total reverse KL .
Insight: Forward KL heavily punishes missing a mode that has large mass (the big term). Reverse KL weights by , so if barely allocates mass to B and C, it barely ‘feels’ being wrong there—capturing the mode-seeking tendency.
KL divergence is defined as an expectation under the first distribution: .
Operational meaning: it is the average extra log-loss (extra bits/nats per sample) you incur when using to predict data generated by .
KL is always nonnegative (Gibbs/Jensen), and equals 0 iff (almost everywhere / on the support of ).
KL is asymmetric; swapping arguments changes which regions matter because the expectation reweights by vs by .
Support mismatch causes infinite KL: if where , then .
Forward KL () tends to be mode-covering; reverse KL () tends to be mode-seeking in approximations.
Cross-entropy decomposes as , making KL the mismatch term behind common training losses.
KL is a foundational building block under ELBOs (VAEs/variational inference) and mutual information (information bottleneck).
Treating KL as a symmetric distance and casually writing .
Ignoring support: computing KL numerically without guarding against zeros in (leading to infinities or NaNs).
Forgetting the expectation is under (or under for reverse KL), leading to incorrect intuition about which errors matter.
Confusing cross-entropy with entropy , or assuming minimizing cross-entropy always implies matches without considering model capacity and estimation error.
Let and . Compute in nats.
Hint: Use .
Here , .
Compute pieces:
so first term .
so second term .
Total nats.
Prove (using Jensen’s inequality) that for discrete distributions with matching support.
Hint: Rewrite KL as and apply Jensen to the concave .
Start from
Since is concave, Jensen gives
Therefore , i.e. .
Equality holds iff is constant over with , which implies .
Consider a target distribution that is a mixture of two well-separated Gaussians in 1D. You approximate it with a single Gaussian family . Qualitatively, which direction (forward KL vs reverse KL) is more likely to yield a large covering both modes, and which is more likely to pick one mode? Explain using the ‘expectation under which distribution?’ idea.
Hint: Forward KL weights by (penalizes missing where has mass). Reverse KL weights by (penalizes placing mass where is small).
Forward KL is averaged over . If fails to assign decent density to either mode, then for many samples from the missed mode, becomes large, heavily penalizing the fit. This tends to push toward a broader distribution (larger ) that covers both modes.
Reverse KL is averaged over . If concentrates around one mode, it rarely samples the other mode, so it does not ‘feel’ the penalty of missing it. Instead, it is strongly penalized for placing mass in the low-density valley between modes (where is small), encouraging to sit on one mode with smaller . This is the classic mode-seeking behavior.
Related next-step ideas you’ll likely want soon: