KL Divergence

Information TheoryDifficulty: ████Depth: 7Unlocks: 7

Relative entropy. Measuring difference between distributions.

Interactive Visualization

t=0s

Core Concepts

  • KL divergence as a scalar measure of how one probability distribution diverges from another (relative entropy).
  • It is defined via an expectation taken with respect to the true distribution (the divergence is expectation-based).
  • Operational interpretation: the KL gives the average extra log-loss (information) per sample when using Q instead of the true P.

Key Symbols & Notation

D_KL(P || Q)

Essential Relationships

  • D_KL(P || Q) = E_{X ~ P}[ log( P(X) / Q(X) ) ]
  • D_KL(P || Q) >= 0, with equality iff P = Q almost everywhere
▶ Advanced Learning Details

Graph Position

79
Depth Cost
7
Fan-Out (ROI)
4
Bottleneck Score
7
Chain Length

Cognitive Load

6
Atomic Elements
36
Total Elements
L2
Percentile Level
L4
Atomic Level

All Concepts (15)

  • Definition of KL divergence for discrete distributions: D_KL(P || Q) = sum over x of P(x) * log( P(x) / Q(x) ).
  • Continuous (density) version: KL divergence as integral of p(x) * log(p(x)/q(x)) dx.
  • Interpretation as expected log-likelihood ratio: KL is the expectation under P of the log of the probability ratio log(P(x)/Q(x)).
  • Interpretation as extra expected code length (extra bits or nats) when encoding samples from P using a code optimized for Q instead of P.
  • Non-negativity (Gibbs inequality): KL divergence is always >= 0, with equality only when P and Q are equal almost everywhere.
  • Asymmetry: D_KL(P || Q) is not equal to D_KL(Q || P); KL is not a metric.
  • Support requirement and infinity behavior: if there exists x with P(x)>0 but Q(x)=0, then D_KL(P||Q) = infinity.
  • Cross-entropy concept as a distinct quantity: H(P,Q) = - sum P(x) log Q(x) (or integral), used as a separate loss.
  • Relationship between KL and cross-entropy: D_KL(P||Q) = H(P,Q) - H(P).
  • Relation to maximum likelihood: minimizing KL(P_empirical || Q_theta) over theta (or equivalently minimizing expected negative log-likelihood under P_empirical) yields MLE; MLE can be viewed as KL minimization to the empirical distribution.
  • Units depend on log base: log base 2 gives bits, natural log gives nats; choice affects numerical value and interpretation.
  • Chain rule / conditional decomposition: KL for joint distributions can be decomposed into marginal KL plus expected conditional KL: D_KL(P(X,Y)||Q(X,Y)) = D_KL(P(X)||Q(X)) + E_{P(X)}[ D_KL(P(Y|X)||Q(Y|X)) ].
  • Additivity for independent product distributions: D_KL(P1 x P2 || Q1 x Q2) = D_KL(P1||Q1) + D_KL(P2||Q2).
  • Relation to mutual information: mutual information I(X;Y) = D_KL( P(X,Y) || P(X) P(Y) ).
  • Optimization properties used in ML: convexity in the model distribution Q (for fixed P) making KL a convenient loss for parameter estimation in many cases.

Teaching Strategy

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

Two probability distributions can look “close” on a plot but behave very differently when you use one to make predictions under the other. KL divergence is the tool that turns that mismatch into a single number—with a clear operational meaning: how many extra bits (or nats) you pay, on average, per sample when you code or predict using Q while the world is actually P.

TL;DR:

KL divergence DKL(PQ)=ExP[logP(x)Q(x)]D_{\mathrm{KL}}(P\|\|Q)=\mathbb{E}_{x\sim P}[\log \frac{P(x)}{Q(x)}] measures the average extra log-loss incurred by using QQ instead of the true PP. It is nonnegative, asymmetric, and becomes infinite when QQ assigns zero probability where PP does not.

What Is KL Divergence?

Why we need a special “distance” for distributions

When you compare two vectors, Euclidean distance is natural. For probability distributions, Euclidean distance often hides what actually matters for learning and decision-making:

  • If your model assigns tiny probability to events that happen under the true distribution, your log-loss explodes.
  • If your model assigns probability mass to impossible events, you waste capacity—but you might not be punished as severely.

So we want a measure that is prediction-aware: it should tell us how costly it is to pretend the world follows QQ when it actually follows PP.

KL divergence (Kullback–Leibler divergence), also called relative entropy, does exactly that.

Definition (discrete)

Let PP and QQ be distributions over the same discrete set \(\mathcal{X}\). The KL divergence from QQ to PP is

DKL(PQ)=xXP(x)logP(x)Q(x).D_{\mathrm{KL}}(P\|\|Q)=\sum_{x\in\mathcal{X}} P(x)\,\log\frac{P(x)}{Q(x)}.

A few immediate notes:

  • The expectation is with respect to $P$ (the “true” distribution in the common interpretation).
  • The log base sets units: base 2 → bits, base eenats.
  • If there exists an xx with P(x)>0P(x)>0 but Q(x)=0Q(x)=0, then logP(x)Q(x)=\log \frac{P(x)}{Q(x)}=\infty and KL becomes infinite.

Definition (continuous)

For densities p(x)p(x) and q(x)q(x) (with respect to the same base measure),

DKL(PQ)=p(x)logp(x)q(x)dx.D_{\mathrm{KL}}(P\|\|Q)=\int p(x)\,\log\frac{p(x)}{q(x)}\,dx.

The same “support mismatch” rule applies: if p(x)>0p(x)>0 on a region where q(x)=0q(x)=0, the KL is infinite.

The most important intuition: a log ratio averaged under reality

KL is an average of a log ratio:

  • If QQ assigns lower probability than PP to likely events, then P(x)/Q(x)P(x)/Q(x) is large → log is positive → KL grows.
  • If QQ assigns higher probability than PP to likely events, then the log ratio can be negative for those xx.

So why isn’t KL sometimes negative overall? Because of a deep inequality (Jensen / Gibbs) that forces the expectation to be ≥ 0.

Guided visualization on the interactive canvas (do this now)

Canvas A: Bernoulli slider

  1. 1)Choose a Bernoulli true distribution: P=Bern(p)P=\mathrm{Bern}(p).
  2. 2)Choose a model distribution: Q=Bern(q)Q=\mathrm{Bern}(q).
  3. 3)Slide pp and qq independently.

Prompted observations:

  • Hold p=0.2p=0.2 fixed. Move qq toward 0. Watch DKL(PQ)D_{\mathrm{KL}}(P\|\|Q) rise sharply.
  • Now move qq toward 1. It also rises, but the blow-up is much more dramatic when QQ places near-zero probability on outcomes that happen under PP.
  • Set q=0q=0 exactly while p>0p>0. The KL becomes infinite.

This is the first “feel” for what KL cares about: being wrong in the direction of underestimating true events is catastrophic in log-loss terms.

Static anchor diagram: support mismatch

Imagine two distributions over the same line:

  • p(x)p(x) is concentrated on an interval, say [0,1][0,1].
  • q(x)q(x) is concentrated on [1,2][1,2].

They barely overlap. On [0,1][0,1], p(x)>0p(x)>0 but q(x)=0q(x)=0, so DKL(PQ)=D_{\mathrm{KL}}(P\|\|Q)=\infty.

If you overlay them, the key highlighted region is where $p(x)$ has mass but $q(x)$ is zero. KL treats that as an absolute failure, because it corresponds to assigning impossible probability to events that occur.

What KL is not

  • It is not symmetric: generally DKL(PQ)DKL(QP)D_{\mathrm{KL}}(P\|\|Q)\neq D_{\mathrm{KL}}(Q\|\|P).
  • It is not a metric: it doesn’t satisfy the triangle inequality.
  • It is not simply “area between curves.” It is an expectation of log ratios, weighted by PP.

Keep those distinctions in mind; they will matter when we interpret mode-seeking vs mode-covering behavior later.

Core Mechanic 1: KL as Expected Extra Log-Loss (Operational Meaning)

Why before how: prediction is about log-loss

In probabilistic modeling, a standard way to score a model QQ on data drawn from PP is negative log-likelihood (log-loss). For an outcome xx, the log-loss under model QQ is

Q(x)=logQ(x).\ell_Q(x) = -\log Q(x).

The expected log-loss under the true distribution PP is

ExP[Q(x)]=xP(x)logQ(x).\mathbb{E}_{x\sim P}[\ell_Q(x)] = -\sum_x P(x)\log Q(x).

This quantity shows up everywhere:

  • maximum likelihood estimation (MLE)
  • cross-entropy loss in classification
  • coding and compression

So the natural question becomes:

How much worse is it to use QQ than to use the true PP, on average?

Derivation: KL is the excess expected log-loss

Start from KL:

DKL(PQ)=xP(x)logP(x)Q(x).D_{\mathrm{KL}}(P\|\|Q)=\sum_x P(x)\log\frac{P(x)}{Q(x)}.

Expand the log ratio:

logP(x)Q(x)=logP(x)logQ(x).\log\frac{P(x)}{Q(x)} = \log P(x) - \log Q(x).

Plug in:

DKL(PQ)=xP(x)(logP(x)logQ(x))=xP(x)logP(x)xP(x)logQ(x).\begin{aligned} D_{\mathrm{KL}}(P\|\|Q) &= \sum_x P(x)\big(\log P(x)-\log Q(x)\big) \\ &= \sum_x P(x)\log P(x) - \sum_x P(x)\log Q(x). \end{aligned}

Rewrite each term:

  • Entropy: H(P)=xP(x)logP(x)H(P) = -\sum_x P(x)\log P(x)
  • Cross-entropy: H(P,Q)=xP(x)logQ(x)H(P,Q) = -\sum_x P(x)\log Q(x)

So

DKL(PQ)=H(P)+H(P,Q)=H(P,Q)H(P).\begin{aligned} D_{\mathrm{KL}}(P\|\|Q) &= -H(P) + H(P,Q) \\ &= H(P,Q) - H(P). \end{aligned}

Interpretation:

  • H(P)H(P) is the best achievable expected log-loss if you predict with the true PP.
  • H(P,Q)H(P,Q) is the expected log-loss if you predict with QQ while data comes from PP.
  • Their difference is the penalty: the average extra log-loss per sample.

In base 2 logs, it is literally “extra bits per symbol.”

Nonnegativity: why you can’t beat the truth on average

The statement DKL(PQ)0D_{\mathrm{KL}}(P\|\|Q)\ge 0 formalizes: on average, you cannot do better (in expected log-loss) than predicting with the true distribution.

A clean proof uses Jensen’s inequality on the concave log function.

Let’s show the common Gibbs inequality form.

We want to show:

xP(x)logP(x)Q(x)0.\sum_x P(x)\log\frac{P(x)}{Q(x)} \ge 0.

Equivalently,

xP(x)logQ(x)P(x)0.\sum_x P(x)\log\frac{Q(x)}{P(x)} \le 0.

Now use Jensen (log is concave):

xP(x)logQ(x)P(x)log(xP(x)Q(x)P(x))=log(xQ(x))=log1=0.\sum_x P(x)\log\frac{Q(x)}{P(x)} \le \log\left(\sum_x P(x)\frac{Q(x)}{P(x)}\right) = \log\left(\sum_x Q(x)\right)=\log 1=0.

Therefore,

DKL(PQ)0,D_{\mathrm{KL}}(P\|\|Q)\ge 0,

with equality iff P(x)=Q(x)P(x)=Q(x) for all xx where P(x)>0P(x)>0.

Guided visualization: “extra log-loss” as a bar chart

Canvas B: per-outcome contribution plot

For a small discrete space (say 5 outcomes), plot bars of

c(x)=P(x)logP(x)Q(x).c(x)=P(x)\log\frac{P(x)}{Q(x)}.

Prompts:

  1. 1)Make one outcome very likely under PP (e.g., P(x_1)=0.6P(x\_1)=0.6), but set Q(x_1)=0.2Q(x\_1)=0.2.
  • Watch that bar dominate KL.
  1. 2)Make QQ larger than PP on a rare outcome.
  • You may see a negative contribution there, but it usually doesn’t compensate much because it’s weighted by small P(x)P(x).

This visually explains why KL focuses on being accurate where PP puts mass.

Connection to MLE (prerequisite bridge)

Suppose you have data x1,,xnPx₁,\dots,x_n \sim P i.i.d., and a parametric model QθQ_\theta.

The average negative log-likelihood is

1ni=1nlogQθ(xi).\frac{1}{n}\sum_{i=1}^n -\log Q_\theta(x_i).

As nn\to\infty, this converges to

ExP[logQθ(x)]=H(P,Qθ).\mathbb{E}_{x\sim P}\big[-\log Q_\theta(x)\big] = H(P,Q_\theta).

Since

H(P,Qθ)=H(P)+DKL(PQθ),H(P,Q_\theta) = H(P) + D_{\mathrm{KL}}(P\|\|Q_\theta),

minimizing expected NLL over θ\theta is the same as minimizing DKL(PQθ)D_{\mathrm{KL}}(P\|\|Q_\theta) (because H(P)H(P) does not depend on θ\theta).

So MLE is “KL minimization from data distribution to model distribution.”

That phrasing becomes extremely useful later (cross-entropy, VAEs, variational inference).

Core Mechanic 2: Asymmetry, Support, and Mode-Seeking vs Mode-Covering

Why asymmetry matters

Because KL is expectation-weighted under the first argument, swapping arguments changes what errors are emphasized.

  • Forward KL: DKL(PQ)D_{\mathrm{KL}}(P\|\|Q) weights errors by PP (truth-weighted).
  • Reverse KL: DKL(QP)D_{\mathrm{KL}}(Q\|\|P) weights errors by QQ (model-weighted).

This is not just a mathematical curiosity: it determines whether an approximation spreads out to “cover” all modes or collapses onto one.

Support and the “infinite wall”

Recall:

  • If P(x)>0P(x)>0 and Q(x)=0Q(x)=0 anywhere, then DKL(PQ)=D_{\mathrm{KL}}(P\|\|Q)=\infty.
  • If Q(x)>0Q(x)>0 and P(x)=0P(x)=0 somewhere, then DKL(QP)=D_{\mathrm{KL}}(Q\|\|P)=\infty.

So each direction imposes a different feasibility constraint:

DivergenceCatastrophic if…Intuition
$D_{\mathrm{KL}}(P\\Q)$Q=0Q=0 where P>0P>0Model refuses to assign probability to real events
$D_{\mathrm{KL}}(Q\\P)$P=0P=0 where Q>0Q>0Model insists on events the truth says are impossible

Guided visualization: toggle forward vs reverse KL on a bimodal target

Canvas C: mixture of Gaussians vs single Gaussian approximation

  1. 1)Let PP be a bimodal distribution (two separated Gaussians).
  2. 2)Let QQ be a single Gaussian with adjustable mean/variance.
  3. 3)Add a toggle: compute either DKL(PQ)D_{\mathrm{KL}}(P\|\|Q) (forward) or DKL(QP)D_{\mathrm{KL}}(Q\|\|P) (reverse).

Prompts:

  • In forward KL mode, increase Q’s variance. You’ll often see the best fit broaden to cover both modes. Missing one mode is costly because PP puts mass there.
  • In reverse KL mode, the best fit often collapses onto one mode (mode-seeking). Why? Because the expectation is under QQ: if QQ puts almost no mass near the other mode, it doesn’t “feel” that region.

This is a central intuition used in variational inference: minimizing reverse KL (common in VI) tends to be mode-seeking.

A simple discrete example of asymmetry

Let X={a,b}\mathcal{X} = \{a,b\}.

  • P(a)=0.99P(a)=0.99, P(b)=0.01P(b)=0.01.
  • Consider two candidate models:
  • Q1(a)=0.9Q₁(a)=0.9, Q1(b)=0.1Q₁(b)=0.1 (covers both outcomes)
  • Q2(a)=0.9999Q₂(a)=0.9999, Q2(b)=0.0001Q₂(b)=0.0001 (very confident)

Forward KL DKL(PQ)D_{\mathrm{KL}}(P\|\|Q) penalizes when QQ underestimates events that happen under PP.

  • If Q2Q₂ makes bb extremely unlikely but P(b)=0.01P(b)=0.01, the term 0.01log0.010.00010.01\log\frac{0.01}{0.0001} can be significant.

Reverse KL DKL(QP)D_{\mathrm{KL}}(Q\|\|P) penalizes when QQ places mass where PP does not.

  • Here both support match (no zeros), but the weighting changes what matters: reverse KL is dominated by where QQ places mass (mostly at aa), making it less sensitive to a small-probability region unless QQ also places mass there.

KL is not symmetric—and not meant to be

Thinking of KL as “how surprised would I be if I used Q in a world of P?” naturally picks a direction: the world is PP, your model is QQ.

So asymmetry is actually part of the meaning.

When does asymmetry show up in ML practice?

  • Supervised classification: cross-entropy corresponds to forward KL from the data label distribution to the model distribution.
  • Variational inference (ELBO): commonly minimizes reverse KL between an approximate posterior q(z)q(z) and true posterior p(zx)p(z\mid x).
  • Distillation: direction determines whether the student covers teacher’s support broadly or focuses on peak probabilities.

Practical note: smoothing to avoid infinities

In discrete problems, if you estimate QQ from limited data, you can accidentally set Q(x)=0Q(x)=0 for an event that later appears. Forward KL then becomes infinite.

Common fix: additive smoothing (Laplace/Dirichlet priors) so Q(x)>0Q(x)>0 for all xx.

This is not a hack; it encodes uncertainty and prevents “impossible” predictions from finite data artifacts.

Applications and Connections: Cross-Entropy, VAEs/ELBO, and Information Bottleneck

Cross-entropy: the most common place you see KL

You already know entropy H(P)H(P). Cross-entropy between PP and QQ is

H(P,Q)=ExP[logQ(x)].H(P,Q)=-\mathbb{E}_{x\sim P}[\log Q(x)].

And the identity

H(P,Q)=H(P)+DKL(PQ)H(P,Q)=H(P)+D_{\mathrm{KL}}(P\|\|Q)

says: cross-entropy is entropy plus a mismatch penalty.

In classification, PP is often a one-hot (or soft) label distribution and QQ is the model’s predicted probabilities. Minimizing cross-entropy is minimizing KL (since H(P)H(P) doesn’t depend on model parameters).

This is why KL is not just a theoretical object—it is embedded in the training objective of most neural classifiers.

Variational Autoencoders (VAEs): KL as a regularizer via ELBO

In VAEs, we introduce latent variables zz and want to maximize logpθ(x)\log p_\theta(x), but the posterior pθ(zx)p_\theta(z\mid x) is intractable.

We choose an approximate posterior qϕ(zx)q_\phi(z\mid x) and maximize the ELBO:

L(θ,ϕ;x)=Ezqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)p(z)).\mathcal{L}(\theta,\phi; x)=\mathbb{E}_{z\sim q_\phi(z\mid x)}[\log p_\theta(x\mid z)] - D_{\mathrm{KL}}(q_\phi(z\mid x)\|\|p(z)).

Two important KL-related insights:

  1. 1)The KL term is reverse KL (approx posterior to prior). It encourages qϕ(zx)q_\phi(z\mid x) not to drift too far from p(z)p(z).
  2. 2)The ELBO can be derived by rewriting logpθ(x)\log p_\theta(x) and inserting qϕq_\phi; the gap between logpθ(x)\log p_\theta(x) and ELBO is itself a KL:
logpθ(x)L(θ,ϕ;x)=DKL(qϕ(zx)pθ(zx)).\log p_\theta(x) - \mathcal{L}(\theta,\phi; x) = D_{\mathrm{KL}}\big(q_\phi(z\mid x)\|\|p_\theta(z\mid x)\big).

So maximizing ELBO is minimizing a KL divergence to the true posterior.

This is a major “KL is everywhere” moment: it measures how far your approximation is from an intractable truth.

Information Bottleneck: KL as a constraint on information

The information bottleneck trades off:

  • how much information a representation TT retains about XX
  • vs how much it preserves about target YY

Mutual information is defined via KL:

I(X;T)=DKL(p(x,t)p(x)p(t)).I(X;T)=D_{\mathrm{KL}}(p(x,t)\|\|p(x)p(t)).

So KL is the primitive object beneath mutual information. The bottleneck objective can be written using KL-based quantities, and many derivations rely on KL manipulations.

A unifying mental model

Across these applications, KL is playing the same role:

  • There is a “reference truth” distribution (data labels, true posterior, joint distribution).
  • There is a “model/approximation” distribution.
  • KL measures the average log mismatch—the penalty you pay for pretending your approximation is real.

Visualization prompt: KL as a landscape to optimize

Canvas D: loss surface in parameter space

Take a simple family QθQ_\theta (e.g., Bernoulli with parameter qq) and a fixed target PP (Bernoulli with parameter pp).

Plot DKL(PQθ)D_{\mathrm{KL}}(P\|\|Q_\theta) as a function of qq.

Prompts:

  • Notice convexity in qq for Bernoulli forward KL.
  • Watch how gradients blow up near q0q\to 0 when p>0p>0.

This links KL to optimization behavior: sometimes the objective is smooth and well-behaved; sometimes it has steep cliffs because of near-zero probabilities.

That optimization intuition is crucial when you later study:

  • numerical stability in softmax/cross-entropy
  • KL annealing in VAEs
  • mode collapse and distribution mismatch in generative models

Worked Examples (3)

Compute KL for Bernoulli distributions and see the blow-up

Let P=Bern(p)P=\mathrm{Bern}(p) and Q=Bern(q)Q=\mathrm{Bern}(q). Outcomes are x{0,1}x\in\{0,1\} with P(1)=pP(1)=p, P(0)=1pP(0)=1-p and similarly for QQ.

  1. Write the discrete KL definition:

    DKL(PQ)=x{0,1}P(x)logP(x)Q(x).D_{\mathrm{KL}}(P\|\|Q)=\sum_{x\in\{0,1\}} P(x)\log\frac{P(x)}{Q(x)}.
  2. Expand the sum over the two outcomes:

    DKL(PQ)=P(1)logP(1)Q(1)+P(0)logP(0)Q(0).D_{\mathrm{KL}}(P\|\|Q)=P(1)\log\frac{P(1)}{Q(1)} + P(0)\log\frac{P(0)}{Q(0)}.
  3. Substitute P(1)=pP(1)=p, Q(1)=qQ(1)=q, P(0)=1pP(0)=1-p, Q(0)=1qQ(0)=1-q:

    DKL(Bern(p)Bern(q))=plogpq+(1p)log1p1q.D_{\mathrm{KL}}(\mathrm{Bern}(p)\|\|\mathrm{Bern}(q))=p\log\frac{p}{q} + (1-p)\log\frac{1-p}{1-q}.
  4. Concrete numbers (nats): let p=0.2p=0.2.

    • If q=0.2q=0.2, then each log ratio is 0, so KL = 0.
    • If q=0.02q=0.02:
    D=0.2log0.20.02+0.8log0.80.98.D=0.2\log\frac{0.2}{0.02} + 0.8\log\frac{0.8}{0.98}.

    Compute pieces:

    0.2log100.22.3026=0.46050.2\log 10 \approx 0.2\cdot 2.3026=0.4605.

    0.8log(0.8163)0.8(0.203)=0.16240.8\log(0.8163)\approx 0.8\cdot(-0.203)= -0.1624.

    So D0.2981D\approx 0.2981 nats.

    • If q0q\to 0:

    The term plogpqp\log\frac{p}{q}\to\infty as log(1/q)\log(1/q)\to\infty. Hence KL blows up.

Insight: Even with only two outcomes, the asymmetry is visible: if the true event probability p>0p>0 but the model sets qq near 0, the log-loss penalty becomes arbitrarily large. This is the operational meaning of “support mismatch” in miniature.

Show that cross-entropy decomposes into entropy + KL (with full algebra)

Let PP be the true distribution over X\mathcal{X} and QQ be a model distribution. Define entropy H(P)=xP(x)logP(x)H(P)=-\sum_x P(x)\log P(x) and cross-entropy H(P,Q)=xP(x)logQ(x)H(P,Q)=-\sum_x P(x)\log Q(x).

  1. Start from KL:

    DKL(PQ)=xP(x)logP(x)Q(x).D_{\mathrm{KL}}(P\|\|Q)=\sum_x P(x)\log\frac{P(x)}{Q(x)}.
  2. Split the log ratio:

    logP(x)Q(x)=logP(x)logQ(x).\log\frac{P(x)}{Q(x)}=\log P(x)-\log Q(x).
  3. Distribute P(x)P(x) and sum:

    DKL(PQ)=xP(x)logP(x)xP(x)logQ(x).\begin{aligned} D_{\mathrm{KL}}(P\|\|Q) &=\sum_x P(x)\log P(x) - \sum_x P(x)\log Q(x). \end{aligned}
  4. Multiply by -1 inside the definitions:

    xP(x)logP(x)=H(P),\sum_x P(x)\log P(x) = -H(P),
    xP(x)logQ(x)=H(P,Q).-\sum_x P(x)\log Q(x) = H(P,Q).
  5. Substitute:

    DKL(PQ)=H(P)+H(P,Q).D_{\mathrm{KL}}(P\|\|Q)= -H(P) + H(P,Q).
  6. Rearrange to get the decomposition:

    H(P,Q)=H(P)+DKL(PQ).H(P,Q)=H(P)+D_{\mathrm{KL}}(P\|\|Q).

Insight: Cross-entropy is exactly “irreducible uncertainty” H(P)H(P) plus “model mismatch” DKL(PQ)D_{\mathrm{KL}}(P\|\|Q). In learning, you can’t reduce H(P)H(P) by changing your model, so training focuses on driving the KL term down.

Reverse vs forward KL on a toy ‘two-mode’ discrete distribution

Let X={A,B,C}\mathcal{X}=\{A,B,C\}. Let the target be bimodal: P(A)=0.49P(A)=0.49, P(B)=0.49P(B)=0.49, P(C)=0.02P(C)=0.02. Consider two approximations:

  • Qcover(A)=0.45Q_{\text{cover}}(A)=0.45, Qcover(B)=0.45Q_{\text{cover}}(B)=0.45, Qcover(C)=0.10Q_{\text{cover}}(C)=0.10 (covers everything)
  • Qseek(A)=0.98Q_{\text{seek}}(A)=0.98, Qseek(B)=0.01Q_{\text{seek}}(B)=0.01, Qseek(C)=0.01Q_{\text{seek}}(C)=0.01 (seeks one mode)
  1. Compute forward KL for QcoverQ_{\text{cover}}:

    D(PQ)=xP(x)logP(x)Q(x).D(P\|\|Q)=\sum_x P(x)\log\frac{P(x)}{Q(x)}.
  2. Plug in each term (nats):

    • For A: 0.49log(0.49/0.45)0.49\log(0.49/0.45)
    • For B: 0.49log(0.49/0.45)0.49\log(0.49/0.45)
    • For C: 0.02log(0.02/0.10)0.02\log(0.02/0.10)
  3. Approximate:

    log(0.49/0.45)=log(1.0889)0.0852\log(0.49/0.45)=\log(1.0889)\approx 0.0852.

    So A+B contribute 20.490.08520.08352\cdot 0.49\cdot 0.0852 \approx 0.0835.

    log(0.02/0.10)=log(0.2)1.609\log(0.02/0.10)=\log(0.2)\approx -1.609.

    So C contributes 0.02(1.609)0.03220.02\cdot(-1.609)\approx -0.0322.

    Total forward KL 0.0513\approx 0.0513.

  4. Compute forward KL for QseekQ_{\text{seek}}:

    Terms:

    • A: 0.49log(0.49/0.98)=0.49log(0.5)0.49(0.693)=0.33960.49\log(0.49/0.98)=0.49\log(0.5)\approx 0.49\cdot(-0.693)= -0.3396
    • B: 0.49log(0.49/0.01)=0.49log(49)0.493.892=1.9070.49\log(0.49/0.01)=0.49\log(49)\approx 0.49\cdot 3.892=1.907
    • C: 0.02log(0.02/0.01)=0.02log20.020.693=0.01390.02\log(0.02/0.01)=0.02\log 2\approx 0.02\cdot 0.693=0.0139

    Total forward KL 1.581\approx 1.581 (much larger).

  5. Now compute reverse KL for QseekQ_{\text{seek}}:

    D(QP)=xQ(x)logQ(x)P(x).D(Q\|\|P)=\sum_x Q(x)\log\frac{Q(x)}{P(x)}.
  6. Reverse KL terms (nats):

    • A: 0.98log(0.98/0.49)=0.98log20.980.693=0.6790.98\log(0.98/0.49)=0.98\log 2\approx 0.98\cdot 0.693=0.679
    • B: 0.01log(0.01/0.49)=0.01log(0.0204)0.01(3.889)=0.03890.01\log(0.01/0.49)=0.01\log(0.0204)\approx 0.01\cdot(-3.889)= -0.0389
    • C: 0.01log(0.01/0.02)=0.01log(0.5)0.01(0.693)=0.006930.01\log(0.01/0.02)=0.01\log(0.5)\approx 0.01\cdot(-0.693)= -0.00693

    Total reverse KL 0.633\approx 0.633.

Insight: Forward KL heavily punishes missing a mode that has large PP mass (the big log(49)\log(49) term). Reverse KL weights by QQ, so if QQ barely allocates mass to B and C, it barely ‘feels’ being wrong there—capturing the mode-seeking tendency.

Key Takeaways

  • KL divergence is defined as an expectation under the first distribution: DKL(PQ)=ExP[logP(x)Q(x)]D_{\mathrm{KL}}(P\|\|Q)=\mathbb{E}_{x\sim P}\left[\log\frac{P(x)}{Q(x)}\right].

  • Operational meaning: it is the average extra log-loss (extra bits/nats per sample) you incur when using QQ to predict data generated by PP.

  • KL is always nonnegative (Gibbs/Jensen), and equals 0 iff P=QP=Q (almost everywhere / on the support of PP).

  • KL is asymmetric; swapping arguments changes which regions matter because the expectation reweights by PP vs by QQ.

  • Support mismatch causes infinite KL: if P>0P>0 where Q=0Q=0, then DKL(PQ)=D_{\mathrm{KL}}(P\|\|Q)=\infty.

  • Forward KL (PQP\|\|Q) tends to be mode-covering; reverse KL (QPQ\|\|P) tends to be mode-seeking in approximations.

  • Cross-entropy decomposes as H(P,Q)=H(P)+DKL(PQ)H(P,Q)=H(P)+D_{\mathrm{KL}}(P\|\|Q), making KL the mismatch term behind common training losses.

  • KL is a foundational building block under ELBOs (VAEs/variational inference) and mutual information (information bottleneck).

Common Mistakes

  • Treating KL as a symmetric distance and casually writing DKL(PQ)=DKL(QP)D_{\mathrm{KL}}(P\|\|Q)=D_{\mathrm{KL}}(Q\|\|P).

  • Ignoring support: computing KL numerically without guarding against zeros in QQ (leading to infinities or NaNs).

  • Forgetting the expectation is under PP (or under QQ for reverse KL), leading to incorrect intuition about which errors matter.

  • Confusing cross-entropy H(P,Q)H(P,Q) with entropy H(P)H(P), or assuming minimizing cross-entropy always implies QQ matches PP without considering model capacity and estimation error.

Practice

easy

Let P=Bern(0.7)P=\mathrm{Bern}(0.7) and Q=Bern(0.4)Q=\mathrm{Bern}(0.4). Compute DKL(PQ)D_{\mathrm{KL}}(P\|\|Q) in nats.

Hint: Use D=plog(p/q)+(1p)log((1p)/(1q))D=p\log(p/q)+(1-p)\log((1-p)/(1-q)).

Show solution

Here p=0.7p=0.7, q=0.4q=0.4.

D=0.7log0.70.4+0.3log0.30.6.D=0.7\log\frac{0.7}{0.4}+0.3\log\frac{0.3}{0.6}.

Compute pieces:

log(0.7/0.4)=log(1.75)0.5596\log(0.7/0.4)=\log(1.75)\approx 0.5596 so first term 0.70.5596=0.3917\approx 0.7\cdot 0.5596=0.3917.

log(0.3/0.6)=log(0.5)0.6931\log(0.3/0.6)=\log(0.5)\approx -0.6931 so second term 0.3(0.6931)=0.2079\approx 0.3\cdot(-0.6931)=-0.2079.

Total D0.1838D\approx 0.1838 nats.

medium

Prove (using Jensen’s inequality) that DKL(PQ)0D_{\mathrm{KL}}(P\|\|Q)\ge 0 for discrete distributions with matching support.

Hint: Rewrite KL as xP(x)log(Q(x)/P(x))-\sum_x P(x)\log(Q(x)/P(x)) and apply Jensen to the concave log\log.

Show solution

Start from

DKL(PQ)=xP(x)logP(x)Q(x)=xP(x)logQ(x)P(x).D_{\mathrm{KL}}(P\|\|Q)=\sum_x P(x)\log\frac{P(x)}{Q(x)} = -\sum_x P(x)\log\frac{Q(x)}{P(x)}.

Since log\log is concave, Jensen gives

xP(x)logQ(x)P(x)log(xP(x)Q(x)P(x))=log(xQ(x))=log1=0.\sum_x P(x)\log\frac{Q(x)}{P(x)} \le \log\left(\sum_x P(x)\frac{Q(x)}{P(x)}\right)=\log\left(\sum_x Q(x)\right)=\log 1=0.

Therefore xP(x)logQ(x)P(x)0-\sum_x P(x)\log\frac{Q(x)}{P(x)} \ge 0, i.e. DKL(PQ)0D_{\mathrm{KL}}(P\|\|Q)\ge 0.

Equality holds iff Q(x)/P(x)Q(x)/P(x) is constant over xx with P(x)>0P(x)>0, which implies Q=PQ=P.

hard

Consider a target distribution PP that is a mixture of two well-separated Gaussians in 1D. You approximate it with a single Gaussian family Qθ=N(μ,σ2)Q_\theta=\mathcal{N}(\mu,\sigma^2). Qualitatively, which direction (forward KL vs reverse KL) is more likely to yield a large σ\sigma covering both modes, and which is more likely to pick one mode? Explain using the ‘expectation under which distribution?’ idea.

Hint: Forward KL weights by PP (penalizes missing where PP has mass). Reverse KL weights by QQ (penalizes placing mass where PP is small).

Show solution

Forward KL DKL(PQ)D_{\mathrm{KL}}(P\|\|Q) is averaged over xPx\sim P. If QQ fails to assign decent density to either mode, then for many samples from the missed mode, log(P(x)/Q(x))\log(P(x)/Q(x)) becomes large, heavily penalizing the fit. This tends to push QQ toward a broader distribution (larger σ\sigma) that covers both modes.

Reverse KL DKL(QP)D_{\mathrm{KL}}(Q\|\|P) is averaged over xQx\sim Q. If QQ concentrates around one mode, it rarely samples the other mode, so it does not ‘feel’ the penalty of missing it. Instead, it is strongly penalized for placing mass in the low-density valley between modes (where PP is small), encouraging QQ to sit on one mode with smaller σ\sigma. This is the classic mode-seeking behavior.

Connections

Related next-step ideas you’ll likely want soon:

  • Mutual information I(X;Y)I(X;Y) as a KL between joint and product of marginals
  • Jensen–Shannon divergence as a symmetrized, smoothed cousin of KL (useful in GANs)
  • f-divergences (KL is one member of a larger family)
Quality: A (4.4/5)