KL Divergence

Information TheoryDifficulty: ████░Depth: 7Unlocks: 7

Relative entropy. Measuring difference between distributions.

Interactive Visualization

t=0s

Core Concepts

▸KL divergence as a scalar measure of how one probability distribution diverges from another (relative entropy).
▸It is defined via an expectation taken with respect to the true distribution (the divergence is expectation-based).
▸Operational interpretation: the KL gives the average extra log-loss (information) per sample when using Q instead of the true P.

Key Symbols & Notation

D_KL(P || Q)

Essential Relationships

↔D_KL(P || Q) = E_{X ~ P}[ log( P(X) / Q(X) ) ]
↔D_KL(P || Q) >= 0, with equality iff P = Q almost everywhere

Prerequisites (2)

Entropy5 atoms

Maximum Likelihood Estimation6 atoms

Unlocks (3)

Cross-Entropylvl 4

Variational Autoencoderslvl 5

Information Bottlenecklvl 4

▶ Advanced Learning Details

Graph Position

Depth Cost

Fan-Out (ROI)

Bottleneck Score

Chain Length

Cognitive Load

Atomic Elements

Total Elements

Percentile Level

Atomic Level

All Concepts (15)

• Definition of KL divergence for discrete distributions: D_KL(P || Q) = sum over x of P(x) * log( P(x) / Q(x) ).
• Continuous (density) version: KL divergence as integral of p(x) * log(p(x)/q(x)) dx.
• Interpretation as expected log-likelihood ratio: KL is the expectation under P of the log of the probability ratio log(P(x)/Q(x)).
• Interpretation as extra expected code length (extra bits or nats) when encoding samples from P using a code optimized for Q instead of P.
• Non-negativity (Gibbs inequality): KL divergence is always >= 0, with equality only when P and Q are equal almost everywhere.
• Asymmetry: D_KL(P || Q) is not equal to D_KL(Q || P); KL is not a metric.
• Support requirement and infinity behavior: if there exists x with P(x)>0 but Q(x)=0, then D_KL(P||Q) = infinity.
• Cross-entropy concept as a distinct quantity: H(P,Q) = - sum P(x) log Q(x) (or integral), used as a separate loss.
• Relationship between KL and cross-entropy: D_KL(P||Q) = H(P,Q) - H(P).
• Relation to maximum likelihood: minimizing KL(P_empirical || Q_theta) over theta (or equivalently minimizing expected negative log-likelihood under P_empirical) yields MLE; MLE can be viewed as KL minimization to the empirical distribution.
• Units depend on log base: log base 2 gives bits, natural log gives nats; choice affects numerical value and interpretation.
• Chain rule / conditional decomposition: KL for joint distributions can be decomposed into marginal KL plus expected conditional KL: D_KL(P(X,Y)||Q(X,Y)) = D_KL(P(X)||Q(X)) + E_{P(X)}[ D_KL(P(Y|X)||Q(Y|X)) ].
• Additivity for independent product distributions: D_KL(P1 x P2 || Q1 x Q2) = D_KL(P1||Q1) + D_KL(P2||Q2).
• Relation to mutual information: mutual information I(X;Y) = D_KL( P(X,Y) || P(X) P(Y) ).
• Optimization properties used in ML: convexity in the model distribution Q (for fixed P) making KL a convenient loss for parameter estimation in many cases.

Teaching Strategy

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

Two probability distributions can look “close” on a plot but behave very differently when you use one to make predictions under the other. KL divergence is the tool that turns that mismatch into a single number—with a clear operational meaning: how many extra bits (or nats) you pay, on average, per sample when you code or predict using Q while the world is actually P.

TL;DR:

KL divergence $D_{\mathrm{KL}}(P\|\|Q)=\mathbb{E}_{x\sim P}[\log \frac{P(x)}{Q(x)}]$ measures the average extra log-loss incurred by using $Q$ instead of the true $P$ . It is nonnegative, asymmetric, and becomes infinite when $Q$ assigns zero probability where $P$ does not.

What Is KL Divergence?

Why we need a special “distance” for distributions

When you compare two vectors, Euclidean distance is natural. For probability distributions, Euclidean distance often hides what actually matters for learning and decision-making:

•If your model assigns tiny probability to events that happen under the true distribution, your log-loss explodes.
•If your model assigns probability mass to impossible events, you waste capacity—but you might not be punished as severely.

So we want a measure that is prediction-aware: it should tell us how costly it is to pretend the world follows $Q$ when it actually follows $P$ .

KL divergence (Kullback–Leibler divergence), also called relative entropy, does exactly that.

Definition (discrete)

Let $P$ and $Q$ be distributions over the same discrete set $\mathcal{X}$. The KL divergence from $Q$ to $P$ is

D_{\mathrm{KL}}(P\|\|Q)=\sum_{x\in\mathcal{X}} P(x)\,\log\frac{P(x)}{Q(x)}.

A few immediate notes:

•The expectation is with respect to $P$ (the “true” distribution in the common interpretation).
•The log base sets units: base 2 → bits, base $e$ → nats.
•If there exists an $x$ with $P(x)>0$ but $Q(x)=0$ , then $\log \frac{P(x)}{Q(x)}=\infty$ and KL becomes infinite.

Definition (continuous)

For densities $p(x)$ and $q(x)$ (with respect to the same base measure),

D_{\mathrm{KL}}(P\|\|Q)=\int p(x)\,\log\frac{p(x)}{q(x)}\,dx.

The same “support mismatch” rule applies: if $p(x)>0$ on a region where $q(x)=0$ , the KL is infinite.

The most important intuition: a log ratio averaged under reality

KL is an average of a log ratio:

•If $Q$ assigns lower probability than $P$ to likely events, then $P(x)/Q(x)$ is large → log is positive → KL grows.
•If $Q$ assigns higher probability than $P$ to likely events, then the log ratio can be negative for those $x$ .

So why isn’t KL sometimes negative overall? Because of a deep inequality (Jensen / Gibbs) that forces the expectation to be ≥ 0.

Guided visualization on the interactive canvas (do this now)

Canvas A: Bernoulli slider

1)Choose a Bernoulli true distribution: $P=\mathrm{Bern}(p)$ .
2)Choose a model distribution: $Q=\mathrm{Bern}(q)$ .
3)Slide $p$ and $q$ independently.

Prompted observations:

•Hold $p=0.2$ fixed. Move $q$ toward 0. Watch $D_{\mathrm{KL}}(P\|\|Q)$ rise sharply.
•Now move $q$ toward 1. It also rises, but the blow-up is much more dramatic when $Q$ places near-zero probability on outcomes that happen under $P$ .
•Set $q=0$ exactly while $p>0$ . The KL becomes infinite.

This is the first “feel” for what KL cares about: being wrong in the direction of underestimating true events is catastrophic in log-loss terms.

Static anchor diagram: support mismatch

Imagine two distributions over the same line:

• $p(x)$ is concentrated on an interval, say $[0,1]$ .
• $q(x)$ is concentrated on $[1,2]$ .

They barely overlap. On $[0,1]$ , $p(x)>0$ but $q(x)=0$ , so $D_{\mathrm{KL}}(P\|\|Q)=\infty$ .

If you overlay them, the key highlighted region is where $p(x)$ has mass but $q(x)$ is zero. KL treats that as an absolute failure, because it corresponds to assigning impossible probability to events that occur.

What KL is not

•It is not symmetric: generally $D_{\mathrm{KL}}(P\|\|Q)\neq D_{\mathrm{KL}}(Q\|\|P)$ .
•It is not a metric: it doesn’t satisfy the triangle inequality.
•It is not simply “area between curves.” It is an expectation of log ratios, weighted by $P$ .

Keep those distinctions in mind; they will matter when we interpret mode-seeking vs mode-covering behavior later.

Core Mechanic 1: KL as Expected Extra Log-Loss (Operational Meaning)

Why before how: prediction is about log-loss

In probabilistic modeling, a standard way to score a model $Q$ on data drawn from $P$ is negative log-likelihood (log-loss). For an outcome $x$ , the log-loss under model $Q$ is

\ell_Q(x) = -\log Q(x).

The expected log-loss under the true distribution $P$ is

\mathbb{E}_{x\sim P}[\ell_Q(x)] = -\sum_x P(x)\log Q(x).

This quantity shows up everywhere:

•maximum likelihood estimation (MLE)
•cross-entropy loss in classification
•coding and compression

So the natural question becomes:

How much worse is it to use $Q$ than to use the true $P$ , on average?

Derivation: KL is the excess expected log-loss

Start from KL:

D_{\mathrm{KL}}(P\|\|Q)=\sum_x P(x)\log\frac{P(x)}{Q(x)}.

Expand the log ratio:

\log\frac{P(x)}{Q(x)} = \log P(x) - \log Q(x).

Plug in:

\begin{aligned} D_{\mathrm{KL}}(P\|\|Q) &= \sum_x P(x)\big(\log P(x)-\log Q(x)\big) \\ &= \sum_x P(x)\log P(x) - \sum_x P(x)\log Q(x). \end{aligned}

Rewrite each term:

•Entropy: $H(P) = -\sum_x P(x)\log P(x)$
•Cross-entropy: $H(P,Q) = -\sum_x P(x)\log Q(x)$

\begin{aligned} D_{\mathrm{KL}}(P\|\|Q) &= -H(P) + H(P,Q) \\ &= H(P,Q) - H(P). \end{aligned}

Interpretation:

• $H(P)$ is the best achievable expected log-loss if you predict with the true $P$ .
• $H(P,Q)$ is the expected log-loss if you predict with $Q$ while data comes from $P$ .
•Their difference is the penalty: the average extra log-loss per sample.

In base 2 logs, it is literally “extra bits per symbol.”

Nonnegativity: why you can’t beat the truth on average

The statement $D_{\mathrm{KL}}(P\|\|Q)\ge 0$ formalizes: on average, you cannot do better (in expected log-loss) than predicting with the true distribution.

A clean proof uses Jensen’s inequality on the concave log function.

Let’s show the common Gibbs inequality form.

We want to show:

\sum_x P(x)\log\frac{P(x)}{Q(x)} \ge 0.

Equivalently,

\sum_x P(x)\log\frac{Q(x)}{P(x)} \le 0.

Now use Jensen (log is concave):

\sum_x P(x)\log\frac{Q(x)}{P(x)} \le \log\left(\sum_x P(x)\frac{Q(x)}{P(x)}\right) = \log\left(\sum_x Q(x)\right)=\log 1=0.

Therefore,

D_{\mathrm{KL}}(P\|\|Q)\ge 0,

with equality iff $P(x)=Q(x)$ for all $x$ where $P(x)>0$ .

Guided visualization: “extra log-loss” as a bar chart

Canvas B: per-outcome contribution plot

For a small discrete space (say 5 outcomes), plot bars of

c(x)=P(x)\log\frac{P(x)}{Q(x)}.

Prompts:

1)Make one outcome very likely under $P$ (e.g., $P(x\_1)=0.6$ ), but set $Q(x\_1)=0.2$ .

•Watch that bar dominate KL.

2)Make $Q$ larger than $P$ on a rare outcome.

•You may see a negative contribution there, but it usually doesn’t compensate much because it’s weighted by small $P(x)$ .

This visually explains why KL focuses on being accurate where $P$ puts mass.

Connection to MLE (prerequisite bridge)

Suppose you have data $x₁,\dots,x_n \sim P$ i.i.d., and a parametric model $Q_\theta$ .

The average negative log-likelihood is

\frac{1}{n}\sum_{i=1}^n -\log Q_\theta(x_i).

As $n\to\infty$ , this converges to

\mathbb{E}_{x\sim P}\big[-\log Q_\theta(x)\big] = H(P,Q_\theta).

Since

H(P,Q_\theta) = H(P) + D_{\mathrm{KL}}(P\|\|Q_\theta),

minimizing expected NLL over $\theta$ is the same as minimizing $D_{\mathrm{KL}}(P\|\|Q_\theta)$ (because $H(P)$ does not depend on $\theta$ ).

So MLE is “KL minimization from data distribution to model distribution.”

That phrasing becomes extremely useful later (cross-entropy, VAEs, variational inference).

Core Mechanic 2: Asymmetry, Support, and Mode-Seeking vs Mode-Covering

Why asymmetry matters

Because KL is expectation-weighted under the first argument, swapping arguments changes what errors are emphasized.

•Forward KL: $D_{\mathrm{KL}}(P\|\|Q)$ weights errors by $P$ (truth-weighted).
•Reverse KL: $D_{\mathrm{KL}}(Q\|\|P)$ weights errors by $Q$ (model-weighted).

This is not just a mathematical curiosity: it determines whether an approximation spreads out to “cover” all modes or collapses onto one.

Support and the “infinite wall”

Recall:

•If $P(x)>0$ and $Q(x)=0$ anywhere, then $D_{\mathrm{KL}}(P\|\|Q)=\infty$ .
•If $Q(x)>0$ and $P(x)=0$ somewhere, then $D_{\mathrm{KL}}(Q\|\|P)=\infty$ .

So each direction imposes a different feasibility constraint:

Divergence	Catastrophic if…	Intuition
$D_{\mathrm{KL}}(P\	\	Q)$	$Q=0$ where $P>0$	Model refuses to assign probability to real events
$D_{\mathrm{KL}}(Q\	\	P)$	$P=0$ where $Q>0$	Model insists on events the truth says are impossible

Guided visualization: toggle forward vs reverse KL on a bimodal target

Canvas C: mixture of Gaussians vs single Gaussian approximation

1)Let $P$ be a bimodal distribution (two separated Gaussians).
2)Let $Q$ be a single Gaussian with adjustable mean/variance.
3)Add a toggle: compute either $D_{\mathrm{KL}}(P\|\|Q)$ (forward) or $D_{\mathrm{KL}}(Q\|\|P)$ (reverse).

Prompts:

•In forward KL mode, increase Q’s variance. You’ll often see the best fit broaden to cover both modes. Missing one mode is costly because $P$ puts mass there.
•In reverse KL mode, the best fit often collapses onto one mode (mode-seeking). Why? Because the expectation is under $Q$ : if $Q$ puts almost no mass near the other mode, it doesn’t “feel” that region.

This is a central intuition used in variational inference: minimizing reverse KL (common in VI) tends to be mode-seeking.

A simple discrete example of asymmetry

Let $\mathcal{X} = \{a,b\}$ .

• $P(a)=0.99$ , $P(b)=0.01$ .
•Consider two candidate models:
• $Q₁(a)=0.9$ , $Q₁(b)=0.1$ (covers both outcomes)
• $Q₂(a)=0.9999$ , $Q₂(b)=0.0001$ (very confident)

Forward KL $D_{\mathrm{KL}}(P\|\|Q)$ penalizes when $Q$ underestimates events that happen under $P$ .

•If $Q₂$ makes $b$ extremely unlikely but $P(b)=0.01$ , the term $0.01\log\frac{0.01}{0.0001}$ can be significant.

Reverse KL $D_{\mathrm{KL}}(Q\|\|P)$ penalizes when $Q$ places mass where $P$ does not.

•Here both support match (no zeros), but the weighting changes what matters: reverse KL is dominated by where $Q$ places mass (mostly at $a$ ), making it less sensitive to a small-probability region unless $Q$ also places mass there.

KL is not symmetric—and not meant to be

Thinking of KL as “how surprised would I be if I used Q in a world of P?” naturally picks a direction: the world is $P$ , your model is $Q$ .

So asymmetry is actually part of the meaning.

When does asymmetry show up in ML practice?

•Supervised classification: cross-entropy corresponds to forward KL from the data label distribution to the model distribution.
•Variational inference (ELBO): commonly minimizes reverse KL between an approximate posterior $q(z)$ and true posterior $p(z\mid x)$ .
•Distillation: direction determines whether the student covers teacher’s support broadly or focuses on peak probabilities.

Practical note: smoothing to avoid infinities

In discrete problems, if you estimate $Q$ from limited data, you can accidentally set $Q(x)=0$ for an event that later appears. Forward KL then becomes infinite.

Common fix: additive smoothing (Laplace/Dirichlet priors) so $Q(x)>0$ for all $x$ .

This is not a hack; it encodes uncertainty and prevents “impossible” predictions from finite data artifacts.

Applications and Connections: Cross-Entropy, VAEs/ELBO, and Information Bottleneck

Cross-entropy: the most common place you see KL

You already know entropy $H(P)$ . Cross-entropy between $P$ and $Q$ is

H(P,Q)=-\mathbb{E}_{x\sim P}[\log Q(x)].

And the identity

H(P,Q)=H(P)+D_{\mathrm{KL}}(P\|\|Q)

says: cross-entropy is entropy plus a mismatch penalty.

In classification, $P$ is often a one-hot (or soft) label distribution and $Q$ is the model’s predicted probabilities. Minimizing cross-entropy is minimizing KL (since $H(P)$ doesn’t depend on model parameters).

This is why KL is not just a theoretical object—it is embedded in the training objective of most neural classifiers.

Variational Autoencoders (VAEs): KL as a regularizer via ELBO

In VAEs, we introduce latent variables $z$ and want to maximize $\log p_\theta(x)$ , but the posterior $p_\theta(z\mid x)$ is intractable.

We choose an approximate posterior $q_\phi(z\mid x)$ and maximize the ELBO:

\mathcal{L}(\theta,\phi; x)=\mathbb{E}_{z\sim q_\phi(z\mid x)}[\log p_\theta(x\mid z)] - D_{\mathrm{KL}}(q_\phi(z\mid x)\|\|p(z)).

Two important KL-related insights:

1)The KL term is reverse KL (approx posterior to prior). It encourages $q_\phi(z\mid x)$ not to drift too far from $p(z)$ .
2)The ELBO can be derived by rewriting $\log p_\theta(x)$ and inserting $q_\phi$ ; the gap between $\log p_\theta(x)$ and ELBO is itself a KL:

\log p_\theta(x) - \mathcal{L}(\theta,\phi; x) = D_{\mathrm{KL}}\big(q_\phi(z\mid x)\|\|p_\theta(z\mid x)\big).

So maximizing ELBO is minimizing a KL divergence to the true posterior.

This is a major “KL is everywhere” moment: it measures how far your approximation is from an intractable truth.

Information Bottleneck: KL as a constraint on information

The information bottleneck trades off:

•how much information a representation $T$ retains about $X$
•vs how much it preserves about target $Y$

Mutual information is defined via KL:

I(X;T)=D_{\mathrm{KL}}(p(x,t)\|\|p(x)p(t)).

So KL is the primitive object beneath mutual information. The bottleneck objective can be written using KL-based quantities, and many derivations rely on KL manipulations.

A unifying mental model

Across these applications, KL is playing the same role:

•There is a “reference truth” distribution (data labels, true posterior, joint distribution).
•There is a “model/approximation” distribution.
•KL measures the average log mismatch—the penalty you pay for pretending your approximation is real.

Visualization prompt: KL as a landscape to optimize

Canvas D: loss surface in parameter space

Take a simple family $Q_\theta$ (e.g., Bernoulli with parameter $q$ ) and a fixed target $P$ (Bernoulli with parameter $p$ ).

Plot $D_{\mathrm{KL}}(P\|\|Q_\theta)$ as a function of $q$ .

Prompts:

•Notice convexity in $q$ for Bernoulli forward KL.
•Watch how gradients blow up near $q\to 0$ when $p>0$ .

This links KL to optimization behavior: sometimes the objective is smooth and well-behaved; sometimes it has steep cliffs because of near-zero probabilities.

That optimization intuition is crucial when you later study:

•numerical stability in softmax/cross-entropy
•KL annealing in VAEs
•mode collapse and distribution mismatch in generative models

Worked Examples (3)

Compute KL for Bernoulli distributions and see the blow-up

Let $P=\mathrm{Bern}(p)$ and $Q=\mathrm{Bern}(q)$ . Outcomes are $x\in\{0,1\}$ with $P(1)=p$ , $P(0)=1-p$ and similarly for $Q$ .

Write the discrete KL definition:
$D_{\mathrm{KL}}(P\|\|Q)=\sum_{x\in\{0,1\}} P(x)\log\frac{P(x)}{Q(x)}.$
Expand the sum over the two outcomes:
$D_{\mathrm{KL}}(P\|\|Q)=P(1)\log\frac{P(1)}{Q(1)} + P(0)\log\frac{P(0)}{Q(0)}.$
Substitute $P(1)=p$ , $Q(1)=q$ , $P(0)=1-p$ , $Q(0)=1-q$ :
$D_{\mathrm{KL}}(\mathrm{Bern}(p)\|\|\mathrm{Bern}(q))=p\log\frac{p}{q} + (1-p)\log\frac{1-p}{1-q}.$
Concrete numbers (nats): let $p=0.2$ .
- •If $q=0.2$ , then each log ratio is 0, so KL = 0.
- •If $q=0.02$ :
$D=0.2\log\frac{0.2}{0.02} + 0.8\log\frac{0.8}{0.98}.$
Compute pieces:
$0.2\log 10 \approx 0.2\cdot 2.3026=0.4605$ .
$0.8\log(0.8163)\approx 0.8\cdot(-0.203)= -0.1624$ .
So $D\approx 0.2981$ nats.
- •If $q\to 0$ :
The term $p\log\frac{p}{q}\to\infty$ as $\log(1/q)\to\infty$ . Hence KL blows up.

Insight: Even with only two outcomes, the asymmetry is visible: if the true event probability $p>0$ but the model sets $q$ near 0, the log-loss penalty becomes arbitrarily large. This is the operational meaning of “support mismatch” in miniature.

Show that cross-entropy decomposes into entropy + KL (with full algebra)

Let $P$ be the true distribution over $\mathcal{X}$ and $Q$ be a model distribution. Define entropy $H(P)=-\sum_x P(x)\log P(x)$ and cross-entropy $H(P,Q)=-\sum_x P(x)\log Q(x)$ .

Start from KL:
$D_{\mathrm{KL}}(P\|\|Q)=\sum_x P(x)\log\frac{P(x)}{Q(x)}.$
Split the log ratio:
$\log\frac{P(x)}{Q(x)}=\log P(x)-\log Q(x).$
Distribute $P(x)$ and sum:
$\begin{aligned} D_{\mathrm{KL}}(P\|\|Q) &=\sum_x P(x)\log P(x) - \sum_x P(x)\log Q(x). \end{aligned}$
Multiply by -1 inside the definitions:
$\sum_x P(x)\log P(x) = -H(P),$
$-\sum_x P(x)\log Q(x) = H(P,Q).$
Substitute:
$D_{\mathrm{KL}}(P\|\|Q)= -H(P) + H(P,Q).$
Rearrange to get the decomposition:
$H(P,Q)=H(P)+D_{\mathrm{KL}}(P\|\|Q).$

Insight: Cross-entropy is exactly “irreducible uncertainty” $H(P)$ plus “model mismatch” $D_{\mathrm{KL}}(P\|\|Q)$ . In learning, you can’t reduce $H(P)$ by changing your model, so training focuses on driving the KL term down.

Reverse vs forward KL on a toy ‘two-mode’ discrete distribution

Let $\mathcal{X}=\{A,B,C\}$ . Let the target be bimodal: $P(A)=0.49$ , $P(B)=0.49$ , $P(C)=0.02$ . Consider two approximations:

• $Q_{\text{cover}}(A)=0.45$ , $Q_{\text{cover}}(B)=0.45$ , $Q_{\text{cover}}(C)=0.10$ (covers everything)
• $Q_{\text{seek}}(A)=0.98$ , $Q_{\text{seek}}(B)=0.01$ , $Q_{\text{seek}}(C)=0.01$ (seeks one mode)

Compute forward KL for $Q_{\text{cover}}$ :
$D(P\|\|Q)=\sum_x P(x)\log\frac{P(x)}{Q(x)}.$
Plug in each term (nats):
- •For A: $0.49\log(0.49/0.45)$
- •For B: $0.49\log(0.49/0.45)$
- •For C: $0.02\log(0.02/0.10)$
Approximate:
$\log(0.49/0.45)=\log(1.0889)\approx 0.0852$ .
So A+B contribute $2\cdot 0.49\cdot 0.0852 \approx 0.0835$ .
$\log(0.02/0.10)=\log(0.2)\approx -1.609$ .
So C contributes $0.02\cdot(-1.609)\approx -0.0322$ .
Total forward KL $\approx 0.0513$ .
Compute forward KL for $Q_{\text{seek}}$ :
Terms:
- •A: $0.49\log(0.49/0.98)=0.49\log(0.5)\approx 0.49\cdot(-0.693)= -0.3396$
- •B: $0.49\log(0.49/0.01)=0.49\log(49)\approx 0.49\cdot 3.892=1.907$
- •C: $0.02\log(0.02/0.01)=0.02\log 2\approx 0.02\cdot 0.693=0.0139$
Total forward KL $\approx 1.581$ (much larger).
Now compute reverse KL for $Q_{\text{seek}}$ :
$D(Q\|\|P)=\sum_x Q(x)\log\frac{Q(x)}{P(x)}.$
Reverse KL terms (nats):
- •A: $0.98\log(0.98/0.49)=0.98\log 2\approx 0.98\cdot 0.693=0.679$
- •B: $0.01\log(0.01/0.49)=0.01\log(0.0204)\approx 0.01\cdot(-3.889)= -0.0389$
- •C: $0.01\log(0.01/0.02)=0.01\log(0.5)\approx 0.01\cdot(-0.693)= -0.00693$
Total reverse KL $\approx 0.633$ .

Insight: Forward KL heavily punishes missing a mode that has large $P$ mass (the big $\log(49)$ term). Reverse KL weights by $Q$ , so if $Q$ barely allocates mass to B and C, it barely ‘feels’ being wrong there—capturing the mode-seeking tendency.

Key Takeaways

✓
KL divergence is defined as an expectation under the first distribution: $D_{\mathrm{KL}}(P\|\|Q)=\mathbb{E}_{x\sim P}\left[\log\frac{P(x)}{Q(x)}\right]$ .
✓
Operational meaning: it is the average extra log-loss (extra bits/nats per sample) you incur when using $Q$ to predict data generated by $P$ .
✓
KL is always nonnegative (Gibbs/Jensen), and equals 0 iff $P=Q$ (almost everywhere / on the support of $P$ ).
✓
KL is asymmetric; swapping arguments changes which regions matter because the expectation reweights by $P$ vs by $Q$ .
✓
Support mismatch causes infinite KL: if $P>0$ where $Q=0$ , then $D_{\mathrm{KL}}(P\|\|Q)=\infty$ .
✓
Forward KL ( $P\|\|Q$ ) tends to be mode-covering; reverse KL ( $Q\|\|P$ ) tends to be mode-seeking in approximations.
✓
Cross-entropy decomposes as $H(P,Q)=H(P)+D_{\mathrm{KL}}(P\|\|Q)$ , making KL the mismatch term behind common training losses.
✓
KL is a foundational building block under ELBOs (VAEs/variational inference) and mutual information (information bottleneck).

Common Mistakes

✗
Treating KL as a symmetric distance and casually writing $D_{\mathrm{KL}}(P\|\|Q)=D_{\mathrm{KL}}(Q\|\|P)$ .
✗
Ignoring support: computing KL numerically without guarding against zeros in $Q$ (leading to infinities or NaNs).
✗
Forgetting the expectation is under $P$ (or under $Q$ for reverse KL), leading to incorrect intuition about which errors matter.
✗
Confusing cross-entropy $H(P,Q)$ with entropy $H(P)$ , or assuming minimizing cross-entropy always implies $Q$ matches $P$ without considering model capacity and estimation error.

Practice

easy

Let $P=\mathrm{Bern}(0.7)$ and $Q=\mathrm{Bern}(0.4)$ . Compute $D_{\mathrm{KL}}(P\|\|Q)$ in nats.

Hint: Use $D=p\log(p/q)+(1-p)\log((1-p)/(1-q))$ .

Show solution

Here $p=0.7$ , $q=0.4$ .

D=0.7\log\frac{0.7}{0.4}+0.3\log\frac{0.3}{0.6}.

Compute pieces:

$\log(0.7/0.4)=\log(1.75)\approx 0.5596$ so first term $\approx 0.7\cdot 0.5596=0.3917$ .

$\log(0.3/0.6)=\log(0.5)\approx -0.6931$ so second term $\approx 0.3\cdot(-0.6931)=-0.2079$ .

Total $D\approx 0.1838$ nats.

medium

Prove (using Jensen’s inequality) that $D_{\mathrm{KL}}(P\|\|Q)\ge 0$ for discrete distributions with matching support.

Hint: Rewrite KL as $-\sum_x P(x)\log(Q(x)/P(x))$ and apply Jensen to the concave $\log$ .

Show solution

Start from

D_{\mathrm{KL}}(P\|\|Q)=\sum_x P(x)\log\frac{P(x)}{Q(x)} = -\sum_x P(x)\log\frac{Q(x)}{P(x)}.

Since $\log$ is concave, Jensen gives

\sum_x P(x)\log\frac{Q(x)}{P(x)} \le \log\left(\sum_x P(x)\frac{Q(x)}{P(x)}\right)=\log\left(\sum_x Q(x)\right)=\log 1=0.

Therefore $-\sum_x P(x)\log\frac{Q(x)}{P(x)} \ge 0$ , i.e. $D_{\mathrm{KL}}(P\|\|Q)\ge 0$ .

Equality holds iff $Q(x)/P(x)$ is constant over $x$ with $P(x)>0$ , which implies $Q=P$ .

hard

Consider a target distribution $P$ that is a mixture of two well-separated Gaussians in 1D. You approximate it with a single Gaussian family $Q_\theta=\mathcal{N}(\mu,\sigma^2)$ . Qualitatively, which direction (forward KL vs reverse KL) is more likely to yield a large $\sigma$ covering both modes, and which is more likely to pick one mode? Explain using the ‘expectation under which distribution?’ idea.

Hint: Forward KL weights by $P$ (penalizes missing where $P$ has mass). Reverse KL weights by $Q$ (penalizes placing mass where $P$ is small).

Show solution

Forward KL $D_{\mathrm{KL}}(P\|\|Q)$ is averaged over $x\sim P$ . If $Q$ fails to assign decent density to either mode, then for many samples from the missed mode, $\log(P(x)/Q(x))$ becomes large, heavily penalizing the fit. This tends to push $Q$ toward a broader distribution (larger $\sigma$ ) that covers both modes.

Reverse KL $D_{\mathrm{KL}}(Q\|\|P)$ is averaged over $x\sim Q$ . If $Q$ concentrates around one mode, it rarely samples the other mode, so it does not ‘feel’ the penalty of missing it. Instead, it is strongly penalized for placing mass in the low-density valley between modes (where $P$ is small), encouraging $Q$ to sit on one mode with smaller $\sigma$ . This is the classic mode-seeking behavior.

Connections

Related next-step ideas you’ll likely want soon:

•Mutual information $I(X;Y)$ as a KL between joint and product of marginals
•Jensen–Shannon divergence as a symmetrized, smoothed cousin of KL (useful in GANs)
•f-divergences (KL is one member of a larger family)

Quality: A (4.4/5)

← back to tree browse all →