Cross-Entropy

Information TheoryDifficulty: ████░Depth: 8Unlocks: 3

H(p,q) = H(p) + KL(p||q). Common loss for classification.

Interactive Visualization

t=0s

Core Concepts

▸Cross-entropy is the expected negative log-probability of samples from the true distribution p under model q (i.e., average surprise of q on data from p).
▸Used as a classification loss: it measures how poorly q predicts true labels and, with a one-hot true distribution, equals the negative log-likelihood of the correct class.

Key Symbols & Notation

H(p,q) - cross-entropy between true distribution p and model q

Essential Relationships

↔Decomposition: H(p,q) = H(p) + KL(p||q).
↔One-hot reduction: if p is a delta/one-hot on class y, then H(p,q) = -log q(y) (negative log-likelihood).

Prerequisites (2)

KL Divergence6 atoms

Entropy5 atoms

Unlocks (1)

Loss Functionslvl 4

▶ Advanced Learning Details

Graph Position

Depth Cost

Fan-Out (ROI)

Bottleneck Score

Chain Length

Cognitive Load

Atomic Elements

Total Elements

Percentile Level

Atomic Level

All Concepts (6)

• Cross-entropy: the quantity H(p,q) as a measure of how well a distribution q explains samples from p
• Cross-entropy as expected negative log-probability under q: H(p,q) = E_{x~p}[-log q(x)]
• Using cross-entropy as a loss for classification (penalizes q for assigning low probability to true labels)
• Behavior for one-hot targets: cross-entropy reduces to the negative log of the predicted probability for the true class (the common 'log loss')
• Asymmetry of cross-entropy: the value depends on the order of arguments (H(p,q) is not generally equal to H(q,p))
• Optimization interpretation: minimizing H(p,q) over q aims to make q approximate p

Teaching Strategy

Quick unlock - significant prerequisite investment but simple final step. Verify prerequisites first.

Cross-entropy is the “average surprise” your model feels when it sees real data. If your model assigns low probability to what actually happens, cross-entropy gets large—and training pushes the model to stop being surprised by the truth.

TL;DR:

Cross-entropy H(p,q) = E_{x∼p}[−log q(x)] measures how well a model distribution q predicts outcomes generated by the true distribution p. It decomposes as H(p,q) = H(p) + KL(p‖q), so minimizing cross-entropy over q is equivalent to minimizing KL divergence to p (since H(p) is fixed). In classification with one-hot labels, cross-entropy becomes the negative log-likelihood of the correct class: −log q(y).

What Is Cross-Entropy?

Why this concept exists

In information theory, we often ask: How many bits (or nats) do we need to encode outcomes?

•If outcomes are truly generated from distribution p, then the best average code length you can hope for (with optimal coding) is the entropy:

H(p) = E_{x∼p}[−log p(x)]

•But in modeling and machine learning, we often don’t know p. We build a model q and behave as if q were correct.

Cross-entropy answers the key practical question:

If reality follows p but I encode/predict using q, what is my average cost?

That cost is the cross-entropy:

H(p,q) = E_{x∼p}[−log q(x)]

So cross-entropy is the expected negative log-probability that the model q assigns to samples drawn from the true distribution p.

Intuition: “average surprise under q when the world follows p”

When q(x) is small for outcomes that happen often under p, −log q(x) is large. Cross-entropy accumulates these penalties in proportion to how often the world produces each outcome.

•If q matches p perfectly, then q(x) = p(x) and cross-entropy becomes entropy:

H(p,p) = E_{x∼p}[−log p(x)] = H(p)

•If q is wrong in places that matter (where p is large), cross-entropy increases.

Discrete vs continuous (important nuance)

For a discrete variable X taking values x ∈ 𝒳:

H(p,q) = −∑_{x∈𝒳} p(x) log q(x)

For continuous variables with densities p(x), q(x), we typically talk about cross-entropy of densities:

H(p,q) = E_{x∼p}[−log q(x)] = −∫ p(x) log q(x) dx

This behaves like differential entropy: it can be negative and depends on coordinate transformations. In ML classification, we’re nearly always in the discrete case (finite classes), so the discrete formula is the main one here.

Units: bits vs nats

The log base sets the unit.

log base	unit	comment
log₂	bits	common in coding theory
ln	nats	common in ML/optimization

Most ML libraries use natural log, so cross-entropy is in nats.

Why cross-entropy is a loss

A loss should be:

1) Small when predictions are good

2) Large when predictions are confidently wrong

3) Smooth enough to optimize (typically differentiable)

Cross-entropy does all three:

•If q assigns high probability to what occurs, −log q is small.
•If q assigns near-zero probability to what occurs, −log q blows up.
•With common parameterizations (like softmax), it has clean gradients.

This makes cross-entropy the workhorse loss for probabilistic classification.

Core Mechanic 1: The Decomposition H(p,q) = H(p) + KL(p‖q)

Why this decomposition matters

When training a model, we want q to get closer to p. But p is unknown. So we need a criterion that:

•can be estimated from samples from p (the dataset)
•leads to a principled notion of “closeness”

The key identity:

H(p,q) = H(p) + KL(p‖q)

tells you cross-entropy equals:

•an irreducible term H(p) (uncertainty inherent in the data-generating process), plus
•a model mismatch term KL(p‖q)

Since H(p) doesn’t depend on q, minimizing cross-entropy over q is equivalent to minimizing KL(p‖q).

Derivation (discrete case)

Start from definitions:

H(p,q) = −∑ p(x) log q(x)

H(p) = −∑ p(x) log p(x)

KL(p‖q) = ∑ p(x) log (p(x)/q(x))

Now expand KL:

KL(p‖q)

= ∑ p(x) [log p(x) − log q(x)]

= ∑ p(x) log p(x) − ∑ p(x) log q(x)

Multiply by −1 where useful:

−∑ p(x) log q(x) = −∑ p(x) log p(x) + KL(p‖q)

Recognize the left side as cross-entropy and the first term as entropy:

H(p,q) = H(p) + KL(p‖q)

Consequences you should internalize

1) Lower bound: since KL(p‖q) ≥ 0,

H(p,q) ≥ H(p)

and equality holds iff p = q (almost everywhere).

2) What training can and cannot do: even with a perfect model family and infinite data, the best achievable cross-entropy is H(p). You cannot beat the intrinsic randomness of the world.

3) “Cross-entropy minimization” is KL minimization: many ML training objectives are really KL projections:

q* = argmin_q H(p,q)

= argmin_q KL(p‖q)

This is the information-theoretic justification for maximum likelihood and many probabilistic learning procedures.

Estimating cross-entropy from data

In practice, we don’t know p, but we have samples x₁,…,xₙ ∼ p. Then:

H(p,q) = E_{x∼p}[−log q(x)]

is estimated by the empirical mean:

Ĥ(p,q) = (1/n) ∑_{i=1}^n [−log q(xᵢ)]

This is exactly the negative log-likelihood (NLL) per sample. Minimizing it is maximum likelihood estimation.

Relationship to log-loss

In binary classification and multiclass classification, cross-entropy is often called:

•log-loss
•negative log-likelihood

They differ mostly in framing; mathematically they are the same objective (up to averaging vs summing).

A small but important support condition

If there exists x with p(x) > 0 but q(x) = 0, then −log q(x) is infinite, so:

H(p,q) = ∞

This is a feature, not a bug: assigning zero probability to an event that occurs is maximally bad under log-loss. In practice, this motivates smoothing, label smoothing, or ensuring q has full support (e.g., softmax always yields positive probabilities).

Core Mechanic 2: Cross-Entropy as a Classification Loss (One-Hot Labels + Softmax)

Why cross-entropy fits classification so well

Classification asks for probabilities over classes. If the true label is y, you want the model to assign high probability q(y|x). Cross-entropy turns that into a scalar penalty:

ℓ(x,y) = −log q(y|x)

This is exactly the penalty you’d derive from maximum likelihood for a categorical distribution.

Setting: K-class classification

Let classes be {1,…,K}.

•True conditional distribution: p(y|x)
•Model predicted distribution: q(y|x)

For a single x, the conditional cross-entropy is:

H(p(·|x), q(·|x)) = −∑_{k=1}^K p(k|x) log q(k|x)

One-hot labels

In supervised classification, we typically observe a single label y for each x, and we treat the “true” distribution as one-hot:

p(k|x) = 1 if k = y, else 0

Plugging in:

H(p(·|x), q(·|x))

= −∑_{k=1}^K p(k|x) log q(k|x)

= −log q(y|x)

So the multiclass cross-entropy loss per example is just the negative log probability of the correct class.

Softmax parameterization (where gradients become simple)

A common model outputs logits z ∈ ℝ^K (bold because it’s a vector). Softmax maps logits to probabilities:

q(k|x) = softmax(z)_k = exp(z_k) / ∑_{j=1}^K exp(z_j)

Then the loss for a one-hot label y is:

ℓ = −log q(y|x)

= −log \left( \frac{exp(z_y)}{∑_{j=1}^K exp(z_j)} \right)

Now simplify:

ℓ

= −[z_y − log(∑_{j=1}^K exp(z_j))]

= −z_y + log(∑_{j=1}^K exp(z_j))

This form is numerically and conceptually important.

Gradient of softmax cross-entropy (the famous “q − y” result)

Why do practitioners love this loss? Because the derivative is clean.

Let p̂ be the one-hot vector for the true class (so p̂_k = 1{k=y}). Let q be softmax(z).

For each class k,

∂ℓ/∂z_k = q_k − p̂_k

Sketch of the derivation (enough steps to see it):

ℓ = −z_y + log(∑_j exp(z_j))

For k ≠ y:

∂ℓ/∂z_k

= 0 + (1/(∑_j exp(z_j))) · exp(z_k)

= exp(z_k)/∑_j exp(z_j)

= q_k

For k = y:

∂ℓ/∂z_y

= −1 + exp(z_y)/∑_j exp(z_j)

= q_y − 1

Combine using p̂_k:

∂ℓ/∂z_k = q_k − p̂_k

This says: update is proportional to (predicted probability − target probability).

Binary cross-entropy as a special case

For y ∈ {0,1}, and model probability q = σ(z) where σ is sigmoid:

σ(z) = 1/(1+exp(−z))

Binary cross-entropy per sample:

ℓ = −[ y log q + (1−y) log(1−q) ]

This matches Bernoulli negative log-likelihood.

Calibration and confidence

Cross-entropy does more than choose the right class—it pushes the predicted probability on the true class toward 1. That can yield well-calibrated probabilities when the model is correct and not overly flexible, but also encourages high confidence (which can be problematic under dataset shift).

A quick comparison:

objective	cares about	typical output
0–1 loss	only correctness	class label
hinge loss	margin between classes	score/margin
cross-entropy	full probability of true class	probability distribution

Cross-entropy is stricter than “just being correct”: predicting 0.51 on the true class is “correct” but still heavily penalized compared to 0.99.

Label smoothing (a direct modification of p)

Sometimes we don’t want one-hot targets because they make the model overconfident. Label smoothing replaces p̂ with a softened distribution:

p̃ = (1−ε)·p̂ + ε·u

where u is uniform (u_k = 1/K).

Then the loss is:

ℓ = −∑_{k=1}^K p̃_k log q_k

This prevents assigning infinite penalty-like gradients to small mistakes and often improves generalization and calibration.

Application/Connection: Cross-Entropy in Practice (MLE, Perplexity, and Model Evaluation)

Maximum likelihood estimation (MLE) is cross-entropy minimization

Suppose you have a dataset D = {x₁,…,xₙ} drawn i.i.d. from p, and a parametric model q_θ.

The log-likelihood is:

log L(θ) = ∑_{i=1}^n log q_θ(xᵢ)

Maximizing log-likelihood is equivalent to minimizing negative log-likelihood:

NLL(θ) = −∑_{i=1}^n log q_θ(xᵢ)

Divide by n:

(1/n) NLL(θ) = (1/n) ∑ −log q_θ(xᵢ)

As n → ∞, this converges (under standard assumptions) to:

E_{x∼p}[−log q_θ(x)] = H(p, q_θ)

So MLE is choosing θ to minimize cross-entropy from p to q_θ.

Perplexity: exponentiated cross-entropy

In language modeling and other discrete-sequence models, a standard metric is perplexity.

If logs are natural:

perplexity = exp(H)

If logs are base 2:

perplexity = 2^{H}

Interpretation: perplexity is the “effective number of equally likely choices” the model faces. Lower perplexity means the model is less surprised.

Cross-entropy rate for sequences

For sequences x₁:T, models often factorize:

q(x₁:T) = ∏_{t=1}^T q(x_t | x_{<t})

Then:

−log q(x₁:T) = −∑_{t=1}^T log q(x_t | x_{<t})

Average per token cross-entropy:

(1/T) ∑_{t=1}^T [−log q(x_t | x_{<t})]

This is exactly what token-level training minimizes.

Proper scoring rule (why it encourages honesty)

Cross-entropy / log-loss is a strictly proper scoring rule: the expected score is minimized when you predict the true distribution.

Informally: if you can choose q, and your goal is to minimize E_{x∼p}[−log q(x)], the best choice is q = p.

That’s a deep reason it is preferred over ad-hoc losses for probabilistic prediction.

Numerical stability: log-sum-exp trick

Softmax cross-entropy often uses:

ℓ = −z_y + log(∑_j exp(z_j))

Directly computing exp(z_j) can overflow if z_j is large. The standard trick subtracts m = max_j z_j:

log(∑_j exp(z_j))

= log(∑_j exp(z_j − m) · exp(m))

= m + log(∑_j exp(z_j − m))

This keeps numbers stable.

When cross-entropy is the wrong tool

Cross-entropy assumes your labels represent draws from a categorical/Bernoulli distribution and that you care about probabilistic accuracy.

It can be mismatched when:

•labels are extremely noisy and you only want ranking
•you care about margins more than probabilities
•evaluation metric is non-probabilistic (e.g., top-k accuracy, F1) and calibration is irrelevant

Even then, cross-entropy is often a good default because it is smooth and statistically grounded, but it’s worth recognizing the assumption.

Connection back to information theory

The identity

H(p,q) = H(p) + KL(p‖q)

says: your average code length using code optimized for q equals the optimal code length (H(p)) plus an inefficiency penalty (KL). In ML terms: you can’t remove inherent uncertainty, but you can remove mismatch by improving q.

This “coding view” is not just philosophy—it’s why log-loss is so ubiquitous across compression, density estimation, and modern neural modeling.

Worked Examples (3)

Compute cross-entropy and verify the decomposition with KL

Let 𝒳 = {a,b,c}. True distribution p = (0.5, 0.25, 0.25). Model distribution q = (0.25, 0.5, 0.25). Use natural logs (nats). Compute H(p), H(p,q), KL(p‖q), and verify H(p,q) = H(p) + KL(p‖q).

Compute entropy:
H(p) = −∑ p(x) log p(x)
= −[0.5 log 0.5 + 0.25 log 0.25 + 0.25 log 0.25].
Use log identities:
log 0.5 = −log 2,
log 0.25 = log(1/4) = −log 4 = −2 log 2.
Plug in:
H(p) = −[0.5(−log 2) + 0.25(−2 log 2) + 0.25(−2 log 2)]
= 0.5 log 2 + 0.5 log 2 + 0.5 log 2
= 1.5 log 2.
So H(p) ≈ 1.5 × 0.6931 = 1.0397 nats.
Compute cross-entropy:
H(p,q) = −∑ p(x) log q(x)
= −[0.5 log 0.25 + 0.25 log 0.5 + 0.25 log 0.25].
Substitute logs:
log 0.25 = −2 log 2,
log 0.5 = −log 2.
So:
H(p,q) = −[0.5(−2 log 2) + 0.25(−log 2) + 0.25(−2 log 2)]
Simplify:
H(p,q) = (1.0 log 2) + (0.25 log 2) + (0.5 log 2)
= 1.75 log 2
≈ 1.75 × 0.6931 = 1.2129 nats.
Compute KL:
KL(p‖q) = ∑ p(x) log(p(x)/q(x))
= 0.5 log(0.5/0.25) + 0.25 log(0.25/0.5) + 0.25 log(0.25/0.25).
Compute ratios:
0.5/0.25 = 2,
0.25/0.5 = 0.5,
0.25/0.25 = 1.
So:
KL = 0.5 log 2 + 0.25 log 0.5 + 0.25 log 1.
Replace logs:
log 0.5 = −log 2, log 1 = 0.
KL = 0.5 log 2 − 0.25 log 2 + 0
= 0.25 log 2
≈ 0.1733 nats.
Verify decomposition:
H(p) + KL = (1.5 log 2) + (0.25 log 2) = 1.75 log 2 = H(p,q).
Verified.

Insight: Cross-entropy exceeds entropy by exactly the KL penalty for using the wrong distribution. Here the mismatch is modest (KL ≈ 0.173), so H(p,q) is only slightly larger than H(p).

Multiclass cross-entropy for one example with softmax logits

A 3-class classifier outputs logits z = (2, 1, −1). The true class is y = 2 (1-indexed). Compute softmax probabilities q and the cross-entropy loss ℓ = −log q(y). Use natural logs.

Compute exp of logits:
exp(2) ≈ 7.389,
exp(1) ≈ 2.718,
exp(−1) ≈ 0.368.
Compute normalization constant:
S = ∑ exp(z_j) = 7.389 + 2.718 + 0.368 = 10.475.
Compute softmax probabilities:
q₁ = 7.389 / 10.475 ≈ 0.705,
q₂ = 2.718 / 10.475 ≈ 0.259,
q₃ = 0.368 / 10.475 ≈ 0.035.
Compute cross-entropy for y = 2:
ℓ = −log q₂ = −log(0.259).
Numerical value:
log(0.259) ≈ −1.351,
so ℓ ≈ 1.351 nats.

Insight: Even though class 2 has the second-highest logit, the model assigns it only ~0.259 probability, so the loss is substantial. Cross-entropy is sensitive to probability mass, not just rank.

Binary cross-entropy and the cost of confident mistakes

Binary label y ∈ {0,1}. Compare two predictions for a positive example y=1: (A) q=0.9 and (B) q=0.01. Compute ℓ = −[ y log q + (1−y) log(1−q) ].

For y = 1, the loss reduces to:
ℓ = −log q.
Case A: q = 0.9
ℓ_A = −log(0.9)
≈ 0.1053 nats.
Case B: q = 0.01
ℓ_B = −log(0.01)
= −log(10^{−2})
= 2 log 10
≈ 2 × 2.3026
= 4.6052 nats.

Insight: Cross-entropy punishes confident wrong predictions dramatically more than mildly wrong ones. This is why it tends to improve probability calibration (but can also encourage overconfidence if the model is misspecified).

Key Takeaways

✓
Cross-entropy is H(p,q) = E_{x∼p}[−log q(x)]: expected negative log-probability assigned by q to data from p.
✓
It decomposes as H(p,q) = H(p) + KL(p‖q), so minimizing cross-entropy over q is equivalent to minimizing KL divergence to p.
✓
Cross-entropy is always ≥ H(p); the gap is exactly the KL mismatch penalty.
✓
In multiclass classification with one-hot labels, the per-example cross-entropy is −log q(y|x), i.e., negative log-likelihood of the correct class.
✓
With softmax logits z, ℓ = −z_y + log(∑ exp(z_j)), enabling stable implementations via the log-sum-exp trick.
✓
The gradient of softmax cross-entropy w.r.t. logits is q_k − p̂_k, giving a clean “prediction minus target” update signal.
✓
If q assigns zero probability to an event that can occur under p, cross-entropy becomes infinite—models should maintain support (or use smoothing).
✓
Perplexity is exp(cross-entropy) (or 2^{cross-entropy} with log₂), widely used in language modeling.

Common Mistakes

✗
Mixing up H(p,q) and H(q,p): cross-entropy is not symmetric; the expectation is taken under p (the data-generating distribution).
✗
Thinking cross-entropy measures only classification accuracy: it measures probability quality, heavily penalizing confident errors.
✗
Ignoring numerical stability: computing softmax then log can overflow/underflow; use a combined stable softmax-cross-entropy or log-sum-exp trick.
✗
Using one-hot targets when labels are uncertain without considering label smoothing or probabilistic targets; this can lead to overconfidence.

Practice

easy

Let p = (0.7, 0.2, 0.1) and q = (0.6, 0.3, 0.1). Compute H(p,q) in nats.

Hint: Use H(p,q) = −∑ p_i log q_i. Keep the 0.1 term simple since q₃ = 0.1.

Show solution

H(p,q) = −[0.7 log 0.6 + 0.2 log 0.3 + 0.1 log 0.1].

Numerically:

log 0.6 ≈ −0.5108 ⇒ −0.7 log 0.6 ≈ 0.3576

log 0.3 ≈ −1.2040 ⇒ −0.2 log 0.3 ≈ 0.2408

log 0.1 ≈ −2.3026 ⇒ −0.1 log 0.1 ≈ 0.2303

Total ≈ 0.3576 + 0.2408 + 0.2303 = 0.8287 nats.

medium

Show that for one-hot multiclass labels, cross-entropy equals negative log-likelihood of the correct class. Start from H(p(·|x), q(·|x)) = −∑_k p(k|x) log q(k|x).

Hint: One-hot means p(k|x) is zero for all k except the true class y, where it equals 1.

Show solution

If the observed true class is y, then p(k|x) = 1{k=y}.

So:

H(p(·|x), q(·|x))

= −∑_{k=1}^K 1{k=y} log q(k|x)

= −log q(y|x).

This is exactly the negative log-likelihood of observing y under categorical probabilities q(·|x).

hard

Let softmax probabilities be q = (0.2, 0.5, 0.3) and the true class is y=3. Compute the gradient ∂ℓ/∂z for softmax cross-entropy, where ℓ = −log q(y).

Hint: Use ∂ℓ/∂z_k = q_k − p̂_k where p̂ is the one-hot vector for y.

Show solution

True class y=3 ⇒ one-hot p̂ = (0,0,1).

Then componentwise:

∂ℓ/∂z₁ = q₁ − 0 = 0.2

∂ℓ/∂z₂ = q₂ − 0 = 0.5

∂ℓ/∂z₃ = q₃ − 1 = 0.3 − 1 = −0.7

So ∂ℓ/∂z = (0.2, 0.5, −0.7).

Connections

Next, connect this node to how we choose objectives in practice: Loss Functions. Related foundations: entropy and KL divergence (prerequisites) are the building blocks for understanding why cross-entropy is a principled training objective.

Quality: A (4.5/5)

← back to tree browse all →