H(p,q) = H(p) + KL(p||q). Common loss for classification.
Quick unlock - significant prerequisite investment but simple final step. Verify prerequisites first.
Cross-entropy is the “average surprise” your model feels when it sees real data. If your model assigns low probability to what actually happens, cross-entropy gets large—and training pushes the model to stop being surprised by the truth.
Cross-entropy H(p,q) = E_{x∼p}[−log q(x)] measures how well a model distribution q predicts outcomes generated by the true distribution p. It decomposes as H(p,q) = H(p) + KL(p‖q), so minimizing cross-entropy over q is equivalent to minimizing KL divergence to p (since H(p) is fixed). In classification with one-hot labels, cross-entropy becomes the negative log-likelihood of the correct class: −log q(y).
In information theory, we often ask: How many bits (or nats) do we need to encode outcomes?
H(p) = E_{x∼p}[−log p(x)]
Cross-entropy answers the key practical question:
If reality follows p but I encode/predict using q, what is my average cost?
That cost is the cross-entropy:
H(p,q) = E_{x∼p}[−log q(x)]
So cross-entropy is the expected negative log-probability that the model q assigns to samples drawn from the true distribution p.
When q(x) is small for outcomes that happen often under p, −log q(x) is large. Cross-entropy accumulates these penalties in proportion to how often the world produces each outcome.
H(p,p) = E_{x∼p}[−log p(x)] = H(p)
For a discrete variable X taking values x ∈ 𝒳:
H(p,q) = −∑_{x∈𝒳} p(x) log q(x)
For continuous variables with densities p(x), q(x), we typically talk about cross-entropy of densities:
H(p,q) = E_{x∼p}[−log q(x)] = −∫ p(x) log q(x) dx
This behaves like differential entropy: it can be negative and depends on coordinate transformations. In ML classification, we’re nearly always in the discrete case (finite classes), so the discrete formula is the main one here.
The log base sets the unit.
| log base | unit | comment |
|---|---|---|
| log₂ | bits | common in coding theory |
| ln | nats | common in ML/optimization |
Most ML libraries use natural log, so cross-entropy is in nats.
A loss should be:
1) Small when predictions are good
2) Large when predictions are confidently wrong
3) Smooth enough to optimize (typically differentiable)
Cross-entropy does all three:
This makes cross-entropy the workhorse loss for probabilistic classification.
When training a model, we want q to get closer to p. But p is unknown. So we need a criterion that:
The key identity:
H(p,q) = H(p) + KL(p‖q)
tells you cross-entropy equals:
Since H(p) doesn’t depend on q, minimizing cross-entropy over q is equivalent to minimizing KL(p‖q).
Start from definitions:
H(p,q) = −∑ p(x) log q(x)
H(p) = −∑ p(x) log p(x)
KL(p‖q) = ∑ p(x) log (p(x)/q(x))
Now expand KL:
KL(p‖q)
= ∑ p(x) [log p(x) − log q(x)]
= ∑ p(x) log p(x) − ∑ p(x) log q(x)
Multiply by −1 where useful:
−∑ p(x) log q(x) = −∑ p(x) log p(x) + KL(p‖q)
Recognize the left side as cross-entropy and the first term as entropy:
H(p,q) = H(p) + KL(p‖q)
1) Lower bound: since KL(p‖q) ≥ 0,
H(p,q) ≥ H(p)
and equality holds iff p = q (almost everywhere).
2) What training can and cannot do: even with a perfect model family and infinite data, the best achievable cross-entropy is H(p). You cannot beat the intrinsic randomness of the world.
3) “Cross-entropy minimization” is KL minimization: many ML training objectives are really KL projections:
q* = argmin_q H(p,q)
= argmin_q KL(p‖q)
This is the information-theoretic justification for maximum likelihood and many probabilistic learning procedures.
In practice, we don’t know p, but we have samples x₁,…,xₙ ∼ p. Then:
H(p,q) = E_{x∼p}[−log q(x)]
is estimated by the empirical mean:
Ĥ(p,q) = (1/n) ∑_{i=1}^n [−log q(xᵢ)]
This is exactly the negative log-likelihood (NLL) per sample. Minimizing it is maximum likelihood estimation.
In binary classification and multiclass classification, cross-entropy is often called:
They differ mostly in framing; mathematically they are the same objective (up to averaging vs summing).
If there exists x with p(x) > 0 but q(x) = 0, then −log q(x) is infinite, so:
H(p,q) = ∞
This is a feature, not a bug: assigning zero probability to an event that occurs is maximally bad under log-loss. In practice, this motivates smoothing, label smoothing, or ensuring q has full support (e.g., softmax always yields positive probabilities).
Classification asks for probabilities over classes. If the true label is y, you want the model to assign high probability q(y|x). Cross-entropy turns that into a scalar penalty:
ℓ(x,y) = −log q(y|x)
This is exactly the penalty you’d derive from maximum likelihood for a categorical distribution.
Let classes be {1,…,K}.
For a single x, the conditional cross-entropy is:
H(p(·|x), q(·|x)) = −∑_{k=1}^K p(k|x) log q(k|x)
In supervised classification, we typically observe a single label y for each x, and we treat the “true” distribution as one-hot:
p(k|x) = 1 if k = y, else 0
Plugging in:
H(p(·|x), q(·|x))
= −∑_{k=1}^K p(k|x) log q(k|x)
= −log q(y|x)
So the multiclass cross-entropy loss per example is just the negative log probability of the correct class.
A common model outputs logits z ∈ ℝ^K (bold because it’s a vector). Softmax maps logits to probabilities:
q(k|x) = softmax(z)_k = exp(z_k) / ∑_{j=1}^K exp(z_j)
Then the loss for a one-hot label y is:
ℓ = −log q(y|x)
= −log \left( \frac{exp(z_y)}{∑_{j=1}^K exp(z_j)} \right)
Now simplify:
ℓ
= −[z_y − log(∑_{j=1}^K exp(z_j))]
= −z_y + log(∑_{j=1}^K exp(z_j))
This form is numerically and conceptually important.
Why do practitioners love this loss? Because the derivative is clean.
Let p̂ be the one-hot vector for the true class (so p̂_k = 1{k=y}). Let q be softmax(z).
For each class k,
∂ℓ/∂z_k = q_k − p̂_k
Sketch of the derivation (enough steps to see it):
ℓ = −z_y + log(∑_j exp(z_j))
For k ≠ y:
∂ℓ/∂z_k
= 0 + (1/(∑_j exp(z_j))) · exp(z_k)
= exp(z_k)/∑_j exp(z_j)
= q_k
For k = y:
∂ℓ/∂z_y
= −1 + exp(z_y)/∑_j exp(z_j)
= q_y − 1
Combine using p̂_k:
∂ℓ/∂z_k = q_k − p̂_k
This says: update is proportional to (predicted probability − target probability).
For y ∈ {0,1}, and model probability q = σ(z) where σ is sigmoid:
σ(z) = 1/(1+exp(−z))
Binary cross-entropy per sample:
ℓ = −[ y log q + (1−y) log(1−q) ]
This matches Bernoulli negative log-likelihood.
Cross-entropy does more than choose the right class—it pushes the predicted probability on the true class toward 1. That can yield well-calibrated probabilities when the model is correct and not overly flexible, but also encourages high confidence (which can be problematic under dataset shift).
A quick comparison:
| objective | cares about | typical output |
|---|---|---|
| 0–1 loss | only correctness | class label |
| hinge loss | margin between classes | score/margin |
| cross-entropy | full probability of true class | probability distribution |
Cross-entropy is stricter than “just being correct”: predicting 0.51 on the true class is “correct” but still heavily penalized compared to 0.99.
Sometimes we don’t want one-hot targets because they make the model overconfident. Label smoothing replaces p̂ with a softened distribution:
p̃ = (1−ε)·p̂ + ε·u
where u is uniform (u_k = 1/K).
Then the loss is:
ℓ = −∑_{k=1}^K p̃_k log q_k
This prevents assigning infinite penalty-like gradients to small mistakes and often improves generalization and calibration.
Suppose you have a dataset D = {x₁,…,xₙ} drawn i.i.d. from p, and a parametric model q_θ.
The log-likelihood is:
log L(θ) = ∑_{i=1}^n log q_θ(xᵢ)
Maximizing log-likelihood is equivalent to minimizing negative log-likelihood:
NLL(θ) = −∑_{i=1}^n log q_θ(xᵢ)
Divide by n:
(1/n) NLL(θ) = (1/n) ∑ −log q_θ(xᵢ)
As n → ∞, this converges (under standard assumptions) to:
E_{x∼p}[−log q_θ(x)] = H(p, q_θ)
So MLE is choosing θ to minimize cross-entropy from p to q_θ.
In language modeling and other discrete-sequence models, a standard metric is perplexity.
If logs are natural:
perplexity = exp(H)
If logs are base 2:
perplexity = 2^{H}
Interpretation: perplexity is the “effective number of equally likely choices” the model faces. Lower perplexity means the model is less surprised.
For sequences x₁:T, models often factorize:
q(x₁:T) = ∏_{t=1}^T q(x_t | x_{<t})
Then:
−log q(x₁:T) = −∑_{t=1}^T log q(x_t | x_{<t})
Average per token cross-entropy:
(1/T) ∑_{t=1}^T [−log q(x_t | x_{<t})]
This is exactly what token-level training minimizes.
Cross-entropy / log-loss is a strictly proper scoring rule: the expected score is minimized when you predict the true distribution.
Informally: if you can choose q, and your goal is to minimize E_{x∼p}[−log q(x)], the best choice is q = p.
That’s a deep reason it is preferred over ad-hoc losses for probabilistic prediction.
Softmax cross-entropy often uses:
ℓ = −z_y + log(∑_j exp(z_j))
Directly computing exp(z_j) can overflow if z_j is large. The standard trick subtracts m = max_j z_j:
log(∑_j exp(z_j))
= log(∑_j exp(z_j − m) · exp(m))
= m + log(∑_j exp(z_j − m))
This keeps numbers stable.
Cross-entropy assumes your labels represent draws from a categorical/Bernoulli distribution and that you care about probabilistic accuracy.
It can be mismatched when:
Even then, cross-entropy is often a good default because it is smooth and statistically grounded, but it’s worth recognizing the assumption.
The identity
H(p,q) = H(p) + KL(p‖q)
says: your average code length using code optimized for q equals the optimal code length (H(p)) plus an inefficiency penalty (KL). In ML terms: you can’t remove inherent uncertainty, but you can remove mismatch by improving q.
This “coding view” is not just philosophy—it’s why log-loss is so ubiquitous across compression, density estimation, and modern neural modeling.
Let 𝒳 = {a,b,c}. True distribution p = (0.5, 0.25, 0.25). Model distribution q = (0.25, 0.5, 0.25). Use natural logs (nats). Compute H(p), H(p,q), KL(p‖q), and verify H(p,q) = H(p) + KL(p‖q).
Compute entropy:
H(p) = −∑ p(x) log p(x)
= −[0.5 log 0.5 + 0.25 log 0.25 + 0.25 log 0.25].
Use log identities:
log 0.5 = −log 2,
log 0.25 = log(1/4) = −log 4 = −2 log 2.
Plug in:
H(p) = −[0.5(−log 2) + 0.25(−2 log 2) + 0.25(−2 log 2)]
= 0.5 log 2 + 0.5 log 2 + 0.5 log 2
= 1.5 log 2.
So H(p) ≈ 1.5 × 0.6931 = 1.0397 nats.
Compute cross-entropy:
H(p,q) = −∑ p(x) log q(x)
= −[0.5 log 0.25 + 0.25 log 0.5 + 0.25 log 0.25].
Substitute logs:
log 0.25 = −2 log 2,
log 0.5 = −log 2.
So:
H(p,q) = −[0.5(−2 log 2) + 0.25(−log 2) + 0.25(−2 log 2)]
Simplify:
H(p,q) = (1.0 log 2) + (0.25 log 2) + (0.5 log 2)
= 1.75 log 2
≈ 1.75 × 0.6931 = 1.2129 nats.
Compute KL:
KL(p‖q) = ∑ p(x) log(p(x)/q(x))
= 0.5 log(0.5/0.25) + 0.25 log(0.25/0.5) + 0.25 log(0.25/0.25).
Compute ratios:
0.5/0.25 = 2,
0.25/0.5 = 0.5,
0.25/0.25 = 1.
So:
KL = 0.5 log 2 + 0.25 log 0.5 + 0.25 log 1.
Replace logs:
log 0.5 = −log 2, log 1 = 0.
KL = 0.5 log 2 − 0.25 log 2 + 0
= 0.25 log 2
≈ 0.1733 nats.
Verify decomposition:
H(p) + KL = (1.5 log 2) + (0.25 log 2) = 1.75 log 2 = H(p,q).
Verified.
Insight: Cross-entropy exceeds entropy by exactly the KL penalty for using the wrong distribution. Here the mismatch is modest (KL ≈ 0.173), so H(p,q) is only slightly larger than H(p).
A 3-class classifier outputs logits z = (2, 1, −1). The true class is y = 2 (1-indexed). Compute softmax probabilities q and the cross-entropy loss ℓ = −log q(y). Use natural logs.
Compute exp of logits:
exp(2) ≈ 7.389,
exp(1) ≈ 2.718,
exp(−1) ≈ 0.368.
Compute normalization constant:
S = ∑ exp(z_j) = 7.389 + 2.718 + 0.368 = 10.475.
Compute softmax probabilities:
q₁ = 7.389 / 10.475 ≈ 0.705,
q₂ = 2.718 / 10.475 ≈ 0.259,
q₃ = 0.368 / 10.475 ≈ 0.035.
Compute cross-entropy for y = 2:
ℓ = −log q₂ = −log(0.259).
Numerical value:
log(0.259) ≈ −1.351,
so ℓ ≈ 1.351 nats.
Insight: Even though class 2 has the second-highest logit, the model assigns it only ~0.259 probability, so the loss is substantial. Cross-entropy is sensitive to probability mass, not just rank.
Binary label y ∈ {0,1}. Compare two predictions for a positive example y=1: (A) q=0.9 and (B) q=0.01. Compute ℓ = −[ y log q + (1−y) log(1−q) ].
For y = 1, the loss reduces to:
ℓ = −log q.
Case A: q = 0.9
ℓ_A = −log(0.9)
≈ 0.1053 nats.
Case B: q = 0.01
ℓ_B = −log(0.01)
= −log(10^{−2})
= 2 log 10
≈ 2 × 2.3026
= 4.6052 nats.
Insight: Cross-entropy punishes confident wrong predictions dramatically more than mildly wrong ones. This is why it tends to improve probability calibration (but can also encourage overconfidence if the model is misspecified).
Cross-entropy is H(p,q) = E_{x∼p}[−log q(x)]: expected negative log-probability assigned by q to data from p.
It decomposes as H(p,q) = H(p) + KL(p‖q), so minimizing cross-entropy over q is equivalent to minimizing KL divergence to p.
Cross-entropy is always ≥ H(p); the gap is exactly the KL mismatch penalty.
In multiclass classification with one-hot labels, the per-example cross-entropy is −log q(y|x), i.e., negative log-likelihood of the correct class.
With softmax logits z, ℓ = −z_y + log(∑ exp(z_j)), enabling stable implementations via the log-sum-exp trick.
The gradient of softmax cross-entropy w.r.t. logits is q_k − p̂_k, giving a clean “prediction minus target” update signal.
If q assigns zero probability to an event that can occur under p, cross-entropy becomes infinite—models should maintain support (or use smoothing).
Perplexity is exp(cross-entropy) (or 2^{cross-entropy} with log₂), widely used in language modeling.
Mixing up H(p,q) and H(q,p): cross-entropy is not symmetric; the expectation is taken under p (the data-generating distribution).
Thinking cross-entropy measures only classification accuracy: it measures probability quality, heavily penalizing confident errors.
Ignoring numerical stability: computing softmax then log can overflow/underflow; use a combined stable softmax-cross-entropy or log-sum-exp trick.
Using one-hot targets when labels are uncertain without considering label smoothing or probabilistic targets; this can lead to overconfidence.
Let p = (0.7, 0.2, 0.1) and q = (0.6, 0.3, 0.1). Compute H(p,q) in nats.
Hint: Use H(p,q) = −∑ p_i log q_i. Keep the 0.1 term simple since q₃ = 0.1.
H(p,q) = −[0.7 log 0.6 + 0.2 log 0.3 + 0.1 log 0.1].
Numerically:
log 0.6 ≈ −0.5108 ⇒ −0.7 log 0.6 ≈ 0.3576
log 0.3 ≈ −1.2040 ⇒ −0.2 log 0.3 ≈ 0.2408
log 0.1 ≈ −2.3026 ⇒ −0.1 log 0.1 ≈ 0.2303
Total ≈ 0.3576 + 0.2408 + 0.2303 = 0.8287 nats.
Show that for one-hot multiclass labels, cross-entropy equals negative log-likelihood of the correct class. Start from H(p(·|x), q(·|x)) = −∑_k p(k|x) log q(k|x).
Hint: One-hot means p(k|x) is zero for all k except the true class y, where it equals 1.
If the observed true class is y, then p(k|x) = 1{k=y}.
So:
H(p(·|x), q(·|x))
= −∑_{k=1}^K 1{k=y} log q(k|x)
= −log q(y|x).
This is exactly the negative log-likelihood of observing y under categorical probabilities q(·|x).
Let softmax probabilities be q = (0.2, 0.5, 0.3) and the true class is y=3. Compute the gradient ∂ℓ/∂z for softmax cross-entropy, where ℓ = −log q(y).
Hint: Use ∂ℓ/∂z_k = q_k − p̂_k where p̂ is the one-hot vector for y.
True class y=3 ⇒ one-hot p̂ = (0,0,1).
Then componentwise:
∂ℓ/∂z₁ = q₁ − 0 = 0.2
∂ℓ/∂z₂ = q₂ − 0 = 0.5
∂ℓ/∂z₃ = q₃ − 1 = 0.3 − 1 = −0.7
So ∂ℓ/∂z = (0.2, 0.5, −0.7).
Next, connect this node to how we choose objectives in practice: Loss Functions. Related foundations: entropy and KL divergence (prerequisites) are the building blocks for understanding why cross-entropy is a principled training objective.