Measure of uncertainty/information content. H(X) = -sum p log p.
Self-serve tutorial - low prerequisites, straightforward concepts.
Why does a fair coin flip feel like “one bit” of uncertainty, while a coin that lands heads 99% of the time feels like almost none? Entropy is the mathematical tool that turns that intuition into a precise number—and it becomes the currency for compression, prediction, and learning.
Entropy H(X) measures the average uncertainty (average “surprise”) of a random variable X. For discrete X with probabilities p(x), surprisal is I(x) = −log p(x), and entropy is the expected surprisal: H(X) = E[I(X)] = −∑ₓ p(x) log p(x). The log base sets the unit (base 2 → bits, base e → nats). Entropy is maximized by the uniform distribution and minimized (0) when outcomes are certain.
Probability tells you what tends to happen. But often you want a single number answering a different question:
A single event can be “surprising” or “unsurprising.” If something was already likely, learning it doesn’t teach you much. If it was unlikely, you learn a lot.
Entropy formalizes this by:
1) defining surprisal for one outcome, then
2) averaging surprisal according to the distribution.
Let X be a discrete random variable taking values x with probability p(x).
We define the surprisal (also called self-information) of outcome x as:
I(x) = −log p(x)
Intuition:
The log base chooses units:
Changing base just rescales:
log_b p = (log_a p) / (log_a b)
Surprisal is about one outcome. Entropy is about the whole random variable.
Definition (discrete entropy):
H(X) = E[I(X)]
= ∑ₓ p(x) I(x)
= ∑ₓ p(x) (−log p(x))
= −∑ₓ p(x) log p(x)
This is the canonical formula you see in information theory.
H(X) = −1 · log 1 = 0
Zero uncertainty, zero information gained by observing.
H(X) = −∑ₓ (1/n) log(1/n)
= −n · (1/n) · (−log n)
= log n
This grows as the number of equally likely possibilities grows.
Entropy is:
Entropy is not:
Keep the mental model: entropy is average surprise.
If an outcome has probability p, the “number of yes/no questions” you need to identify it (in the best case) behaves like log₂(1/p). This is because each binary question can cut the remaining possibilities roughly in half.
The log is the unique (up to scaling) function that turns multiplication into addition:
Concretely, if two independent events happen: A with p(A) and B with p(B), then:
p(A ∩ B) = p(A) p(B)
Surprisal of the joint event should be:
I(A ∩ B) = −log(p(A)p(B))
= −(log p(A) + log p(B))
= (−log p(A)) + (−log p(B))
= I(A) + I(B)
So the −log choice makes “information from independent events” add up cleanly.
If we use log₂:
Each halving of probability adds one bit of surprisal.
You already know expected value:
E[g(X)] = ∑ₓ p(x) g(x)
Entropy is simply that with g(x) = −log p(x):
H(X) = E[−log p(X)]
A key pacing idea: entropy is not mysterious new machinery—it’s “expected value” applied to the random variable “surprisal.”
If you compute with log₂, H(X) is in bits.
If you compute with ln, H(X) is in nats.
Conversion:
H_bits(X) = H_nats(X) / ln 2
This matters because in machine learning, losses often use ln (because calculus is cleaner), but we still conceptually speak in bits.
Let X ∈ {0,1} with P(X=1)=p and P(X=0)=1−p.
H(X) = −p log p − (1−p) log(1−p)
This function has a distinctive shape:
If logs are base 2, the maximum is 1 bit.
| Probability p(x) | Surprisal I(x)=−log₂ p(x) | Interpretation |
|---|---|---|
| 1 | 0 | No surprise |
| 1/2 | 1 | One bit of info |
| 1/4 | 2 | Two bits |
| 0.1 | ≈ 3.322 | More than 3 yes/no questions on average |
| 0.01 | ≈ 6.644 | Very surprising |
Takeaway: surprisal is a smooth measure that grows rapidly as p(x) shrinks.
Entropy is useful partly because it obeys clean, predictable rules. These rules let you reason about coding, uncertainty, and learning systems without re-deriving everything.
Because 0 ≤ p(x) ≤ 1, we have log p(x) ≤ 0, hence −p(x) log p(x) ≥ 0 for each term.
So:
H(X) = −∑ₓ p(x) log p(x) ≥ 0
Entropy can’t be negative.
If X always takes one value with probability 1, then H(X)=0.
Conversely, if H(X)=0, all terms −p(x) log p(x) must be 0. For any p(x) in (0,1), log p(x) < 0 and the term is positive, so the only way all terms vanish is when one outcome has probability 1 and the rest 0.
So entropy captures “no uncertainty.”
Suppose X can take n outcomes.
Why this matters: if you know only that there are n possible outcomes, “most uncertain” means “assume uniform.”
Let (X,Y) be a pair of discrete random variables.
The chain rule says:
H(X,Y) = H(X) + H(Y|X)
Interpretation:
This mirrors how you might actually encode (X,Y): encode X, then encode Y using a code adapted to the conditional distribution P(Y|X).
A brief derivation sketch (using definitions):
H(X,Y)
= −∑ₓ ∑ᵧ p(x,y) log p(x,y)
= −∑ₓ ∑ᵧ p(x,y) log( p(x) p(y|x) )
= −∑ₓ ∑ᵧ p(x,y) [log p(x) + log p(y|x)]
= −∑ₓ ∑ᵧ p(x,y) log p(x) − ∑ₓ ∑ᵧ p(x,y) log p(y|x)
For the first term:
−∑ₓ ∑ᵧ p(x,y) log p(x)
= −∑ₓ log p(x) ∑ᵧ p(x,y)
= −∑ₓ log p(x) p(x)
= H(X)
For the second term, by definition that is H(Y|X).
You should expect that knowing more can’t increase uncertainty:
H(X|Y) ≤ H(X)
Interpretation: if you observe Y, you can only become more certain (or equally uncertain) about X.
This idea is a cornerstone for later nodes like mutual information:
I(X;Y) = H(X) − H(X|Y)
If H(X)=log n, it behaves like you have n equally likely outcomes.
This motivates the notion of “perplexity” used in language modeling:
perplexity(X) = 2^{H_bits(X)}
If H=10 bits, perplexity is 1024: it’s as if, on average, there are ~1024 equally plausible next tokens.
Entropy answers questions like:
And importantly, it does so with algebraic rules that compose well across systems.
Imagine X generates symbols with distribution p(x). If you design an optimal prefix-free code (e.g., Huffman coding), the expected code length L̄ satisfies:
H(X) ≤ L̄ < H(X) + 1
Interpretation:
So entropy is not just “an abstract uncertainty”—it’s a concrete benchmark for storage and transmission.
In classification, we often model a distribution q over labels given features, and reality follows p. The cross-entropy:
H(p,q) = −∑ₓ p(x) log q(x)
is the expected surprisal when you use q but the world is p.
This decomposes as:
H(p,q) = H(p) + KL(p‖q)
So minimizing cross-entropy (common in neural nets) is equivalent to minimizing KL divergence because H(p) is fixed.
Even if you don’t yet know KL deeply, you can see the story:
Decision trees choose splits that reduce uncertainty about the label Y.
At a node, compute entropy of labels:
H(Y) = −∑ᵧ p(y) log p(y)
After splitting on feature X, you get conditional entropy:
H(Y|X) = ∑ₓ p(x) H(Y|X=x)
Information gain is:
IG = H(Y) − H(Y|X)
So trees pick the feature that maximizes IG (largest reduction in label uncertainty).
If you use log₂, entropy is measured in bits, and a bit is literally one yes/no question.
So H(X)=3 bits means: with optimal questioning, you need about 3 binary questions on average to identify X.
This mental model helps you sanity-check:
Once entropy feels natural, several powerful concepts become almost inevitable:
Entropy is the hub: it’s the “energy unit” in the economy of information.
Let X ∈ {0,1} with P(X=1)=p and P(X=0)=1−p. Compute H(X) (base 2), and verify it is maximized at p=1/2.
Start from the definition:
H(X) = −∑ₓ p(x) log₂ p(x)
Plug in the two outcomes:
H(p) = −p log₂ p − (1−p) log₂(1−p)
Check a few values for intuition:
= −(1/2)(−1) − (1/2)(−1)
= 1 bit
To verify the maximum location, it is easier to differentiate using ln and convert later.
Define h(p) = −p ln p − (1−p) ln(1−p). (This is entropy in nats.)
Differentiate:
∂h/∂p = −(ln p + 1) − [−ln(1−p) − 1]
= −ln p − 1 + ln(1−p) + 1
= ln((1−p)/p)
Set derivative to 0:
ln((1−p)/p) = 0
⇒ (1−p)/p = 1
⇒ 1−p = p
⇒ p = 1/2
Second derivative check:
∂²h/∂p² = ∂/∂p [ln(1−p) − ln p]
= −1/(1−p) − 1/p
< 0 for p ∈ (0,1)
So p=1/2 is a maximum.
Convert back to bits if needed:
H_bits(p) = h(p) / ln 2, and at p=1/2 we get 1 bit.
Insight: Entropy is maximized when outcomes are as balanced as possible. For a binary variable, “most uncertain” means “fair coin,” giving exactly 1 bit of average information.
Let X take values {A,B,C,D} with probabilities p = (1/2, 1/4, 1/8, 1/8). Compute H(X) in bits and compute perplexity 2^{H(X)}.
Write the entropy:
H(X) = −∑ p(x) log₂ p(x)
Compute each surprisal:
I(A)=−log₂(1/2)=1
I(B)=−log₂(1/4)=2
I(C)=−log₂(1/8)=3
I(D)=−log₂(1/8)=3
Average them with probabilities:
H(X) = (1/2)·1 + (1/4)·2 + (1/8)·3 + (1/8)·3
Compute:
(1/2)·1 = 0.5
(1/4)·2 = 0.5
(1/8)·3 = 0.375
(1/8)·3 = 0.375
Sum:
H(X) = 0.5 + 0.5 + 0.375 + 0.375 = 1.75 bits
Compute perplexity:
perplexity = 2^{H} = 2^{1.75}
= 2^{1} · 2^{0.75}
= 2 · 2^{3/4}
≈ 2 · 1.6818
≈ 3.36
Insight: Although there are 4 possible outcomes, the distribution is skewed, so uncertainty behaves like having only ~3.36 equally likely outcomes.
A dataset has a binary label Y with P(Y=1)=3/4 and P(Y=0)=1/4. A binary feature X splits the data: when X=0 (prob 1/2), P(Y=1|X=0)=1; when X=1 (prob 1/2), P(Y=1|X=1)=1/2. Compute H(Y), H(Y|X), and the entropy reduction H(Y)−H(Y|X) (base 2).
Compute the initial label entropy:
H(Y) = −(3/4)log₂(3/4) − (1/4)log₂(1/4)
Use known logs:
log₂(1/4)=−2
log₂(3/4)=log₂ 3 − log₂ 4 = log₂ 3 − 2 ≈ 1.585 − 2 = −0.415
Plug in:
H(Y) = −(3/4)(−0.415) − (1/4)(−2)
= (3/4)(0.415) + (1/4)(2)
≈ 0.311 + 0.5
≈ 0.811 bits
Compute conditional entropy via the split:
H(Y|X) = ∑ₓ p(x) H(Y|X=x)
= (1/2)H(Y|X=0) + (1/2)H(Y|X=1)
For X=0, P(Y=1|X=0)=1 so the label is certain:
H(Y|X=0)=0
For X=1, P(Y=1|X=1)=1/2 so it is a fair coin:
H(Y|X=1)=1 bit
Therefore:
H(Y|X) = (1/2)·0 + (1/2)·1 = 0.5 bits
Entropy reduction (information gain idea):
H(Y) − H(Y|X) ≈ 0.811 − 0.5 = 0.311 bits
Insight: A feature is useful when it makes the label distribution more certain. Here, X removes about 0.311 bits of uncertainty about Y on average—exactly the quantity decision trees try to maximize.
Surprisal measures information in a single outcome: I(x) = −log p(x). Rare events carry more information.
Entropy is average surprisal: H(X) = E[−log p(X)] = −∑ₓ p(x) log p(x).
The log base sets units: log₂ → bits, ln → nats; changing base rescales by a constant factor.
H(X) ≥ 0, and H(X)=0 exactly when X is deterministic (no uncertainty).
For a fixed number of outcomes n, entropy is maximized by the uniform distribution and equals log n.
The chain rule H(X,Y)=H(X)+H(Y|X) explains how uncertainty composes across multiple variables.
Conditioning reduces uncertainty: H(X|Y) ≤ H(X), paving the way to mutual information.
In coding and ML, entropy is a benchmark: it is the lower bound on average lossless code length and a building block for cross-entropy/KL losses.
Forgetting the negative sign: ∑ p log p is ≤ 0, so entropy is defined with a minus to make it nonnegative.
Mixing log bases mid-calculation (e.g., using ln in one step and log₂ in another) and getting inconsistent units.
Interpreting entropy as a property of a particular observed sequence rather than of the underlying probability distribution.
Confusing entropy H(X) with cross-entropy H(p,q): entropy uses the true p inside both p and log; cross-entropy uses p outside and q inside the log.
Compute the entropy (in bits) of a fair six-sided die roll. Then compute the entropy if the die is loaded so that P(1)=1/2 and P(2)=P(3)=P(4)=P(5)=P(6)=1/10.
Hint: Uniform over n outcomes has H=log₂ n. For the loaded die, compute −∑ p log₂ p and reuse that −log₂(1/2)=1 and −log₂(1/10)=log₂ 10 ≈ 3.322.
Fair die: H = log₂ 6 ≈ 2.585 bits.
Loaded die:
H = −(1/2)log₂(1/2) − 5·(1/10)log₂(1/10)
= (1/2)·1 + 5·(1/10)·log₂ 10
= 0.5 + 0.5·3.322
≈ 0.5 + 1.661
≈ 2.161 bits.
Let X take values {a,b,c} with probabilities (1/2, 1/3, 1/6). Compute H(X) in bits.
Hint: Compute each term p(x)·(−log₂ p(x)). You may use log₂ 3 ≈ 1.585 and log₂ 6 = log₂(2·3)=1+log₂ 3 ≈ 2.585.
H(X)= −(1/2)log₂(1/2) − (1/3)log₂(1/3) − (1/6)log₂(1/6)
Compute surprisals:
−log₂(1/2)=1
−log₂(1/3)=log₂ 3 ≈ 1.585
−log₂(1/6)=log₂ 6 ≈ 2.585
Average:
H ≈ (1/2)·1 + (1/3)·1.585 + (1/6)·2.585
≈ 0.5 + 0.528 + 0.431
≈ 1.459 bits.
Suppose X and Y are independent and each is a fair coin flip. Compute H(X), H(Y), and H(X,Y). Verify the chain rule H(X,Y)=H(X)+H(Y|X).
Hint: Fair coin entropy is 1 bit. Independence implies H(Y|X)=H(Y). Joint distribution has 4 equally likely outcomes.
H(X)=1 bit and H(Y)=1 bit.
Because X and Y are independent, knowing X does not change uncertainty about Y, so H(Y|X)=H(Y)=1 bit.
The pair (X,Y) has 4 equally likely outcomes, so:
H(X,Y)=log₂ 4 = 2 bits.
Chain rule check:
H(X)+H(Y|X) = 1 + 1 = 2 = H(X,Y).
Next nodes you can study: