Entropy

Information TheoryDifficulty: ███░░Depth: 6Unlocks: 14

Measure of uncertainty/information content. H(X) = -sum p log p.

Interactive Visualization

t=0s

Core Concepts

▸Surprisal (information content) of a single outcome: -log p(x)
▸Entropy as the expected surprisal (average information/uncertainty) of a random variable

Key Symbols & Notation

H(X) - entropy of random variable XI(x) - surprisal of outcome x, I(x) = -log p(x)

Essential Relationships

↔H(X) = E[I(X)] = - sum_x p(x) log p(x) (entropy equals the expectation of surprisal)

Prerequisites (2)

Expected Value5 atoms

Common Distributions6 atoms

Unlocks (7)

KL Divergencelvl 4

Mutual Informationlvl 3

Cross-Entropylvl 4

Decision Treeslvl 4

Rate-Distortion Theorylvl 5

Channel Capacitylvl 5

Information-Theoretic Boundslvl 5

▶ Advanced Learning Details

Graph Position

Depth Cost

Fan-Out (ROI)

Bottleneck Score

Chain Length

Cognitive Load

Atomic Elements

Total Elements

Percentile Level

Atomic Level

All Concepts (17)

• Entropy as a measure of uncertainty/information content of a random variable
• Self-information (surprisal) of an outcome: i(x) = -log p(x)
• Entropy as the expected self-information: H(X) = E[i(X)] = -sum_x p(x) log p(x)
• Entropy depends only on the probability distribution of X (not on the numeric labels of outcomes)
• Units of information determined by log base (bits for log_2, nats for ln)
• Entropy of a deterministic variable is zero
• Maximum entropy for a given alphabet occurs at the uniform distribution
• Range/bound: 0 ≤ H(X) ≤ log |X| (where |X| is the number of possible outcomes)
• Joint entropy H(X,Y) as uncertainty of a pair of variables
• Conditional entropy H(X|Y) as remaining uncertainty about X given Y
• Chain rule for entropy: total joint uncertainty decomposes into marginal + conditional
• Subadditivity (and additivity in special case): H(X,Y) ≤ H(X) + H(Y), with equality if X and Y are independent
• Mutual information I(X;Y) as reduction in uncertainty about X due to knowledge of Y
• Mutual information non-negativity (I(X;Y) ≥ 0)
• Entropy as a property of distributions relevant to optimal coding (average minimum code length) - coding interpretation
• Permutation invariance: relabeling outcomes does not change entropy
• Concavity of entropy as a function of the probability distribution (mixing increases entropy)

Teaching Strategy

Self-serve tutorial - low prerequisites, straightforward concepts.

Why does a fair coin flip feel like “one bit” of uncertainty, while a coin that lands heads 99% of the time feels like almost none? Entropy is the mathematical tool that turns that intuition into a precise number—and it becomes the currency for compression, prediction, and learning.

TL;DR:

Entropy H(X) measures the average uncertainty (average “surprise”) of a random variable X. For discrete X with probabilities p(x), surprisal is I(x) = −log p(x), and entropy is the expected surprisal: H(X) = E[I(X)] = −∑ₓ p(x) log p(x). The log base sets the unit (base 2 → bits, base e → nats). Entropy is maximized by the uniform distribution and minimized (0) when outcomes are certain.

What Is Entropy?

Why we need a new quantity (motivation)

Probability tells you what tends to happen. But often you want a single number answering a different question:

•How uncertain am I before I observe X?
•How much information do I gain when I learn the outcome?
•How many bits do I need (on average) to encode outcomes of X?

A single event can be “surprising” or “unsurprising.” If something was already likely, learning it doesn’t teach you much. If it was unlikely, you learn a lot.

Entropy formalizes this by:

1) defining surprisal for one outcome, then

2) averaging surprisal according to the distribution.

Surprisal: information content of one outcome

Let X be a discrete random variable taking values x with probability p(x).

We define the surprisal (also called self-information) of outcome x as:

I(x) = −log p(x)

Intuition:

•If p(x) is large (close to 1), then log p(x) is close to 0, so I(x) is close to 0 ⇒ not surprising.
•If p(x) is small, log p(x) is very negative, so −log p(x) is large ⇒ very surprising.

The log base chooses units:

•log₂ ⇒ units are bits
•ln ⇒ units are nats
•log₁₀ ⇒ units are hartleys/dits

Changing base just rescales:

log_b p = (log_a p) / (log_a b)

Entropy: expected surprisal

Surprisal is about one outcome. Entropy is about the whole random variable.

Definition (discrete entropy):

H(X) = E[I(X)]

= ∑ₓ p(x) I(x)

= ∑ₓ p(x) (−log p(x))

= −∑ₓ p(x) log p(x)

This is the canonical formula you see in information theory.

Intuition check with extremes

•Certain outcome: Suppose P(X = a) = 1. Then

H(X) = −1 · log 1 = 0

Zero uncertainty, zero information gained by observing.

•Uniform over n outcomes: If p(x) = 1/n for each outcome,

H(X) = −∑ₓ (1/n) log(1/n)

= −n · (1/n) · (−log n)

= log n

This grows as the number of equally likely possibilities grows.

What entropy is (and is not)

Entropy is:

•a measure of uncertainty in X before observing it
•a measure of average information you learn by observing X
•the theoretical limit (in bits/symbol) for lossless compression of outcomes from X

Entropy is not:

•“randomness” in the colloquial sense for a single sequence (entropy is about a distribution)
•a guarantee that any one outcome is surprising (it’s an average)

Keep the mental model: entropy is average surprise.

Core Mechanic 1: Surprisal and the Log

Why does information grow like a log?

If an outcome has probability p, the “number of yes/no questions” you need to identify it (in the best case) behaves like log₂(1/p). This is because each binary question can cut the remaining possibilities roughly in half.

The log is the unique (up to scaling) function that turns multiplication into addition:

•Independent probabilities multiply.
•Independent information should add.

Concretely, if two independent events happen: A with p(A) and B with p(B), then:

p(A ∩ B) = p(A) p(B)

Surprisal of the joint event should be:

I(A ∩ B) = −log(p(A)p(B))

= −(log p(A) + log p(B))

= (−log p(A)) + (−log p(B))

= I(A) + I(B)

So the −log choice makes “information from independent events” add up cleanly.

Practical meanings of surprisal

If we use log₂:

•p(x) = 1/2 ⇒ I(x) = 1 bit
•p(x) = 1/4 ⇒ I(x) = 2 bits
•p(x) = 1/8 ⇒ I(x) = 3 bits

Each halving of probability adds one bit of surprisal.

From surprisal to entropy: averaging with expected value

You already know expected value:

E[g(X)] = ∑ₓ p(x) g(x)

Entropy is simply that with g(x) = −log p(x):

H(X) = E[−log p(X)]

A key pacing idea: entropy is not mysterious new machinery—it’s “expected value” applied to the random variable “surprisal.”

Units and interpretation (bits vs nats)

If you compute with log₂, H(X) is in bits.

If you compute with ln, H(X) is in nats.

Conversion:

H_bits(X) = H_nats(X) / ln 2

This matters because in machine learning, losses often use ln (because calculus is cleaner), but we still conceptually speak in bits.

The shape of uncertainty for a Bernoulli variable

Let X ∈ {0,1} with P(X=1)=p and P(X=0)=1−p.

H(X) = −p log p − (1−p) log(1−p)

This function has a distinctive shape:

•H(X)=0 at p=0 or p=1 (certainty)
•Maximum at p=1/2 (most uncertain)

If logs are base 2, the maximum is 1 bit.

A small table: how surprisal behaves

Probability p(x)	Surprisal I(x)=−log₂ p(x)	Interpretation
1	0	No surprise
1/2	1	One bit of info
1/4	2	Two bits
0.1	≈ 3.322	More than 3 yes/no questions on average
0.01	≈ 6.644	Very surprising

Takeaway: surprisal is a smooth measure that grows rapidly as p(x) shrinks.

Core Mechanic 2: Properties of Entropy (and Why They Matter)

Entropy is useful partly because it obeys clean, predictable rules. These rules let you reason about coding, uncertainty, and learning systems without re-deriving everything.

Property A: Non-negativity

Because 0 ≤ p(x) ≤ 1, we have log p(x) ≤ 0, hence −p(x) log p(x) ≥ 0 for each term.

So:

H(X) = −∑ₓ p(x) log p(x) ≥ 0

Entropy can’t be negative.

Property B: Entropy is 0 iff X is certain

If X always takes one value with probability 1, then H(X)=0.

Conversely, if H(X)=0, all terms −p(x) log p(x) must be 0. For any p(x) in (0,1), log p(x) < 0 and the term is positive, so the only way all terms vanish is when one outcome has probability 1 and the rest 0.

So entropy captures “no uncertainty.”

Property C: Maximum entropy for fixed support is uniform

Suppose X can take n outcomes.

•The maximum of H(X) occurs when p(x)=1/n for all x.
•The maximum value is H(X)=log n.

Why this matters: if you know only that there are n possible outcomes, “most uncertain” means “assume uniform.”

Property D: Chain rule (entropy of joint variables)

Let (X,Y) be a pair of discrete random variables.

The chain rule says:

H(X,Y) = H(X) + H(Y|X)

Interpretation:

•First you learn X: cost H(X)
•Then you learn Y given X: extra cost H(Y|X)

This mirrors how you might actually encode (X,Y): encode X, then encode Y using a code adapted to the conditional distribution P(Y|X).

A brief derivation sketch (using definitions):

H(X,Y)

= −∑ₓ ∑ᵧ p(x,y) log p(x,y)

= −∑ₓ ∑ᵧ p(x,y) log( p(x) p(y|x) )

= −∑ₓ ∑ᵧ p(x,y) [log p(x) + log p(y|x)]

= −∑ₓ ∑ᵧ p(x,y) log p(x) − ∑ₓ ∑ᵧ p(x,y) log p(y|x)

For the first term:

−∑ₓ ∑ᵧ p(x,y) log p(x)

= −∑ₓ log p(x) ∑ᵧ p(x,y)

= −∑ₓ log p(x) p(x)

= H(X)

For the second term, by definition that is H(Y|X).

Property E: Conditioning reduces entropy

You should expect that knowing more can’t increase uncertainty:

H(X|Y) ≤ H(X)

Interpretation: if you observe Y, you can only become more certain (or equally uncertain) about X.

This idea is a cornerstone for later nodes like mutual information:

I(X;Y) = H(X) − H(X|Y)

Property F: Entropy and “effective number of outcomes”

If H(X)=log n, it behaves like you have n equally likely outcomes.

This motivates the notion of “perplexity” used in language modeling:

perplexity(X) = 2^{H_bits(X)}

If H=10 bits, perplexity is 1024: it’s as if, on average, there are ~1024 equally plausible next tokens.

When entropy is the right tool

Entropy answers questions like:

•“How unpredictable is this source?”
•“How many bits/symbol do I need to encode it?”
•“How much does observing Y reduce uncertainty about X?”

And importantly, it does so with algebraic rules that compose well across systems.

Application/Connection: Compression, Learning, and Decision-Making

Entropy as a limit for lossless compression

Imagine X generates symbols with distribution p(x). If you design an optimal prefix-free code (e.g., Huffman coding), the expected code length L̄ satisfies:

H(X) ≤ L̄ < H(X) + 1

Interpretation:

•You can’t beat entropy on average (without loss).
•You can get within 1 bit/symbol of it with classical coding methods.

So entropy is not just “an abstract uncertainty”—it’s a concrete benchmark for storage and transmission.

Why entropy shows up in machine learning losses

In classification, we often model a distribution q over labels given features, and reality follows p. The cross-entropy:

H(p,q) = −∑ₓ p(x) log q(x)

is the expected surprisal when you use q but the world is p.

This decomposes as:

H(p,q) = H(p) + KL(p‖q)

So minimizing cross-entropy (common in neural nets) is equivalent to minimizing KL divergence because H(p) is fixed.

Even if you don’t yet know KL deeply, you can see the story:

•entropy H(p): irreducible uncertainty in labels
•extra penalty: mismatch between your model and reality

Decision trees: entropy as an impurity measure

Decision trees choose splits that reduce uncertainty about the label Y.

At a node, compute entropy of labels:

H(Y) = −∑ᵧ p(y) log p(y)

After splitting on feature X, you get conditional entropy:

H(Y|X) = ∑ₓ p(x) H(Y|X=x)

Information gain is:

IG = H(Y) − H(Y|X)

So trees pick the feature that maximizes IG (largest reduction in label uncertainty).

Entropy as “how many questions remain?”

If you use log₂, entropy is measured in bits, and a bit is literally one yes/no question.

So H(X)=3 bits means: with optimal questioning, you need about 3 binary questions on average to identify X.

This mental model helps you sanity-check:

•Doubling the number of equally likely possibilities adds 1 bit.
•When one outcome dominates, the average number of questions drops sharply.

Connections you’re about to unlock

Once entropy feels natural, several powerful concepts become almost inevitable:

•KL divergence: measures the extra surprisal from using the wrong distribution.
•Mutual information: the reduction in entropy from knowing another variable.
•Cross-entropy: the training objective for probabilistic classifiers.
•Rate-distortion theory: how entropy changes when you allow controlled loss.

Entropy is the hub: it’s the “energy unit” in the economy of information.

Worked Examples (3)

Entropy of a Bernoulli(p): why the maximum is at p = 1/2

Let X ∈ {0,1} with P(X=1)=p and P(X=0)=1−p. Compute H(X) (base 2), and verify it is maximized at p=1/2.

Start from the definition:
H(X) = −∑ₓ p(x) log₂ p(x)
Plug in the two outcomes:
H(p) = −p log₂ p − (1−p) log₂(1−p)
Check a few values for intuition:
- •p=0 ⇒ H=0
- •p=1 ⇒ H=0
- •p=1/2 ⇒ H = −(1/2)log₂(1/2) − (1/2)log₂(1/2)
= −(1/2)(−1) − (1/2)(−1)
= 1 bit
To verify the maximum location, it is easier to differentiate using ln and convert later.
Define h(p) = −p ln p − (1−p) ln(1−p). (This is entropy in nats.)
Differentiate:
∂h/∂p = −(ln p + 1) − [−ln(1−p) − 1]
= −ln p − 1 + ln(1−p) + 1
= ln((1−p)/p)
Set derivative to 0:
ln((1−p)/p) = 0
⇒ (1−p)/p = 1
⇒ 1−p = p
⇒ p = 1/2
Second derivative check:
∂²h/∂p² = ∂/∂p [ln(1−p) − ln p]
= −1/(1−p) − 1/p
< 0 for p ∈ (0,1)
So p=1/2 is a maximum.
Convert back to bits if needed:
H_bits(p) = h(p) / ln 2, and at p=1/2 we get 1 bit.

Insight: Entropy is maximized when outcomes are as balanced as possible. For a binary variable, “most uncertain” means “fair coin,” giving exactly 1 bit of average information.

Entropy of a 4-symbol source and the ‘effective number of outcomes’

Let X take values {A,B,C,D} with probabilities p = (1/2, 1/4, 1/8, 1/8). Compute H(X) in bits and compute perplexity 2^{H(X)}.

Write the entropy:
H(X) = −∑ p(x) log₂ p(x)
Compute each surprisal:
I(A)=−log₂(1/2)=1
I(B)=−log₂(1/4)=2
I(C)=−log₂(1/8)=3
I(D)=−log₂(1/8)=3
Average them with probabilities:
H(X) = (1/2)·1 + (1/4)·2 + (1/8)·3 + (1/8)·3
Compute:
(1/2)·1 = 0.5
(1/4)·2 = 0.5
(1/8)·3 = 0.375
(1/8)·3 = 0.375
Sum:
H(X) = 0.5 + 0.5 + 0.375 + 0.375 = 1.75 bits
Compute perplexity:
perplexity = 2^{H} = 2^{1.75}
= 2^{1} · 2^{0.75}
= 2 · 2^{3/4}
≈ 2 · 1.6818
≈ 3.36

Insight: Although there are 4 possible outcomes, the distribution is skewed, so uncertainty behaves like having only ~3.36 equally likely outcomes.

Entropy reduction after observing a feature (information gain preview)

A dataset has a binary label Y with P(Y=1)=3/4 and P(Y=0)=1/4. A binary feature X splits the data: when X=0 (prob 1/2), P(Y=1|X=0)=1; when X=1 (prob 1/2), P(Y=1|X=1)=1/2. Compute H(Y), H(Y|X), and the entropy reduction H(Y)−H(Y|X) (base 2).

Compute the initial label entropy:
H(Y) = −(3/4)log₂(3/4) − (1/4)log₂(1/4)
Use known logs:
log₂(1/4)=−2
log₂(3/4)=log₂ 3 − log₂ 4 = log₂ 3 − 2 ≈ 1.585 − 2 = −0.415
Plug in:
H(Y) = −(3/4)(−0.415) − (1/4)(−2)
= (3/4)(0.415) + (1/4)(2)
≈ 0.311 + 0.5
≈ 0.811 bits
Compute conditional entropy via the split:
H(Y|X) = ∑ₓ p(x) H(Y|X=x)
= (1/2)H(Y|X=0) + (1/2)H(Y|X=1)
For X=0, P(Y=1|X=0)=1 so the label is certain:
H(Y|X=0)=0
For X=1, P(Y=1|X=1)=1/2 so it is a fair coin:
H(Y|X=1)=1 bit
Therefore:
H(Y|X) = (1/2)·0 + (1/2)·1 = 0.5 bits
Entropy reduction (information gain idea):
H(Y) − H(Y|X) ≈ 0.811 − 0.5 = 0.311 bits

Insight: A feature is useful when it makes the label distribution more certain. Here, X removes about 0.311 bits of uncertainty about Y on average—exactly the quantity decision trees try to maximize.

Key Takeaways

✓
Surprisal measures information in a single outcome: I(x) = −log p(x). Rare events carry more information.
✓
Entropy is average surprisal: H(X) = E[−log p(X)] = −∑ₓ p(x) log p(x).
✓
The log base sets units: log₂ → bits, ln → nats; changing base rescales by a constant factor.
✓
H(X) ≥ 0, and H(X)=0 exactly when X is deterministic (no uncertainty).
✓
For a fixed number of outcomes n, entropy is maximized by the uniform distribution and equals log n.
✓
The chain rule H(X,Y)=H(X)+H(Y|X) explains how uncertainty composes across multiple variables.
✓
Conditioning reduces uncertainty: H(X|Y) ≤ H(X), paving the way to mutual information.
✓
In coding and ML, entropy is a benchmark: it is the lower bound on average lossless code length and a building block for cross-entropy/KL losses.

Common Mistakes

✗
Forgetting the negative sign: ∑ p log p is ≤ 0, so entropy is defined with a minus to make it nonnegative.
✗
Mixing log bases mid-calculation (e.g., using ln in one step and log₂ in another) and getting inconsistent units.
✗
Interpreting entropy as a property of a particular observed sequence rather than of the underlying probability distribution.
✗
Confusing entropy H(X) with cross-entropy H(p,q): entropy uses the true p inside both p and log; cross-entropy uses p outside and q inside the log.

Practice

easy

Compute the entropy (in bits) of a fair six-sided die roll. Then compute the entropy if the die is loaded so that P(1)=1/2 and P(2)=P(3)=P(4)=P(5)=P(6)=1/10.

Hint: Uniform over n outcomes has H=log₂ n. For the loaded die, compute −∑ p log₂ p and reuse that −log₂(1/2)=1 and −log₂(1/10)=log₂ 10 ≈ 3.322.

Show solution

Fair die: H = log₂ 6 ≈ 2.585 bits.

Loaded die:

H = −(1/2)log₂(1/2) − 5·(1/10)log₂(1/10)

= (1/2)·1 + 5·(1/10)·log₂ 10

= 0.5 + 0.5·3.322

≈ 0.5 + 1.661

≈ 2.161 bits.

medium

Let X take values {a,b,c} with probabilities (1/2, 1/3, 1/6). Compute H(X) in bits.

Hint: Compute each term p(x)·(−log₂ p(x)). You may use log₂ 3 ≈ 1.585 and log₂ 6 = log₂(2·3)=1+log₂ 3 ≈ 2.585.

Show solution

H(X)= −(1/2)log₂(1/2) − (1/3)log₂(1/3) − (1/6)log₂(1/6)

Compute surprisals:

−log₂(1/2)=1

−log₂(1/3)=log₂ 3 ≈ 1.585

−log₂(1/6)=log₂ 6 ≈ 2.585

Average:

H ≈ (1/2)·1 + (1/3)·1.585 + (1/6)·2.585

≈ 0.5 + 0.528 + 0.431

≈ 1.459 bits.

medium

Suppose X and Y are independent and each is a fair coin flip. Compute H(X), H(Y), and H(X,Y). Verify the chain rule H(X,Y)=H(X)+H(Y|X).

Hint: Fair coin entropy is 1 bit. Independence implies H(Y|X)=H(Y). Joint distribution has 4 equally likely outcomes.

Show solution

H(X)=1 bit and H(Y)=1 bit.

Because X and Y are independent, knowing X does not change uncertainty about Y, so H(Y|X)=H(Y)=1 bit.

The pair (X,Y) has 4 equally likely outcomes, so:

H(X,Y)=log₂ 4 = 2 bits.

Chain rule check:

H(X)+H(Y|X) = 1 + 1 = 2 = H(X,Y).

Connections

Next nodes you can study:

•KL Divergence — measures extra average surprisal from using the wrong distribution.
•Cross-Entropy — expected surprisal under a model q when reality is p; standard classification loss.
•Mutual Information — reduction in uncertainty: I(X;Y)=H(X)−H(X|Y).
•Decision Trees — information gain uses entropy reduction to pick splits.
•Rate-Distortion Theory — compression when some distortion is allowed; entropy is the baseline.

Quality: A (4.6/5)

← back to tree browse all →