Measure of spread. E[(X - mu)^2]. Standard deviation is sqrt(variance).
Self-serve tutorial - low prerequisites, straightforward concepts.
You can know the “average” of a random variable and still be surprised all the time. Variance is the standard way to quantify how spread out the outcomes are around that average.
Variance measures spread: Var(X) = E[(X − μ)²], where μ = E[X]. It’s always ≥ 0, equals 0 only if X is constant (almost surely), and can be computed efficiently via Var(X) = E[X²] − (E[X])². Standard deviation is √Var(X).
Expected value tells you the center of a distribution: μ = E[X]. But two random variables can have the same μ and behave very differently.
Example intuition:
Both have E[X] = E[Y] = 0, but Y “swings” wildly while X never moves. Variance is designed to capture that swing.
Let X be a random variable with mean μ = E[X]. The variance of X is
Var(X) = E[(X − μ)²].
Read this as: “take the deviation from the mean, square it, then average.”
We want a measure of spread that:
Absolute deviation |X − μ| would avoid cancellation too, but squaring is the choice that leads to particularly useful identities (like Var(X) = E[X²] − μ²) and makes many derivations clean.
If X has units (meters, dollars, etc.), then (X − μ)² has squared units, so variance has squared units too. To get back to the original units, define the standard deviation:
σ = √Var(X).
Variance is the main mathematical object; standard deviation is often easier to interpret in the original scale.
If you think of outcomes x as points on a line, μ is the “balance point,” and (x − μ)² is the squared distance to that point. Variance is the expected squared distance to the center.
From the definition, for a discrete random variable:
Var(X) = E[(X − μ)²] = ∑ (x − μ)² P(X = x).
This is conceptually simple:
For a continuous random variable with density f(x), replace the sum with an integral:
Var(X) = ∫ (x − μ)² f(x) dx.
Computing (x − μ)² can be tedious. There’s a standard algebraic identity that makes variance faster to compute.
Start with the definition and expand:
Var(X)
= E[(X − μ)²]
= E[X² − 2μX + μ²].
Now use linearity of expectation and the fact that μ is a constant:
E[X² − 2μX + μ²]
= E[X²] − 2μE[X] + E[μ²]
= E[X²] − 2μ·μ + μ²
= E[X²] − μ².
So:
Var(X) = E[X²] − (E[X])².
This is one of the most-used formulas in probability and statistics.
It separates variance into two pieces you can often compute easily:
Many common distributions have known formulas for these moments, making Var(X) almost immediate.
A key property: shifting a random variable by a constant does not change its variance.
Let Y = X + c. Then E[Y] = E[X] + c = μ + c.
Compute variance:
Var(Y)
= E[(Y − E[Y])²]
= E[(X + c − (μ + c))²]
= E[(X − μ)²]
= Var(X).
So variance measures spread, not location.
If you scale outcomes, spread scales quadratically:
Let Y = aX. Then E[Y] = aμ.
Var(Y)
= E[(aX − aμ)²]
= E[a²(X − μ)²]
= a²E[(X − μ)²]
= a²Var(X).
So:
Var(aX) = a²Var(X).
This is why standard deviation scales linearly: √Var(aX) = |a| √Var(X).
| Transformation | New mean | New variance |
|---|---|---|
| Y = X + c | E[Y] = μ + c | Var(Y) = Var(X) |
| Y = aX | E[Y] = aμ | Var(Y) = a²Var(X) |
| Y = aX + c | E[Y] = aμ + c | Var(Y) = a²Var(X) |
From the definition:
Var(X) = E[(X − μ)²].
But (X − μ)² ≥ 0 for every outcome, because a square can’t be negative. An expectation of a nonnegative random variable is nonnegative, so:
Var(X) ≥ 0.
This is not just a technicality: it tells you variance behaves like a “size” or “energy.”
Variance is zero exactly when there is no spread at all.
Claim:
Var(X) = 0 ⇔ X = μ almost surely.
Meaning: X equals the constant μ with probability 1 (it may differ on a probability-0 set, which doesn’t affect expectations).
If X = μ (with probability 1), then X − μ = 0 (with probability 1), so (X − μ)² = 0 always. Therefore E[(X − μ)²] = 0.
If Var(X) = E[(X − μ)²] = 0 and (X − μ)² is always ≥ 0, the only way the average can be 0 is if it is 0 wherever probability mass exists.
Formally (intuition-first):
“Almost surely” (a.s.) means “with probability 1.”
It allows edge cases where X might misbehave on events of probability 0, but those events do not affect expectations, variances, or most practical probability calculations.
Example:
For variance, “Var(X) = 0 implies X is constant a.s.” is the correct statement.
The smaller Var(X) is, the more tightly X concentrates around μ.
A very important inequality (you don’t need to master it yet, but it motivates why variance matters) is Chebyshev’s inequality:
P(|X − μ| ≥ kσ) ≤ 1/k².
It says: if variance is small, large deviations from the mean are unlikely.
This is a major reason variance is used as a compact summary of uncertainty.
Once you know how to compute variance, you can summarize a distribution with (mean, variance). Many families are parameterized this way.
A few examples you’ll meet soon:
Variance is the “second parameter” in many models because it controls spread.
Variance is the special case of covariance when you compare a variable with itself.
Cov(X, Y) is defined as:
Cov(X, Y) = E[(X − E[X])(Y − E[Y])].
If you plug in Y = X:
Cov(X, X) = E[(X − E[X])(X − E[X])]
= E[(X − μ)²]
= Var(X).
So variance is “self-covariance.” This viewpoint is crucial later:
Correlation normalizes covariance by standard deviations:
Corr(X, Y) = Cov(X, Y) / (σₓ σᵧ).
Variance is the building block for both.
In deep learning, activations can drift to have large or tiny variance across layers. If variance explodes or vanishes, gradients can become unstable.
Normalization methods (BatchNorm, LayerNorm) explicitly use mean and variance to re-center and re-scale activations.
A typical pattern for a batch of activations a:
This makes the normalized activations have approximately mean 0 and variance 1.
Even if you’re not training neural nets yet, the message is:
In probability, Var(X) is a population quantity: it assumes you know the true distribution.
In statistics, you estimate variance from samples x₁, …, xₙ. Two common formulas are:
You don’t need the full derivation here, but it’s helpful to recognize that variance connects probability theory (true Var) to data analysis (estimated variance).
Let X take values {0, 1, 2} with probabilities P(0)=1/4, P(1)=1/2, P(2)=1/4. Find μ = E[X], Var(X), and σ.
Compute the mean:
μ = E[X]
= 0·(1/4) + 1·(1/2) + 2·(1/4)
= 0 + 1/2 + 1/2
= 1.
Compute squared deviations from the mean:
(0 − 1)² = 1
(1 − 1)² = 0
(2 − 1)² = 1.
Take the expectation of squared deviation:
Var(X) = E[(X − μ)²]
= 1·(1/4) + 0·(1/2) + 1·(1/4)
= 1/4 + 0 + 1/4
= 1/2.
Compute standard deviation:
σ = √Var(X) = √(1/2) ≈ 0.7071.
Insight: Variance is an average of squared distances to the mean. Symmetry around μ=1 makes the computation especially simple here.
Let X ∼ Bernoulli(p), so X ∈ {0,1} with P(X=1)=p and P(X=0)=1−p. Compute Var(X).
Compute E[X]:
E[X] = 0·(1−p) + 1·p = p.
Compute E[X²]: since X is 0 or 1, we have X² = X for every outcome.
So E[X²] = E[X] = p.
Apply the identity:
Var(X) = E[X²] − (E[X])²
= p − p²
= p(1 − p).
Sanity check at extremes:
If p=0 or p=1, then Var(X)=0, matching the fact that X is constant.
If p=1/2, then Var(X)=1/4, the maximum spread for a Bernoulli.
Insight: For Bernoulli variables, the shortcut is extremely fast because X² = X. This pattern (compute moments, then combine) is the standard way to get variances for many distributions.
Suppose Var(X) = 9 and μ = E[X] = 2. Define Y = 3X − 5. Find E[Y], Var(Y), and σᵧ.
Compute the mean using linearity:
E[Y] = E[3X − 5]
= 3E[X] − 5
= 3·2 − 5
= 1.
Compute the variance using the scaling rule (shift doesn’t matter):
Var(Y) = Var(3X − 5)
= 3² Var(X)
= 9·9
= 81.
Compute standard deviation:
σᵧ = √Var(Y) = √81 = 9.
Insight: Adding/subtracting constants moves the distribution but doesn’t change spread; multiplying by 3 triples standard deviation and multiplies variance by 9.
Variance measures spread around the mean: Var(X) = E[(X − μ)²], μ = E[X].
Var(X) ≥ 0 always, because it is the expectation of a square.
Var(X) = 0 exactly when X is constant almost surely (P(X = μ) = 1).
Efficient computation: Var(X) = E[X²] − (E[X])².
Shifts don’t change variance: Var(X + c) = Var(X).
Scaling changes variance quadratically: Var(aX) = a²Var(X).
Standard deviation is the square root of variance: σ = √Var(X), restoring original units.
Forgetting to compute μ = E[X] first, or using the wrong mean when plugging into E[(X − μ)²].
Mixing up variance and standard deviation (variance is squared units; standard deviation is √variance).
Thinking Var(X + c) changes with c (it doesn’t); only scaling affects variance magnitude.
Dropping the square when expanding (X − μ)² or misapplying Var(X) = E[X²] − (E[X])².
Let X take values {−1, 0, 1} with probabilities {1/4, 1/2, 1/4}. Compute E[X], Var(X), and σ.
Hint: Symmetry suggests E[X]=0. Then Var(X)=E[X²].
Mean:
E[X] = (−1)(1/4) + 0(1/2) + 1(1/4) = 0.
Variance:
Since μ=0, Var(X)=E[X²].
E[X²] = (1)(1/4) + 0(1/2) + (1)(1/4) = 1/2.
So Var(X)=1/2.
Standard deviation:
σ = √(1/2) ≈ 0.7071.
Use Var(X) = E[X²] − (E[X])². Suppose P(X=1)=0.2, P(X=3)=0.8. Compute Var(X).
Hint: Compute E[X] and E[X²] from the two-point distribution, then subtract.
E[X] = 1·0.2 + 3·0.8 = 0.2 + 2.4 = 2.6.
E[X²] = 1²·0.2 + 3²·0.8 = 1·0.2 + 9·0.8 = 0.2 + 7.2 = 7.4.
Var(X) = E[X²] − (E[X])² = 7.4 − (2.6)² = 7.4 − 6.76 = 0.64.
Assume Var(X)=4. Define Z = −2X + 10. Compute Var(Z) and σ_z.
Hint: Use Var(aX + c) = a²Var(X). The sign of a doesn’t matter after squaring.
Var(Z) = Var(−2X + 10) = (−2)²Var(X) = 4·4 = 16.
σ_z = √Var(Z) = √16 = 4.