Variance

Probability & StatisticsDifficulty: ██░░░Depth: 4Unlocks: 60

Measure of spread. E[(X - mu)^2]. Standard deviation is sqrt(variance).

Interactive Visualization

t=0s

Core Concepts

▸Variance is the expected squared deviation from the mean: E[(X - mu)^2].
▸Variance is always nonnegative and equals zero exactly when the random variable is (almost surely) constant.

Key Symbols & Notation

Var(X) (variance of X)

Essential Relationships

↔Standard deviation = positive square root of variance: sigma = sqrt(Var(X)).
↔Variance under linear transformation: Var(aX + b) = a^2 Var(X).

Prerequisites (1)

Expected Value5 atoms

Unlocks (3)

Common Distributionslvl 2

Covariance and Correlationlvl 3

Layer Normalizationlvl 4

▶ Advanced Learning Details

Graph Position

Depth Cost

Fan-Out (ROI)

Bottleneck Score

Chain Length

Cognitive Load

Atomic Elements

Total Elements

Percentile Level

Atomic Level

All Concepts (8)

• Spread (dispersion) of a distribution as a distinct property from its mean
• Squared deviation: (X - μ)^2 as a way to measure how far an outcome is from the mean without cancellation
• Variance as the expected squared deviation: Var(X) = E[(X - μ)^2]
• Standard deviation as the square root of variance
• Units issue: variance is measured in squared units of X while standard deviation is in the same units as X
• Variance is always nonnegative
• Zero variance means no variability (the random variable is constant with probability 1)
• Squaring deviations gives more weight to larger deviations (sensitivity to outliers)

Teaching Strategy

Self-serve tutorial - low prerequisites, straightforward concepts.

You can know the “average” of a random variable and still be surprised all the time. Variance is the standard way to quantify how spread out the outcomes are around that average.

TL;DR:

Variance measures spread: Var(X) = E[(X − μ)²], where μ = E[X]. It’s always ≥ 0, equals 0 only if X is constant (almost surely), and can be computed efficiently via Var(X) = E[X²] − (E[X])². Standard deviation is √Var(X).

What Is Variance?

Why we need more than an average

Expected value tells you the center of a distribution: μ = E[X]. But two random variables can have the same μ and behave very differently.

Example intuition:

•X is always 0.
•Y is +10 half the time and −10 half the time.

Both have E[X] = E[Y] = 0, but Y “swings” wildly while X never moves. Variance is designed to capture that swing.

Definition

Let X be a random variable with mean μ = E[X]. The variance of X is

Var(X) = E[(X − μ)²].

Read this as: “take the deviation from the mean, square it, then average.”

Why squared deviation?

We want a measure of spread that:

1)Treats positive and negative deviations symmetrically (so they don’t cancel).
2)Penalizes larger deviations more than smaller ones.
3)Plays nicely with algebra and optimization (this becomes important in statistics and ML).

Absolute deviation |X − μ| would avoid cancellation too, but squaring is the choice that leads to particularly useful identities (like Var(X) = E[X²] − μ²) and makes many derivations clean.

Units and standard deviation

If X has units (meters, dollars, etc.), then (X − μ)² has squared units, so variance has squared units too. To get back to the original units, define the standard deviation:

σ = √Var(X).

Variance is the main mathematical object; standard deviation is often easier to interpret in the original scale.

A quick geometric picture

If you think of outcomes x as points on a line, μ is the “balance point,” and (x − μ)² is the squared distance to that point. Variance is the expected squared distance to the center.

Core Mechanic 1: Computing Variance (and the E[X²] − (E[X])² Identity)

The direct computation

From the definition, for a discrete random variable:

Var(X) = E[(X − μ)²] = ∑ (x − μ)² P(X = x).

This is conceptually simple:

1)Find μ.
2)For each outcome x, compute (x − μ)².
3)Weight by P(X = x) and sum.

For a continuous random variable with density f(x), replace the sum with an integral:

Var(X) = ∫ (x − μ)² f(x) dx.

The shortcut identity: Var(X) = E[X²] − (E[X])²

Computing (x − μ)² can be tedious. There’s a standard algebraic identity that makes variance faster to compute.

Start with the definition and expand:

Var(X)

= E[(X − μ)²]

= E[X² − 2μX + μ²].

Now use linearity of expectation and the fact that μ is a constant:

E[X² − 2μX + μ²]

= E[X²] − 2μE[X] + E[μ²]

= E[X²] − 2μ·μ + μ²

= E[X²] − μ².

So:

Var(X) = E[X²] − (E[X])².

This is one of the most-used formulas in probability and statistics.

Why this identity matters

It separates variance into two pieces you can often compute easily:

•E[X] (first moment)
•E[X²] (second moment)

Many common distributions have known formulas for these moments, making Var(X) almost immediate.

Invariance to shifts (adding a constant)

A key property: shifting a random variable by a constant does not change its variance.

Let Y = X + c. Then E[Y] = E[X] + c = μ + c.

Compute variance:

Var(Y)

= E[(Y − E[Y])²]

= E[(X + c − (μ + c))²]

= E[(X − μ)²]

= Var(X).

So variance measures spread, not location.

Scaling rule (multiplying by a constant)

If you scale outcomes, spread scales quadratically:

Let Y = aX. Then E[Y] = aμ.

Var(Y)

= E[(aX − aμ)²]

= E[a²(X − μ)²]

= a²E[(X − μ)²]

= a²Var(X).

So:

Var(aX) = a²Var(X).

This is why standard deviation scales linearly: √Var(aX) = |a| √Var(X).

Quick reference table

Transformation	New mean	New variance
Y = X + c	E[Y] = μ + c	Var(Y) = Var(X)
Y = aX	E[Y] = aμ	Var(Y) = a²Var(X)
Y = aX + c	E[Y] = aμ + c	Var(Y) = a²Var(X)

Core Mechanic 2: Nonnegativity, When Variance Is Zero, and “Almost Surely”

Variance is always nonnegative

From the definition:

Var(X) = E[(X − μ)²].

But (X − μ)² ≥ 0 for every outcome, because a square can’t be negative. An expectation of a nonnegative random variable is nonnegative, so:

Var(X) ≥ 0.

This is not just a technicality: it tells you variance behaves like a “size” or “energy.”

When is variance exactly zero?

Variance is zero exactly when there is no spread at all.

Claim:

Var(X) = 0 ⇔ X = μ almost surely.

Meaning: X equals the constant μ with probability 1 (it may differ on a probability-0 set, which doesn’t affect expectations).

Why “if” direction is true

If X = μ (with probability 1), then X − μ = 0 (with probability 1), so (X − μ)² = 0 always. Therefore E[(X − μ)²] = 0.

Why “only if” direction is true

If Var(X) = E[(X − μ)²] = 0 and (X − μ)² is always ≥ 0, the only way the average can be 0 is if it is 0 wherever probability mass exists.

Formally (intuition-first):

•If there were any positive probability that (X − μ)² > 0, then the expectation would be > 0.
•So it must be that P((X − μ)² = 0) = 1, i.e., P(X = μ) = 1.

“Almost surely” in plain terms

“Almost surely” (a.s.) means “with probability 1.”

It allows edge cases where X might misbehave on events of probability 0, but those events do not affect expectations, variances, or most practical probability calculations.

Example:

•A continuous random variable might have P(X = 0) = 0, but it could still be possible that some definition changes at x = 0 without affecting anything measurable.

For variance, “Var(X) = 0 implies X is constant a.s.” is the correct statement.

Variance as a measure of concentration

The smaller Var(X) is, the more tightly X concentrates around μ.

A very important inequality (you don’t need to master it yet, but it motivates why variance matters) is Chebyshev’s inequality:

P(|X − μ| ≥ kσ) ≤ 1/k².

It says: if variance is small, large deviations from the mean are unlikely.

This is a major reason variance is used as a compact summary of uncertainty.

Application/Connection: Variance in Distributions, Covariance/Correlation, and Normalization in ML

Variance in common distributions

Once you know how to compute variance, you can summarize a distribution with (mean, variance). Many families are parameterized this way.

A few examples you’ll meet soon:

•Bernoulli(p): mean p, variance p(1 − p)
•Binomial(n, p): mean np, variance np(1 − p)
•Poisson(λ): mean λ, variance λ
•Normal(μ, σ²): mean μ, variance σ²

Variance is the “second parameter” in many models because it controls spread.

From variance to covariance and correlation

Variance is the special case of covariance when you compare a variable with itself.

Cov(X, Y) is defined as:

Cov(X, Y) = E[(X − E[X])(Y − E[Y])].

If you plug in Y = X:

Cov(X, X) = E[(X − E[X])(X − E[X])]

= E[(X − μ)²]

= Var(X).

So variance is “self-covariance.” This viewpoint is crucial later:

•Covariance matrices describe how a vector random variable X (boldface often used for vectors) varies in multiple dimensions.
•The diagonal entries are variances of each component.

Correlation normalizes covariance by standard deviations:

Corr(X, Y) = Cov(X, Y) / (σₓ σᵧ).

Variance is the building block for both.

Why ML cares: normalization and training stability

In deep learning, activations can drift to have large or tiny variance across layers. If variance explodes or vanishes, gradients can become unstable.

Normalization methods (BatchNorm, LayerNorm) explicitly use mean and variance to re-center and re-scale activations.

A typical pattern for a batch of activations a:

1)Compute mean μ̂
2)Compute variance v̂
3)Normalize: (a − μ̂) / √(v̂ + ε)

This makes the normalized activations have approximately mean 0 and variance 1.

Even if you’re not training neural nets yet, the message is:

•Variance is not just descriptive statistics.
•It is a control knob for stability in algorithms.

Population variance vs sample variance (preview)

In probability, Var(X) is a population quantity: it assumes you know the true distribution.

In statistics, you estimate variance from samples x₁, …, xₙ. Two common formulas are:

•“Biased” estimator: (1/n) ∑(xᵢ − x̄)²
•“Unbiased” estimator: (1/(n−1)) ∑(xᵢ − x̄)²

You don’t need the full derivation here, but it’s helpful to recognize that variance connects probability theory (true Var) to data analysis (estimated variance).

Worked Examples (3)

Compute variance directly (small discrete distribution)

Let X take values {0, 1, 2} with probabilities P(0)=1/4, P(1)=1/2, P(2)=1/4. Find μ = E[X], Var(X), and σ.

Compute the mean:
μ = E[X]
= 0·(1/4) + 1·(1/2) + 2·(1/4)
= 0 + 1/2 + 1/2
= 1.
Compute squared deviations from the mean:
(0 − 1)² = 1
(1 − 1)² = 0
(2 − 1)² = 1.
Take the expectation of squared deviation:
Var(X) = E[(X − μ)²]
= 1·(1/4) + 0·(1/2) + 1·(1/4)
= 1/4 + 0 + 1/4
= 1/2.
Compute standard deviation:
σ = √Var(X) = √(1/2) ≈ 0.7071.

Insight: Variance is an average of squared distances to the mean. Symmetry around μ=1 makes the computation especially simple here.

Use Var(X) = E[X²] − (E[X])² (Bernoulli case)

Let X ∼ Bernoulli(p), so X ∈ {0,1} with P(X=1)=p and P(X=0)=1−p. Compute Var(X).

Compute E[X]:
E[X] = 0·(1−p) + 1·p = p.
Compute E[X²]: since X is 0 or 1, we have X² = X for every outcome.
So E[X²] = E[X] = p.
Apply the identity:
Var(X) = E[X²] − (E[X])²
= p − p²
= p(1 − p).
Sanity check at extremes:
If p=0 or p=1, then Var(X)=0, matching the fact that X is constant.
If p=1/2, then Var(X)=1/4, the maximum spread for a Bernoulli.

Insight: For Bernoulli variables, the shortcut is extremely fast because X² = X. This pattern (compute moments, then combine) is the standard way to get variances for many distributions.

Effect of scaling and shifting

Suppose Var(X) = 9 and μ = E[X] = 2. Define Y = 3X − 5. Find E[Y], Var(Y), and σᵧ.

Compute the mean using linearity:
E[Y] = E[3X − 5]
= 3E[X] − 5
= 3·2 − 5
= 1.
Compute the variance using the scaling rule (shift doesn’t matter):
Var(Y) = Var(3X − 5)
= 3² Var(X)
= 9·9
= 81.
Compute standard deviation:
σᵧ = √Var(Y) = √81 = 9.

Insight: Adding/subtracting constants moves the distribution but doesn’t change spread; multiplying by 3 triples standard deviation and multiplies variance by 9.

Key Takeaways

✓
Variance measures spread around the mean: Var(X) = E[(X − μ)²], μ = E[X].
✓
Var(X) ≥ 0 always, because it is the expectation of a square.
✓
Var(X) = 0 exactly when X is constant almost surely (P(X = μ) = 1).
✓
Efficient computation: Var(X) = E[X²] − (E[X])².
✓
Shifts don’t change variance: Var(X + c) = Var(X).
✓
Scaling changes variance quadratically: Var(aX) = a²Var(X).
✓
Standard deviation is the square root of variance: σ = √Var(X), restoring original units.

Common Mistakes

✗
Forgetting to compute μ = E[X] first, or using the wrong mean when plugging into E[(X − μ)²].
✗
Mixing up variance and standard deviation (variance is squared units; standard deviation is √variance).
✗
Thinking Var(X + c) changes with c (it doesn’t); only scaling affects variance magnitude.
✗
Dropping the square when expanding (X − μ)² or misapplying Var(X) = E[X²] − (E[X])².

Practice

easy

Let X take values {−1, 0, 1} with probabilities {1/4, 1/2, 1/4}. Compute E[X], Var(X), and σ.

Hint: Symmetry suggests E[X]=0. Then Var(X)=E[X²].

Show solution

Mean:

E[X] = (−1)(1/4) + 0(1/2) + 1(1/4) = 0.

Variance:

Since μ=0, Var(X)=E[X²].

E[X²] = (1)(1/4) + 0(1/2) + (1)(1/4) = 1/2.

So Var(X)=1/2.

Standard deviation:

σ = √(1/2) ≈ 0.7071.

medium

Use Var(X) = E[X²] − (E[X])². Suppose P(X=1)=0.2, P(X=3)=0.8. Compute Var(X).

Hint: Compute E[X] and E[X²] from the two-point distribution, then subtract.

Show solution

E[X] = 1·0.2 + 3·0.8 = 0.2 + 2.4 = 2.6.

E[X²] = 1²·0.2 + 3²·0.8 = 1·0.2 + 9·0.8 = 0.2 + 7.2 = 7.4.

Var(X) = E[X²] − (E[X])² = 7.4 − (2.6)² = 7.4 − 6.76 = 0.64.

medium

Assume Var(X)=4. Define Z = −2X + 10. Compute Var(Z) and σ_z.

Hint: Use Var(aX + c) = a²Var(X). The sign of a doesn’t matter after squaring.

Show solution

Var(Z) = Var(−2X + 10) = (−2)²Var(X) = 4·4 = 16.

σ_z = √Var(Z) = √16 = 4.

Connections

•Next: Common Distributions
•Next: Covariance and Correlation
•Applied ML: Layer Normalization
•Prerequisite review: Expected Value (already known in this path)

Quality: A (4.4/5)

← back to tree browse all →