Expected Value

Probability & StatisticsDifficulty: ██░░░Depth: 3Unlocks: 74

Long-run average of a random variable. E[X] = sum of x*P(x).

Interactive Visualization

t=0s

Core Concepts

▸Expected value = the theoretical long-run average of a random variable (the value the sample mean approaches under repeated independent draws)
▸Definition as a probability-weighted average: for discrete X, E[X] = sum_x x * P(X = x); for continuous X, E[X] = integral x * f_X(x) dx
▸Existence criterion: the expectation is defined only if the weighted sum/integral converges (finite); otherwise E[X] may be infinite or undefined

Key Symbols & Notation

E[X] (expectation operator on random variable X)

Essential Relationships

↔Linearity: E[aX + b] = a * E[X] + b for constants a,b

Prerequisites (1)

Random Variables6 atoms

Unlocks (11)

Variancelvl 2

Entropylvl 3

Game Theory Introductionlvl 3

Law of Large Numberslvl 3

Stochastic Gradient Descentlvl 4

Mechanism Designlvl 4

Bias-Variance Tradeofflvl 4

Reinforcement Learning Introductionlvl 4

+3 more...

▶ Advanced Learning Details

Graph Position

Depth Cost

Fan-Out (ROI)

Bottleneck Score

Chain Length

Cognitive Load

Atomic Elements

Total Elements

Percentile Level

Atomic Level

All Concepts (5)

• Expected value: a single-number summary of a random variable representing its long-run average outcome
• Expectation defined for discrete random variables as a probability-weighted average of possible outcomes
• Interpretation of expectation as the average result one would observe over many repeated independent trials (long-run frequency interpretation)
• Existence/finition condition: the expectation is meaningful only when the defining sum converges to a finite value
• Support-based summation: the expectation is computed by summing only over the random variable's possible values (its support)

Teaching Strategy

Self-serve tutorial - low prerequisites, straightforward concepts.

If you repeatedly play a lottery, flip a biased coin for points, or measure noisy sensor readings, you eventually want one number that summarizes what you “typically” get. Expected value is that number: the long-run average you should plan around, even though any single outcome can differ.

TL;DR:

The expected value (expectation) E[X] of a random variable X is its probability-weighted average. Discrete: E[X] = ∑ₓ x·P(X=x). Continuous: E[X] = ∫ x f_X(x) dx. Expectation is linear (E[aX+b]=aE[X]+b) and generalizes to E[g(X)]. It may be infinite or undefined if the weighted sum/integral doesn’t converge.

What Is Expected Value?

Why we need it (motivation)

A random variable X can take many values depending on chance. If you want to:

•compare two gambling games,
•decide whether an insurance policy is “fair,”
•estimate average runtime of a randomized algorithm,
•or reason about average loss in machine learning,

you need a single summary number.

Expected value is the theoretical long-run average: if you repeatedly draw independent samples X₁, X₂, … from the same distribution, the sample mean

\bar X_n = \frac{1}{n}\sum_{i=1}^n X_i

tends to get close to a fixed value. That fixed value (when it exists) is E[X]. (A later node formalizes this as the Law of Large Numbers.)

Intuition first: “average with weights”

Suppose outcomes x happen with probabilities p(x). An ordinary average gives each outcome equal weight. Expected value gives outcome x the weight p(x).

So if big outcomes are rare, they still matter—but only proportionally to how often they occur.

Definition (discrete)

If X is discrete with possible values x and probability mass function P(X = x), the expectation is

\mathbb{E}[X] = \sum_x x\,\mathbb{P}(X=x).

You can read this as “sum over all outcomes: (value) × (chance of that value).”

Definition (continuous)

If X is continuous with probability density function f_X(x), then

\mathbb{E}[X] = \int_{-\infty}^{\infty} x\, f_X(x)\, dx.

This is the same idea: infinitely many possible values, so the weighted average becomes an integral.

Units and interpretation

•E[X] has the same units as X. If X is dollars, E[X] is dollars.
•E[X] is not necessarily a value X can actually take (e.g., the average of 0 and 1 is 0.5).

Difficulty calibration note

We’ll treat the following as core vs optional/advanced:

•Core: compute E[X] for discrete/continuous distributions; linearity; interpretation.
•Optional/Advanced: existence/undefined expectations; heavy tails; expectation of functions E[g(X)].

You can learn and apply expected value well with the core, then come back for the optional parts when you need them.

Core Mechanic 1: Computing E[X] as a Probability-Weighted Average

Discrete examples: build the pattern

For a discrete X, you need two things:

1) the set of possible values {x}, and

2) the probability of each value p(x).

Then compute the weighted sum ∑ x p(x).

Example pattern: dice

Let X be the value of a fair six-sided die. Then P(X=k)=1/6 for k∈{1,2,3,4,5,6}.

\mathbb{E}[X] = \sum_{k=1}^6 k\cdot \frac{1}{6} = \frac{1}{6}(1+2+3+4+5+6)=3.5.

Notice 3.5 is not an outcome on the die—expected value is a planning number, not a predicted single roll.

Continuous examples: computing E[X] with integrals

For continuous X, the density f_X(x) plays the role of “probability per unit x.” The integral

\int x f_X(x) dx

is the continuous weighted average.

Example pattern: Uniform(0,1)

If X ∼ Uniform(0,1), then f_X(x)=1 for x∈[0,1] and 0 otherwise.

\mathbb{E}[X]=\int_0^1 x\cdot 1\, dx = \left[\frac{x^2}{2}\right]_0^1 = \frac{1}{2}.

Expectation as “center of mass” (intuition)

A useful physical analogy: imagine each value x has “mass” p(x) (discrete) or density f(x)dx (continuous). The expected value is the balance point.

•If probability mass shifts right, E[X] increases.
•If you add a rare but huge outcome, E[X] can jump noticeably.

Sanity checks when computing

1) Range check: If X is always between a and b, then E[X] must lie between a and b.

2) Symmetry: If a distribution is symmetric around 0, often E[X]=0 (when it exists).

3) Weights sum to 1: For discrete, verify ∑ p(x)=1; for continuous, ∫ f(x)dx=1.

Core takeaway

Computing expectation is usually straightforward bookkeeping—until you meet distributions with extremely large values or tails. That’s where the next sections add nuance.

Core Mechanic 2: Linearity of Expectation (the "superpower")

Why linearity matters

Many random variables are built from simpler pieces:

•total reward = sum of per-step rewards,
•total cost = sum of random costs,
•total heads = sum of indicator variables,
•ML loss over a dataset = average of per-example losses.

Computing the full distribution of a sum can be hard. Expected value often stays easy because expectation is linear.

Linearity rules (core)

For random variables X and Y (no independence required):

\mathbb{E}[X+Y] = \mathbb{E}[X] + \mathbb{E}[Y].

For constants a, b:

\mathbb{E}[aX+b] = a\,\mathbb{E}[X] + b.

More generally, for any finite sum:

\mathbb{E}\Big[\sum_{i=1}^n X_i\Big] = \sum_{i=1}^n \mathbb{E}[X_i].

Mini-derivation (discrete)

Assume (X,Y) are discrete.

Start with the definition:

\mathbb{E}[X+Y] = \sum_{x,y} (x+y)\,\mathbb{P}(X=x, Y=y).

Split the sum:

\mathbb{E}[X+Y] = \sum_{x,y} x\,\mathbb{P}(X=x, Y=y) + \sum_{x,y} y\,\mathbb{P}(X=x, Y=y).

Now notice:

\sum_{y} \mathbb{P}(X=x, Y=y) = \mathbb{P}(X=x)

\sum_{x,y} x\,\mathbb{P}(X=x, Y=y) = \sum_x x\,\mathbb{P}(X=x) = \mathbb{E}[X].

Similarly the second term becomes E[Y]. Therefore E[X+Y]=E[X]+E[Y].

Indicators: a common trick

Define an indicator random variable I for an event A:

•I = 1 if A happens
•I = 0 otherwise

Then

\mathbb{E}[I] = 1\cdot \mathbb{P}(A) + 0\cdot (1-\mathbb{P}(A)) = \mathbb{P}(A).

This turns probability questions into expectation questions.

Example: Let X be the number of heads in n coin flips (not necessarily fair). Let Iᵢ indicate “flip i is heads.” Then X = ∑ᵢ Iᵢ, so

\mathbb{E}[X] = \sum_{i=1}^n \mathbb{E}[I_i] = \sum_{i=1}^n \mathbb{P}(\text{heads on } i).

If the coin has P(heads)=p each time, then E[X]=np.

What linearity does not say

A classic confusion is to assume expectation distributes over products:

•Generally, E[XY] ≠ E[X]E[Y].

That equality holds under independence (and some integrability conditions), but linearity alone doesn’t give it.

Optional/Advanced: Expectation of Functions E[g(X)] and Existence Issues

E[g(X)] (operator viewpoint)

Expected value is not just a number attached to X—it’s an operator that maps a random variable to a number.

Often we care about a transformed quantity g(X):

•squared error: g(X) = (X−c)²
•absolute deviation: g(X)=|X|
•utility in economics: g(X)=u(X)
•loss in ML: g(X)=ℓ(X, y)

Definition (discrete)

\mathbb{E}[g(X)] = \sum_x g(x)\,\mathbb{P}(X=x).

Definition (continuous)

\mathbb{E}[g(X)] = \int_{-\infty}^{\infty} g(x)\, f_X(x)\, dx.

This is the same weighted-average idea, just applied after transforming the outcomes.

Law of the unconscious statistician (LOUTS)

A subtle but powerful point: to compute E[g(X)], you usually do not need the distribution of Y=g(X). You can compute directly from X’s distribution using the formulas above.

When expectation does not exist (or is infinite)

So far, we’ve treated expectation as always producing a finite number. But expectation is only well-defined when the weighted sum/integral converges.

A practical sufficient condition is that the absolute expectation is finite:

•Discrete: $ $\sum_x |x|\,\mathbb{P}(X=x) < \infty$ $
•Continuous: $ $\int_{-\infty}^{\infty} |x|\, f_X(x)\, dx < \infty$ $

If these diverge, several things can happen:

Situation	What it means	Typical phrase
E[X] is a finite real number	weighted average converges	“expectation exists”
E[X] = +∞ or −∞	one-sided integral/sum diverges	“infinite expectation”
undefined	positive and negative parts both diverge	“does not exist”

Heavy tails (intuition)

A heavy-tailed distribution puts enough probability on huge values that the “average” never settles.

A famous example is the Cauchy distribution: it’s symmetric, but its tails are so heavy that E[X] is undefined (the integral does not converge in the required sense). That’s why sample means of Cauchy draws behave wildly even for large n.

Why this matters in practice

•In modeling, assuming E[X] exists lets you use many theorems (LLN, variance formulas, etc.).
•In risk/finance, rare catastrophic outcomes can dominate expectation.
•In ML, expectation of loss is usually well-defined, but certain data distributions (extreme outliers) can make empirical averages unstable.

If you’re learning expectation for the first time, don’t let this scare you: most standard distributions used early (Bernoulli, Binomial, Uniform, Normal, Exponential) have finite expectations. But it’s important to know that “average” is not guaranteed by definition—it’s a property that may or may not hold.

Applications and Connections: From Fair Games to SGD

Fairness and pricing (games, insurance)

A gamble is often called “fair” if its expected payoff is 0 (or if the price equals expected payout).

If you pay cost c to play and receive random payout X, then net payoff is X−c. By linearity:

\mathbb{E}[X-c] = \mathbb{E}[X] - c.

A fair price is c = E[X] (ignoring risk preferences).

Expected loss (machine learning viewpoint)

In supervised learning, we often minimize expected risk:

R(\theta) = \mathbb{E}_{(x,y)\sim \mathcal{D}}\,[\ell(\theta; x,y)].

We don’t know the true distribution 𝒟, so we approximate R(θ) with the empirical average over data:

\hat R(\theta) = \frac{1}{n}\sum_{i=1}^n \ell(\theta; x_i, y_i).

The idea “sample average ≈ expected value” is exactly the intuition behind expected value and what the Law of Large Numbers formalizes.

Why SGD works with expectation

Stochastic Gradient Descent uses a random mini-batch to estimate the gradient of expected loss.

If g(θ) is the gradient computed from a random sample, SGD relies on it being (approximately) unbiased:

\mathbb{E}[g(\theta)] = \nabla R(\theta).

This is an expectation statement: on average, the noisy gradient points in the true direction.

Connecting expectation to what comes next

•Variance measures spread around the mean μ = E[X] using E[(X−μ)²].
•Entropy uses an expectation too: H(X)=E[−log p(X)] (discrete).
•Law of Large Numbers explains the long-run convergence of sample means to E[X].
•Game theory uses expected payoff when players randomize strategies.

Expected value is the first “global” summary of a distribution you should reach for: it’s simple, composable via linearity, and it’s the backbone of many later definitions.

Worked Examples (3)

Compute E[X] for a simple discrete gamble

A game pays $10 with probability 0.2, pays$ 2 with probability 0.5, and pays $0 with probability 0.3. Let X be the payout in dollars. Find E[X] and a fair entry price c (ignoring risk).

List outcomes and probabilities:
- •x=10 with p=0.2
- •x=2 with p=0.5
- •x=0 with p=0.3
Apply the discrete expectation formula:
$\mathbb{E}[X]=\sum_x x\,\mathbb{P}(X=x)=10(0.2)+2(0.5)+0(0.3).$
Compute:
10(0.2)=2
2(0.5)=1
0(0.3)=0
So
$\mathbb{E}[X]=2+1+0=3.$
A fair entry price c makes expected net payoff zero:
Net payoff = X − c
$\mathbb{E}[X-c]=\mathbb{E}[X]-c=0 \Rightarrow c=\mathbb{E}[X]=3.$

Insight: Expected value treats each payout as contributing “value × frequency.” Even though $10 is large, it happens only 20% of the time, so it contributes$ 2 to the average.

Use linearity with indicator variables: expected number of successes

You run 5 independent trials. Trial i succeeds with probability pᵢ (not necessarily the same across trials). Let X be the total number of successes. Compute E[X].

Define indicator variables:
Let Iᵢ = 1 if trial i succeeds, else 0.
Then the total number of successes is
$X = \sum_{i=1}^5 I_i.$
Compute each indicator’s expectation:
$\mathbb{E}[I_i] = 1\cdot p_i + 0\cdot (1-p_i)=p_i.$
Apply linearity of expectation (no extra assumptions needed beyond finiteness):
$\mathbb{E}[X] = \mathbb{E}\left[\sum_{i=1}^5 I_i\right] = \sum_{i=1}^5 \mathbb{E}[I_i] = \sum_{i=1}^5 p_i.$

Insight: Linearity lets you avoid computing the distribution of X. Even if the trials have different probabilities, the expected total is just the sum of the individual success probabilities.

Continuous expectation: E[X] for an exponential distribution

Let X have an exponential distribution with rate λ>0, meaning f_X(x)=λe^{−λx} for x≥0 and 0 otherwise. Compute E[X].

Start from the definition:
$\mathbb{E}[X] = \int_{-\infty}^{\infty} x f_X(x)\,dx = \int_0^{\infty} x\, \lambda e^{-\lambda x}\,dx.$
Compute the integral using integration by parts.
Let u = x so du = dx.
Let dv = \lambda e^{-\lambda x} dx so v = -e^{-\lambda x}.
Then
$\int_0^{\infty} x\, \lambda e^{-\lambda x} dx = \left[x(-e^{-\lambda x})\right]_0^{\infty} - \int_0^{\infty} (-e^{-\lambda x})\,dx.$
Evaluate the boundary term:
As x→∞, x e^{−λx} → 0, so x(−e^{−λx}) → 0.
At x=0, x(−e^{0}) = 0.
So
$\left[x(-e^{-\lambda x})\right]_0^{\infty} = 0.$
Compute the remaining integral:
$-\int_0^{\infty} (-e^{-\lambda x}) dx = \int_0^{\infty} e^{-\lambda x} dx = \left[-\frac{1}{\lambda}e^{-\lambda x}\right]_0^{\infty} = 0 - \left(-\frac{1}{\lambda}\right)=\frac{1}{\lambda}.$
Therefore
$\mathbb{E}[X] = \frac{1}{\lambda}.$

Insight: For continuous variables, expectation is still a weighted average—just spread across a continuum. The exponential distribution’s mean 1/λ matches the intuition: higher rate λ means shorter expected waiting time.

Key Takeaways

✓
Expected value E[X] is the theoretical long-run average of a random variable, aligning with the sample mean under repeated draws (formalized later by LLN).
✓
Discrete expectation is a probability-weighted sum: E[X]=∑ₓ x·P(X=x). Continuous expectation is a probability-weighted integral: E[X]=∫ x f_X(x) dx.
✓
Linearity is the main computational tool: E[X+Y]=E[X]+E[Y] and E[aX+b]=aE[X]+b, without needing independence.
✓
Indicator variables convert probabilities into expectations: if I is 1 on event A and 0 otherwise, then E[I]=P(A).
✓
Expectation generalizes to transformations: E[g(X)] = ∑ g(x)p(x) or ∫ g(x)f(x)dx (often without finding the distribution of g(X)).
✓
E[X] may be infinite or undefined for heavy-tailed distributions; finiteness typically requires ∑ |x|p(x) < ∞ or ∫ |x|f(x)dx < ∞.
✓
Expected value is foundational for variance, entropy, fair games, and optimizing expected loss in machine learning.

Common Mistakes

✗
Thinking E[X] must be a value X can actually take (e.g., expecting a die to roll 3.5).
✗
Forgetting that probabilities must sum/integrate to 1 before computing E[X], leading to incorrect weighted averages.
✗
Assuming E[XY]=E[X]E[Y] without checking independence (linearity does not apply to products).
✗
Ignoring existence: applying expectation formulas to heavy-tailed cases where the sum/integral diverges, producing misleading “answers.”

Practice

easy

A biased coin lands heads with probability p. Let X be the payout where you get $1 for heads and$ 0 for tails. Compute E[X].

Hint: List the two outcomes (1 and 0) and weight by their probabilities.

Show solution

Outcomes: X=1 with probability p, and X=0 with probability 1−p.

\mathbb{E}[X] = 1\cdot p + 0\cdot (1-p) = p.

So the expected payout is p dollars.

medium

Let X be the result of a fair six-sided die. Define Y = 2X − 1. Compute E[Y] using linearity (do not re-sum from scratch).

Hint: First recall E[X] for a fair die, then apply E[aX+b]=aE[X]+b.

Show solution

For a fair die, $ $\mathbb{E}[X]=3.5.$ $ Using linearity:

\mathbb{E}[Y]=\mathbb{E}[2X-1]=2\mathbb{E}[X]-1=2(3.5)-1=7-1=6.

hard

Optional/advanced: Let X take values 1,2,3,… with probability P(X=k)=c/k^2 for k≥1. (a) Find c. (b) Does E[X] exist as a finite number?

Hint: Use that ∑_{k=1}^∞ 1/k^2 converges (to π²/6). For part (b), examine ∑ k·(c/k²).

Show solution

(a) We need probabilities to sum to 1:

\sum_{k=1}^{\infty} \frac{c}{k^2} = c \sum_{k=1}^{\infty} \frac{1}{k^2} = 1.

Using $ $\sum_{k=1}^{\infty} \frac{1}{k^2} = \frac{\pi^2}{6},$ $ we get

c\cdot \frac{\pi^2}{6}=1 \Rightarrow c=\frac{6}{\pi^2}.

(b) Compute expectation:

\mathbb{E}[X] = \sum_{k=1}^{\infty} k\cdot \frac{c}{k^2} = c\sum_{k=1}^{\infty} \frac{1}{k}.

But ∑_{k=1}^∞ 1/k diverges, so E[X] is infinite (does not exist as a finite number). In this case we say the expectation diverges to +∞.

Connections

Next nodes you can unlock and why they depend on expected value:

•Variance: uses the mean μ = E[X] and defines spread via $ $\mathrm{Var}(X)=\mathbb{E}[(X-\mu)^2].$ $
•Entropy: can be written as an expectation, e.g. discrete $ $H(X)=\mathbb{E}[-\log p(X)].$ $
•Law of Large Numbers: formal statement that the sample mean approaches E[X] under conditions.
•Game Theory Introduction: expected payoff evaluates mixed (randomized) strategies.
•Stochastic Gradient Descent: relies on unbiased gradient estimates, an expectation identity $ $\mathbb{E}[g(\theta)]\approx \nabla R(\theta).$ $

Quality: A (4.5/5)

← back to tree browse all →