Expected Value

Probability & StatisticsDifficulty: ██░░░Depth: 3Unlocks: 74

Long-run average of a random variable. E[X] = sum of x*P(x).

Interactive Visualization

t=0s

Core Concepts

  • Expected value = the theoretical long-run average of a random variable (the value the sample mean approaches under repeated independent draws)
  • Definition as a probability-weighted average: for discrete X, E[X] = sum_x x * P(X = x); for continuous X, E[X] = integral x * f_X(x) dx
  • Existence criterion: the expectation is defined only if the weighted sum/integral converges (finite); otherwise E[X] may be infinite or undefined

Key Symbols & Notation

E[X] (expectation operator on random variable X)

Essential Relationships

  • Linearity: E[aX + b] = a * E[X] + b for constants a,b

Prerequisites (1)

▶ Advanced Learning Details

Graph Position

29
Depth Cost
74
Fan-Out (ROI)
29
Bottleneck Score
3
Chain Length

Cognitive Load

5
Atomic Elements
13
Total Elements
L0
Percentile Level
L3
Atomic Level

All Concepts (5)

  • Expected value: a single-number summary of a random variable representing its long-run average outcome
  • Expectation defined for discrete random variables as a probability-weighted average of possible outcomes
  • Interpretation of expectation as the average result one would observe over many repeated independent trials (long-run frequency interpretation)
  • Existence/finition condition: the expectation is meaningful only when the defining sum converges to a finite value
  • Support-based summation: the expectation is computed by summing only over the random variable's possible values (its support)

Teaching Strategy

Self-serve tutorial - low prerequisites, straightforward concepts.

If you repeatedly play a lottery, flip a biased coin for points, or measure noisy sensor readings, you eventually want one number that summarizes what you “typically” get. Expected value is that number: the long-run average you should plan around, even though any single outcome can differ.

TL;DR:

The expected value (expectation) E[X] of a random variable X is its probability-weighted average. Discrete: E[X] = ∑ₓ x·P(X=x). Continuous: E[X] = ∫ x f_X(x) dx. Expectation is linear (E[aX+b]=aE[X]+b) and generalizes to E[g(X)]. It may be infinite or undefined if the weighted sum/integral doesn’t converge.

What Is Expected Value?

Why we need it (motivation)

A random variable X can take many values depending on chance. If you want to:

  • compare two gambling games,
  • decide whether an insurance policy is “fair,”
  • estimate average runtime of a randomized algorithm,
  • or reason about average loss in machine learning,

you need a single summary number.

Expected value is the theoretical long-run average: if you repeatedly draw independent samples X₁, X₂, … from the same distribution, the sample mean

Xˉn=1ni=1nXi\bar X_n = \frac{1}{n}\sum_{i=1}^n X_i

tends to get close to a fixed value. That fixed value (when it exists) is E[X]. (A later node formalizes this as the Law of Large Numbers.)

Intuition first: “average with weights”

Suppose outcomes x happen with probabilities p(x). An ordinary average gives each outcome equal weight. Expected value gives outcome x the weight p(x).

So if big outcomes are rare, they still matter—but only proportionally to how often they occur.

Definition (discrete)

If X is discrete with possible values x and probability mass function P(X = x), the expectation is

E[X]=xxP(X=x).\mathbb{E}[X] = \sum_x x\,\mathbb{P}(X=x).

You can read this as “sum over all outcomes: (value) × (chance of that value).”

Definition (continuous)

If X is continuous with probability density function f_X(x), then

E[X]=xfX(x)dx.\mathbb{E}[X] = \int_{-\infty}^{\infty} x\, f_X(x)\, dx.

This is the same idea: infinitely many possible values, so the weighted average becomes an integral.

Units and interpretation

  • E[X] has the same units as X. If X is dollars, E[X] is dollars.
  • E[X] is not necessarily a value X can actually take (e.g., the average of 0 and 1 is 0.5).

Difficulty calibration note

We’ll treat the following as core vs optional/advanced:

  • Core: compute E[X] for discrete/continuous distributions; linearity; interpretation.
  • Optional/Advanced: existence/undefined expectations; heavy tails; expectation of functions E[g(X)].

You can learn and apply expected value well with the core, then come back for the optional parts when you need them.

Core Mechanic 1: Computing E[X] as a Probability-Weighted Average

Discrete examples: build the pattern

For a discrete X, you need two things:

1) the set of possible values {x}, and

2) the probability of each value p(x).

Then compute the weighted sum ∑ x p(x).

Example pattern: dice

Let X be the value of a fair six-sided die. Then P(X=k)=1/6 for k∈{1,2,3,4,5,6}.

E[X]=k=16k16=16(1+2+3+4+5+6)=3.5.\mathbb{E}[X] = \sum_{k=1}^6 k\cdot \frac{1}{6} = \frac{1}{6}(1+2+3+4+5+6)=3.5.

Notice 3.5 is not an outcome on the die—expected value is a planning number, not a predicted single roll.

Continuous examples: computing E[X] with integrals

For continuous X, the density f_X(x) plays the role of “probability per unit x.” The integral

xfX(x)dx\int x f_X(x) dx

is the continuous weighted average.

Example pattern: Uniform(0,1)

If X ∼ Uniform(0,1), then f_X(x)=1 for x∈[0,1] and 0 otherwise.

E[X]=01x1dx=[x22]01=12.\mathbb{E}[X]=\int_0^1 x\cdot 1\, dx = \left[\frac{x^2}{2}\right]_0^1 = \frac{1}{2}.

Expectation as “center of mass” (intuition)

A useful physical analogy: imagine each value x has “mass” p(x) (discrete) or density f(x)dx (continuous). The expected value is the balance point.

  • If probability mass shifts right, E[X] increases.
  • If you add a rare but huge outcome, E[X] can jump noticeably.

Sanity checks when computing

1) Range check: If X is always between a and b, then E[X] must lie between a and b.

2) Symmetry: If a distribution is symmetric around 0, often E[X]=0 (when it exists).

3) Weights sum to 1: For discrete, verify ∑ p(x)=1; for continuous, ∫ f(x)dx=1.

Core takeaway

Computing expectation is usually straightforward bookkeeping—until you meet distributions with extremely large values or tails. That’s where the next sections add nuance.

Core Mechanic 2: Linearity of Expectation (the "superpower")

Why linearity matters

Many random variables are built from simpler pieces:

  • total reward = sum of per-step rewards,
  • total cost = sum of random costs,
  • total heads = sum of indicator variables,
  • ML loss over a dataset = average of per-example losses.

Computing the full distribution of a sum can be hard. Expected value often stays easy because expectation is linear.

Linearity rules (core)

For random variables X and Y (no independence required):

E[X+Y]=E[X]+E[Y].\mathbb{E}[X+Y] = \mathbb{E}[X] + \mathbb{E}[Y].

For constants a, b:

E[aX+b]=aE[X]+b.\mathbb{E}[aX+b] = a\,\mathbb{E}[X] + b.

More generally, for any finite sum:

E[i=1nXi]=i=1nE[Xi].\mathbb{E}\Big[\sum_{i=1}^n X_i\Big] = \sum_{i=1}^n \mathbb{E}[X_i].

Mini-derivation (discrete)

Assume (X,Y) are discrete.

Start with the definition:

E[X+Y]=x,y(x+y)P(X=x,Y=y).\mathbb{E}[X+Y] = \sum_{x,y} (x+y)\,\mathbb{P}(X=x, Y=y).

Split the sum:

E[X+Y]=x,yxP(X=x,Y=y)+x,yyP(X=x,Y=y).\mathbb{E}[X+Y] = \sum_{x,y} x\,\mathbb{P}(X=x, Y=y) + \sum_{x,y} y\,\mathbb{P}(X=x, Y=y).

Now notice:

yP(X=x,Y=y)=P(X=x)\sum_{y} \mathbb{P}(X=x, Y=y) = \mathbb{P}(X=x)

so

x,yxP(X=x,Y=y)=xxP(X=x)=E[X].\sum_{x,y} x\,\mathbb{P}(X=x, Y=y) = \sum_x x\,\mathbb{P}(X=x) = \mathbb{E}[X].

Similarly the second term becomes E[Y]. Therefore E[X+Y]=E[X]+E[Y].

Indicators: a common trick

Define an indicator random variable I for an event A:

  • I = 1 if A happens
  • I = 0 otherwise

Then

E[I]=1P(A)+0(1P(A))=P(A).\mathbb{E}[I] = 1\cdot \mathbb{P}(A) + 0\cdot (1-\mathbb{P}(A)) = \mathbb{P}(A).

This turns probability questions into expectation questions.

Example: Let X be the number of heads in n coin flips (not necessarily fair). Let Iᵢ indicate “flip i is heads.” Then X = ∑ᵢ Iᵢ, so

E[X]=i=1nE[Ii]=i=1nP(heads on i).\mathbb{E}[X] = \sum_{i=1}^n \mathbb{E}[I_i] = \sum_{i=1}^n \mathbb{P}(\text{heads on } i).

If the coin has P(heads)=p each time, then E[X]=np.

What linearity does not say

A classic confusion is to assume expectation distributes over products:

  • Generally, E[XY] ≠ E[X]E[Y].

That equality holds under independence (and some integrability conditions), but linearity alone doesn’t give it.

Optional/Advanced: Expectation of Functions E[g(X)] and Existence Issues

E[g(X)] (operator viewpoint)

Expected value is not just a number attached to X—it’s an operator that maps a random variable to a number.

Often we care about a transformed quantity g(X):

  • squared error: g(X) = (X−c)²
  • absolute deviation: g(X)=|X|
  • utility in economics: g(X)=u(X)
  • loss in ML: g(X)=ℓ(X, y)

Definition (discrete)

E[g(X)]=xg(x)P(X=x).\mathbb{E}[g(X)] = \sum_x g(x)\,\mathbb{P}(X=x).

Definition (continuous)

E[g(X)]=g(x)fX(x)dx.\mathbb{E}[g(X)] = \int_{-\infty}^{\infty} g(x)\, f_X(x)\, dx.

This is the same weighted-average idea, just applied after transforming the outcomes.

Law of the unconscious statistician (LOUTS)

A subtle but powerful point: to compute E[g(X)], you usually do not need the distribution of Y=g(X). You can compute directly from X’s distribution using the formulas above.

When expectation does not exist (or is infinite)

So far, we’ve treated expectation as always producing a finite number. But expectation is only well-defined when the weighted sum/integral converges.

A practical sufficient condition is that the absolute expectation is finite:

  • Discrete: $xxP(X=x)<\sum_x |x|\,\mathbb{P}(X=x) < \infty$
  • Continuous: $xfX(x)dx<\int_{-\infty}^{\infty} |x|\, f_X(x)\, dx < \infty$

If these diverge, several things can happen:

SituationWhat it meansTypical phrase
E[X] is a finite real numberweighted average converges“expectation exists”
E[X] = +∞ or −∞one-sided integral/sum diverges“infinite expectation”
undefinedpositive and negative parts both diverge“does not exist”

Heavy tails (intuition)

A heavy-tailed distribution puts enough probability on huge values that the “average” never settles.

A famous example is the Cauchy distribution: it’s symmetric, but its tails are so heavy that E[X] is undefined (the integral does not converge in the required sense). That’s why sample means of Cauchy draws behave wildly even for large n.

Why this matters in practice

  • In modeling, assuming E[X] exists lets you use many theorems (LLN, variance formulas, etc.).
  • In risk/finance, rare catastrophic outcomes can dominate expectation.
  • In ML, expectation of loss is usually well-defined, but certain data distributions (extreme outliers) can make empirical averages unstable.

If you’re learning expectation for the first time, don’t let this scare you: most standard distributions used early (Bernoulli, Binomial, Uniform, Normal, Exponential) have finite expectations. But it’s important to know that “average” is not guaranteed by definition—it’s a property that may or may not hold.

Applications and Connections: From Fair Games to SGD

Fairness and pricing (games, insurance)

A gamble is often called “fair” if its expected payoff is 0 (or if the price equals expected payout).

If you pay cost c to play and receive random payout X, then net payoff is X−c. By linearity:

E[Xc]=E[X]c.\mathbb{E}[X-c] = \mathbb{E}[X] - c.

A fair price is c = E[X] (ignoring risk preferences).

Expected loss (machine learning viewpoint)

In supervised learning, we often minimize expected risk:

R(θ)=E(x,y)D[(θ;x,y)].R(\theta) = \mathbb{E}_{(x,y)\sim \mathcal{D}}\,[\ell(\theta; x,y)].

We don’t know the true distribution 𝒟, so we approximate R(θ) with the empirical average over data:

R^(θ)=1ni=1n(θ;xi,yi).\hat R(\theta) = \frac{1}{n}\sum_{i=1}^n \ell(\theta; x_i, y_i).

The idea “sample average ≈ expected value” is exactly the intuition behind expected value and what the Law of Large Numbers formalizes.

Why SGD works with expectation

Stochastic Gradient Descent uses a random mini-batch to estimate the gradient of expected loss.

If g(θ) is the gradient computed from a random sample, SGD relies on it being (approximately) unbiased:

E[g(θ)]=R(θ).\mathbb{E}[g(\theta)] = \nabla R(\theta).

This is an expectation statement: on average, the noisy gradient points in the true direction.

Connecting expectation to what comes next

  • Variance measures spread around the mean μ = E[X] using E[(X−μ)²].
  • Entropy uses an expectation too: H(X)=E[−log p(X)] (discrete).
  • Law of Large Numbers explains the long-run convergence of sample means to E[X].
  • Game theory uses expected payoff when players randomize strategies.

Expected value is the first “global” summary of a distribution you should reach for: it’s simple, composable via linearity, and it’s the backbone of many later definitions.

Worked Examples (3)

Compute E[X] for a simple discrete gamble

A game pays 10withprobability0.2,pays10 with probability 0.2, pays 2 with probability 0.5, and pays $0 with probability 0.3. Let X be the payout in dollars. Find E[X] and a fair entry price c (ignoring risk).

  1. List outcomes and probabilities:

    • x=10 with p=0.2
    • x=2 with p=0.5
    • x=0 with p=0.3
  2. Apply the discrete expectation formula:

    E[X]=xxP(X=x)=10(0.2)+2(0.5)+0(0.3).\mathbb{E}[X]=\sum_x x\,\mathbb{P}(X=x)=10(0.2)+2(0.5)+0(0.3).
  3. Compute:

    10(0.2)=2

    2(0.5)=1

    0(0.3)=0

    So

    E[X]=2+1+0=3.\mathbb{E}[X]=2+1+0=3.
  4. A fair entry price c makes expected net payoff zero:

    Net payoff = X − c

    E[Xc]=E[X]c=0c=E[X]=3.\mathbb{E}[X-c]=\mathbb{E}[X]-c=0 \Rightarrow c=\mathbb{E}[X]=3.

Insight: Expected value treats each payout as contributing “value × frequency.” Even though 10islarge,ithappensonly2010 is large, it happens only 20% of the time, so it contributes 2 to the average.

Use linearity with indicator variables: expected number of successes

You run 5 independent trials. Trial i succeeds with probability pᵢ (not necessarily the same across trials). Let X be the total number of successes. Compute E[X].

  1. Define indicator variables:

    Let Iᵢ = 1 if trial i succeeds, else 0.

    Then the total number of successes is

    X=i=15Ii.X = \sum_{i=1}^5 I_i.
  2. Compute each indicator’s expectation:

    E[Ii]=1pi+0(1pi)=pi.\mathbb{E}[I_i] = 1\cdot p_i + 0\cdot (1-p_i)=p_i.
  3. Apply linearity of expectation (no extra assumptions needed beyond finiteness):

    E[X]=E[i=15Ii]=i=15E[Ii]=i=15pi.\mathbb{E}[X] = \mathbb{E}\left[\sum_{i=1}^5 I_i\right] = \sum_{i=1}^5 \mathbb{E}[I_i] = \sum_{i=1}^5 p_i.

Insight: Linearity lets you avoid computing the distribution of X. Even if the trials have different probabilities, the expected total is just the sum of the individual success probabilities.

Continuous expectation: E[X] for an exponential distribution

Let X have an exponential distribution with rate λ>0, meaning f_X(x)=λe^{−λx} for x≥0 and 0 otherwise. Compute E[X].

  1. Start from the definition:

    E[X]=xfX(x)dx=0xλeλxdx.\mathbb{E}[X] = \int_{-\infty}^{\infty} x f_X(x)\,dx = \int_0^{\infty} x\, \lambda e^{-\lambda x}\,dx.
  2. Compute the integral using integration by parts.

    Let u = x so du = dx.

    Let dv = \lambda e^{-\lambda x} dx so v = -e^{-\lambda x}.

    Then

    0xλeλxdx=[x(eλx)]00(eλx)dx.\int_0^{\infty} x\, \lambda e^{-\lambda x} dx = \left[x(-e^{-\lambda x})\right]_0^{\infty} - \int_0^{\infty} (-e^{-\lambda x})\,dx.
  3. Evaluate the boundary term:

    As x→∞, x e^{−λx} → 0, so x(−e^{−λx}) → 0.

    At x=0, x(−e^{0}) = 0.

    So

    [x(eλx)]0=0.\left[x(-e^{-\lambda x})\right]_0^{\infty} = 0.
  4. Compute the remaining integral:

    0(eλx)dx=0eλxdx=[1λeλx]0=0(1λ)=1λ.-\int_0^{\infty} (-e^{-\lambda x}) dx = \int_0^{\infty} e^{-\lambda x} dx = \left[-\frac{1}{\lambda}e^{-\lambda x}\right]_0^{\infty} = 0 - \left(-\frac{1}{\lambda}\right)=\frac{1}{\lambda}.
  5. Therefore

    E[X]=1λ.\mathbb{E}[X] = \frac{1}{\lambda}.

Insight: For continuous variables, expectation is still a weighted average—just spread across a continuum. The exponential distribution’s mean 1/λ matches the intuition: higher rate λ means shorter expected waiting time.

Key Takeaways

  • Expected value E[X] is the theoretical long-run average of a random variable, aligning with the sample mean under repeated draws (formalized later by LLN).

  • Discrete expectation is a probability-weighted sum: E[X]=∑ₓ x·P(X=x). Continuous expectation is a probability-weighted integral: E[X]=∫ x f_X(x) dx.

  • Linearity is the main computational tool: E[X+Y]=E[X]+E[Y] and E[aX+b]=aE[X]+b, without needing independence.

  • Indicator variables convert probabilities into expectations: if I is 1 on event A and 0 otherwise, then E[I]=P(A).

  • Expectation generalizes to transformations: E[g(X)] = ∑ g(x)p(x) or ∫ g(x)f(x)dx (often without finding the distribution of g(X)).

  • E[X] may be infinite or undefined for heavy-tailed distributions; finiteness typically requires ∑ |x|p(x) < ∞ or ∫ |x|f(x)dx < ∞.

  • Expected value is foundational for variance, entropy, fair games, and optimizing expected loss in machine learning.

Common Mistakes

  • Thinking E[X] must be a value X can actually take (e.g., expecting a die to roll 3.5).

  • Forgetting that probabilities must sum/integrate to 1 before computing E[X], leading to incorrect weighted averages.

  • Assuming E[XY]=E[X]E[Y] without checking independence (linearity does not apply to products).

  • Ignoring existence: applying expectation formulas to heavy-tailed cases where the sum/integral diverges, producing misleading “answers.”

Practice

easy

A biased coin lands heads with probability p. Let X be the payout where you get 1forheadsand1 for heads and 0 for tails. Compute E[X].

Hint: List the two outcomes (1 and 0) and weight by their probabilities.

Show solution

Outcomes: X=1 with probability p, and X=0 with probability 1−p.

E[X]=1p+0(1p)=p.\mathbb{E}[X] = 1\cdot p + 0\cdot (1-p) = p.

So the expected payout is p dollars.

medium

Let X be the result of a fair six-sided die. Define Y = 2X − 1. Compute E[Y] using linearity (do not re-sum from scratch).

Hint: First recall E[X] for a fair die, then apply E[aX+b]=aE[X]+b.

Show solution

For a fair die, $E[X]=3.5.\mathbb{E}[X]=3.5.$ Using linearity:

E[Y]=E[2X1]=2E[X]1=2(3.5)1=71=6.\mathbb{E}[Y]=\mathbb{E}[2X-1]=2\mathbb{E}[X]-1=2(3.5)-1=7-1=6.
hard

Optional/advanced: Let X take values 1,2,3,… with probability P(X=k)=c/k^2 for k≥1. (a) Find c. (b) Does E[X] exist as a finite number?

Hint: Use that ∑_{k=1}^∞ 1/k^2 converges (to π²/6). For part (b), examine ∑ k·(c/k²).

Show solution

(a) We need probabilities to sum to 1:

k=1ck2=ck=11k2=1.\sum_{k=1}^{\infty} \frac{c}{k^2} = c \sum_{k=1}^{\infty} \frac{1}{k^2} = 1.

Using $k=11k2=π26,\sum_{k=1}^{\infty} \frac{1}{k^2} = \frac{\pi^2}{6},$ we get

cπ26=1c=6π2.c\cdot \frac{\pi^2}{6}=1 \Rightarrow c=\frac{6}{\pi^2}.

(b) Compute expectation:

E[X]=k=1kck2=ck=11k.\mathbb{E}[X] = \sum_{k=1}^{\infty} k\cdot \frac{c}{k^2} = c\sum_{k=1}^{\infty} \frac{1}{k}.

But ∑_{k=1}^∞ 1/k diverges, so E[X] is infinite (does not exist as a finite number). In this case we say the expectation diverges to +∞.

Connections

Next nodes you can unlock and why they depend on expected value:

  • Variance: uses the mean μ = E[X] and defines spread via $Var(X)=E[(Xμ)2].\mathrm{Var}(X)=\mathbb{E}[(X-\mu)^2].$
  • Entropy: can be written as an expectation, e.g. discrete $H(X)=E[logp(X)].H(X)=\mathbb{E}[-\log p(X)].$
  • Law of Large Numbers: formal statement that the sample mean approaches E[X] under conditions.
  • Game Theory Introduction: expected payoff evaluates mixed (randomized) strategies.
  • Stochastic Gradient Descent: relies on unbiased gradient estimates, an expectation identity $E[g(θ)]R(θ).\mathbb{E}[g(\theta)]\approx \nabla R(\theta).$
Quality: A (4.5/5)