Long-run average of a random variable. E[X] = sum of x*P(x).
Self-serve tutorial - low prerequisites, straightforward concepts.
If you repeatedly play a lottery, flip a biased coin for points, or measure noisy sensor readings, you eventually want one number that summarizes what you “typically” get. Expected value is that number: the long-run average you should plan around, even though any single outcome can differ.
The expected value (expectation) E[X] of a random variable X is its probability-weighted average. Discrete: E[X] = ∑ₓ x·P(X=x). Continuous: E[X] = ∫ x f_X(x) dx. Expectation is linear (E[aX+b]=aE[X]+b) and generalizes to E[g(X)]. It may be infinite or undefined if the weighted sum/integral doesn’t converge.
A random variable X can take many values depending on chance. If you want to:
you need a single summary number.
Expected value is the theoretical long-run average: if you repeatedly draw independent samples X₁, X₂, … from the same distribution, the sample mean
tends to get close to a fixed value. That fixed value (when it exists) is E[X]. (A later node formalizes this as the Law of Large Numbers.)
Suppose outcomes x happen with probabilities p(x). An ordinary average gives each outcome equal weight. Expected value gives outcome x the weight p(x).
So if big outcomes are rare, they still matter—but only proportionally to how often they occur.
If X is discrete with possible values x and probability mass function P(X = x), the expectation is
You can read this as “sum over all outcomes: (value) × (chance of that value).”
If X is continuous with probability density function f_X(x), then
This is the same idea: infinitely many possible values, so the weighted average becomes an integral.
We’ll treat the following as core vs optional/advanced:
You can learn and apply expected value well with the core, then come back for the optional parts when you need them.
For a discrete X, you need two things:
1) the set of possible values {x}, and
2) the probability of each value p(x).
Then compute the weighted sum ∑ x p(x).
Let X be the value of a fair six-sided die. Then P(X=k)=1/6 for k∈{1,2,3,4,5,6}.
Notice 3.5 is not an outcome on the die—expected value is a planning number, not a predicted single roll.
For continuous X, the density f_X(x) plays the role of “probability per unit x.” The integral
is the continuous weighted average.
If X ∼ Uniform(0,1), then f_X(x)=1 for x∈[0,1] and 0 otherwise.
A useful physical analogy: imagine each value x has “mass” p(x) (discrete) or density f(x)dx (continuous). The expected value is the balance point.
1) Range check: If X is always between a and b, then E[X] must lie between a and b.
2) Symmetry: If a distribution is symmetric around 0, often E[X]=0 (when it exists).
3) Weights sum to 1: For discrete, verify ∑ p(x)=1; for continuous, ∫ f(x)dx=1.
Computing expectation is usually straightforward bookkeeping—until you meet distributions with extremely large values or tails. That’s where the next sections add nuance.
Many random variables are built from simpler pieces:
Computing the full distribution of a sum can be hard. Expected value often stays easy because expectation is linear.
For random variables X and Y (no independence required):
For constants a, b:
More generally, for any finite sum:
Assume (X,Y) are discrete.
Start with the definition:
Split the sum:
Now notice:
so
Similarly the second term becomes E[Y]. Therefore E[X+Y]=E[X]+E[Y].
Define an indicator random variable I for an event A:
Then
This turns probability questions into expectation questions.
Example: Let X be the number of heads in n coin flips (not necessarily fair). Let Iᵢ indicate “flip i is heads.” Then X = ∑ᵢ Iᵢ, so
If the coin has P(heads)=p each time, then E[X]=np.
A classic confusion is to assume expectation distributes over products:
That equality holds under independence (and some integrability conditions), but linearity alone doesn’t give it.
Expected value is not just a number attached to X—it’s an operator that maps a random variable to a number.
Often we care about a transformed quantity g(X):
This is the same weighted-average idea, just applied after transforming the outcomes.
A subtle but powerful point: to compute E[g(X)], you usually do not need the distribution of Y=g(X). You can compute directly from X’s distribution using the formulas above.
So far, we’ve treated expectation as always producing a finite number. But expectation is only well-defined when the weighted sum/integral converges.
A practical sufficient condition is that the absolute expectation is finite:
If these diverge, several things can happen:
| Situation | What it means | Typical phrase |
|---|---|---|
| E[X] is a finite real number | weighted average converges | “expectation exists” |
| E[X] = +∞ or −∞ | one-sided integral/sum diverges | “infinite expectation” |
| undefined | positive and negative parts both diverge | “does not exist” |
A heavy-tailed distribution puts enough probability on huge values that the “average” never settles.
A famous example is the Cauchy distribution: it’s symmetric, but its tails are so heavy that E[X] is undefined (the integral does not converge in the required sense). That’s why sample means of Cauchy draws behave wildly even for large n.
If you’re learning expectation for the first time, don’t let this scare you: most standard distributions used early (Bernoulli, Binomial, Uniform, Normal, Exponential) have finite expectations. But it’s important to know that “average” is not guaranteed by definition—it’s a property that may or may not hold.
A gamble is often called “fair” if its expected payoff is 0 (or if the price equals expected payout).
If you pay cost c to play and receive random payout X, then net payoff is X−c. By linearity:
A fair price is c = E[X] (ignoring risk preferences).
In supervised learning, we often minimize expected risk:
We don’t know the true distribution 𝒟, so we approximate R(θ) with the empirical average over data:
The idea “sample average ≈ expected value” is exactly the intuition behind expected value and what the Law of Large Numbers formalizes.
Stochastic Gradient Descent uses a random mini-batch to estimate the gradient of expected loss.
If g(θ) is the gradient computed from a random sample, SGD relies on it being (approximately) unbiased:
This is an expectation statement: on average, the noisy gradient points in the true direction.
Expected value is the first “global” summary of a distribution you should reach for: it’s simple, composable via linearity, and it’s the backbone of many later definitions.
A game pays 2 with probability 0.5, and pays $0 with probability 0.3. Let X be the payout in dollars. Find E[X] and a fair entry price c (ignoring risk).
List outcomes and probabilities:
Apply the discrete expectation formula:
Compute:
10(0.2)=2
2(0.5)=1
0(0.3)=0
So
A fair entry price c makes expected net payoff zero:
Net payoff = X − c
Insight: Expected value treats each payout as contributing “value × frequency.” Even though 2 to the average.
You run 5 independent trials. Trial i succeeds with probability pᵢ (not necessarily the same across trials). Let X be the total number of successes. Compute E[X].
Define indicator variables:
Let Iᵢ = 1 if trial i succeeds, else 0.
Then the total number of successes is
Compute each indicator’s expectation:
Apply linearity of expectation (no extra assumptions needed beyond finiteness):
Insight: Linearity lets you avoid computing the distribution of X. Even if the trials have different probabilities, the expected total is just the sum of the individual success probabilities.
Let X have an exponential distribution with rate λ>0, meaning f_X(x)=λe^{−λx} for x≥0 and 0 otherwise. Compute E[X].
Start from the definition:
Compute the integral using integration by parts.
Let u = x so du = dx.
Let dv = \lambda e^{-\lambda x} dx so v = -e^{-\lambda x}.
Then
Evaluate the boundary term:
As x→∞, x e^{−λx} → 0, so x(−e^{−λx}) → 0.
At x=0, x(−e^{0}) = 0.
So
Compute the remaining integral:
Therefore
Insight: For continuous variables, expectation is still a weighted average—just spread across a continuum. The exponential distribution’s mean 1/λ matches the intuition: higher rate λ means shorter expected waiting time.
Expected value E[X] is the theoretical long-run average of a random variable, aligning with the sample mean under repeated draws (formalized later by LLN).
Discrete expectation is a probability-weighted sum: E[X]=∑ₓ x·P(X=x). Continuous expectation is a probability-weighted integral: E[X]=∫ x f_X(x) dx.
Linearity is the main computational tool: E[X+Y]=E[X]+E[Y] and E[aX+b]=aE[X]+b, without needing independence.
Indicator variables convert probabilities into expectations: if I is 1 on event A and 0 otherwise, then E[I]=P(A).
Expectation generalizes to transformations: E[g(X)] = ∑ g(x)p(x) or ∫ g(x)f(x)dx (often without finding the distribution of g(X)).
E[X] may be infinite or undefined for heavy-tailed distributions; finiteness typically requires ∑ |x|p(x) < ∞ or ∫ |x|f(x)dx < ∞.
Expected value is foundational for variance, entropy, fair games, and optimizing expected loss in machine learning.
Thinking E[X] must be a value X can actually take (e.g., expecting a die to roll 3.5).
Forgetting that probabilities must sum/integrate to 1 before computing E[X], leading to incorrect weighted averages.
Assuming E[XY]=E[X]E[Y] without checking independence (linearity does not apply to products).
Ignoring existence: applying expectation formulas to heavy-tailed cases where the sum/integral diverges, producing misleading “answers.”
A biased coin lands heads with probability p. Let X be the payout where you get 0 for tails. Compute E[X].
Hint: List the two outcomes (1 and 0) and weight by their probabilities.
Outcomes: X=1 with probability p, and X=0 with probability 1−p.
So the expected payout is p dollars.
Let X be the result of a fair six-sided die. Define Y = 2X − 1. Compute E[Y] using linearity (do not re-sum from scratch).
Hint: First recall E[X] for a fair die, then apply E[aX+b]=aE[X]+b.
For a fair die, $$ Using linearity:
Optional/advanced: Let X take values 1,2,3,… with probability P(X=k)=c/k^2 for k≥1. (a) Find c. (b) Does E[X] exist as a finite number?
Hint: Use that ∑_{k=1}^∞ 1/k^2 converges (to π²/6). For part (b), examine ∑ k·(c/k²).
(a) We need probabilities to sum to 1:
Using $$ we get
(b) Compute expectation:
But ∑_{k=1}^∞ 1/k diverges, so E[X] is infinite (does not exist as a finite number). In this case we say the expectation diverges to +∞.
Next nodes you can unlock and why they depend on expected value: