Bernoulli, binomial, Poisson, uniform, normal distributions.
Deep-dive lesson - accessible entry point but dense material. Use worked examples and spaced repetition.
Most of probability and statistics is built on a surprisingly small “toolbox” of distributions. Learn a handful well, and you can model coin flips, counts of arrivals, measurement noise, and uncertainty around unknown quantities—with clean formulas for probabilities, means, and variances.
A distribution is a parameterized family p(x | θ) describing how a random variable X behaves. Discrete distributions use PMFs and sum to get probabilities; continuous distributions use PDFs and integrate. This lesson covers Bernoulli, binomial, Poisson (discrete) and uniform, normal (continuous), including when to use each and their key formulas.
In real problems, you rarely invent a probability model from scratch. Instead, you pick a distribution family—a reusable pattern—and tune a few parameters θ to match the situation.
Examples:
Each family gives you:
The first decision is the type of values X can take.
A common confusion: for continuous X, f(x) is not itself a probability. It’s a density. You must integrate over an interval.
A distribution is often written as p(x | θ) (or f(x | θ)). The parameter set θ chooses one specific member of the family.
Examples of θ:
These parameters control “location” (where the mass sits) and “scale/spread” (how variable it is).
| Distribution | Support | Type | Parameters θ | Typical meaning |
|---|---|---|---|---|
| Bernoulli(p) | {0, 1} | Discrete (PMF) | p | One trial: success/failure |
| Binomial(n, p) | {0, …, n} | Discrete (PMF) | n, p | # successes in n independent trials |
| Poisson(λ) | {0, 1, 2, …} | Discrete (PMF) | λ | # events in a fixed window |
| Uniform(a, b) | [a, b] | Continuous (PDF) | a, b | “Equally likely” in an interval |
| Normal(μ, σ²) | ℝ | Continuous (PDF) | μ, σ² | Measurement noise / sums / averages |
In the next sections, we’ll build each one slowly: story → support → formula → mean/variance → how to compute probabilities.
Many systems naturally produce counts:
Discrete distributions let you assign probability to each integer outcome and then sum the relevant probabilities.
One trial. Two outcomes.
Support: X ∈ {0, 1}
PMF:
A compact way to write the PMF is:
You’ll use these constantly:
Why E[X] = p?
Repeat a Bernoulli trial n times, independently, with the same success probability p. Let X be the number of successes.
Examples:
Support: X ∈ {0, 1, …, n}
PMF:
Where (n choose k) = n! / (k!(n−k)!).
A useful connection (preview of CLT): if n is large, a binomial can often be approximated by a normal:
Poisson models counts of events in a fixed window when:
Examples:
Support: X ∈ {0, 1, 2, …}
PMF:
Here λ is both the rate parameter and the mean count per window.
That “mean = variance” fact is a diagnostic: if your observed counts have variance much bigger than the mean, a plain Poisson may be too simple.
A classic approximation: if n is large and p is small but np stays moderate, then:
This is the “rare events” regime.
If X is discrete, you compute:
Examples:
In practice, you often use a CDF function from software for these sums, but it’s crucial to understand what is being summed and why.
Many measurements are not naturally integer counts:
Even if the world is measured with finite precision, continuous distributions are often excellent approximations and give smooth, usable mathematics.
The key mental shift:
“All values between a and b are equally plausible.”
Examples:
Support: X ∈ [a, b]
PDF:
For any c, d with a ≤ c ≤ d ≤ b:
P(c ≤ X ≤ d)
= ∫ᶜᵈ 1/(b − a) dx
= (d − c)/(b − a)
So probability is proportional to interval length.
The normal (Gaussian) distribution models values that cluster around an average μ with symmetric noise.
It also appears because of aggregation:
Examples:
Support: X ∈ ℝ
PDF:
Parameters:
A core technique is converting to the standard normal.
If X ∼ Normal(μ, σ²), define:
Then Z ∼ Normal(0, 1).
This lets you use standard normal tables or software CDFs:
For X ∼ Normal(μ, σ²):
This is not a definition—just a helpful memory.
If X is continuous with PDF f:
For an interval [c, d]:
For the uniform, this integral is easy.
For the normal, there is no elementary antiderivative, so we use the CDF Φ(z) numerically.
A practical comparison:
| Task | Discrete | Continuous |
|---|---|---|
| Probability at a point | P(X=x) can be > 0 | P(X=x)=0 |
| Probability over a range | sum PMF values | integrate PDF |
| Typical tool | ∑ and CDF tables | ∫ and CDF Φ |
Understanding this split (support → PMF/PDF → sum/integrate) prevents many downstream mistakes in statistics and ML.
A distribution is a compact set of assumptions. Choosing one is not just “picking a formula”—it’s deciding what outcomes are possible and what patterns are likely.
A good first pass is to match the data type and generative story.
| If your variable X is… | And the story is… | Start with… |
|---|---|---|
| 0/1 outcome | one trial with success prob p | Bernoulli(p) |
| integer 0…n | n independent trials, constant p | Binomial(n, p) |
| nonnegative count | events in a window at average rate λ | Poisson(λ) |
| real in [a, b] | equally likely in an interval | Uniform(a, b) |
| real-valued | symmetric noise around μ | Normal(μ, σ²) |
Then sanity-check with:
In MLE, you assume data x₁, …, xₙ came from a distribution family p(x | θ) and pick θ that makes the observed data most likely.
Examples you’ll soon see:
To do MLE well, you must recognize which likelihood matches your data (Bernoulli vs binomial vs Poisson, etc.).
Bayesian inference updates distributions with data:
The likelihood often comes from a “common distribution.” Examples:
Knowing the likelihood family is step 1.
A distribution is rarely “true.” It’s a simplified story that is useful if:
As you learn more, you’ll add richer families, but these five are the workhorses you’ll keep returning to.
A website A/B test shows a conversion on a visit with probability p = 0.2, assumed constant across visitors. You observe n = 5 independent visitors. Let X be the number of conversions. Compute P(X ≥ 2).
Identify the distribution:
X counts successes in n independent Bernoulli trials ⇒ X ∼ Binomial(n, p) with n=5, p=0.2.
Use the complement to reduce work:
P(X ≥ 2) = 1 − P(X ≤ 1)
= 1 − (P(X=0) + P(X=1)).
Compute P(X=0):
P(X=0) = (5 choose 0) (0.2)⁰ (0.8)⁵
= 1 · 1 · 0.8⁵
= 0.32768.
Compute P(X=1):
P(X=1) = (5 choose 1) (0.2)¹ (0.8)⁴
= 5 · 0.2 · 0.8⁴
= 1 · 0.4096
= 0.4096.
Combine:
P(X ≤ 1) = 0.32768 + 0.4096 = 0.73728
P(X ≥ 2) = 1 − 0.73728 = 0.26272.
Insight: For discrete distributions, complements often avoid long sums. Here, summing k=2,3,4,5 is more work than subtracting k=0,1 from 1.
A server receives requests at an average rate of λ = 3 requests per minute. Model the number of requests in a minute as X ∼ Poisson(3). Compute P(X ≤ 1).
Write the PMF:
P(X=k) = e^(−λ) λᵏ / k! with λ = 3.
Compute P(X=0):
P(X=0) = e^(−3) 3⁰ / 0!
= e^(−3).
Compute P(X=1):
P(X=1) = e^(−3) 3¹ / 1!
= 3e^(−3).
Sum:
P(X ≤ 1) = P(X=0) + P(X=1)
= e^(−3) + 3e^(−3)
= 4e^(−3)
≈ 4 · 0.049787
≈ 0.19915.
Insight: Poisson computations are often a few terms plus a small exponential factor. Also note how λ directly sets the typical count: with λ=3, getting ≤1 is fairly unlikely (~0.20).
Let X ∼ Uniform(10, 18). Compute P(12 ≤ X ≤ 15) and the mean and variance.
Write the PDF:
f(x) = 1/(18−10) = 1/8 for 10 ≤ x ≤ 18.
Compute the probability by integrating:
P(12 ≤ X ≤ 15) = ∫¹²¹⁵ (1/8) dx
= (1/8)(15−12)
= 3/8
= 0.375.
Compute the mean:
E[X] = (a+b)/2 = (10+18)/2 = 14.
Compute the variance:
Var(X) = (b−a)²/12
= (8)²/12
= 64/12
= 16/3
≈ 5.333.
Insight: For a uniform distribution, probabilities are purely about lengths of intervals—no calculus tricks required beyond “constant × width.”
Support matters first: discrete X assigns probability to points (PMF), continuous X assigns density and uses integrals (PDF).
Bernoulli(p) models one 0/1 trial with E[X]=p and Var(X)=p(1−p).
Binomial(n, p) models a count of successes in n independent trials: P(X=k)=(n choose k)pᵏ(1−p)^(n−k), with E[X]=np and Var(X)=np(1−p).
Poisson(λ) models counts in a fixed window with PMF e^(−λ)λᵏ/k!, and it has E[X]=Var(X)=λ.
Uniform(a, b) has constant density 1/(b−a) on [a,b], with E[X]=(a+b)/2 and Var(X)=(b−a)²/12.
Normal(μ, σ²) models symmetric noise; standardization Z=(X−μ)/σ converts to Normal(0,1) for probability calculations.
Most probability queries reduce to either a sum (discrete) or an integral (continuous), often aided by complements and CDFs.
Treating a PDF value f(x) as a probability (for continuous X, only integrals over intervals are probabilities).
Using a binomial model when trials are not independent or when p changes from trial to trial (then Binomial(n,p) is not appropriate).
Confusing Poisson(λ) with Binomial(n,p): Poisson is unbounded (0,1,2,…) and is about rates per window, not a fixed number of trials.
Mixing up σ and σ² in the normal distribution, or forgetting to divide by σ when computing a Z-score.
Let X ∼ Bernoulli(p) with p = 0.7. Compute E[X], Var(X), and P(X=0).
Hint: Use E[X]=p and Var(X)=p(1−p). Also P(X=0)=1−p.
E[X]=0.7.
Var(X)=0.7(1−0.7)=0.7·0.3=0.21.
P(X=0)=1−0.7=0.3.
A factory produces items with defect probability p = 0.05 independently. In a batch of n = 20 items, let X be the number of defects. Compute P(X=0) and P(X≥1).
Hint: Use X ∼ Binomial(20, 0.05). P(X≥1)=1−P(X=0).
X ∼ Binomial(20,0.05).
P(X=0) = (20 choose 0)(0.05)⁰(0.95)²⁰ = 0.95²⁰ ≈ 0.3585.
P(X≥1)=1−0.95²⁰ ≈ 1−0.3585 = 0.6415.
Let X ∼ Normal(μ, σ²) with μ = 100 and σ = 15. Compute P(X ≤ 130) in terms of the standard normal CDF Φ, and give a numerical approximation.
Hint: Convert to Z = (X−μ)/σ. Then P(X≤x)=Φ((x−μ)/σ). Use Φ(2)≈0.9772.
Z = (X−100)/15 so Z ∼ Normal(0,1).
P(X ≤ 130) = P(Z ≤ (130−100)/15) = Φ(30/15) = Φ(2).
Numerically, Φ(2) ≈ 0.9772.
Next nodes you can tackle: