Distributions over multiple random variables. Marginal and conditional.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
Most real-world uncertainty is not about one variable at a time. Weather affects demand; demand affects price; price affects sales. Joint distributions are the language that lets you assign probabilities to these tuples of outcomes—and then extract the pieces you care about via marginals and conditionals.
A joint distribution p_{X,Y}(x,y) assigns probability (discrete) or density (continuous) to pairs (x,y). Marginals come from summing/integrating out the other variable, e.g. p_X(x) = ∑_y p_{X,Y}(x,y) or f_X(x) = ∫ f_{X,Y}(x,y) dy. Conditionals come from “renormalizing a slice,” e.g. p_{X|Y}(x|y) = p_{X,Y}(x,y)/p_Y(y) (when p_Y(y) > 0). Independence is the special case p_{X,Y}(x,y) = p_X(x)p_Y(y).
Single-variable distributions (Bernoulli, Poisson, Normal, …) answer questions like “How likely is X = 3?” But many systems involve multiple random variables simultaneously.
Examples:
To model relationships, we need a distribution over tuples of outcomes.
Let X and Y be discrete random variables. The joint pmf is
p_{X,Y}(x,y) = P(X = x, Y = y).
It must satisfy:
∑_x ∑_y p_{X,Y}(x,y) = 1
Think of p_{X,Y} as a table: rows are x values, columns are y values, and each cell is the probability of that pair.
If X and Y are continuous, we use a joint pdf f_{X,Y}(x,y) such that probability is obtained by integrating over regions:
P((X,Y) ∈ A) = ∬_A f_{X,Y}(x,y) dx dy.
Requirements:
Crucial subtlety: For continuous variables, f_{X,Y}(x,y) is a density, not a probability. So P(X = x, Y = y) = 0 for exact points, even though f_{X,Y}(x,y) can be positive.
A joint distribution often only assigns nonzero probability/density on a particular set:
For example, if X is uniform on [0,1] and Y = 1 − X, then (X,Y) lies only on the line y = 1 − x. This is a joint distribution, but it’s not a regular 2D pdf (it’s “singular” on a line). In this lesson we focus on standard pmfs and pdfs where the formulas below apply cleanly.
Another way to specify the joint distribution is the joint CDF:
F_{X,Y}(x,y) = P(X ≤ x, Y ≤ y).
For continuous distributions, you can (when differentiable) recover the joint pdf via partial derivatives:
f_{X,Y}(x,y) = ∂²/∂x∂y F_{X,Y}(x,y).
This connects joint distributions to multivariable calculus: joint CDFs are like “accumulated volume,” joint pdfs are like “density per unit area.”
Those last two are the real workhorses—and they both come directly from the joint.
Even if you build a full model of (X,Y), you frequently need statements about just one variable:
Marginals let you ignore variables by aggregating over all their possible values.
Starting from p_{X,Y}(x,y), the marginal pmfs are
p_X(x) = ∑_y p_{X,Y}(x,y)
p_Y(y) = ∑_x p_{X,Y}(x,y).
Interpretation: Fix x and add up the probabilities of all pairs (x,y) across all y.
This is literally the “row sum” (or “column sum”) of the joint table.
From a joint pdf f_{X,Y}(x,y), the marginal pdfs are
f_X(x) = ∫_{−∞}^{∞} f_{X,Y}(x,y) dy
f_Y(y) = ∫_{−∞}^{∞} f_{X,Y}(x,y) dx.
Interpretation: Fix x and integrate the density along the vertical line at that x.
In practice, the integration limits often come from the support.
Example: Suppose f_{X,Y}(x,y) = c on the triangle 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 − x (and 0 otherwise). Then
So
f_X(x) = ∫_0^{1−x} c dy = c(1 − x), 0 ≤ x ≤ 1.
The lesson: marginalization is conceptually simple, but you must be careful about the geometry of where the joint density is nonzero.
If you have a joint pmf table, marginals are easy bookkeeping.
| x \ y | y₁ | y₂ | y₃ | p_X(x) |
|---|---|---|---|---|
| x₁ | p(x₁,y₁) | p(x₁,y₂) | p(x₁,y₃) | row sum |
| x₂ | p(x₂,y₁) | p(x₂,y₂) | p(x₂,y₃) | row sum |
| p_Y(y) | col sum | col sum | col sum | 1 |
The bottom-right cell becomes 1 because the whole table sums to 1.
Once you have marginals, you can compute 1D probabilities:
For more variables, the idea is identical.
If you have p_{X,Y,Z}(x,y,z), then
p_{X,Y}(x,y) = ∑_z p_{X,Y,Z}(x,y,z)
and similarly for continuous with integrals.
This “sum/integrate out what you don’t need” pattern will show up everywhere later (Bayes nets, latent-variable models, expectation-maximization, variational inference).
Information changes beliefs. If you observe Y = y, your uncertainty about X should typically shrink or shift.
Conditional distributions formalize this update.
If p_Y(y) > 0, define
p_{X|Y}(x|y) = P(X = x | Y = y) = p_{X,Y}(x,y) / p_Y(y).
This is the key identity:
p_{X,Y}(x,y) = p_{X|Y}(x|y) p_Y(y).
For continuous variables, we use conditional densities (informally, “density of X given Y = y”):
f_{X|Y}(x|y) = f_{X,Y}(x,y) / f_Y(y), when f_Y(y) > 0.
And similarly
f_{X,Y}(x,y) = f_{X|Y}(x|y) f_Y(y).
Fix y.
f_Y(y) = ∫ f_{X,Y}(x,y) dx.
So dividing by f_Y(y) forces the slice to integrate to 1:
∫ f_{X|Y}(x|y) dx
= ∫ f_{X,Y}(x,y)/f_Y(y) dx
= (1/f_Y(y)) ∫ f_{X,Y}(x,y) dx
= (1/f_Y(y)) f_Y(y)
= 1.
That’s what a conditional density is: the renormalized slice.
Once you accept
p_{X,Y}(x,y) = p_{X|Y}(x|y)p_Y(y) = p_{Y|X}(y|x)p_X(x),
you can equate them to get Bayes’ rule:
p_{X|Y}(x|y) = p_{Y|X}(y|x) p_X(x) / p_Y(y).
And the denominator is just a marginalization:
p_Y(y) = ∑_x p_{Y|X}(y|x)p_X(x) (discrete)
f_Y(y) = ∫ f_{Y|X}(y|x) f_X(x) dx (continuous).
So “Bayes” is not a separate topic from joint distributions—it’s a rearrangement of the joint.
X and Y are independent if learning Y tells you nothing about X:
p_{X|Y}(x|y) = p_X(x) for all x,y with p_Y(y) > 0.
Equivalent factorizations:
p_{X,Y}(x,y) = p_X(x)p_Y(y)
f_{X,Y}(x,y) = f_X(x)f_Y(y).
Independence is strong. Many variables are not independent, but may still be uncorrelated or weakly dependent. Later nodes (covariance/correlation, mutual information) quantify dependence in different ways.
For three variables (X,Y,Z), the joint can be factored as
p_{X,Y,Z}(x,y,z)
= p_{X|Y,Z}(x|y,z) p_{Y|Z}(y|z) p_Z(z).
More generally, for X₁,…,X_n:
p(x₁,…,x_n) = ∏_{i=1}^n p(x_i | x₁,…,x_{i−1}).
This is not “assumptions”—it’s always true when the conditionals exist. Assumptions enter when you simplify the conditionals (e.g., Markov properties).
With a joint distribution, probabilities of events about (X,Y) become sums/integrals over regions.
Discrete:
P((X,Y) ∈ A) = ∑_{(x,y)∈A} p_{X,Y}(x,y).
Continuous:
P((X,Y) ∈ A) = ∬_A f_{X,Y}(x,y) dx dy.
This is often the cleanest way to solve problems like P(X + Y ≤ 1) or P(X ≤ Y).
Many quantities of interest are functions g(X,Y). The expected value is
E[g(X,Y)] = ∑_x ∑_y g(x,y) p_{X,Y}(x,y) (discrete)
E[g(X,Y)] = ∬ g(x,y) f_{X,Y}(x,y) dx dy (continuous).
Special cases:
A key identity (law of total expectation) links conditionals back to marginals:
E[X] = E[E[X|Y]].
You can see it directly in the discrete case:
E[E[X|Y]]
= ∑_y E[X|Y=y] p_Y(y)
= ∑_y (∑_x x p_{X|Y}(x|y)) p_Y(y)
= ∑_y ∑_x x (p_{X,Y}(x,y)/p_Y(y)) p_Y(y)
= ∑_x ∑_y x p_{X,Y}(x,y)
= E[X].
This is one of the most useful “bridge” theorems in probability.
Independence is all-or-nothing. In practice you often ask: “How dependent are X and Y?”
Two major upcoming tools depend on joints:
| Tool | What it measures | Needs from joints |
|---|---|---|
| Covariance/correlation | Linear relationship via E[XY] − E[X]E[Y] | Joint moments like E[XY] |
| Mutual information | Any statistical dependence via KL divergence / entropies | p_{X,Y}, p_X, p_Y |
Joint distributions are the raw material those measures are made from.
For a continuous f_{X,Y}:
Even before computing correlation, you can often “see” dependence from the joint density.
1) Start with a generative story (conditionals):
· | y)
This defines a valid joint via p_{X,Y}(x,y) = p_{X|Y}(x|y)p_Y(y).
2) Or start with a joint and derive everything else:
Both are common. In machine learning, you often specify p(y) and p(x|y) (mixture models, Naive Bayes), and marginalize y to get p(x).
When you have many variables, you often package them into a random vector X ∈ ℝᵈ.
You’ll later see multivariate normals, covariance matrices, and transformations of random vectors. Joint distributions are the foundation.
Let X ∈ {0,1} and Y ∈ {0,1,2}. Suppose the joint pmf is:
p_{X,Y}(0,0)=0.10, p(0,1)=0.20, p(0,2)=0.10,
p_{X,Y}(1,0)=0.05, p(1,1)=0.25, p(1,2)=0.30.
(1) Compute p_X and p_Y.
(2) Compute p_{X|Y}(x|1).
(3) Are X and Y independent?
(1) Compute p_X(x) by summing over y.
p_X(0) = p(0,0)+p(0,1)+p(0,2)
= 0.10 + 0.20 + 0.10
= 0.40
p_X(1) = p(1,0)+p(1,1)+p(1,2)
= 0.05 + 0.25 + 0.30
= 0.60
(1) Compute p_Y(y) by summing over x.
p_Y(0) = p(0,0)+p(1,0) = 0.10 + 0.05 = 0.15
p_Y(1) = p(0,1)+p(1,1) = 0.20 + 0.25 = 0.45
p_Y(2) = p(0,2)+p(1,2) = 0.10 + 0.30 = 0.40
Check: 0.15+0.45+0.40 = 1.00 ✓
(2) Compute p_{X|Y}(x|1) = p_{X,Y}(x,1)/p_Y(1).
p_{X|Y}(0|1) = p(0,1)/p_Y(1) = 0.20 / 0.45 = 4/9 ≈ 0.444...
p_{X|Y}(1|1) = p(1,1)/p_Y(1) = 0.25 / 0.45 = 5/9 ≈ 0.555...
Check: 4/9 + 5/9 = 1 ✓
(3) Test independence using p_{X,Y}(x,y) ?= p_X(x)p_Y(y).
For (x,y)=(0,0):
Right side = p_X(0)p_Y(0) = 0.40·0.15 = 0.06
Left side = p(0,0) = 0.10
Not equal ⇒ not independent.
Insight: Marginals are just sums of the joint table. Conditionals are a “renormalized column” (fix y) or row (fix x). Independence fails as soon as any joint cell disagrees with the product of marginals.
Let (X,Y) have joint pdf f_{X,Y}(x,y) = c on the region 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 − x, and 0 otherwise.
(1) Find c.
(2) Find the marginal pdf f_X(x).
(3) Find P(X ≤ 1/2).
(1) Normalize the pdf: ∬ f_{X,Y}(x,y) dx dy = 1.
The region is: x from 0 to 1, and for each x, y from 0 to 1−x.
So:
1 = ∫_{x=0}^{1} ∫_{y=0}^{1−x} c dy dx
= ∫_0^1 c(1−x) dx
= c ∫_0^1 (1−x) dx
= c [x − x²/2]_0^1
= c (1 − 1/2)
= c/2
Therefore c = 2.
(2) Compute f_X(x) = ∫ f_{X,Y}(x,y) dy.
For x outside [0,1], f_X(x)=0.
For x ∈ [0,1], y ranges 0 to 1−x:
f_X(x) = ∫_0^{1−x} 2 dy
= 2(1−x), 0 ≤ x ≤ 1.
(3) Compute P(X ≤ 1/2) from f_X.
P(X ≤ 1/2) = ∫_0^{1/2} 2(1−x) dx
= 2 [x − x²/2]_0^{1/2}
= 2 (1/2 − (1/4)/2)
= 2 (1/2 − 1/8)
= 2 (3/8)
= 3/4.
Insight: In continuous problems, most difficulty is identifying the support correctly. Once the limits match the geometry, marginalization is straightforward calculus.
Assume a joint pdf f_{X,Y}(x,y) = 2y for 0 < x < 1 and 0 < y < 1 (and 0 otherwise).
(1) Find f_Y(y).
(2) Find f_{X|Y}(x|y).
(3) Are X and Y independent?
(1) Marginalize out x:
f_Y(y) = ∫_{x=0}^{1} 2y dx
= 2y ∫_0^1 dx
= 2y, 0 < y < 1.
Check normalization:
∫_0^1 2y dy = [y²]_0^1 = 1 ✓
(2) Condition on Y=y:
f_{X|Y}(x|y) = f_{X,Y}(x,y)/f_Y(y)
= (2y)/(2y)
= 1, for 0 < x < 1.
So X|Y=y ~ Uniform(0,1).
(3) Test factorization:
f_X(x) = ∫_{y=0}^{1} 2y dy = 1, for 0 < x < 1.
So f_X(x)=1 on (0,1).
Now f_X(x)f_Y(y) = 1 · (2y) = 2y.
This equals f_{X,Y}(x,y) for 0<x<1, 0<y<1.
Therefore X and Y are independent.
Insight: A joint can look like it “depends on y,” yet still represent independence if it factors cleanly. The conditional f_{X|Y} revealing a constant 1 is a strong clue: X does not change when you learn Y.
A joint distribution p_{X,Y}(x,y) or f_{X,Y}(x,y) assigns probability mass/density to outcome pairs (x,y).
Marginals come from summing/integrating out the other variable: p_X(x)=∑_y p_{X,Y}(x,y) and f_X(x)=∫ f_{X,Y}(x,y) dy.
Conditionals are normalized slices: p_{X|Y}(x|y)=p_{X,Y}(x,y)/p_Y(y) (when p_Y(y)>0), and similarly for densities.
The joint always factors as joint = conditional × marginal: p_{X,Y}=p_{X|Y}p_Y (and also = p_{Y|X}p_X).
Bayes’ rule is a direct consequence of two ways to factor the same joint.
Independence is equivalent to factorization: p_{X,Y}=p_X p_Y; equivalently p_{X|Y}=p_X.
Probabilities of events about (X,Y) are sums/integrals over regions in the (x,y)-plane.
Future tools like covariance/correlation and mutual information are computed from the joint (plus its marginals/conditionals).
Treating a pdf value f_{X,Y}(x,y) as a probability; probabilities require integrating over an area, not reading off a point value.
Forgetting to respect the support when integrating (wrong bounds), especially for triangular or otherwise constrained regions.
Dividing by p_Y(y) or f_Y(y) without checking it’s positive (conditioning on an event of probability 0 is subtle).
Assuming uncorrelated implies independent; independence is stronger and must be checked via factorization or equivalent criteria.
Discrete: Suppose p_{X,Y}(x,y) is given by
p(0,0)=0.3, p(0,1)=0.1,
p(1,0)=0.2, p(1,1)=0.4.
Compute (a) p_X, (b) p_{Y|X}(1|0), (c) P(X=Y).
Hint: Marginals are row/column sums. Conditionals divide a joint cell by the corresponding marginal. P(X=Y) adds the diagonal cells.
(a) p_X(0)=0.3+0.1=0.4, p_X(1)=0.2+0.4=0.6.
Also p_Y(0)=0.3+0.2=0.5, p_Y(1)=0.1+0.4=0.5.
(b) p_{Y|X}(1|0)=p(0,1)/p_X(0)=0.1/0.4=0.25.
(c) P(X=Y)=p(0,0)+p(1,1)=0.3+0.4=0.7.
Continuous: Let f_{X,Y}(x,y) = k(x + y) on 0 ≤ x ≤ 1, 0 ≤ y ≤ 1, and 0 otherwise.
(a) Find k.
(b) Find f_X(x).
(c) Find E[X].
Hint: Use ∬ f = 1 to find k. Then f_X(x)=∫_0^1 k(x+y) dy. For E[X], use E[X]=∫ x f_X(x) dx.
(a) Normalize:
1 = ∫_0^1 ∫_0^1 k(x+y) dy dx
= k ∫_0^1 ∫_0^1 (x+y) dy dx.
Compute inner integral:
∫_0^1 (x+y) dy = [xy + y²/2]_0^1 = x + 1/2.
Then
1 = k ∫_0^1 (x+1/2) dx
= k [x²/2 + x/2]_0^1
= k (1/2 + 1/2)
= k.
So k = 1.
(b) f_X(x)=∫_0^1 (x+y) dy = x + 1/2, 0≤x≤1.
(c) E[X]=∫_0^1 x(x+1/2) dx
= ∫_0^1 (x² + x/2) dx
= [x³/3 + x²/4]_0^1
= 1/3 + 1/4
= 7/12.
Independence check: Let X be Uniform(0,1). Given X=x, let Y|X=x be Uniform(0,x).
(a) Write f_{Y|X}(y|x).
(b) Find the joint density f_{X,Y}(x,y).
(c) Are X and Y independent?
Hint: Use f_{X,Y}(x,y)=f_{Y|X}(y|x)f_X(x). Watch the support: 0<y<x<1. Independence would require factoring into f_X(x)f_Y(y).
(a) For x∈(0,1), Y|X=x is Uniform(0,x), so
f_{Y|X}(y|x)=1/x for 0<y<x, else 0.
(b) Since f_X(x)=1 on (0,1):
f_{X,Y}(x,y)=f_{Y|X}(y|x)f_X(x)= (1/x)·1 = 1/x on the region 0<y<x<1, else 0.
(c) Not independent. Intuitively, if you learn X is small, Y must be even smaller because 0<Y<X. Formally, the support is triangular (0<y<x<1), which already prevents factorization into a product of a function of x and a function of y over the full unit square.
Next steps that build directly on this node:
Helpful refreshers: