Measures of linear relationship between random variables.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
A city analyst plots two monthly time series: ice cream sales and drownings. The scatterplot slopes upward—strongly. A headline writes itself: “Ice cream causes drownings.”
But the analyst pauses. Covariance and correlation can tell you how tightly two variables move together in a straight-line way. They cannot, by themselves, tell you why (causation), nor whether a relationship is real vs driven by a third variable (season/temperature), nor whether a relationship is nonlinear (e.g., U-shaped).
In this lesson you’ll learn:
Covariance is the expected product of two variables’ deviations from their means: Cov(X,Y) = E[(X−E[X])(Y−E[Y])]. Its sign indicates direction of linear co-movement, and its magnitude depends on units. Correlation ρ₍X,Y₎ standardizes covariance by dividing by σXσY, producing a unitless number in [−1,1] that measures linear association. Zero covariance means “no linear relationship” but does not generally mean independence (except in special cases like jointly normal variables).
Variance tells you how a single random variable spreads around its mean. But many real questions are about pairs of variables:
We want a number that captures co-movement.
Covariance looks at whether and are simultaneously above their means or simultaneously below their means.
Define the mean-centered variables:
Then covariance is the expected product:
Interpretation by sign:
A key subtlety: covariance is unitful. If is in dollars and is in seconds, covariance is in dollar·seconds. Changing units (e.g., dollars → cents) scales covariance.
To compare relationships across different scales, we standardize by each variable’s standard deviation.
where and .
Correlation is:
Interpretation:
If you take paired samples and center them, each data point becomes a 2D vector from the mean: .
A helpful geometric intuition is that correlation behaves like a normalized “alignment” between the centered coordinates of and .
Here’s a simple ASCII scatter showing a positive association. The “+” is the mean; arrows suggest centered deviations.
Y
^ *
| *
| *
| *
| *
| +-----------------> X
| *
| *When points trend from bottom-left to top-right, products are often positive, boosting covariance and correlation.
Covariance/correlation can:
They cannot by themselves:
We want a measure that’s:
The product does exactly that:
Averaging (expectation) makes it a stable summary.
Start from the definition:
Expand step-by-step:
So:
This identity is especially handy in calculations.
These properties matter constantly in modeling:
1) Adding constants doesn’t change covariance:
Reason: centering removes constants.
2) Scaling scales covariance:
So if you convert meters to centimeters (×100), covariance changes by ×100 on that variable’s side.
Variance is just covariance with itself:
That makes covariance a true generalization of variance.
If and are independent, then , so:
But the reverse is not generally true: does not guarantee independence. (We’ll build a concrete counterexample later.)
In practice, you usually have samples for .
Define sample means:
A common estimator (unbiased for i.i.d. samples) is:
You may also see instead of ; that version is the maximum-likelihood estimator under a normal model but is biased in finite samples.
Suppose:
The relationship is identical, but scaling height by 100 scales covariance by 100. That’s why correlation is so widely used: it removes units.
Covariance answers “do they move together?”, but not “how strong is that co-movement relative to each variable’s natural scale?”
Correlation fixes this by dividing by .
Because has units of and has units of , the units cancel.
This isn’t just a convention—it’s a theorem. One route uses Cauchy–Schwarz.
Let and .
Apply Cauchy–Schwarz to random variables:
But and similarly for .
So:
Divide both sides by (assuming nonzero):
Equality in Cauchy–Schwarz occurs when one variable is an exact scalar multiple of the other (almost surely) after centering.
That means:
Equivalently, almost surely for some constants .
For a dataset, define centered vectors:
Then the sample correlation is:
That is exactly the cosine of the angle between x and y in .
Inline “plot” of the vector-angle intuition:
y (centered)
^
/|
/ | cos(θ) = r
/θ |
+---+----------> x (centered)This is a powerful mental model: correlation is alignment between patterns of deviation-from-mean.
Using the sample covariance and sample standard deviations :
Because it depends on products of deviations, a single extreme point can dominate.
Practical note:
A strong correlation can arise from:
In the ice-cream example, temperature (a confounder) increases both ice cream consumption and swimming, which increases drowning risk.
Covariance/correlation quantify association; causal claims require additional design/assumptions.
means the linear term in their relationship is absent in a precise sense:
But can still strongly determine nonlinearly.
We’ll show an explicit example in the worked examples: and . Then covariance is 0, yet is completely determined by .
Special case worth knowing: if are jointly normal, then zero covariance does imply independence. (Not true in general.)
With random variables, stack them into a random vector X ∈ .
Define the covariance matrix:
Key facts:
This matrix is the central object in multivariate statistics.
PCA looks for directions in feature space with maximum variance.
If you have centered data vectors x ∈ , PCA finds unit vectors v maximizing:
The solutions are eigenvectors of the covariance matrix .
So learning covariance is not just about pairwise relationships—it’s about understanding the geometry of data clouds and the linear structure PCA extracts.
| Goal | Use covariance? | Use correlation? | Notes |
|---|---|---|---|
| Keep physical units (e.g., risk in \·days) | ✅ | ❌ | Covariance preserves scale |
| Compare relationships across different units | ❌ | ✅ | Correlation is unitless |
| Build PCA on raw feature scales | ✅ | ❌/✅ | Often you choose covariance PCA or correlation PCA (standardized) |
| Detect any dependence (including nonlinear) | ❌ | ❌ | Need other tools (MI, plots, kernels, etc.) |
You observe n=5 paired measurements:
X: [1, 2, 3, 4, 5]
Y: [2, 1, 4, 3, 6]
Compute the sample covariance s_xy (with 1/(n−1)) and sample correlation r.
Compute means:
\bar{x} = (1+2+3+4+5)/5 = 15/5 = 3
\bar{y} = (2+1+4+3+6)/5 = 16/5 = 3.2
Compute centered values and products:
For each i, compute (xᵢ−\bar{x}), (yᵢ−\bar{y}), and product.
i=1: x=1 → −2; y=2 → −1.2; product = (−2)(−1.2)= 2.4
i=2: x=2 → −1; y=1 → −2.2; product = (−1)(−2.2)= 2.2
i=3: x=3 → 0; y=4 → 0.8; product = 0·0.8= 0
i=4: x=4 → 1; y=3 → −0.2; product = 1·(−0.2)= −0.2
i=5: x=5 → 2; y=6 → 2.8; product = 2·2.8= 5.6
Sum of products:
∑(xᵢ−\bar{x})(yᵢ−\bar{y}) = 2.4+2.2+0−0.2+5.6 = 10.0
Sample covariance:
Compute sample standard deviations.
First compute sums of squares.
For X:
(xᵢ−\bar{x})²: 4, 1, 0, 1, 4 → sum = 10
So s_x² = 10/(n−1)=10/4=2.5 → s_x = √2.5
For Y:
(yᵢ−\bar{y})²: (−1.2)²=1.44, (−2.2)²=4.84, 0.8²=0.64, (−0.2)²=0.04, 2.8²=7.84
Sum = 1.44+4.84+0.64+0.04+7.84 = 14.8
So s_y² = 14.8/4 = 3.7 → s_y = √3.7
Compute correlation:
Numerically, √9.25 ≈ 3.041, so
r ≈ 2.5 / 3.041 ≈ 0.822.
Insight: The covariance (2.5) says X and Y tend to be above/below their means together, but its magnitude depends on X and Y units. The correlation (~0.82) shows a fairly strong positive linear association on a standardized scale.
Let X ~ Uniform(−1, 1) and define Y = X². Compute Cov(X,Y). Are X and Y independent?
Compute expectations.
By symmetry of Uniform(−1,1), the distribution is symmetric around 0, so
Compute E[Y]. Since Y=X²:
For X ~ Uniform(−1,1) with density f(x)=1/2 on [−1,1]:
Compute E[XY]. Since Y=X², we have XY = X·X² = X³. Then:
But x³ is an odd function and the interval is symmetric, so the integral is 0:
Compute covariance using E[XY] − E[X]E[Y]:
Check independence intuition.
If X and Y were independent, knowing X would not give information about Y.
But here Y is determined exactly by X via Y=X². For example:
So Y is not independent of X.
Insight: Covariance zero means “no linear association,” not “no relationship.” The relationship here is perfectly nonlinear: the scatter is a U-shape. Covariance and correlation miss that shape even though dependence is complete.
Let X and Y be random variables with Corr(X,Y)=ρ. Define X' = 3X + 10 and Y' = −2Y + 5. Find Corr(X',Y').
Use covariance and standard deviation scaling rules.
Constants don’t affect covariance:
Cov(X+a, Y+b)=Cov(X,Y).
Scaling: Cov(cX, dY)=cd Cov(X,Y).
Compute Cov(X',Y'):
X' = 3X+10, Y' = −2Y+5
Compute σ_{X'} and σ_{Y'}.
Standard deviation scales by absolute value:
σ_{3X+10} = |3|σ_X = 3σ_X
σ_{−2Y+5} = |−2|σ_Y = 2σ_Y
Compute correlation:
Insight: Correlation ignores shifts and positive rescalings (unit changes), but a negative scaling flips the sign because it reverses one axis.
Covariance measures average joint deviation: $$
A useful identity: $$
Covariance has units (units of X times units of Y), so its magnitude depends on scaling choices.
Correlation standardizes covariance: $$ producing a unitless number in [−1,1].
Correlation captures linear association; strong nonlinear dependence can still yield ρ≈0.
Independence ⇒ zero covariance, but zero covariance ⇏ independence in general (e.g., Y=X² with symmetric X).
In sample form, correlation equals cosine similarity between centered data vectors x and y.
Covariance matrices generalize covariance to many variables and are the core input to PCA via eigenvectors.
Interpreting correlation as causation (confounding variables can create high correlation).
Assuming ρ=0 implies independence (it only rules out linear association unless extra assumptions like joint normality hold).
Comparing covariance magnitudes across datasets with different units/scales (use correlation or standardize first).
Ignoring outliers: a few extreme points can drastically change covariance/correlation and hide the typical pattern.
Let X have E[X]=2, Var(X)=9. Let Y = 5 − 2X. Compute Cov(X,Y) and Corr(X,Y).
Hint: Use Cov(X, a+bX) = b Var(X). For correlation, note Y is exactly linear in X.
Compute covariance:
Y = 5 − 2X ⇒ Cov(X,Y) = Cov(X, 5 − 2X) = −2 Cov(X,X) = −2 Var(X) = −2·9 = −18.
For correlation, since Y is an exact decreasing linear function of X, Corr(X,Y)=−1.
(You can also compute σ_Y = |−2|σ_X = 2·3=6, so Corr = (−18)/(3·6)=−1.)
Suppose E[X]=1, E[Y]=3, and E[XY]=10. Compute Cov(X,Y).
Hint: Use Cov(X,Y)=E[XY]−E[X]E[Y].
Cov(X,Y) = 10 − (1)(3) = 7.
Let X be Uniform(0,1) and Y = X. Compute Corr(X,Y). Then let Z = X² and reason (without heavy computation) whether Corr(X,Z) is closer to 1, closer to 0, or could be negative.
Hint: For Y=X, it’s perfect linear. For Z=X² on (0,1), it’s increasing but nonlinear; think about whether larger X tends to mean larger Z.
For Y=X, correlation is 1 because Y is an exact positive linear function of X (Y=1·X+0).
For Z=X² with X∈(0,1), Z is strictly increasing in X, so larger X tends to correspond to larger Z. That suggests a positive covariance and positive correlation. However, because the relationship is nonlinear, Corr(X,Z) will be less than 1 (not a perfect straight line). It cannot be negative here because X and X² move in the same direction on (0,1). So Corr(X,Z) is closer to 1 than to 0, but still < 1.
Next steps and related nodes:
Helpful background refreshers: