Covariance and Correlation

Probability & StatisticsDifficulty: ███░░Depth: 7Unlocks: 6

Measures of linear relationship between random variables.

Interactive Visualization

t=0s

Core Concepts

▸Covariance: the expected product of deviations from the means (Cov(X,Y) = E[(X - E[X]) (Y - E[Y])]).
▸Correlation: the standardized, unitless measure of linear association bounded between -1 and 1.
▸Zero covariance means no linear relationship but does not imply independence in general.

Key Symbols & Notation

Cov(X,Y)rho_{X,Y} (correlation coefficient)

Essential Relationships

↔rho_{X,Y} = Cov(X,Y) / (sigma_X * sigma_Y) where sigma_X = sqrt(Var(X)) and sigma_Y = sqrt(Var(Y)).

Prerequisites (2)

Variance5 atoms

Joint Distributions6 atoms

Unlocks (2)

Principal Component Analysislvl 4

Time Series Foundationslvl 3

▶ Advanced Learning Details

Graph Position

Depth Cost

Fan-Out (ROI)

Bottleneck Score

Chain Length

Cognitive Load

Atomic Elements

Total Elements

Percentile Level

Atomic Level

All Concepts (14)

• Covariance as a measure of joint variability: expectation of the product of deviations of two variables from their means
• Sign interpretation of covariance: positive, negative, or zero indicates direction of linear association
• Magnitude interpretation of covariance: size depends on variable units (not standardized)
• Alternative algebraic form: covariance can be computed as E[XY] - E[X]E[Y]
• Covariance as a special-case relation to variance: Cov(X,X) = Var(X)
• Bilinearity / linear response: covariance scales with multiplicative linear changes to variables
• Independence implies zero covariance
• Zero covariance does not imply independence (lack of linear association ≠ no dependence in general)
• Pearson correlation coefficient: standardized (unitless) version of covariance
• Interpretation of correlation values: +1/−1 (perfect linear relationship), 0 (no linear relationship)
• Invariance properties of correlation: unaffected by additive shifts and by scaling magnitude (sign of scaling can flip correlation sign)
• Sample (empirical) covariance as an estimator of population covariance (including Bessel's correction)
• Sample Pearson correlation as estimator of population correlation
• Connection to simple linear regression slope: slope = Cov(X,Y)/Var(X) (population or sample analogue)

Teaching Strategy

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

A city analyst plots two monthly time series: ice cream sales and drownings. The scatterplot slopes upward—strongly. A headline writes itself: “Ice cream causes drownings.”

But the analyst pauses. Covariance and correlation can tell you how tightly two variables move together in a straight-line way. They cannot, by themselves, tell you why (causation), nor whether a relationship is real vs driven by a third variable (season/temperature), nor whether a relationship is nonlinear (e.g., U-shaped).

In this lesson you’ll learn:

•what covariance and correlation actually measure,
•how to compute and interpret them,
•what “zero covariance” does and does not imply,
•and why covariance is the engine behind PCA.

TL;DR:

Covariance is the expected product of two variables’ deviations from their means: Cov(X,Y) = E[(X−E[X])(Y−E[Y])]. Its sign indicates direction of linear co-movement, and its magnitude depends on units. Correlation ρ₍X,Y₎ standardizes covariance by dividing by σXσY, producing a unitless number in [−1,1] that measures linear association. Zero covariance means “no linear relationship” but does not generally mean independence (except in special cases like jointly normal variables).

What Is Covariance and Correlation?

The motivation (why we need them)

Variance tells you how a single random variable spreads around its mean. But many real questions are about pairs of variables:

•Do study hours and exam score tend to rise together?
•Do two sensors drift together over time?
•If one stock goes up, does another tend to go up as well?

We want a number that captures co-movement.

Covariance: “do deviations move together?”

Covariance looks at whether $X$ and $Y$ are simultaneously above their means or simultaneously below their means.

Define the mean-centered variables:

• $X_c = X - E[X]$
• $Y_c = Y - E[Y]$

Then covariance is the expected product:

\operatorname{Cov}(X,Y) = E\big[(X - E[X])(Y - E[Y])\big].

Interpretation by sign:

•Positive covariance: when $X$ is above its mean, $Y$ tends to be above its mean too (and likewise below).
•Negative covariance: when $X$ is above its mean, $Y$ tends to be below its mean.
•Near zero covariance: no linear tendency for the deviations to align.

A key subtlety: covariance is unitful. If $X$ is in dollars and $Y$ is in seconds, covariance is in dollar·seconds. Changing units (e.g., dollars → cents) scales covariance.

Correlation: “covariance without units”

To compare relationships across different scales, we standardize by each variable’s standard deviation.

\rho_{X,Y} = \operatorname{Corr}(X,Y) = \frac{\operatorname{Cov}(X,Y)}{\sigma_X\,\sigma_Y},

where $\sigma_X = \sqrt{\operatorname{Var}(X)}$ and $\sigma_Y = \sqrt{\operatorname{Var}(Y)}$ .

Correlation is:

•unitless
•bounded: $-1 \le \rho_{X,Y} \le 1$

Interpretation:

• $\rho = 1$ : perfect positive linear relationship ( $Y = aX + b$ with $a>0$ almost surely)
• $\rho = -1$ : perfect negative linear relationship ( $a<0$ )
• $\rho = 0$ : no linear association (but there may be a nonlinear one)

A concrete picture: centered scatter and “cosine similarity”

If you take paired samples $(xᵢ, yᵢ)$ and center them, each data point becomes a 2D vector from the mean: $\mathbf{z}_i = [xᵢ-\bar{x},\; yᵢ-\bar{y}]$ .

A helpful geometric intuition is that correlation behaves like a normalized “alignment” between the centered coordinates of $X$ and $Y$ .

Here’s a simple ASCII scatter showing a positive association. The “+” is the mean; arrows suggest centered deviations.

Y
^               *
|            *
|         *
|      *
|   *
| +-----------------> X
|   *
|      *

When points trend from bottom-left to top-right, products $(xᵢ-\bar{x})(yᵢ-\bar{y})$ are often positive, boosting covariance and correlation.

What they can and cannot conclude (preview)

Covariance/correlation can:

•quantify direction and strength of linear relationship
•feed into models (regression) and decompositions (PCA)

They cannot by themselves:

•prove causation (ice cream doesn’t cause drownings)
•guarantee independence from $\rho=0$ (nonlinear dependence exists)
•detect strong nonlinear patterns (U-shapes can yield $\rho \approx 0$ )

Core Mechanic 1: Computing Covariance (and its key identities)

Why the definition looks the way it does

We want a measure that’s:

1)Positive when both variables tend to be on the same side of their means.
2)Negative when they tend to be on opposite sides.
3)Larger when these deviations are larger.

The product $(X-E[X])(Y-E[Y])$ does exactly that:

•both positive → product positive
•one positive, one negative → product negative
•bigger deviations → bigger magnitude

Averaging (expectation) makes it a stable summary.

Expanding covariance: a very useful formula

Start from the definition:

\operatorname{Cov}(X,Y) = E\big[(X - E[X])(Y - E[Y])\big].

Expand step-by-step:

\begin{aligned} \operatorname{Cov}(X,Y) &= E\big[XY - X\,E[Y] - E[X]\,Y + E[X]E[Y]\big] \\ &= E[XY] - E[Y]E[X] - E[X]E[Y] + E[X]E[Y] \\ &= E[XY] - E[X]E[Y]. \end{aligned}

So:

\boxed{\operatorname{Cov}(X,Y) = E[XY] - E[X]E[Y].}

This identity is especially handy in calculations.

Covariance with constants and scaling

These properties matter constantly in modeling:

1) Adding constants doesn’t change covariance:

\operatorname{Cov}(X+a,\, Y+b) = \operatorname{Cov}(X,Y).

Reason: centering removes constants.

2) Scaling scales covariance:

\operatorname{Cov}(cX,\, dY) = cd\,\operatorname{Cov}(X,Y).

So if you convert meters to centimeters (×100), covariance changes by ×100 on that variable’s side.

Relationship to variance

Variance is just covariance with itself:

\operatorname{Var}(X) = \operatorname{Cov}(X,X).

That makes covariance a true generalization of variance.

Independence implies zero covariance (but not conversely)

If $X$ and $Y$ are independent, then $E[XY] = E[X]E[Y]$ , so:

\operatorname{Cov}(X,Y) = 0.

But the reverse is not generally true: $\operatorname{Cov}(X,Y)=0$ does not guarantee independence. (We’ll build a concrete counterexample later.)

Sample covariance: estimating from data

In practice, you usually have samples $(xᵢ, yᵢ)$ for $i=1..n$ .

Define sample means:

• $\bar{x} = \frac{1}{n}\sum_{i=1}^n xᵢ$
• $\bar{y} = \frac{1}{n}\sum_{i=1}^n yᵢ$

A common estimator (unbiased for i.i.d. samples) is:

s_{xy} = \frac{1}{n-1}\sum_{i=1}^n (xᵢ-\bar{x})(yᵢ-\bar{y}).

You may also see $\frac{1}{n}$ instead of $\frac{1}{n-1}$ ; that version is the maximum-likelihood estimator under a normal model but is biased in finite samples.

Interpreting magnitude: why covariance is hard to compare

Suppose:

•height in meters vs weight in kg
•height in centimeters vs weight in kg

The relationship is identical, but scaling height by 100 scales covariance by 100. That’s why correlation is so widely used: it removes units.

Core Mechanic 2: Correlation as Standardized Covariance (and the geometry behind ±1)

Why standardize?

Covariance answers “do they move together?”, but not “how strong is that co-movement relative to each variable’s natural scale?”

Correlation fixes this by dividing by $\sigma_X\sigma_Y$ .

\rho_{X,Y} = \frac{\operatorname{Cov}(X,Y)}{\sigma_X\sigma_Y}.

Because $\sigma_X$ has units of $X$ and $\sigma_Y$ has units of $Y$ , the units cancel.

Correlation is bounded between −1 and 1

This isn’t just a convention—it’s a theorem. One route uses Cauchy–Schwarz.

Let $X_c = X - E[X]$ and $Y_c = Y - E[Y]$ .

Apply Cauchy–Schwarz to random variables:

|E[X_c Y_c]| \le \sqrt{E[X_c^2]}\,\sqrt{E[Y_c^2]}.

But $E[X_c^2]=\operatorname{Var}(X)=\sigma_X^2$ and similarly for $Y$ .

So:

|\operatorname{Cov}(X,Y)| \le \sigma_X\sigma_Y.

Divide both sides by $\sigma_X\sigma_Y$ (assuming nonzero):

|\rho_{X,Y}| \le 1.

When do we get ±1?

Equality in Cauchy–Schwarz occurs when one variable is an exact scalar multiple of the other (almost surely) after centering.

That means:

• $\rho=1$ iff $Y - E[Y] = a(X - E[X])$ with $a>0$ a.s.
• $\rho=-1$ iff same with $a<0$

Equivalently, $Y = aX + b$ almost surely for some constants $a, b$ .

A visual: correlation as “angle” between centered coordinates (sample view)

For a dataset, define centered vectors:

•x = $(x₁-\bar{x},\dots,x_n-\bar{x})$
•y = $(y₁-\bar{y},\dots,y_n-\bar{y})$

Then the sample correlation is:

r = \frac{\mathbf{x}\cdot\mathbf{y}}{\|\mathbf{x}\|\,\|\mathbf{y}\|}.

That is exactly the cosine of the angle between x and y in $\mathbb{R}^n$ .

•If x and y point in the same direction → $r\approx 1$
•Orthogonal → $r\approx 0$
•Opposite directions → $r\approx -1$

Inline “plot” of the vector-angle intuition:

           y (centered)
           ^
          /|
         / |   cos(θ) = r
        /θ |
       +---+----------> x (centered)

This is a powerful mental model: correlation is alignment between patterns of deviation-from-mean.

Sample correlation

Using the sample covariance $s_{xy}$ and sample standard deviations $s_x, s_y$ :

r = \frac{s_{xy}}{s_x s_y}.

Correlation is not robust to outliers

Because it depends on products of deviations, a single extreme point can dominate.

Practical note:

•If you suspect outliers or heavy tails, consider rank-based alternatives like Spearman’s $\rho$ or Kendall’s $\tau$ (different nodes/topics).

Application/Connection: What Covariance and Correlation Enable (and what they don’t)

1) The “ice cream and drownings” lesson: correlation ≠ causation

A strong correlation can arise from:

•a direct causal link ( $X \to Y$ )
•reverse causation ( $Y \to X$ )
•a third variable $Z$ driving both ( $Z \to X$ and $Z \to Y$ )
•coincidence / selection bias / measurement artifacts

In the ice-cream example, temperature (a confounder) increases both ice cream consumption and swimming, which increases drowning risk.

Covariance/correlation quantify association; causal claims require additional design/assumptions.

2) Zero covariance: what it means, and the classic trap

$\operatorname{Cov}(X,Y)=0$ means the linear term in their relationship is absent in a precise sense:

•If you try to predict $Y$ using a linear function of $X$ , the best linear predictor doesn’t gain anything from $X$ (in a mean-square sense) beyond the intercept.

But $X$ can still strongly determine $Y$ nonlinearly.

We’ll show an explicit example in the worked examples: $X \sim \text{Uniform}(-1,1)$ and $Y=X^2$ . Then covariance is 0, yet $Y$ is completely determined by $X$ .

Special case worth knowing: if $(X,Y)$ are jointly normal, then zero covariance does imply independence. (Not true in general.)

3) Covariance matrix: scaling up to many variables

With $d$ random variables, stack them into a random vector X ∈ $\mathbb{R}^d$ .

Define the covariance matrix:

\Sigma = \operatorname{Cov}(\mathbf{X}) = E\big[(\mathbf{X}-E[\mathbf{X}])(\mathbf{X}-E[\mathbf{X}])^T\big].

Key facts:

•Diagonal entries: variances ( $\Sigma_{jj}=\operatorname{Var}(X_j)$ )
•Off-diagonals: covariances ( $\Sigma_{jk}=\operatorname{Cov}(X_j, X_k)$ )
• $\Sigma$ is symmetric and positive semidefinite

This matrix is the central object in multivariate statistics.

4) Connection to PCA (what this node unlocks)

PCA looks for directions in feature space with maximum variance.

If you have centered data vectors x ∈ $\mathbb{R}^d$ , PCA finds unit vectors v maximizing:

\operatorname{Var}(\mathbf{v}^T\mathbf{X}) = \mathbf{v}^T \Sigma \mathbf{v}.

The solutions are eigenvectors of the covariance matrix $\Sigma$ .

So learning covariance is not just about pairwise relationships—it’s about understanding the geometry of data clouds and the linear structure PCA extracts.

5) A quick “what to use when” table

Goal	Use covariance?	Use correlation?	Notes
Keep physical units (e.g., risk in $\$ ·days)	✅	❌	Covariance preserves scale
Compare relationships across different units	❌	✅	Correlation is unitless
Build PCA on raw feature scales	✅	❌/✅	Often you choose covariance PCA or correlation PCA (standardized)
Detect any dependence (including nonlinear)	❌	❌	Need other tools (MI, plots, kernels, etc.)

Worked Examples (3)

Compute covariance and correlation from a small dataset (by hand)

You observe n=5 paired measurements:

X: [1, 2, 3, 4, 5]

Y: [2, 1, 4, 3, 6]

Compute the sample covariance s_xy (with 1/(n−1)) and sample correlation r.

Compute means:
\bar{x} = (1+2+3+4+5)/5 = 15/5 = 3
\bar{y} = (2+1+4+3+6)/5 = 16/5 = 3.2
Compute centered values and products:
For each i, compute (xᵢ−\bar{x}), (yᵢ−\bar{y}), and product.
i=1: x=1 → −2; y=2 → −1.2; product = (−2)(−1.2)= 2.4
i=2: x=2 → −1; y=1 → −2.2; product = (−1)(−2.2)= 2.2
i=3: x=3 → 0; y=4 → 0.8; product = 0·0.8= 0
i=4: x=4 → 1; y=3 → −0.2; product = 1·(−0.2)= −0.2
i=5: x=5 → 2; y=6 → 2.8; product = 2·2.8= 5.6
Sum of products:
∑(xᵢ−\bar{x})(yᵢ−\bar{y}) = 2.4+2.2+0−0.2+5.6 = 10.0
Sample covariance:
$s_{xy} = \frac{1}{n-1}\sum (xᵢ-\bar{x})(yᵢ-\bar{y}) = \frac{10}{4} = 2.5.$
Compute sample standard deviations.
First compute sums of squares.
For X:
(xᵢ−\bar{x})²: 4, 1, 0, 1, 4 → sum = 10
So s_x² = 10/(n−1)=10/4=2.5 → s_x = √2.5
For Y:
(yᵢ−\bar{y})²: (−1.2)²=1.44, (−2.2)²=4.84, 0.8²=0.64, (−0.2)²=0.04, 2.8²=7.84
Sum = 1.44+4.84+0.64+0.04+7.84 = 14.8
So s_y² = 14.8/4 = 3.7 → s_y = √3.7
Compute correlation:
$r = \frac{s_{xy}}{s_x s_y} = \frac{2.5}{\sqrt{2.5}\,\sqrt{3.7}} = \frac{2.5}{\sqrt{9.25}}.$
Numerically, √9.25 ≈ 3.041, so
r ≈ 2.5 / 3.041 ≈ 0.822.

Insight: The covariance (2.5) says X and Y tend to be above/below their means together, but its magnitude depends on X and Y units. The correlation (~0.82) shows a fairly strong positive linear association on a standardized scale.

Zero covariance but strong dependence: Y = X² on a symmetric interval

Let X ~ Uniform(−1, 1) and define Y = X². Compute Cov(X,Y). Are X and Y independent?

Compute expectations.
By symmetry of Uniform(−1,1), the distribution is symmetric around 0, so
$E[X] = 0.$
Compute E[Y]. Since Y=X²:
$E[Y] = E[X^2].$
For X ~ Uniform(−1,1) with density f(x)=1/2 on [−1,1]:
$\begin{aligned} E[X^2] &= \int_{-1}^{1} x^2 \cdot \frac{1}{2}\,dx \\ &= \frac{1}{2}\left[\frac{x^3}{3}\right]_{-1}^{1} \\ &= \frac{1}{2}\left(\frac{1}{3} - \left(-\frac{1}{3}\right)\right) \\ &= \frac{1}{2}\cdot \frac{2}{3} = \frac{1}{3}. \end{aligned}$
Compute E[XY]. Since Y=X², we have XY = X·X² = X³. Then:
$E[XY] = E[X^3] = \int_{-1}^{1} x^3 \cdot \frac{1}{2}\,dx.$
But x³ is an odd function and the interval is symmetric, so the integral is 0:
$E[X^3]=0.$
Compute covariance using E[XY] − E[X]E[Y]:
$\operatorname{Cov}(X,Y) = E[XY] - E[X]E[Y] = 0 - (0)\left(\frac{1}{3}\right) = 0.$
Check independence intuition.
If X and Y were independent, knowing X would not give information about Y.
But here Y is determined exactly by X via Y=X². For example:
- •If X=0, then Y=0 with certainty.
- •If X=±1, then Y=1 with certainty.
So Y is not independent of X.

Insight: Covariance zero means “no linear association,” not “no relationship.” The relationship here is perfectly nonlinear: the scatter is a U-shape. Covariance and correlation miss that shape even though dependence is complete.

Correlation is invariant to shifting and positive scaling (but flips sign under negative scaling)

Let X and Y be random variables with Corr(X,Y)=ρ. Define X' = 3X + 10 and Y' = −2Y + 5. Find Corr(X',Y').

Use covariance and standard deviation scaling rules.
Constants don’t affect covariance:
Cov(X+a, Y+b)=Cov(X,Y).
Scaling: Cov(cX, dY)=cd Cov(X,Y).
Compute Cov(X',Y'):
X' = 3X+10, Y' = −2Y+5
$\operatorname{Cov}(X',Y') = \operatorname{Cov}(3X+10,\,-2Y+5) = (3)(-2)\operatorname{Cov}(X,Y) = -6\operatorname{Cov}(X,Y).$
Compute σ_{X'} and σ_{Y'}.
Standard deviation scales by absolute value:
σ_{3X+10} = |3|σ_X = 3σ_X
σ_{−2Y+5} = |−2|σ_Y = 2σ_Y
Compute correlation:
$\begin{aligned} \operatorname{Corr}(X',Y') &= \frac{\operatorname{Cov}(X',Y')}{\sigma_{X'}\sigma_{Y'}} \\ &= \frac{-6\operatorname{Cov}(X,Y)}{(3\sigma_X)(2\sigma_Y)} \\ &= -\frac{\operatorname{Cov}(X,Y)}{\sigma_X\sigma_Y} \\ &= -\rho. \end{aligned}$

Insight: Correlation ignores shifts and positive rescalings (unit changes), but a negative scaling flips the sign because it reverses one axis.

Key Takeaways

✓
Covariance measures average joint deviation: $ $\operatorname{Cov}(X,Y)=E[(X−E[X])(Y−E[Y])].$ $
✓
A useful identity: $ $\operatorname{Cov}(X,Y)=E[XY]−E[X]E[Y].$ $
✓
Covariance has units (units of X times units of Y), so its magnitude depends on scaling choices.
✓
Correlation standardizes covariance: $ $\rho_{X,Y}=\frac{\operatorname{Cov}(X,Y)}{\sigma_X\sigma_Y},$ $ producing a unitless number in [−1,1].
✓
Correlation captures linear association; strong nonlinear dependence can still yield ρ≈0.
✓
Independence ⇒ zero covariance, but zero covariance ⇏ independence in general (e.g., Y=X² with symmetric X).
✓
In sample form, correlation equals cosine similarity between centered data vectors x and y.
✓
Covariance matrices generalize covariance to many variables and are the core input to PCA via eigenvectors.

Common Mistakes

✗
Interpreting correlation as causation (confounding variables can create high correlation).
✗
Assuming ρ=0 implies independence (it only rules out linear association unless extra assumptions like joint normality hold).
✗
Comparing covariance magnitudes across datasets with different units/scales (use correlation or standardize first).
✗
Ignoring outliers: a few extreme points can drastically change covariance/correlation and hide the typical pattern.

Practice

easy

Let X have E[X]=2, Var(X)=9. Let Y = 5 − 2X. Compute Cov(X,Y) and Corr(X,Y).

Hint: Use Cov(X, a+bX) = b Var(X). For correlation, note Y is exactly linear in X.

Show solution

Compute covariance:

Y = 5 − 2X ⇒ Cov(X,Y) = Cov(X, 5 − 2X) = −2 Cov(X,X) = −2 Var(X) = −2·9 = −18.

For correlation, since Y is an exact decreasing linear function of X, Corr(X,Y)=−1.

(You can also compute σ_Y = |−2|σ_X = 2·3=6, so Corr = (−18)/(3·6)=−1.)

easy

Suppose E[X]=1, E[Y]=3, and E[XY]=10. Compute Cov(X,Y).

Hint: Use Cov(X,Y)=E[XY]−E[X]E[Y].

Show solution

Cov(X,Y) = 10 − (1)(3) = 7.

medium

Let X be Uniform(0,1) and Y = X. Compute Corr(X,Y). Then let Z = X² and reason (without heavy computation) whether Corr(X,Z) is closer to 1, closer to 0, or could be negative.

Hint: For Y=X, it’s perfect linear. For Z=X² on (0,1), it’s increasing but nonlinear; think about whether larger X tends to mean larger Z.

Show solution

For Y=X, correlation is 1 because Y is an exact positive linear function of X (Y=1·X+0).

For Z=X² with X∈(0,1), Z is strictly increasing in X, so larger X tends to correspond to larger Z. That suggests a positive covariance and positive correlation. However, because the relationship is nonlinear, Corr(X,Z) will be less than 1 (not a perfect straight line). It cannot be negative here because X and X² move in the same direction on (0,1). So Corr(X,Z) is closer to 1 than to 0, but still < 1.

Connections

Next steps and related nodes:

•Principal Component Analysis — PCA uses eigenvectors of the covariance matrix $\Sigma$ to find high-variance directions.

Helpful background refreshers:

•Variance and standard deviation (prerequisite): understand $\sigma=\sqrt{\operatorname{Var}(X)}$ since correlation divides by $\sigma_X\sigma_Y$ .
•Joint distributions (prerequisite): covariance depends on the joint behavior via $E[XY].$

Quality: A (4.4/5)

← back to tree browse all →