Dimensionality reduction via eigenvectors of covariance.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
You have a dataset with 200 features. You suspect only a few underlying “directions” actually matter (e.g., lighting vs pose in images, or overall size vs shape in tabular data). PCA is the classic tool for discovering those directions—purely from geometry—and then compressing the data while losing as little information as possible (in a precise sense).
Principal Component Analysis (PCA) finds orthogonal directions v₁, v₂, … (eigenvectors of the data covariance) that capture maximum variance. Projecting onto the top-k components gives the best k-dimensional linear approximation: it maximizes captured variance and equivalently minimizes squared reconstruction error. PCA can be computed via covariance eigen-decomposition or via SVD of the centered data matrix.
PCA is a linear dimensionality reduction method. It replaces your original coordinate system (your original features) with a new coordinate system whose axes are:
Those axes are called principal components.
Imagine your data points as a cloud in ℝᵈ (d features). If that cloud is stretched out more in one direction than others, there’s a “long axis” through the cloud. Projecting onto that axis preserves a lot of the spread (variance) of the data. PCA finds that axis, then the next-most-informative orthogonal axis, and so on.
We’ll set up notation carefully so you don’t have to guess later.
where is the all-ones column vector.
For PCA, the eigenvectors are the same under both (they differ only by a scalar factor in eigenvalues). In this lesson, we’ll mostly use
to keep algebra clean.
PCA produces:
Interpretation:
When you keep only the top-k components, PCA gives the best k-dimensional linear approximation in two equivalent ways:
1) Maximum captured variance among all k-dimensional subspaces.
2) Minimum squared reconstruction error among all rank-k linear projections.
We’ll prove both—but slowly, with checkpoints.
If you scale your covariance by 1/(n−1) instead of 1/n, what changes?
Answer: eigenvectors (principal directions) do not change; eigenvalues scale by .
PCA starts with a very specific optimization question:
Among all unit directions v in feature space, which direction makes the projected data have the largest variance?
That direction will be v₁, the first principal component.
Take a unit vector v ∈ ℝᵈ with .
Project a centered data point onto v:
For a random centered vector x (think of drawing a row from ), the variance of is:
Using and standard covariance rules:
So we want to maximize subject to .
The expression
is the Rayleigh quotient. Under the unit-norm constraint, maximizing is the same as maximizing .
Set up the Lagrangian:
Differentiate w.r.t. v and set to zero:
1) Gradient of quadratic form: (since is symmetric)
2) Gradient of constraint term:
So:
Divide by 2:
That is exactly the eigenvector equation.
So any optimum must be an eigenvector of .
To see which eigenvector gives the maximum, plug an eigenvector v with eigenvalue into the objective:
Under the unit constraint, the value equals its eigenvalue. Therefore, the maximum is attained by the eigenvector with the largest eigenvalue.
So:
After choosing v₁, we want v₂ to maximize variance subject to being orthogonal to v₁:
The solution is the eigenvector with the next-largest eigenvalue, and so on. Because is symmetric positive semidefinite, it has an orthonormal eigenbasis, so the eigenvectors can be chosen orthogonal.
Let . Which unit vector maximizes ?
Solution sketch: eigenvalues are 2 and 1 with eigenvectors [1,0]ᵀ and [0,1]ᵀ. Max is along [1,0]ᵀ with value 2.
Finding v₁ is only the beginning. PCA’s power comes from selecting a whole k-dimensional subspace that best represents the data.
There are two common ways to describe what “best” means:
1) Keep as much variance as possible (information-as-spread viewpoint)
2) Reconstruct with minimal squared error (compression viewpoint)
A key learning goal is to see why these are actually the same objective.
Let with orthonormal columns ().
Given a centered data point x, its coordinates in the PCA subspace are:
The projection back to ℝᵈ (the rank-k approximation of x) is:
So the projection matrix is .
The variance captured by projecting onto the k-dimensional subspace is the expected squared norm of the projected vector:
Compute it:
\[\begin{aligned}
\mathbb{E}[\|V_k^T \mathbf{x}\|_2^2]
&= \mathbb{E}[ (V_k^T \mathbf{x})^T (V_k^T \mathbf{x}) ] \\
&= \mathbb{E}[ \mathbf{x}^T V_k V_k^T \mathbf{x} ] \\
&= \mathrm{tr}(\mathbb{E}[ \mathbf{x}\mathbf{x}^T ] V_k V_k^T) \quad \text{(cyclic trace)} \\
&= \mathrm{tr}(\Sigma V_k V_k^T).
\end{aligned}\]
If is made of eigenvectors of , then:
So the top-k PCA subspace captures variance equal to the sum of the top k eigenvalues.
A common diagnostic is:
This tells you what fraction of total variance you keep.
If eigenvalues are [5, 2, 1, 1, 1], what k gives ≥80% explained variance?
Solution: total = 10. k=1 gives 50%, k=2 gives 70%, k=3 gives 80% → k=3.
Now we switch viewpoints: you want to compress data by keeping only k numbers per sample (the coordinates z) and then reconstruct.
For a point x, reconstruction error is:
Consider expected reconstruction error over the data distribution:
Because is an orthogonal projector,
So by Pythagoras:
Rearrange:
Take expectation:
\[\begin{aligned}
\mathbb{E}[\|\mathbf{x} - V_k V_k^T \mathbf{x}\|_2^2]
&= \mathbb{E}[\|\mathbf{x}\|_2^2] - \mathbb{E}[\|V_k V_k^T \mathbf{x}\|_2^2] \\
&= \mathbb{E}[\|\mathbf{x}\|_2^2] - \mathbb{E}[\|V_k^T \mathbf{x}\|_2^2].
\end{aligned}\]
The first term is constant with respect to the choice of subspace. So:
Minimizing reconstruction error is equivalent to maximizing captured projected energy/variance.
So the same that maximizes also minimizes reconstruction error.
If you choose as top-k eigenvectors, the variance you don’t capture is the sum of remaining eigenvalues:
This gives a clean interpretation:
Eigenvalues are [10, 3, 2]. If you keep k=1, what is expected reconstruction error (in variance units)?
Solution: drop eigenvalues 3 and 2 → error = 5.
So far we reasoned per-vector. PCA also gives the best rank-k matrix approximation of the centered data matrix in Frobenius norm.
Let have SVD:
(where we’ll call the diagonal of singular values to avoid confusing it with covariance).
The Eckart–Young theorem says the best rank-k approximation is:
This is exactly what PCA is doing in matrix form.
Connecting to covariance:
\[\begin{aligned}
\frac{1}{n} X_c^T X_c
&= \frac{1}{n} (V \Sigma_s U^T)(U \Sigma_s V^T) \\
&= \frac{1}{n} V \Sigma_s^2 V^T.
\end{aligned}\]
So:
That’s why PCA can be computed without explicitly forming the covariance matrix, especially when d is large.
PCA is simple to state but easy to misuse in practice. This section focuses on how PCA is applied, what choices matter, and how it connects to the tools you already know.
Common goals:
PCA assumes the data is centered.
If you forget this, the first component can point mostly toward the mean rather than describing variation.
If features have different units/scales, covariance will be dominated by high-variance features.
| Choice | What you compute PCA on | When it’s appropriate | Effect |
|---|---|---|---|
| Covariance PCA | Centered | Features share comparable units/scales | Preserves natural variance magnitudes |
| Correlation PCA | Standardized data (z-scores) | Mixed units (e.g., dollars + seconds), different scales | Treats each feature as equally scaled |
Standardization: where is std dev.
Two equivalent computation paths:
Path A: Covariance eigen-decomposition (good when d is modest):
Path B: SVD of centered data (often better numerically and when d is large):
Typical heuristics:
Transform (encode):
Reconstruct (decode):
(remember to add the mean back).
Why must you apply the training mean and scaling to the test set (not recompute them on test)?
Solution: recomputing uses test information (data leakage) and changes the coordinate system; transformed features wouldn’t be comparable.
When is orthonormal, multiplying by is a rotation/reflection (distance-preserving). PCA can be thought of as:
1) Rotate into the eigenvector basis:
2) Drop the last d−k coordinates (small-variance directions)
This makes clear why PCA often denoises: if noise is roughly isotropic, it spreads across many directions and tends to be captured by small eigenvalues.
Sometimes you want the transformed variables to have unit variance and be uncorrelated. PCA already makes them uncorrelated; whitening also rescales.
Let be diagonal with top-k eigenvalues. PCA coordinates are:
Whitened coordinates:
Caution: whitening can amplify noise in low-variance directions (because you divide by small ).
Since you already know SVD, here’s the clean link you can reuse:
This last identity is handy: you can get the k-dimensional representation directly as without forming explicitly in some streaming/large-scale settings.
PCA is linear. It fails when structure is nonlinear.
Still, PCA is often the baseline: fast, interpretable, and a strong first step.
Suppose your centered 2D data has covariance
Find the principal components and how much variance the first component explains.
Compute eigenvalues from det(Σ − λI)=0:
\n$$
\nExpand:
\n$$
Solve the quadratic:
\n$$
\n$$
Order them:
\n$$
Find eigenvector for λ₁ by solving (Σ − λ₁I)v=0:
\n$$
Use the first row: (4−λ₁)v₁ + 2v₂ = 0 ⇒ 2v₂ = (λ₁−4)v₁.
\nSo ratio:
\n$$
Plug λ₁ ≈ 5.561:
\n$$
Choose a convenient vector proportional to [1, 0.7805]ᵀ and normalize:
\nNorm ≈ √(1² + 0.7805²) = √(1 + 0.609) = √1.609 ≈ 1.268.
\nSo
\n$$
Similarly, v₂ is orthogonal to v₁ (or solve for λ₂). One orthonormal choice is approximately:
\n$$
Compute explained variance ratio for k=1:
\nTotal variance = tr(Σ)=4+3=7 = λ₁+λ₂.
\n$$
Insight: Eigenvectors give the principal directions; eigenvalues literally are the variances along those directions. The explained variance ratio is just “how much of the trace you keep.”
Let centered data matrix be
Compute the covariance, its eigenvalues, and relate them to singular values of X_c.
Compute covariance using Σ = (1/n) X_cᵀ X_c with n=4:
\nFirst compute X_cᵀ X_c:
\n$$
Multiply:
\n- (1,1) entry: 2² + 0² + (−2)² + 0² = 8
\nSo
\n$$
Now scale by 1/n = 1/4:
\n$$
Eigenvalues of Σ are clearly λ₁=2, λ₂=0.5 with eigenvectors e₁=[1,0]ᵀ and e₂=[0,1]ᵀ.
Now connect to singular values. By definition, singular values satisfy:
\n$$
Here X_cᵀ X_c is diagonal with entries 8 and 2, so the singular values are:
\n$$
Check the PCA relationship λⱼ = σⱼ²/n:
\n$$
Insight: SVD of centered data is PCA in disguise: right singular vectors are principal directions, and eigenvalues are just scaled squared singular values.
Assume a 3D centered dataset has covariance eigenvalues
Compare expected reconstruction error for k=1 vs k=2.
Total variance (total expected squared norm) is:
\n$$
For k=1, you keep λ₁ and drop λ₂, λ₃:
\nExpected reconstruction error:
\n$$
For k=2, you keep λ₁+λ₂ and drop λ₃:
\nExpected reconstruction error:
\n$$
Consistency check via 'total − captured':
\n- Captured (k=1): 6 ⇒ error 9−6=3
Insight: Once you’re in the PCA basis, reconstruction error is literally the energy in the discarded coordinates. That’s why eigenvalues are such a direct diagnostic.
A principal component v is a unit direction maximizing projected variance; it is an eigenvector of the covariance matrix.
Eigenvalues quantify variance captured along their eigenvectors; ordering by orders importance.
Projecting onto top-k components uses (for centered data).
PCA’s top-k subspace maximizes captured variance and equivalently minimizes expected squared reconstruction error .
PCA can be computed via eigen-decomposition of or via SVD ; eigenvalues satisfy .
Centering is essential; scaling (standardization) is often essential when features have different units.
Explained variance ratio is a practical way to choose k, but task-driven cross-validation is often better for predictive goals.
Forgetting to center the data (PCA then captures mean offsets rather than variation).
Mixing scaling conventions unintentionally (covariance PCA vs correlation PCA) and misinterpreting components because feature units differ.
Choosing k solely from explained variance without checking downstream performance or the effect of noise/outliers.
Recomputing mean/standard deviation on the test set (data leakage; inconsistent coordinate system).
Let . Find eigenvalues and an orthonormal set of principal components.
Hint: This matrix has the form a on diagonal and b off-diagonal; eigenvectors are along [1,1] and [1,−1].
Compute eigenvalues: solve det(Σ−λI)=0.
So ⇒ , .
Eigenvector for λ₁=4: (3−4)v₁ + 1 v₂ = 0 ⇒ −v₁+v₂=0 ⇒ v₂=v₁ ⇒ direction [1,1]. Normalize: v₁ = (1/√2)[1,1].
Eigenvector for λ₂=2: (3−2)v₁ + 1 v₂ =0 ⇒ v₁+v₂=0 ⇒ direction [1,−1]. Normalize: v₂ = (1/√2)[1,−1].
Show that for any orthonormal , the expected reconstruction error satisfies
Hint: Use that is an orthogonal projector and apply Pythagoras: projected part is orthogonal to residual.
Let and residual .
Because is an orthogonal projector, lies in the span of columns of , while lies in the orthogonal complement. Therefore .
So
Rearrange:
Now note because is orthonormal. Take expectation of both sides to obtain the identity.
You have centered data with SVD . Prove that the covariance matrix has eigenvectors equal to columns of V and eigenvalues .
Hint: Compute using the SVD and simplify using .
Start with SVD: .
Compute:
\[\begin{aligned}
X_c^T X_c
&= (U\Sigma_s V^T)^T (U\Sigma_s V^T) \\
&= (V \Sigma_s^T U^T)(U\Sigma_s V^T) \\
&= V \Sigma_s^T (U^T U) \Sigma_s V^T \\
&= V \Sigma_s^T \Sigma_s V^T.
\end{aligned}\]
Since is diagonal with nonnegative singular values, .
Thus:
Scale by 1/n:
This is an eigen-decomposition with eigenvectors given by columns of V and eigenvalues .
Next steps and related nodes:
Useful prerequisites to revisit as needed: