Principal Component Analysis

Linear AlgebraDifficulty: ████░Depth: 8Unlocks: 1

Dimensionality reduction via eigenvectors of covariance.

Interactive Visualization

t=0s

Core Concepts

▸Principal component = an orthogonal direction (axis) in feature space that maximizes the variance of data when projected onto it.
▸Eigen-decomposition of the data covariance matrix produces those orthogonal principal axes and associated variances.
▸Dimensionality reduction by projecting data onto the top-k principal components yields the best k-dimensional linear approximation in terms of captured variance.

Key Symbols & Notation

v (principal component vector, an eigenvector of the covariance matrix)lambda (eigenvalue: the variance captured by its corresponding v)

Essential Relationships

↔Covariance * v = lambda * v, so each eigenvector v is an orthogonal principal axis and its eigenvalue lambda equals the variance of data projected onto v; sorting lambda orders components by explained variance.

Prerequisites (2)

Singular Value Decomposition6 atoms

Covariance and Correlation6 atoms

Unlocks (1)

Dimensionality Reductionlvl 4

▶ Advanced Learning Details

Graph Position

164

Depth Cost

Fan-Out (ROI)

Bottleneck Score

Chain Length

Cognitive Load

Atomic Elements

Total Elements

Percentile Level

Atomic Level

All Concepts (19)

• Principal component: a direction (unit vector) in feature space onto which data is projected to form a new variable
• Principal axes / principal directions: the ordered set of eigenvectors of the data covariance that define the new coordinate system
• Principal component score (or score): the coordinate of a data point along a principal component (projection value)
• Loadings: the coefficients (elements of an eigenvector) that define a principal component as a linear combination of original variables
• Explained variance: the variance of the data along a principal component (numerical measure of its importance)
• Explained variance ratio: the fraction of total variance accounted for by a particular principal component (λ_i / Σ_j λ_j)
• Ordering of components: principal components are sorted by descending explained variance (eigenvalues)
• Dimensionality reduction by truncation: approximating data by retaining only the top-k principal components
• Reconstruction from components: reconstructing an approximation of the original data from retained scores and loadings
• Mean-centering: subtracting the variable-wise mean from data before applying PCA (required for covariance-based PCA)
• Eigendecomposition of covariance: representing the covariance matrix as P Λ P^T where columns of P are eigenvectors and Λ is diagonal of eigenvalues
• Orthogonality of components: principal components (eigenvectors) are mutually orthogonal and form an orthonormal basis
• Uncorrelatedness of component scores: projections onto different principal components have zero covariance
• Variance-maximization formulation: each principal component is the direction that maximizes projected variance subject to orthogonality constraints to previous components
• Least-squares / reconstruction-optimality: choosing the top-k eigenvectors minimizes squared reconstruction error among all k-dimensional linear subspaces
• Total variance equals sum of eigenvalues: global variance of the dataset equals trace of covariance matrix (Σ_i λ_i)
• Scree plot / cumulative explained variance: graphical/aggregate criteria to choose number k (visualizing λ_i or Σ_{i<=k} λ_i / Σ_j λ_j)
• Whitening (optional PCA variant): scaling projected components by 1/sqrt(λ_i) to produce uncorrelated unit-variance features
• Principal components are linear combinations of original variables (interpretability via loadings)

Teaching Strategy

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

You have a dataset with 200 features. You suspect only a few underlying “directions” actually matter (e.g., lighting vs pose in images, or overall size vs shape in tabular data). PCA is the classic tool for discovering those directions—purely from geometry—and then compressing the data while losing as little information as possible (in a precise sense).

TL;DR:

Principal Component Analysis (PCA) finds orthogonal directions v₁, v₂, … (eigenvectors of the data covariance) that capture maximum variance. Projecting onto the top-k components gives the best k-dimensional linear approximation: it maximizes captured variance and equivalently minimizes squared reconstruction error. PCA can be computed via covariance eigen-decomposition or via SVD of the centered data matrix.

What Is Principal Component Analysis?

PCA is a linear dimensionality reduction method. It replaces your original coordinate system (your original features) with a new coordinate system whose axes are:

•Orthogonal (perpendicular) directions in feature space
•Ordered by how much variance of the data they capture when you project onto them

Those axes are called principal components.

The geometric picture

Imagine your data points as a cloud in ℝᵈ (d features). If that cloud is stretched out more in one direction than others, there’s a “long axis” through the cloud. Projecting onto that axis preserves a lot of the spread (variance) of the data. PCA finds that axis, then the next-most-informative orthogonal axis, and so on.

Conventions (define them early)

We’ll set up notation carefully so you don’t have to guess later.

•Data matrix: $X \in \mathbb{R}^{n \times d}$ with rows $\mathbf{x}_i^T$ (each $\mathbf{x}_i \in \mathbb{R}^d$ is one sample).
•Sample mean:

\boldsymbol{\mu} = \frac{1}{n}\sum_{i=1}^n \mathbf{x}_i

•Centered data matrix:

X_c = X - \mathbf{1}\boldsymbol{\mu}^T

where $\mathbf{1} \in \mathbb{R}^{n}$ is the all-ones column vector.

•Covariance matrix. Two common conventions:
•Population-style: $\Sigma = \frac{1}{n} X_c^T X_c$
•Unbiased sample-style: $\Sigma = \frac{1}{n-1} X_c^T X_c$

For PCA, the eigenvectors are the same under both (they differ only by a scalar factor in eigenvalues). In this lesson, we’ll mostly use

\Sigma = \frac{1}{n} X_c^T X_c

to keep algebra clean.

What PCA outputs

PCA produces:

•Principal component directions: v₁, …, vᵈ (unit vectors in ℝᵈ), collected in a matrix $V = [\mathbf{v}_1\ \cdots\ \mathbf{v}_d]$ .
•Corresponding eigenvalues: $\lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_d \ge 0$ .

Interpretation:

•vⱼ is a direction (axis) in feature space.
• $\lambda_j$ is the variance of the data when projected onto vⱼ.

Two equivalent “best” statements

When you keep only the top-k components, PCA gives the best k-dimensional linear approximation in two equivalent ways:

1) Maximum captured variance among all k-dimensional subspaces.

2) Minimum squared reconstruction error among all rank-k linear projections.

We’ll prove both—but slowly, with checkpoints.

Checkpoint summary

•PCA is a change of basis to orthogonal directions ordered by variance.
•Compute PCA from the covariance eigenvectors (or from SVD).
•Keeping top-k components gives an optimal k-dimensional linear representation.

Micro-exercise (30 seconds)

If you scale your covariance by 1/(n−1) instead of 1/n, what changes?

Answer: eigenvectors (principal directions) do not change; eigenvalues scale by $\frac{n}{n-1}$ .

Core Mechanic 1: The First Principal Component as a Variance Maximizer (Rayleigh Quotient → Eigenvector)

PCA starts with a very specific optimization question:

Among all unit directions v in feature space, which direction makes the projected data have the largest variance?

That direction will be v₁, the first principal component.

Step 1: Define projection and projected variance

Take a unit vector v ∈ ℝᵈ with $\|\mathbf{v}\|_2 = 1$ .

Project a centered data point $\mathbf{x}$ onto v:

•Scalar coordinate along v: $z = \mathbf{v}^T \mathbf{x}$ .

For a random centered vector x (think of drawing a row from $X_c$ ), the variance of $z$ is:

\mathrm{Var}(z) = \mathrm{Var}(\mathbf{v}^T \mathbf{x}).

Using $\mathrm{Cov}(\mathbf{x}) = \Sigma$ and standard covariance rules:

\mathrm{Var}(\mathbf{v}^T \mathbf{x}) = \mathbf{v}^T \Sigma \mathbf{v}.

So we want to maximize $\mathbf{v}^T \Sigma \mathbf{v}$ subject to $\|\mathbf{v}\|_2 = 1$ .

Step 2: The optimization problem

\max_{\mathbf{v} \in \mathbb{R}^d} \ \mathbf{v}^T \Sigma \mathbf{v} \quad \text{s.t.} \quad \mathbf{v}^T \mathbf{v} = 1.

The expression

R(\mathbf{v}) = \frac{\mathbf{v}^T \Sigma \mathbf{v}}{\mathbf{v}^T \mathbf{v}}

is the Rayleigh quotient. Under the unit-norm constraint, maximizing $\mathbf{v}^T\Sigma\mathbf{v}$ is the same as maximizing $R(\mathbf{v})$ .

Step 3: Solve with Lagrange multipliers (show work)

Set up the Lagrangian:

\mathcal{L}(\mathbf{v},\alpha) = \mathbf{v}^T \Sigma \mathbf{v} - \alpha(\mathbf{v}^T\mathbf{v} - 1).

Differentiate w.r.t. v and set to zero:

1) Gradient of quadratic form: $\nabla_{\mathbf{v}} (\mathbf{v}^T \Sigma \mathbf{v}) = 2\Sigma\mathbf{v}$ (since $\Sigma$ is symmetric)

2) Gradient of constraint term: $\nabla_{\mathbf{v}} (\mathbf{v}^T\mathbf{v}) = 2\mathbf{v}$

So:

\nabla_{\mathbf{v}}\mathcal{L} = 2\Sigma\mathbf{v} - 2\alpha\mathbf{v} = 0

Divide by 2:

\Sigma\mathbf{v} = \alpha\mathbf{v}.

That is exactly the eigenvector equation.

So any optimum must be an eigenvector of $\Sigma$ .

To see which eigenvector gives the maximum, plug an eigenvector v with eigenvalue $\lambda$ into the objective:

\mathbf{v}^T \Sigma \mathbf{v} = \mathbf{v}^T (\lambda \mathbf{v}) = \lambda \mathbf{v}^T \mathbf{v} = \lambda.

Under the unit constraint, the value equals its eigenvalue. Therefore, the maximum is attained by the eigenvector with the largest eigenvalue.

So:

•v₁ = eigenvector of $\Sigma$ with largest eigenvalue $\lambda_1$
•Captured variance along v₁ is $\lambda_1$

Step 4: The next components (orthogonality emerges)

After choosing v₁, we want v₂ to maximize variance subject to being orthogonal to v₁:

\max_{\|\mathbf{v}\|=1,\ \mathbf{v} \perp \mathbf{v}_1} \mathbf{v}^T\Sigma\mathbf{v}.

The solution is the eigenvector with the next-largest eigenvalue, and so on. Because $\Sigma$ is symmetric positive semidefinite, it has an orthonormal eigenbasis, so the eigenvectors can be chosen orthogonal.

Checkpoint summary

•Projected variance along v is $\mathbf{v}^T\Sigma\mathbf{v}$ .
•Maximizing this under $\|\mathbf{v}\|=1$ gives an eigenvector problem.
•The best direction is the eigenvector with largest eigenvalue.

Micro-exercise (1 minute)

Let $\Sigma = \begin{bmatrix} 2 & 0 \\ 0 & 1 \end{bmatrix}$ . Which unit vector maximizes $\mathbf{v}^T\Sigma\mathbf{v}$ ?

Solution sketch: eigenvalues are 2 and 1 with eigenvectors [1,0]ᵀ and [0,1]ᵀ. Max is along [1,0]ᵀ with value 2.

Core Mechanic 2: From One Direction to k Dimensions + Variance ↔ Reconstruction Error

Finding v₁ is only the beginning. PCA’s power comes from selecting a whole k-dimensional subspace that best represents the data.

There are two common ways to describe what “best” means:

1) Keep as much variance as possible (information-as-spread viewpoint)

2) Reconstruct with minimal squared error (compression viewpoint)

A key learning goal is to see why these are actually the same objective.

Part A: Projecting onto the top-k components

Let $V_k = [\mathbf{v}_1\ \cdots\ \mathbf{v}_k] \in \mathbb{R}^{d \times k}$ with orthonormal columns ( $V_k^T V_k = I_k$ ).

Given a centered data point x, its coordinates in the PCA subspace are:

\mathbf{z} = V_k^T \mathbf{x} \in \mathbb{R}^k.

The projection back to ℝᵈ (the rank-k approximation of x) is:

\hat{\mathbf{x}} = V_k \mathbf{z} = V_k V_k^T \mathbf{x}.

So the projection matrix is $P_k = V_k V_k^T$ .

Part B: Captured variance in k dimensions

The variance captured by projecting onto the k-dimensional subspace is the expected squared norm of the projected vector:

\mathbb{E}[\|V_k^T \mathbf{x}\|_2^2].

Compute it:

\[\begin{aligned}

\mathbb{E}[\|V_k^T \mathbf{x}\|_2^2]

&= \mathbb{E}[ (V_k^T \mathbf{x})^T (V_k^T \mathbf{x}) ] \\

&= \mathbb{E}[ \mathbf{x}^T V_k V_k^T \mathbf{x} ] \\

&= \mathrm{tr}(\mathbb{E}[ \mathbf{x}\mathbf{x}^T ] V_k V_k^T) \quad \text{(cyclic trace)} \\

&= \mathrm{tr}(\Sigma V_k V_k^T).

\end{aligned}\]

If $V_k$ is made of eigenvectors of $\Sigma$ , then:

\mathrm{tr}(\Sigma V_k V_k^T) = \sum_{j=1}^k \lambda_j.

So the top-k PCA subspace captures variance equal to the sum of the top k eigenvalues.

Explained variance ratio

A common diagnostic is:

\text{ExplainedVarianceRatio}(k) = \frac{\sum_{j=1}^k \lambda_j}{\sum_{j=1}^d \lambda_j}.

This tells you what fraction of total variance you keep.

Checkpoint summary

•The rank-k projection is $\hat{\mathbf{x}} = V_k V_k^T \mathbf{x}$ .
•Variance captured by that subspace is $\sum_{j=1}^k \lambda_j$ .
•Explained variance ratio guides k selection.

Micro-exercise (45 seconds)

If eigenvalues are [5, 2, 1, 1, 1], what k gives ≥80% explained variance?

Solution: total = 10. k=1 gives 50%, k=2 gives 70%, k=3 gives 80% → k=3.

Part C: Reconstruction error and why PCA is optimal

Now we switch viewpoints: you want to compress data by keeping only k numbers per sample (the coordinates z) and then reconstruct.

For a point x, reconstruction error is:

\|\mathbf{x} - \hat{\mathbf{x}}\|_2^2 = \|\mathbf{x} - V_k V_k^T \mathbf{x}\|_2^2.

Consider expected reconstruction error over the data distribution:

\mathbb{E}[\|\mathbf{x} - V_k V_k^T \mathbf{x}\|_2^2].

Key identity: energy splits into kept + lost

Because $V_k V_k^T$ is an orthogonal projector,

•the kept component is $\hat{\mathbf{x}}$
•the residual is $\mathbf{r} = \mathbf{x} - \hat{\mathbf{x}}$
•and $\hat{\mathbf{x}} \perp \mathbf{r}$

So by Pythagoras:

\|\mathbf{x}\|_2^2 = \|\hat{\mathbf{x}}\|_2^2 + \|\mathbf{r}\|_2^2.

Rearrange:

\|\mathbf{r}\|_2^2 = \|\mathbf{x}\|_2^2 - \|\hat{\mathbf{x}}\|_2^2.

Take expectation:

\[\begin{aligned}

\mathbb{E}[\|\mathbf{x} - V_k V_k^T \mathbf{x}\|_2^2]

&= \mathbb{E}[\|\mathbf{x}\|_2^2] - \mathbb{E}[\|V_k V_k^T \mathbf{x}\|_2^2] \\

&= \mathbb{E}[\|\mathbf{x}\|_2^2] - \mathbb{E}[\|V_k^T \mathbf{x}\|_2^2].

\end{aligned}\]

The first term $\mathbb{E}[\|\mathbf{x}\|_2^2]$ is constant with respect to the choice of subspace. So:

Minimizing reconstruction error is equivalent to maximizing captured projected energy/variance.

So the same $V_k$ that maximizes $\mathbb{E}[\|V_k^T \mathbf{x}\|^2]$ also minimizes reconstruction error.

What is the minimum error value?

If you choose $V_k$ as top-k eigenvectors, the variance you don’t capture is the sum of remaining eigenvalues:

\mathbb{E}[\|\mathbf{x} - \hat{\mathbf{x}}\|_2^2] = \sum_{j=k+1}^d \lambda_j.

This gives a clean interpretation:

•Eigenvalues are “energy per principal axis.”
•Dropping axes j>k loses exactly that energy.

Checkpoint summary

•Reconstruction error = total energy − captured energy.
•PCA maximizes captured energy, therefore minimizes reconstruction error.
•Minimum achievable expected squared error after keeping k comps is $\sum_{j>k} \lambda_j$ .

Micro-exercise (1 minute)

Eigenvalues are [10, 3, 2]. If you keep k=1, what is expected reconstruction error (in variance units)?

Solution: drop eigenvalues 3 and 2 → error = 5.

Part D: Best rank-k approximation (matrix view)

So far we reasoned per-vector. PCA also gives the best rank-k matrix approximation of the centered data matrix $X_c$ in Frobenius norm.

Let $X_c$ have SVD:

X_c = U\Sigma_s V^T

(where we’ll call the diagonal of singular values $\Sigma_s$ to avoid confusing it with covariance).

The Eckart–Young theorem says the best rank-k approximation is:

X_{c,k} = U_k \Sigma_{s,k} V_k^T.

This is exactly what PCA is doing in matrix form.

Connecting to covariance:

\[\begin{aligned}

\frac{1}{n} X_c^T X_c

&= \frac{1}{n} (V \Sigma_s U^T)(U \Sigma_s V^T) \\

&= \frac{1}{n} V \Sigma_s^2 V^T.

\end{aligned}\]

So:

•The PCA eigenvectors are the right singular vectors $V$ .
•The PCA eigenvalues are $\lambda_j = \frac{\sigma_j^2}{n}$ where $\sigma_j$ is singular value j.

That’s why PCA can be computed without explicitly forming the covariance matrix, especially when d is large.

Checkpoint summary

•PCA is equivalent to truncated SVD of centered data.
•Eigenvalues relate to singular values via $\lambda_j = \sigma_j^2/n$ .
•Best rank-k approximation in Frobenius norm matches the best k-dimensional subspace story.

Application/Connection: How PCA Is Used (and How to Do It Correctly)

PCA is simple to state but easy to misuse in practice. This section focuses on how PCA is applied, what choices matter, and how it connects to the tools you already know.

A practical PCA workflow

Step 0: Decide the goal

Common goals:

•Visualization (k=2 or 3)
•Compression / speed-up downstream models
•Denoising
•Preprocessing for regression/classification (sometimes helps, sometimes hurts)

Step 1: Preprocess (centering is non-negotiable)

PCA assumes the data is centered.

•Compute mean $\boldsymbol{\mu}$ over training set.
•Subtract it from all training points.

If you forget this, the first component can point mostly toward the mean rather than describing variation.

Step 2: Scaling / standardization (depends on units)

If features have different units/scales, covariance will be dominated by high-variance features.

Choice	What you compute PCA on	When it’s appropriate	Effect
Covariance PCA	Centered $X_c$	Features share comparable units/scales	Preserves natural variance magnitudes
Correlation PCA	Standardized data (z-scores)	Mixed units (e.g., dollars + seconds), different scales	Treats each feature as equally scaled

Standardization: $x_{ij} \leftarrow (x_{ij} - \mu_j)/s_j$ where $s_j$ is std dev.

Step 3: Compute PCA (covariance eigendecomp or SVD)

Two equivalent computation paths:

Path A: Covariance eigen-decomposition (good when d is modest):

1) $\Sigma = \frac{1}{n} X_c^T X_c$
2)Eigen-decompose: $\Sigma = V\Lambda V^T$

Path B: SVD of centered data (often better numerically and when d is large):

1) $X_c = U\Sigma_s V^T$
2)Columns of $V$ are principal directions, and eigenvalues are $\lambda_j = \sigma_j^2/n$ .

Step 4: Choose k

Typical heuristics:

•Keep enough components for 90–99% explained variance
•Scree plot “elbow”
•Cross-validate downstream task performance

Step 5: Transform and (optionally) reconstruct

Transform (encode):

Z = X_c V_k \in \mathbb{R}^{n \times k}

Reconstruct (decode):

\hat{X} = Z V_k^T + \mathbf{1}\boldsymbol{\mu}^T

(remember to add the mean back).

Checkpoint summary

•Always center using training mean.
•Consider standardization if feature scales differ.
•Use SVD when d is large or for stability.

Micro-exercise (1 minute)

Why must you apply the training mean and scaling to the test set (not recompute them on test)?

Solution: recomputing uses test information (data leakage) and changes the coordinate system; transformed features wouldn’t be comparable.

PCA as “rotation + truncation”

When $V$ is orthonormal, multiplying by $V$ is a rotation/reflection (distance-preserving). PCA can be thought of as:

1) Rotate into the eigenvector basis: $\mathbf{y} = V^T \mathbf{x}$

2) Drop the last d−k coordinates (small-variance directions)

This makes clear why PCA often denoises: if noise is roughly isotropic, it spreads across many directions and tends to be captured by small eigenvalues.

Whitening (optional but common)

Sometimes you want the transformed variables to have unit variance and be uncorrelated. PCA already makes them uncorrelated; whitening also rescales.

Let $\Lambda_k$ be diagonal with top-k eigenvalues. PCA coordinates are:

\mathbf{z} = V_k^T \mathbf{x}.

Whitened coordinates:

\mathbf{z}_{\text{white}} = \Lambda_k^{-1/2} V_k^T \mathbf{x}.

Caution: whitening can amplify noise in low-variance directions (because you divide by small $\lambda$ ).

Connections to SVD (tying back to prerequisite)

Since you already know SVD, here’s the clean link you can reuse:

•Compute SVD: $X_c = U\Sigma_s V^T$
•Principal components are columns of $V$
•Scores (projected coordinates) are $Z = X_c V_k = U_k\Sigma_{s,k}$

This last identity is handy: you can get the k-dimensional representation directly as $U_k\Sigma_{s,k}$ without forming $V_k$ explicitly in some streaming/large-scale settings.

When PCA is not the right tool

PCA is linear. It fails when structure is nonlinear.

•A “Swiss roll” manifold in 3D is not well represented by a plane.
•Clusters separated nonlinearly might need kernel PCA, t-SNE/UMAP for visualization, or autoencoders for nonlinear compression.

Still, PCA is often the baseline: fast, interpretable, and a strong first step.

Final checkpoint summary

•PCA is optimal among linear k-dimensional subspaces.
•Implementation requires centering, and often scaling.
•SVD is typically the computational workhorse.
•PCA is a baseline; nonlinear methods are for nonlinear structure.

Worked Examples (3)

Example 1: Compute PCA from a 2D covariance matrix (directions + explained variance)

Suppose your centered 2D data has covariance

\Sigma = \begin{bmatrix} 4 & 2 \\ 2 & 3 \end{bmatrix}.

Find the principal components and how much variance the first component explains.

Compute eigenvalues from det(Σ − λI)=0:
\n$ $\det\begin{bmatrix}4-\lambda & 2\\2 & 3-\lambda\end{bmatrix}=(4-\lambda)(3-\lambda)-4$ $
\nExpand:
\n$ $(4-\lambda)(3-\lambda)-4 = 12 -4\lambda -3\lambda + \lambda^2 -4 = \lambda^2 -7\lambda + 8.$ $
Solve the quadratic:
\n$ $\lambda^2 - 7\lambda + 8 = 0$ $
\n$ $\lambda = \frac{7 \pm \sqrt{49-32}}{2} = \frac{7 \pm \sqrt{17}}{2}.$ $
Order them:
\n$ $\lambda_1 = \frac{7+\sqrt{17}}{2} \approx 5.561,$ $
$\lambda_2 = \frac{7-\sqrt{17}}{2} \approx 1.439.$
Find eigenvector for λ₁ by solving (Σ − λ₁I)v=0:
\n$ $\begin{bmatrix}4-\lambda_1 & 2\\2 & 3-\lambda_1\end{bmatrix}\begin{bmatrix}v_1\\v_2\end{bmatrix}=\mathbf{0}.$ $
Use the first row: (4−λ₁)v₁ + 2v₂ = 0 ⇒ 2v₂ = (λ₁−4)v₁.
\nSo ratio:
\n$ $\frac{v_2}{v_1} = \frac{\lambda_1-4}{2}.$ $
Plug λ₁ ≈ 5.561:
\n$ $\frac{v_2}{v_1} \approx \frac{1.561}{2} \approx 0.7805.$ $
Choose a convenient vector proportional to [1, 0.7805]ᵀ and normalize:
\nNorm ≈ √(1² + 0.7805²) = √(1 + 0.609) = √1.609 ≈ 1.268.
\nSo
\n$ $\mathbf{v}_1 \approx \begin{bmatrix}0.788\\0.615\end{bmatrix}.$ $
Similarly, v₂ is orthogonal to v₁ (or solve for λ₂). One orthonormal choice is approximately:
\n$ $\mathbf{v}_2 \approx \begin{bmatrix}-0.615\\0.788\end{bmatrix}.$ $
Compute explained variance ratio for k=1:
\nTotal variance = tr(Σ)=4+3=7 = λ₁+λ₂.
\n$ $\text{EVR}(1)=\frac{\lambda_1}{\lambda_1+\lambda_2}=\frac{\lambda_1}{7} \approx \frac{5.561}{7} \approx 0.794.$ $

Insight: Eigenvectors give the principal directions; eigenvalues literally are the variances along those directions. The explained variance ratio is just “how much of the trace you keep.”

Example 2: PCA via SVD on centered data (and the λ = σ²/n connection)

Let centered data matrix be

X_c = \begin{bmatrix} 2 & 0\\ 0 & 1\\ -2 & 0\\ 0 & -1 \end{bmatrix} \in \mathbb{R}^{4\times 2}.

Compute the covariance, its eigenvalues, and relate them to singular values of X_c.

Compute covariance using Σ = (1/n) X_cᵀ X_c with n=4:
\nFirst compute X_cᵀ X_c:
\n$ $X_c^T X_c = \begin{bmatrix}2&0&-2&0\\0&1&0&-1\end{bmatrix}\begin{bmatrix}2&0\\0&1\\-2&0\\0&-1\end{bmatrix}.$ $
Multiply:
\n- (1,1) entry: 2² + 0² + (−2)² + 0² = 8
- •(2,2) entry: 0² + 1² + 0² + (−1)² = 2
- •off-diagonals are 0 (columns are orthogonal)
\nSo
\n$ $X_c^T X_c = \begin{bmatrix}8 & 0\\0 & 2\end{bmatrix}.$ $
Now scale by 1/n = 1/4:
\n$ $\Sigma = \frac{1}{4}\begin{bmatrix}8&0\\0&2\end{bmatrix} = \begin{bmatrix}2&0\\0&0.5\end{bmatrix}.$ $
Eigenvalues of Σ are clearly λ₁=2, λ₂=0.5 with eigenvectors e₁=[1,0]ᵀ and e₂=[0,1]ᵀ.
Now connect to singular values. By definition, singular values satisfy:
\n$ $X_c^T X_c = V \Sigma_s^2 V^T.$ $
Here X_cᵀ X_c is diagonal with entries 8 and 2, so the singular values are:
\n$ $\sigma_1 = \sqrt{8} = 2\sqrt{2},\quad \sigma_2 = \sqrt{2}.$ $
Check the PCA relationship λⱼ = σⱼ²/n:
\n$ $\frac{\sigma_1^2}{n} = \frac{8}{4}=2=\lambda_1,$ $
$\frac{\sigma_2^2}{n} = \frac{2}{4}=0.5=\lambda_2.$

Insight: SVD of centered data is PCA in disguise: right singular vectors are principal directions, and eigenvalues are just scaled squared singular values.

Example 3: Reconstruction error equals dropped eigenvalues (numeric check)

Assume a 3D centered dataset has covariance eigenvalues

\lambda_1=6,\ \lambda_2=2,\ \lambda_3=1.

Compare expected reconstruction error for k=1 vs k=2.

Total variance (total expected squared norm) is:
\n$ $\sum_{j=1}^3 \lambda_j = 6+2+1=9.$ $
For k=1, you keep λ₁ and drop λ₂, λ₃:
\nExpected reconstruction error:
\n$ $\sum_{j=2}^3 \lambda_j = 2+1=3.$ $
For k=2, you keep λ₁+λ₂ and drop λ₃:
\nExpected reconstruction error:
\n$ $\lambda_3 = 1.$ $
Consistency check via 'total − captured':
\n- Captured (k=1): 6 ⇒ error 9−6=3
- •Captured (k=2): 8 ⇒ error 9−8=1

Insight: Once you’re in the PCA basis, reconstruction error is literally the energy in the discarded coordinates. That’s why eigenvalues are such a direct diagnostic.

Key Takeaways

✓
A principal component v is a unit direction maximizing projected variance; it is an eigenvector of the covariance matrix.
✓
Eigenvalues $\lambda$ quantify variance captured along their eigenvectors; ordering by $\lambda_1 \ge \lambda_2 \ge \cdots$ orders importance.
✓
Projecting onto top-k components uses $\hat{\mathbf{x}} = V_k V_k^T \mathbf{x}$ (for centered data).
✓
PCA’s top-k subspace maximizes captured variance $\sum_{j=1}^k \lambda_j$ and equivalently minimizes expected squared reconstruction error $\sum_{j=k+1}^d \lambda_j$ .
✓
PCA can be computed via eigen-decomposition of $\Sigma = (1/n)X_c^T X_c$ or via SVD $X_c=U\Sigma_s V^T$ ; eigenvalues satisfy $\lambda_j=\sigma_j^2/n$ .
✓
Centering is essential; scaling (standardization) is often essential when features have different units.
✓
Explained variance ratio is a practical way to choose k, but task-driven cross-validation is often better for predictive goals.

Common Mistakes

✗
Forgetting to center the data (PCA then captures mean offsets rather than variation).
✗
Mixing scaling conventions unintentionally (covariance PCA vs correlation PCA) and misinterpreting components because feature units differ.
✗
Choosing k solely from explained variance without checking downstream performance or the effect of noise/outliers.
✗
Recomputing mean/standard deviation on the test set (data leakage; inconsistent coordinate system).

Practice

easy

Let $\Sigma = \begin{bmatrix} 3 & 1 \\ 1 & 3 \end{bmatrix}$ . Find eigenvalues and an orthonormal set of principal components.

Hint: This matrix has the form a on diagonal and b off-diagonal; eigenvectors are along [1,1] and [1,−1].

Show solution

Compute eigenvalues: solve det(Σ−λI)=0.

\det\begin{bmatrix}3-\lambda & 1\\1 & 3-\lambda\end{bmatrix}=(3-\lambda)^2-1=0

So $(3-\lambda)=\pm 1$ ⇒ $\lambda_1=4$ , $\lambda_2=2$ .

Eigenvector for λ₁=4: (3−4)v₁ + 1 v₂ = 0 ⇒ −v₁+v₂=0 ⇒ v₂=v₁ ⇒ direction [1,1]. Normalize: v₁ = (1/√2)[1,1].

Eigenvector for λ₂=2: (3−2)v₁ + 1 v₂ =0 ⇒ v₁+v₂=0 ⇒ direction [1,−1]. Normalize: v₂ = (1/√2)[1,−1].

medium

Show that for any orthonormal $V_k$ , the expected reconstruction error satisfies

\mathbb{E}[\|\mathbf{x} - V_k V_k^T \mathbf{x}\|_2^2] = \mathbb{E}[\|\mathbf{x}\|_2^2] - \mathbb{E}[\|V_k^T \mathbf{x}\|_2^2].

Hint: Use that $V_k V_k^T$ is an orthogonal projector and apply Pythagoras: projected part is orthogonal to residual.

Show solution

Let $\hat{\mathbf{x}} = V_k V_k^T \mathbf{x}$ and residual $\mathbf{r}=\mathbf{x}-\hat{\mathbf{x}}$ .

Because $V_k V_k^T$ is an orthogonal projector, $\hat{\mathbf{x}}$ lies in the span of columns of $V_k$ , while $\mathbf{r}$ lies in the orthogonal complement. Therefore $\hat{\mathbf{x}}^T \mathbf{r}=0$ .

\|\mathbf{x}\|_2^2 = \|\hat{\mathbf{x}} + \mathbf{r}\|_2^2 = \|\hat{\mathbf{x}}\|_2^2 + \|\mathbf{r}\|_2^2.

Rearrange:

\|\mathbf{r}\|_2^2 = \|\mathbf{x}\|_2^2 - \|\hat{\mathbf{x}}\|_2^2.

Now note $\|\hat{\mathbf{x}}\|_2^2 = \|V_k V_k^T \mathbf{x}\|_2^2 = \|V_k^T \mathbf{x}\|_2^2$ because $V_k$ is orthonormal. Take expectation of both sides to obtain the identity.

hard

You have centered data $X_c \in \mathbb{R}^{n\times d}$ with SVD $X_c=U\Sigma_s V^T$ . Prove that the covariance matrix $\Sigma=(1/n)X_c^T X_c$ has eigenvectors equal to columns of V and eigenvalues $\lambda_j=\sigma_j^2/n$ .

Hint: Compute $X_c^T X_c$ using the SVD and simplify using $U^T U = I$ .

Show solution

Start with SVD: $X_c=U\Sigma_s V^T$ .

Compute:

\[\begin{aligned}

X_c^T X_c

&= (U\Sigma_s V^T)^T (U\Sigma_s V^T) \\

&= (V \Sigma_s^T U^T)(U\Sigma_s V^T) \\

&= V \Sigma_s^T (U^T U) \Sigma_s V^T \\

&= V \Sigma_s^T \Sigma_s V^T.

\end{aligned}\]

Since $\Sigma_s$ is diagonal with nonnegative singular values, $\Sigma_s^T\Sigma_s = \Sigma_s^2$ .

Thus:

X_c^T X_c = V \Sigma_s^2 V^T.

Scale by 1/n:

\Sigma = \frac{1}{n} X_c^T X_c = V \left(\frac{1}{n}\Sigma_s^2\right) V^T.

This is an eigen-decomposition with eigenvectors given by columns of V and eigenvalues $\lambda_j = \sigma_j^2/n$ .

Connections

Next steps and related nodes:

•Dimensionality Reduction

Useful prerequisites to revisit as needed:

Quality: A (4.4/5)

← back to tree browse all →