A = UΣV^T. Fundamental matrix factorization.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
Most linear algebra tools tell you what a matrix does to special vectors. Eigenvectors work beautifully—when the matrix is square and behaves nicely. The Singular Value Decomposition (SVD) is the “works anyway” factorization: it describes how any real m×n matrix transforms space by rotating (or reflecting), scaling along orthogonal directions, then rotating again.
For any real matrix A ∈ ℝ^{m×n}, there exist orthonormal matrices U ∈ ℝ^{m×m}, V ∈ ℝ^{n×n}, and a diagonal (rectangular) matrix Σ ∈ ℝ^{m×n} with nonnegative diagonal entries σ₁ ≥ σ₂ ≥ … ≥ 0 such that A = UΣVᵀ. Geometrically: Vᵀ rotates the input, Σ scales along orthogonal axes, U rotates the output. The σᵢ are singular values; columns of V are right singular vectors; columns of U are left singular vectors. SVD underlies best low-rank approximation and PCA.
Eigenvalues and eigenvectors answer: “Are there directions v that A maps onto themselves up to scaling?”
SVD fixes these issues by giving an orthonormal coordinate system in the input and output spaces.
For any real matrix A ∈ ℝ^{m×n}, the Singular Value Decomposition is
A = UΣVᵀ
where:
Concretely, Σ looks like
Σ = [ diag(σ₁,…,σ_r) 0 ]
[ 0 0 ]
with the diagonal sitting in the top-left.
SVD says A can be viewed as three simple transformations composed:
1) Vᵀ: rotate/reflect input space (ℝⁿ) into a special orthonormal basis (the right singular vectors)
2) Σ: scale along coordinate axes by nonnegative factors σᵢ
3) U: rotate/reflect output space (ℝᵐ) into the standard basis
Because U and V are orthonormal, they preserve lengths and angles:
‖Vᵀx‖ = ‖x‖, ‖Uy‖ = ‖y‖.
So all stretching/squashing is completely captured by Σ.
Consider the unit sphere in ℝⁿ:
S = { x ∈ ℝⁿ : ‖x‖ = 1 }.
Under A, the set A(S) is generally an ellipsoid (possibly flattened into lower dimensions). SVD explains why:
The singular values σᵢ are the lengths of the ellipsoid’s semi-axes.
Let V = [v₁ … vₙ] and U = [u₁ … uₘ]. Then:
And the defining relationships are:
Avᵢ = σᵢ uᵢ
Aᵀuᵢ = σᵢ vᵢ
These say: A maps orthonormal directions vᵢ to orthonormal directions uᵢ, scaled by σᵢ.
A key promise of SVD: every real matrix has an SVD.
This is not just a convenience—it is the reason SVD becomes a universal tool in numerical linear algebra, data analysis, and machine learning.
A common route to existence uses the fact that AᵀA is symmetric positive semidefinite, hence has an orthonormal eigenbasis. From that eigenbasis we build V and Σ, then define U accordingly. We’ll do this carefully in the next section.
We want orthonormal directions in the input space that A treats “nicely.” AᵀA is the right object because:
xᵀ(AᵀA)x = (Ax)ᵀ(Ax) = ‖Ax‖² ≥ 0.
So AᵀA has real eigenvalues and orthonormal eigenvectors.
Because AᵀA is symmetric, there exists an orthonormal matrix V and diagonal Λ such that:
AᵀA = VΛVᵀ
where Λ = diag(λ₁,…,λₙ) with λᵢ ≥ 0.
Let the eigenpairs be:
(AᵀA)vᵢ = λᵢ vᵢ
with ‖vᵢ‖ = 1 and vᵢ ⟂ vⱼ for i ≠ j.
Singular values are defined as
σᵢ = √λᵢ (nonnegative)
and arranged in decreasing order σ₁ ≥ σ₂ ≥ … ≥ 0.
So singular values are literally the square-roots of eigenvalues of AᵀA.
For any λᵢ > 0 (equivalently σᵢ > 0), define
uᵢ = (1/σᵢ) Avᵢ.
Then uᵢ is a unit vector, and distinct uᵢ are orthogonal. Let’s show the unit-length claim carefully.
Compute ‖uᵢ‖²:
‖uᵢ‖² = uᵢᵀuᵢ
= ((1/σᵢ)Avᵢ)ᵀ ((1/σᵢ)Avᵢ)
= (1/σᵢ²) vᵢᵀ AᵀA vᵢ
= (1/σᵢ²) vᵢᵀ (λᵢ vᵢ)
= (1/σᵢ²) λᵢ (vᵢᵀvᵢ)
= (1/σᵢ²) λᵢ
= (1/(√λᵢ)²) λᵢ
= 1.
So ‖uᵢ‖ = 1.
Now check orthogonality for i ≠ j (assuming σᵢ,σⱼ > 0):
uᵢᵀuⱼ
= ((1/σᵢ)Avᵢ)ᵀ ((1/σⱼ)Avⱼ)
= (1/(σᵢσⱼ)) vᵢᵀ AᵀA vⱼ
= (1/(σᵢσⱼ)) vᵢᵀ (λⱼ vⱼ)
= (λⱼ/(σᵢσⱼ)) vᵢᵀvⱼ
= 0.
Thus {uᵢ} are orthonormal over the nonzero singular values.
For the remaining columns of U (if m > r), we choose any orthonormal vectors that complete a basis of ℝᵐ. This completion is not unique.
Let r = rank(A) = number of positive singular values.
Then we can show A = UΣVᵀ.
One clean way is to check what A does to each vᵢ:
Avᵢ = σᵢ uᵢ.
Now consider the matrix UΣVᵀ applied to vᵢ:
(UΣVᵀ)vᵢ
= UΣ (Vᵀvᵢ)
= UΣ eᵢ
= U (σᵢ eᵢ)
= σᵢ uᵢ.
So A and UΣVᵀ agree on the basis vectors vᵢ, hence they are the same linear map.
The largest singular value σ₁ tells you the maximum stretching factor of A:
σ₁ = max_{‖x‖=1} ‖Ax‖.
You can see this from
‖Ax‖² = xᵀAᵀAx.
For symmetric matrices, the maximum of xᵀMx over ‖x‖=1 is the largest eigenvalue. Here M = AᵀA, so the maximum is λ₁, and thus max ‖Ax‖ = √λ₁ = σ₁.
More generally, the sum of squared singular values equals the squared Frobenius norm:
‖A‖_F² = ∑_{i=1}^{r} σᵢ².
This becomes useful for measuring how much “energy” is captured by keeping only the top k singular values.
If A is symmetric positive semidefinite, then its eigen-decomposition A = QΛQᵀ looks SVD-like.
In that case:
But in general:
This is why SVD is the robust “geometric backbone” behind many algorithms.
Matrices can be complicated globally, but rank-1 matrices are simple:
So if we can write A as a sum of rank-1 pieces, we get an interpretable and computable structure.
SVD gives exactly that.
Write the (thin) SVD using r = rank(A):
A = U_r Σ_r V_rᵀ
where:
Then:
A = ∑_{i=1}^{r} σᵢ uᵢ vᵢᵀ.
This is worth pausing on:
Start from thin SVD:
A = U_r Σ_r V_rᵀ.
Let Σ_r have diagonal entries σᵢ. We can write Σ_r as a sum of diagonal basis matrices:
Σ_r = ∑_{i=1}^{r} σᵢ (eᵢ eᵢᵀ).
Then
A = U_r (∑_{i=1}^{r} σᵢ eᵢ eᵢᵀ) V_rᵀ
= ∑_{i=1}^{r} σᵢ (U_r eᵢ) (V_r eᵢ)ᵀ
= ∑_{i=1}^{r} σᵢ uᵢ vᵢᵀ.
That’s the outer-product form.
The defining relation Avᵢ = σᵢ uᵢ means:
If you take an arbitrary input vector x and express it in the V-basis:
x = ∑_{i=1}^{n} αᵢ vᵢ
where αᵢ = vᵢᵀx.
Then
Ax
= A(∑_{i} αᵢ vᵢ)
= ∑_{i} αᵢ Avᵢ
= ∑_{i} αᵢ σᵢ uᵢ.
So the components along vᵢ get scaled by σᵢ and re-expressed along uᵢ.
This is the ellipsoid story in coordinates.
Because Σ has σ₁,…,σ_r > 0 and the rest 0:
The right singular vectors associated with σᵢ = 0 span null(A):
If σᵢ = 0, then Avᵢ = 0.
Similarly, left singular vectors with σᵢ = 0 span null(Aᵀ).
This makes SVD a “coordinate system” where the fundamental subspaces (row space, column space, null space) become clean blocks.
You will see two common SVD shapes:
| Name | Factorization | Shapes | When used |
|---|---|---|---|
| Full SVD | A = UΣVᵀ | U: m×m, Σ: m×n, V: n×n | Theory, explicit bases for all spaces |
| Thin/compact SVD | A = U_r Σ_r V_rᵀ | U_r: m×r, Σ_r: r×r, V_r: n×r | Computation, low-rank approximation |
The thin SVD discards the “zero singular value” directions and keeps only what A actually uses.
A remarkable theorem (Eckart–Young–Mirsky) says:
If you want a rank-k matrix B that best approximates A, then truncating the SVD is optimal.
Define
A_k = ∑_{i=1}^{k} σᵢ uᵢ vᵢᵀ.
Then A_k solves:
min_{rank(B)=k} ‖A − B‖_F
and also
min_{rank(B)=k} ‖A − B‖₂.
Intuition:
Quantitatively, the truncation error is clean:
‖A − A_k‖_F² = ∑_{i=k+1}^{r} σᵢ²
‖A − A_k‖₂ = σ_{k+1}.
These formulas are incredibly useful in practice: they tell you exactly what you lose by compressing.
Principal Component Analysis (PCA) finds directions of maximal variance in data.
Suppose you have a centered data matrix X ∈ ℝ^{N×d}:
The sample covariance is
C = (1/(N−1)) XᵀX.
PCA traditionally says: compute eigenvectors of C.
Now notice:
XᵀX = VΣᵀUᵀUΣVᵀ = V(ΣᵀΣ)Vᵀ.
Since UᵀU = I, we get an eigen-decomposition:
C = (1/(N−1)) V(ΣᵀΣ)Vᵀ.
So:
This is why numerical PCA is often implemented via SVD of X directly.
Given an overdetermined system Ax ≈ b (m > n), the least squares solution minimizes
min_{x} ‖Ax − b‖².
SVD provides a stable way to solve it using the Moore–Penrose pseudoinverse A⁺.
If A = UΣVᵀ, then
A⁺ = V Σ⁺ Uᵀ
where Σ⁺ is formed by inverting nonzero singular values:
Σ⁺_{ii} = 1/σᵢ for i ≤ r, and 0 otherwise.
Then one least-squares solution is
x̂ = A⁺ b.
Why this is good:
The condition number (in 2-norm) of a full-rank matrix is
κ₂(A) = σ₁ / σ_n
(or σ₁/σ_r if rank-deficient).
Interpretation:
SVD makes this sensitivity visible and measurable.
When you see A = UΣVᵀ, think:
And when you truncate at k:
A ≈ A_k = U_k Σ_k V_kᵀ
you are choosing to keep only the k most influential input-output modes.
SVD is not just a theorem—it is a lens for understanding what a matrix does, and a toolkit for acting on that understanding.
Let A = [[3, 0], [0, 1]]. Find an SVD and interpret σ₁, σ₂ geometrically.
Recognize A is already diagonal with nonnegative diagonal entries.
We suspect V = I and U = I might work.
Compute AᵀA:
AᵀA = A² = [[9, 0], [0, 1]].
Find eigenvalues/eigenvectors of AᵀA.
Eigenvalues: λ₁ = 9, λ₂ = 1.
Eigenvectors are the standard basis:
v₁ = [1,0]ᵀ, v₂ = [0,1]ᵀ.
Thus V = I.
Convert eigenvalues to singular values:
σ₁ = √9 = 3
σ₂ = √1 = 1
So Σ = [[3,0],[0,1]].
Compute left singular vectors:
uᵢ = (1/σᵢ)Avᵢ.
For i=1:
Av₁ = [3,0]ᵀ ⇒ u₁ = (1/3)[3,0]ᵀ = [1,0]ᵀ.
For i=2:
Av₂ = [0,1]ᵀ ⇒ u₂ = (1/1)[0,1]ᵀ = [0,1]ᵀ.
Thus U = I.
Verify:
UΣVᵀ = I · Σ · I = A.
Insight: This matrix scales the x-axis by 3 and the y-axis by 1. The unit circle maps to an axis-aligned ellipse with semi-axes 3 and 1. SVD reduces to “rotate (none), scale, rotate (none)”—a clean case that matches the general geometric story.
Let A = [[1, 1], [1, 1]]. Compute its singular values and one valid SVD.
Compute AᵀA:
AᵀA = [[1,1],[1,1]]ᵀ[[1,1],[1,1]] = [[2,2],[2,2]].
Find eigenvalues of AᵀA.
Matrix [[2,2],[2,2]] has:
Normalize:
v₁ = (1/√2)[1,1]ᵀ
v₂ = (1/√2)[1,−1]ᵀ
Set V = [v₁ v₂].
Singular values:
σ₁ = √λ₁ = 2
σ₂ = √λ₂ = 0
So rank r = 1.
Compute u₁ = (1/σ₁)Av₁:
Av₁ = [[1,1],[1,1]] (1/√2)[1,1]ᵀ
= (1/√2)[2,2]ᵀ
= √2 [1,1]ᵀ
Thus u₁ = (1/2) √2 [1,1]ᵀ = (1/√2)[1,1]ᵀ.
Complete U to an orthonormal basis.
Choose u₂ = (1/√2)[1,−1]ᵀ.
Then U = [u₁ u₂] is orthonormal.
Assemble Σ = [[2,0],[0,0]].
Then A = UΣVᵀ holds (one valid SVD).
Insight: A has rank 1: it only acts nontrivially along the direction v₁ = (1/√2)[1,1]. Everything orthogonal to that direction is sent to 0 (σ₂ = 0). The unit circle is squashed into a line segment (a degenerate ellipsoid).
Suppose A has singular values σ₁ = 10, σ₂ = 3, σ₃ = 1 (and no others). Compute ‖A − A₁‖_F and ‖A − A₁‖₂.
Recall truncation errors:
‖A − A_k‖_F² = ∑_{i=k+1}^{r} σᵢ²
‖A − A_k‖₂ = σ_{k+1}.
Here r = 3 and k = 1.
So:
‖A − A₁‖_F² = σ₂² + σ₃² = 3² + 1² = 9 + 1 = 10.
Therefore:
‖A − A₁‖_F = √10.
Spectral norm error:
‖A − A₁‖₂ = σ₂ = 3.
Insight: SVD doesn’t just give a method—it gives exact guarantees. The leftover singular values quantify precisely what information you discard when compressing.
SVD factorizes any real matrix A ∈ ℝ^{m×n} as A = UΣVᵀ with U, V orthonormal and Σ diagonal with nonnegative entries.
Geometrically: Vᵀ rotates/reflects the input, Σ scales along orthogonal axes (σᵢ), and U rotates/reflects the output; the unit sphere maps to an ellipsoid.
Singular values satisfy σᵢ = √λᵢ where λᵢ are eigenvalues of AᵀA (or AAᵀ).
Right singular vectors (vᵢ) are eigenvectors of AᵀA; left singular vectors (uᵢ) are eigenvectors of AAᵀ.
The relationships Avᵢ = σᵢuᵢ and Aᵀuᵢ = σᵢvᵢ encode how A acts on special orthonormal directions.
SVD gives an outer-product expansion: A = ∑ σᵢ uᵢ vᵢᵀ, a sum of rank-1 components.
Truncating the SVD yields the best rank-k approximation: A_k = ∑_{i=1}^{k} σᵢ uᵢ vᵢᵀ, with clean error formulas.
PCA is closely connected: for centered data X, the principal directions are right singular vectors of X, and variances are proportional to σᵢ².
Confusing eigenvalues with singular values: eigenvalues can be negative/complex; singular values are always real and ≥ 0, derived from AᵀA.
Assuming U = V always: that holds for special cases (e.g., symmetric positive semidefinite), but not for general matrices.
Forgetting the rectangular shape of Σ in the full SVD: Σ is m×n, not necessarily square; only its diagonal entries matter.
Thinking the SVD is unique: singular vectors are not unique when singular values repeat, and sign flips (uᵢ, vᵢ) → (−uᵢ, −vᵢ) leave A unchanged.
Let A ∈ ℝ^{m×n} have SVD A = UΣVᵀ. Show that ‖Ax‖ ≤ σ₁‖x‖ for all x ∈ ℝⁿ.
Hint: Write x = Vz with z = Vᵀx. Use orthonormality to relate ‖z‖ and ‖x‖, then bound ‖Σz‖ by σ₁‖z‖.
Let z = Vᵀx, so x = Vz and ‖z‖ = ‖x‖ since V is orthonormal.
Then:
Ax = UΣVᵀx = UΣz.
Taking norms and using orthonormality of U:
‖Ax‖ = ‖UΣz‖ = ‖Σz‖.
Since Σ scales coordinates by at most σ₁,
‖Σz‖² = ∑ σᵢ² zᵢ² ≤ σ₁² ∑ zᵢ² = σ₁²‖z‖².
Thus ‖Ax‖ ≤ σ₁‖z‖ = σ₁‖x‖.
Compute the singular values of A = [[2, 0], [0, −1]] and give one valid SVD.
Hint: Compute AᵀA and take square-roots of its eigenvalues. Remember singular values are nonnegative; signs can be absorbed into U or V.
AᵀA = [[4,0],[0,1]]. Its eigenvalues are 4 and 1, so singular values are σ₁=2, σ₂=1.
One SVD: take V = I, Σ = [[2,0],[0,1]].
We need U such that UΣ = A. Since A differs by a sign on the second axis, choose U = diag(1, −1).
Then UΣVᵀ = diag(1,−1) diag(2,1) I = diag(2, −1) = A.
Suppose A has singular values 5, 2, 2, 0. What are rank(A), dim(null(A)), and the best rank-2 Frobenius error ‖A − A₂‖_F?
Hint: Rank is the number of positive singular values. Use ‖A − A_k‖_F² = ∑_{i>k} σᵢ².
Positive singular values: 5,2,2 → rank(A)=3.
If A is m×n, then dim(null(A)) = n − rank(A) = n − 3.
For k=2:
‖A − A₂‖_F² = σ₃² + σ₄² = 2² + 0² = 4.
So ‖A − A₂‖_F = 2.