Projecting vectors onto subspaces. Least squares.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
When data doesn’t fit your model perfectly, you don’t “solve” Ax = b—you choose the best approximation. Projections are the geometric rule that turns “best approximation” into a precise, computable object.
Projecting a vector v onto a subspace S means finding the unique u ∈ S that minimizes ‖v − u‖. For S = Col(A) with full column rank, the orthogonal projection is p = Pb, where P = A(AᵀA)⁻¹Aᵀ. Least squares chooses x̂ that makes Ax̂ the projection of b onto Col(A), leading to the normal equations AᵀAx̂ = Aᵀb.
In linear algebra, a subspace S is a “legal set of directions.” Often, the vector you have—call it v—doesn’t lie inside S. That mismatch happens constantly:
In all these cases, you want a principled way to replace v with a nearby vector u that does lie in S. The key idea is: choose the closest one.
Let S ⊂ ℝⁿ be a subspace, and let v ∈ ℝⁿ. The orthogonal projection of v onto S, written proj_S(v), is the unique vector u ∈ S that minimizes the distance to v:
minimize over u ∈ S: ‖v − u‖.
This is a geometric “drop a perpendicular” idea generalized from 2D/3D to ℝⁿ.
The minimizer isn’t just any closest point. It satisfies a very specific condition:
Let p = proj_S(v). Define the residual (error) r = v − p.
Then:
Meaning r is orthogonal to every vector in S:
This is the characterization that makes projections computable.
Because S is a subspace (hence closed and convex) and the objective ‖v − u‖² is strictly convex in u, there is exactly one minimizer.
Geometrically: the set of points in S is “flat,” and the distance function has a single lowest point when you look from v toward that flat set.
Orthogonal projection gives a clean decomposition:
So projection is not merely “closest point.” It is a structured split into a component inside the subspace and a component orthogonal to it.
If S is the span of a nonzero vector a, then the projection is the familiar formula:
If a is unit length (‖a‖ = 1), this simplifies to:
The general subspace case is this idea repeated, but with multiple basis directions at once.
The definition “pick the closest u ∈ S” is conceptually perfect, but it doesn’t yet tell you how to compute u. The trick is to translate “closest” into an equation.
Instead of minimizing ‖v − u‖ directly, minimize the squared norm (same minimizer, easier algebra):
Let p ∈ S be the minimizer. Consider any direction s ∈ S (a feasible direction to move within the subspace). If you nudge p inside the subspace by ts, the distance to v cannot decrease at t = 0 (otherwise p wasn’t minimal).
Compute the derivative:
f(p + ts) = ‖v − (p + ts)‖²
= ‖(v − p) − ts‖²
= ((v − p) − ts)ᵀ((v − p) − ts)
= ‖v − p‖² − 2t sᵀ(v − p) + t²‖s‖².
At a minimum, derivative at t = 0 is 0:
d/dt f(p + ts) |_{t=0} = −2 sᵀ(v − p) = 0.
So:
which is exactly:
This is the most important practical fact: the residual is orthogonal to the model subspace.
Suppose S is spanned by k linearly independent vectors a₁, …, aₖ. Put them into a matrix A ∈ ℝⁿˣᵏ as columns:
Any vector in S looks like Ax for some x ∈ ℝᵏ.
We want p = Ax that makes r = v − Ax orthogonal to S.
Being orthogonal to S is equivalent to being orthogonal to every column of A:
Stack these k equations:
⇒ AᵀAx = Aᵀv.
These are the normal equations in the language of least squares (we’ll return to that). For projection, they are simply the orthogonality conditions written in matrix form.
If the columns of Q ∈ ℝⁿˣᵏ are orthonormal (QᵀQ = I), then projection is especially simple.
Let S = Col(Q). We want p ∈ S of the form p = Qc.
Use the orthogonality condition:
Qᵀ(v − Qc) = 0
⇒ Qᵀv − QᵀQc = 0
⇒ Qᵀv − Ic = 0
⇒ c = Qᵀv.
So:
Interpretation:
1) Compute coordinates c = Qᵀv (dot products against basis vectors).
2) Rebuild the vector in the subspace: Qc.
This is the cleanest picture of what projection does.
With v = p + r and p ⟂ r:
‖v‖² = ‖p‖² + ‖r‖².
This is often used to reason about error. Projection chooses p to make ‖r‖ as small as possible, so it also makes ‖p‖ as large as possible among all vectors in S that could approximate v.
If you will project many different vectors onto the same subspace S, you want a reusable operator. Projections are not just geometric constructions; they are linear transformations.
For a fixed subspace S ⊂ ℝⁿ, the orthogonal projection is a function P: ℝⁿ → ℝⁿ such that:
When written in standard coordinates, this function is multiplication by an n×n matrix, also called P.
Assume A ∈ ℝⁿˣᵏ has full column rank (its columns are independent). Then AᵀA is invertible.
Given b ∈ ℝⁿ, the projection of b onto Col(A) is p = Ax̂ where x̂ solves the normal equations:
AᵀAx̂ = Aᵀb
⇒ x̂ = (AᵀA)⁻¹Aᵀb.
Plug back into p:
p = Ax̂
= A((AᵀA)⁻¹Aᵀb)
= (A(AᵀA)⁻¹Aᵀ)b.
So the projection matrix is:
For an orthogonal projection matrix P onto S:
1) Idempotent: P² = P
Apply the projection twice: once you’re in the subspace, projecting again does nothing.
2) Symmetric: Pᵀ = P
This encodes “orthogonal” in matrix form.
3) Range and null space:
So P keeps the S component and kills the orthogonal component.
Let P = A(AᵀA)⁻¹Aᵀ. Then:
P² = [A(AᵀA)⁻¹Aᵀ][A(AᵀA)⁻¹Aᵀ]
= A(AᵀA)⁻¹(AᵀA)(AᵀA)⁻¹Aᵀ
= A(I)(AᵀA)⁻¹Aᵀ
= A(AᵀA)⁻¹Aᵀ
= P.
The step (AᵀA)⁻¹(AᵀA) = I works because A has full column rank.
Pᵀ = [A(AᵀA)⁻¹Aᵀ]ᵀ
= (Aᵀ)ᵀ[(AᵀA)⁻¹]ᵀAᵀ
= A[(AᵀA)⁻¹]Aᵀ
= P,
because AᵀA is symmetric, so (AᵀA)⁻¹ is also symmetric.
If Q has orthonormal columns (QᵀQ = I), then:
P = Q(QᵀQ)⁻¹Qᵀ = QIQᵀ = QQᵀ.
This is why orthonormal bases are so valuable: they make projection formulas simpler and numerically more stable.
From P² = P, any eigenvalue λ satisfies λ² = λ, so λ ∈ {0, 1}.
So P is literally a “keep these directions, erase those directions” operator.
Consider a system Ax = b. If b ∈ Col(A), there is an exact solution. But if b is not in Col(A), there is no x that makes Ax equal to b.
Least squares replaces “solve exactly” with “get as close as possible”:
This is the same geometry as projection:
So the least-squares fitted vector Ax̂ is:
Minimize the squared error:
f(x) = ‖Ax − b‖²
= (Ax − b)ᵀ(Ax − b).
Expand:
f(x) = xᵀAᵀAx − 2xᵀAᵀb + bᵀb.
Take gradient and set to zero:
∇f(x) = 2AᵀAx − 2Aᵀb = 0
⇒ AᵀAx̂ = Aᵀb.
These are the normal equations.
Geometric meaning: the residual r = b − Ax̂ is orthogonal to Col(A):
Aᵀ(b − Ax̂) = 0.
That is exactly the orthogonality condition for projection.
If A has full column rank, AᵀA is invertible and:
If A does not have full column rank, there are infinitely many least-squares minimizers; then one typically uses the pseudoinverse A⁺ (outside this node’s core scope, but good to know).
Let p = Ax̂ and r = b − p.
This gives a stable mental model: least squares chooses the decomposition of b into “explained by the model subspace” plus “leftover orthogonal noise.”
In linear regression, A is the design matrix, x is the parameter vector, and b (often written y) is the target output. The fitted predictions Ax̂ are the projection of y onto Col(A).
That single sentence unifies:
Let a = (2, 1) and v = (1, 3). Let S = span(a) ⊂ ℝ². Compute proj_S(v) and the residual r.
Use the line projection formula:
proj_S(v) = ((aᵀv) / (aᵀa)) a.
Compute dot products:
aᵀv = (2, 1)·(1, 3) = 2·1 + 1·3 = 5.
aᵀa = (2, 1)·(2, 1) = 2² + 1² = 5.
Compute the scalar coefficient:
(aᵀv) / (aᵀa) = 5/5 = 1.
Compute the projection:
proj_S(v) = 1·a = (2, 1).
Compute the residual:
r = v − proj_S(v) = (1, 3) − (2, 1) = (−1, 2).
Check orthogonality to the subspace (it suffices to check ⟂ to a):
aᵀr = (2, 1)·(−1, 2) = 2(−1) + 1(2) = 0.
So r ⟂ span(a), as required.
Insight: The closest point on a line is found by matching the component of v along the line direction and discarding the perpendicular component. The residual must be orthogonal to the line—this condition is what generalizes to higher-dimensional subspaces.
Let A = [[1],[1],[0]] (a 3×1 matrix) and b = (2, 0, 1). Compute the projection of b onto Col(A) and verify P² = P.
Interpretation:
Col(A) is the line in ℝ³ spanned by a = (1, 1, 0). We will compute P explicitly.
Compute AᵀA:
AᵀA = [1 1 0]·[1; 1; 0] = 1² + 1² + 0² = 2.
So (AᵀA)⁻¹ = 1/2.
Compute P:
P = A(AᵀA)⁻¹Aᵀ
= 1; 1; 0[1 1 0]
= (1/2)
[[1·1, 1·1, 1·0],
[1·1, 1·1, 1·0],
[0·1, 0·1, 0·0]]
= (1/2)
[[1, 1, 0],
[1, 1, 0],
[0, 0, 0]].
Project b:
p = Pb
= (1/2)
[[1, 1, 0],
[1, 1, 0],
[0, 0, 0]]
[2; 0; 1]
= (1/2)
[1·2 + 1·0 + 0·1;
1·2 + 1·0 + 0·1;
0]
= (1/2)[2; 2; 0]
= (1, 1, 0).
Residual:
r = b − p = (2, 0, 1) − (1, 1, 0) = (1, −1, 1).
Check orthogonality to Col(A):
aᵀr = (1, 1, 0)·(1, −1, 1) = 1 − 1 + 0 = 0.
Verify idempotence (compute P²):
First note P = (1/2)M with M = [[1,1,0],[1,1,0],[0,0,0]].
Compute M²:
M² = [[1,1,0],[1,1,0],[0,0,0]]·[[1,1,0],[1,1,0],[0,0,0]]
= [[2,2,0],[2,2,0],[0,0,0]]
= 2M.
So P² = (1/2)² M² = (1/4)(2M) = (1/2)M = P.
Insight: The matrix P acts as a reusable operator: any b is decomposed into p ∈ Col(A) and r ⟂ Col(A). The algebraic identities P² = P and Pᵀ = P encode the geometric facts “projecting twice changes nothing” and “the residual is orthogonal.”
Fit b = (1, 2, 2) with a model Ax where A has two columns:
A = [[1,0],[1,1],[1,2]].
Compute the least-squares solution x̂ and the fitted vector p = Ax̂.
Set up the normal equations:
AᵀAx̂ = Aᵀb.
Compute AᵀA.
Aᵀ = [[1,1,1],[0,1,2]].
So:
AᵀA = [[1,1,1],[0,1,2]]·[[1,0],[1,1],[1,2]]
= [[1+1+1, 0+1+2],
[0+1+2, 0²+1²+2²]]
= [[3, 3],
[3, 5]].
Compute Aᵀb:
Aᵀb = [[1,1,1],[0,1,2]]·[1;2;2]
= [1+2+2;
0·1 + 1·2 + 2·2]
= [5;
6].
Solve [[3,3],[3,5]] x̂ = [5,6].
Let x̂ = (x₁, x₂).
Equations:
3x₁ + 3x₂ = 5
3x₁ + 5x₂ = 6
Subtract first from second:
(3x₁ + 5x₂) − (3x₁ + 3x₂) = 6 − 5
2x₂ = 1 ⇒ x₂ = 1/2.
Plug back:
3x₁ + 3(1/2) = 5
3x₁ + 3/2 = 5
3x₁ = 5 − 3/2 = 10/2 − 3/2 = 7/2
x₁ = 7/6.
Compute fitted vector p = Ax̂:
Ax̂ = [[1,0],[1,1],[1,2]]·[7/6; 1/2]
= [7/6 + 0;
7/6 + 1/2;
7/6 + 2·(1/2)]
= [7/6;
7/6 + 3/6;
7/6 + 6/6]
= [7/6;
10/6;
13/6]
= (7/6, 5/3, 13/6).
Compute residual r = b − p:
r = (1,2,2) − (7/6, 5/3, 13/6)
= (6/6−7/6, 6/3−5/3, 12/6−13/6)
= (−1/6, 1/3, −1/6).
Check orthogonality to Col(A) by verifying Aᵀr = 0:
Aᵀr = [[1,1,1],[0,1,2]]·[−1/6; 1/3; −1/6]
= [−1/6 + 1/3 − 1/6;
0·(−1/6) + 1·(1/3) + 2·(−1/6)]
= [0;
1/3 − 2/6]
= [0;
0].
Insight: Least squares chooses x̂ so that the residual is orthogonal to every column of A. That orthogonality is exactly the geometric condition for projection onto Col(A).
proj_S(v) is the unique u ∈ S minimizing ‖v − u‖; equivalently, the residual v − u lies in S⊥.
Projection decomposes v as v = p + r with p ∈ S, r ∈ S⊥, and p ⟂ r.
If S = Col(A) and A has full column rank, the projection matrix is P = A(AᵀA)⁻¹Aᵀ and proj_S(b) = Pb.
Orthogonal projection matrices satisfy P² = P (idempotent) and Pᵀ = P (symmetric).
With an orthonormal basis Q for S, projection simplifies to proj_S(v) = Q(Qᵀv) and P = QQᵀ.
Least squares min ‖Ax − b‖ produces Ax̂ = proj_Col(A)(b) and leads to the normal equations AᵀAx̂ = Aᵀb.
Geometrically, least squares finds the closest point in Col(A) to b; algebraically, it enforces Aᵀ(b − Ax̂) = 0.
Forgetting the orthogonality condition: the residual must be orthogonal to the entire subspace (all columns of A), not just to the projected vector.
Using P = A(AᵀA)⁻¹Aᵀ when A is not full column rank (AᵀA not invertible). In that case you need a different tool (e.g., pseudoinverse).
Confusing projection onto Col(A) with projection onto the row space; the formulas involve Aᵀ differently and live in different ambient spaces.
Assuming every idempotent matrix is an orthogonal projection: orthogonal projections require both P² = P and Pᵀ = P.
Let a = (1, −2, 2) and v = (2, 1, 0). Compute proj_span(a)(v) and the residual r. Verify aᵀr = 0.
Hint: Use proj_span(a)(v) = ((aᵀv) / (aᵀa)) a.
aᵀv = (1,−2,2)·(2,1,0) = 2 − 2 + 0 = 0.
aᵀa = 1² + (−2)² + 2² = 1 + 4 + 4 = 9.
Coefficient = 0/9 = 0.
Projection = 0·a = (0,0,0).
Residual r = v − 0 = (2,1,0).
Check: aᵀr = aᵀv = 0, so orthogonality holds.
Let A = [[1,1],[1,0],[0,1]] and b = (1,2,3). (i) Compute x̂ solving AᵀAx̂ = Aᵀb. (ii) Compute p = Ax̂. (iii) Compute r = b − p and verify Aᵀr = 0.
Hint: Compute AᵀA (2×2) and Aᵀb (2×1), solve the 2×2 system, then form p and r.
Aᵀ = [[1,1,0],[1,0,1]].
AᵀA = [[1,1,0],[1,0,1]]·[[1,1],[1,0],[0,1]]
= [[1²+1²+0², 1·1+1·0+0·1],
[1·1+0·1+1·0, 1²+0²+1²]]
= [[2, 1],
[1, 2]].
Aᵀb = [[1,1,0],[1,0,1]]·[1;2;3] = [1+2+0; 1+0+3] = [3;4].
Solve [[2,1],[1,2]] [x₁;x₂] = [3;4].
From 2x₁ + x₂ = 3 ⇒ x₂ = 3 − 2x₁.
Plug into x₁ + 2x₂ = 4:
x₁ + 2(3 − 2x₁) = 4 ⇒ x₁ + 6 − 4x₁ = 4 ⇒ −3x₁ = −2 ⇒ x₁ = 2/3.
Then x₂ = 3 − 2(2/3) = 3 − 4/3 = 5/3.
So x̂ = (2/3, 5/3).
Compute p = Ax̂:
Ax̂ = [[1,1],[1,0],[0,1]]·[2/3; 5/3]
= [2/3+5/3; 2/3; 5/3]
= [7/3; 2/3; 5/3].
Residual r = b − p = [1;2;3] − [7/3;2/3;5/3]
= [3/3−7/3; 6/3−2/3; 9/3−5/3]
= [−4/3; 4/3; 4/3].
Verify Aᵀr:
Aᵀr = [[1,1,0],[1,0,1]]·[−4/3; 4/3; 4/3]
= [−4/3 + 4/3 + 0; −4/3 + 0 + 4/3] = [0;0].
Suppose P is an orthogonal projection matrix in ℝⁿ (so P² = P and Pᵀ = P). Show that for any v, the residual r = v − Pv is orthogonal to every vector in Range(P).
Hint: Take an arbitrary y in Range(P). Write y = Px for some x. Compute yᵀr and use symmetry/idempotence.
Let y ∈ Range(P). Then ∃ x such that y = Px.
Let r = v − Pv.
Compute:
yᵀr = (Px)ᵀ(v − Pv)
= xᵀPᵀ(v − Pv) (move P using transpose)
= xᵀP(v − Pv) (since Pᵀ = P)
= xᵀ(Pv − P²v)
= xᵀ(Pv − Pv) (since P² = P)
= xᵀ0
= 0.
So yᵀr = 0 for all y ∈ Range(P), meaning r ⟂ Range(P).
Next up: projections are the geometric backbone of least squares regression.
Related reinforcement nodes you may have seen: