Predicting continuous values. Least squares, normal equations.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
Linear regression is the “hello world” of predicting numbers: you pick a linear rule, compare its predictions to real outcomes, and adjust the rule to make the squared errors as small as possible. Under the hood, it’s an elegant geometry problem about projecting y onto the column space of X.
Linear regression predicts a continuous target by ŷ = Xβ. We fit β by minimizing the sum of squared residuals ‖y − Xβ‖². The solution satisfies the normal equations XᵀXβ = Xᵀy (when XᵀX is invertible) and geometrically corresponds to projecting y onto the subspace spanned by the columns of X.
In supervised learning for continuous outputs, you want a model that maps an input feature vector to a real number:
Linear regression is the simplest widely useful choice:
For one example with features x, linear regression predicts
ŷ = β₀ + β₁x₁ + β₂x₂ + … + β_d x_d
To keep notation clean, we fold the intercept into the feature vector by adding a constant feature:
Then the prediction becomes a dot product:
ŷ = xᵀβ
Suppose you have n training examples. Stack them into a design matrix X:
Also stack targets into a vector:
Predictions for all examples at once:
ŷ = Xβ
This compact form is the main reason linear regression is such a good “bridge” topic: it connects machine learning, linear algebra, geometry, and optimization.
You need a rule for choosing β. Linear regression chooses β to make predictions close to y.
Define residuals:
r = y − Xβ
A natural idea is to minimize total error. But we need a scalar objective. Squared error is the classic choice:
SSE(β) = ∑ᵢ (yᵢ − ŷᵢ)² = ‖y − Xβ‖²
Why square?
Linear regression is not “the true relationship is linear.” It’s:
You can include nonlinear transformations as features (e.g., x², log x), and the model is still linear in β.
Least squares fitting means choosing β to minimize squared residual length:
minimize over β: J(β) = ‖y − Xβ‖²
This is a geometry problem wearing an optimization costume.
So the fitted predictions Xβ̂ are the orthogonal projection of y onto Col(X).
Write J(β) as a quadratic form:
J(β) = (y − Xβ)ᵀ(y − Xβ)
Expand carefully:
J(β)
= (yᵀ − βᵀXᵀ)(y − Xβ)
= yᵀy − yᵀXβ − βᵀXᵀy + βᵀXᵀXβ
Note that yᵀXβ is a scalar, and equals its transpose:
yᵀXβ = ( yᵀXβ )ᵀ = βᵀXᵀy
So the middle two terms combine:
J(β) = yᵀy − 2βᵀXᵀy + βᵀXᵀXβ
This makes it clear J(β) is a convex quadratic function in β.
To minimize J(β), set its gradient to zero.
Using standard matrix calculus results:
∂/∂β (βᵀAβ) = (A + Aᵀ)β, and when A is symmetric: = 2Aβ
Here A = XᵀX, which is symmetric.
Compute the gradient:
∇β J(β)
= ∇β( yᵀy − 2βᵀXᵀy + βᵀXᵀXβ )
= 0 − 2Xᵀy + 2XᵀXβ
Set ∇β J(β) = 0:
2XᵀXβ − 2Xᵀy = 0
Divide by 2:
XᵀXβ = Xᵀy
These are the normal equations.
If XᵀX is invertible (full column rank), then
β̂ = (XᵀX)⁻¹ Xᵀ y
This is the classic closed-form least squares solution.
If XᵀX is not invertible (e.g., perfectly collinear features), there are infinitely many solutions that achieve the same minimum SSE. A common choice is the minimum-norm solution via the Moore–Penrose pseudoinverse:
β̂ = X⁺ y
At the optimum, residual r = y − Xβ̂ is orthogonal to the column space of X.
In algebra:
Xᵀ(y − Xβ̂) = 0
which rearranges to the normal equations.
So you can read the normal equations as:
That is exactly the condition for an orthogonal projection.
Least squares chooses β to balance tradeoffs across all points.
This is why linear regression is both:
Including an intercept means adding a column of ones:
X = [ 1 | feature columns ]
This matters because without an intercept, the model is forced to go through the origin in feature space (in 1D, the line must pass through (0,0)). In real data, that constraint is often unjustified and increases error.
It helps to keep a mental “type system” for linear regression:
If you ever feel stuck, check dimensions first.
The set of all predictions the model can make is
{ Xβ : β ∈ ℝᵖ } = Col(X)
That is a subspace of ℝⁿ (or an affine subspace if you treat the intercept differently; with the column-of-ones trick, it’s still a linear subspace in ℝⁿ).
The fitted predictions are
ŷ = Xβ̂ = Proj_{Col(X)}(y)
The residual vector
r = y − ŷ
is orthogonal to Col(X).
(XᵀX) is invertible iff the columns of X are linearly independent (full column rank). Practically, it can fail for several reasons:
Even if it exists, it can be numerically unstable if XᵀX is ill-conditioned.
Although the formula β̂ = (XᵀX)⁻¹Xᵀy is conceptually important, computing the inverse explicitly is usually a bad idea numerically.
Common stable approaches:
| Approach | Idea | Pros | Cons |
|---|---|---|---|
| QR decomposition | X = QR, solve Rβ = Qᵀy | Stable, standard | More linear algebra machinery |
| SVD / pseudoinverse | X = UΣVᵀ | Best for rank-deficient cases | Slower |
| Gradient-based optimization | minimize J(β) iteratively | Scales, works with huge data | Needs learning rate / iterations |
This lesson focuses on the conceptual math; just remember that implementations often use QR/SVD under the hood.
Least squares itself doesn’t require scaling, but conditioning and interpretation improve when you:
Centering also interacts with the intercept:
Linear regression is linear in parameters β, not necessarily linear in raw input.
Example: If you use features [1, x, x²], predictions are
ŷ = β₀ + β₁x + β₂x²
This is a quadratic curve in x, but still a linear regression model because it is a linear combination of features with coefficients β.
After fitting β̂, predictions for a new example xₙₑw are
ŷₙₑw = xₙₑwᵀ β̂
For a batch of new examples Xₙₑw:
ŷₙₑw = Xₙₑw β̂
If feature xⱼ increases by 1 (holding other features fixed), the prediction changes by βⱼ.
That “holding fixed” clause is crucial: in correlated data, changing one feature while keeping others constant may describe a scenario that doesn’t naturally occur.
Since the training objective is SSE, related metrics are common:
Define:
Then
R² = 1 − SSE/SST
Interpretation:
Be cautious: R² always (weakly) increases when you add features, even useless ones. This is one reason regularization (like ridge regression) matters later.
Linear regression can still overfit when:
Least squares will chase patterns in the sample even if they are not stable.
The usual fix is to evaluate on held-out data (validation/test) and/or add regularization.
Linear regression is a gateway to many later ideas:
If you understand linear regression deeply—especially the projection view—you’re set up to understand a lot of ML with much less memorization.
Data: (x, y) = (0, 1), (1, 2), (2, 2). Use an intercept by setting features x = [1, x]ᵀ. Compute β̂ = [β₀, β₁]ᵀ.
Build the design matrix X (n = 3, p = 2):
X = [ [1, 0],
[1, 1],
[1, 2] ]
Build the target vector:
y = [1, 2, 2]ᵀ
Compute XᵀX:
XᵀX = [ [1,1,1],
[0,1,2] ] · [ [1,0],
[1,1],
[1,2] ]
= [ [3, 3],
[3, 5] ]
Compute Xᵀy:
Xᵀy = [ [1,1,1],
[0,1,2] ] · [1,2,2]ᵀ
= [ 1+2+2,
0·1 + 1·2 + 2·2 ]ᵀ
= [5, 6]ᵀ
Solve the normal equations (XᵀX)β = Xᵀy:
[ [3, 3],
[3, 5] ] · [β₀, β₁]ᵀ = [5, 6]ᵀ
Solve the linear system:
From 3β₀ + 3β₁ = 5 ⇒ β₀ + β₁ = 5/3
From 3β₀ + 5β₁ = 6
Subtract the first equation (scaled by 3):
(3β₀ + 5β₁) − (3β₀ + 3β₁) = 6 − 5
2β₁ = 1 ⇒ β₁ = 1/2
Then β₀ = 5/3 − 1/2 = 10/6 − 3/6 = 7/6
Write the fitted model:
ŷ = β₀ + β₁x = 7/6 + (1/2)x
Compute fitted values and residuals:
For x=0: ŷ=7/6, residual = 1 − 7/6 = −1/6
For x=1: ŷ=7/6+1/2=7/6+3/6=10/6=5/3, residual = 2 − 5/3 = 1/3
For x=2: ŷ=7/6+1=13/6, residual = 2 − 13/6 = −1/6
Insight: Even in this tiny dataset, the best line is not the one that hits the most points exactly; it’s the one whose residual vector has minimal length ‖r‖ and is orthogonal to the column space of X.
Use the previous example’s β̂ with X and y. Compute residual r = y − Xβ̂ and check Xᵀr = 0.
Recall:
X = [ [1,0],
[1,1],
[1,2] ], y = [1,2,2]ᵀ, β̂ = [7/6, 1/2]ᵀ
Compute predictions ŷ = Xβ̂:
Row 1: [1,0]·β̂ = 7/6
Row 2: [1,1]·β̂ = 7/6 + 1/2 = 5/3
Row 3: [1,2]·β̂ = 7/6 + 1 = 13/6
So ŷ = [7/6, 5/3, 13/6]ᵀ
Compute residuals r = y − ŷ:
r = [1 − 7/6,
2 − 5/3,
2 − 13/6]ᵀ
= [−1/6,
1/3,
−1/6]ᵀ
Compute Xᵀr:
Xᵀr = [ [1,1,1],
[0,1,2] ] · [−1/6, 1/3, −1/6]ᵀ
First component (dot with column of ones):
(−1/6) + (1/3) + (−1/6)
= −1/6 + 2/6 − 1/6
= 0
Second component (dot with x column):
0·(−1/6) + 1·(1/3) + 2·(−1/6)
= 1/3 − 2/6
= 2/6 − 2/6
= 0
Therefore Xᵀ(y − Xβ̂) = Xᵀr = 0, confirming the projection/normal-equation condition.
Insight: The normal equations are not arbitrary algebra: they encode the geometric fact that the residual is perpendicular to every feature column, meaning you can’t move within the model’s subspace to get any closer to y.
Let X include an intercept and two features:
X = [ [1, 1, 0],
[1, 0, 1],
[1, 1, 1] ],
y = [1, 2, 2]ᵀ.
Compute β̂ by solving XᵀXβ = Xᵀy.
Write columns of X:
Intercept column c₀ = [1,1,1]ᵀ
Feature 1 column c₁ = [1,0,1]ᵀ
Feature 2 column c₂ = [0,1,1]ᵀ
Compute XᵀX using dot products of columns:
XᵀX = [ [c₀ᵀc₀, c₀ᵀc₁, c₀ᵀc₂],
[c₁ᵀc₀, c₁ᵀc₁, c₁ᵀc₂],
[c₂ᵀc₀, c₂ᵀc₁, c₂ᵀc₂] ]
Compute each dot product:
c₀ᵀc₀ = 1+1+1 = 3
c₀ᵀc₁ = 1+0+1 = 2
c₀ᵀc₂ = 0+1+1 = 2
c₁ᵀc₁ = 1+0+1 = 2
c₁ᵀc₂ = (1·0)+(0·1)+(1·1) = 1
c₂ᵀc₂ = 0+1+1 = 2
So:
XᵀX = [ [3, 2, 2],
[2, 2, 1],
[2, 1, 2] ]
Compute Xᵀy by dotting columns with y:
Xᵀy = [ c₀ᵀy, c₁ᵀy, c₂ᵀy ]ᵀ
Compute:
c₀ᵀy = 1+2+2 = 5
c₁ᵀy = 1·1 + 0·2 + 1·2 = 3
c₂ᵀy = 0·1 + 1·2 + 1·2 = 4
So Xᵀy = [5, 3, 4]ᵀ
Solve the system:
[ [3, 2, 2],
[2, 2, 1],
[2, 1, 2] ] · [β₀, β₁, β₂]ᵀ = [5, 3, 4]ᵀ
One solution path (elimination):
From equation (2): 2β₀ + 2β₁ + β₂ = 3
From equation (3): 2β₀ + β₁ + 2β₂ = 4
Subtract (3) − (2): (2β₀ cancels)
(β₁ + 2β₂) − (2β₁ + β₂) = 4 − 3
(−β₁ + β₂) = 1 ⇒ β₂ = β₁ + 1
Use equation (1): 3β₀ + 2β₁ + 2β₂ = 5
Substitute β₂:
3β₀ + 2β₁ + 2(β₁ + 1) = 5
3β₀ + 4β₁ + 2 = 5
3β₀ + 4β₁ = 3
Use equation (2): 2β₀ + 2β₁ + β₂ = 3
Substitute β₂ = β₁ + 1:
2β₀ + 2β₁ + (β₁ + 1) = 3
2β₀ + 3β₁ + 1 = 3
2β₀ + 3β₁ = 2
Solve the 2×2 system:
3β₀ + 4β₁ = 3
2β₀ + 3β₁ = 2
Multiply the second by 3: 6β₀ + 9β₁ = 6
Multiply the first by 2: 6β₀ + 8β₁ = 6
Subtract: (6β₀+9β₁) − (6β₀+8β₁) = 6 − 6
β₁ = 0
Then 2β₀ + 3·0 = 2 ⇒ β₀ = 1
Then β₂ = β₁ + 1 = 1
So β̂ = [1, 0, 1]ᵀ and the model is:
ŷ = 1 + 0·x₁ + 1·x₂ = 1 + x₂
Insight: Even with multiple features, the workflow is the same: build X, compute XᵀX and Xᵀy, then solve. Thinking in terms of column dot products makes XᵀX feel less mysterious.
Linear regression predicts with ŷ = Xβ, where X includes a column of ones to represent the intercept.
Least squares chooses β̂ to minimize SSE = ‖y − Xβ‖², a convex quadratic objective with a unique minimum when X has full column rank.
The normal equations XᵀXβ̂ = Xᵀy come directly from setting the gradient ∇β‖y − Xβ‖² to zero.
Geometrically, Xβ̂ is the orthogonal projection of y onto Col(X); the residual r = y − Xβ̂ is orthogonal to every column of X.
(XᵀX) may be singular if features are redundant or if p > n; then use a pseudoinverse or a different solver (QR/SVD).
The intercept prevents the model from being forced through the origin and usually improves fit unless you have strong reasons to omit it.
Metrics like MSE/RMSE connect directly to SSE; R² measures improvement over predicting the mean but can be misleading when adding many features.
Forgetting the intercept (not adding the column of ones), which silently forces predictions to pass through the origin.
Trying to compute (XᵀX)⁻¹ explicitly in code; solving linear systems (QR/SVD) is typically more stable.
Mixing up shapes (e.g., using XXᵀ instead of XᵀX) and getting dimension errors or incorrect equations.
Assuming coefficients are always causal or always directly interpretable even when features are correlated or confounded.
Given data points (x, y) = (1, 1), (2, 2), (3, 2). Fit a 1D linear regression with intercept using normal equations. Report β̂ and the predicted values ŷ for x = 1,2,3.
Hint: Use X = [[1,1],[1,2],[1,3]] and β̂ = (XᵀX)⁻¹Xᵀy (or solve the 2×2 normal equations directly).
Let X = [ [1,1], [1,2], [1,3] ] and y = [1,2,2]ᵀ.
Compute XᵀX = [ [3, 6], [6, 14] ].
Compute Xᵀy = [ 1+2+2, 1·1+2·2+3·2 ]ᵀ = [5, 11]ᵀ.
Solve [ [3,6],[6,14] ] [β₀,β₁]ᵀ = [5,11]ᵀ.
From 3β₀+6β₁=5 and 6β₀+14β₁=11.
Multiply the first by 2: 6β₀+12β₁=10. Subtract from second: 2β₁=1 ⇒ β₁=1/2.
Then 3β₀+6(1/2)=5 ⇒ 3β₀+3=5 ⇒ β₀=2/3.
So β̂ = [2/3, 1/2]ᵀ.
Predictions: x=1: ŷ=2/3+1/2=7/6; x=2: ŷ=2/3+1=5/3; x=3: ŷ=2/3+3/2=13/6.
Show that at the least-squares optimum β̂, the residual r = y − Xβ̂ is orthogonal to every column of X. (Derive the condition, don’t just state it.)
Hint: Start from J(β) = ‖y − Xβ‖². Expand and take ∇β, then set it equal to 0.
Let J(β) = (y − Xβ)ᵀ(y − Xβ).
Expand:
J(β) = yᵀy − 2βᵀXᵀy + βᵀXᵀXβ.
Take gradient:
∇β J(β) = −2Xᵀy + 2XᵀXβ.
At optimum β̂: ∇β J(β̂) = 0 ⇒ XᵀXβ̂ = Xᵀy.
Rearrange:
Xᵀy − XᵀXβ̂ = 0
⇒ Xᵀ(y − Xβ̂) = 0
⇒ Xᵀr = 0.
This means each column cⱼ of X satisfies cⱼᵀr = 0, i.e., r is orthogonal to every feature column.
Suppose you have two features where the second is exactly double the first for all examples (x₂ = 2x₁), and you include an intercept. What does this imply about XᵀX and the uniqueness of β̂? What would you do in practice?
Hint: Think about linear dependence among columns of X and what that means for invertibility. Consider pseudoinverse or removing a redundant feature.
If x₂ = 2x₁ for all examples, then the column for feature 2 equals 2 times the column for feature 1. Therefore the columns of X are linearly dependent, so X does not have full column rank. Then XᵀX is singular (not invertible).
Least squares still has minimizers, but β̂ is not unique: many coefficient vectors produce the same predictions Xβ (because changing β₁ and β₂ along the dependent direction leaves Xβ unchanged).
In practice you can (1) drop one of the redundant features, restoring full rank; (2) use a pseudoinverse solution β̂ = X⁺y (which selects a minimum-norm β̂ among all minimizers); or (3) add regularization (e.g., ridge regression) to make the problem well-posed.