Linear Regression

Machine LearningDifficulty: ███░░Depth: 9Unlocks: 4

Predicting continuous values. Least squares, normal equations.

Interactive Visualization

t=0s

Core Concepts

  • Linear predictive model: outputs are a linear combination of input features (include intercept as a constant feature).
  • Least-squares fitting: choose coefficients that minimize the sum of squared residuals (SSE).

Key Symbols & Notation

X (design matrix: examples x features, include column of ones for intercept)beta (parameter/coefficients vector)y (target/response vector of observed continuous values)

Essential Relationships

  • Normal equations: setting the SSE gradient to zero gives (X^T X) * beta = X^T * y, the linear condition that characterizes the least-squares solution.
▶ Advanced Learning Details

Graph Position

150
Depth Cost
4
Fan-Out (ROI)
1
Bottleneck Score
9
Chain Length

Cognitive Load

6
Atomic Elements
39
Total Elements
L2
Percentile Level
L4
Atomic Level

All Concepts (14)

  • linear prediction model: hypothesis h_θ(x) = θ^T x (prediction as a linear combination of features)
  • bias / intercept term (handled by augmenting feature vector with a constant 1)
  • design matrix X (matrix whose rows or rows/columns collect all training feature vectors)
  • parameter vector θ (coefficients to be estimated)
  • predicted outputs vector ŷ (y-hat) defined as ŷ = X θ
  • residuals vector r defined as r = y - ŷ
  • mean squared error cost function J(θ) written in ML form (often (1/2m) Σ (h_θ(x_i) - y_i)^2 or (1/2m) ||Xθ - y||^2)
  • normal equations: the closed-form condition for least-squares minimizer (X^T X θ = X^T y)
  • closed-form solution via matrix inverse θ = (X^T X)^{-1} X^T y (when X^T X is invertible)
  • Moore–Penrose pseudoinverse solution θ = X^+ y for singular or rank-deficient cases
  • hat matrix H = X (X^T X)^{-1} X^T that maps y to predictions ŷ
  • orthogonality condition for least squares: residuals are orthogonal to each column of X (X^T r = 0)
  • full column rank condition: X^T X invertible if and only if columns of X are linearly independent
  • interpretation of linear regression as projecting the target vector y onto the column space of X

Teaching Strategy

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

Linear regression is the “hello world” of predicting numbers: you pick a linear rule, compare its predictions to real outcomes, and adjust the rule to make the squared errors as small as possible. Under the hood, it’s an elegant geometry problem about projecting y onto the column space of X.

TL;DR:

Linear regression predicts a continuous target by ŷ = Xβ. We fit β by minimizing the sum of squared residuals ‖yXβ‖². The solution satisfies the normal equations XXβ = Xy (when XX is invertible) and geometrically corresponds to projecting y onto the subspace spanned by the columns of X.

What Is Linear Regression?

The goal

In supervised learning for continuous outputs, you want a model that maps an input feature vector to a real number:

  • Input (features): a vector x ∈ ℝᵈ
  • Output (target): y ∈ ℝ

Linear regression is the simplest widely useful choice:

  • Model: predict with a linear function of features.

The linear predictive model

For one example with features x, linear regression predicts

ŷ = β₀ + β₁x₁ + β₂x₂ + … + β_d x_d

To keep notation clean, we fold the intercept into the feature vector by adding a constant feature:

  • Define x₀ = 1
  • Define β = [β₀, β₁, …, β_d]ᵀ
  • Define x = [x₀, x₁, …, x_d]ᵀ

Then the prediction becomes a dot product:

ŷ = xᵀβ

From one example to a dataset

Suppose you have n training examples. Stack them into a design matrix X:

  • X ∈ ℝⁿˣᵖ where p = d+1 if you included the intercept column of ones
  • Each row i is xᵢᵀ

Also stack targets into a vector:

  • y ∈ ℝⁿ

Predictions for all examples at once:

ŷ = Xβ

This compact form is the main reason linear regression is such a good “bridge” topic: it connects machine learning, linear algebra, geometry, and optimization.

Why the squared error?

You need a rule for choosing β. Linear regression chooses β to make predictions close to y.

Define residuals:

r = yXβ

A natural idea is to minimize total error. But we need a scalar objective. Squared error is the classic choice:

SSE(β) = ∑ᵢ (yᵢ − ŷᵢ)² = ‖yXβ‖²

Why square?

  • It penalizes large mistakes more than small ones.
  • It yields a smooth, convex objective (one global minimum).
  • It leads to closed-form solutions via linear algebra.

What you should keep in mind

Linear regression is not “the true relationship is linear.” It’s:

  • “I will approximate the relationship with a linear function of chosen features.”

You can include nonlinear transformations as features (e.g., x², log x), and the model is still linear in β.

Core Mechanic 1: Least-Squares Fitting (Minimizing ‖y − Xβ‖²)

The optimization problem

Least squares fitting means choosing β to minimize squared residual length:

minimize over β: J(β) = ‖yXβ‖²

This is a geometry problem wearing an optimization costume.

  • Xβ is restricted to lie in the column space of X (all linear combinations of columns).
  • You want the point Xβ in that subspace that is closest to y.

So the fitted predictions Xβ̂ are the orthogonal projection of y onto Col(X).

Expanding the objective (so we can take derivatives)

Write J(β) as a quadratic form:

J(β) = (yXβ)ᵀ(yXβ)

Expand carefully:

J(β)

= (yᵀ − βᵀXᵀ)(yXβ)

= yyyXβ − βᵀXy + βᵀXXβ

Note that yXβ is a scalar, and equals its transpose:

yXβ = ( yXβ )ᵀ = βᵀXy

So the middle two terms combine:

J(β) = yy − 2βᵀXy + βᵀXXβ

This makes it clear J(β) is a convex quadratic function in β.

The gradient and the normal equations

To minimize J(β), set its gradient to zero.

Using standard matrix calculus results:

∂/∂β (βᵀAβ) = (A + Aᵀ)β, and when A is symmetric: = 2Aβ

Here A = XX, which is symmetric.

Compute the gradient:

∇β J(β)

= ∇β( yy − 2βᵀXy + βᵀXXβ )

= 0 − 2Xy + 2XXβ

Set ∇β J(β) = 0:

2XXβ − 2Xy = 0

Divide by 2:

XXβ = Xy

These are the normal equations.

Solving for β̂

If XX is invertible (full column rank), then

β̂ = (XX)⁻¹ Xy

This is the classic closed-form least squares solution.

If XX is not invertible (e.g., perfectly collinear features), there are infinitely many solutions that achieve the same minimum SSE. A common choice is the minimum-norm solution via the Moore–Penrose pseudoinverse:

β̂ = Xy

The projection viewpoint (link to prerequisites)

At the optimum, residual r = yXβ̂ is orthogonal to the column space of X.

In algebra:

Xᵀ(yXβ̂) = 0

which rearranges to the normal equations.

So you can read the normal equations as:

  • “Residuals are perpendicular to every column of X.”

That is exactly the condition for an orthogonal projection.

What least squares is really doing

Least squares chooses β to balance tradeoffs across all points.

  • You generally can’t make all residuals zero unless y lies in Col(X) (perfectly representable).
  • Instead, you choose the closest representable vector Xβ to y.

This is why linear regression is both:

  • An optimization algorithm: minimize SSE
  • A linear algebra operation: projection onto a subspace

Core Mechanic 2: Intercept, Geometry, and When the Normal Equations Behave

The intercept as a constant feature

Including an intercept means adding a column of ones:

X = [ 1 | feature columns ]

This matters because without an intercept, the model is forced to go through the origin in feature space (in 1D, the line must pass through (0,0)). In real data, that constraint is often unjustified and increases error.

Shapes and dimensions (so you don’t get lost)

It helps to keep a mental “type system” for linear regression:

  • X: n × p (n examples, p features including intercept)
  • β: p × 1
  • y: n × 1
  • Xβ: n × 1
  • XX: p × p
  • Xy: p × 1

If you ever feel stuck, check dimensions first.

Geometry: column space and projections

The set of all predictions the model can make is

{ Xβ : β ∈ ℝᵖ } = Col(X)

That is a subspace of ℝⁿ (or an affine subspace if you treat the intercept differently; with the column-of-ones trick, it’s still a linear subspace in ℝⁿ).

The fitted predictions are

ŷ = Xβ̂ = Proj_{Col(X)}(y)

The residual vector

r = yŷ

is orthogonal to Col(X).

When does (XX)⁻¹ exist?

(XX) is invertible iff the columns of X are linearly independent (full column rank). Practically, it can fail for several reasons:

  • Perfect multicollinearity: one feature is an exact linear combination of others.
  • Too many features: p > n guarantees rank ≤ n < p, so not invertible.
  • Redundant intercept: e.g., you accidentally include two constant columns.

Even if it exists, it can be numerically unstable if XX is ill-conditioned.

Practical numerical note: don’t literally invert

Although the formula β̂ = (XX)⁻¹Xy is conceptually important, computing the inverse explicitly is usually a bad idea numerically.

Common stable approaches:

ApproachIdeaProsCons
QR decompositionX = QR, solve Rβ = QyStable, standardMore linear algebra machinery
SVD / pseudoinverseX = UΣVBest for rank-deficient casesSlower
Gradient-based optimizationminimize J(β) iterativelyScales, works with huge dataNeeds learning rate / iterations

This lesson focuses on the conceptual math; just remember that implementations often use QR/SVD under the hood.

Centering and scaling (feature preprocessing)

Least squares itself doesn’t require scaling, but conditioning and interpretation improve when you:

  • center features (subtract mean)
  • scale features (divide by standard deviation)

Centering also interacts with the intercept:

  • If each non-intercept feature has mean 0, then β₀ becomes the mean of y (when the model includes only intercept) and is often more interpretable.

A note on “linear”

Linear regression is linear in parameters β, not necessarily linear in raw input.

Example: If you use features [1, x, x²], predictions are

ŷ = β₀ + β₁x + β₂x²

This is a quadratic curve in x, but still a linear regression model because it is a linear combination of features with coefficients β.

Application/Connection: Prediction, Evaluation, and Why It’s a Foundation for ML

Using the model to predict

After fitting β̂, predictions for a new example xₙₑw are

ŷₙₑw = xₙₑwᵀ β̂

For a batch of new examples Xₙₑw:

ŷₙₑw = Xₙₑw β̂

Interpreting coefficients (carefully)

If feature xⱼ increases by 1 (holding other features fixed), the prediction changes by βⱼ.

That “holding fixed” clause is crucial: in correlated data, changing one feature while keeping others constant may describe a scenario that doesn’t naturally occur.

Common evaluation metrics

Since the training objective is SSE, related metrics are common:

  • MSE = (1/n)‖yXβ̂‖²
  • RMSE = √MSE
  • MAE = (1/n)∑ᵢ |yᵢ − ŷᵢ| (not the training objective here, but popular)

R² as “variance explained” (intuition)

Define:

  • SSE = ∑ᵢ (yᵢ − ŷᵢ)²
  • SST = ∑ᵢ (yᵢ − ȳ)² where ȳ is mean of y

Then

R² = 1 − SSE/SST

Interpretation:

  • R² = 0 means “no better than predicting the mean.”
  • R² = 1 means perfect fit.

Be cautious: R² always (weakly) increases when you add features, even useless ones. This is one reason regularization (like ridge regression) matters later.

Overfitting and generalization

Linear regression can still overfit when:

  • you add too many features
  • you add high-degree polynomial features
  • you have limited data with noisy targets

Least squares will chase patterns in the sample even if they are not stable.

The usual fix is to evaluate on held-out data (validation/test) and/or add regularization.

Why linear regression is foundational

Linear regression is a gateway to many later ideas:

  • Gradient descent: minimizing a convex objective
  • Regularization: ridge/lasso as modified least squares
  • Probabilistic modeling: with Gaussian noise assumptions, least squares becomes maximum likelihood
  • Generalized linear models: logistic regression changes the output link but keeps linear predictors
  • Neural networks: last layer often is linear; training uses similar principles

If you understand linear regression deeply—especially the projection view—you’re set up to understand a lot of ML with much less memorization.

Worked Examples (3)

Fit a line in 1D with an intercept (compute β̂ via normal equations)

Data: (x, y) = (0, 1), (1, 2), (2, 2). Use an intercept by setting features x = [1, x]ᵀ. Compute β̂ = [β₀, β₁]ᵀ.

  1. Build the design matrix X (n = 3, p = 2):

    X = [ [1, 0],

    [1, 1],

    [1, 2] ]

  2. Build the target vector:

    y = [1, 2, 2]ᵀ

  3. Compute XX:

    XX = [ [1,1,1],

    [0,1,2] ] · [ [1,0],

    [1,1],

    [1,2] ]

    = [ [3, 3],

    [3, 5] ]

  4. Compute Xy:

    Xy = [ [1,1,1],

    [0,1,2] ] · [1,2,2]ᵀ

    = [ 1+2+2,

    0·1 + 1·2 + 2·2 ]ᵀ

    = [5, 6]ᵀ

  5. Solve the normal equations (XX)β = Xy:

    [ [3, 3],

    [3, 5] ] · [β₀, β₁]ᵀ = [5, 6]ᵀ

  6. Solve the linear system:

    From 3β₀ + 3β₁ = 5 ⇒ β₀ + β₁ = 5/3

    From 3β₀ + 5β₁ = 6

    Subtract the first equation (scaled by 3):

    (3β₀ + 5β₁) − (3β₀ + 3β₁) = 6 − 5

    2β₁ = 1 ⇒ β₁ = 1/2

    Then β₀ = 5/3 − 1/2 = 10/6 − 3/6 = 7/6

  7. Write the fitted model:

    ŷ = β₀ + β₁x = 7/6 + (1/2)x

  8. Compute fitted values and residuals:

    For x=0: ŷ=7/6, residual = 1 − 7/6 = −1/6

    For x=1: ŷ=7/6+1/2=7/6+3/6=10/6=5/3, residual = 2 − 5/3 = 1/3

    For x=2: ŷ=7/6+1=13/6, residual = 2 − 13/6 = −1/6

Insight: Even in this tiny dataset, the best line is not the one that hits the most points exactly; it’s the one whose residual vector has minimal length ‖r‖ and is orthogonal to the column space of X.

Verify the orthogonality condition **X**ᵀ(**y** − **X**β̂) = **0** for the fitted line

Use the previous example’s β̂ with X and y. Compute residual r = yXβ̂ and check Xr = 0.

  1. Recall:

    X = [ [1,0],

    [1,1],

    [1,2] ], y = [1,2,2]ᵀ, β̂ = [7/6, 1/2]ᵀ

  2. Compute predictions ŷ = Xβ̂:

    Row 1: [1,0]·β̂ = 7/6

    Row 2: [1,1]·β̂ = 7/6 + 1/2 = 5/3

    Row 3: [1,2]·β̂ = 7/6 + 1 = 13/6

    So ŷ = [7/6, 5/3, 13/6]ᵀ

  3. Compute residuals r = yŷ:

    r = [1 − 7/6,

    2 − 5/3,

    2 − 13/6]ᵀ

    = [−1/6,

    1/3,

    −1/6]ᵀ

  4. Compute Xr:

    Xr = [ [1,1,1],

    [0,1,2] ] · [−1/6, 1/3, −1/6]ᵀ

  5. First component (dot with column of ones):

    (−1/6) + (1/3) + (−1/6)

    = −1/6 + 2/6 − 1/6

    = 0

  6. Second component (dot with x column):

    0·(−1/6) + 1·(1/3) + 2·(−1/6)

    = 1/3 − 2/6

    = 2/6 − 2/6

    = 0

  7. Therefore Xᵀ(yXβ̂) = Xr = 0, confirming the projection/normal-equation condition.

Insight: The normal equations are not arbitrary algebra: they encode the geometric fact that the residual is perpendicular to every feature column, meaning you can’t move within the model’s subspace to get any closer to y.

Multiple features: compute β̂ for a tiny 2-feature design matrix

Let X include an intercept and two features:

X = [ [1, 1, 0],

[1, 0, 1],

[1, 1, 1] ],

y = [1, 2, 2]ᵀ.

Compute β̂ by solving XXβ = Xy.

  1. Write columns of X:

    Intercept column c₀ = [1,1,1]ᵀ

    Feature 1 column c₁ = [1,0,1]ᵀ

    Feature 2 column c₂ = [0,1,1]ᵀ

  2. Compute XX using dot products of columns:

    XX = [ [c₀ᵀc₀, c₀ᵀc₁, c₀ᵀc₂],

    [c₁ᵀc₀, c₁ᵀc₁, c₁ᵀc₂],

    [c₂ᵀc₀, c₂ᵀc₁, c₂ᵀc₂] ]

  3. Compute each dot product:

    c₀ᵀc₀ = 1+1+1 = 3

    c₀ᵀc₁ = 1+0+1 = 2

    c₀ᵀc₂ = 0+1+1 = 2

    c₁ᵀc₁ = 1+0+1 = 2

    c₁ᵀc₂ = (1·0)+(0·1)+(1·1) = 1

    c₂ᵀc₂ = 0+1+1 = 2

  4. So:

    XX = [ [3, 2, 2],

    [2, 2, 1],

    [2, 1, 2] ]

  5. Compute Xy by dotting columns with y:

    Xy = [ c₀ᵀy, c₁ᵀy, c₂ᵀy ]ᵀ

  6. Compute:

    c₀ᵀy = 1+2+2 = 5

    c₁ᵀy = 1·1 + 0·2 + 1·2 = 3

    c₂ᵀy = 0·1 + 1·2 + 1·2 = 4

    So Xy = [5, 3, 4]ᵀ

  7. Solve the system:

    [ [3, 2, 2],

    [2, 2, 1],

    [2, 1, 2] ] · [β₀, β₁, β₂]ᵀ = [5, 3, 4]ᵀ

  8. One solution path (elimination):

    From equation (2): 2β₀ + 2β₁ + β₂ = 3

    From equation (3): 2β₀ + β₁ + 2β₂ = 4

    Subtract (3) − (2): (2β₀ cancels)

    (β₁ + 2β₂) − (2β₁ + β₂) = 4 − 3

    (−β₁ + β₂) = 1 ⇒ β₂ = β₁ + 1

  9. Use equation (1): 3β₀ + 2β₁ + 2β₂ = 5

    Substitute β₂:

    3β₀ + 2β₁ + 2(β₁ + 1) = 5

    3β₀ + 4β₁ + 2 = 5

    3β₀ + 4β₁ = 3

  10. Use equation (2): 2β₀ + 2β₁ + β₂ = 3

    Substitute β₂ = β₁ + 1:

    2β₀ + 2β₁ + (β₁ + 1) = 3

    2β₀ + 3β₁ + 1 = 3

    2β₀ + 3β₁ = 2

  11. Solve the 2×2 system:

    3β₀ + 4β₁ = 3

    2β₀ + 3β₁ = 2

    Multiply the second by 3: 6β₀ + 9β₁ = 6

    Multiply the first by 2: 6β₀ + 8β₁ = 6

    Subtract: (6β₀+9β₁) − (6β₀+8β₁) = 6 − 6

    β₁ = 0

    Then 2β₀ + 3·0 = 2 ⇒ β₀ = 1

    Then β₂ = β₁ + 1 = 1

  12. So β̂ = [1, 0, 1]ᵀ and the model is:

    ŷ = 1 + 0·x₁ + 1·x₂ = 1 + x₂

Insight: Even with multiple features, the workflow is the same: build X, compute XX and Xy, then solve. Thinking in terms of column dot products makes XX feel less mysterious.

Key Takeaways

  • Linear regression predicts with ŷ = Xβ, where X includes a column of ones to represent the intercept.

  • Least squares chooses β̂ to minimize SSE = ‖yXβ‖², a convex quadratic objective with a unique minimum when X has full column rank.

  • The normal equations XXβ̂ = Xy come directly from setting the gradient ∇β‖yXβ‖² to zero.

  • Geometrically, Xβ̂ is the orthogonal projection of y onto Col(X); the residual r = yXβ̂ is orthogonal to every column of X.

  • (XX) may be singular if features are redundant or if p > n; then use a pseudoinverse or a different solver (QR/SVD).

  • The intercept prevents the model from being forced through the origin and usually improves fit unless you have strong reasons to omit it.

  • Metrics like MSE/RMSE connect directly to SSE; R² measures improvement over predicting the mean but can be misleading when adding many features.

Common Mistakes

  • Forgetting the intercept (not adding the column of ones), which silently forces predictions to pass through the origin.

  • Trying to compute (XX)⁻¹ explicitly in code; solving linear systems (QR/SVD) is typically more stable.

  • Mixing up shapes (e.g., using XXᵀ instead of XX) and getting dimension errors or incorrect equations.

  • Assuming coefficients are always causal or always directly interpretable even when features are correlated or confounded.

Practice

easy

Given data points (x, y) = (1, 1), (2, 2), (3, 2). Fit a 1D linear regression with intercept using normal equations. Report β̂ and the predicted values ŷ for x = 1,2,3.

Hint: Use X = [[1,1],[1,2],[1,3]] and β̂ = (XX)⁻¹Xy (or solve the 2×2 normal equations directly).

Show solution

Let X = [ [1,1], [1,2], [1,3] ] and y = [1,2,2]ᵀ.

Compute XX = [ [3, 6], [6, 14] ].

Compute Xy = [ 1+2+2, 1·1+2·2+3·2 ]ᵀ = [5, 11]ᵀ.

Solve [ [3,6],[6,14] ] [β₀,β₁]ᵀ = [5,11]ᵀ.

From 3β₀+6β₁=5 and 6β₀+14β₁=11.

Multiply the first by 2: 6β₀+12β₁=10. Subtract from second: 2β₁=1 ⇒ β₁=1/2.

Then 3β₀+6(1/2)=5 ⇒ 3β₀+3=5 ⇒ β₀=2/3.

So β̂ = [2/3, 1/2]ᵀ.

Predictions: x=1: ŷ=2/3+1/2=7/6; x=2: ŷ=2/3+1=5/3; x=3: ŷ=2/3+3/2=13/6.

medium

Show that at the least-squares optimum β̂, the residual r = yXβ̂ is orthogonal to every column of X. (Derive the condition, don’t just state it.)

Hint: Start from J(β) = ‖yXβ‖². Expand and take ∇β, then set it equal to 0.

Show solution

Let J(β) = (yXβ)ᵀ(yXβ).

Expand:

J(β) = yy − 2βᵀXy + βᵀXXβ.

Take gradient:

∇β J(β) = −2Xy + 2XXβ.

At optimum β̂: ∇β J(β̂) = 0XXβ̂ = Xy.

Rearrange:

XyXXβ̂ = 0

Xᵀ(yXβ̂) = 0

Xr = 0.

This means each column cⱼ of X satisfies cⱼᵀr = 0, i.e., r is orthogonal to every feature column.

hard

Suppose you have two features where the second is exactly double the first for all examples (x₂ = 2x₁), and you include an intercept. What does this imply about XX and the uniqueness of β̂? What would you do in practice?

Hint: Think about linear dependence among columns of X and what that means for invertibility. Consider pseudoinverse or removing a redundant feature.

Show solution

If x₂ = 2x₁ for all examples, then the column for feature 2 equals 2 times the column for feature 1. Therefore the columns of X are linearly dependent, so X does not have full column rank. Then XX is singular (not invertible).

Least squares still has minimizers, but β̂ is not unique: many coefficient vectors produce the same predictions Xβ (because changing β₁ and β₂ along the dependent direction leaves Xβ unchanged).

In practice you can (1) drop one of the redundant features, restoring full rank; (2) use a pseudoinverse solution β̂ = Xy (which selects a minimum-norm β̂ among all minimizers); or (3) add regularization (e.g., ridge regression) to make the problem well-posed.

Connections

Quality: A (4.4/5)