Decomposition of prediction error. Underfitting vs overfitting.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
Two models can have the same training error and wildly different test error. The bias–variance tradeoff explains why: learning is a tug-of-war between systematic error (bias) and sensitivity to data (variance), with an unavoidable floor set by noise.
At a fixed input x, the expected squared prediction error decomposes as
E[(ŷ(x) − y)²] = (Bias[ŷ(x)])² + Var(ŷ(x)) + σ².
Bias measures how far the average learned prediction is from the true function f(x). Variance measures how much the learned prediction changes across different training sets. σ² is irreducible noise from the data-generating process.
In supervised learning, you don’t just want to fit the data you already saw—you want to predict new data. The uncomfortable part is that training data is only one random draw from many possible datasets you could have received. If you trained the same algorithm again on a different sample, you’d typically get a different predictor.
So when you ask “How good is my model?” you really mean something like:
what error should I expect?
Bias–variance is the language for answering that. It separates error into:
This is what people mean by underfitting vs overfitting:
To make this precise, we assume a data-generating process:
y = f(x) + ε
where:
A learning algorithm takes a training set D and produces a predictor:
f̂_D(x)
We will often shorten this to f̂(x), but it’s important to remember:
Fix an input x. Across many possible training sets, you get many learned predictions:
f̂_D₁(x), f̂_D₂(x), …
Define:
Bias(x) = E_D[f̂(x)] − f(x)
Var(x) = Var_D(f̂(x)) = E_D[(f̂(x) − E_D[f̂(x)])²]
Bias is about the center of the cloud of possible learned predictions. Variance is about the spread of that cloud.
The classic decomposition is at a fixed x. To get a single number for a model, you typically average over x as well:
E_x[E_D[(f̂(x) − y)²]]
But the intuition is cleanest at one x first: bias and variance can differ across regions of input space. A model might be stable (low variance) where data is plentiful and unstable (high variance) near the edges.
| Term | Depends on training set D? | Depends on noise ε? | What it measures | Can we reduce it? |
|---|---|---|---|---|
| (Bias(x))² | via E_D[f̂(x)] | no | systematic mismatch from f(x) | yes (richer model, better features, less regularization) |
| Var(x) | yes | no | sensitivity to dataset sampling | yes (more data, regularization, bagging/ensembles) |
| σ² (noise) | no | yes | inherent randomness in y given x | not without changing measurement process |
Bias–variance tradeoff is the practical reality that many interventions that reduce one can increase the other.
When you measure squared error, it’s hard to tell why the error is happening. Is the model too simple? Too wiggly? Is the data noisy? The decomposition tells you how much error comes from each cause—at least conceptually.
We’ll derive the standard result:
E_D,ε[(f̂(x) − y)²] = (Bias(x))² + Var(x) + σ²
where y = f(x) + ε.
Start from:
(f̂(x) − y)²
Substitute y = f(x) + ε:
(f̂(x) − f(x) − ε)²
Now take expectation over both sources of randomness:
So we consider:
E_D,ε[(f̂(x) − f(x) − ε)²]
Expand the square:
(f̂(x) − f(x) − ε)²
= (f̂(x) − f(x))² − 2ε(f̂(x) − f(x)) + ε²
Take expectation:
E[(f̂(x) − f(x))²] − 2E[ε(f̂(x) − f(x))] + E[ε²]
Now use two key assumptions:
So:
E[ε(f̂(x) − f(x))] = E[ε] · E[f̂(x) − f(x)] = 0
And E[ε²] = Var(ε) = σ².
Thus:
E_D,ε[(f̂(x) − y)²] = E_D[(f̂(x) − f(x))²] + σ²
So the “interesting” part is E_D[(f̂(x) − f(x))²].
Let μ(x) = E_D[f̂(x)]. Add and subtract μ(x):
f̂(x) − f(x)
= (f̂(x) − μ(x)) + (μ(x) − f(x))
Square it:
(f̂(x) − f(x))²
= (f̂(x) − μ(x) + μ(x) − f(x))²
Expand:
= (f̂(x) − μ(x))²
Now take expectation over D.
The last term is constant with respect to D:
E_D[(μ(x) − f(x))²] = (μ(x) − f(x))²
The first term is the variance definition:
E_D[(f̂(x) − μ(x))²] = Var_D(f̂(x))
The middle term disappears because E_D[f̂(x) − μ(x)] = 0:
E_D[2(f̂(x) − μ(x))(μ(x) − f(x))]
= 2(μ(x) − f(x)) E_D[f̂(x) − μ(x)]
= 2(μ(x) − f(x)) · 0
= 0
So we get:
E_D[(f̂(x) − f(x))²]
= Var_D(f̂(x)) + (μ(x) − f(x))²
But μ(x) − f(x) is exactly Bias(x). Therefore:
E_D[(f̂(x) − f(x))²] = Var(x) + (Bias(x))²
Recall:
E_D,ε[(f̂(x) − y)²] = E_D[(f̂(x) − f(x))²] + σ²
So:
E_D,ε[(f̂(x) − y)²]
= (Bias(x))² + Var(x) + σ²
This is a conceptual decomposition of expected test error at x. It does not mean you can directly look at your dataset and perfectly measure bias and variance without extra assumptions. But it does tell you:
This exact decomposition is for squared loss (and closely related losses). The nice algebra comes from expanding squares. For other losses (like 0–1 classification error), there are analogs but not such a clean three-term identity.
At fixed x, imagine f̂(x) as a point on the number line that changes with D.
You might hope to reduce bias and variance simultaneously. Sometimes you can (e.g., better features, more data). But often you face a real tension:
This is not a law of nature for every method, but it is a persistent pattern in many learning setups.
Fix x. Suppose you repeatedly sample datasets D₁, D₂, … and train the same algorithm.
But rigidity often means the model cannot represent f well → higher bias.
It helps to connect the decomposition to the typical learning-curve story.
A common misconception is that “overfitting means the model is too complex.” Complexity is one route to variance, but variance is about sensitivity, not about complexity alone.
Below is a practical map. Effects can depend on the algorithm and data regime, but these are reliable first approximations.
| Intervention | Typical effect on Bias | Typical effect on Variance | Notes |
|---|---|---|---|
| Increase model capacity (more parameters, higher-degree polynomial, deeper tree) | ↓ | ↑ | Lower bias, higher variance risk |
| Increase regularization (L2/L1, pruning, dropout) | ↑ | ↓ | Makes solutions more stable |
| More training data | ↔ or ↓ | ↓ | Usually reduces variance strongly |
| Better features / representation | ↓ | ↔ or ↓ | Can reduce bias without increasing variance much |
| Early stopping (iterative learners) | ↑ | ↓ | Acts like regularization |
| Bagging / averaging multiple models | ↔ | ↓ | Reduces variance by averaging |
| Boosting (often) | ↓ | can ↑ or ↓ | Often reduces bias; variance behavior depends |
KNN is a classic bias–variance illustration.
Here the “complexity knob” is k (smaller k → more flexible).
Suppose you fit y as a polynomial of degree d.
If you also add regularization, you can increase d (potentially reduce bias) while controlling variance.
A useful fact: averaging reduces variance when errors are not perfectly correlated.
Let Z₁, …, Z_M be random predictions (at x) from M models with the same mean and variance.
Define the average prediction:
Z̄ = (1/M) ∑ᵢ Zᵢ
If the Zᵢ are independent with Var(Zᵢ) = v, then:
Var(Z̄) = Var((1/M) ∑ᵢ Zᵢ)
= (1/M²) ∑ᵢ Var(Zᵢ)
= (1/M²) · M · v
= v/M
In practice, models are correlated, so you don’t get v/M, but you still often get a meaningful reduction. This is the core reason bagging and random forests reduce variance.
A final subtlety worth breathing room:
This explains why some models behave nicely “in the middle” but become unstable near the boundaries of the input domain.
You rarely get to compute Bias(x), Var(x), or σ² directly. What you can do is:
Bias–variance tradeoff is a diagnostic story that guides which lever to pull next.
A common iterative workflow:
This is not a proof—just a principled heuristic.
Cross-validation (CV) is a way to approximate the expectation over datasets D.
Remember: variance is about how much f̂ changes when D changes. CV does something related:
While CV doesn’t directly compute Var_D(f̂(x)), it approximates expected generalization error, which includes both bias and variance effects.
In other words, CV is your practical handle on the left-hand side:
E_D,ε[(f̂(x) − y)²]
averaged over x.
Ensembles (bagging, random forests) are practical tools to reduce variance:
This links directly to the variance term in the decomposition.
Even with a perfect predictor f(x), your expected squared error is still σ².
This matters when you’re stuck:
In classification, the analogous notion is that if classes overlap intrinsically, there is a minimum achievable error (often called Bayes error). The exact decomposition differs, but the same moral applies: some uncertainty is built into the world.
The decomposition uses (Bias(x))², not Bias(x). This means:
| Observation | Likely issue | Typical next move |
|---|---|---|
| Training error high; validation error high | high bias | increase capacity, reduce regularization, improve features |
| Training error low; validation error high | high variance | more data, regularization, bagging/ensemble, simplify |
| Training error low; validation error low | good balance | consider whether noise limits further gains |
| Training error high; validation error lower (rare) | training procedure mismatch | check data leakage, metric, preprocessing, optimization |
Bias–variance is not just theory. It’s the bridge between:
And it motivates the next nodes you’ll unlock:
Suppose the true function value at a particular input is f(x) = 2.
You train the same learning algorithm on many independent training sets, and you observe the learned prediction at this x is a random variable f̂(x) that takes values:
Assume the observation noise in the test label is ε with E[ε] = 0 and Var(ε) = σ² = 1, so y = f(x) + ε.
Compute Bias(x), (Bias(x))², Var(x), and the expected test MSE at x.
Compute the expected prediction μ(x) = E_D[f̂(x)]:
μ(x) = (1)(1/2) + (3)(1/2)
= (1/2) + (3/2)
= 2
Compute Bias(x) = μ(x) − f(x):
Bias(x) = 2 − 2 = 0
So (Bias(x))² = 0² = 0
Compute Var(x) = E[(f̂(x) − μ(x))²]:
Possible deviations from μ = 2 are:
Thus:
Var(x) = (1)(1/2) + (1)(1/2) = 1
Use the decomposition:
E[(f̂(x) − y)²] = (Bias(x))² + Var(x) + σ²
= 0 + 1 + 1
= 2
Insight: Even though the predictor is unbiased at this x (bias = 0), it is unstable (variance = 1). Averaging multiple such predictors (an ensemble) could reduce the variance term and improve test error, but you can’t beat the noise floor σ² = 1.
Assume y = f(x) + ε with E[ε] = 0 and Var(ε) = σ², and ε is independent of the training set D.
Let f̂(x) be the learned predictor from D.
Show that:
E_D,ε[(f̂(x) − y)²] = (E_D[f̂(x)] − f(x))² + Var_D(f̂(x)) + σ².
Start with the test squared error:
E[(f̂(x) − y)²]
Substitute y = f(x) + ε:
E[(f̂(x) − f(x) − ε)²]
Expand the square:
(f̂ − f − ε)²
= (f̂ − f)² − 2ε(f̂ − f) + ε²
Take expectation:
E[(f̂ − f)²] − 2E[ε(f̂ − f)] + E[ε²]
Show the middle term is 0:
Because ε is independent of D (hence of f̂) and E[ε] = 0,
E[ε(f̂ − f)] = E[ε]·E[f̂ − f] = 0
So the expression becomes:
E[(f̂ − f)²] + E[ε²]
And E[ε²] = Var(ε) = σ²
Now decompose E[(f̂ − f)²] by adding and subtracting μ = E[f̂]:
Let μ = E_D[f̂(x)]. Then
f̂ − f = (f̂ − μ) + (μ − f)
Square:
(f̂ − f)² = (f̂ − μ)² + 2(f̂ − μ)(μ − f) + (μ − f)²
Take expectation over D:
E[(f̂ − μ)²] + 2(μ − f)E[f̂ − μ] + (μ − f)²
But E[f̂ − μ] = 0, so the cross-term vanishes.
Recognize terms:
E[(f̂ − μ)²] = Var_D(f̂(x))
(μ − f) = Bias(x)
So E[(f̂ − f)²] = Var_D(f̂(x)) + (Bias(x))²
Combine with noise:
E[(f̂ − y)²] = (Bias(x))² + Var_D(f̂(x)) + σ²
Insight: The entire decomposition hinges on two ideas: (1) squared loss lets you expand and regroup terms cleanly, and (2) the cross-term disappears because deviations around the mean have zero expectation.
At a fixed x, suppose two learned models produce predictions Z₁ and Z₂ with:
E[Z₁] = E[Z₂] = m,
Var(Z₁) = Var(Z₂) = v,
Corr(Z₁, Z₂) = ρ.
Let the ensemble prediction be Z̄ = (Z₁ + Z₂)/2.
Compute Var(Z̄) in terms of v and ρ.
Use the variance formula:
Var(Z̄) = Var((Z₁ + Z₂)/2)
= (1/4) Var(Z₁ + Z₂)
Expand Var(Z₁ + Z₂):
Var(Z₁ + Z₂) = Var(Z₁) + Var(Z₂) + 2Cov(Z₁, Z₂)
= v + v + 2Cov(Z₁, Z₂)
= 2v + 2Cov(Z₁, Z₂)
Convert correlation to covariance:
Corr(Z₁, Z₂) = ρ = Cov(Z₁, Z₂) / (√v √v) = Cov(Z₁, Z₂) / v
So Cov(Z₁, Z₂) = ρv
Substitute:
Var(Z₁ + Z₂) = 2v + 2ρv = 2v(1 + ρ)
Thus:
Var(Z̄) = (1/4) · 2v(1 + ρ)
= (v/2)(1 + ρ)
Insight: Averaging helps most when models are less correlated. If ρ = 1 (perfectly correlated), Var(Z̄) = v (no gain). If ρ = 0, Var(Z̄) = v/2. Random forests work partly by reducing correlation between trees, making variance reduction from averaging more effective.
At fixed x, the expected squared prediction error decomposes into (Bias(x))² + Var(x) + σ².
Bias(x) = E_D[f̂(x)] − f(x) measures systematic mismatch; variance measures sensitivity of f̂(x) to the sampled training set.
σ² is irreducible noise from the data-generating process y = f(x) + ε; even perfect f(x) can’t beat it under squared loss.
Underfitting typically corresponds to high bias; overfitting typically corresponds to high variance.
Increasing model flexibility often decreases bias but increases variance; regularization often does the opposite.
More data usually reduces variance (stabilizes the learned predictor) and can sometimes reduce bias indirectly.
Ensembles reduce variance by averaging, especially when individual models are not highly correlated.
Bias and variance can vary across input space; scarce-data regions often have higher variance.
Treating bias–variance as something you can read off from a single trained model without considering randomness over training sets.
Assuming overfitting is only about model size; the core issue is instability (variance), which can come from many sources (features, noise, training procedure).
Forgetting the noise term σ² and expecting test error to go to 0 even when the labels are inherently noisy.
Mixing up Bias(x) with (Bias(x))² in the decomposition; the squared term is what appears in MSE.
At a fixed x, suppose f(x) = 5 and the learned predictor satisfies E_D[f̂(x)] = 4 with Var_D(f̂(x)) = 2. The noise variance is σ² = 3. Compute the expected test MSE at x under squared loss.
Hint: Use E[(f̂ − y)²] = (E[f̂] − f)² + Var(f̂) + σ².
Bias(x) = E[f̂(x)] − f(x) = 4 − 5 = −1
(Bias(x))² = 1
Var(x) = 2
σ² = 3
So expected MSE = 1 + 2 + 3 = 6.
You average M predictors at a fixed x: Z̄ = (1/M)∑ᵢ Zᵢ. Assume each has Var(Zᵢ) = v and pairwise correlation Corr(Zᵢ, Zⱼ) = ρ for i ≠ j. Derive Var(Z̄).
Hint: Use Var(∑ Zᵢ) = ∑ Var(Zᵢ) + 2∑_{i<j} Cov(Zᵢ, Zⱼ) and Cov = ρv.
Var(Z̄) = Var((1/M)∑ᵢ Zᵢ) = (1/M²)Var(∑ᵢ Zᵢ)
Var(∑ᵢ Zᵢ) = ∑ᵢ Var(Zᵢ) + 2∑_{i<j} Cov(Zᵢ, Zⱼ)
= Mv + 2·(number of pairs)·(ρv)
Number of pairs = M(M−1)/2
So Var(∑ᵢ Zᵢ) = Mv + 2·(M(M−1)/2)·ρv = Mv + M(M−1)ρv
Therefore:
Var(Z̄) = (1/M²)[Mv + M(M−1)ρv]
= (v/M) + ((M−1)/M)ρv
= v( (1−ρ)/M + ρ )
Checks: if ρ = 0 → v/M; if ρ = 1 → v.
A learner compares two models A and B. Model A has higher training error but similar validation error compared to B; model B has very low training error but noticeably worse validation error. Using bias–variance language, diagnose A vs B and propose one concrete change to improve each.
Hint: Think: high bias shows up as high training and validation error; high variance shows up as low training but high validation error.
Model A: higher training error suggests it may be underfitting (higher bias). If validation error is similar to B, A may be too simple or too regularized. Improvement: increase capacity (e.g., more features, deeper model) or reduce regularization.
Model B: very low training error but worse validation error indicates overfitting (higher variance). Improvement: add regularization (e.g., L2, dropout, pruning), simplify the model, gather more data, or use bagging/ensembling.