Combining multiple models. Bagging, boosting, random forests.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
A single decision tree can be brilliant—and also wildly unstable. Ensemble methods tame that instability (and sometimes push accuracy beyond what any single model can reach) by combining many “weak” or “noisy” models into one strong predictor.
Ensemble methods build a final predictor H(x) by combining base learners hᵢ(x). Bagging trains many models in parallel on bootstrap-resampled data and averages/votes to reduce variance (e.g., random forests). Boosting trains models sequentially, each focusing on previous mistakes, and combines them as a weighted sum to reduce bias (e.g., AdaBoost, Gradient Boosting).
Many ML models have a frustrating property: small changes in the training data can cause large changes in the learned model. Decision trees are a classic example—if you perturb the dataset slightly, the splits can change, leading to very different trees.
Ensemble methods are a systematic way to convert:
The core idea is surprisingly simple: instead of trusting one model, train multiple base learners and aggregate their predictions.
An ensemble method constructs a predictor:
and defines H(x) as an aggregation of the hᵢ(x).
Typical forms:
If every base learner makes the same mistakes, the ensemble gains nothing.
Ensembles work when base learners are:
A helpful mental model: if each model has some error “noise,” averaging cancels noise—but only if those noises aren’t identical.
Suppose we do regression and each model predicts:
hᵢ(x) = f(x) + εᵢ
where f(x) is the true target function (or the expected prediction) and εᵢ is a zero-mean error term.
The ensemble average is:
H(x) = (1/M) ∑ᵢ hᵢ(x)
= (1/M) ∑ᵢ (f(x) + εᵢ)
= f(x) + (1/M) ∑ᵢ εᵢ
So the ensemble error is (1/M) ∑ᵢ εᵢ.
If the εᵢ are independent with Var(εᵢ) = σ², then:
Var(H(x) − f(x))
= Var((1/M) ∑ᵢ εᵢ)
= (1/M²) ∑ᵢ Var(εᵢ)
= (1/M²) · M · σ²
= σ² / M
That 1/M is the dream scenario.
If errors are correlated, you don’t get the full 1/M; this is why creating diversity is a central design goal.
Ensemble methods mainly fall into two camps:
| Family | Training style | Main goal | Common examples |
|---|---|---|---|
| Bagging | Parallel (independent) | Reduce variance | Bagging, Random Forests |
| Boosting | Sequential (dependent) | Reduce bias (and sometimes variance) | AdaBoost, Gradient Boosting, XGBoost-style methods |
This lesson focuses on: aggregation, bagging, random forests, and boosting—what they optimize, why they work, and how to reason about their behavior using the bias–variance lens.
Training many models is only half the story. The other half is: how do we combine them so the result is better than each one?
Aggregation defines how information flows from base learners hᵢ(x) to the final ensemble H(x). Different choices correspond to different assumptions about:
If hᵢ(x) outputs a real number (e.g., house price), the default aggregation is:
H(x) = (1/M) ∑ᵢ hᵢ(x)
Why this makes sense:
If each hᵢ(x) outputs a class label in {0,1} (or {1,…,K}), a typical ensemble is:
H(x) = mode{h₁(x), …, h_M(x)}
Why this can help:
Often, classifiers output estimated probabilities pᵢ(y=k | x). Then you can average probabilities and choose argmax:
p̄(y=k | x) = (1/M) ∑ᵢ pᵢ(y=k | x)
H(x) = argmaxₖ p̄(y=k | x)
Soft voting tends to be more stable than hard voting because it uses confidence information.
Sometimes models shouldn’t be equal. Boosting is the archetype:
H(x) = sign(∑ᵢ αᵢ hᵢ(x)) (binary classification with hᵢ(x) ∈ {−1, +1})
or for regression:
H(x) = ∑ᵢ αᵢ hᵢ(x)
Here αᵢ is learned from performance (later models may get different weights).
Aggregation works best when base learners disagree in useful ways.
How do we create diversity?
A key point: diversity that increases error is not helpful. We want “decorrelated” errors while keeping each hᵢ(x) reasonably accurate.
You already know the bias–variance tradeoff. Aggregation changes that balance:
A simplified but useful heuristic:
Assume M models, each with error εᵢ, Var(εᵢ)=σ², and pairwise Corr(εᵢ, εⱼ)=ρ for i≠j.
Let H be the average. Then:
Var(H − f)
= Var((1/M) ∑ᵢ εᵢ)
= (1/M²) Var(∑ᵢ εᵢ)
Now expand Var of a sum:
Var(∑ᵢ εᵢ)
= ∑ᵢ Var(εᵢ) + 2∑_{i<j} Cov(εᵢ, εⱼ)
= Mσ² + 2 · (M(M−1)/2) · (ρσ²)
= Mσ² + M(M−1)ρσ²
So:
Var(H − f)
= (1/M²)[Mσ² + M(M−1)ρσ²]
= (σ²/M) + ( (M−1)/M ) ρσ²
As M → ∞, this approaches ρσ².
Interpretation:
This is why random forests deliberately reduce correlation (via feature subsampling), not just train many trees.
Decision trees are typically:
Bagging targets that exact profile.
The plan:
You get a “committee” of trees that individually overfit in different ways, and the vote cancels much of that overfitting.
Given training set D of size N, a bootstrap sample Dᵢ is created by sampling N points with replacement from D.
Key facts:
A useful calculation: expected fraction of unique points in a bootstrap sample.
Probability a specific point is not selected in one draw: 1 − 1/N.
Not selected in N draws:
(1 − 1/N)ᴺ ≈ e^{−1} ≈ 0.368
So expected fraction left out is ≈ 36.8%, and included is ≈ 63.2%.
Those “left out” points are the out-of-bag (OOB) set for that tree.
For m = 1..M:
Return ensemble H(x) = average/vote of h_m(x)
Because each training point is excluded from about 36.8% of bootstrap samples, you can estimate generalization error without a separate validation set:
This yields an OOB error estimate that is often close to cross-validation for bagged trees.
Bagging alone reduces variance, but trees can still be correlated because strong predictors dominate splits.
Random forests add a second diversity lever:
If there are d features total, you choose m_try features per split:
Why it helps:
Often the net result is better test performance.
| Parameter | Effect on bias | Effect on variance | Notes |
|---|---|---|---|
| # trees M | ~ no change | ↓ (until plateau) | More trees rarely overfit; cost is compute |
| max depth / min leaf | ↑ when constrained | ↓ | Deep trees benefit from bagging; forests often use deep trees |
| m_try | can ↑ if too small | ↓ by reducing correlation | Too small may underuse signal |
| bootstrap on/off | small | diversity changes | Some variants use subsampling without replacement |
Random forests often report feature importance:
This is a great tool, but not a proof of causality.
Bagging is great when your base learner is already powerful but unstable.
What if your base learner is too weak and underfits?
Boosting attacks bias by building an additive model step-by-step, where each new learner focuses on what previous learners got wrong.
Boosting constructs:
H_M(x) = ∑_{m=1..M} α_m h_m(x)
The sequence matters: hₘ is chosen based on the errors of H_{m−1}.
Two big boosting views:
We’ll sketch both, because they illuminate the “why.”
Assume labels y ∈ {−1, +1}, and weak learners output h(x) ∈ {−1, +1}.
AdaBoost maintains a weight distribution over training points:
At step m, weighted error is:
err_m = ∑ᵢ wᵢ [h_m(xᵢ) ≠ yᵢ]
Then compute learner weight:
α_m = (1/2) ln((1 − err_m)/err_m)
Update data weights:
wᵢ ← wᵢ · exp(−α_m yᵢ h_m(xᵢ))
Check what this does:
Finally normalize w so ∑ᵢ wᵢ = 1.
The final classifier:
H(x) = sign(∑ₘ α_m h_m(x))
Why it reduces bias: each stage targets the “hard” points not yet handled.
Risk: if you keep adding learners, you can start fitting noise (variance can rise), especially with very complex base learners.
Gradient boosting generalizes boosting to arbitrary differentiable losses.
We want to minimize empirical risk:
L(H) = ∑ᵢ ℓ(yᵢ, H(xᵢ))
We build H as an additive model:
H_m(x) = H_{m−1}(x) + η · f_m(x)
where f_m is a new base learner (often a small tree), and η ∈ (0,1] is a learning rate.
We want to choose f_m that most reduces L. In function space, the steepest descent direction is the negative gradient.
Define pseudo-residuals:
rᵢ^{(m)} = − ∂ℓ(yᵢ, H(xᵢ)) / ∂H(xᵢ) evaluated at H = H_{m−1}
Then fit the next learner to these residuals:
So gradient boosting is like:
Let ℓ(y, H) = (1/2)(y − H)².
Compute:
∂ℓ/∂H = ∂(1/2)(y − H)² / ∂H
= (1/2) · 2(y − H) · (−1)
= −(y − H)
So the negative gradient is:
r = −(∂ℓ/∂H) = y − H
That’s exactly the residual.
Thus gradient boosting with squared loss repeatedly fits residuals.
Boosting often uses weak learners such as:
They have high bias individually, but the additive combination builds complexity gradually. This helps control overfitting and makes training more stable.
| Knob | What it does | Typical effect |
|---|---|---|
| Learning rate η | scales each step | smaller η needs more trees but often generalizes better |
| # estimators M | number of steps | too large can overfit unless η small / trees small |
| Tree depth | base learner complexity | deeper trees fit interactions faster but risk overfitting |
| Subsample (stochastic GB) | fit each step on data subsample | reduces variance, adds randomness |
Boosting is often described as:
This is why boosting tends to be powerful but sensitive to tuning.
Use the bias–variance diagnosis you already know.
This is not absolute, but it’s a strong default.
| Situation | Symptoms | Good first choice | Why |
|---|---|---|---|
| Tree is unstable | train accuracy high, test low; results change with small data changes | Bagging | averaging cancels instability |
| Many features, strong predictors dominate | bagged trees still similar | Random Forest | feature subsampling reduces correlation |
| Need top accuracy on tabular data | linear/logistic underfits; interactions matter | Gradient Boosting | sequential correction builds complex function |
| Need minimal tuning | want strong baseline fast | Random Forest | robust defaults, less sensitive |
| Need calibrated probabilities | raw tree votes not calibrated | GB + calibration | boosting can improve ranking; calibrate separately |
Even though an ensemble is “many models,” you still ask:
For random forests, stability increases with M; for boosting, stability depends on regularization.
| Method | Parallelizable? | Typical training cost | Typical inference cost |
|---|---|---|---|
| Bagging | Yes | M independent fits | M predictions |
| Random Forest | Yes | M independent fits + split feature sampling | M predictions |
| Boosting | Not easily (sequential) | M sequential fits | M predictions |
Boosting’s sequential nature is the price you pay for its targeted bias reduction.
Ensembles are not limited to trees.
However, trees are a perfect match because:
Another ensemble family is stacking, where a meta-model learns how to combine base learners. It’s powerful, but adds complexity and risks leakage. For this node, focus on the foundational trio: bagging, random forests, boosting.
Before choosing/tuning ensembles:
Understanding ensembles unlocks:
Binary classification with three base learners. Each outputs a probability for class 1 at x. Convert to class labels with threshold 0.5.
Model outputs:
Convert each probability to a hard class prediction with threshold 0.5:
h₁(x) = 1 (since 0.51 ≥ 0.5)
h₂(x) = 1 (since 0.99 ≥ 0.5)
h₃(x) = 0 (since 0.01 < 0.5)
Hard majority vote:
Votes for class 1: 2
Votes for class 0: 1
So H_hard(x) = 1
Soft voting (average probabilities):
p̄ = (1/3)(0.51 + 0.99 + 0.01)
= (1/3)(1.51)
≈ 0.5033
Apply threshold 0.5 to averaged probability:
H_soft(x) = 1 (since 0.5033 ≥ 0.5)
Now consider a tiny change: p₁ becomes 0.49 (barely flips).
Hard predictions become:
Hard vote ⇒ H_hard(x)=0 (2 votes for 0).
Soft voting with new p₁:
p̄ = (1/3)(0.49 + 0.99 + 0.01)
= (1/3)(1.49)
≈ 0.4967
Soft vote ⇒ H_soft(x)=0 (since 0.4967 < 0.5)
Insight: Hard voting is sensitive to thresholding each model first; soft voting uses confidence and can be more stable. In practice, soft voting often performs better when base learners are reasonably calibrated.
Assume each tree’s prediction error (for a fixed x) has variance σ² = 9. Compare two ensembles of M = 100 trees:
A) Bagged trees with high correlation ρ = 0.3
B) Random forest trees with lower correlation ρ = 0.05
Use:
Var(H − f) = (σ²/M) + ((M−1)/M) ρ σ²
Compute common quantities:
M = 100
σ² = 9
(σ²/M) = 9/100 = 0.09
((M−1)/M) = 99/100 = 0.99
Case A (ρ = 0.3):
Var = 0.09 + 0.99 · 0.3 · 9
= 0.09 + 0.99 · 2.7
= 0.09 + 2.673
= 2.763
Case B (ρ = 0.05):
Var = 0.09 + 0.99 · 0.05 · 9
= 0.09 + 0.99 · 0.45
= 0.09 + 0.4455
= 0.5355
Compare:
2.763 / 0.5355 ≈ 5.16
So reducing correlation from 0.3 to 0.05 cuts variance by about 5×, even with the same number of trees and same per-tree variance.
Insight: Adding more trees helps, but reducing correlation ρ can matter even more. Random forests do exactly this via feature subsampling, especially at high-level splits.
We have three training points with scalar x and y:
(x₁, y₁) = (0, 1)
(x₂, y₂) = (1, 3)
(x₃, y₃) = (2, 2)
Start with constant model H₀(x) = 2 (already fit somehow). Use squared error ℓ = (1/2)(y − H)².
Compute residuals and fit a simple base learner f₁ that predicts the mean residual (a constant). Use learning rate η = 0.5.
Compute residuals rᵢ = yᵢ − H₀(xᵢ):
r₁ = 1 − 2 = −1
r₂ = 3 − 2 = +1
r₃ = 2 − 2 = 0
Fit base learner f₁ as a constant equal to mean residual:
mean(r) = (1/3)(−1 + 1 + 0) = 0
So f₁(x) = 0 for all x
Update the model:
H₁(x) = H₀(x) + η f₁(x)
= 2 + 0.5 · 0
= 2
Interpretation:
The residuals had zero mean, so the best constant step is 0; a constant learner cannot improve H₀.
If instead f₁ were a stump that split x (e.g., predict different residual means for x < 1.5 vs ≥ 1.5), it could make progress.
Insight: Gradient boosting only improves when the base learner can represent the structure in the residuals. This is why small trees (not just constants) are common: they can capture patterns in residuals and reduce loss step-by-step.
An ensemble builds H(x) by aggregating predictions from base learners hᵢ(x) via averaging, voting, or weighted sums.
Averaging reduces variance most effectively when base learner errors are weakly correlated; correlation ρ creates a variance floor.
Bagging creates diversity by training on bootstrap samples; it is especially effective for unstable learners like decision trees.
Out-of-bag (OOB) error provides an efficient internal validation estimate for bagged trees and random forests.
Random forests add feature subsampling at each split to reduce tree correlation, often giving a large additional gain over pure bagging.
Boosting builds an additive model sequentially; each step focuses on previous errors, often reducing bias dramatically.
Gradient boosting can be seen as functional gradient descent: each learner fits pseudo-residuals (negative gradients) of the loss.
Bagging/forests are typically robust and easy to tune; boosting can reach higher accuracy but is more sensitive to hyperparameters and overfitting.
Assuming “more models always helps” without checking correlation: if base learners are too similar, adding more barely improves.
Using boosting with overly deep trees and large learning rate η, leading to fast overfitting and unstable performance.
Comparing methods without controlling evaluation properly (e.g., tuning on test set, ignoring time-aware splits, or mixing leakage into OOB/validation).
Interpreting random forest impurity-based feature importance as causal or as unbiased (it can favor high-cardinality or noisy features).
You train 200 bagged trees for regression. Each tree has error variance σ² = 4 and pairwise correlation ρ = 0.2. Use Var(H − f) = (σ²/M) + ((M−1)/M) ρ σ² to compute the ensemble error variance. Then compute the limiting variance as M → ∞.
Hint: Compute σ²/M and ((M−1)/M) first. The limit removes the σ²/M term and sets ((M−1)/M) → 1.
M = 200, σ² = 4, ρ = 0.2.
σ²/M = 4/200 = 0.02
(M−1)/M = 199/200 = 0.995
Var = 0.02 + 0.995 · 0.2 · 4
= 0.02 + 0.995 · 0.8
= 0.02 + 0.796
= 0.816
As M → ∞:
Var → ρσ² = 0.2 · 4 = 0.8
In a random forest for classification with d = 36 features, compare two settings: m_try = 36 (no feature subsampling) vs m_try = √d. What is √d here, and why might it improve test performance even though each split sees fewer features?
Hint: Compute √36. Then relate feature subsampling to reduced correlation ρ between trees.
√d = √36 = 6.
With m_try = 36, every split considers all features, so many trees choose the same strong features near the top. Trees become more similar, increasing correlation ρ of their errors.
With m_try = 6, each split considers only 6 random features. Different trees are forced to explore different top splits, reducing correlation ρ. Even if individual trees become slightly weaker, the ensemble variance can drop enough to improve test performance.
For squared error regression with loss ℓ = (1/2)(y − H)², derive the pseudo-residual r = −∂ℓ/∂H and show it equals the ordinary residual (y − H).
Hint: Differentiate (1/2)(y − H)² with respect to H using the chain rule.
ℓ(y, H) = (1/2)(y − H)²
∂ℓ/∂H
= (1/2) · 2(y − H) · ∂(y − H)/∂H
= (y − H) · (−1)
= −(y − H)
Therefore the pseudo-residual is:
r = −∂ℓ/∂H = y − H
So for squared loss, gradient boosting fits ordinary residuals.