Preventing overfitting. L1, L2 penalties. Dropout.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
Training loss going down while validation loss goes up is one of the most common “surprises” in machine learning. Regularization is the toolkit for preventing that surprise: you deliberately restrict (or noise up) your model so it can’t memorize quirks of the training set and is forced to learn patterns that generalize.
Regularization modifies learning to prefer simpler, more robust solutions. The most common form is adding a penalty to the objective: minimize L(θ) + λ·Ω(θ). L2 (‖w‖₂²) shrinks weights smoothly (“weight decay”), L1 (‖w‖₁) promotes sparsity (many weights become exactly 0), and dropout randomly masks units during training to reduce co-adaptation and behaves like implicit ensembling.
In supervised learning, you pick parameters θ to minimize an empirical (training) loss:
J_train(θ) = (1/n) ∑ᵢ ℓ(f(xᵢ; θ), yᵢ)
If the model is flexible enough (especially deep nets), J_train can often be pushed very low. But “low training loss” is not the goal. The goal is low generalization error on unseen data.
Overfitting happens when the model uses its capacity to fit idiosyncrasies: noise, rare coincidences, spurious correlations. The optimization succeeds (training loss drops), but the representation learned is brittle.
Regularization is any technique that changes the learning problem so that the solution is biased toward simpler / smoother / more stable models.
The most canonical framing is: augment the loss with a penalty term.
J_reg(θ) = J_train(θ) + λ·Ω(θ)
This is the atomic concept to keep returning to:
Regularization = minimize loss + penalty.
It’s not merely a trick; it’s a deliberate statement of preference: among many parameter settings that fit the data similarly well, prefer the one with smaller norm, fewer nonzero parameters, or better robustness.
There are several complementary lenses:
1) Bias–variance tradeoff (classical)
2) Stability / robustness (modern intuition)
3) Constrained optimization equivalence
Minimizing loss + penalty is often equivalent to minimizing loss subject to a constraint.
For L2:
minimize J_train(w) + λ‖w‖₂²
is closely related to
minimize J_train(w) subject to ‖w‖₂² ≤ c
The penalty formulation is convenient for gradient methods; the constraint formulation is useful for geometric intuition.
Deep networks complicate the story because “complexity” is not perfectly captured by parameter count alone. Still, regularization remains essential.
Three cornerstone tools you’ll use constantly:
| Technique | What it modifies | Primary effect |
|---|---|---|
| L2 penalty (weight decay) | Objective (adds λ‖w‖₂²) | Shrinks weights smoothly; improves stability |
| L1 penalty | Objective (adds λ‖w‖₁) | Drives many weights to 0 (sparsity) |
| Dropout | Training procedure (random masking) | Reduces co-adaptation; implicit model averaging |
In the next sections, we’ll build each one from motivation → math → behavior → practical use.
If you want a regularizer that:
…then L2 is usually the first choice.
In neural nets, you’ll often see it called weight decay, because the update rule literally decays weights a bit each step.
Let w denote the vector of weights you want to penalize (often all weights, sometimes excluding biases and normalization parameters).
J(w) = J_train(w) + λ‖w‖₂²
Recall:
‖w‖₂² = ∑ⱼ wⱼ²
So the penalty grows quadratically with magnitude.
We’ll derive the gradient term you add during backprop.
Ω(w) = ‖w‖₂² = ∑ⱼ wⱼ²
Take partial derivative for coordinate j:
∂Ω/∂wⱼ = ∂/∂wⱼ (∑ₖ wₖ²)
= 2wⱼ
So
∇Ω(w) = 2w
Therefore the gradient of the regularized objective is:
∇J(w) = ∇J_train(w) + λ∇Ω(w)
= ∇J_train(w) + 2λw
With learning rate η, SGD updates are:
w ← w − η(∇J_train(w) + 2λw)
Rearrange:
w ← w − η∇J_train(w) − 2ηλw
Factor w:
w ← (1 − 2ηλ)w − η∇J_train(w)
That factor (1 − 2ηλ) is the “decay”: every step pulls weights toward 0.
Consider the constrained form:
minimize J_train(w) subject to ‖w‖₂² ≤ c
The L2 ball is a circle (in 2D) or sphere (in higher dimensions). When you intersect a smooth loss surface with a round constraint set, the optimum typically lies on the boundary but not at axes corners. This leads to many small weights rather than a few exactly zero weights.
L2 regularization corresponds to a Gaussian prior on weights:
p(w) ∝ exp(−(λ)‖w‖₂²)
Minimizing J_train + λ‖w‖₂² is like MAP estimation: fit the data while preferring weights near 0.
1) What to regularize
2) Choosing λ
3) Weight decay vs L2 penalty in adaptive optimizers
In SGD, “L2 regularization” and “weight decay” are effectively the same. In Adam/RMSProp, naïvely adding λ‖w‖₂² to the loss is not identical to decoupled weight decay.
In practice, for Adam-family optimizers, AdamW is often preferred because the regularization effect is more predictable.
Even though L2 acts on parameters, its functional effect is:
Keep one mental picture: L2 spreads influence across many features with small coefficients, rather than betting everything on a few huge weights.
L1 regularization is used when you want:
or when you suspect many true effects are actually irrelevant.
Where L2 shrinks everything smoothly, L1 tends to create exact zeros.
J(w) = J_train(w) + λ‖w‖₁
where
‖w‖₁ = ∑ⱼ |wⱼ|
Consider the constrained form:
minimize J_train(w) subject to ‖w‖₁ ≤ c
In 2D, the L1 ball is a diamond (a rotated square). Its corners lie on the coordinate axes.
When a smooth loss contour first touches this diamond, it often touches at a corner.
Touching at a corner means one coordinate is exactly 0.
This “corners encourage zeros” intuition is the key:
The absolute value is not differentiable at 0, so we use a subgradient.
For a single coordinate:
d|w|/dw =
So a subgradient of ‖w‖₁ is:
∂‖w‖₁/∂wⱼ = sign(wⱼ) (with sign(0) ∈ [−1, +1])
Thus a subgradient update looks like:
w ← w − η(∇J_train(w) + λ·sign(w))
For some losses (notably squared loss in linear regression), L1 leads to a closed-form coordinate update called soft-thresholding.
Even if you don’t use the closed form in deep learning, the behavior is worth understanding: L1 applies a constant pull toward 0, not proportional to w.
Compare pulls:
That’s why small weights get “snapped” to zero under L1.
Pure L1 on all network weights is less common than L2/weight decay, because:
But L1 still matters a lot in:
Sometimes you want sparsity and stability:
J(w) = J_train(w) + λ₁‖w‖₁ + λ₂‖w‖₂²
L1 alone can be unstable when features are strongly correlated; L2 helps group correlated features and improves conditioning.
| Property | L1 (‖w‖₁) | L2 (‖w‖₂²) |
|---|---|---|
| Pull toward 0 | constant magnitude (λ) | proportional to size (2λw) |
| Exact zeros? | yes, often | rarely |
| Optimization smoothness | nonsmooth at 0 | smooth |
| Typical effect | sparsity / feature selection | shrinkage / stability |
| Common in deep nets | less common (unstructured) | very common |
If L2 is “make everything smaller,” L1 is “make many things vanish.”
Neural networks can overfit by forming co-adaptations: a hidden unit becomes useful only because some other unit reliably provides a complementary signal. This can produce fragile internal representations.
Dropout regularizes by injecting structured noise during training:
The motivation is simple: if any unit might disappear, the network must distribute information and learn redundant, robust features.
Let h be a vector of activations at some layer. Sample a mask m with independent Bernoulli entries:
mⱼ ∼ Bernoulli(p)
Apply mask:
h̃ = (m ⊙ h) / p
Show the work:
For one coordinate:
h̃ⱼ = (mⱼ hⱼ)/p
E[h̃ⱼ] = E[mⱼ]·hⱼ/p
= p·hⱼ/p
= hⱼ
So at test time you typically do nothing special (no dropout, no scaling), because the scaling was already handled during training.
Each dropout mask corresponds to a sub-network. Training with dropout is like training a huge ensemble of thinned networks that share weights.
You do not explicitly average predictions over all masks at test time (that would be expensive). Instead, using the full network without dropout approximates that ensemble average.
This is why dropout often improves generalization even when it increases training loss: it optimizes performance across many perturbations, not one fixed computation graph.
Dropout is most effective when:
It can be less helpful (or need careful tuning) when:
Modern practice often uses:
Dropout usually isn’t written as loss + λ·Ω(θ) explicitly. It’s a procedural regularizer.
Still, conceptually it:
A useful comparison:
| Method | How it regularizes | Typical symptom it fixes |
|---|---|---|
| L2 | discourages large weights | overly sharp decision boundaries |
| L1 | enforces sparsity | too many irrelevant features |
| Dropout | disrupts co-adaptation with stochastic masks | brittle internal representations |
A common approach is to start with small dropout (e.g., q = 0.1–0.3 in dense layers) and increase only if overfitting persists after using weight decay and data augmentation.
A practical workflow is:
1) Pick an architecture capable of fitting the task.
2) Use monitoring: training vs validation curves.
3) Add regularization to address observed gaps.
If training and validation losses are both high → underfitting: reduce regularization or increase capacity.
If training loss is low but validation loss is high → overfitting: increase regularization.
Think in terms of what kind of “simplicity” you want:
Often you combine them:
Regularization changes gradients and thus training dynamics.
Regularization is part of a larger family of generalization controls:
You’ll often see these combined:
| Tool | What it controls | Notes |
|---|---|---|
| Weight decay | parameter magnitude | cheap, widely applicable |
| Dropout | internal co-adaptation | helps more in dense parts |
| Data augmentation | invariance & sample diversity | often the biggest win in vision |
| Early stopping | effective capacity via training time | requires validation monitoring |
λ is not “a little extra term.” It is a knob that sets the relative scale between fitting the data and shrinking complexity.
If the loss term is scaled (e.g., average vs sum over batch), the same numeric λ can behave differently.
So when you tune λ, do it in context:
Deep learning systems are powerful partly because they can fit almost anything—so they will happily fit the wrong thing unless you apply constraints.
Understanding regularization prepares you for:
This is why regularization is a core “unlock” for the broader Deep Learning node: it turns raw capacity into reliable performance.
Assume a model with parameter vector w and training objective J_train(w). We define the regularized objective:
J(w) = J_train(w) + λ‖w‖₂²
We want to derive the SGD update and interpret it as weight decay.
Start with the penalty term:
Ω(w) = ‖w‖₂² = ∑ⱼ wⱼ²
Differentiate coordinate-wise:
∂Ω/∂wⱼ = 2wⱼ
So ∇Ω(w) = 2w
Differentiate the full objective:
∇J(w) = ∇J_train(w) + λ∇Ω(w)
= ∇J_train(w) + 2λw
Write the SGD step with learning rate η:
w ← w − η(∇J_train(w) + 2λw)
Rearrange to expose decay:
w ← w − η∇J_train(w) − 2ηλw
w ← (1 − 2ηλ)w − η∇J_train(w)
Insight: The factor (1 − 2ηλ) multiplies the current weights every step, shrinking them toward 0 even if the data gradient were zero. This is why L2 is called weight decay: it continuously damps parameter magnitude, which tends to reduce variance and improve generalization.
Consider the regularized objective:
J(w) = J_train(w) + λ‖w‖₁
We’ll examine the (sub)gradient contributed by the L1 term and compare it to L2.
Write the L1 norm:
‖w‖₁ = ∑ⱼ |wⱼ|
For one coordinate wⱼ, the derivative of |wⱼ| is not defined at 0, so use a subgradient:
∂|wⱼ|/∂wⱼ =
+1 if wⱼ > 0
−1 if wⱼ < 0
any value in [−1, +1] if wⱼ = 0
Thus a valid subgradient of the full L1 norm is:
∂‖w‖₁/∂wⱼ = sign(wⱼ), where sign(0) ∈ [−1, +1]
Gradient-style update (conceptually):
w ← w − η(∇J_train(w) + λ·sign(w))
Compare with L2’s contribution (2λw):
Insight: Because L1 applies a constant-magnitude force toward 0, small weights don’t get “protected” the way they do under L2. They are driven to exactly 0, yielding sparsity (feature selection).
Let h be a layer’s activation vector during training. We apply inverted dropout with keep probability p. We want to show E[h̃] = h.
Sample a Bernoulli mask m with independent entries:
mⱼ ∼ Bernoulli(p)
Apply inverted dropout:
h̃ = (m ⊙ h) / p
Take expectation coordinate-wise:
h̃ⱼ = (mⱼ hⱼ)/p
E[h̃ⱼ] = E[mⱼ]·hⱼ/p
Use E[mⱼ] = p for a Bernoulli(p):
E[h̃ⱼ] = p·hⱼ/p = hⱼ
Therefore E[h̃] = h
Insight: Inverted dropout preserves the mean activation during training, so at inference you can turn dropout off without additional scaling. The regularization comes from the randomness (variance), not from a shifted mean.
Regularization is best thought of as optimizing a modified objective: J(θ) = J_train(θ) + λ·Ω(θ), where Ω encodes a preference for simpler solutions.
L2 regularization uses Ω(w) = ‖w‖₂² and adds gradient 2λw, producing weight decay: w ← (1 − 2ηλ)w − η∇J_train(w).
L2 typically yields many small weights (smooth shrinkage) and often improves stability and generalization in deep nets.
L1 regularization uses Ω(w) = ‖w‖₁ and has a sign-based (sub)gradient; its constant pull toward 0 tends to create exact zeros (sparsity).
Geometrically, L1 constraints have corners that encourage axis-aligned (sparse) solutions; L2 constraints are round and rarely produce exact zeros.
Dropout regularizes by randomly masking units/activations during training, reducing co-adaptation and approximating an ensemble of thinned networks.
λ is a true tradeoff parameter; its effective strength depends on loss scaling, batch size, learning rate, and optimizer choice (especially Adam vs AdamW).
Treating λ as universal: the same numeric λ can behave very differently when you change batch size, loss reduction (mean vs sum), or learning rate.
Applying dropout everywhere (especially early convolutional layers) without checking if it hurts representation learning; dropout placement matters.
Assuming “L2 regularization” equals “weight decay” under all optimizers—under Adam, decoupled weight decay (AdamW) is often the intended behavior.
Regularizing all parameters indiscriminately (e.g., including BatchNorm parameters or biases) without verifying whether that helps; sometimes it degrades training.
You are training with SGD and L2 regularization. Suppose η = 0.1 and λ = 0.01. Ignoring the data gradient (assume ∇J_train(w) = 0 for this step), what happens to w after one update? Write the multiplicative factor applied to w.
Hint: Use w ← (1 − 2ηλ)w when the data gradient is zero.
With L2: w ← w − η(2λw) = (1 − 2ηλ)w.
Compute 2ηλ = 2·0.1·0.01 = 0.002.
So w is multiplied by (1 − 0.002) = 0.998 after one step.
Explain, using the constrained-optimization geometry, why L1 regularization tends to produce sparse solutions while L2 does not. Focus on the shape of the constraint sets in 2D and where smooth loss contours typically touch them.
Hint: Compare the L1 ball (diamond) vs the L2 ball (circle) and think about corners vs smooth boundaries.
In 2D, the constraint ‖w‖₁ ≤ c is a diamond with sharp corners on the axes, while ‖w‖₂² ≤ c is a circle with a smooth boundary. A smooth loss contour (e.g., an ellipse) expanded outward from its minimum will typically first touch the feasible set at a point of tangency. Because the L1 feasible set has corners, tangency often occurs at a corner, which lies on an axis, implying one coordinate is exactly 0 (sparsity). The L2 feasible set is round, so tangency usually happens at a point with both coordinates nonzero, producing shrinkage but not exact zeros.
You apply inverted dropout to an activation h with keep probability p. Let h = 3 (a scalar activation). Compute the distribution of h̃ and verify E[h̃] = 3 for p = 0.6.
Hint: With probability p you keep the unit and scale by 1/p; otherwise it becomes 0.
Mask m ∼ Bernoulli(p) with p = 0.6. Inverted dropout gives h̃ = (m·h)/p.
So:
Expectation:
E[h̃] = 0.6·5 + 0.4·0 = 3
So E[h̃] = 3, matching the original activation.
Next steps and related nodes:
Related concepts you may want nearby in the tech tree: