Iteratively moving in negative gradient direction to minimize.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
You already know what a gradient is: it points uphill. Gradient descent is the art of repeatedly stepping downhill—carefully—so your objective function gets smaller and smaller until further improvement is hard or unnecessary.
Gradient descent minimizes a differentiable function f(x) by iterating xₜ₊₁ = xₜ − α ∇f(xₜ). The negative gradient gives the local direction of steepest decrease, and the step size α controls how far you move—too big can diverge, too small can crawl. You stop when progress (or gradient size) is small enough.
In optimization, you often want to choose parameters x (a vector) to minimize some objective function:
In machine learning, f is frequently a loss function measuring how wrong your model is on data. In engineering, f might be energy, cost, or error.
Sometimes you can solve ∇f(x) = 0 exactly, but often:
So we want an iterative method: start somewhere, improve step-by-step.
You already know:
So the direction of steepest decrease is:
Gradient descent turns this into an update rule:
Here α > 0 is the step size (learning rate). It is a scalar that controls how far you move along the downhill direction.
Imagine f(x) as a height function over a landscape. At your current position xₜ, the gradient is the direction of steepest uphill; stepping in the negative gradient points you down the steepest slope available using only local information.
This local nature matters:
Gradient descent is simple and powerful, but its behavior depends on the shape of f.
Even in non-convex settings (like deep learning), it’s still widely used because:
You’ll see this update everywhere:
xₜ₊₁ = xₜ − α ∇f(xₜ)
Keep a mental model:
So α controls stability and speed, while the gradient magnitude reflects how steep the landscape is at xₜ.
At x, there are infinitely many directions you could move. We want the one that decreases f the fastest for a tiny step.
So ask: if we take a small step ε in a unit direction u (‖u‖ = 1), how does f change?
For differentiable f, the first-order approximation (Taylor expansion) is:
f(x + εu) ≈ f(x) + ε ∇f(x) · u
So the rate of change in direction u is:
D₍u₎ f(x) = ∇f(x) · u
We want the direction u that minimizes this quantity (most negative), under the constraint ‖u‖ = 1.
We want:
minimize over u: ∇f(x) · u
subject to: ‖u‖ = 1
Let g = ∇f(x). Then the objective is g · u.
By Cauchy–Schwarz:
g · u ≥ −‖g‖ ‖u‖ = −‖g‖
Since ‖u‖ = 1, we get:
g · u ≥ −‖g‖
Equality happens when u points exactly opposite to g:
u = −g / ‖g‖
So:
Gradient descent chooses this direction and then decides how far to move with α.
If you rewrote the update using a unit direction:
Let uₜ = −∇f(xₜ) / ‖∇f(xₜ)‖
Then:
xₜ₊₁ = xₜ + (α ‖∇f(xₜ)‖) uₜ
So the step length is α ‖∇f(xₜ)‖.
This explains an important behavior:
The “steepest decrease” statement is local and first-order.
That’s exactly why the next ingredient—step size α—matters so much.
The update rule has a single explicit hyperparameter:
xₜ₊₁ = xₜ − α ∇f(xₜ)
But α does multiple jobs:
If gradient descent feels like “just walk downhill,” α is how big your stride is.
Consider 1D f(x). The update is:
xₜ₊₁ = xₜ − α f′(xₜ)
If f is a simple bowl (like a quadratic), then:
This isn’t a subtle effect—it’s often the difference between learning and exploding.
Quadratics are the “physics sandbox” of optimization. Let:
f(x) = 1/2 xᵀ A x (assume A is symmetric positive definite)
Then:
∇f(x) = A x
Gradient descent becomes:
xₜ₊₁ = xₜ − α A xₜ
= (I − α A) xₜ
So the behavior depends on the matrix (I − α A). If its eigenvalues are within (−1, 1), iterates shrink toward 0.
Let λᵢ be eigenvalues of A (all positive). Eigenvalues of (I − α A) are (1 − α λᵢ). We need:
|1 − α λᵢ| < 1 for all i
Solve it:
−1 < 1 − α λᵢ < 1
Left inequality:
−1 < 1 − α λᵢ
α λᵢ < 2
α < 2/λᵢ
Right inequality:
1 − α λᵢ < 1
−α λᵢ < 0
α > 0
So the condition for all i is:
0 < α < 2/λₘₐₓ
This is a clean statement:
If λₘₐₓ is huge (steep direction), α must be small to avoid instability.
Even if you pick α stable, convergence can still be slow if A is ill-conditioned.
Define condition number:
κ = λₘₐₓ / λₘᵢₙ
When κ is large, the bowl is stretched: steep in some directions, flat in others. Then:
So you may see “zig-zagging” down a narrow valley.
This is a major motivation for:
There are two common philosophies:
| Approach | What you choose | Pros | Cons |
|---|---|---|---|
| Fixed learning rate | constant α | simple, cheap | must tune carefully; can stall or diverge |
| Step size schedule | αₜ decreases over time | can be stable + accurate | schedule design matters |
| Line search | choose α each step to reduce f | less tuning; robust | extra computation per step |
In many ML settings, you’ll see schedules like:
But the core concept remains: α is a stability-speed trade-off.
Gradient descent is iterative, so you must decide when to stop. Common criteria:
Each criterion has a “why”:
In ML, you’ll also stop based on validation performance (generalization), not just training loss.
Many models are trained by minimizing an average loss:
f(w) = (1/n) ∑ᵢ₌₁ⁿ ℓᵢ(w)
Here w are model parameters (weights), and ℓᵢ measures error on example i.
Two key reasons gradient descent fits ML so well:
1) Modularity: If you can compute ∇f, you can train.
2) Scalability: Each step costs about “one gradient computation,” which is manageable even in large dimensions.
For linear regression, logistic regression, neural nets—this pattern repeats.
Plain gradient descent (often called batch GD in ML) computes the full gradient over all data:
∇f(w) = (1/n) ∑ᵢ ∇ℓᵢ(w)
This can be expensive when n is large.
SGD (and mini-batch SGD) approximates the gradient using a subset of examples. Conceptually:
You’ll unlock SGD soon; for now, notice what remains the same:
Logistic regression minimizes cross-entropy loss. Even if you don’t have the full formula memorized, the training loop is the same:
This is why gradient descent is often taught before specific ML models: it’s the shared engine underneath.
In supervised learning, you often hear:
Operationally, it means:
Gradient descent gives you a clean mental map:
Even without fancy second-order methods, you can improve GD performance by shaping the landscape:
For example, with L2 regularization:
f(w) = data_loss(w) + (λ/2)‖w‖²
Then:
∇f(w) = ∇data_loss(w) + λ w
So gradient descent update becomes:
wₜ₊₁ = wₜ − α (∇data_loss(wₜ) + λ wₜ)
= (1 − αλ)wₜ − α ∇data_loss(wₜ)
This shows a neat interpretation:
Gradient descent is foundational, but not always the most efficient choice.
A rough comparison:
| Method | Uses | Typical strength | Typical weakness |
|---|---|---|---|
| Gradient descent | ∇f | simple, general | can be slow on ill-conditioned problems |
| Conjugate gradients | structure of quadratics | very fast for quadratic/linear systems | less direct for non-quadratic, needs careful usage |
| Newton / quasi-Newton | ∇f and (approx) Hessian | fast local convergence | expensive per step in high dimensions |
| SGD / mini-batch | noisy ∇f | scales to huge data | requires more tuning; noisy convergence |
Learning gradient descent well makes all of these easier, because most are variations on “choose direction + choose step size.”
Minimize f(x) = (x − 3)². Start at x₀ = 0. Compare α = 0.1 vs α = 1.1 for a few steps.
Compute derivative:
f(x) = (x − 3)²
f′(x) = 2(x − 3)
Gradient descent update in 1D:
xₜ₊₁ = xₜ − α f′(xₜ)
= xₜ − α · 2(xₜ − 3)
= xₜ − 2α xₜ + 6α
= (1 − 2α) xₜ + 6α
Case A: α = 0.1
Update rule: xₜ₊₁ = (1 − 0.2) xₜ + 0.6 = 0.8 xₜ + 0.6
Compute iterates:
x₀ = 0
x₁ = 0.8·0 + 0.6 = 0.6
x₂ = 0.8·0.6 + 0.6 = 1.08
x₃ = 0.8·1.08 + 0.6 = 1.464
x₄ = 0.8·1.464 + 0.6 = 1.7712
(It moves steadily toward 3.)
Case B: α = 1.1
Update rule: xₜ₊₁ = (1 − 2.2) xₜ + 6.6 = (−1.2) xₜ + 6.6
Compute iterates:
x₀ = 0
x₁ = −1.2·0 + 6.6 = 6.6
x₂ = −1.2·6.6 + 6.6 = −1.32
x₃ = −1.2·(−1.32) + 6.6 = 8.184
x₄ = −1.2·8.184 + 6.6 = −3.2208
(It oscillates wildly and grows in magnitude.)
Interpret stability from the coefficient (1 − 2α):
For convergence we need |1 − 2α| < 1
Solve:
−1 < 1 − 2α < 1
Left: −1 < 1 − 2α ⇒ 2α < 2 ⇒ α < 1
Right: 1 − 2α < 1 ⇒ α > 0
So 0 < α < 1 is stable here.
α = 1.1 violates it, so divergence makes sense.
Insight: Even in the friendliest possible objective (a perfect parabola), α determines whether you converge smoothly, oscillate, or blow up. Stability is not optional tuning—it is part of the algorithm.
Minimize f(x) = x₁² + 4x₂². Start at x₀ = [2, 1]ᵀ. Use α = 0.2. Compute x₁ and show that f decreases.
Write x = [x₁, x₂]ᵀ. Compute the gradient:
f(x) = x₁² + 4x₂²
∂f/∂x₁ = 2x₁
∂f/∂x₂ = 8x₂
So:
∇f(x) = [2x₁, 8x₂]ᵀ
Evaluate at x₀ = [2, 1]ᵀ:
∇f(x₀) = [2·2, 8·1]ᵀ = [4, 8]ᵀ
Apply gradient descent update:
x₁ = x₀ − α ∇f(x₀)
= [2, 1]ᵀ − 0.2[4, 8]ᵀ
= [2 − 0.8, 1 − 1.6]ᵀ
= [1.2, −0.6]ᵀ
Check objective value decreases:
f(x₀) = 2² + 4·1² = 4 + 4 = 8
f(x₁) = 1.2² + 4·(−0.6)²
= 1.44 + 4·0.36
= 1.44 + 1.44
= 2.88
So f decreased from 8 to 2.88 in one step.
Geometric note:
The curvature in x₂ direction is larger (coefficient 4), so the gradient component in x₂ is scaled more (8x₂ vs 2x₁). This creates the typical behavior: stronger pull (and stricter stability) in steep directions.
Insight: Gradients encode anisotropic steepness: directions with higher curvature produce larger gradient components, which affects both progress and the maximum stable α.
Let f(x) = 1/2 xᵀ A x with A symmetric positive definite. Show ∇f(x) = A x, derive xₜ₊₁ = (I − αA)xₜ, and state the stability condition in terms of λₘₐₓ.
Compute gradient (known result for symmetric A):
f(x) = 1/2 xᵀ A x
∇f(x) = A x
Gradient descent update:
xₜ₊₁ = xₜ − α ∇f(xₜ)
= xₜ − α A xₜ
= (I − α A) xₜ
Diagonalize A:
Because A is symmetric, A = QΛQᵀ with orthonormal Q, diagonal Λ with entries λᵢ > 0.
Transform coordinates zₜ = Qᵀ xₜ:
xₜ = Q zₜ
xₜ₊₁ = (I − α QΛQᵀ) Q zₜ
= Q(I − αΛ)zₜ
So:
zₜ₊₁ = (I − αΛ)zₜ
Coordinate-wise behavior:
zᵢ,ₜ₊₁ = (1 − αλᵢ) zᵢ,ₜ
For convergence we need |1 − αλᵢ| < 1 for all i.
Solve for α:
|1 − αλᵢ| < 1
⇒ −1 < 1 − αλᵢ < 1
⇒ 0 < α < 2/λᵢ
To satisfy all i, require:
0 < α < 2/λₘₐₓ
Insight: The largest eigenvalue λₘₐₓ (steepest curvature direction) sets the global stability limit for a fixed learning rate on a quadratic.
Gradient descent iterates xₜ₊₁ = xₜ − α ∇f(xₜ) to minimize a differentiable objective f(x).
The negative gradient −∇f(x) is the direction of steepest local decrease because it minimizes the directional derivative under ‖u‖ = 1.
The learning rate α controls step length; too small makes progress slow, too large can cause oscillation or divergence.
For quadratics f(x) = 1/2 xᵀAx, stability with fixed α requires 0 < α < 2/λₘₐₓ.
Ill-conditioning (large κ = λₘₐₓ/λₘᵢₙ) explains zig-zagging and slow convergence even when α is stable.
Stopping criteria are typically based on small gradient norm, small parameter changes, small objective improvement, or iteration budget.
Gradient descent is the core training engine behind many ML models; SGD is a scalable noisy variant of the same principle.
Using the wrong sign: updating x ← x + α∇f(x) performs gradient ascent, increasing the objective.
Choosing α without checking stability: a learning rate that is slightly too large can make loss explode or oscillate indefinitely.
Assuming GD always finds the global minimum: for non-convex objectives, it may converge to local minima or saddle points.
Stopping solely because iterations are done, without checking whether ‖∇f(xₜ)‖ or the loss improvement has actually become small.
Let f(x) = x². Start at x₀ = 5.
1) Write the gradient descent update.
2) For α = 0.4 compute x₁, x₂, x₃.
3) For which α does this 1D quadratic converge?
Hint: Compute f′(x) and simplify xₜ₊₁ in terms of xₜ. Convergence requires |multiplier| < 1.
1) f′(x) = 2x, so xₜ₊₁ = xₜ − α·2xₜ = (1 − 2α)xₜ.
2) With α = 0.4, multiplier = 1 − 0.8 = 0.2:
x₁ = 0.2·5 = 1
x₂ = 0.2·1 = 0.2
x₃ = 0.2·0.2 = 0.04
3) Need |1 − 2α| < 1 ⇒ 0 < α < 1.
Minimize f(x) = (x₁ − 1)² + (x₂ + 2)².
Start at x₀ = [3, 0]ᵀ and use α = 0.5.
Compute ∇f(x), then compute x₁. What is f(x₀) and f(x₁)?
Hint: Differentiate each squared term: ∂/∂x (x − c)² = 2(x − c).
∇f(x) = [2(x₁ − 1), 2(x₂ + 2)]ᵀ.
At x₀ = [3,0]ᵀ:
∇f(x₀) = [2(3−1), 2(0+2)]ᵀ = [4, 4]ᵀ.
Update:
x₁ = x₀ − 0.5[4,4]ᵀ = [3−2, 0−2]ᵀ = [1, −2]ᵀ.
Objective values:
f(x₀) = (3−1)² + (0+2)² = 4 + 4 = 8.
f(x₁) = (1−1)² + (−2+2)² = 0 + 0 = 0.
Consider f(x) = 1/2 xᵀAx where A has eigenvalues {1, 10}.
1) Give a range of α that guarantees convergence for fixed-step gradient descent.
2) Explain (in one or two sentences) why convergence can still feel slow when eigenvalues are very different.
Hint: Use the condition 0 < α < 2/λₘₐₓ. For slowness, think about the condition number κ = λₘₐₓ/λₘᵢₙ and zig-zagging.
1) λₘₐₓ = 10, so 0 < α < 2/10 = 0.2 guarantees convergence.
2) When eigenvalues differ greatly (κ large), α must be small enough for the steep direction (λ=10), which makes progress along the flat direction (λ=1) comparatively slow; iterates can zig-zag in a narrow valley, reducing efficiency.