Newton's method, BFGS, L-BFGS. Using Hessian information.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
Gradient methods know which way is downhill. Second-order methods also know how the ground curves—so they can choose steps that are both directional and properly scaled.
Second-order optimization uses curvature information (the Hessian H or an approximation to H⁻¹) to build a local quadratic model and compute a step that is often far more scale-aware than plain gradient descent. Newton’s method uses H directly (solve H p = −∇f). Quasi-Newton methods (BFGS, L-BFGS) avoid forming H by updating an inverse-Hessian approximation B using gradient differences, giving many of Newton’s benefits at much lower cost.
First-order methods (gradient descent, momentum, Adam) use ∇f(x) to pick a direction of improvement. But they do not know how the objective bends. If the landscape is stretched—steep in one direction and flat in another—then a single global learning rate is a compromise: small enough not to explode in steep directions, but then painfully slow in flat directions.
Second-order methods solve this by using curvature: they estimate how f changes for a whole step vector p, not just for an infinitesimal move. This usually yields steps that are automatically scaled per-direction. In well-behaved problems, this can turn long, zig-zagging trajectories into short, decisive moves.
Second-order methods are built on the second-order Taylor approximation around a point x:
f(x + p) ≈ f(x) + ∇f(x)ᵀ p + ½ pᵀ H(x) p
This approximation says: “near x, the function looks like a quadratic bowl (or saddle).” If we trust that local quadratic, we can choose p by minimizing the quadratic model.
Define the quadratic model:
m(p) = f(x) + ∇f(x)ᵀ p + ½ pᵀ H(x) p
To minimize m(p), differentiate with respect to p and set to zero:
∇ₚ m(p) = ∇f(x) + H(x) p = 0
So the step solves:
H(x) p = −∇f(x)
If H is invertible:
p = −H(x)⁻¹ ∇f(x)
This is the Newton step. It is not “just a better direction”; it is a direction and a scale determined by curvature.
Newton can converge extremely fast near a minimizer—often quadratically: the number of correct digits can roughly double each iteration in ideal settings. That’s why Newton-like methods are central in classical optimization.
But Newton also has costs and pitfalls:
Quasi-Newton methods (BFGS, L-BFGS) keep the good idea—use curvature to scale steps—while avoiding explicit Hessians.
In practice, many algorithms compute the step as:
p = −B ∇f(x)
with B updated cheaply from gradient information.
If you accept the local quadratic approximation as your model, then the Newton step is simply the optimizer of that model. There’s no extra heuristic—Newton is what happens when you take the quadratic seriously.
Start from the second-order expansion at x:
f(x + p) ≈ f(x) + ∇f(x)ᵀ p + ½ pᵀ H(x) p
Let g = ∇f(x) and H = H(x) for short. Then:
m(p) = f(x) + gᵀ p + ½ pᵀ H p
Take the gradient with respect to p:
∇ₚ m(p) = g + H p
Set to zero at optimum:
g + H p = 0
Solve:
H p = −g
So the update is:
xₖ₊₁ = xₖ + p
Gradient descent uses p = −α g. Newton uses p = −H⁻¹ g.
You can interpret H⁻¹ as a preconditioner: it rescales the gradient by curvature. If the function is steep in one direction, H has large eigenvalues there, and H⁻¹ shrinks the step along that direction. If it is flat, H⁻¹ expands the step.
More explicitly, if H = Q Λ Qᵀ (eigendecomposition with orthonormal Q), then:
H⁻¹ = Q Λ⁻¹ Qᵀ
and
p = −Q Λ⁻¹ Qᵀ g
So each eigen-direction is scaled by 1/λᵢ.
A direction p is a descent direction if gᵀ p < 0.
For Newton:
p = −H⁻¹ g
Compute gᵀ p:
gᵀ p = gᵀ (−H⁻¹ g) = −gᵀ H⁻¹ g
If H is symmetric positive definite (SPD), then H⁻¹ is SPD and gᵀ H⁻¹ g > 0 for any nonzero g. Therefore gᵀ p < 0: Newton is a descent direction.
If H is indefinite (has negative eigenvalues), then gᵀ H⁻¹ g can be negative, and the Newton step can ascend.
Even with SPD H, the quadratic model may not be accurate far from x. A standard fix is to take:
xₖ₊₁ = xₖ + αₖ pₖ
where pₖ solves Hₖ pₖ = −gₖ, and αₖ is found via line search (e.g., Armijo/Wolfe conditions). This is often called damped Newton.
Another fix is to trust the quadratic model only within a region ‖p‖ ≤ Δ:
minimize m(p) subject to ‖p‖ ≤ Δ
This changes the step when Newton would take something too large or unstable. Trust-region Newton methods are especially robust when H is indefinite.
You never compute H⁻¹ explicitly. You solve the linear system:
H p = −g
using:
The solve is the “real cost” of Newton.
| Aspect | Gradient descent | Newton’s method |
|---|---|---|
| Uses | ∇f | ∇f and H |
| Step | p = −α ∇f | Solve H p = −∇f |
| Sensitivity to scaling | High | Low (often scale-invariant under affine transforms) |
| Per-iteration cost | Low | High |
| Convergence near optimum | Linear (typical) | Quadratic (ideal conditions) |
| Failure modes | Slow in ill-conditioned valleys | Non-descent if H indefinite; expensive |
In large-scale ML (millions of parameters), full Newton is typically too expensive—this is where quasi-Newton, L-BFGS, and approximations enter.
Newton’s method asks for H and a linear solve. Quasi-Newton methods try to approximate the effect of H⁻¹ without ever forming H.
The guiding question is:
Can we learn a good linear map B that turns gradients into near-Newton steps?
That is, we want:
p = −B g
with B ≈ H⁻¹.
Let xₖ₊₁ = xₖ + sₖ where sₖ = xₖ₊₁ − xₖ.
Let gₖ = ∇f(xₖ) and gₖ₊₁ = ∇f(xₖ₊₁). Define:
yₖ = gₖ₊₁ − gₖ
For twice-differentiable f, for small steps:
yₖ ≈ Hₖ sₖ
A quasi-Newton method chooses an approximation (either to H or to H⁻¹) that satisfies the secant equation:
H̃ₖ₊₁ sₖ = yₖ
or in inverse form:
Bₖ₊₁ yₖ = sₖ
This is the multi-dimensional analogue of matching slopes in 1D.
BFGS maintains Bₖ ≈ Hₖ⁻¹ and updates it using s and y.
Define ρ = 1 / (yᵀ s).
Assuming yᵀ s > 0 (curvature condition), the BFGS inverse update is:
Bₖ₊₁ = (I − ρ s yᵀ) Bₖ (I − ρ y sᵀ) + ρ s sᵀ
Key properties:
B is a learned preconditioner.
BFGS accumulates this information over iterations, gradually sculpting B to match the inverse curvature.
The condition yᵀ s > 0 ensures positive curvature along s. With line searches satisfying the (strong) Wolfe conditions, one can guarantee yᵀ s > 0 for smooth functions.
If yᵀ s ≤ 0, BFGS may lose SPD and produce non-descent steps. Practical implementations may:
B is n×n. Storing and updating it costs O(n²) memory and O(n²) time per iteration—often too expensive for large n.
L-BFGS avoids storing B. Instead, it stores only the last m pairs {(sᵢ, yᵢ)} for i = k−m+1,…,k.
To compute p = −Bₖ gₖ, L-BFGS uses the two-loop recursion, which applies the implicit Bₖ to a vector in O(mn) time and O(mn) memory.
Given current gradient g, set q = g.
Loop backwards over stored pairs:
Apply an initial scaling H₀ (often γ I) to get r:
Loop forwards:
Then the search direction is:
p = −r
Even though no n×n matrix is stored, the resulting p behaves like a quasi-Newton direction.
A common choice is:
γₖ = (sₖ₋₁ᵀ yₖ₋₁) / (yₖ₋₁ᵀ yₖ₋₁)
and H₀ = γₖ I.
This roughly matches average curvature magnitude.
| Method | Uses H explicitly? | Stores n×n matrix? | Typical per-iter cost | Strengths | Weaknesses |
|---|---|---|---|---|---|
| Newton | Yes | Often yes (or factorization) | High (solve) | Fast local convergence, accurate steps | Expensive; needs robust globalization |
| BFGS | No | Yes (B) | O(n²) | Very effective for smooth medium-scale problems | Memory/time blow up for large n |
| L-BFGS | No | No (stores m pairs) | O(mn) | Scales to large n; strong practical default | Needs line search; can be sensitive/noisy gradients |
Quasi-Newton methods typically assume deterministic or low-noise gradients. In deep learning with mini-batch noise, L-BFGS can still be used but often requires care (larger batches, stronger line search heuristics, or stochastic quasi-Newton variants).
The conceptual win remains: approximate H⁻¹ well enough to get curvature-aware steps without paying Newton’s full price.
Second-order methods are fundamentally local: they rely on the Taylor approximation being accurate near the current iterate. Far from the solution, that approximation can be wrong in two ways:
Globalization strategies are the standard tools that make Newton-like methods robust in practice.
Given a direction p (Newton, BFGS, L-BFGS), pick α to ensure sufficient decrease:
xₖ₊₁ = xₖ + α p
A common sufficient decrease condition (Armijo) is:
f(x + α p) ≤ f(x) + c₁ α ∇f(x)ᵀ p
with 0 < c₁ < 1.
To also ensure curvature information behaves well (especially for BFGS), enforce Wolfe (or strong Wolfe) conditions, which include:
∇f(x + α p)ᵀ p ≥ c₂ ∇f(x)ᵀ p
with c₁ < c₂ < 1.
These conditions help guarantee yᵀ s > 0, preserving SPD updates in BFGS/L-BFGS.
Instead of searching along a line, trust region methods solve:
minimize gᵀ p + ½ pᵀ H p
subject to ‖p‖ ≤ Δ
If the Newton step lies inside the region, take it. If not, take a boundary step that balances descent and stability.
In nonconvex problems, trust regions naturally handle indefinite H by restricting step size and allowing directions of negative curvature to be exploited safely.
When H is not SPD, you may:
Damping connects to the idea that you want a model that is convex enough locally to provide a descent direction.
Even if you cannot store H, you may be able to compute H v efficiently (automatic differentiation can do this). Then you can approximately solve:
H p = −g
using conjugate gradient (CG), which needs only matrix-vector products.
This yields Newton-CG (also called truncated Newton). It can be seen as a spectrum:
| Scenario | Recommended tool | Why |
|---|---|---|
| Small n, high accuracy, smooth convex | Newton + line search / trust region | Fast convergence, stable |
| Medium n (10³–10⁵), smooth deterministic | L-BFGS | Excellent cost/benefit |
| Large n, can do H v, need high precision | Newton-CG / trust region Newton-CG | Uses curvature without storing H |
| Noisy mini-batch gradients | First-order (Adam/SGD) or specialized stochastic quasi-Newton | Classical line searches can struggle |
A deep way to view these methods is that they choose steps using a local metric:
That perspective connects second-order optimization to:
You now have the two atomic concepts for this node:
1) second-order quadratic modeling (Taylor) ⇒ Newton step via solving H p = −g
2) quasi-Newton approximation B ≈ H⁻¹ learned via secant information (s, y) ⇒ BFGS/L-BFGS
From here, you can build toward robust implementations (line search, trust region) and scalable solvers (CG, Hessian-vector products).
Minimize f(x) = ½ xᵀ A x − bᵀ x, where A is SPD.
Let A = [[4, 0],[0, 1]] and b = [8, 2]ᵀ. Start at x₀ = [0, 0]ᵀ.
Compute one Newton step.
Compute gradient and Hessian:
∇f(x) = A x − b
H = ∇²f(x) = A
At x₀ = 0:
g₀ = ∇f(x₀) = A 0 − b = −b = [−8, −2]ᵀ
Newton step solves:
H p = −g₀
A p = −(−b) = b
Solve A p = b:
[[4,0],[0,1]] [p₁,p₂]ᵀ = [8,2]ᵀ
So:
4 p₁ = 8 ⇒ p₁ = 2
1 p₂ = 2 ⇒ p₂ = 2
Therefore p = [2,2]ᵀ
Update:
x₁ = x₀ + p = [0,0]ᵀ + [2,2]ᵀ = [2,2]ᵀ
Verify optimality (for a quadratic with SPD A, optimum is A x⋆ = b):
A x₁ = [[4,0],[0,1]] [2,2]ᵀ = [8,2]ᵀ = b
So x₁ = x⋆
Insight: For strictly convex quadratics, Newton’s method reaches the minimizer in one step from any start because the second-order Taylor model is exact (the function really is quadratic).
Minimize f(x) = x⁴. Start at x₀ = 1. Compute two Newton iterations.
(Here ∇f = f′ and H = f″.)
Compute derivatives:
f′(x) = 4x³
f″(x) = 12x²
Newton update in 1D:
xₖ₊₁ = xₖ − f′(xₖ)/f″(xₖ)
At x₀ = 1:
f′(1) = 4
f″(1) = 12
So:
x₁ = 1 − 4/12 = 1 − 1/3 = 2/3
At x₁ = 2/3:
f′(2/3) = 4 (2/3)³ = 4 (8/27) = 32/27
f″(2/3) = 12 (2/3)² = 12 (4/9) = 16/3
Compute step:
x₂ = x₁ − (32/27)/(16/3)
= 2/3 − (32/27)·(3/16)
= 2/3 − (96)/(432)
= 2/3 − 2/9
= 4/9
Insight: Newton automatically reduces step sizes near flat curvature: as x → 0, f″(x) → 0 too, so the ratio f′/f″ behaves like (4x³)/(12x²) = x/3, giving xₖ₊₁ = (2/3) xₖ. Here convergence is linear because the minimizer is not strongly convex (H at 0 is 0).
Suppose at iteration k you have Bₖ = I (2×2 identity).
You took s = [1, 0]ᵀ and observed gradient change y = [2, 0]ᵀ.
Compute Bₖ₊₁ using the BFGS inverse update.
Compute ρ:
ρ = 1/(yᵀ s) = 1/([2,0]·[1,0]) = 1/2
Write the update:
Bₖ₊₁ = (I − ρ s yᵀ) Bₖ (I − ρ y sᵀ) + ρ s sᵀ
Compute matrices:
s yᵀ = [[1],[0]] [2,0] = [[2,0],[0,0]]
So:
I − ρ s yᵀ = I − (1/2) [[2,0],[0,0]] = [[0,0],[0,1]]
y sᵀ = [[2],[0]] [1,0] = [[2,0],[0,0]]
So:
I − ρ y sᵀ = [[0,0],[0,1]] as well
Since Bₖ = I:
(I − ρ s yᵀ) Bₖ (I − ρ y sᵀ)
= [[0,0],[0,1]] I [[0,0],[0,1]]
= [[0,0],[0,1]]
Compute final term:
ρ s sᵀ = (1/2) [[1],[0]] [1,0] = (1/2) [[1,0],[0,0]] = [[1/2,0],[0,0]]
Add them:
Bₖ₊₁ = [[0,0],[0,1]] + [[1/2,0],[0,0]] = [[1/2,0],[0,1]]
Insight: This update learned that curvature along the first coordinate is about 2 (since y ≈ H s), so the inverse curvature along that direction should be about 1/2. The second coordinate was untouched, so it stayed at 1.
Second-order methods build a local quadratic model: f(x + p) ≈ f(x) + gᵀ p + ½ pᵀ H p.
Newton’s step is the minimizer of that quadratic model: solve H p = −g (don’t compute H⁻¹ explicitly).
If H is SPD, the Newton direction is a descent direction since gᵀ p = −gᵀ H⁻¹ g < 0.
Because the Taylor model is local, practical Newton methods use globalization: line search (Armijo/Wolfe) or trust regions.
Quasi-Newton methods replace H⁻¹ with a learned approximation B updated from (s, y) where s = Δx and y = Δg.
BFGS updates B with a rank-two formula that preserves symmetry and (with yᵀ s > 0) positive definiteness.
L-BFGS scales to large n by storing only the last m curvature pairs and computing p = −B g via the two-loop recursion.
Inverting the Hessian explicitly (computing H⁻¹) instead of solving H p = −g with a linear solver.
Using Newton/BFGS steps without globalization (no line search / trust region), leading to overshooting or divergence.
Ignoring indefiniteness: treating a nonconvex Hessian as if it were SPD, which can produce ascent directions.
For BFGS/L-BFGS, failing to enforce yᵀ s > 0 (e.g., by skipping Wolfe line search), which can break SPD guarantees and destabilize updates.
For f(x) = ½ xᵀ A x with A = [[10,0],[0,1]], compare the gradient descent direction −g and the Newton direction −H⁻¹ g at a general point x. What is the qualitative difference in how they scale the two coordinates?
Hint: Compute g = A x and H = A. Then −H⁻¹ g simplifies a lot.
We have g = ∇f(x) = A x and H = A.
Gradient descent direction (ignoring step size α):
−g = −A x = [−10 x₁, −1 x₂]ᵀ
So it is 10× larger in magnitude along coordinate 1.
Newton direction:
−H⁻¹ g = −A⁻¹ (A x) = −x = [−x₁, −x₂]ᵀ
So Newton cancels the anisotropic scaling of A and treats both coordinates equally (invariant to that scaling).
Show that if H is SPD, then the Newton direction p solves gᵀ p < 0 whenever g ≠ 0.
Hint: Use p = −H⁻¹ g and the fact that vᵀ M v > 0 for SPD M and v ≠ 0.
If H is SPD, then H⁻¹ is also SPD.
Newton direction: p = −H⁻¹ g.
Then:
gᵀ p = gᵀ (−H⁻¹ g) = −gᵀ H⁻¹ g.
Since H⁻¹ is SPD and g ≠ 0, we have gᵀ H⁻¹ g > 0.
Therefore gᵀ p < 0, so p is a descent direction.
Given Bₖ = I, s = [1,1]ᵀ, y = [3,1]ᵀ, compute ρ = 1/(yᵀ s) and verify the secant condition Bₖ₊₁ y = s for the BFGS inverse update (you may compute Bₖ₊₁ explicitly or reason using known properties).
Hint: First compute yᵀ s. Then either plug into the BFGS formula or use the theorem that the BFGS inverse update satisfies the secant equation by construction (provided yᵀ s ≠ 0).
Compute yᵀ s = [3,1]·[1,1] = 4, so ρ = 1/4.
For BFGS inverse update:
Bₖ₊₁ = (I − ρ s yᵀ) Bₖ (I − ρ y sᵀ) + ρ s sᵀ.
With Bₖ = I, the update is well-defined since yᵀ s = 4 ≠ 0.
A defining property of the BFGS inverse update is that it satisfies the secant condition:
Bₖ₊₁ y = s.
Therefore, for this (s, y) pair, Bₖ₊₁ y must equal [1,1]ᵀ.
(If you compute Bₖ₊₁ explicitly, you will indeed find a symmetric matrix that maps y to s, confirming the condition.)