Optimization where local minimum is global. Efficient algorithms.
Quick unlock - significant prerequisite investment but simple final step. Verify prerequisites first.
Many optimization problems in ML feel like hiking in fog: you follow the steepest downhill direction and hope you’re heading toward the right valley. Convex optimization is the rare case where the landscape is shaped so kindly that every local valley is the same valley—once you get low enough, you can be confident you’ve reached the global minimum. That single geometric guarantee is why convex optimization sits at the core of reliable, scalable learning algorithms.
A convex optimization problem minimizes a convex objective over a convex feasible set. Its defining superpower is: any local minimizer is a global minimizer. This enables strong guarantees (existence/uniqueness under conditions, duality, certificates of optimality) and efficient algorithms (projected gradient, interior-point, proximal methods).
In general optimization, you can easily get stuck in a local minimum that isn’t the global minimum. That uncertainty forces you to rely on heuristics, restarts, and luck.
Convex optimization is the opposite: it’s a mathematical setting where the geometry of the problem rules out “bad” local minima. When you solve a convex optimization problem, you get a solution you can trust—and you can often quantify how close you are to optimal at every step.
This is why convex optimization underpins so many standard ML models: ridge regression, LASSO, logistic regression (with convex losses), SVMs, and many regularized empirical risk minimization problems.
A convex optimization problem has the form
minimize f(x)
subject to x ∈ C
where:
A point x⋆ is an optimizer if it attains the smallest objective value among feasible points:
x⋆ ∈ argmin_{x ∈ C} f(x)
Here, argmin (“argument of the minimizer”) returns the set of points that minimize f over C. (There can be multiple minimizers.)
If f is convex and C is convex, then:
Intuitively: convexity prevents the objective from dipping down and then rising and dipping again in a way that would create a “fake” valley.
More formally, if x̂ is feasible and there exists a neighborhood around x̂ where f(x̂) ≤ f(x) for all feasible x in that neighborhood, then convexity implies f(x̂) ≤ f(x) for all x ∈ C.
Think of two ingredients:
1) Convex set C: if you pick any two feasible points x₁, x₂ ∈ C, then the entire line segment is feasible:
∀t ∈ [0, 1], t x₁ + (1 − t) x₂ ∈ C
2) Convex function f: the value at a mixture is no more than the mixture of values:
f(t x₁ + (1 − t) x₂) ≤ t f(x₁) + (1 − t) f(x₂)
Those two combine into a landscape with no “hidden basins.”
Most convex problems are written using inequalities and equalities:
minimize f₀(x)
subject to fᵢ(x) ≤ 0 for i = 1,…,m
A x = b
If f₀ is convex, each fᵢ is convex, and A x = b is affine, then the feasible set is convex and the whole problem is convex.
Here are common convex problem families you’ll see in ML and systems:
| Problem class | Objective / constraints | Notes |
|---|---|---|
| LP (Linear Program) | linear objective, linear inequalities | extremely well-studied; piecewise-linear geometry |
| QP (Quadratic Program) | convex quadratic objective, linear constraints | includes least squares + constraints |
| SOCP | second-order cone constraints | norms like ‖A x + b‖₂ ≤ cᵀx + d |
| SDP | semidefinite constraints X ⪰ 0 | powerful but heavier computationally |
| Unconstrained smooth convex | just minimize f(x) | gradient methods shine |
| Composite convex | f(x) = g(x) + h(x) | g smooth, h nonsmooth (e.g., L1) → proximal methods |
Convex optimization is the umbrella that organizes these into one coherent theory.
In optimization, you want a certificate that a candidate solution is optimal.
Optimality conditions become both:
1) conceptual tools (understanding solutions), and
2) algorithmic stopping criteria (when to stop iterating).
Suppose f is convex and differentiable on ℝⁿ. Then x⋆ minimizes f over ℝⁿ if and only if
∇f(x⋆) = 0.
Convexity implies the first-order condition:
f(y) ≥ f(x) + ∇f(x)ᵀ (y − x) for all x, y.
This says the tangent hyperplane at x underestimates the function everywhere.
Now plug x = x⋆. If ∇f(x⋆) = 0, then for any y:
f(y) ≥ f(x⋆) + 0ᵀ (y − x⋆) = f(x⋆).
So x⋆ is globally optimal.
Now consider minimize f(x) subject to x ∈ C, where C is convex and closed.
Even if ∇f(x⋆) ≠ 0, x⋆ can be optimal because the constraint blocks descent directions.
A direction d is feasible at x⋆ if small steps stay in C:
∃ε > 0 such that x⋆ + t d ∈ C for all t ∈ [0, ε].
If x⋆ is optimal, you should not be able to decrease f by moving in any feasible direction.
For differentiable f, the directional derivative is
( d/dt ) f(x⋆ + t d) |_{t=0} = ∇f(x⋆)ᵀ d.
Optimality requires
∇f(x⋆)ᵀ d ≥ 0 for all feasible directions d.
Geometrically: the gradient points “out of” the feasible set.
Define the normal cone of C at x⋆:
N_C(x⋆) = { g : gᵀ (x − x⋆) ≤ 0 for all x ∈ C }.
Then x⋆ solves min_{x∈C} f(x) (with differentiable convex f) iff
−∇f(x⋆) ∈ N_C(x⋆).
This is the geometric version of constrained optimality.
Many important convex functions are not differentiable:
Convex optimization still works because we generalize gradients.
A vector g is a subgradient of convex f at x if
f(y) ≥ f(x) + gᵀ (y − x) for all y.
The set of all subgradients is the subdifferential:
∂f(x) = { g : g is a subgradient at x }.
If f is differentiable at x, then ∂f(x) = {∇f(x)}.
Unconstrained case: x⋆ minimizes convex f over ℝⁿ iff
0 ∈ ∂f(x⋆).
Constrained case: x⋆ minimizes f over C iff
0 ∈ ∂f(x⋆) + N_C(x⋆).
This is a bridge to KKT conditions (which you’ll learn next as an unlocked node): KKT is essentially an explicit form of “subgradient + normal cone = 0” for inequality/equality constraints.
Convexity ensures global minima are well-behaved, but not necessarily unique.
A function is µ-strongly convex if for all x, y and t ∈ [0,1]:
f(t x + (1−t) y) ≤ t f(x) + (1−t) f(y) − (µ/2) t(1−t) ‖x−y‖₂².
Equivalent (when twice differentiable):
∇²f(x) ⪰ µ I.
Strong convexity implies:
In ML, adding L2 regularization (λ/2)‖w‖₂² often makes objectives strongly convex.
To decide if a problem is convex, verify:
1) Objective f₀ is convex.
2) Each inequality constraint fᵢ(x) ≤ 0 has fᵢ convex.
3) Equality constraints are affine.
4) Domain restrictions (like x > 0) define a convex set.
Once those hold, you gain the full power of convex optimality conditions and duality-based certificates.
You already know gradient descent: iterate
x_{k+1} = x_k − α_k ∇f(x_k).
In nonconvex landscapes, this can converge to saddle points or poor local minima.
In convex optimization, algorithms are designed around two privileges:
1) Any stationary point is globally optimal (with the right notion of stationarity).
2) We can measure progress via optimality gaps f(x_k) − f(x⋆).
This section focuses on three algorithmic ideas you’ll see constantly.
When you want to minimize a smooth convex f over a simple convex set C (box constraints, ball constraints, simplex, etc.).
Take a gradient step, then project back to feasibility:
x_{k+1} = Π_C( x_k − α_k ∇f(x_k) )
where Π_C(z) = argmin_{x∈C} ‖x − z‖₂.
If you take an unconstrained step, you might leave C. Projection finds the closest feasible point in Euclidean distance.
The projection itself is a convex optimization problem, but for many sets it has a closed form:
| Set C | Projection Π_C(z) |
|---|---|
| Box [ℓ, u] | clip each coordinate to [ℓᵢ, uᵢ] |
| ℓ₂ ball {x: ‖x‖₂ ≤ r} | if ‖z‖₂ ≤ r then z else r z / ‖z‖₂ |
| Simplex {x ≥ 0, ∑ xᵢ = 1} | sort + threshold (O(n log n)) |
If ∇f is L-Lipschitz (smooth), and α ∈ (0, 1/L], PGD achieves
f(x_k) − f(x⋆) = O(1/k)
and with strong convexity you can get linear convergence.
Many ML objectives are sums:
minimize g(x) + h(x)
where:
Plain gradient descent struggles with nonsmooth terms because ∇h may not exist.
Define the proximal operator of h:
prox_{α h}(z) = argmin_x ( 1/(2α) ‖x − z‖₂² + h(x) ).
Then proximal gradient updates:
x_{k+1} = prox_{α h}( x_k − α ∇g(x_k) ).
This looks like “gradient step on g” followed by a structured correction that handles h.
If h(x) = λ‖x‖₁, then prox has a closed form coordinatewise:
[prox_{αλ‖·‖₁}(z)]ᵢ = sign(zᵢ) · max(|zᵢ| − αλ, 0).
That single formula is the engine behind efficient sparse learning (LASSO, sparse logistic regression).
For convex g+h with g smooth:
f(x_k) − f(x⋆) = O(1/k)
and accelerated variants (FISTA) achieve O(1/k²).
Interior-point methods are among the most powerful general-purpose solvers for many convex problems (LP, QP, SOCP, SDP) at moderate scale.
They’re less common in very large-scale ML training loops (where first-order methods dominate), but they’re essential for:
Suppose you have inequality constraints fᵢ(x) ≤ 0. An interior method solves a sequence of unconstrained problems:
minimize f₀(x) + (1/t) ∑_{i=1}^m ϕ(−fᵢ(x))
where ϕ is a barrier like −log(·), and t increases over time.
As t → ∞, the barrier weight 1/t → 0, and solutions approach the constrained optimum while staying strictly feasible.
The barrier-augmented objective remains convex (sum of convex functions), and Newton-type steps behave predictably.
| Situation | Best-fit method | Why |
|---|---|---|
| Smooth f, simple constraint set C | Projected gradient / accelerated PGD | projection is cheap; scales well |
| g smooth + h nonsmooth with simple prox | Proximal gradient / FISTA | handles L1, hinge-like terms efficiently |
| Medium-scale constrained LP/QP/SOCP/SDP | Interior-point solver | high accuracy, robust, strong theory |
| Huge data, streaming, noisy gradients | (Stochastic) gradient / variance-reduced methods | cheap iterations; convexity still gives guarantees |
The unifying theme is that convexity provides a progress metric (optimality gap) and often a certificate (duality/KKT) that your algorithm has succeeded.
Convex optimization is not just about finding x⋆. It’s also about knowing you’ve found it.
Duality gives:
This node unlocks KKT conditions because KKT is the language that connects primal feasibility, dual feasibility, and complementary slackness into a single optimality certificate.
For many constrained convex problems, you form a Lagrangian
L(x, λ, ν) = f₀(x) + ∑_{i=1}^m λᵢ fᵢ(x) + νᵀ(Ax − b)
with λ ≥ 0.
The dual function is
g(λ, ν) = inf_x L(x, λ, ν).
Then the dual problem is
maximize g(λ, ν)
subject to λ ≥ 0.
Weak duality always holds: dual optimum ≤ primal optimum.
In convex optimization, strong duality often holds under mild conditions (e.g., Slater’s condition), meaning the dual optimum equals the primal optimum—giving a tight certificate.
Regularization commonly produces convex objectives.
Example objective:
minimize (1/n) ∑_{i=1}^n (yᵢ − wᵀ xᵢ)² + (λ/2) ‖w‖₂²
Practical meaning:
minimize (1/n) ∑ (yᵢ − wᵀ xᵢ)² + λ‖w‖₁
This is convex but nonsmooth. Proximal methods solve it efficiently, and the nonsmoothness is what enables sparsity.
The classic soft-margin SVM can be written as a convex optimization problem.
One common primal form:
minimize (1/2)‖w‖₂² + C ∑_{i=1}^n ξᵢ
subject to yᵢ(wᵀ xᵢ + b) ≥ 1 − ξᵢ
ξᵢ ≥ 0
This is a convex QP:
The dual form reveals:
This is a perfect example of how convex optimization structures an ML model end-to-end: formulation → algorithm → certificate → interpretation.
Convexity interacts beautifully with Hessian-based methods.
If f is twice differentiable convex, then ∇²f(x) ⪰ 0 everywhere. This means:
Interior-point methods essentially rely on Newton-like steps on barrier-augmented convex objectives.
Meanwhile, quasi-Newton methods (BFGS, L-BFGS) are often effective on convex problems because they build PSD curvature approximations.
Bayesian optimization targets global optimization of expensive black-box functions, typically nonconvex.
So why does convex optimization matter?
Convex optimization is often less about calculus and more about modeling:
Once you start thinking this way, many problems become: “Can I rewrite this to be convex (or approximately convex)?” That’s a major professional skill in ML, operations research, and control.
Compute Π_C(z) where C = {x ∈ ℝⁿ : ‖x‖₂ ≤ r} and z is given. This projection is used in constrained convex optimization like min f(x) s.t. ‖x‖₂ ≤ r.
We want the closest feasible point:
Π_C(z) = argmin_{‖x‖₂ ≤ r} ‖x − z‖₂.
Case 1: z is already feasible.
If ‖z‖₂ ≤ r, then choosing x = z yields objective 0, which is minimal.
So Π_C(z) = z.
Case 2: ‖z‖₂ > r.
The solution must lie on the boundary ‖x‖₂ = r (otherwise we could move toward z and reduce distance).
Use the fact that the closest point on a sphere to z lies on the ray from the origin through z.
So let x = α z with α ≥ 0.
Impose feasibility:
‖α z‖₂ = α ‖z‖₂ = r ⇒ α = r / ‖z‖₂.
Therefore the projection is:
Π_C(z) = r z / ‖z‖₂ when ‖z‖₂ > r.
Insight: Projections encode constraints as geometry. For many convex sets used in ML, Π_C has a simple closed form—making projected gradient methods practical at scale.
Minimize f(x) = 1/2 xᵀQx + cᵀx over ℝⁿ where Q ⪰ µI for some µ > 0 (so Q is positive definite). Find argmin and explain why it’s unique.
Because Q ⪰ µI with µ > 0, f is µ-strongly convex. Strong convexity implies the minimizer is unique.
Compute the gradient:
∇f(x) = ∇(1/2 xᵀQx) + ∇(cᵀx)
= Qx + c.
Set first-order optimality condition (unconstrained convex):
∇f(x⋆) = 0 ⇒ Qx⋆ + c = 0.
Solve the linear system:
Qx⋆ = −c
⇒ x⋆ = −Q⁻¹ c.
Therefore:
argmin_{x∈ℝⁿ} f(x) = { −Q⁻¹ c } (a singleton set).
Insight: For convex quadratics, optimization reduces to solving linear systems. Strong convexity (Q ≻ 0) upgrades “a minimizer exists” to “the minimizer is unique,” which is crucial for stability in learning.
Compute prox_{α λ‖·‖₁}(z) for z ∈ ℝⁿ. This is the key step in LASSO and sparse logistic regression.
By definition:
prox_{α λ‖·‖₁}(z) = argmin_{x} (1/(2α))‖x − z‖₂² + λ‖x‖₁.
The objective separates by coordinates because both terms are sums:
(1/(2α))∑(xᵢ − zᵢ)² + λ∑|xᵢ|
= ∑ [ (1/(2α))(xᵢ − zᵢ)² + λ|xᵢ| ].
So we can minimize each coordinate independently:
xᵢ⋆ = argmin_{x} (1/(2α))(x − zᵢ)² + λ|x|.
Consider the 1D subgradient optimality condition.
For x ≠ 0, derivative is:
(1/α)(x − zᵢ) + λ sign(x) = 0.
If x > 0:
(1/α)(x − zᵢ) + λ = 0 ⇒ x = zᵢ − αλ.
This is consistent with x > 0 only if zᵢ > αλ.
If x < 0:
(1/α)(x − zᵢ) − λ = 0 ⇒ x = zᵢ + αλ.
This is consistent with x < 0 only if zᵢ < −αλ.
If |zᵢ| ≤ αλ, the optimum is x = 0 (you can verify using subgradients of |x| at 0, which is [−1, 1]).
Combine cases:
xᵢ⋆ = sign(zᵢ) · max(|zᵢ| − αλ, 0).
Insight: The prox step doesn’t merely “nudge” parameters—it can set them exactly to 0. That discrete sparsity effect comes from convex but nonsmooth geometry, not from any heuristic.
A convex optimization problem minimizes a convex function over a convex set; its hallmark guarantee is that any local minimum is global.
argmin returns the set of minimizers; convex problems may have multiple solutions unless the objective is strongly convex.
First-order convexity inequality: f(y) ≥ f(x) + ∇f(x)ᵀ(y−x) is the engine behind many proofs and algorithms.
Unconstrained differentiable convex minimization: x⋆ is optimal iff ∇f(x⋆) = 0.
With constraints, optimality is geometric: −∇f(x⋆) lies in the normal cone N_C(x⋆); equivalently 0 ∈ ∂f(x⋆) + N_C(x⋆) for nonsmooth cases.
Projected gradient handles simple convex constraints via x_{k+1} = Π_C(x_k − α∇f(x_k)).
Proximal gradient handles composite objectives g + h where h is nonsmooth but has an efficient prox (e.g., L1 soft-thresholding).
Duality (and then KKT) turns optimality into a verifiable certificate and often yields alternative, more efficient solution pathways (e.g., SVM dual, kernelization).
Assuming “convex” means “differentiable.” Many key convex problems are nonsmooth (L1, hinge loss) and require subgradients or proximal methods.
Checking only the objective for convexity while ignoring constraints. Nonconvex feasible sets break the local=global guarantee even if f is convex.
Treating argmin as a single point by default. Without strong convexity (or other conditions), argmin can be a set.
Using gradient descent on constrained problems without projection (or another constraint-handling method), then misinterpreting infeasible iterates as meaningful.
Let f(x) = |x|. Compute ∂f(0) and use it to determine whether x⋆ = 0 is a minimizer of f over ℝ.
Hint: Use the subgradient inequality f(y) ≥ f(0) + g(y−0) for all y to find all valid g at 0.
We need all g such that |y| ≥ 0 + g y for all y.
For y > 0: |y| = y, inequality becomes y ≥ g y ⇒ 1 ≥ g.
For y < 0: |y| = −y, inequality becomes −y ≥ g y. Divide by y (negative) flips sign: −1 ≤ g.
Thus ∂f(0) = [−1, 1].
Optimality for unconstrained convex minimization is 0 ∈ ∂f(x⋆). Since 0 ∈ [−1,1], x⋆ = 0 is a global minimizer.
Consider minimize f(x) = 1/2 ‖Ax − b‖₂² subject to ‖x‖₂ ≤ r. Write the projected gradient descent update explicitly (in terms of A, b, r).
Hint: Compute ∇f(x) first, then apply the ℓ₂-ball projection formula from the worked example.
We have f(x) = 1/2 ‖Ax − b‖₂².
Gradient:
∇f(x) = Aᵀ(Ax − b).
A PGD step with step size α is:
z = x_k − α Aᵀ(Ax_k − b).
Then project onto C = {‖x‖₂ ≤ r}:
If ‖z‖₂ ≤ r, set x_{k+1} = z.
Else set x_{k+1} = r z / ‖z‖₂.
Show that if f is µ-strongly convex on a convex set C, then it has at most one minimizer on C.
Hint: Assume there are two distinct minimizers x₁ and x₂ with equal objective value, then apply strong convexity to the midpoint (or general t).
Assume for contradiction there exist two distinct minimizers x₁ ≠ x₂ in C such that f(x₁) = f(x₂) = f⋆.
Since C is convex, for any t ∈ (0,1), x_t = t x₁ + (1−t) x₂ ∈ C.
Strong convexity gives:
f(x_t) ≤ t f(x₁) + (1−t) f(x₂) − (µ/2) t(1−t) ‖x₁−x₂‖₂².
Substitute f(x₁)=f(x₂)=f⋆:
f(x_t) ≤ f⋆ − (µ/2) t(1−t) ‖x₁−x₂‖₂².
Because µ > 0, t(1−t) > 0, and ‖x₁−x₂‖₂² > 0, the right-hand side is strictly less than f⋆.
So f(x_t) < f⋆, contradicting that f⋆ is the minimum value on C.
Therefore, there cannot be two distinct minimizers; the minimizer is unique (argmin is a singleton).