Functions where line segment between points lies above graph.
Quick unlock - significant prerequisite investment but simple final step. Verify prerequisites first.
Convexity is the reason some optimization problems feel “easy”: the landscape has no deceptive valleys. Once you learn the two core convexity inequalities, you can recognize when gradient-based methods are safe, when a loss is well-behaved, and why “local = global” can hold.
A function f is convex if its value on any line segment is at most the linear interpolation of its endpoint values: f(t x + (1−t) y) ≤ t f(x) + (1−t) f(y). If f is differentiable, an equivalent condition is the supporting-hyperplane inequality: f(y) ≥ f(x) + ∇f(x) · (y − x). These characterizations let you prove convexity, derive bounds, and justify optimization guarantees.
In optimization, we often want to minimize a function f: ℝⁿ → ℝ, like a training loss. The hardest part of minimization isn’t taking derivatives—it’s the geometry of the function.
Convexity is a global property. It doesn’t just tell you what happens at one point (like a derivative); it constrains what happens between points.
A function f is convex if, when you pick any two points x, y in its domain and move between them in a straight line, the function value never rises above the straight-line interpolation of the endpoint function values.
Formally, for all x, y and all t ∈ [0, 1],
f(t x + (1 − t) y) ≤ t f(x) + (1 − t) f(y)
This is the convex combination inequality.
Intuition:
If n = 1, with x, y ∈ ℝ, the inequality becomes:
f(t x + (1 − t) y) ≤ t f(x) + (1 − t) f(y)
When f is convex, its curve is “cup-shaped.” A parabola f(x) = x² is convex: the secant line between two points sits above the curve.
Convexity must hold for all t ∈ [0,1], including:
f((x + y)/2) ≤ (f(x) + f(y))/2
Midpoint convexity alone is not always enough to guarantee convexity unless you assume regularity (like continuity). The full definition uses all t.
It helps to separate three related shapes:
| Type | Inequality | Shape intuition | Typical role |
|---|---|---|---|
| Convex | f(t x + (1−t) y) ≤ t f(x) + (1−t) f(y) | “Cup” | Minimization-friendly |
| Concave | f(t x + (1−t) y) ≥ t f(x) + (1−t) f(y) | “Cap” | Maximization-friendly |
| Affine | equality for all x, y, t | perfect plane/line | linear models, constraints |
An affine function is both convex and concave.
Notice the definition compares values at three locations: x, y, and the point in between. A derivative only sees an infinitesimal neighborhood. Convexity is stronger: it bans certain global configurations of the graph.
That’s why convexity buys us strong optimization guarantees later (for example, any local minimum becomes global under convexity).
The convex combination inequality is the most direct way to use convexity:
f(t x + (1 − t) y) ≤ t f(x) + (1 − t) f(y)
It lets you:
A point z is a convex combination of x and y if:
z = t x + (1 − t) y, with t ∈ [0,1].
Geometrically, z lies on the line segment between x and y.
In higher dimensions (ℝⁿ), nothing changes conceptually: you still move along a segment, but the segment lives in n-dimensional space.
Define the chord interpolation value:
L(t) = t f(x) + (1 − t) f(y)
Convexity says:
f(t x + (1 − t) y) − L(t) ≤ 0
The difference f(…) − L(t) is sometimes called the convexity gap (it’s ≤ 0 for convex functions under this sign convention).
The 2-point definition generalizes to a finite mixture. Suppose we have points x₁, …, xₘ and weights α₁, …, αₘ such that:
αᵢ ≥ 0, and ∑ᵢ αᵢ = 1
Then a convex function satisfies:
f(∑ᵢ αᵢ xᵢ) ≤ ∑ᵢ αᵢ f(xᵢ)
This is a standard form of Jensen’s inequality (finite version).
Why mention this here? Because in machine learning and optimization you constantly average:
Convexity tells you how the loss behaves under averaging.
1) f(x) = ‖x‖² is convex.
2) f(x) = eˣ is convex in 1D.
3) f(x) = |x| is convex.
These “closure properties” are practical: they let you build convex functions from known convex pieces.
| Operation | Result | Notes |
|---|---|---|
| Nonnegative weighted sum | convex | If fᵢ convex and cᵢ ≥ 0 then ∑ cᵢ fᵢ is convex |
| Add affine function | convex | f + (a·x + b) stays convex |
| Pointwise maximum | convex | max(f, g) is convex |
| Composition with affine map | convex | If f convex then g(x) = f(Ax + b) is convex |
These are heavily used when designing loss functions and regularizers.
Convexity is defined on a convex domain (a set where line segments stay inside). If the domain isn’t convex, the inequality can fail simply because t x + (1 − t) y might not be in the domain.
So when you see “f is convex,” it usually implicitly means:
The chord condition already hints at why minima behave well:
But to connect convexity to gradients (and to optimization algorithms), we need the second core characterization: the first-order supporting-hyperplane condition.
The convex combination inequality is geometric and global. Optimization algorithms are local and differential: they use ∇f(x).
So we want a condition that:
That condition is the supporting-hyperplane inequality.
If f is differentiable on a convex domain, then f is convex iff for all x, y:
f(y) ≥ f(x) + ∇f(x) · (y − x)
Interpretation:
This is the opposite “above/below” direction compared to the chord condition:
Both are signatures of a bowl shape.
Define the affine function (a hyperplane in ℝⁿ):
ℓₓ(y) = f(x) + ∇f(x) · (y − x)
Convexity implies:
ℓₓ(y) ≤ f(y) for all y
So ℓₓ “supports” the epigraph of f (the set of points above the graph). In 2D (n=1), it’s the tangent line touching the curve at x and staying below it.
Assume f is convex and differentiable. Fix x, y. Consider the 1D function along the line segment:
g(t) = f(x + t(y − x)), t ∈ [0,1]
Because f is convex, g is convex as a function of t.
Convexity of g implies (in 1D) that the secant slope is at least the derivative at the left endpoint:
g(1) ≥ g(0) + g′(0)·(1 − 0)
Compute each term:
Substitute:
f(y) ≥ f(x) + ∇f(x) · (y − x)
That is the supporting-hyperplane inequality.
This direction shows the gradient inequality is not just a consequence—it characterizes convexity.
Assume for all x, y:
f(y) ≥ f(x) + ∇f(x) · (y − x)
We want to show:
f(t x + (1 − t) y) ≤ t f(x) + (1 − t) f(y)
One standard route is to apply the gradient inequality twice (once with x and once with y) to bound f at the mixture point. Let:
z = t x + (1 − t) y
Apply inequality with (x as base point, y = z):
f(z) ≥ f(x) + ∇f(x) · (z − x)
Similarly with (y as base point, y = z):
f(z) ≥ f(y) + ∇f(y) · (z − y)
Then, with additional arguments (or integrating along the segment), one can recover the chord inequality. The key idea is: if every tangent plane underestimates f globally, the function cannot bend downward in a way that violates convexity.
For this lesson, the crucial takeaway is operational:
For differentiable convex f, the gradient is a monotone operator:
(∇f(x) − ∇f(y)) · (x − y) ≥ 0
You can derive it by writing the supporting-hyperplane inequality twice:
1) f(y) ≥ f(x) + ∇f(x) · (y − x)
2) f(x) ≥ f(y) + ∇f(y) · (x − y)
Add them:
0 ≥ ∇f(x) · (y − x) + ∇f(y) · (x − y)
Rewrite:
0 ≥ (∇f(x) − ∇f(y)) · (y − x)
Multiply by −1:
(∇f(x) − ∇f(y)) · (y − x) ≥ 0
Equivalently:
(∇f(x) − ∇f(y)) · (x − y) ≥ 0
This property underlies convergence proofs for gradient methods and proximal algorithms.
If f is twice differentiable, convexity is equivalent to:
∇²f(x) ⪰ 0 for all x
Meaning the Hessian matrix is positive semidefinite (PSD). In quadratic form language:
∀ v ≠ 0, vᵀ ∇²f(x) v ≥ 0
This is often the easiest way to check convexity for smooth functions. But the node you’re learning emphasizes the two inequalities (chord and supporting hyperplane), which remain valid even when Hessians are inconvenient or undefined.
| You want to… | Use… | Why |
|---|---|---|
| Prove an averaging bound | Convex combination inequality | It directly compares f at mixtures |
| Prove a linear lower bound | Supporting-hyperplane inequality | It gives global underestimators |
| Check convexity of smooth formula | Hessian PSD test | Mechanical computation |
In optimization, you’ll constantly switch between them depending on what the proof needs.
One of the most valuable theorems in optimization is:
If f is convex, then every local minimizer is a global minimizer.
Why? Suppose x★ is a local minimizer but not global. Then there exists y with f(y) < f(x★). Consider points on the segment:
z(t) = t y + (1 − t) x★
By convexity:
f(z(t)) ≤ t f(y) + (1 − t) f(x★)
Since f(y) < f(x★), for small t > 0 the right-hand side is strictly less than f(x★), implying points arbitrarily close to x★ have smaller value—contradicting local optimality.
This is the geometric “no false basins” idea.
In general (nonconvex), ∇f(x) = 0 might be a max, min, or saddle.
For differentiable convex f, the condition is stronger:
∇f(x★) = 0 ⇒ x★ is a global minimizer.
Proof using supporting hyperplane:
f(y) ≥ f(x★) + ∇f(x★) · (y − x★)
If ∇f(x★) = 0, then:
f(y) ≥ f(x★)
So nothing beats x★.
This is why convexity pairs so well with gradient-based methods: stationary points are safe.
Many common losses are convex in the model’s linear predictions or in parameters for linear models.
Examples:
Convexity doesn’t automatically make deep networks convex (they aren’t), but it remains crucial:
Regularizers are often chosen to be convex because they preserve convexity of the overall objective.
If you minimize:
F(w) = loss(w) + λ·R(w)
and both loss and R are convex, then F is convex.
Typical convex regularizers:
The supporting-hyperplane inequality gives a reusable bound:
f(y) ≥ f(x) + ∇f(x)·(y − x)
In algorithms, you often set x to the current iterate and y to the optimum (unknown), then manipulate the inequality to bound suboptimality.
It also motivates methods that iteratively minimize a surrogate:
This idea is a stepping stone to gradient descent analysis and to proximal-gradient methods.
In short: convexity is less about memorizing a definition and more about learning a reliable shape constraint you can exploit everywhere.
Let f(x) = x² on ℝ. Show that for any x, y ∈ ℝ and t ∈ [0,1], f(t x + (1−t) y) ≤ t f(x) + (1−t) f(y).
Start from the convexity inequality we want:
(t x + (1−t) y)² ≤ t x² + (1−t) y².
Expand the left-hand side:
(t x + (1−t) y)²
= t² x² + 2t(1−t) x y + (1−t)² y².
Move everything to the right-hand side (right minus left):
[t x² + (1−t) y²] − [t² x² + 2t(1−t) x y + (1−t)² y²].
Group x² terms and y² terms:
= (t − t²) x² + ((1−t) − (1−t)²) y² − 2t(1−t) x y.
Simplify coefficients:
(t − t²) = t(1−t)
( (1−t) − (1−t)² ) = (1−t)t = t(1−t).
So the expression becomes:
= t(1−t) x² + t(1−t) y² − 2t(1−t) x y
= t(1−t)(x² − 2xy + y²)
= t(1−t)(x − y)².
Since t ∈ [0,1], we have t(1−t) ≥ 0, and (x−y)² ≥ 0.
Therefore t(1−t)(x−y)² ≥ 0.
Thus [right − left] ≥ 0, which implies:
(t x + (1−t) y)² ≤ t x² + (1−t) y².
So f(x) = x² is convex.
Insight: This proof shows a common pattern: after expanding, convexity often reduces to a nonnegative square like (x−y)². Convexity is frequently “hidden” nonnegativity.
Let f(x) = ‖x‖² on ℝⁿ. Compute ∇f(x) and verify f(y) ≥ f(x) + ∇f(x)·(y−x).
Write f explicitly:
f(x) = x·x = ∑ᵢ xᵢ².
Compute the gradient:
∇f(x) = 2x
(because ∂/∂xᵢ of ∑ⱼ xⱼ² is 2xᵢ).
Plug into the supporting-hyperplane inequality:
Need to show:
‖y‖² ≥ ‖x‖² + (2x)·(y−x).
Expand the right-hand side:
‖x‖² + 2x·y − 2x·x
= ‖x‖² + 2x·y − 2‖x‖²
= 2x·y − ‖x‖².
So the inequality becomes:
‖y‖² ≥ 2x·y − ‖x‖².
Bring all terms to the left:
‖y‖² − 2x·y + ‖x‖² ≥ 0.
Recognize the left-hand side as a squared norm:
‖y − x‖² = (y−x)·(y−x)
= ‖y‖² − 2x·y + ‖x‖².
Thus the inequality is:
‖y − x‖² ≥ 0,
which is always true.
Therefore f(x) = ‖x‖² satisfies the supporting-hyperplane inequality and is convex.
Insight: For squared norm objectives, convexity and many optimization inequalities reduce to the identity ‖y−x‖² ≥ 0. This is why least-squares problems are so well-behaved.
Assume f and g are convex functions on a convex domain in ℝⁿ. Define h(x) = max(f(x), g(x)). Prove h is convex using the convex combination inequality.
Take any x, y and any t ∈ [0,1]. Let z = t x + (1−t) y.
Because f is convex:
f(z) ≤ t f(x) + (1−t) f(y).
Because g is convex:
g(z) ≤ t g(x) + (1−t) g(y).
Take the maximum of the left sides and use that max preserves inequalities:
max(f(z), g(z)) ≤ max(t f(x) + (1−t) f(y), t g(x) + (1−t) g(y)).
Now use a key inequality: for any numbers a₁, a₂, b₁, b₂,
max(a₁, a₂) ≤ max(b₁, b₂) is not generally true, but here we instead bound the max of two sums by the sum of maxes:
max(t f(x) + (1−t) f(y), t g(x) + (1−t) g(y))
≤ t max(f(x), g(x)) + (1−t) max(f(y), g(y)).
Justification of the previous step:
Add the corresponding bounds for each branch, then the maximum of two quantities is ≤ the common upper bound.
Substitute h:
h(z) = max(f(z), g(z))
≤ t h(x) + (1−t) h(y).
This matches the convex combination inequality, so h is convex.
Insight: Pointwise maxima preserve convexity because “being below a chord” is stable under taking the upper envelope of convex graphs—useful for hinge losses and robust objectives.
Convexity is a global shape constraint: for any x, y, the graph of f lies below the chord between (f(x), f(y)).
Formal definition: f(t x + (1−t) y) ≤ t f(x) + (1−t) f(y) for all t ∈ [0,1].
If f is differentiable, convexity ⇔ supporting-hyperplane inequality: f(y) ≥ f(x) + ∇f(x)·(y−x).
Supporting hyperplanes give global linear lower bounds; this is the bridge from convex geometry to gradient-based optimization proofs.
For differentiable convex f, the gradient is monotone: (∇f(x)−∇f(y))·(x−y) ≥ 0.
Convexity is preserved under nonnegative sums, adding affine functions, pointwise maxima, and affine changes of variables.
In convex problems, local minima are global; for differentiable convex f, ∇f(x★)=0 is sufficient for global optimality.
Mixing up directions: convex means f at a mixture is ≤ mixture of f values; concave flips the inequality.
Assuming “second derivative ≥ 0” always applies: many convex functions (e.g., |x|) are not twice differentiable everywhere, yet are still convex.
Checking convexity only at t = 1/2 (midpoints) without additional assumptions; full convexity requires all t ∈ [0,1] (or continuity plus midpoint convexity).
Forgetting domain convexity: if the domain isn’t convex, the definition can’t even be applied to all segments.
Show that an affine function a(x) = c·x + d is convex on ℝⁿ using the convex combination inequality.
Hint: Compute a(t x + (1−t) y) and compare to t a(x) + (1−t) a(y). Expect equality.
Let a(x) = c·x + d. Then:
a(t x + (1−t) y)
= c·(t x + (1−t) y) + d
= t (c·x) + (1−t)(c·y) + d.
Also:
t a(x) + (1−t) a(y)
= t(c·x + d) + (1−t)(c·y + d)
= t(c·x) + (1−t)(c·y) + [t d + (1−t) d]
= t(c·x) + (1−t)(c·y) + d.
So a(t x + (1−t) y) = t a(x) + (1−t) a(y) for all x, y, t. Therefore a is convex (and concave).
Let f be differentiable and convex. Prove that if ∇f(x★) = 0, then x★ is a global minimizer.
Hint: Use the supporting-hyperplane inequality with x = x★ and arbitrary y.
Since f is convex and differentiable, for all y:
f(y) ≥ f(x★) + ∇f(x★)·(y − x★).
Given ∇f(x★) = 0, this becomes:
f(y) ≥ f(x★) + 0·(y − x★) = f(x★).
Thus no y has a smaller value than f(x★), so x★ is a global minimizer.
Assume f and g are convex on a convex domain. Show that F(x) = α f(x) + β g(x) is convex for α, β ≥ 0.
Hint: Apply the convex combination inequality to f and g separately, then multiply by α and β and add.
Take any x, y and t ∈ [0,1]. Since f is convex:
f(t x + (1−t) y) ≤ t f(x) + (1−t) f(y).
Similarly for g:
g(t x + (1−t) y) ≤ t g(x) + (1−t) g(y).
Multiply the first inequality by α ≥ 0 and the second by β ≥ 0 and add:
α f(t x + (1−t) y) + β g(t x + (1−t) y)
≤ t[α f(x) + β g(x)] + (1−t)[α f(y) + β g(y)].
The left-hand side is F(t x + (1−t) y) and the right-hand side is t F(x) + (1−t) F(y). Therefore F is convex.