Optimization with equality constraints. Gradient parallelism.
Quick unlock - significant prerequisite investment but simple final step. Verify prerequisites first.
Many real problems aren’t “optimize freely”; they’re “optimize while obeying a rule.” Lagrange multipliers give a clean geometric and algebraic way to solve equality-constrained optimization by turning “stay on this surface” into a condition about gradients lining up.
To optimize f(x) subject to g(x) = c, look for points where ∇f(x) = λ ∇g(x) and g(x) = c. This says: at a constrained optimum, the objective’s gradient is parallel to the constraint’s gradient. Equivalent viewpoint: build the Lagrangian ℒ(x, λ) = f(x) − λ(g(x) − c) and solve ∇xℒ = 0 plus the constraint.
Unconstrained optimization asks:
But many optimization problems have equality constraints:
The constraint restricts you to a feasible set—typically a curve (in 2D), a surface (in 3D), or a manifold/hypersurface (in higher dimensions). On that set, the “best” point is not usually where ∇f(x) = 0, because you’re not allowed to move in all directions.
So we need a different optimality condition that respects the constraint.
Imagine you’re walking along the constraint curve g(x) = c. At a best feasible point, you can’t move along the curve to improve f.
That means:
In gradient language:
So at a constrained extremum, ∇f must have no component in any tangent direction.
What vectors are orthogonal to every tangent direction? The normal vector to the constraint surface.
And what gives a normal vector to the level set g(x) = c? The gradient ∇g.
That yields the geometric condition:
Parallel means “same direction up to scaling,” so:
λ is the Lagrange multiplier.
For one equality constraint:
Optimize f(x) subject to g(x) = c.
If x* is a solution (under regularity conditions like ∇g(x*) ≠ 0), then there exists λ such that:
This converts the constrained problem into a system of equations.
λ is not just a trick variable. It often measures sensitivity: how much the optimal value changes if you relax the constraint slightly.
Define the optimal value function:
Under suitable conditions, one can show:
Interpretation: λ is the “shadow price” of tightening/loosening the constraint.
Lagrange multipliers are designed for:
If you have inequalities (g(x) ≤ c), you need the generalized framework: KKT Conditions.
That single picture explains the method—and helps you remember why it works.
Suppose you’re constrained to g(x) = c, and you’re at some feasible point x.
A feasible direction d is one that keeps you on the constraint to first order. Using a first-order approximation:
To stay feasible (keep g = c) for small ε, we need:
So feasible directions d are exactly the vectors orthogonal to ∇g(x). These directions form the tangent space to the constraint surface.
Now look at how the objective changes:
At a constrained local optimum, we can’t improve f by moving in any feasible direction. That means:
In linear algebra terms:
The orthogonal complement of the tangent space is the span of the normal vector ∇g. Therefore:
So there exists λ such that:
This is the “gradient parallelism” condition.
If ∇g(x*) = 0, the constraint surface may not have a well-defined normal at x* (think cusp or degenerate point). Then the simple condition may fail or be incomplete.
A standard regularity assumption for one constraint is:
For multiple constraints, you need gradients of constraints to be linearly independent (more later).
Unconstrained optimum: ∇f(x*) = 0.
Constrained optimum: ∇f(x*) doesn’t need to be 0. It only needs to have no component tangent to the feasible set.
Geometrically:
If x ∈ ℝ² with constraint g(x, y) = c, the method yields:
That’s 3 equations in 3 unknowns (x, y, λ).
In ℝⁿ, it becomes n + 1 equations in n + 1 unknowns.
If you have m constraints:
Then the feasible set is the intersection of m level sets. The tangent space is orthogonal to each ∇gᵢ, so the normal space is spanned by them.
The condition generalizes to:
This is the same “gradient lives in the span of constraint gradients” idea.
| Setting | Condition | Unknown multipliers |
|---|---|---|
| 1 constraint g(x) = c | ∇f = λ ∇g | λ ∈ ℝ |
| m constraints gᵢ(x) = cᵢ | ∇f = ∑ λᵢ ∇gᵢ | λ ∈ ℝᵐ |
Solving the Lagrange equations gives candidates. You still need to:
In constrained problems, it’s common to find:
So treat the gradient parallelism equations as a principled way to generate a shortlist of candidates.
Gradient parallelism is a geometric statement, but algebraically it’s nicer to express everything as “take derivatives and set them to zero.”
The Lagrangian packages the objective and constraint into one scalar function whose stationary points encode the Lagrange conditions.
For one constraint g(x) = c, define:
(You’ll also see ℒ = f + λ(g − c). The sign convention doesn’t matter as long as you’re consistent; λ will flip sign.)
Take partial derivatives:
Set them to zero:
1) ∇xℒ = 0 ⇒ ∇f(x) − λ ∇g(x) = 0 ⇒ ∇f(x) = λ ∇g(x)
2) ∂ℒ/∂λ = 0 ⇒ g(x) − c = 0 ⇒ g(x) = c
So stationary points of ℒ correspond exactly to the Lagrange multiplier equations.
Think of λ as adjusting the penalty for violating the constraint. At the solution, λ tunes the trade-off so that a stationary point occurs exactly on the feasible set.
This is different from naive penalty methods like minimizing f(x) + ρ(g(x) − c)² for a large ρ:
For constraints gᵢ(x) = cᵢ:
Stationarity gives:
Solving first-order conditions gives candidates. To classify them, you need to understand curvature along feasible directions.
A practical approach:
At difficulty 4, it’s worth knowing a bit of structure without turning this into a full course:
Let d be a feasible direction at x*, i.e., ∇g(x*)·d = 0.
For a constrained local minimum, you expect:
The clean way to incorporate constraint curvature uses the Hessian of the Lagrangian:
Then you examine dᵀ ∇²xℒ d over tangent directions d.
You don’t always need this in practice, but it’s the right object to look at: once you’ve “encoded” the constraint in ℒ, the Hessian of ℒ governs local behavior on the manifold.
Even if you don’t solve analytically, the Lagrangian viewpoint drives algorithms:
These will reappear when you study Lagrangian Duality and KKT Conditions.
1) Write constraint(s) in standard form g(x) = c.
2) Form ℒ(x, λ) = f(x) − ∑ λᵢ(gᵢ(x) − cᵢ).
3) Solve:
4) Evaluate candidates and pick the correct optimum type.
The method is compact, systematic, and scales naturally to multiple constraints—exactly why it’s a cornerstone for later topics like KKT and duality.
Many ML training problems are constrained, either explicitly or implicitly:
Even when the original problem is unconstrained, constraints appear when you derive dual problems or enforce structure.
In a (hard-margin) linear SVM, you choose a separating hyperplane (w, b) that maximizes margin subject to classification constraints.
A common equivalent formulation:
Minimize ½‖w‖² subject to
Those are inequality constraints, so the full story uses KKT. But the core pattern is identical:
Even before inequalities, the equality-constraint case teaches the essential muscle memory: “Introduce λ, take derivatives, solve stationarity + constraints.”
KKT conditions generalize Lagrange multipliers to include inequalities g(x) ≤ 0 via multipliers αᵢ ≥ 0 and complementary slackness:
If you understand equality constraints well, KKT feels like a small extension rather than a new world.
In equality constraints, we usually solve for λ as part of the stationarity system.
In duality, you instead treat λ as defining a lower bound (for minimization) and optimize that bound:
The Lagrangian is the bridge between primal and dual.
Consider a parameterized constraint g(x) = c. If λ* is large in magnitude at the solution, the optimum is very sensitive to changes in c.
A simple mental model:
In economics and resource allocation, λ is literally the “value” of an additional unit of resource.
Suppose you optimize over probability vectors p ∈ ℝⁿ with constraint ∑ᵢ pᵢ = 1.
That’s an equality constraint. The gradient-parallelism idea says:
But ∇(∑ᵢ pᵢ) = 1 (the all-ones vector). So the condition becomes:
Meaning: at optimum (with only that constraint), all partial derivatives must be equal. This is a powerful structural fact you can reuse.
(If you also require pᵢ ≥ 0, that’s inequality territory → KKT.)
You’re likely in this regime if:
Another way to interpret Lagrange multipliers:
Equivalently:
This “projection to tangent space” language shows up in manifold optimization and constrained gradient methods.
Lagrange multipliers are the conceptual seed for all three.
Optimize f(x, y) = x y subject to g(x, y) = x² + y² = 1. This is “best product on the unit circle.”
Write the gradients:
∇f = (∂/∂x (x y), ∂/∂y (x y)) = (y, x)
∇g = (∂/∂x (x² + y²), ∂/∂y (x² + y²)) = (2x, 2y)
Set the Lagrange condition ∇f = λ ∇g:
(y, x) = λ(2x, 2y)
So:
y = 2λ x
x = 2λ y
Solve the two equations:
From y = 2λ x, substitute into x = 2λ y:
x = 2λ(2λ x) = 4λ² x
⇒ x(1 − 4λ²) = 0
So either x = 0 or 1 − 4λ² = 0
Case A: x = 0
Constraint gives: 0² + y² = 1 ⇒ y = ±1
Then f = x y = 0
Case B: 1 − 4λ² = 0 ⇒ λ = ±1/2
Using y = 2λ x:
If λ = 1/2, then y = x
If λ = −1/2, then y = −x
Subcase B1: y = x
Constraint: x² + x² = 1 ⇒ 2x² = 1 ⇒ x = ±1/√2
So points: (1/√2, 1/√2) and (−1/√2, −1/√2)
Objective: f = x y = x² = 1/2
Subcase B2: y = −x
Constraint: x² + (−x)² = 1 ⇒ 2x² = 1 ⇒ x = ±1/√2
Points: (1/√2, −1/√2) and (−1/√2, 1/√2)
Objective: f = x y = −x² = −1/2
Compare objective values:
Max value is 1/2 at (±1/√2, ±1/√2) with matching signs.
Min value is −1/2 at (±1/√2, ∓1/√2) with opposite signs.
The x=0 points give 0, neither max nor min globally.
Insight: On the unit circle, the best way to maximize x y is to balance x and y equally (same sign), i.e., sit on the line y = x. The gradients line up exactly at the tangency points between level sets of x y and the circle.
Minimize distance to origin subject to a linear equality. Let the constraint be a x + b y = 1 (with constants a, b not both 0). Minimize f(x, y) = x² + y² subject to g(x, y) = a x + b y = 1.
Compute gradients:
∇f = (2x, 2y)
∇g = (a, b)
Set ∇f = λ ∇g:
(2x, 2y) = λ(a, b)
So:
2x = λ a ⇒ x = (λ a)/2
2y = λ b ⇒ y = (λ b)/2
Enforce the constraint a x + b y = 1:
a(λ a/2) + b(λ b/2) = 1
⇒ (λ/2)(a² + b²) = 1
⇒ λ = 2/(a² + b²)
Substitute λ back:
x* = (λ a)/2 = ( (2/(a² + b²)) a )/2 = a/(a² + b²)
y* = b/(a² + b²)
Compute the minimum value:
f(x, y) = x² + y²
= a²/(a² + b²)² + b²/(a² + b²)²
= (a² + b²)/(a² + b²)²
= 1/(a² + b²)
Insight: The minimizer is the orthogonal projection of the origin onto the line a x + b y = 1. The multiplier λ scales the normal vector (a, b) so that the point lands exactly on the constraint line.
Optimize f(x, y, z) = x + y + z subject to g₁(x, y, z) = x² + y² + z² = 1 and g₂(x, y, z) = x − y = 0.
Compute gradients:
∇f = (1, 1, 1)
∇g₁ = (2x, 2y, 2z)
∇g₂ = (1, −1, 0)
Use the multi-constraint condition:
∇f = λ₁ ∇g₁ + λ₂ ∇g₂
So:
(1, 1, 1) = λ₁(2x, 2y, 2z) + λ₂(1, −1, 0)
Write component equations:
1 = 2λ₁ x + λ₂
1 = 2λ₁ y − λ₂
1 = 2λ₁ z
Also apply constraints:
x² + y² + z² = 1
x − y = 0 ⇒ x = y
Use x = y in the first two component equations:
1 = 2λ₁ x + λ₂
1 = 2λ₁ x − λ₂
Subtract the second from the first:
0 = 2λ₂ ⇒ λ₂ = 0
With λ₂ = 0, we have:
1 = 2λ₁ x
1 = 2λ₁ y
1 = 2λ₁ z
So x = y = z
Let x = y = z = t. Apply the sphere constraint:
3t² = 1 ⇒ t = ±1/√3
Evaluate f:
If t = 1/√3, then f = 3(1/√3) = √3 (maximum)
If t = −1/√3, then f = −√3 (minimum)
Insight: With two constraints, ∇f must lie in the span of two normals. Here symmetry and the constraint x = y force the optimum to the equal-coordinates direction, then the sphere fixes the magnitude.
Equality-constrained optimization asks for extrema of f(x) restricted to the surface g(x) = c.
At a regular constrained extremum x* (with ∇g(x*) ≠ 0), the gradients satisfy ∇f(x*) = λ ∇g(x*): gradient parallelism.
Geometric reason: feasible (tangent) directions d satisfy ∇g·d = 0, and at optimum you need ∇f·d = 0 for all such d.
The Lagrangian ℒ(x, λ) = f(x) − λ(g(x) − c) converts the constrained first-order conditions into stationary conditions: ∇xℒ = 0, ∂ℒ/∂λ = 0.
With multiple equality constraints, the condition becomes ∇f = ∑ λᵢ ∇gᵢ; ∇f lies in the span of constraint gradients.
Solving Lagrange equations gives candidate points; you still must compare objective values or use second-order reasoning to decide max/min.
The multiplier λ often has sensitivity meaning: it relates to how the optimal value changes as the constraint constant c changes.
Lagrange multipliers are the foundation for KKT (inequalities), SVM derivations, and Lagrangian duality.
Forgetting to enforce the constraint after solving ∇f = λ ∇g (you must also satisfy g(x) = c).
Assuming every solution to the Lagrange equations is a maximum/minimum without checking (some are minima, maxima, or neither on the constraint).
Ignoring degeneracy: if ∇g(x*) = 0 (or multiple constraints are not independent), the standard conditions may be incomplete or misleading.
Mixing sign conventions in the Lagrangian and then interpreting λ incorrectly (the equations still work, but λ’s sign flips).
Find the maximum and minimum of f(x, y) = x² + y² subject to the constraint x + y = 1.
Hint: Use ∇f = (2x, 2y) and ∇g = (1, 1). Solve (2x, 2y) = λ(1, 1) plus x + y = 1. Then think about whether a maximum exists on an unbounded line.
Let g(x, y) = x + y = 1. Then ∇f = (2x, 2y), ∇g = (1, 1).
Lagrange condition:
(2x, 2y) = λ(1, 1)
⇒ 2x = λ and 2y = λ ⇒ x = y.
Constraint x + y = 1 ⇒ 2x = 1 ⇒ x = 1/2, y = 1/2.
Then f = (1/2)² + (1/2)² = 1/2.
This is the minimum (closest point to origin on the line). There is no maximum because along the line x + y = 1 you can take x → ∞, y = 1 − x → −∞, and x² + y² → ∞.
Maximize f(x, y) = x + 2y subject to x² + y² = 5.
Hint: The constraint is a circle. Solve ∇f = λ ∇g, where g(x, y) = x² + y².
Let g(x, y) = x² + y² = 5. Then ∇f = (1, 2), ∇g = (2x, 2y).
Set (1, 2) = λ(2x, 2y):
1 = 2λ x ⇒ x = 1/(2λ)
2 = 2λ y ⇒ y = 1/λ
Use constraint:
x² + y² = 5
(1/(2λ))² + (1/λ)² = 5
1/(4λ²) + 1/λ² = 5
(1/4 + 1)/λ² = 5
(5/4)/λ² = 5
1/λ² = 4 ⇒ λ = ±1/2
If λ = 1/2: x = 1, y = 2, f = 1 + 4 = 5 (maximum)
If λ = −1/2: x = −1, y = −2, f = −1 − 4 = −5 (minimum)
Use Lagrange multipliers to find critical points of f(x, y, z) = x y z subject to x² + y² + z² = 3. (You may leave the answer as a set of symmetric points.)
Hint: Compute ∇f = (y z, x z, x y) and ∇g = (2x, 2y, 2z). Use y z = 2λ x etc. Look for symmetry cases like x = y = z and sign flips.
Constraint: g(x, y, z) = x² + y² + z² = 3, so ∇g = (2x, 2y, 2z).
Objective: f = x y z, so ∇f = (y z, x z, x y).
Lagrange equations:
(y z, x z, x y) = λ(2x, 2y, 2z)
So:
(1) y z = 2λ x
(2) x z = 2λ y
(3) x y = 2λ z
and x² + y² + z² = 3.
Assume x, y, z are all nonzero. Multiply (1)(2)(3):
(y z)(x z)(x y) = (2λ)³ (x y z)
Left side = x² y² z² = (x y z)².
So (x y z)² = 8λ³ (x y z).
Since x y z ≠ 0, divide by x y z:
x y z = 8λ³.
From (1): λ = (y z)/(2x). By symmetry, ratios suggest |x| = |y| = |z|.
Let x = s₁ t, y = s₂ t, z = s₃ t where sᵢ ∈ {−1, +1} and t > 0.
Constraint gives 3 t² = 3 ⇒ t = 1.
So candidates are (x, y, z) = (s₁, s₂, s₃).
These satisfy the Lagrange equations with suitable λ because:
For x = ±1, y = ±1, z = ±1, we have y z = ±1 and 2λ x must match it; choose λ = (y z)/(2x) which is ±1/2 consistently across equations.
Thus the critical points on the sphere are the 8 sign combinations (±1, ±1, ±1).
Their objective values are f = x y z = ±1.
Maxima: points with product +1 (even number of negatives). Minima: product −1 (odd number of negatives).