Minimizing or maximizing objective functions. Constraints.
Quick unlock - significant prerequisite investment but simple final step. Verify prerequisites first.
Nearly every “smart” system—training a neural network, fitting a line to data, scheduling deliveries, designing a bridge—can be framed as: pick a decision vector x that makes a single number f(x) as small (or large) as possible, while respecting the rules of the problem.
Optimization is the study of choosing a decision vector x to minimize or maximize an objective f(x) over a feasible set defined by constraints. Unconstrained local optima satisfy the first-order stationarity condition ∇f(x*) = 0; constraints change what “no improving direction” means and motivate tools like projection, Lagrange multipliers, and linear programming.
Optimization is a language for decision-making. Whenever you can:
1) describe what “good” means with a single number, and
2) describe what choices are allowed,
you can write an optimization problem.
A canonical form is:
A typical minimization problem:
minimize f(x)
subject to x ∈ 𝒳
where 𝒳 is the feasible set (the set of all choices that satisfy the constraints).
You’ll see these three pieces repeatedly:
An objective is a function f that maps a decision vector to a scalar:
f: ℝⁿ → ℝ, x ↦ f(x)
Maximization is equivalent to minimization by sign flip:
maximize f(x) ⇔ minimize (−f(x))
So in many courses and libraries, “optimization” is taught largely as minimization.
Constraints carve out which x are permitted. A common description:
𝒳 = { x ∈ ℝⁿ : gᵢ(x) ≤ 0 for i = 1,…,m and hⱼ(x) = 0 for j = 1,…,p }
If there are no constraints, then 𝒳 = ℝⁿ and the problem is unconstrained.
An optimizer is any feasible point x⋆ whose objective is minimal among the candidates.
Local vs global matters because many objectives (especially in ML) are nonconvex: they can have multiple valleys. Algorithms like gradient descent typically aim for a local optimum or even just a “good enough” point.
Here’s a compact glossary.
| Term | Meaning | Example intuition |
|---|---|---|
| Decision variable x | What you control | model parameters, schedule, portfolio weights |
| Objective f(x) | What you measure | error, cost, negative log-likelihood |
| Feasible set 𝒳 | Allowed choices | budget constraints, box bounds, equality laws |
| Minimizer x⋆ | Best choice (local/global) | lowest loss parameters |
| Argmin | The set of minimizers | argmin f(x) can contain multiple points |
Because x is a vector, optimization is inherently geometric:
Your prerequisite—gradients—already gives you a powerful geometric tool: ∇f(x) points in the direction of steepest ascent, and −∇f(x) points downhill the fastest (locally). This will become your default “direction of improvement” in unconstrained optimization.
Even if you can write down f(x), directly checking “is x⋆ global?” by comparing against all x is impossible in continuous spaces. So optimization uses conditions that any optimum must satisfy.
For unconstrained problems:
minimize f(x) over x ∈ ℝⁿ
the first condition you learn is a first-order necessary condition: at a local minimizer, there should be no infinitesimal direction you can move that decreases f.
Assume f is differentiable. Consider a candidate point x⋆, and take a small step along a direction d with step size ε:
x(ε) = x⋆ + ε d
Define the 1D function φ(ε) = f(x⋆ + εd). If x⋆ is a local minimizer, then for small ε around 0, ε = 0 should be a local minimum of φ.
Compute the derivative at 0:
φ′(0) = d/dε f(x⋆ + εd) |_{ε=0}
By the chain rule:
φ′(0) = ∇f(x⋆)ᵀ d
For x⋆ to be a local minimizer, we need φ′(0) = 0 for all directions d (otherwise choose a direction where it’s negative and decrease f).
So we require:
∇f(x⋆)ᵀ d = 0 for all d
The only vector orthogonal to all directions d is the zero vector. Therefore:
∇f(x⋆) = 0
This is first-order stationarity.
Stationary points include:
So ∇f(x⋆) = 0 is necessary for a differentiable unconstrained local optimum, but not sufficient.
A second-order view uses the Hessian H(x) = ∇²f(x) (a matrix). If H(x⋆) is positive definite, then x⋆ is a strict local minimizer. But many practical algorithms still start from first-order information because:
The dot product ∇f(x)ᵀ d is the directional derivative: the instantaneous rate of change of f if you move in direction d.
∇f(x)ᵀ (−∇f(x)) = −‖∇f(x)‖² < 0
So you can immediately decrease f by moving a little along −∇f.
A helpful geometric picture:
At a local minimum, level sets form nested curves/surfaces around the minimizer. Stationarity means the “best local place” has no downhill direction.
Some objectives are not differentiable everywhere (e.g., absolute value |x| at 0, ReLU). Then “∇f(x) = 0” is not the right tool at kink points; you use subgradients. For this node, keep the main idea: optimality means no direction improves the objective, and gradients formalize that when differentiable.
In unconstrained optimization, any direction d is allowed. With constraints, you can’t necessarily step in the direction you want.
Instead, you must stay inside the feasible set 𝒳.
So the key question becomes:
Is there a feasible direction you can move that decreases f?
If the answer is “no,” you are (at least) locally optimal with respect to the constraints.
A common constrained optimization problem:
minimize f(x)
subject to gᵢ(x) ≤ 0, i = 1,…,m
hⱼ(x) = 0, j = 1,…,p
Feasible set:
𝒳 = { x : gᵢ(x) ≤ 0 ∀i, hⱼ(x) = 0 ∀j }
Constraints define geometry:
1) Box constraints: ℓ ≤ x ≤ u (componentwise)
2) Ball constraint: ‖x‖ ≤ R
3) Simplex: x ≥ 0, ∑ᵢ xᵢ = 1
4) Linear inequalities: Ax ≤ b
The shape of 𝒳 often determines which algorithms are natural:
A crucial difference from the unconstrained case:
Example intuition: minimize f(x) = x over the constraint x ≥ 0.
So what replaces “gradient is zero”?
At a feasible point x⋆, a direction d is a feasible direction if for small ε > 0, x⋆ + εd remains feasible.
Local optimality can be phrased as:
∇f(x⋆)ᵀ d ≥ 0 for all feasible directions d
Interpretation: every allowed infinitesimal move is non-decreasing; you can’t go downhill while staying feasible.
For equality constraints h(x) = 0, feasible directions must satisfy the linearized constraint:
h(x⋆ + εd) ≈ h(x⋆) + ε ∇h(x⋆)ᵀ d = 0
Since h(x⋆) = 0, this requires:
∇h(x⋆)ᵀ d = 0
Meaning: d lies in the tangent space of the constraint surface.
This is the doorway to Lagrange multipliers: at an optimum, the gradient of f must be orthogonal to all tangent directions, so it must lie in the span of constraint normals.
Constraints make the problem more realistic but often harder:
But constraints also add structure that can make some problems easier:
| Class | Objective | Constraints | Typical guarantee |
|---|---|---|---|
| Unconstrained smooth | differentiable f | none | local stationarity ∇f = 0 at local optima |
| Equality-constrained | differentiable f | h(x) = 0 | Lagrange multipliers / KKT-style conditions |
| Inequality-constrained | differentiable f | g(x) ≤ 0 | active-set / KKT conditions |
| Linear programming | linear cᵀx | Ax ≤ b | global optimum at a vertex (if bounded) |
You don’t need all the machinery yet; the goal is to internalize that constraints redefine what “can’t improve” means.
In supervised learning, you choose parameters w to minimize average loss:
minimize (1/N) ∑_{i=1}^{N} ℓ( model(w, xᵢ), yᵢ ) + λ R(w)
This directly unlocks gradient descent, which iteratively updates:
w ← w − α ∇f(w)
In a game, each player solves an optimization problem given others’ strategies.
Player k chooses strategy xₖ to maximize payoff uₖ(xₖ, x_{−k}) subject to xₖ ∈ 𝒳ₖ.
A Nash equilibrium is a strategy profile where no player can improve by unilateral deviation—an optimality condition across multiple coupled optimization problems.
So optimization vocabulary (objective, feasible set, optimality) is the foundation for equilibrium concepts.
When you must satisfy h(x) = 0 exactly, the best point often occurs where the objective gradient is “balanced” by constraint gradients.
Lagrange multipliers formalize the idea:
∇f(x⋆) + ∑ⱼ λⱼ ∇hⱼ(x⋆) = 0
This is a precise way to express: at optimum, you cannot move along the constraint surface to decrease f.
Not all optimization is calculus-based. In linear programming:
minimize cᵀx
subject to Ax ≤ b
This structural fact enables algorithms like simplex and interior-point methods.
Optimization problems are often described in different but equivalent ways:
These transformations are not just algebra—they determine algorithm choices.
Given a new scenario, you should be able to:
1) Identify the decision vector x.
2) Write a scalar objective f(x).
3) Specify the feasible set 𝒳 using constraints.
4) Say what “optimal” means (local vs global).
5) For unconstrained differentiable problems, use ∇f(x⋆) = 0 as a necessary condition.
That’s the core toolkit you’ll use immediately in gradient descent, and later in Lagrange multipliers and linear programming.
Minimize f(x) = x₁² + 2x₂² − 4x₁ − 8x₂ over x = (x₁, x₂) ∈ ℝ².
Compute the gradient:
∇f(x) = (∂f/∂x₁, ∂f/∂x₂)
∂f/∂x₁ = 2x₁ − 4
∂f/∂x₂ = 4x₂ − 8
So ∇f(x) = (2x₁ − 4, 4x₂ − 8).
Set the first-order stationarity condition:
∇f(x⋆) = 0
⇒ 2x₁ − 4 = 0 and 4x₂ − 8 = 0
⇒ x₁ = 2, x₂ = 2
So x⋆ = (2, 2).
Check that it’s a (global) minimum by completing the square:
f(x) = (x₁² − 4x₁) + 2(x₂² − 4x₂)
= (x₁² − 4x₁ + 4) − 4 + 2[(x₂² − 4x₂ + 4) − 4]
= (x₁ − 2)² − 4 + 2(x₂ − 2)² − 8
= (x₁ − 2)² + 2(x₂ − 2)² − 12.
Since (x₁ − 2)² ≥ 0 and 2(x₂ − 2)² ≥ 0 for all x, we have:
f(x) ≥ −12
and equality occurs at x = (2, 2).
Therefore x⋆ is the unique global minimizer and f(x⋆) = −12.
Insight: Stationarity gave the candidate quickly; rewriting the function as a sum of squares showed there’s only one valley bottom, so the stationary point is global.
Minimize f(x) = (x − 2)² subject to x ≥ 0.
First, solve the unconstrained problem:
f′(x) = 2(x − 2)
Set f′(x) = 0 ⇒ x = 2.
This candidate is feasible because 2 ≥ 0.
Compare with the boundary intuition anyway:
The feasible set is 𝒳 = [0, ∞).
The unconstrained minimizer x = 2 lies inside 𝒳, so it is also the constrained minimizer.
Compute value: f(2) = 0.
Modify slightly to see the boundary effect: minimize g(x) = (x + 1)² subject to x ≥ 0.
Unconstrained: g′(x) = 2(x + 1) ⇒ stationary at x = −1 (infeasible).
So the constrained optimum must occur at the closest feasible point to −1, which is the boundary x⋆ = 0.
Check the gradient at the constrained optimum:
g′(0) = 2(0 + 1) = 2 ≠ 0.
Yet x⋆ = 0 is optimal because any feasible move must have ε ≥ 0:
For x ≥ 0, g(x) = (x + 1)² is increasing for x ≥ 0, so x = 0 is best among feasible points.
Insight: With constraints, “no improving direction” means no improving feasible direction. At a boundary optimum, the derivative/gradient can be nonzero because the decreasing direction points outside the feasible set.
You choose a 2D decision vector x = (x₁, x₂) representing amounts of two resources. You must (i) stay nonnegative, (ii) keep total under 10. Objective: maximize utility u(x) = 3x₁ + x₂.
Write constraints:
(i) nonnegative ⇒ x₁ ≥ 0, x₂ ≥ 0
(ii) budget ⇒ x₁ + x₂ ≤ 10
So feasible set:
𝒳 = { x ∈ ℝ² : x₁ ≥ 0, x₂ ≥ 0, x₁ + x₂ ≤ 10 }.
This is a triangle (a 2-simplex scaled by 10).
Convert maximization to minimization (optional):
maximize 3x₁ + x₂ ⇔ minimize f(x) = −(3x₁ + x₂).
But we can reason directly about maximization.
Use linear objective intuition:
Because u(x) is linear and 𝒳 is a polygon, the maximum occurs at a vertex of 𝒳 (a cornerstone fact for linear programming).
Vertices are:
A = (0, 0)
B = (10, 0)
C = (0, 10).
Evaluate u at vertices:
u(A) = 0
u(B) = 3·10 + 0 = 30
u(C) = 3·0 + 10 = 10
So the maximizer is x⋆ = (10, 0) with utility 30.
Insight: Even before learning simplex, you can often solve small linear programs by: (1) writing the feasible set, (2) listing vertices, (3) checking the objective on vertices.
An optimization problem is: choose a decision vector x to minimize or maximize a scalar objective f(x) over a feasible set 𝒳.
Constraints (equalities/inequalities) define the feasible set 𝒳; they can force the solution onto the boundary where ∇f(x⋆) ≠ 0.
Global optimality compares against all feasible points; local optimality compares within a neighborhood of x⋆ inside 𝒳.
For unconstrained differentiable problems, any local minimizer x⋆ must satisfy first-order stationarity: ∇f(x⋆) = 0.
Directional derivatives ∇f(x)ᵀd formalize “can I improve by moving along d?”; with constraints, only feasible directions matter.
Maximization problems can be converted to minimization by negating the objective: maximize f ⇔ minimize −f.
Problem structure matters: linear objective + linear constraints leads to linear programming; equality constraints motivate Lagrange multipliers; smooth unconstrained problems motivate gradient descent.
Assuming ∇f(x⋆) = 0 at constrained optima; boundary solutions often have nonzero gradients.
Confusing local and global optima (especially in nonconvex landscapes): stationarity does not imply global optimality.
Forgetting to define the feasible set precisely (e.g., missing nonnegativity or mixing up ≤ vs ≥), which changes the solution.
Treating maximization and minimization as fundamentally different rather than using the sign-flip equivalence.
Write the optimization problem (objective + feasible set) for: “Choose x = (x₁, x₂) to minimize cost f(x) = (x₁ − 1)² + (x₂ + 2)² subject to x₁ ≥ 0 and x₂ ≥ 0.” Is the unconstrained minimizer feasible?
Hint: First find the unconstrained minimizer by minimizing each squared term, then check the constraints.
Feasible set: 𝒳 = { x ∈ ℝ² : x₁ ≥ 0, x₂ ≥ 0 }.
Unconstrained minimizer occurs at x₁ = 1 and x₂ = −2, i.e., x = (1, −2).
This is not feasible because x₂ = −2 < 0.
For f(x) = x₁² + x₂², find all stationary points and decide whether they are local/global minima, maxima, or saddle points (unconstrained).
Hint: Compute ∇f(x) and set it to 0. Then use geometry: f(x) = ‖x‖².
∇f(x) = (2x₁, 2x₂). Stationarity gives x₁ = 0 and x₂ = 0, so the only stationary point is 0.
Since f(x) = ‖x‖² ≥ 0 for all x and f(0) = 0, 0 is a global (and strict local) minimum. There is no maximum because f(x) → ∞ as ‖x‖ → ∞.
Consider minimize f(x) = x subject to −1 ≤ x ≤ 2. Identify the feasible set, the global minimizer, and check whether f′(x⋆) = 0 at the solution.
Hint: A linear function over an interval is minimized at an endpoint.
Feasible set: 𝒳 = [−1, 2]. Since f(x) = x increases with x, the minimum occurs at the smallest feasible x: x⋆ = −1.
Derivative f′(x) = 1, so f′(x⋆) = 1 ≠ 0. This is a boundary optimum, so unconstrained stationarity does not apply.
Gradient Descent builds directly on the idea that −∇f(x) is a local improvement direction for unconstrained minimization.
Lagrange Multipliers formalizes constrained optimality for equality constraints by relating ∇f(x⋆) to constraint gradients.
Linear Programming specializes optimization to linear objectives and linear constraints, where geometry (vertices of polyhedra) becomes central.
Nash Equilibrium uses optimization as a building block: each player’s strategy is optimal given others’ strategies, forming a coupled set of optimality conditions.