Vector of partial derivatives. Direction of steepest ascent.
Quick unlock - significant prerequisite investment but simple final step. Verify prerequisites first.
When you change a single-variable function f(x), the derivative f′(x) tells you “which way is uphill” and “how steep.” For a function of many variables f(x, y, z, …), the gradient ∇f plays that role: it’s the single object that tells you the uphill direction in the full space.
The gradient ∇f(x) is the vector of partial derivatives. It points in the direction of steepest increase of f, its magnitude ‖∇f‖ is the maximum directional rate of increase, and it is perpendicular (normal) to level sets f(x) = c.
In 1D, the derivative f′(x) answers two questions at once:
1) Direction: If f′(x) > 0, moving right increases f; if f′(x) < 0, moving right decreases f.
2) Rate: |f′(x)| tells how quickly f changes per unit step.
In multiple dimensions, you can move in infinitely many directions. A single number can’t encode “uphill direction” anymore. We need a vector that:
That vector is the gradient.
Let f: ℝⁿ → ℝ be a differentiable scalar function, and let x = (x₁, …, xₙ). The gradient of f at x is
∇f(x) = (∂f/∂x₁, ∂f/∂x₂, …, ∂f/∂xₙ).
We will write gradients as vectors (bold): ∇f(x) is a vector in ℝⁿ.
The i-th component ∂f/∂xᵢ is the slope of f if you move only along the xᵢ axis, holding all other variables constant.
So the gradient collects all “axis-aligned slopes” into a single object. But the real power is that it also tells you what happens when you move in any direction, not just along coordinate axes.
For f(x, y):
Then
∇f(x, y) = (∂f/∂x, ∂f/∂y).
At a point (x, y), this vector points “uphill” on the surface z = f(x, y). But to make that statement precise, we need directional derivatives.
The symbol ∇ (“nabla” or “del”) behaves like a vector of partial-derivative operators:
∇ = (∂/∂x₁, …, ∂/∂xₙ).
Applying it to a scalar function f gives ∇f.
If f has units (say, meters) and xᵢ has units (say, seconds), then ∂f/∂xᵢ has units meters/second. So ∇f is a vector of “rates per coordinate unit.” This becomes important when different coordinates have different scales (a common source of mistakes in optimization).
Partial derivatives tell you slopes along coordinate axes. But most interesting motion is not axis-aligned: you might move diagonally in (x, y), or along an arbitrary direction in ℝⁿ.
The directional derivative formalizes “rate of change if I move in direction u.”
Let u be a unit vector (‖u‖ = 1). The directional derivative of f at x along u is
D_u f(x) = lim_{h→0} [f(x + hu) − f(x)] / h.
This is just the ordinary derivative of the 1D function g(h) = f(x + hu) at h = 0.
If f is differentiable, then
D_u f(x) = ∇f(x) · u.
This is the bridge between “vector of partial derivatives” and “change in any direction.”
Let f(x, y) be differentiable and u = (u₁, u₂) with ‖u‖ = 1. Consider
g(h) = f(x + h u₁, y + h u₂).
By the chain rule,
g′(h) = (∂f/∂x)(x + h u₁, y + h u₂)·u₁ + (∂f/∂y)(x + h u₁, y + h u₂)·u₂.
At h = 0,
D_u f(x, y) = g′(0)
= (∂f/∂x)(x, y)·u₁ + (∂f/∂y)(x, y)·u₂
= (∂f/∂x, ∂f/∂y) · (u₁, u₂)
= ∇f(x, y) · u.
The same reasoning extends to ℝⁿ.
Now that directional change is a dot product, we can ask:
Among all unit directions u (‖u‖ = 1), which maximizes D_u f(x) = ∇f(x) · u?
Use the Cauchy–Schwarz inequality:
∇f · u ≤ ‖∇f‖‖u‖ = ‖∇f‖.
Because ‖u‖ = 1, we get
D_u f(x) ≤ ‖∇f(x)‖.
Equality holds exactly when u points in the same direction as ∇f:
u = ∇f / ‖∇f‖ (when ∇f ≠ 0).
So:
If ∇f(x) = 0, then every directional derivative is 0:
D_u f(x) = ∇f(x) · u = 0.
That doesn’t automatically mean a maximum or minimum (it could be a saddle). But it does mean the function has no first-order preference for any direction at that point.
Differentiability implies a local linear model:
f(x + h) ≈ f(x) + ∇f(x) · h (for small h).
This is the multivariable analog of
f(x + h) ≈ f(x) + f′(x)h.
This approximation is what makes gradients so useful in optimization: you can predict how f changes under a small step h, then choose h to increase or decrease f.
For f: ℝ² → ℝ, you often visualize the function using contours (level curves):
f(x, y) = c.
For f: ℝ³ → ℝ, you get level surfaces:
f(x, y, z) = c.
These describe “where the function value stays constant.” If you walk along a contour, f does not change.
So the key question becomes:
If I move tangent to a level set, why does the gradient end up perpendicular to that motion?
Suppose you are at a point x on the level set f(x) = c. Consider a small step t that stays on the level set (to first order). Along such a step, f doesn’t change, so the directional derivative in that tangent direction should be 0.
Using the dot-product identity:
0 = D_u f(x) = ∇f(x) · u
for any unit tangent direction u.
A vector whose dot product with every tangent direction is 0 must be normal to the level set. Therefore:
∇f(x) is orthogonal to the level set f(x) = c at x (when ∇f(x) ≠ 0).
Let the level curve be given implicitly by f(x, y) = c. Suppose it can be parameterized as (x(t), y(t)) staying on the curve. Then
f(x(t), y(t)) = c.
Differentiate both sides with respect to t:
d/dt f(x(t), y(t)) = 0.
Apply the chain rule:
(∂f/∂x)·x′(t) + (∂f/∂y)·y′(t) = 0.
Rewrite in dot-product form:
(∂f/∂x, ∂f/∂y) · (x′(t), y′(t)) = 0
So:
∇f · r′(t) = 0,
meaning ∇f is perpendicular to the tangent vector r′(t).
If you see a contour map of f(x, y):
This is a powerful intuition for optimization: to decrease f fastest, move opposite the gradient, cutting across contours as directly as possible.
This orthogonality is also the reason gradients appear in constrained optimization. If you constrain x to live on a level set g(x) = 0, then ∇g(x) is normal to the constraint surface. At an optimum, the objective’s gradient must align with the constraint’s normal direction (Lagrange multipliers formalize this).
You can hold three ideas in your head simultaneously:
1) ∇f is built from partial derivatives.
2) ∇f gives first-order change: f(x + h) ≈ f(x) + ∇f · h.
3) Level sets are perpendicular to ∇f, and the steepest-ascent direction is ∇f itself.
These are not separate facts—they are the same fact seen through different lenses.
Most optimization problems can be phrased as
minimize f(x) or maximize f(x).
The gradient tells you the direction that increases f most. So to decrease f, you go the other way.
A basic iterative method is gradient descent:
xₖ₊₁ = xₖ − α ∇f(xₖ),
where α > 0 is the step size (learning rate).
This update is directly motivated by the linear approximation:
f(x + h) ≈ f(x) + ∇f(x) · h.
If you pick h = −α ∇f, then
∇f · h = ∇f · (−α ∇f) = −α ‖∇f‖² ≤ 0,
so the first-order prediction says the objective should go down (for small enough α).
In supervised learning, you often minimize a loss function like
L(w) = (1/m) ∑ᵢ ℓ(ŷᵢ(w), yᵢ),
where w are model parameters.
The gradient ∇L(w) tells you how to change parameters to reduce loss. Backpropagation is essentially an efficient way to compute these partial derivatives for neural networks.
Gradients are the “scalar-output” case of more general derivative objects.
| Object | Maps | Output | Meaning |
|---|---|---|---|
| Gradient ∇f | f: ℝⁿ → ℝ | vector in ℝⁿ | slopes of scalar function |
| Jacobian J | F: ℝⁿ → ℝᵐ | m×n matrix | linear map of first-order changes |
| Hessian ∇²f | f: ℝⁿ → ℝ | n×n matrix | curvature / second derivatives |
If F has components Fⱼ, then the Jacobian rows are gradients:
J(x) has row j equal to (∇Fⱼ(x))ᵀ.
The Hessian shows up when the gradient alone is not enough (e.g., Newton’s method uses curvature to choose better steps).
A subtle but practical point: the gradient depends on your coordinate system and scaling.
If one feature is measured in kilometers and another in millimeters, a unit step in each coordinate means wildly different real-world changes. Then “steepest” under the usual Euclidean norm might not align with what you intend.
This is why feature scaling (standardization) matters in ML: it changes the geometry so gradient-based methods behave well.
If you maximize f(x) subject to g(x) = 0, the feasible directions are tangent to the level set g(x) = 0. At an optimum, moving in any tangent direction can’t improve f, so ∇f must be orthogonal to the tangent space—i.e., it must be parallel to ∇g.
This becomes the condition
∇f(x*) = λ ∇g(x*),
which is the heart of Lagrange multipliers.
When you see ∇f in later topics, interpret it as:
Let f(x, y) = x²y + 3y. Compute ∇f(x, y). Then at (x, y) = (2, 1), predict the change in f for a small step h = (0.1, −0.05) using the linear approximation.
Compute partial derivatives:
∂f/∂x = ∂/∂x (x²y + 3y)
= 2xy.
∂f/∂y = ∂/∂y (x²y + 3y)
= x² + 3.
Assemble the gradient:
∇f(x, y) = (2xy, x² + 3).
Evaluate at (2, 1):
∇f(2, 1) = (2·2·1, 2² + 3)
= (4, 7).
Use the first-order approximation:
Δf ≈ ∇f(2, 1) · h
= (4, 7) · (0.1, −0.05)
= 4(0.1) + 7(−0.05)
= 0.4 − 0.35
= 0.05.
Insight: The gradient converts a small vector step h into an approximate scalar change via a dot product. The sign comes from alignment: the step had a small component against the y-gradient (−0.05 versus +7), nearly canceling the x increase.
Let f(x, y) = x eʸ. At the point (0, 0), find (1) the unit direction u of steepest ascent and (2) the maximum directional derivative value.
Compute the gradient:
∂f/∂x = eʸ.
∂f/∂y = x eʸ.
So ∇f(x, y) = (eʸ, x eʸ).
Evaluate at (0, 0):
∇f(0, 0) = (e⁰, 0·e⁰)
= (1, 0).
Direction of steepest ascent is the normalized gradient:
u = ∇f / ‖∇f‖.
Here ‖∇f(0,0)‖ = √(1² + 0²) = 1.
So u = (1, 0).
Maximum directional derivative equals ‖∇f‖:
max_{‖u‖=1} D_u f(0,0) = ‖∇f(0,0)‖ = 1.
Insight: At (0,0), changing y does nothing to first order because the y-slope is proportional to x. The gradient correctly captures that the only immediate increase comes from moving in +x.
Consider f(x, y) = x² + y². Show that ∇f is perpendicular to the level set f(x, y) = 1 at a point (x, y) on the circle.
Compute the gradient:
∇f(x, y) = (∂f/∂x, ∂f/∂y)
= (2x, 2y).
Parameterize the level set f(x, y) = 1 by
r(t) = (cos t, sin t).
Then r′(t) = (−sin t, cos t), which is tangent to the circle.
Evaluate ∇f on the circle:
At r(t), ∇f = (2cos t, 2sin t).
Check orthogonality using a dot product:
∇f · r′(t)
= (2cos t, 2sin t) · (−sin t, cos t)
= 2cos t(−sin t) + 2sin t(cos t)
= −2sin t cos t + 2sin t cos t
= 0.
Insight: For circles centered at the origin, ∇f points radially outward while tangents go around the circle. “Radial” and “tangent” directions are perpendicular—this is the level-set orthogonality principle in a familiar shape.
∇f(x) is the vector of partial derivatives: one component per input variable.
Directional derivatives satisfy D_u f(x) = ∇f(x) · u for any unit direction u.
The steepest-ascent direction is ∇f/‖∇f‖ (when ∇f ≠ 0), and the maximal increase rate is ‖∇f‖.
The first-order approximation is f(x + h) ≈ f(x) + ∇f(x) · h for small h.
∇f is orthogonal to level sets f(x) = c (it is a normal vector to contours/surfaces).
If ∇f(x) = 0, then all first-order directional changes vanish; this identifies critical points (not necessarily minima).
In optimization and ML, gradients drive iterative updates like x ← x − α∇f(x).
Forgetting to evaluate the gradient at the point of interest (carrying around ∇f(x, y) but never plugging in (x₀, y₀)).
Using a non-unit direction in a directional derivative without accounting for its length (D_u assumes ‖u‖ = 1 when interpreting “rate per unit distance”).
Confusing “orthogonal to level sets” with “points toward the origin” (true for x² + y² but not general).
Assuming ∇f = 0 implies a minimum; it could be a maximum or saddle without second-order analysis.
Let f(x, y) = 3x² − 2xy + y². Compute ∇f(x, y) and evaluate it at (1, 2).
Hint: Differentiate with respect to x holding y constant, then with respect to y holding x constant.
∂f/∂x = 6x − 2y.
∂f/∂y = −2x + 2y.
So ∇f(x, y) = (6x − 2y, −2x + 2y).
At (1, 2): ∇f(1, 2) = (6·1 − 4, −2·1 + 4) = (2, 2).
For f(x, y) = x² + 4y², find the unit direction u at (1, 1) that gives the fastest decrease, and compute the corresponding directional derivative value.
Hint: Fastest decrease is in direction −∇f/‖∇f‖. The minimum directional derivative over unit vectors equals −‖∇f‖.
∇f(x, y) = (2x, 8y). So ∇f(1, 1) = (2, 8).
‖∇f(1,1)‖ = √(2² + 8²) = √(4 + 64) = √68 = 2√17.
Fastest decrease unit direction:
u = −(2, 8)/√68 = (−2/√68, −8/√68).
The corresponding directional derivative:
D_u f = ∇f · u = −‖∇f‖ = −2√17.
Let f(x, y) = x + y. Consider the level set f(x, y) = 5 (a line). Show that ∇f is orthogonal to this level set by computing a tangent direction and taking a dot product.
Hint: A level set x + y = 5 can be parameterized by (t, 5 − t).
∇f(x, y) = (1, 1) everywhere.
Parameterize the level set: r(t) = (t, 5 − t).
Then r′(t) = (1, −1), which is tangent.
Dot product: ∇f · r′(t) = (1, 1) · (1, −1) = 1 − 1 = 0.
So ∇f is orthogonal to the level set.