Matrix of partial derivatives. Change of variables in integrals.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
You already know the gradient tells you how a scalar function changes as you nudge inputs. The Jacobian is the next step: it tells you how an entire vector of outputs changes—capturing the local linear behavior of a multivariable transformation and the volume-scaling you need for change of variables in integrals.
The Jacobian J_f(x) = Df(x) is the matrix of first-order partial derivatives of a vector-valued map f. It is the best linear approximation to f near a point. When f maps ℝⁿ → ℝⁿ, det(J_f) measures local oriented volume scaling; |det(J_f)| is the factor used in change-of-variables formulas in integrals.
For a scalar function g: ℝⁿ → ℝ, the gradient ∇g(x) summarizes first-order change: it’s the vector that best predicts how g changes for a small input step h.
But many important maps are vector-valued:
When the output is a vector, the “rate of change” can’t be captured by a single vector. Each output component depends on each input component. The Jacobian packages all those partial derivatives into one matrix.
Let f: ℝⁿ → ℝᵐ be written in components:
f(x) = (f₁(x), f₂(x), …, f_m(x))
where x = (x₁, x₂, …, x_n).
The Jacobian matrix of f at x is
J_f(x) = Df(x) = [ ∂f_i / ∂x_j ]
It is an m×n matrix whose (i, j) entry is:
(J_f(x))_{ij} = ∂f_i(x) / ∂x_j.
So:
If m = 1 (scalar output), then J_f is 1×n:
J_f(x) = [ ∂f/∂x₁ ∂f/∂x₂ … ∂f/∂x_n ]
This is exactly the gradient as a row vector (convention). Meanwhile ∇f(x) is usually a column vector. Transpose connects them:
J_f(x) = (∇f(x))ᵀ.
So the Jacobian generalizes the gradient.
The most important intuition is not “array of partials”, but “best linear approximation.” Near a point a, f behaves like:
f(a + h) ≈ f(a) + J_f(a) h
for small h.
That expression is the multivariable analogue of the 1D approximation:
f(a + h) ≈ f(a) + f′(a) h.
Here, f′(a) is a number; in many dimensions, the derivative becomes a matrix.
If m = n, then J_f(x) is n×n and you can take its determinant:
det(J_f(x))
This scalar has a deep meaning:
That volume-scaling is exactly what appears in change-of-variables for integrals.
Most nonlinear functions are hard to analyze globally. But if you zoom in enough, smooth functions look linear. This is the core strategy behind:
The Jacobian is the device that turns “zooming in” into a concrete computation.
Let f: ℝⁿ → ℝᵐ be differentiable at a. Then there exists a linear map L(h) such that
f(a + h) = f(a) + L(h) + r(h)
where the remainder satisfies
‖r(h)‖ / ‖h‖ → 0 as ‖h‖ → 0.
That linear map L is the derivative Df(a). When we choose coordinates, L is represented by the Jacobian matrix J_f(a), and we write:
f(a + h) ≈ f(a) + J_f(a) h.
Write eⱼ for the j-th standard basis vector in ℝⁿ (a 1 in position j, else 0). Then:
J_f(a) eⱼ = column j of J_f(a).
But eⱼ corresponds to “nudge only x_j”. So:
column j ≈ how the output vector changes when you increase x_j a tiny bit.
Equivalently, each row i is:
row i = [ ∂f_i/∂x₁ … ∂f_i/∂x_n ]
which is the gradient of the i-th output component (as a row). So:
If f: ℝⁿ → ℝᵐ and g: ℝᵐ → ℝᵏ, then the composition g ∘ f: ℝⁿ → ℝᵏ has Jacobian:
J_{g∘f}(x) = J_g(f(x)) · J_f(x).
This is the multivariable chain rule, and it looks exactly like matrix multiplication.
It’s worth pausing to connect this to “linear approximation of a composition”:
That is why the chain rule becomes matrix multiplication.
For a direction u ∈ ℝⁿ, the first-order change in f at a in direction u is:
Df(a) u = J_f(a) u.
This is the vector-valued directional derivative.
In the scalar-output case (m = 1), this reduces to:
J_f(a) u = (∇f(a))ᵀ u = ∇f(a) · u
which is the familiar directional derivative formula.
Suppose your input x has a small perturbation δx (measurement noise). Then the induced output perturbation is approximately:
δy ≈ J_f(x) δx.
So the Jacobian acts like a “local gain matrix.” This is the mathematical foundation for sensitivity analysis and for linearizing nonlinear systems in control and estimation.
Integration measures “total accumulation.” In multiple dimensions, it’s accumulation over area (2D) or volume (3D and beyond). If you change coordinates, a small patch in the new coordinates may correspond to a differently sized patch in the old coordinates.
So you need a conversion factor between tiny volume elements:
(d volume in x-space) = (scale factor) · (d volume in u-space)
That scale factor is the absolute value of the Jacobian determinant.
Assume f: ℝⁿ → ℝⁿ is differentiable and J_f(a) is invertible.
Near a, f behaves like:
f(a + h) ≈ f(a) + J_f(a) h.
So near a, f is approximately the linear map h ↦ J_f(a) h. For a linear map A, the determinant det(A) gives the oriented volume scaling:
Therefore, for the nonlinear map f, the local volume scaling near a is approximately |det(J_f(a))|.
Let f: U ⊂ ℝⁿ → V ⊂ ℝⁿ be a bijective differentiable map with differentiable inverse (a diffeomorphism), and let g be integrable on V. Then:
∫∫…∫_V g(x) dx = ∫∫…∫_U g(f(u)) · |det(J_f(u))| du.
Here:
The key idea is:
dx = |det(J_f(u))| du.
In 2D, if f(u, v) = (x(u, v), y(u, v)), then
J_f(u, v) = [ ∂x/∂u ∂x/∂v
∂y/∂u ∂y/∂v ]
and the area element transforms as:
dx dy = |det(J_f(u, v))| du dv.
x = r cos θ
y = r sin θ
Compute J_f(r, θ):
∂x/∂r = cos θ ∂x/∂θ = −r sin θ
∂y/∂r = sin θ ∂y/∂θ = r cos θ
So
J_f = [ cos θ −r sin θ
sin θ r cos θ ]
and
det(J_f) = (cos θ)(r cos θ) − (−r sin θ)(sin θ)
= r cos²θ + r sin²θ
= r.
Thus:
dx dy = r dr dθ.
That single factor r is exactly the Jacobian determinant’s magnitude.
When you see an “extra factor” like r in polar coordinates (or r² sin φ in spherical coordinates), it is not arbitrary—it is the local volume-scaling |det(J_f)|.
The determinant can be negative. Integrals measure (unsigned) volume/area, so the change-of-variables formula uses:
|det(J_f)|.
If you’re doing differential geometry or oriented integrals, the sign can matter; for standard multivariable calculus integrals over regions, absolute value is the rule.
Many ML models are compositions of vector-valued functions:
x → f₁(x) → f₂(f₁(x)) → … → y
Training relies on derivatives of a loss with respect to parameters, and those derivatives are built from Jacobians (and their transposes) via the chain rule.
Even when you mostly hear “gradients,” under the hood:
You already know gradients. The Jacobian sits between gradient and Hessian in complexity:
| Object | Typical function type | Shape | Captures | Notes |
|---|---|---|---|---|
| ∇g(x) | g: ℝⁿ → ℝ | n×1 | first-order change of scalar | steepest ascent direction |
| J_f(x) | f: ℝⁿ → ℝᵐ | m×n | first-order change of vector | linearization matrix |
| H_g(x) | g: ℝⁿ → ℝ | n×n | second-order change | curvature |
A helpful mental model:
Suppose you have residuals r(x) ∈ ℝᵐ and a scalar loss:
L(x) = ½ ‖r(x)‖².
Then the gradient of L can be written using the Jacobian of r:
Let J = J_r(x) (an m×n matrix). Then:
L(x) = ½ ∑_{i=1}^m r_i(x)²
Differentiate component-wise. For j-th component:
∂L/∂x_j = ½ ∑_{i=1}^m 2 r_i(x) · ∂r_i/∂x_j
= ∑_{i=1}^m r_i(x) · J_{ij}
In matrix form:
∇L(x) = J_r(x)ᵀ r(x).
This identity appears in Gauss–Newton, Levenberg–Marquardt, and many optimization routines.
For a dynamical system:
x_{t+1} = f(x_t)
the Jacobian J_f at a fixed point x⋆ characterizes local stability:
x_{t+1} − x⋆ ≈ J_f(x⋆) ( x_t − x⋆ )
Eigenvalues of J_f(x⋆) determine whether perturbations shrink or grow.
Matrix calculus generalizes these ideas when variables and outputs are vectors/matrices and you want systematic rules.
Key bridge concepts you’ll use next:
This node sets the foundation: once “derivative = linear map = Jacobian matrix” feels natural, matrix calculus becomes mostly careful bookkeeping plus chain rule.
Let f: ℝ² → ℝ² be f(x, y) = (f₁(x, y), f₂(x, y)) where f₁(x, y) = x²y and f₂(x, y) = sin(x + y). Compute J_f(x, y). Then linearize at (1, 0) to approximate f(1.02, −0.01).
Step 1: Compute partial derivatives for f₁(x, y) = x²y.
∂f₁/∂x = 2xy
∂f₁/∂y = x²
Step 2: Compute partial derivatives for f₂(x, y) = sin(x + y).
∂f₂/∂x = cos(x + y)
∂f₂/∂y = cos(x + y)
Step 3: Assemble the Jacobian matrix.
J_f(x, y) = [ 2xy x²
cos(x+y) cos(x+y) ]
Step 4: Evaluate the Jacobian at (1, 0).
J_f(1, 0) = [ 2·1·0 1²
cos(1+0) cos(1+0) ]
= [ 0 1
cos 1 cos 1 ]
Step 5: Compute f(1, 0).
f(1, 0) = (1²·0, sin(1+0)) = (0, sin 1)
Step 6: Form the small displacement h from (1, 0) to (1.02, −0.01).
h = (Δx, Δy) = (0.02, −0.01)
Step 7: Apply the linearization f(a+h) ≈ f(a) + J_f(a) h.
J_f(1,0)h = [ 0 1
cos1 cos1 ] [ 0.02
−0.01 ]
First component: 0·0.02 + 1·(−0.01) = −0.01
Second component: cos1·0.02 + cos1·(−0.01) = cos1·(0.01) = 0.01 cos1
Step 8: Combine.
f(1.02, −0.01) ≈ (0, sin1) + (−0.01, 0.01 cos1)
= (−0.01, sin1 + 0.01 cos1)
Insight: The Jacobian turns “small input change” into “approximate output change” via matrix multiplication. Notice how the first output f₁ is most sensitive to y near (1,0) (since ∂f₁/∂x = 0 there), which is immediately visible in J_f(1,0).
Use the transformation f(r, θ) = (x, y) = (r cos θ, r sin θ). Compute det(J_f) and show that dx dy = r dr dθ.
Step 1: Write the Jacobian matrix.
J_f(r, θ) = [ ∂x/∂r ∂x/∂θ
∂y/∂r ∂y/∂θ ]
Step 2: Compute the partial derivatives.
∂x/∂r = cos θ
∂x/∂θ = −r sin θ
∂y/∂r = sin θ
∂y/∂θ = r cos θ
Step 3: Substitute into the matrix.
J_f(r, θ) = [ cos θ −r sin θ
sin θ r cos θ ]
Step 4: Compute the determinant.
det(J_f) = (cos θ)(r cos θ) − (−r sin θ)(sin θ)
= r cos²θ + r sin²θ
= r( cos²θ + sin²θ )
= r
Step 5: Convert the area element.
dx dy = |det(J_f(r, θ))| dr dθ = |r| dr dθ
In standard polar coordinates, r ≥ 0, so |r| = r.
Therefore dx dy = r dr dθ.
Insight: The mysterious “extra r” in polar integrals is exactly local area scaling. A tiny rectangle of size dr×dθ in (r,θ)-space maps to a curved wedge-like region in (x,y)-space whose area is approximately r·dr·dθ.
Let f: ℝ² → ℝ² be f(x, y) = (u, v) = (x + y, x − y). Let g: ℝ² → ℝ² be g(u, v) = (u², uv). Compute J_{g∘f}(x, y) using the chain rule.
Step 1: Compute J_f(x, y).
u = x + y ⇒ ∂u/∂x = 1, ∂u/∂y = 1
v = x − y ⇒ ∂v/∂x = 1, ∂v/∂y = −1
So J_f(x, y) = [ 1 1
1 −1 ]
Step 2: Compute J_g(u, v).
First component: g₁(u, v) = u²
∂g₁/∂u = 2u, ∂g₁/∂v = 0
Second component: g₂(u, v) = uv
∂g₂/∂u = v, ∂g₂/∂v = u
So J_g(u, v) = [ 2u 0
v u ]
Step 3: Apply the chain rule.
J_{g∘f}(x, y) = J_g(f(x, y)) · J_f(x, y)
Substitute u = x + y and v = x − y:
J_g(f(x, y)) = [ 2(x+y) 0
(x−y) (x+y) ]
Step 4: Multiply the matrices.
J_{g∘f} = [ 2(x+y) 0
(x−y) (x+y) ] [ 1 1
1 −1 ]
Compute entry-by-entry:
Top row:
(1,1): 2(x+y)·1 + 0·1 = 2(x+y)
(1,2): 2(x+y)·1 + 0·(−1) = 2(x+y)
Bottom row:
(2,1): (x−y)·1 + (x+y)·1 = (x−y)+(x+y)=2x
(2,2): (x−y)·1 + (x+y)·(−1) = (x−y)−(x+y)=−2y
So J_{g∘f}(x, y) = [ 2(x+y) 2(x+y)
2x −2y ]
Insight: The Jacobian chain rule is “just” matrix multiplication because derivatives are linear maps. Computing J_g at (u,v) and then substituting (u,v)=f(x,y) keeps the structure clean and scales to long compositions.
The Jacobian J_f(x) = Df(x) is the m×n matrix with entries (∂f_i/∂x_j), describing first-order change of f: ℝⁿ → ℝᵐ.
Linearization: for small h, f(a+h) ≈ f(a) + J_f(a) h; the Jacobian is the best linear approximation near a.
Columns of J_f describe how the output changes when you perturb one input coordinate; rows are gradients of each output component (as row vectors).
Chain rule: J_{g∘f}(x) = J_g(f(x)) · J_f(x)—composition becomes matrix multiplication.
When n = m, det(J_f(x)) measures local oriented volume scaling; |det(J_f(x))| is the local (unsigned) volume scale factor.
Change of variables in integrals uses dx = |det(J_f(u))| du for x = f(u).
Many “extra factors” in coordinate systems (like r in polar) are exactly Jacobian determinants.
Mixing up the shape: for f: ℝⁿ → ℝᵐ, the Jacobian is m×n (outputs by inputs), not n×m.
Confusing ∇f with J_f: the gradient is for scalar outputs; for vector outputs you need the full Jacobian (or one gradient per component).
Forgetting the absolute value in change-of-variables: integrals over regions use |det(J)|, not det(J) when det could be negative.
Evaluating J_g at the wrong point in the chain rule: J_{g∘f}(x) requires J_g at f(x), not at x.
Let f(x, y, z) = (xy, yz). Compute J_f(x, y, z). What is J_f(1, 2, 3)?
Hint: There are m=2 outputs and n=3 inputs, so J is 2×3. Differentiate each output with respect to x, y, z.
f₁=xy ⇒ ∂f₁/∂x=y, ∂f₁/∂y=x, ∂f₁/∂z=0.
f₂=yz ⇒ ∂f₂/∂x=0, ∂f₂/∂y=z, ∂f₂/∂z=y.
So J_f(x,y,z) = [ y x 0
0 z y ].
At (1,2,3): J_f(1,2,3) = [ 2 1 0
0 3 2 ].
Let f: ℝ² → ℝ² be f(u, v) = (x, y) = (u² − v², 2uv). (This maps to complex squaring.) Compute det(J_f(u, v)).
Hint: Compute ∂x/∂u, ∂x/∂v, ∂y/∂u, ∂y/∂v, then take a 2×2 determinant.
x=u²−v² ⇒ ∂x/∂u=2u, ∂x/∂v=−2v.
y=2uv ⇒ ∂y/∂u=2v, ∂y/∂v=2u.
J_f(u,v) = [ 2u −2v
2v 2u ].
det(J_f) = (2u)(2u) − (−2v)(2v)
= 4u² + 4v²
= 4(u²+v²).
Use a Jacobian to perform the substitution in the integral ∬_R (x + y) dx dy where R is the parallelogram defined by x = u + v, y = u − v with (u, v) ∈ [0,1]×[0,1]. Compute the value.
Hint: Compute det(J_f) for f(u,v)=(x,y). Rewrite x+y in terms of u,v. Then integrate over the unit square and multiply by |det(J_f)|.
Define f(u,v)=(x,y) with x=u+v, y=u−v.
Jacobian:
J_f = [ ∂x/∂u ∂x/∂v
∂y/∂u ∂y/∂v ]
= [ 1 1
1 −1 ].
det(J_f) = (1)(−1) − (1)(1) = −2, so |det(J_f)|=2.
Rewrite integrand:
x+y = (u+v)+(u−v)=2u.
Change variables:
∬_R (x+y) dx dy = ∬_{[0,1]²} (2u) · 2 du dv = ∬_{[0,1]²} 4u du dv.
Compute:
∫_0^1 ∫_0^1 4u dv du = ∫_0^1 (4u·1) du = 4 ∫_0^1 u du = 4·(1/2)=2.
Next: Matrix Calculus
Related reinforcement nodes you may have seen:
Forward links this enables: