Jacobian

CalculusDifficulty: ███░░Depth: 6Unlocks: 14

Matrix of partial derivatives. Change of variables in integrals.

Interactive Visualization

t=0s

Core Concepts

  • Jacobian matrix as the array of first-order partial derivatives (entry = partial f_i / partial x_j)
  • Jacobian as the derivative/linearization: the best linear approximation (matrix) of a vector-valued map at a point
  • Jacobian determinant (square case): scalar giving local oriented volume scaling (absolute value used for change of variables)

Key Symbols & Notation

Df(x) or J_f(x) for the Jacobian matrix; det(Df(x)) or |J_f(x)| for its determinant

Essential Relationships

  • Linearization: f(x+dx) ≈ f(x) + Df(x) · dx (matrix times vector)
  • Change-of-variables/volume-scaling: for y=f(x) (locally invertible), infinitesimal volumes transform by |det(Df(x))|, hence integrals change using that factor
▶ Advanced Learning Details

Graph Position

62
Depth Cost
14
Fan-Out (ROI)
6
Bottleneck Score
6
Chain Length

Cognitive Load

6
Atomic Elements
23
Total Elements
L0
Percentile Level
L4
Atomic Level

All Concepts (9)

  • Jacobian matrix: for a function f: R^n -> R^m, the m×n matrix whose (i,j) entry is ∂f_i/∂x_j
  • Jacobian determinant: determinant of a square Jacobian (when m = n)
  • Jacobian as linear approximation (differential): the Jacobian gives the best linear map approximating f near a point
  • Local volume (area) scaling: the Jacobian determinant measures how infinitesimal volumes are scaled by the map
  • Absolute Jacobian factor in integrals: using the absolute value of the Jacobian determinant when changing variables in integrals
  • Orientation information: the sign of the Jacobian determinant indicates whether the mapping preserves or reverses orientation
  • Local invertibility criterion (inverse function theorem) in terms of Jacobian determinant: nonzero determinant implies local invertibility
  • Rectangular (non-square) Jacobians: Jacobian matrices when m ≠ n have no determinant but still represent the linear differential
  • Column/row interpretation: each column is the partial derivatives of the output(s) with respect to one input variable (or each row is the gradient of one output component)

Teaching Strategy

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

You already know the gradient tells you how a scalar function changes as you nudge inputs. The Jacobian is the next step: it tells you how an entire vector of outputs changes—capturing the local linear behavior of a multivariable transformation and the volume-scaling you need for change of variables in integrals.

TL;DR:

The Jacobian J_f(x) = Df(x) is the matrix of first-order partial derivatives of a vector-valued map f. It is the best linear approximation to f near a point. When f maps ℝⁿ → ℝⁿ, det(J_f) measures local oriented volume scaling; |det(J_f)| is the factor used in change-of-variables formulas in integrals.

What Is Jacobian?

Why you need a new object beyond the gradient

For a scalar function g: ℝⁿ → ℝ, the gradient ∇g(x) summarizes first-order change: it’s the vector that best predicts how g changes for a small input step h.

But many important maps are vector-valued:

  • A coordinate transform f(u, v) = (x, y)
  • A physics map from state to state
  • A neural network layer that maps an input vector to an output vector

When the output is a vector, the “rate of change” can’t be captured by a single vector. Each output component depends on each input component. The Jacobian packages all those partial derivatives into one matrix.

Definition (matrix of partial derivatives)

Let f: ℝⁿ → ℝᵐ be written in components:

f(x) = (f₁(x), f₂(x), …, f_m(x))

where x = (x₁, x₂, …, x_n).

The Jacobian matrix of f at x is

J_f(x) = Df(x) = [ ∂f_i / ∂x_j ]

It is an m×n matrix whose (i, j) entry is:

(J_f(x))_{ij} = ∂f_i(x) / ∂x_j.

So:

  • Rows correspond to output components f_i
  • Columns correspond to input variables x_j

Relationship to the gradient (a comforting special case)

If m = 1 (scalar output), then J_f is 1×n:

J_f(x) = [ ∂f/∂x₁ ∂f/∂x₂ … ∂f/∂x_n ]

This is exactly the gradient as a row vector (convention). Meanwhile ∇f(x) is usually a column vector. Transpose connects them:

J_f(x) = (∇f(x))ᵀ.

So the Jacobian generalizes the gradient.

A geometric preview: local linear map

The most important intuition is not “array of partials”, but “best linear approximation.” Near a point a, f behaves like:

f(a + h) ≈ f(a) + J_f(a) h

for small h.

That expression is the multivariable analogue of the 1D approximation:

f(a + h) ≈ f(a) + f′(a) h.

Here, f′(a) is a number; in many dimensions, the derivative becomes a matrix.

Square case and determinant

If m = n, then J_f(x) is n×n and you can take its determinant:

det(J_f(x))

This scalar has a deep meaning:

  • sign(det) tells whether the map locally preserves or flips orientation
  • |det| tells how much the map locally scales n-dimensional volume

That volume-scaling is exactly what appears in change-of-variables for integrals.

Core Mechanic 1: Jacobian as the Derivative (Linearization)

Why linearization matters

Most nonlinear functions are hard to analyze globally. But if you zoom in enough, smooth functions look linear. This is the core strategy behind:

  • Newton’s method and other root-finding algorithms
  • error propagation and sensitivity analysis
  • optimization and gradient-based learning (locally linear steps)

The Jacobian is the device that turns “zooming in” into a concrete computation.

The best linear approximation statement

Let f: ℝⁿ → ℝᵐ be differentiable at a. Then there exists a linear map L(h) such that

f(a + h) = f(a) + L(h) + r(h)

where the remainder satisfies

‖r(h)‖ / ‖h‖ → 0 as ‖h‖ → 0.

That linear map L is the derivative Df(a). When we choose coordinates, L is represented by the Jacobian matrix J_f(a), and we write:

f(a + h) ≈ f(a) + J_f(a) h.

Interpreting columns and rows

Write eⱼ for the j-th standard basis vector in ℝⁿ (a 1 in position j, else 0). Then:

J_f(a) eⱼ = column j of J_f(a).

But eⱼ corresponds to “nudge only x_j”. So:

column j ≈ how the output vector changes when you increase x_j a tiny bit.

Equivalently, each row i is:

row i = [ ∂f_i/∂x₁ … ∂f_i/∂x_n ]

which is the gradient of the i-th output component (as a row). So:

  • columns = sensitivity directions in input space
  • rows = gradients of each output component

Chain rule in Jacobian form (the practical payoff)

If f: ℝⁿ → ℝᵐ and g: ℝᵐ → ℝᵏ, then the composition g ∘ f: ℝⁿ → ℝᵏ has Jacobian:

J_{g∘f}(x) = J_g(f(x)) · J_f(x).

This is the multivariable chain rule, and it looks exactly like matrix multiplication.

It’s worth pausing to connect this to “linear approximation of a composition”:

  • f turns a small input step h into an approximate output step J_f h
  • g then turns that step into J_g (J_f h)
  • overall: (J_g J_f) h

That is why the chain rule becomes matrix multiplication.

Directional derivatives via the Jacobian

For a direction u ∈ ℝⁿ, the first-order change in f at a in direction u is:

Df(a) u = J_f(a) u.

This is the vector-valued directional derivative.

In the scalar-output case (m = 1), this reduces to:

J_f(a) u = (∇f(a))ᵀ u = ∇f(a) · u

which is the familiar directional derivative formula.

Small-error propagation (a common use)

Suppose your input x has a small perturbation δx (measurement noise). Then the induced output perturbation is approximately:

δy ≈ J_f(x) δx.

So the Jacobian acts like a “local gain matrix.” This is the mathematical foundation for sensitivity analysis and for linearizing nonlinear systems in control and estimation.

Core Mechanic 2: Jacobian Determinant and Change of Variables

Why determinants show up in integrals

Integration measures “total accumulation.” In multiple dimensions, it’s accumulation over area (2D) or volume (3D and beyond). If you change coordinates, a small patch in the new coordinates may correspond to a differently sized patch in the old coordinates.

So you need a conversion factor between tiny volume elements:

(d volume in x-space) = (scale factor) · (d volume in u-space)

That scale factor is the absolute value of the Jacobian determinant.

Local volume scaling intuition

Assume f: ℝⁿ → ℝⁿ is differentiable and J_f(a) is invertible.

Near a, f behaves like:

f(a + h) ≈ f(a) + J_f(a) h.

So near a, f is approximately the linear map h ↦ J_f(a) h. For a linear map A, the determinant det(A) gives the oriented volume scaling:

  • A maps a tiny n-dimensional parallelepiped to another
  • the volume scales by |det(A)|
  • orientation flips if det(A) < 0

Therefore, for the nonlinear map f, the local volume scaling near a is approximately |det(J_f(a))|.

The change-of-variables formula (multivariable substitution)

Let f: U ⊂ ℝⁿ → V ⊂ ℝⁿ be a bijective differentiable map with differentiable inverse (a diffeomorphism), and let g be integrable on V. Then:

∫∫…∫_V g(x) dx = ∫∫…∫_U g(f(u)) · |det(J_f(u))| du.

Here:

  • u are the new coordinates
  • x = f(u) are the old coordinates
  • dx and du represent n-dimensional volume elements

The key idea is:

dx = |det(J_f(u))| du.

2D special case: area scaling

In 2D, if f(u, v) = (x(u, v), y(u, v)), then

J_f(u, v) = [ ∂x/∂u ∂x/∂v

∂y/∂u ∂y/∂v ]

and the area element transforms as:

dx dy = |det(J_f(u, v))| du dv.

Common coordinate transforms

Polar coordinates

x = r cos θ

y = r sin θ

Compute J_f(r, θ):

∂x/∂r = cos θ ∂x/∂θ = −r sin θ

∂y/∂r = sin θ ∂y/∂θ = r cos θ

So

J_f = [ cos θ −r sin θ

sin θ r cos θ ]

and

det(J_f) = (cos θ)(r cos θ) − (−r sin θ)(sin θ)

= r cos²θ + r sin²θ

= r.

Thus:

dx dy = r dr dθ.

That single factor r is exactly the Jacobian determinant’s magnitude.

General lesson

When you see an “extra factor” like r in polar coordinates (or r² sin φ in spherical coordinates), it is not arbitrary—it is the local volume-scaling |det(J_f)|.

Orientation vs absolute value

The determinant can be negative. Integrals measure (unsigned) volume/area, so the change-of-variables formula uses:

|det(J_f)|.

If you’re doing differential geometry or oriented integrals, the sign can matter; for standard multivariable calculus integrals over regions, absolute value is the rule.

Application/Connection: Jacobians in Optimization, ML, and Matrix Calculus

Why Jacobians show up constantly in ML

Many ML models are compositions of vector-valued functions:

x → f₁(x) → f₂(f₁(x)) → … → y

Training relies on derivatives of a loss with respect to parameters, and those derivatives are built from Jacobians (and their transposes) via the chain rule.

Even when you mostly hear “gradients,” under the hood:

  • a gradient is a Jacobian of a scalar-output function
  • backprop is repeated application of the Jacobian chain rule

Jacobian vs gradient vs Hessian (positioning)

You already know gradients. The Jacobian sits between gradient and Hessian in complexity:

ObjectTypical function typeShapeCapturesNotes
∇g(x)g: ℝⁿ → ℝn×1first-order change of scalarsteepest ascent direction
J_f(x)f: ℝⁿ → ℝᵐm×nfirst-order change of vectorlinearization matrix
H_g(x)g: ℝⁿ → ℝn×nsecond-order changecurvature

A helpful mental model:

  • gradient: “slope vector”
  • Jacobian: “slope matrix”
  • Hessian: “curvature matrix”

Jacobian-transpose trick (common in least squares)

Suppose you have residuals r(x) ∈ ℝᵐ and a scalar loss:

L(x) = ½ ‖r(x)‖².

Then the gradient of L can be written using the Jacobian of r:

Let J = J_r(x) (an m×n matrix). Then:

L(x) = ½ ∑_{i=1}^m r_i(x

Differentiate component-wise. For j-th component:

∂L/∂x_j = ½ ∑_{i=1}^m 2 r_i(x) · ∂r_i/∂x_j

= ∑_{i=1}^m r_i(x) · J_{ij}

In matrix form:

∇L(x) = J_r(x)ᵀ r(x).

This identity appears in Gauss–Newton, Levenberg–Marquardt, and many optimization routines.

Jacobian in dynamics and stability

For a dynamical system:

x_{t+1} = f(x_t)

the Jacobian J_f at a fixed point x⋆ characterizes local stability:

x_{t+1} − x⋆ ≈ J_f(x⋆) ( x_t − x⋆ )

Eigenvalues of J_f(x⋆) determine whether perturbations shrink or grow.

Bridge to Matrix Calculus

Matrix calculus generalizes these ideas when variables and outputs are vectors/matrices and you want systematic rules.

Key bridge concepts you’ll use next:

  • organizing derivatives consistently (shapes and conventions)
  • Jacobians of common vector operations
  • combining Jacobians with chain rule for complex compositions

This node sets the foundation: once “derivative = linear map = Jacobian matrix” feels natural, matrix calculus becomes mostly careful bookkeeping plus chain rule.

Worked Examples (3)

Compute a Jacobian and use it to linearize a vector-valued function

Let f: ℝ² → ℝ² be f(x, y) = (f₁(x, y), f₂(x, y)) where f₁(x, y) = x²y and f₂(x, y) = sin(x + y). Compute J_f(x, y). Then linearize at (1, 0) to approximate f(1.02, −0.01).

  1. Step 1: Compute partial derivatives for f₁(x, y) = x²y.

    ∂f₁/∂x = 2xy

    ∂f₁/∂y = x²

  2. Step 2: Compute partial derivatives for f₂(x, y) = sin(x + y).

    ∂f₂/∂x = cos(x + y)

    ∂f₂/∂y = cos(x + y)

  3. Step 3: Assemble the Jacobian matrix.

    J_f(x, y) = [ 2xy x²

    cos(x+y) cos(x+y) ]

  4. Step 4: Evaluate the Jacobian at (1, 0).

    J_f(1, 0) = [ 2·1·0 1²

    cos(1+0) cos(1+0) ]

    = [ 0 1

    cos 1 cos 1 ]

  5. Step 5: Compute f(1, 0).

    f(1, 0) = (1²·0, sin(1+0)) = (0, sin 1)

  6. Step 6: Form the small displacement h from (1, 0) to (1.02, −0.01).

    h = (Δx, Δy) = (0.02, −0.01)

  7. Step 7: Apply the linearization f(a+h) ≈ f(a) + J_f(a) h.

    J_f(1,0)h = [ 0 1

    cos1 cos1 ] [ 0.02

    −0.01 ]

    First component: 0·0.02 + 1·(−0.01) = −0.01

    Second component: cos1·0.02 + cos1·(−0.01) = cos1·(0.01) = 0.01 cos1

  8. Step 8: Combine.

    f(1.02, −0.01) ≈ (0, sin1) + (−0.01, 0.01 cos1)

    = (−0.01, sin1 + 0.01 cos1)

Insight: The Jacobian turns “small input change” into “approximate output change” via matrix multiplication. Notice how the first output f₁ is most sensitive to y near (1,0) (since ∂f₁/∂x = 0 there), which is immediately visible in J_f(1,0).

Derive the polar-coordinate area factor using det(J)

Use the transformation f(r, θ) = (x, y) = (r cos θ, r sin θ). Compute det(J_f) and show that dx dy = r dr dθ.

  1. Step 1: Write the Jacobian matrix.

    J_f(r, θ) = [ ∂x/∂r ∂x/∂θ

    ∂y/∂r ∂y/∂θ ]

  2. Step 2: Compute the partial derivatives.

    ∂x/∂r = cos θ

    ∂x/∂θ = −r sin θ

    ∂y/∂r = sin θ

    ∂y/∂θ = r cos θ

  3. Step 3: Substitute into the matrix.

    J_f(r, θ) = [ cos θ −r sin θ

    sin θ r cos θ ]

  4. Step 4: Compute the determinant.

    det(J_f) = (cos θ)(r cos θ) − (−r sin θ)(sin θ)

    = r cos²θ + r sin²θ

    = r( cos²θ + sin²θ )

    = r

  5. Step 5: Convert the area element.

    dx dy = |det(J_f(r, θ))| dr dθ = |r| dr dθ

    In standard polar coordinates, r ≥ 0, so |r| = r.

    Therefore dx dy = r dr dθ.

Insight: The mysterious “extra r” in polar integrals is exactly local area scaling. A tiny rectangle of size dr×dθ in (r,θ)-space maps to a curved wedge-like region in (x,y)-space whose area is approximately r·dr·dθ.

Use the Jacobian chain rule to differentiate a composition

Let f: ℝ² → ℝ² be f(x, y) = (u, v) = (x + y, x − y). Let g: ℝ² → ℝ² be g(u, v) = (u², uv). Compute J_{g∘f}(x, y) using the chain rule.

  1. Step 1: Compute J_f(x, y).

    u = x + y ⇒ ∂u/∂x = 1, ∂u/∂y = 1

    v = x − y ⇒ ∂v/∂x = 1, ∂v/∂y = −1

    So J_f(x, y) = [ 1 1

    1 −1 ]

  2. Step 2: Compute J_g(u, v).

    First component: g₁(u, v) = u²

    ∂g₁/∂u = 2u, ∂g₁/∂v = 0

    Second component: g₂(u, v) = uv

    ∂g₂/∂u = v, ∂g₂/∂v = u

    So J_g(u, v) = [ 2u 0

    v u ]

  3. Step 3: Apply the chain rule.

    J_{g∘f}(x, y) = J_g(f(x, y)) · J_f(x, y)

    Substitute u = x + y and v = x − y:

    J_g(f(x, y)) = [ 2(x+y) 0

    (x−y) (x+y) ]

  4. Step 4: Multiply the matrices.

    J_{g∘f} = [ 2(x+y) 0

    (x−y) (x+y) ] [ 1 1

    1 −1 ]

    Compute entry-by-entry:

    Top row:

    (1,1): 2(x+y)·1 + 0·1 = 2(x+y)

    (1,2): 2(x+y)·1 + 0·(−1) = 2(x+y)

    Bottom row:

    (2,1): (x−y)·1 + (x+y)·1 = (x−y)+(x+y)=2x

    (2,2): (x−y)·1 + (x+y)·(−1) = (x−y)−(x+y)=−2y

    So J_{g∘f}(x, y) = [ 2(x+y) 2(x+y)

    2x −2y ]

Insight: The Jacobian chain rule is “just” matrix multiplication because derivatives are linear maps. Computing J_g at (u,v) and then substituting (u,v)=f(x,y) keeps the structure clean and scales to long compositions.

Key Takeaways

  • The Jacobian J_f(x) = Df(x) is the m×n matrix with entries (∂f_i/∂x_j), describing first-order change of f: ℝⁿ → ℝᵐ.

  • Linearization: for small h, f(a+h) ≈ f(a) + J_f(a) h; the Jacobian is the best linear approximation near a.

  • Columns of J_f describe how the output changes when you perturb one input coordinate; rows are gradients of each output component (as row vectors).

  • Chain rule: J_{g∘f}(x) = J_g(f(x)) · J_f(x)—composition becomes matrix multiplication.

  • When n = m, det(J_f(x)) measures local oriented volume scaling; |det(J_f(x))| is the local (unsigned) volume scale factor.

  • Change of variables in integrals uses dx = |det(J_f(u))| du for x = f(u).

  • Many “extra factors” in coordinate systems (like r in polar) are exactly Jacobian determinants.

Common Mistakes

  • Mixing up the shape: for f: ℝⁿ → ℝᵐ, the Jacobian is m×n (outputs by inputs), not n×m.

  • Confusing ∇f with J_f: the gradient is for scalar outputs; for vector outputs you need the full Jacobian (or one gradient per component).

  • Forgetting the absolute value in change-of-variables: integrals over regions use |det(J)|, not det(J) when det could be negative.

  • Evaluating J_g at the wrong point in the chain rule: J_{g∘f}(x) requires J_g at f(x), not at x.

Practice

easy

Let f(x, y, z) = (xy, yz). Compute J_f(x, y, z). What is J_f(1, 2, 3)?

Hint: There are m=2 outputs and n=3 inputs, so J is 2×3. Differentiate each output with respect to x, y, z.

Show solution

f₁=xy ⇒ ∂f₁/∂x=y, ∂f₁/∂y=x, ∂f₁/∂z=0.

f₂=yz ⇒ ∂f₂/∂x=0, ∂f₂/∂y=z, ∂f₂/∂z=y.

So J_f(x,y,z) = [ y x 0

0 z y ].

At (1,2,3): J_f(1,2,3) = [ 2 1 0

0 3 2 ].

medium

Let f: ℝ² → ℝ² be f(u, v) = (x, y) = (u² − v², 2uv). (This maps to complex squaring.) Compute det(J_f(u, v)).

Hint: Compute ∂x/∂u, ∂x/∂v, ∂y/∂u, ∂y/∂v, then take a 2×2 determinant.

Show solution

x=u²−v² ⇒ ∂x/∂u=2u, ∂x/∂v=−2v.

y=2uv ⇒ ∂y/∂u=2v, ∂y/∂v=2u.

J_f(u,v) = [ 2u −2v

2v 2u ].

det(J_f) = (2u)(2u) − (−2v)(2v)

= 4u² + 4v²

= 4(u²+v²).

hard

Use a Jacobian to perform the substitution in the integral ∬_R (x + y) dx dy where R is the parallelogram defined by x = u + v, y = u − v with (u, v) ∈ [0,1]×[0,1]. Compute the value.

Hint: Compute det(J_f) for f(u,v)=(x,y). Rewrite x+y in terms of u,v. Then integrate over the unit square and multiply by |det(J_f)|.

Show solution

Define f(u,v)=(x,y) with x=u+v, y=u−v.

Jacobian:

J_f = [ ∂x/∂u ∂x/∂v

∂y/∂u ∂y/∂v ]

= [ 1 1

1 −1 ].

det(J_f) = (1)(−1) − (1)(1) = −2, so |det(J_f)|=2.

Rewrite integrand:

x+y = (u+v)+(u−v)=2u.

Change variables:

∬_R (x+y) dx dy = ∬_{[0,1]²} (2u) · 2 du dv = ∬_{[0,1]²} 4u du dv.

Compute:

∫_0^1 ∫_0^1 4u dv du = ∫_0^1 (4u·1) du = 4 ∫_0^1 u du = 4·(1/2)=2.

Connections

Next: Matrix Calculus

Related reinforcement nodes you may have seen:

Forward links this enables:

  • Jacobian chain rule → backprop-style differentiation in vector form
  • Jacobian determinant → multivariable substitution and probability density transforms
Quality: A (4.5/5)