Multivariable Chain Rule

CalculusDifficulty: ███░░Depth: 6Unlocks: 5

Derivatives of composed functions with multiple variables.

Interactive Visualization

t=0s

Core Concepts

  • Structure of composition: an inner map from R^n to R^m and an outer map from R^m to R^p (track which variables feed into which function)
  • Derivative as the best linear approximation at a point, represented by the Jacobian matrix (matrix of partial derivatives)
  • Composition of linear maps is given by matrix multiplication (linear maps compose by multiplying their matrices)

Key Symbols & Notation

D f (the Jacobian matrix / derivative of f)o (function composition)

Essential Relationships

  • Chain rule (core formula): D(g o f)(x) = Dg(f(x)) * Df(x) (matrix product; evaluate outer derivative at f(x))
▶ Advanced Learning Details

Graph Position

56
Depth Cost
5
Fan-Out (ROI)
2
Bottleneck Score
6
Chain Length

Cognitive Load

6
Atomic Elements
24
Total Elements
L0
Percentile Level
L4
Atomic Level

All Concepts (9)

  • Composition of multivariable maps: inner map g: R^k -> R^n and outer map f: R^n -> R^m (so f∘g: R^k -> R^m)
  • Total derivative (or differential) as a linear map at a point (best linear approximation of a multivariable function)
  • Jacobian matrix: the matrix representation of the total derivative for vector-valued functions
  • Derivative of vector-valued functions (componentwise partial derivatives assembled into a Jacobian)
  • Multivariable chain rule as a rule for composing total derivatives (not just scalar chain rule)
  • Component/summation form of the chain rule: expressing partials of a composition as sums over intermediate variables
  • Chain rule along a parameterized curve (rate of change of f along r(t): df/dt = ∇f(r(t)) · r'(t))
  • Evaluation-location dependence: derivatives of outer function must be evaluated at the inner function value (e.g., Df(g(x)))
  • Interpretation of the chain rule as composition of linear approximations (compose best linear approximations to get best linear approximation of composition)

Teaching Strategy

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

In single-variable calculus, the chain rule is a one-line formula. In multivariable calculus, it’s the same idea—“derivatives multiply along a composition”—but the objects are linear maps, so the multiplication becomes matrix multiplication. Once you truly see that, backpropagation stops feeling like magic.

TL;DR:

For a composition h=fgh = f \circ g where g:RnRmg: \mathbb{R}^n \to \mathbb{R}^m and f:RmRpf: \mathbb{R}^m \to \mathbb{R}^p, the multivariable chain rule is

D(fg)(x)=Df(g(x))Dg(x).D(f \circ g)(\mathbf{x}) = Df(g(\mathbf{x}))\, Dg(\mathbf{x}).

Interpretation: a small input perturbation Δx\Delta \mathbf{x} is pushed forward by DgDg into Δu\Delta \mathbf{u}, then pushed forward by DfDf into Δy\Delta \mathbf{y}. For scalar output p=1p=1, gradients pull back via transpose: h(x)=Dg(x)f(g(x))\nabla h(\mathbf{x}) = Dg(\mathbf{x})^\top \nabla f(g(\mathbf{x})).

What Is the Multivariable Chain Rule?

Why you need a new-looking chain rule

In 1D, composition looks like h(x)=f(g(x))h(x) = f(g(x)) and the chain rule says h(x)=f(g(x))g(x)h'(x) = f'(g(x))\, g'(x).

In multiple dimensions, you still compose functions, but now each function can take multiple inputs and produce multiple outputs. The “derivative” is no longer a single number; it’s a linear map that best approximates the function near a point. Linear maps are represented by matrices, so “multiply the derivatives” becomes “multiply the Jacobian matrices.”

This is the conceptual core:

  • Derivative = best linear approximation near a point.
  • Jacobian matrix = the matrix of partial derivatives that represents that linear approximation.
  • Composition of linear maps corresponds to matrix multiplication.

The composition structure (track shapes)

The cleanest multivariable chain rule is written with a clear inner/outer structure:

  • Inner map: g:RnRmg: \mathbb{R}^n \to \mathbb{R}^m
  • Outer map: f:RmRpf: \mathbb{R}^m \to \mathbb{R}^p
  • Composition: h=fg:RnRph = f \circ g: \mathbb{R}^n \to \mathbb{R}^p

Let:

  • input vector x ∈ ℝⁿ
  • intermediate vector u = g(x) ∈ ℝᵐ
  • output vector y = f(u) ∈ ℝᵖ

A quick “shape table” you should get used to:

ObjectMeaningShape
xinputn×1
u = g(x)intermediatem×1
y = f(u)outputp×1
Dg(x)Dg(\mathbf{x})Jacobian of g at xm×n
Df(u)Df(\mathbf{u})Jacobian of f at up×m
D(fg)(x)D(f\circ g)(\mathbf{x})Jacobian of compositionp×n

Notice the only multiplication that makes sense dimensionally is:

(p×m)(m×n)=(p×n).(p\times m)\,(m\times n) = (p\times n).

That is already a big part of the multivariable chain rule: the shapes force the correct order.

The chain rule (matrix form)

The multivariable chain rule states:

D(fg)(x)=Df(g(x))Dg(x).D(f \circ g)(\mathbf{x}) = Df(g(\mathbf{x}))\, Dg(\mathbf{x}).

Read it as: “first apply the derivative of gg at x, then apply the derivative of ff at the resulting point g(x)g(\mathbf{x}).”

What it means geometrically (tiny perturbations)

If you perturb the input by a small vector Δx\Delta \mathbf{x}, then:

1) Push forward through g:

ΔuDg(x)Δx.\Delta \mathbf{u} \approx Dg(\mathbf{x})\, \Delta \mathbf{x}.

2) Push forward through f:

ΔyDf(u)Δu.\Delta \mathbf{y} \approx Df(\mathbf{u})\, \Delta \mathbf{u}.

Combine them:

ΔyDf(g(x))Dg(x)Δx.\Delta \mathbf{y} \approx Df(g(\mathbf{x}))\, Dg(\mathbf{x})\, \Delta \mathbf{x}.

So the Jacobian of the composition is the matrix that takes Δx\Delta \mathbf{x} directly to Δy\Delta \mathbf{y}. That matrix is the product DfDgDf\,Dg.

A note on notation: Df vs ∇f

You said you know gradients already. The key is to keep these distinct:

  • DfDf is the Jacobian (a matrix, in general).
  • f\nabla f is the gradient (a vector) and is defined when ff is scalar-valued: f:RnRf: \mathbb{R}^n \to \mathbb{R}.

When p=1p=1, the Jacobian of ff is a 1×m row vector; the gradient is usually written as an m×1 column vector. They are transposes of each other (depending on convention):

Df(u) is 1×m,f(u) is m×1.Df(\mathbf{u}) \text{ is } 1\times m, \qquad \nabla f(\mathbf{u}) \text{ is } m\times 1.

This transpose issue matters a lot in backprop, so we’ll be explicit about it later.

Core Mechanic 1: Derivative as Best Linear Approximation (and the Jacobian)

Why this viewpoint

If you treat multivariable derivatives as “a bunch of partial derivatives,” you can still compute things, but it’s easy to lose track of structure.

If you treat the derivative as “the best linear map near a point,” everything becomes systematic:

  • You can push forward small changes.
  • You can compose derivatives by composing linear maps.
  • You can check correctness by verifying matrix shapes.

The linear approximation definition

Let g:RnRmg: \mathbb{R}^n \to \mathbb{R}^m. We say gg is differentiable at x if there exists a linear map L:RnRmL: \mathbb{R}^n \to \mathbb{R}^m such that

g(x+Δx)g(x)+LΔxg(\mathbf{x} + \Delta \mathbf{x}) \approx g(\mathbf{x}) + L\,\Delta \mathbf{x}

with an error that becomes negligible compared to Δx\|\Delta \mathbf{x}\| as Δx0\Delta \mathbf{x} \to 0.

That linear map is the derivative Dg(x)Dg(\mathbf{x}).

Jacobian matrix: how the linear map is represented

Write g(x)=[g1(x)gm(x)]g(\mathbf{x}) = \begin{bmatrix} g_1(\mathbf{x}) \\ \vdots \\ g_m(\mathbf{x}) \end{bmatrix}.

Then the Jacobian Dg(x)Dg(\mathbf{x}) is the m×n matrix:

Dg(x)=[g1x1g1xngmx1gmxn].Dg(\mathbf{x}) = \begin{bmatrix} \frac{\partial g_1}{\partial x_1} & \cdots & \frac{\partial g_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial g_m}{\partial x_1} & \cdots & \frac{\partial g_m}{\partial x_n} \end{bmatrix}.

This is exactly the matrix that maps a small input perturbation to an approximate output perturbation:

ΔuDg(x)Δx.\Delta \mathbf{u} \approx Dg(\mathbf{x})\, \Delta \mathbf{x}.

Interactive-canvas mental model (shape-tracking + arrows)

Imagine an “interactive canvas” with three boxes:

1) x-box: x ∈ ℝⁿ

2) u-box: u = g(x) ∈ ℝᵐ

3) y-box: y = f(u) ∈ ℝᵖ

Now add two kinds of arrows:

Arrow type A: pushforward of perturbations (forward mode)

You draw a little arrow Δx\Delta \mathbf{x} at the input.

  • It transforms to Δu=DgΔx\Delta \mathbf{u} = Dg\, \Delta \mathbf{x}.
  • Then transforms to Δy=DfΔu\Delta \mathbf{y} = Df\, \Delta \mathbf{u}.

So perturbations flow with the function direction.

Arrow type B: pullback of sensitivities / gradients (reverse mode)

If the final output is scalar (p=1), you draw a gradient arrow at the output: y\nabla_{\mathbf{y}} \ell (often just 1 if the scalar is the loss itself).

Then gradients flow backwards through transposes:

  • sensitivity at u: u=Df(u)y\nabla_{\mathbf{u}} \ell = Df(\mathbf{u})^\top \nabla_{\mathbf{y}} \ell
  • sensitivity at x: x=Dg(x)u\nabla_{\mathbf{x}} \ell = Dg(\mathbf{x})^\top \nabla_{\mathbf{u}} \ell

This is the heart of backprop. In this lesson, we’re building the “transpose reflex”:

  • Perturbations push forward with $J$.
  • Gradients pull back with $J^\top$.

A small but crucial convention check

There are two common conventions:

1) Jacobian is m×n (outputs by inputs). Gradients are column vectors.

2) Jacobian is n×m (inputs by outputs). Gradients are row vectors.

We’ll use the most common ML convention:

  • Jacobian $Dg$ is m×n.
  • Gradient $\nabla f$ is n×1 for scalar f.

With that convention, the chain rule for scalar output becomes a clean transpose pullback (we’ll derive it soon).

Core Mechanic 2: Chain Rule = Composition of Linear Maps (Matrix Multiplication)

Why matrix multiplication appears

The multivariable chain rule is not a new rule you memorize. It is a consequence of one fact:

If you approximate each function by a linear map near the relevant point, then approximating the composition means composing those linear maps.

And linear maps compose by matrix multiplication.

Derivation (showing the work)

Let g:RnRmg: \mathbb{R}^n \to \mathbb{R}^m and f:RmRpf: \mathbb{R}^m \to \mathbb{R}^p.

Define:

  • u = g(x)
  • y = f(u) = (f\circ g)(x)

Start with a small perturbation Δx\Delta \mathbf{x}.

Step 1: Linearize g at x

g(x+Δx)g(x)+Dg(x)Δx.g(\mathbf{x}+\Delta\mathbf{x}) \approx g(\mathbf{x}) + Dg(\mathbf{x})\,\Delta\mathbf{x}.

Let Δu=Dg(x)Δx\Delta \mathbf{u} = Dg(\mathbf{x})\,\Delta\mathbf{x}, so

g(x+Δx)u+Δu.g(\mathbf{x}+\Delta\mathbf{x}) \approx \mathbf{u} + \Delta\mathbf{u}.

Step 2: Linearize f at u

f(u+Δu)f(u)+Df(u)Δu.f(\mathbf{u}+\Delta\mathbf{u}) \approx f(\mathbf{u}) + Df(\mathbf{u})\,\Delta\mathbf{u}.

Substitute Δu=Dg(x)Δx\Delta \mathbf{u} = Dg(\mathbf{x})\,\Delta\mathbf{x}:

f(u+Δu)f(u)+Df(u)Dg(x)Δx.f(\mathbf{u}+\Delta\mathbf{u}) \approx f(\mathbf{u}) + Df(\mathbf{u})\,Dg(\mathbf{x})\,\Delta\mathbf{x}.

But the left side is approximately

f(g(x+Δx))=(fg)(x+Δx).f(g(\mathbf{x}+\Delta\mathbf{x})) = (f\circ g)(\mathbf{x}+\Delta\mathbf{x}).

So we have the linear approximation:

(fg)(x+Δx)(fg)(x)+(Df(g(x))Dg(x))Δx.(f\circ g)(\mathbf{x}+\Delta\mathbf{x}) \approx (f\circ g)(\mathbf{x}) + \bigl(Df(g(\mathbf{x}))\,Dg(\mathbf{x})\bigr)\,\Delta\mathbf{x}.

By uniqueness of the best linear approximation, the derivative must be:

D(fg)(x)=Df(g(x))Dg(x).D(f \circ g)(\mathbf{x}) = Df(g(\mathbf{x}))\,Dg(\mathbf{x}).

That’s the multivariable chain rule.

Element-wise chain rule (path-tracing through indices)

Sometimes you want the version that looks like “sum over paths.”

Let h=fgh = f \circ g with components:

  • gk(x)g_k(\mathbf{x}) for k = 1..m
  • fi(u)f_i(\mathbf{u}) for i = 1..p
  • hi(x)=fi(g(x))h_i(\mathbf{x}) = f_i(g(\mathbf{x}))

Then for each output component i and input component j:

hixj(x)=k=1mfiuk(u)gkxj(x),u=g(x).\frac{\partial h_i}{\partial x_j}(\mathbf{x}) = \sum_{k=1}^{m} \frac{\partial f_i}{\partial u_k}(\mathbf{u})\,\frac{\partial g_k}{\partial x_j}(\mathbf{x}),\quad \mathbf{u}=g(\mathbf{x}).

This is literally matrix multiplication in coordinates.

Path-tracing interpretation:

  • Each term fiukgkxj\frac{\partial f_i}{\partial u_k}\,\frac{\partial g_k}{\partial x_j} corresponds to a path xjukyix_j \to u_k \to y_i.
  • You add contributions from all intermediate coordinates uku_k.

“Computational graph” view (interactive canvas)

Think of a small graph:

  • Nodes are variables (scalars or vectors).
  • Edges are functions.

For our two-layer composition:

x →(g)→ u →(f)→ y

On an interactive canvas, you can show two overlays:

Overlay 1: Jacobians on edges

  • edge xu labeled Dg(x)Dg(\mathbf{x}) (m×n)
  • edge uy labeled Df(u)Df(\mathbf{u}) (p×m)

To get total derivative xy, multiply along the path:

Dxy=DfDg.D_{\mathbf{x}}\mathbf{y} = Df\,Dg.

Overlay 2: live perturbations and gradients

Pick a concrete point x₀.

  • Drag x by a tiny Δx\Delta \mathbf{x}. The canvas updates u and y and also shows the predicted linear response DfDgΔxDf\,Dg\,\Delta \mathbf{x}.
  • Alternatively, set a scalar loss =ϕ(y)\ell = \phi(\mathbf{y}) (or just take p=1). Show a gradient vector at y and animate it flowing backward:
x=Dg(x)Df(u)y.\nabla_{\mathbf{x}}\ell = Dg(\mathbf{x})^\top\,Df(\mathbf{u})^\top\,\nabla_{\mathbf{y}}\ell.

This “two-way animation” is the visualization you want to internalize:

  • Forward: Δ\Delta vectors multiply by Jacobians.
  • Backward: gradients multiply by Jacobian transposes.

Scalar output special case (gradient form)

Now suppose f:RmRf: \mathbb{R}^m \to \mathbb{R} is scalar. Then Df(u)Df(\mathbf{u}) is 1×m.

From the Jacobian chain rule:

D(fg)(x)=Df(u)Dg(x).D(f\circ g)(\mathbf{x}) = Df(\mathbf{u})\,Dg(\mathbf{x}).

This left side is a 1×n row vector. If we want the gradient as an n×1 column vector, transpose:

\begin{align*}

\nabla (f\circ g)(\mathbf{x})

&= D(f\circ g)(\mathbf{x})^\top \\

&= \bigl(Df(\mathbf{u})\,Dg(\mathbf{x})\bigr)^\top \\

&= Dg(\mathbf{x})^\top\,Df(\mathbf{u})^\top \\

&= Dg(\mathbf{x})^\top\,\nabla f(\mathbf{u}).

\end{align*}

That last line is the standard “gradient chain rule” used everywhere in ML.

One more composition (three layers)

If you have xgufvry\mathbf{x} \xrightarrow{g} \mathbf{u} \xrightarrow{f} \mathbf{v} \xrightarrow{r} \mathbf{y}, then:

D(rfg)(x)=Dr(v)Df(u)Dg(x).D(r\circ f\circ g)(\mathbf{x}) = Dr(\mathbf{v})\,Df(\mathbf{u})\,Dg(\mathbf{x}).

Forward-mode perturbations multiply left-to-right in the same order as the functions apply (inner to outer).

Reverse-mode gradients multiply by transposes right-to-left:

x=Dg(x)Df(u)Dr(v)y.\nabla_{\mathbf{x}}\ell = Dg(\mathbf{x})^\top\,Df(\mathbf{u})^\top\,Dr(\mathbf{v})^\top\,\nabla_{\mathbf{y}}\ell.

This is backprop in one line—just applied repeatedly.

Application/Connection: From Multivariable Chain Rule to Backprop Intuition

Why this matters for ML

Neural networks are compositions of many vector-valued functions:

xh(1)h(2)y^.\mathbf{x} \to \mathbf{h}^{(1)} \to \mathbf{h}^{(2)} \to \cdots \to \hat{\mathbf{y}} \to \ell.

Training requires θ\nabla_{\theta}\ell, gradients with respect to parameters. The only tool you need conceptually is the multivariable chain rule, but applied efficiently.

The hard part is not the calculus; it’s bookkeeping:

  • What depends on what?
  • What is the shape of each Jacobian?
  • Are we pushing perturbations forward or pulling gradients back?

A concrete computational graph (with explicit shapes)

Let’s build a tiny “network” with one hidden layer and a scalar loss. Define:

  • x ∈ ℝ²
  • Parameters:
  • WW ∈ ℝ^{3×2}, b ∈ ℝ³
  • c ∈ ℝ³ (a vector used to reduce to scalar)

Forward computation:

1) a = Wx+bW\mathbf{x} + \mathbf{b} (so a ∈ ℝ³)

2) h = σ(a)\sigma(\mathbf{a}) elementwise (so h ∈ ℝ³)

3) =ch\ell = \mathbf{c}^\top \mathbf{h} (so \ell ∈ ℝ)

This is a composition:

x →(affine)→ a →(nonlinearity)→ h →(dot)→ ℓ

On an interactive canvas, you can attach:

EdgeLocal derivativeShape
xaDxa=WD_{\mathbf{x}}\mathbf{a} = W3×2
ahDah=Diag(σ(a))D_{\mathbf{a}}\mathbf{h} = \operatorname{Diag}(\sigma'(\mathbf{a}))3×3
h→ℓDh=cD_{\mathbf{h}}\ell = \mathbf{c}^\top1×3

Forward-mode (push a perturbation)

A perturbation Δx\Delta \mathbf{x} produces:

\begin{align*}

\Delta \mathbf{a} &\approx W\,\Delta \mathbf{x} \\

\Delta \mathbf{h} &\approx \operatorname{Diag}(\sigma'(\mathbf{a}))\,\Delta \mathbf{a} \\

\Delta \ell &\approx \mathbf{c}^\top\,\Delta \mathbf{h}.

\end{align*}

Combine:

ΔcDiag(σ(a))WΔx.\Delta \ell \approx \mathbf{c}^\top\,\operatorname{Diag}(\sigma'(\mathbf{a}))\,W\,\Delta \mathbf{x}.

So the total Jacobian (1×2 row vector) is:

Dx=cDiag(σ(a))W.D_{\mathbf{x}}\ell = \mathbf{c}^\top\,\operatorname{Diag}(\sigma'(\mathbf{a}))\,W.

Reverse-mode (pull back a gradient)

Because ℓ is scalar, we typically want x\nabla_{\mathbf{x}}\ell as a 2×1 column vector.

Start with =1\nabla_{\ell}\ell = 1.

  • From ℓ = ch:
h=c.\nabla_{\mathbf{h}}\ell = \mathbf{c}.
  • Through h = σ(a) elementwise:
a=Diag(σ(a))h=Diag(σ(a))c.\nabla_{\mathbf{a}}\ell = \operatorname{Diag}(\sigma'(\mathbf{a}))\,\nabla_{\mathbf{h}}\ell = \operatorname{Diag}(\sigma'(\mathbf{a}))\,\mathbf{c}.
  • Through a = Wx+bW\mathbf{x}+\mathbf{b}:
x=Wa=WDiag(σ(a))c.\nabla_{\mathbf{x}}\ell = W^\top\,\nabla_{\mathbf{a}}\ell = W^\top\,\operatorname{Diag}(\sigma'(\mathbf{a}))\,\mathbf{c}.

Compare with the forward-mode Jacobian expression above:

  • DxD_{\mathbf{x}}\ell (row) = cᵀ Diag(σ′) W
  • x\nabla_{\mathbf{x}}\ell (column) = WW^\top Diag(σ′) c

They are transposes, consistent with =(D)\nabla \ell = (D\ell)^\top.

Visual intuition: pushforward vs pullback

To address visualization explicitly, here’s the picture you should rehearse:

1) Pick a point x₀.

2) Draw a tiny arrow Δx\Delta \mathbf{x} at x.

3) Multiply by local Jacobians to watch the arrow morph:

  • it rotates/scales/shears in a-space,
  • then again in h-space,
  • finally collapses to a scalar change Δ\Delta \ell.

Now reverse:

1) Draw a gradient arrow at h: it points in the direction that increases ℓ fastest in h-space.

2) Pull it back to a using the transpose of the local Jacobian: Diag(σ)\operatorname{Diag}(\sigma') (symmetric here, so transpose doesn’t change it).

3) Pull it back to x using WW^\top.

This is not two unrelated processes. It’s the same linear maps viewed from two dual perspectives:

  • perturbations: Δ\Delta vectors (tangent vectors)
  • gradients: covectors, pulled back by transpose

You don’t need the formal differential-geometry language to use it correctly, but you do need the operational rule:

If forward uses JJ, backward uses JJ^\top.

Connection you’ll use next

Backpropagation is essentially the repeated application of:

input=Joutput.\nabla_{\text{input}}\ell = J^\top\,\nabla_{\text{output}}\ell.

where JJ is the Jacobian of a local block in the computational graph.

Once you’re comfortable multiplying Jacobians (forward) and multiplying by transposes (backward), you’re ready to study backprop as an algorithmic optimization: reuse intermediate results so you don’t form huge Jacobian matrices explicitly.

Worked Examples (3)

Matrix-form chain rule with a 2→2→1 composition (shape-check + gradient)

Let g:R2R2g: \mathbb{R}^2 \to \mathbb{R}^2 and f:R2Rf: \mathbb{R}^2 \to \mathbb{R} be

g(x1,x2)=[u1u2]=[x12+x2x1x2],f(u1,u2)=u1+u22.g(x_1,x_2) = \begin{bmatrix} u_1 \\ u_2 \end{bmatrix} = \begin{bmatrix} x_1^2 + x_2 \\ x_1x_2 \end{bmatrix}, \qquad f(u_1,u_2)= u_1 + u_2^2.

Define h=fgh = f\circ g. Compute Dh(x)Dh(\mathbf{x}) and h(x)\nabla h(\mathbf{x}) at a general point, then evaluate at (x1,x2)=(1,2)(x_1,x_2)=(1,2).

  1. Step 1: Compute Dg(x)Dg(\mathbf{x}) (2×2).

    We have

    • u1=x12+x2u_1 = x_1^2 + x_2 so u1/x1=2x1\partial u_1/\partial x_1 = 2x_1, u1/x2=1\partial u_1/\partial x_2 = 1.
    • u2=x1x2u_2 = x_1x_2 so u2/x1=x2\partial u_2/\partial x_1 = x_2, u2/x2=x1\partial u_2/\partial x_2 = x_1.

    Thus

    Dg(x)=[2x11x2x1].Dg(\mathbf{x}) = \begin{bmatrix} 2x_1 & 1 \\ x_2 & x_1 \end{bmatrix}.
  2. Step 2: Compute Df(u)Df(\mathbf{u}) (1×2).

    f(u1,u2)=u1+u22f(u_1,u_2)=u_1 + u_2^2 so

    Df(u)=[f/u1f/u2]=[12u2].Df(\mathbf{u}) = \begin{bmatrix} \partial f/\partial u_1 & \partial f/\partial u_2 \end{bmatrix} = \begin{bmatrix} 1 & 2u_2 \end{bmatrix}.
  3. Step 3: Apply the Jacobian chain rule.

    Dh(x)=Df(g(x))Dg(x).Dh(\mathbf{x}) = Df(g(\mathbf{x}))\,Dg(\mathbf{x}).

    Substitute u2=x1x2u_2 = x_1x_2:

    Df(g(x))=[12x1x2].Df(g(\mathbf{x})) = \begin{bmatrix} 1 & 2x_1x_2 \end{bmatrix}.

    Now multiply:

    \begin{align*}

    Dh(\mathbf{x})

    &= \begin{bmatrix} 1 & 2x_1x_2 \end{bmatrix}

    \begin{bmatrix} 2x_1 & 1 \\ x_2 & x_1 \end{bmatrix} \\

    &= \begin{bmatrix}

    1\cdot 2x_1 + (2x_1x_2)\cdot x_2 \; , \; 1\cdot 1 + (2x_1x_2)\cdot x_1

    \end{bmatrix} \\

    &= \begin{bmatrix}

    2x_1 + 2x_1x_2^2 \; , \; 1 + 2x_1^2x_2

    \end{bmatrix}.

    \end{align*}

  4. Step 4: Convert to gradient (column vector).

    Since hh is scalar, DhDh is 1×2 and

    h(x)=Dh(x)=[2x1+2x1x221+2x12x2].\nabla h(\mathbf{x}) = Dh(\mathbf{x})^\top = \begin{bmatrix} 2x_1 + 2x_1x_2^2 \\ 1 + 2x_1^2x_2 \end{bmatrix}.
  5. Step 5: Evaluate at (1,2).

    h(1,2)=[21+21221+2122]=[2+81+4]=[105].\nabla h(1,2)=\begin{bmatrix}2\cdot 1 + 2\cdot 1\cdot 2^2 \\ 1 + 2\cdot 1^2\cdot 2\end{bmatrix}= \begin{bmatrix}2+8\\1+4\end{bmatrix}= \begin{bmatrix}10\\5\end{bmatrix}.

Insight: The computation stayed organized because we never mixed ‘partial derivative rules’ randomly. We computed two local Jacobians with clear shapes (2×2 and 1×2), multiplied them in the only shape-consistent order, then transposed to get the gradient vector.

Pushforward vs pullback on a small computational graph (explicit J and Jᵀ)

Let x ∈ ℝ². Define

1) u = g(x) where g(x1,x2)=[u1u2]=[x1+2x2x1x2]g(x_1,x_2) = \begin{bmatrix} u_1 \\ u_2 \end{bmatrix} = \begin{bmatrix} x_1 + 2x_2 \\ x_1 - x_2 \end{bmatrix}

2) scalar output =f(u)\ell = f(\mathbf{u}) where f(u1,u2)=u12+3u2f(u_1,u_2)= u_1^2 + 3u_2

At x₀ = (1,1), compute:

  • (A) the predicted scalar change Δ\Delta \ell for a small perturbation Δx=(0.01,0.02)\Delta \mathbf{x} = (0.01, -0.02) using pushforward
  • (B) the gradient x\nabla_{\mathbf{x}}\ell using pullback
  1. Step 1: Compute the forward values at x₀.

    At (1,1):

    • u=g(1,1)=[1+2111]=[30]\mathbf{u} = g(1,1) = \begin{bmatrix} 1+2\cdot 1 \\ 1-1 \end{bmatrix} = \begin{bmatrix} 3 \\ 0 \end{bmatrix}.
    • =f(3,0)=32+30=9\ell = f(3,0)= 3^2 + 3\cdot 0 = 9.
  2. Step 2: Pushforward (perturbations use Jacobians).

    Compute Dg(x)Dg(\mathbf{x}) (2×2). Since g is linear,

    Dg=[u1/x1u1/x2u2/x1u2/x2]=[1211].Dg = \begin{bmatrix} \partial u_1/\partial x_1 & \partial u_1/\partial x_2 \\ \partial u_2/\partial x_1 & \partial u_2/\partial x_2 \end{bmatrix} = \begin{bmatrix} 1 & 2 \\ 1 & -1 \end{bmatrix}.

    Compute Df(u)Df(\mathbf{u}) (1×2):

    Df(u)=[2u13].Df(\mathbf{u})=\begin{bmatrix}2u_1 & 3\end{bmatrix}.

    At u = (3,0):

    Df(u)=[63].Df(\mathbf{u}) = \begin{bmatrix} 6 & 3 \end{bmatrix}.

    Now push Δx\Delta \mathbf{x} forward:

    ΔuDgΔx=[1211][0.010.02]=[0.010.040.01+0.02]=[0.030.03].\Delta \mathbf{u} \approx Dg\,\Delta \mathbf{x} = \begin{bmatrix} 1 & 2 \\ 1 & -1 \end{bmatrix}\begin{bmatrix} 0.01 \\ -0.02 \end{bmatrix} = \begin{bmatrix} 0.01-0.04 \\ 0.01+0.02 \end{bmatrix} = \begin{bmatrix} -0.03 \\ 0.03 \end{bmatrix}.

    Then

    ΔDfΔu=[63][0.030.03]=6(0.03)+3(0.03)=0.18+0.09=0.09.\Delta \ell \approx Df\,\Delta \mathbf{u} = \begin{bmatrix}6 & 3\end{bmatrix}\begin{bmatrix}-0.03\\0.03\end{bmatrix} = 6(-0.03)+3(0.03) = -0.18+0.09 = -0.09.
  3. Step 3: Pullback (gradients use transposes).

    Because \ell is scalar, start with u\nabla_{\mathbf{u}}\ell:

    u=[/u1/u2]=[2u13].\nabla_{\mathbf{u}}\ell = \begin{bmatrix} \partial \ell/\partial u_1 \\ \partial \ell/\partial u_2 \end{bmatrix} = \begin{bmatrix}2u_1\\3\end{bmatrix}.

    At u=(3,0):

    u=[63].\nabla_{\mathbf{u}}\ell = \begin{bmatrix}6\\3\end{bmatrix}.

    Now pull back through g using DgDg^\top:

    x=Dgu=[1121][63]=[9123]=[99].\nabla_{\mathbf{x}}\ell = Dg^\top\,\nabla_{\mathbf{u}}\ell = \begin{bmatrix} 1 & 1 \\ 2 & -1 \end{bmatrix}\begin{bmatrix}6\\3\end{bmatrix} = \begin{bmatrix}9\\12-3\end{bmatrix} = \begin{bmatrix}9\\9\end{bmatrix}.
  4. Step 4: Consistency check via dot product.

    The linear prediction should satisfy

    ΔxΔx.\Delta \ell \approx \nabla_{\mathbf{x}}\ell \cdot \Delta \mathbf{x}.

    Compute:

    [99][0.010.02]=9(0.01)+9(0.02)=0.090.18=0.09,\begin{bmatrix}9\\9\end{bmatrix}\cdot\begin{bmatrix}0.01\\-0.02\end{bmatrix} = 9(0.01)+9(-0.02)=0.09-0.18=-0.09,

    matching the pushforward result.

Insight: This example makes the duality visible: forward-mode computes the effect of a small move Δx\Delta \mathbf{x}; reverse-mode computes the gradient that, when dotted with Δx\Delta \mathbf{x}, predicts the same Δ\Delta \ell. The transpose is exactly what makes those two views consistent.

Element-wise chain rule as “sum over intermediate coordinates”

Let g:R3R2g: \mathbb{R}^3\to\mathbb{R}^2 and f:R2R2f: \mathbb{R}^2\to\mathbb{R}^2 be

g(x1,x2,x3)=[u1u2]=[x1x2x2+x32],f(u1,u2)=[y1y2]=[u1+u2u1u2].g(x_1,x_2,x_3)=\begin{bmatrix}u_1\\u_2\end{bmatrix}=\begin{bmatrix}x_1x_2\\x_2+x_3^2\end{bmatrix},\qquad f(u_1,u_2)=\begin{bmatrix}y_1\\y_2\end{bmatrix}=\begin{bmatrix}u_1+u_2\\u_1u_2\end{bmatrix}.

Compute y2x3\frac{\partial y_2}{\partial x_3} at a general point using path-tracing.

  1. Step 1: Identify the dependency paths.

    We want y2=u1u2y_2 = u_1u_2.

    • u1=x1x2u_1 = x_1x_2 does NOT depend on x3x_3.
    • u2=x2+x32u_2 = x_2 + x_3^2 DOES depend on x3x_3.

    So only paths through u2u_2 matter for y2/x3\partial y_2/\partial x_3.

  2. Step 2: Apply the coordinate chain rule.

    Use

    y2x3=k=12y2ukukx3.\frac{\partial y_2}{\partial x_3} = \sum_{k=1}^{2}\frac{\partial y_2}{\partial u_k}\frac{\partial u_k}{\partial x_3}.

    Compute each factor:

    • y2u1=(u1u2)u1=u2\frac{\partial y_2}{\partial u_1} = \frac{\partial (u_1u_2)}{\partial u_1} = u_2
    • y2u2=(u1u2)u2=u1\frac{\partial y_2}{\partial u_2} = \frac{\partial (u_1u_2)}{\partial u_2} = u_1
    • u1x3=0\frac{\partial u_1}{\partial x_3} = 0
    • u2x3=2x3\frac{\partial u_2}{\partial x_3} = 2x_3
  3. Step 3: Sum over k.

    \begin{align*}

    \frac{\partial y_2}{\partial x_3}

    &= \left(\frac{\partial y_2}{\partial u_1}\right)\left(\frac{\partial u_1}{\partial x_3}\right) + \left(\frac{\partial y_2}{\partial u_2}\right)\left(\frac{\partial u_2}{\partial x_3}\right) \\

    &= (u_2)(0) + (u_1)(2x_3) \\

    &= 2x_3\,u_1 \\

    &= 2x_3\,(x_1x_2).

    \end{align*}

Insight: The summation form is ‘matrix multiplication with indices.’ It forces you to enumerate intermediate coordinates uku_k and add their contributions—exactly like summing over all paths in a computational graph.

Key Takeaways

  • A multivariable derivative is best understood as a linear map; its matrix representation is the Jacobian.

  • For g:RnRmg: \mathbb{R}^n\to\mathbb{R}^m and f:RmRpf: \mathbb{R}^m\to\mathbb{R}^p, the chain rule is $D(fg)(x)=Df(g(x))Dg(x).D(f\circ g)(\mathbf{x}) = Df(g(\mathbf{x}))\,Dg(\mathbf{x}).$

  • Shape-tracking is a correctness tool: (p×m)(m×n)=(p×n)(p\times m)(m\times n)=(p\times n) is the only order that composes.

  • Perturbations push forward: ΔuJΔx\Delta \mathbf{u} \approx J\,\Delta \mathbf{x}.

  • Gradients pull back: for scalar output, x=Ju\nabla_{\mathbf{x}}\ell = J^\top\,\nabla_{\mathbf{u}}\ell.

  • The coordinate form $hixj=kfiukgkxj\frac{\partial h_i}{\partial x_j} = \sum_k \frac{\partial f_i}{\partial u_k}\frac{\partial g_k}{\partial x_j}$ is the same rule as matrix multiplication, interpreted as ‘sum over intermediate coordinates.’

  • Backpropagation is repeated application of the pullback rule (multiply by local Jacobian transposes) along a computational graph.

Common Mistakes

  • Multiplying Jacobians in the wrong order (forgetting that composition order reverses: outer derivative on the left).

  • Mixing up Jacobians (matrices) and gradients (vectors), especially the row-vs-column convention; forgetting the transpose when converting DD\ell to \nabla \ell.

  • Trying to build the full Jacobian for a large network when you only need Jacobian-vector products (forward mode) or Jacobianᵀ-vector products (reverse mode).

  • Losing track of which variables each function actually depends on; missing or adding dependency paths in the summation form.

Practice

easy

Let g:R2R3g: \mathbb{R}^2\to\mathbb{R}^3 be g(x1,x2)=(x1,  x1x2,  x22)g(x_1,x_2)=(x_1,\; x_1x_2,\; x_2^2) and let f:R3Rf: \mathbb{R}^3\to\mathbb{R} be f(u1,u2,u3)=2u1u2+u3f(u_1,u_2,u_3)=2u_1-u_2+u_3. Compute (fg)(x)\nabla (f\circ g)(\mathbf{x}).

Hint: Compute Dg(x)Dg(\mathbf{x}) (3×2) and f(u)\nabla f(\mathbf{u}) (3×1). Then use (fg)=Dgf\nabla (f\circ g)=Dg^\top\nabla f.

Show solution

We have f(u)=[211]\nabla f(\mathbf{u}) = \begin{bmatrix}2\\-1\\1\end{bmatrix}.

Compute

Dg(x)=[u1/x1u1/x2u2/x1u2/x2u3/x1u3/x2]=[10x2x102x2].Dg(\mathbf{x})=\begin{bmatrix} \partial u_1/\partial x_1 & \partial u_1/\partial x_2 \\ \partial u_2/\partial x_1 & \partial u_2/\partial x_2 \\ \partial u_3/\partial x_1 & \partial u_3/\partial x_2 \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ x_2 & x_1 \\ 0 & 2x_2 \end{bmatrix}.

Then

(fg)(x)=Dg(x)f=[1x200x12x2][211]=[2x2x1+2x2].\nabla (f\circ g)(\mathbf{x}) = Dg(\mathbf{x})^\top\,\nabla f = \begin{bmatrix}1 & x_2 & 0\\0 & x_1 & 2x_2\end{bmatrix}\begin{bmatrix}2\\-1\\1\end{bmatrix} = \begin{bmatrix}2-x_2\\-x_1+2x_2\end{bmatrix}.
medium

Suppose g:RnRmg: \mathbb{R}^n\to\mathbb{R}^m and f:RmRpf: \mathbb{R}^m\to\mathbb{R}^p. You are told Dg(x)Dg(\mathbf{x}) is m×n and Df(u)Df(\mathbf{u}) is p×m. What is the shape of D(fg)(x)D(f\circ g)(\mathbf{x})? Also, if p=1p=1, what is the shape of (fg)(x)\nabla (f\circ g)(\mathbf{x}) (as a column vector)?

Hint: Use matrix multiplication shapes. For p=1p=1, the Jacobian is 1×n and the gradient is its transpose.

Show solution

D(fg)(x)=Df(g(x))Dg(x)D(f\circ g)(\mathbf{x}) = Df(g(\mathbf{x}))\,Dg(\mathbf{x}) has shape (p×m)(m×n) = p×n.

If p=1p=1, then D(fg)D(f\circ g) is 1×n, so the gradient (fg)\nabla (f\circ g) (column) has shape n×1.

hard

Let u = g(x) with g:R2R2g: \mathbb{R}^2\to\mathbb{R}^2 given by u1=x12u_1=x_1^2, u2=sin(x2)u_2=\sin(x_2). Let scalar =f(u)\ell=f(\mathbf{u}) with f(u1,u2)=u1u2f(u_1,u_2)=u_1u_2. Compute x\nabla_{\mathbf{x}}\ell.

Hint: Compute u\nabla_{\mathbf{u}}\ell first, then multiply by Dg(x)Dg(\mathbf{x})^\top. Remember DgDg is 2×2 and diagonal here.

Show solution

First, =u1u2\ell=u_1u_2 so

u=[/u1/u2]=[u2u1].\nabla_{\mathbf{u}}\ell = \begin{bmatrix}\partial \ell/\partial u_1\\\partial \ell/\partial u_2\end{bmatrix} = \begin{bmatrix}u_2\\u_1\end{bmatrix}.

Next, Dg(x)Dg(\mathbf{x}):

  • u1=x12u_1=x_1^2 so u1/x1=2x1\partial u_1/\partial x_1=2x_1, u1/x2=0\partial u_1/\partial x_2=0
  • u2=sin(x2)u_2=\sin(x_2) so u2/x1=0\partial u_2/\partial x_1=0, u2/x2=cos(x2)\partial u_2/\partial x_2=\cos(x_2)

Thus

Dg(x)=[2x100cos(x2)].Dg(\mathbf{x})=\begin{bmatrix}2x_1 & 0\\0 & \cos(x_2)\end{bmatrix}.

So

x=Dg(x)u=[2x100cos(x2)][u2u1]=[2x1sin(x2)x12cos(x2)].\nabla_{\mathbf{x}}\ell = Dg(\mathbf{x})^\top\,\nabla_{\mathbf{u}}\ell = \begin{bmatrix}2x_1 & 0\\0 & \cos(x_2)\end{bmatrix}\begin{bmatrix}u_2\\u_1\end{bmatrix} = \begin{bmatrix}2x_1\sin(x_2)\\x_1^2\cos(x_2)\end{bmatrix}.

Connections

Next up: apply this rule repeatedly and efficiently in neural networks.

Related supporting nodes you may also want:

Quality: A (4.5/5)