Multivariable Chain Rule

CalculusDifficulty: ███░░Depth: 6Unlocks: 5

Derivatives of composed functions with multiple variables.

Interactive Visualization

t=0s

Core Concepts

▸Structure of composition: an inner map from R^n to R^m and an outer map from R^m to R^p (track which variables feed into which function)
▸Derivative as the best linear approximation at a point, represented by the Jacobian matrix (matrix of partial derivatives)
▸Composition of linear maps is given by matrix multiplication (linear maps compose by multiplying their matrices)

Key Symbols & Notation

D f (the Jacobian matrix / derivative of f)o (function composition)

Essential Relationships

↔Chain rule (core formula): D(g o f)(x) = Dg(f(x)) * Df(x) (matrix product; evaluate outer derivative at f(x))

Prerequisites (2)

Gradients5 atoms

Derivative Rules5 atoms

Unlocks (1)

Backpropagationlvl 4

In single-variable calculus, the chain rule is a one-line formula. In multivariable calculus, it’s the same idea—“derivatives multiply along a composition”—but the objects are linear maps, so the multiplication becomes matrix multiplication. Once you truly see that, backpropagation stops feeling like magic.

TL;DR:

For a composition $h = f \circ g$ where $g: \mathbb{R}^n \to \mathbb{R}^m$ and $f: \mathbb{R}^m \to \mathbb{R}^p$ , the multivariable chain rule is

D(f \circ g)(\mathbf{x}) = Df(g(\mathbf{x}))\, Dg(\mathbf{x}).

Interpretation: a small input perturbation $\Delta \mathbf{x}$ is pushed forward by $Dg$ into $\Delta \mathbf{u}$ , then pushed forward by $Df$ into $\Delta \mathbf{y}$ . For scalar output $p=1$ , gradients pull back via transpose: $\nabla h(\mathbf{x}) = Dg(\mathbf{x})^\top \nabla f(g(\mathbf{x}))$ .

What Is the Multivariable Chain Rule?

Why you need a new-looking chain rule

In 1D, composition looks like $h(x) = f(g(x))$ and the chain rule says $h'(x) = f'(g(x))\, g'(x)$ .

In multiple dimensions, you still compose functions, but now each function can take multiple inputs and produce multiple outputs. The “derivative” is no longer a single number; it’s a linear map that best approximates the function near a point. Linear maps are represented by matrices, so “multiply the derivatives” becomes “multiply the Jacobian matrices.”

This is the conceptual core:

•Derivative = best linear approximation near a point.
•Jacobian matrix = the matrix of partial derivatives that represents that linear approximation.
•Composition of linear maps corresponds to matrix multiplication.

The composition structure (track shapes)

The cleanest multivariable chain rule is written with a clear inner/outer structure:

•Inner map: $g: \mathbb{R}^n \to \mathbb{R}^m$
•Outer map: $f: \mathbb{R}^m \to \mathbb{R}^p$
•Composition: $h = f \circ g: \mathbb{R}^n \to \mathbb{R}^p$

Let:

•input vector x ∈ ℝⁿ
•intermediate vector u = g(x) ∈ ℝᵐ
•output vector y = f(u) ∈ ℝᵖ

A quick “shape table” you should get used to:

Object	Meaning	Shape
x	input	n×1
u = g(x)	intermediate	m×1
y = f(u)	output	p×1
$Dg(\mathbf{x})$	Jacobian of g at x	m×n
$Df(\mathbf{u})$	Jacobian of f at u	p×m
$D(f\circ g)(\mathbf{x})$	Jacobian of composition	p×n

Notice the only multiplication that makes sense dimensionally is:

(p\times m)\,(m\times n) = (p\times n).

That is already a big part of the multivariable chain rule: the shapes force the correct order.

The chain rule (matrix form)

The multivariable chain rule states:

D(f \circ g)(\mathbf{x}) = Df(g(\mathbf{x}))\, Dg(\mathbf{x}).

Read it as: “first apply the derivative of $g$ at x, then apply the derivative of $f$ at the resulting point $g(\mathbf{x})$ .”

What it means geometrically (tiny perturbations)

If you perturb the input by a small vector $\Delta \mathbf{x}$ , then:

1) Push forward through g:

\Delta \mathbf{u} \approx Dg(\mathbf{x})\, \Delta \mathbf{x}.

2) Push forward through f:

\Delta \mathbf{y} \approx Df(\mathbf{u})\, \Delta \mathbf{u}.

Combine them:

\Delta \mathbf{y} \approx Df(g(\mathbf{x}))\, Dg(\mathbf{x})\, \Delta \mathbf{x}.

So the Jacobian of the composition is the matrix that takes $\Delta \mathbf{x}$ directly to $\Delta \mathbf{y}$ . That matrix is the product $Df\,Dg$ .

A note on notation: Df vs ∇f

You said you know gradients already. The key is to keep these distinct:

• $Df$ is the Jacobian (a matrix, in general).
• $\nabla f$ is the gradient (a vector) and is defined when $f$ is scalar-valued: $f: \mathbb{R}^n \to \mathbb{R}$ .

When $p=1$ , the Jacobian of $f$ is a 1×m row vector; the gradient is usually written as an m×1 column vector. They are transposes of each other (depending on convention):

Df(\mathbf{u}) \text{ is } 1\times m, \qquad \nabla f(\mathbf{u}) \text{ is } m\times 1.

This transpose issue matters a lot in backprop, so we’ll be explicit about it later.

Core Mechanic 1: Derivative as Best Linear Approximation (and the Jacobian)

Why this viewpoint

If you treat multivariable derivatives as “a bunch of partial derivatives,” you can still compute things, but it’s easy to lose track of structure.

If you treat the derivative as “the best linear map near a point,” everything becomes systematic:

•You can push forward small changes.
•You can compose derivatives by composing linear maps.
•You can check correctness by verifying matrix shapes.

The linear approximation definition

Let $g: \mathbb{R}^n \to \mathbb{R}^m$ . We say $g$ is differentiable at x if there exists a linear map $L: \mathbb{R}^n \to \mathbb{R}^m$ such that

g(\mathbf{x} + \Delta \mathbf{x}) \approx g(\mathbf{x}) + L\,\Delta \mathbf{x}

with an error that becomes negligible compared to $\|\Delta \mathbf{x}\|$ as $\Delta \mathbf{x} \to 0$ .

That linear map is the derivative $Dg(\mathbf{x})$ .

Jacobian matrix: how the linear map is represented

Write $g(\mathbf{x}) = \begin{bmatrix} g_1(\mathbf{x}) \\ \vdots \\ g_m(\mathbf{x}) \end{bmatrix}$ .

Then the Jacobian $Dg(\mathbf{x})$ is the m×n matrix:

Dg(\mathbf{x}) = \begin{bmatrix} \frac{\partial g_1}{\partial x_1} & \cdots & \frac{\partial g_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial g_m}{\partial x_1} & \cdots & \frac{\partial g_m}{\partial x_n} \end{bmatrix}.

This is exactly the matrix that maps a small input perturbation to an approximate output perturbation:

\Delta \mathbf{u} \approx Dg(\mathbf{x})\, \Delta \mathbf{x}.

Interactive-canvas mental model (shape-tracking + arrows)

Imagine an “interactive canvas” with three boxes:

1) x-box: x ∈ ℝⁿ

2) u-box: u = g(x) ∈ ℝᵐ

3) y-box: y = f(u) ∈ ℝᵖ

Now add two kinds of arrows:

Arrow type A: pushforward of perturbations (forward mode)

You draw a little arrow $\Delta \mathbf{x}$ at the input.

•It transforms to $\Delta \mathbf{u} = Dg\, \Delta \mathbf{x}$ .
•Then transforms to $\Delta \mathbf{y} = Df\, \Delta \mathbf{u}$ .

So perturbations flow with the function direction.

Arrow type B: pullback of sensitivities / gradients (reverse mode)

If the final output is scalar (p=1), you draw a gradient arrow at the output: $\nabla_{\mathbf{y}} \ell$ (often just 1 if the scalar is the loss itself).

Then gradients flow backwards through transposes:

•sensitivity at u: $\nabla_{\mathbf{u}} \ell = Df(\mathbf{u})^\top \nabla_{\mathbf{y}} \ell$
•sensitivity at x: $\nabla_{\mathbf{x}} \ell = Dg(\mathbf{x})^\top \nabla_{\mathbf{u}} \ell$

This is the heart of backprop. In this lesson, we’re building the “transpose reflex”:

•Perturbations push forward with $J$.
•Gradients pull back with $J^\top$.

A small but crucial convention check

There are two common conventions:

1) Jacobian is m×n (outputs by inputs). Gradients are column vectors.

2) Jacobian is n×m (inputs by outputs). Gradients are row vectors.

We’ll use the most common ML convention:

•Jacobian $Dg$ is m×n.
•Gradient $\nabla f$ is n×1 for scalar f.

With that convention, the chain rule for scalar output becomes a clean transpose pullback (we’ll derive it soon).

Core Mechanic 2: Chain Rule = Composition of Linear Maps (Matrix Multiplication)

Why matrix multiplication appears

The multivariable chain rule is not a new rule you memorize. It is a consequence of one fact:

If you approximate each function by a linear map near the relevant point, then approximating the composition means composing those linear maps.

And linear maps compose by matrix multiplication.

Derivation (showing the work)

Let $g: \mathbb{R}^n \to \mathbb{R}^m$ and $f: \mathbb{R}^m \to \mathbb{R}^p$ .

Define:

•u = g(x)
•y = f(u) = (f\circ g)(x)

Start with a small perturbation $\Delta \mathbf{x}$ .

Step 1: Linearize g at x

g(\mathbf{x}+\Delta\mathbf{x}) \approx g(\mathbf{x}) + Dg(\mathbf{x})\,\Delta\mathbf{x}.

Let $\Delta \mathbf{u} = Dg(\mathbf{x})\,\Delta\mathbf{x}$ , so

g(\mathbf{x}+\Delta\mathbf{x}) \approx \mathbf{u} + \Delta\mathbf{u}.

Step 2: Linearize f at u

f(\mathbf{u}+\Delta\mathbf{u}) \approx f(\mathbf{u}) + Df(\mathbf{u})\,\Delta\mathbf{u}.

Substitute $\Delta \mathbf{u} = Dg(\mathbf{x})\,\Delta\mathbf{x}$ :

f(\mathbf{u}+\Delta\mathbf{u}) \approx f(\mathbf{u}) + Df(\mathbf{u})\,Dg(\mathbf{x})\,\Delta\mathbf{x}.

But the left side is approximately

f(g(\mathbf{x}+\Delta\mathbf{x})) = (f\circ g)(\mathbf{x}+\Delta\mathbf{x}).

So we have the linear approximation:

(f\circ g)(\mathbf{x}+\Delta\mathbf{x}) \approx (f\circ g)(\mathbf{x}) + \bigl(Df(g(\mathbf{x}))\,Dg(\mathbf{x})\bigr)\,\Delta\mathbf{x}.

By uniqueness of the best linear approximation, the derivative must be:

D(f \circ g)(\mathbf{x}) = Df(g(\mathbf{x}))\,Dg(\mathbf{x}).

That’s the multivariable chain rule.

Element-wise chain rule (path-tracing through indices)

Sometimes you want the version that looks like “sum over paths.”

Let $h = f \circ g$ with components:

• $g_k(\mathbf{x})$ for k = 1..m
• $f_i(\mathbf{u})$ for i = 1..p
• $h_i(\mathbf{x}) = f_i(g(\mathbf{x}))$

Then for each output component i and input component j:

\frac{\partial h_i}{\partial x_j}(\mathbf{x}) = \sum_{k=1}^{m} \frac{\partial f_i}{\partial u_k}(\mathbf{u})\,\frac{\partial g_k}{\partial x_j}(\mathbf{x}),\quad \mathbf{u}=g(\mathbf{x}).

This is literally matrix multiplication in coordinates.

Path-tracing interpretation:

•Each term $\frac{\partial f_i}{\partial u_k}\,\frac{\partial g_k}{\partial x_j}$ corresponds to a path $x_j \to u_k \to y_i$ .
•You add contributions from all intermediate coordinates $u_k$ .

“Computational graph” view (interactive canvas)

Think of a small graph:

•Nodes are variables (scalars or vectors).
•Edges are functions.

For our two-layer composition:

x →(g)→ u →(f)→ y

On an interactive canvas, you can show two overlays:

Overlay 1: Jacobians on edges

•edge x→u labeled $Dg(\mathbf{x})$ (m×n)
•edge u→y labeled $Df(\mathbf{u})$ (p×m)

To get total derivative x→y, multiply along the path:

D_{\mathbf{x}}\mathbf{y} = Df\,Dg.

Overlay 2: live perturbations and gradients

Pick a concrete point x₀.

•Drag x by a tiny $\Delta \mathbf{x}$ . The canvas updates u and y and also shows the predicted linear response $Df\,Dg\,\Delta \mathbf{x}$ .
•Alternatively, set a scalar loss $\ell = \phi(\mathbf{y})$ (or just take p=1). Show a gradient vector at y and animate it flowing backward:

\nabla_{\mathbf{x}}\ell = Dg(\mathbf{x})^\top\,Df(\mathbf{u})^\top\,\nabla_{\mathbf{y}}\ell.

This “two-way animation” is the visualization you want to internalize:

•Forward: $\Delta$ vectors multiply by Jacobians.
•Backward: gradients multiply by Jacobian transposes.

Scalar output special case (gradient form)

Now suppose $f: \mathbb{R}^m \to \mathbb{R}$ is scalar. Then $Df(\mathbf{u})$ is 1×m.

From the Jacobian chain rule:

D(f\circ g)(\mathbf{x}) = Df(\mathbf{u})\,Dg(\mathbf{x}).

This left side is a 1×n row vector. If we want the gradient as an n×1 column vector, transpose:

\begin{align*}

\nabla (f\circ g)(\mathbf{x})

&= D(f\circ g)(\mathbf{x})^\top \\

&= \bigl(Df(\mathbf{u})\,Dg(\mathbf{x})\bigr)^\top \\

&= Dg(\mathbf{x})^\top\,Df(\mathbf{u})^\top \\

&= Dg(\mathbf{x})^\top\,\nabla f(\mathbf{u}).

\end{align*}

That last line is the standard “gradient chain rule” used everywhere in ML.

One more composition (three layers)

If you have $\mathbf{x} \xrightarrow{g} \mathbf{u} \xrightarrow{f} \mathbf{v} \xrightarrow{r} \mathbf{y}$ , then:

D(r\circ f\circ g)(\mathbf{x}) = Dr(\mathbf{v})\,Df(\mathbf{u})\,Dg(\mathbf{x}).

Forward-mode perturbations multiply left-to-right in the same order as the functions apply (inner to outer).

Reverse-mode gradients multiply by transposes right-to-left:

\nabla_{\mathbf{x}}\ell = Dg(\mathbf{x})^\top\,Df(\mathbf{u})^\top\,Dr(\mathbf{v})^\top\,\nabla_{\mathbf{y}}\ell.

This is backprop in one line—just applied repeatedly.

Application/Connection: From Multivariable Chain Rule to Backprop Intuition

Why this matters for ML

Neural networks are compositions of many vector-valued functions:

\mathbf{x} \to \mathbf{h}^{(1)} \to \mathbf{h}^{(2)} \to \cdots \to \hat{\mathbf{y}} \to \ell.

Training requires $\nabla_{\theta}\ell$ , gradients with respect to parameters. The only tool you need conceptually is the multivariable chain rule, but applied efficiently.

The hard part is not the calculus; it’s bookkeeping:

•What depends on what?
•What is the shape of each Jacobian?
•Are we pushing perturbations forward or pulling gradients back?

A concrete computational graph (with explicit shapes)

Let’s build a tiny “network” with one hidden layer and a scalar loss. Define:

•x ∈ ℝ²
•Parameters:
• $W$ ∈ ℝ^{3×2}, b ∈ ℝ³
•c ∈ ℝ³ (a vector used to reduce to scalar)

Forward computation:

1) a = $W\mathbf{x} + \mathbf{b}$ (so a ∈ ℝ³)

2) h = $\sigma(\mathbf{a})$ elementwise (so h ∈ ℝ³)

3) $\ell = \mathbf{c}^\top \mathbf{h}$ (so $\ell$ ∈ ℝ)

This is a composition:

x →(affine)→ a →(nonlinearity)→ h →(dot)→ ℓ

On an interactive canvas, you can attach:

Edge	Local derivative	Shape
x→a	$D_{\mathbf{x}}\mathbf{a} = W$	3×2
a→h	$D_{\mathbf{a}}\mathbf{h} = \operatorname{Diag}(\sigma'(\mathbf{a}))$	3×3
h→ℓ	$D_{\mathbf{h}}\ell = \mathbf{c}^\top$	1×3

Forward-mode (push a perturbation)

A perturbation $\Delta \mathbf{x}$ produces:

\begin{align*}

\Delta \mathbf{a} &\approx W\,\Delta \mathbf{x} \\

\Delta \mathbf{h} &\approx \operatorname{Diag}(\sigma'(\mathbf{a}))\,\Delta \mathbf{a} \\

\Delta \ell &\approx \mathbf{c}^\top\,\Delta \mathbf{h}.

\end{align*}

Combine:

\Delta \ell \approx \mathbf{c}^\top\,\operatorname{Diag}(\sigma'(\mathbf{a}))\,W\,\Delta \mathbf{x}.

So the total Jacobian (1×2 row vector) is:

D_{\mathbf{x}}\ell = \mathbf{c}^\top\,\operatorname{Diag}(\sigma'(\mathbf{a}))\,W.

Reverse-mode (pull back a gradient)

Because ℓ is scalar, we typically want $\nabla_{\mathbf{x}}\ell$ as a 2×1 column vector.

Start with $\nabla_{\ell}\ell = 1$ .

•From ℓ = cᵀh:

\nabla_{\mathbf{h}}\ell = \mathbf{c}.

•Through h = σ(a) elementwise:

\nabla_{\mathbf{a}}\ell = \operatorname{Diag}(\sigma'(\mathbf{a}))\,\nabla_{\mathbf{h}}\ell = \operatorname{Diag}(\sigma'(\mathbf{a}))\,\mathbf{c}.

•Through a = $W\mathbf{x}+\mathbf{b}$ :

\nabla_{\mathbf{x}}\ell = W^\top\,\nabla_{\mathbf{a}}\ell = W^\top\,\operatorname{Diag}(\sigma'(\mathbf{a}))\,\mathbf{c}.

Compare with the forward-mode Jacobian expression above:

• $D_{\mathbf{x}}\ell$ (row) = cᵀ Diag(σ′) W
• $\nabla_{\mathbf{x}}\ell$ (column) = $W^\top$ Diag(σ′) c

They are transposes, consistent with $\nabla \ell = (D\ell)^\top$ .

Visual intuition: pushforward vs pullback

To address visualization explicitly, here’s the picture you should rehearse:

1) Pick a point x₀.

2) Draw a tiny arrow $\Delta \mathbf{x}$ at x.

3) Multiply by local Jacobians to watch the arrow morph:

•it rotates/scales/shears in a-space,
•then again in h-space,
•finally collapses to a scalar change $\Delta \ell$ .

Now reverse:

1) Draw a gradient arrow at h: it points in the direction that increases ℓ fastest in h-space.

2) Pull it back to a using the transpose of the local Jacobian: $\operatorname{Diag}(\sigma')$ (symmetric here, so transpose doesn’t change it).

3) Pull it back to x using $W^\top$ .

This is not two unrelated processes. It’s the same linear maps viewed from two dual perspectives:

•perturbations: $\Delta$ vectors (tangent vectors)
•gradients: covectors, pulled back by transpose

You don’t need the formal differential-geometry language to use it correctly, but you do need the operational rule:

If forward uses $J$ , backward uses $J^\top$ .

Connection you’ll use next

Backpropagation is essentially the repeated application of:

\nabla_{\text{input}}\ell = J^\top\,\nabla_{\text{output}}\ell.

where $J$ is the Jacobian of a local block in the computational graph.

Once you’re comfortable multiplying Jacobians (forward) and multiplying by transposes (backward), you’re ready to study backprop as an algorithmic optimization: reuse intermediate results so you don’t form huge Jacobian matrices explicitly.

Worked Examples (3)

Matrix-form chain rule with a 2→2→1 composition (shape-check + gradient)

Let $g: \mathbb{R}^2 \to \mathbb{R}^2$ and $f: \mathbb{R}^2 \to \mathbb{R}$ be

g(x_1,x_2) = \begin{bmatrix} u_1 \\ u_2 \end{bmatrix} = \begin{bmatrix} x_1^2 + x_2 \\ x_1x_2 \end{bmatrix}, \qquad f(u_1,u_2)= u_1 + u_2^2.

Define $h = f\circ g$ . Compute $Dh(\mathbf{x})$ and $\nabla h(\mathbf{x})$ at a general point, then evaluate at $(x_1,x_2)=(1,2)$ .

Step 1: Compute $Dg(\mathbf{x})$ (2×2).
We have
- • $u_1 = x_1^2 + x_2$ so $\partial u_1/\partial x_1 = 2x_1$ , $\partial u_1/\partial x_2 = 1$ .
- • $u_2 = x_1x_2$ so $\partial u_2/\partial x_1 = x_2$ , $\partial u_2/\partial x_2 = x_1$ .
Thus
$Dg(\mathbf{x}) = \begin{bmatrix} 2x_1 & 1 \\ x_2 & x_1 \end{bmatrix}.$
Step 2: Compute $Df(\mathbf{u})$ (1×2).
$f(u_1,u_2)=u_1 + u_2^2$ so
$Df(\mathbf{u}) = \begin{bmatrix} \partial f/\partial u_1 & \partial f/\partial u_2 \end{bmatrix} = \begin{bmatrix} 1 & 2u_2 \end{bmatrix}.$
Step 3: Apply the Jacobian chain rule.
$Dh(\mathbf{x}) = Df(g(\mathbf{x}))\,Dg(\mathbf{x}).$
Substitute $u_2 = x_1x_2$ :
$Df(g(\mathbf{x})) = \begin{bmatrix} 1 & 2x_1x_2 \end{bmatrix}.$
Now multiply:
\begin{align*}
Dh(\mathbf{x})
&= \begin{bmatrix} 1 & 2x_1x_2 \end{bmatrix}
\begin{bmatrix} 2x_1 & 1 \\ x_2 & x_1 \end{bmatrix} \\
&= \begin{bmatrix}
1\cdot 2x_1 + (2x_1x_2)\cdot x_2 \; , \; 1\cdot 1 + (2x_1x_2)\cdot x_1
\end{bmatrix} \\
&= \begin{bmatrix}
2x_1 + 2x_1x_2^2 \; , \; 1 + 2x_1^2x_2
\end{bmatrix}.
\end{align*}
Step 4: Convert to gradient (column vector).
Since $h$ is scalar, $Dh$ is 1×2 and
$\nabla h(\mathbf{x}) = Dh(\mathbf{x})^\top = \begin{bmatrix} 2x_1 + 2x_1x_2^2 \\ 1 + 2x_1^2x_2 \end{bmatrix}.$
Step 5: Evaluate at (1,2).
$\nabla h(1,2)=\begin{bmatrix}2\cdot 1 + 2\cdot 1\cdot 2^2 \\ 1 + 2\cdot 1^2\cdot 2\end{bmatrix}= \begin{bmatrix}2+8\\1+4\end{bmatrix}= \begin{bmatrix}10\\5\end{bmatrix}.$

Insight: The computation stayed organized because we never mixed ‘partial derivative rules’ randomly. We computed two local Jacobians with clear shapes (2×2 and 1×2), multiplied them in the only shape-consistent order, then transposed to get the gradient vector.

Pushforward vs pullback on a small computational graph (explicit J and Jᵀ)

Let x ∈ ℝ². Define

1) u = g(x) where $g(x_1,x_2) = \begin{bmatrix} u_1 \\ u_2 \end{bmatrix} = \begin{bmatrix} x_1 + 2x_2 \\ x_1 - x_2 \end{bmatrix}$

2) scalar output $\ell = f(\mathbf{u})$ where $f(u_1,u_2)= u_1^2 + 3u_2$

At x₀ = (1,1), compute:

•(A) the predicted scalar change $\Delta \ell$ for a small perturbation $\Delta \mathbf{x} = (0.01, -0.02)$ using pushforward
•(B) the gradient $\nabla_{\mathbf{x}}\ell$ using pullback

Step 1: Compute the forward values at x₀.
At (1,1):
- • $\mathbf{u} = g(1,1) = \begin{bmatrix} 1+2\cdot 1 \\ 1-1 \end{bmatrix} = \begin{bmatrix} 3 \\ 0 \end{bmatrix}$ .
- • $\ell = f(3,0)= 3^2 + 3\cdot 0 = 9$ .
Step 2: Pushforward (perturbations use Jacobians).
Compute $Dg(\mathbf{x})$ (2×2). Since g is linear,
$Dg = \begin{bmatrix} \partial u_1/\partial x_1 & \partial u_1/\partial x_2 \\ \partial u_2/\partial x_1 & \partial u_2/\partial x_2 \end{bmatrix} = \begin{bmatrix} 1 & 2 \\ 1 & -1 \end{bmatrix}.$
Compute $Df(\mathbf{u})$ (1×2):
$Df(\mathbf{u})=\begin{bmatrix}2u_1 & 3\end{bmatrix}.$
At u = (3,0):
$Df(\mathbf{u}) = \begin{bmatrix} 6 & 3 \end{bmatrix}.$
Now push $\Delta \mathbf{x}$ forward:
$\Delta \mathbf{u} \approx Dg\,\Delta \mathbf{x} = \begin{bmatrix} 1 & 2 \\ 1 & -1 \end{bmatrix}\begin{bmatrix} 0.01 \\ -0.02 \end{bmatrix} = \begin{bmatrix} 0.01-0.04 \\ 0.01+0.02 \end{bmatrix} = \begin{bmatrix} -0.03 \\ 0.03 \end{bmatrix}.$
Then
$\Delta \ell \approx Df\,\Delta \mathbf{u} = \begin{bmatrix}6 & 3\end{bmatrix}\begin{bmatrix}-0.03\\0.03\end{bmatrix} = 6(-0.03)+3(0.03) = -0.18+0.09 = -0.09.$
Step 3: Pullback (gradients use transposes).
Because $\ell$ is scalar, start with $\nabla_{\mathbf{u}}\ell$ :
$\nabla_{\mathbf{u}}\ell = \begin{bmatrix} \partial \ell/\partial u_1 \\ \partial \ell/\partial u_2 \end{bmatrix} = \begin{bmatrix}2u_1\\3\end{bmatrix}.$
At u=(3,0):
$\nabla_{\mathbf{u}}\ell = \begin{bmatrix}6\\3\end{bmatrix}.$
Now pull back through g using $Dg^\top$ :
$\nabla_{\mathbf{x}}\ell = Dg^\top\,\nabla_{\mathbf{u}}\ell = \begin{bmatrix} 1 & 1 \\ 2 & -1 \end{bmatrix}\begin{bmatrix}6\\3\end{bmatrix} = \begin{bmatrix}9\\12-3\end{bmatrix} = \begin{bmatrix}9\\9\end{bmatrix}.$
Step 4: Consistency check via dot product.
The linear prediction should satisfy
$\Delta \ell \approx \nabla_{\mathbf{x}}\ell \cdot \Delta \mathbf{x}.$
Compute:
$\begin{bmatrix}9\\9\end{bmatrix}\cdot\begin{bmatrix}0.01\\-0.02\end{bmatrix} = 9(0.01)+9(-0.02)=0.09-0.18=-0.09,$
matching the pushforward result.

Insight: This example makes the duality visible: forward-mode computes the effect of a small move $\Delta \mathbf{x}$ ; reverse-mode computes the gradient that, when dotted with $\Delta \mathbf{x}$ , predicts the same $\Delta \ell$ . The transpose is exactly what makes those two views consistent.

Element-wise chain rule as “sum over intermediate coordinates”

Let $g: \mathbb{R}^3\to\mathbb{R}^2$ and $f: \mathbb{R}^2\to\mathbb{R}^2$ be

g(x_1,x_2,x_3)=\begin{bmatrix}u_1\\u_2\end{bmatrix}=\begin{bmatrix}x_1x_2\\x_2+x_3^2\end{bmatrix},\qquad f(u_1,u_2)=\begin{bmatrix}y_1\\y_2\end{bmatrix}=\begin{bmatrix}u_1+u_2\\u_1u_2\end{bmatrix}.

Compute $\frac{\partial y_2}{\partial x_3}$ at a general point using path-tracing.

Step 1: Identify the dependency paths.
We want $y_2 = u_1u_2$ .
- • $u_1 = x_1x_2$ does NOT depend on $x_3$ .
- • $u_2 = x_2 + x_3^2$ DOES depend on $x_3$ .
So only paths through $u_2$ matter for $\partial y_2/\partial x_3$ .
Step 2: Apply the coordinate chain rule.
Use
$\frac{\partial y_2}{\partial x_3} = \sum_{k=1}^{2}\frac{\partial y_2}{\partial u_k}\frac{\partial u_k}{\partial x_3}.$
Compute each factor:
- • $\frac{\partial y_2}{\partial u_1} = \frac{\partial (u_1u_2)}{\partial u_1} = u_2$
- • $\frac{\partial y_2}{\partial u_2} = \frac{\partial (u_1u_2)}{\partial u_2} = u_1$
- • $\frac{\partial u_1}{\partial x_3} = 0$
- • $\frac{\partial u_2}{\partial x_3} = 2x_3$
Step 3: Sum over k.
\begin{align*}
\frac{\partial y_2}{\partial x_3}
&= \left(\frac{\partial y_2}{\partial u_1}\right)\left(\frac{\partial u_1}{\partial x_3}\right) + \left(\frac{\partial y_2}{\partial u_2}\right)\left(\frac{\partial u_2}{\partial x_3}\right) \\
&= (u_2)(0) + (u_1)(2x_3) \\
&= 2x_3\,u_1 \\
&= 2x_3\,(x_1x_2).
\end{align*}

Insight: The summation form is ‘matrix multiplication with indices.’ It forces you to enumerate intermediate coordinates $u_k$ and add their contributions—exactly like summing over all paths in a computational graph.

Key Takeaways

✓
A multivariable derivative is best understood as a linear map; its matrix representation is the Jacobian.
✓
For $g: \mathbb{R}^n\to\mathbb{R}^m$ and $f: \mathbb{R}^m\to\mathbb{R}^p$ , the chain rule is $ $D(f\circ g)(\mathbf{x}) = Df(g(\mathbf{x}))\,Dg(\mathbf{x}).$ $
✓
Shape-tracking is a correctness tool: $(p\times m)(m\times n)=(p\times n)$ is the only order that composes.
✓
Perturbations push forward: $\Delta \mathbf{u} \approx J\,\Delta \mathbf{x}$ .
✓
Gradients pull back: for scalar output, $\nabla_{\mathbf{x}}\ell = J^\top\,\nabla_{\mathbf{u}}\ell$ .
✓
The coordinate form $ $\frac{\partial h_i}{\partial x_j} = \sum_k \frac{\partial f_i}{\partial u_k}\frac{\partial g_k}{\partial x_j}$ $ is the same rule as matrix multiplication, interpreted as ‘sum over intermediate coordinates.’
✓
Backpropagation is repeated application of the pullback rule (multiply by local Jacobian transposes) along a computational graph.

Common Mistakes

✗
Multiplying Jacobians in the wrong order (forgetting that composition order reverses: outer derivative on the left).
✗
Mixing up Jacobians (matrices) and gradients (vectors), especially the row-vs-column convention; forgetting the transpose when converting $D\ell$ to $\nabla \ell$ .
✗
Trying to build the full Jacobian for a large network when you only need Jacobian-vector products (forward mode) or Jacobianᵀ-vector products (reverse mode).
✗
Losing track of which variables each function actually depends on; missing or adding dependency paths in the summation form.

Practice

easy

Let $g: \mathbb{R}^2\to\mathbb{R}^3$ be $g(x_1,x_2)=(x_1,\; x_1x_2,\; x_2^2)$ and let $f: \mathbb{R}^3\to\mathbb{R}$ be $f(u_1,u_2,u_3)=2u_1-u_2+u_3$ . Compute $\nabla (f\circ g)(\mathbf{x})$ .

Hint: Compute $Dg(\mathbf{x})$ (3×2) and $\nabla f(\mathbf{u})$ (3×1). Then use $\nabla (f\circ g)=Dg^\top\nabla f$ .

Show solution

We have $\nabla f(\mathbf{u}) = \begin{bmatrix}2\\-1\\1\end{bmatrix}$ .

Compute

Dg(\mathbf{x})=\begin{bmatrix} \partial u_1/\partial x_1 & \partial u_1/\partial x_2 \\ \partial u_2/\partial x_1 & \partial u_2/\partial x_2 \\ \partial u_3/\partial x_1 & \partial u_3/\partial x_2 \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ x_2 & x_1 \\ 0 & 2x_2 \end{bmatrix}.

Then

\nabla (f\circ g)(\mathbf{x}) = Dg(\mathbf{x})^\top\,\nabla f = \begin{bmatrix}1 & x_2 & 0\\0 & x_1 & 2x_2\end{bmatrix}\begin{bmatrix}2\\-1\\1\end{bmatrix} = \begin{bmatrix}2-x_2\\-x_1+2x_2\end{bmatrix}.

medium

Suppose $g: \mathbb{R}^n\to\mathbb{R}^m$ and $f: \mathbb{R}^m\to\mathbb{R}^p$ . You are told $Dg(\mathbf{x})$ is m×n and $Df(\mathbf{u})$ is p×m. What is the shape of $D(f\circ g)(\mathbf{x})$ ? Also, if $p=1$ , what is the shape of $\nabla (f\circ g)(\mathbf{x})$ (as a column vector)?

Hint: Use matrix multiplication shapes. For $p=1$ , the Jacobian is 1×n and the gradient is its transpose.

Show solution

$D(f\circ g)(\mathbf{x}) = Df(g(\mathbf{x}))\,Dg(\mathbf{x})$ has shape (p×m)(m×n) = p×n.

If $p=1$ , then $D(f\circ g)$ is 1×n, so the gradient $\nabla (f\circ g)$ (column) has shape n×1.

hard

Let u = g(x) with $g: \mathbb{R}^2\to\mathbb{R}^2$ given by $u_1=x_1^2$ , $u_2=\sin(x_2)$ . Let scalar $\ell=f(\mathbf{u})$ with $f(u_1,u_2)=u_1u_2$ . Compute $\nabla_{\mathbf{x}}\ell$ .

Hint: Compute $\nabla_{\mathbf{u}}\ell$ first, then multiply by $Dg(\mathbf{x})^\top$ . Remember $Dg$ is 2×2 and diagonal here.

Show solution

First, $\ell=u_1u_2$ so

\nabla_{\mathbf{u}}\ell = \begin{bmatrix}\partial \ell/\partial u_1\\\partial \ell/\partial u_2\end{bmatrix} = \begin{bmatrix}u_2\\u_1\end{bmatrix}.

Next, $Dg(\mathbf{x})$ :

• $u_1=x_1^2$ so $\partial u_1/\partial x_1=2x_1$ , $\partial u_1/\partial x_2=0$
• $u_2=\sin(x_2)$ so $\partial u_2/\partial x_1=0$ , $\partial u_2/\partial x_2=\cos(x_2)$

Thus

Dg(\mathbf{x})=\begin{bmatrix}2x_1 & 0\\0 & \cos(x_2)\end{bmatrix}.

So

\nabla_{\mathbf{x}}\ell = Dg(\mathbf{x})^\top\,\nabla_{\mathbf{u}}\ell = \begin{bmatrix}2x_1 & 0\\0 & \cos(x_2)\end{bmatrix}\begin{bmatrix}u_2\\u_1\end{bmatrix} = \begin{bmatrix}2x_1\sin(x_2)\\x_1^2\cos(x_2)\end{bmatrix}.

Connections

Next up: apply this rule repeatedly and efficiently in neural networks.

•Backpropagation

Related supporting nodes you may also want:

Quality: A (4.5/5)

Multivariable Chain Rule

Interactive Visualization

Core Concepts

Key Symbols & Notation

Essential Relationships

Prerequisites (2)

Unlocks (1)

Graph Position

Cognitive Load

All Concepts (9)

Teaching Strategy

What Is the Multivariable Chain Rule?

Why you need a new-looking chain rule

The composition structure (track shapes)

The chain rule (matrix form)

What it means geometrically (tiny perturbations)

A note on notation: Df vs ∇f

Core Mechanic 1: Derivative as Best Linear Approximation (and the Jacobian)

Why this viewpoint

The linear approximation definition

Jacobian matrix: how the linear map is represented

Interactive-canvas mental model (shape-tracking + arrows)

Arrow type A: pushforward of perturbations (forward mode)

Arrow type B: pullback of sensitivities / gradients (reverse mode)

A small but crucial convention check

Core Mechanic 2: Chain Rule = Composition of Linear Maps (Matrix Multiplication)

Why matrix multiplication appears

Derivation (showing the work)

Element-wise chain rule (path-tracing through indices)

“Computational graph” view (interactive canvas)

Overlay 1: Jacobians on edges

Overlay 2: live perturbations and gradients

Scalar output special case (gradient form)

One more composition (three layers)

Application/Connection: From Multivariable Chain Rule to Backprop Intuition

Why this matters for ML

A concrete computational graph (with explicit shapes)

Forward-mode (push a perturbation)

Reverse-mode (pull back a gradient)

Visual intuition: pushforward vs pullback

Connection you’ll use next

Worked Examples (3)

Matrix-form chain rule with a 2→2→1 composition (shape-check + gradient)

Pushforward vs pullback on a small computational graph (explicit J and Jᵀ)

Element-wise chain rule as “sum over intermediate coordinates”

Key Takeaways

Common Mistakes

Practice

Connections