Derivatives of composed functions with multiple variables.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
In single-variable calculus, the chain rule is a one-line formula. In multivariable calculus, it’s the same idea—“derivatives multiply along a composition”—but the objects are linear maps, so the multiplication becomes matrix multiplication. Once you truly see that, backpropagation stops feeling like magic.
For a composition where and , the multivariable chain rule is
Interpretation: a small input perturbation is pushed forward by into , then pushed forward by into . For scalar output , gradients pull back via transpose: .
In 1D, composition looks like and the chain rule says .
In multiple dimensions, you still compose functions, but now each function can take multiple inputs and produce multiple outputs. The “derivative” is no longer a single number; it’s a linear map that best approximates the function near a point. Linear maps are represented by matrices, so “multiply the derivatives” becomes “multiply the Jacobian matrices.”
This is the conceptual core:
The cleanest multivariable chain rule is written with a clear inner/outer structure:
Let:
A quick “shape table” you should get used to:
| Object | Meaning | Shape |
|---|---|---|
| x | input | n×1 |
| u = g(x) | intermediate | m×1 |
| y = f(u) | output | p×1 |
| Jacobian of g at x | m×n | |
| Jacobian of f at u | p×m | |
| Jacobian of composition | p×n |
Notice the only multiplication that makes sense dimensionally is:
That is already a big part of the multivariable chain rule: the shapes force the correct order.
The multivariable chain rule states:
Read it as: “first apply the derivative of at x, then apply the derivative of at the resulting point .”
If you perturb the input by a small vector , then:
1) Push forward through g:
2) Push forward through f:
Combine them:
So the Jacobian of the composition is the matrix that takes directly to . That matrix is the product .
You said you know gradients already. The key is to keep these distinct:
When , the Jacobian of is a 1×m row vector; the gradient is usually written as an m×1 column vector. They are transposes of each other (depending on convention):
This transpose issue matters a lot in backprop, so we’ll be explicit about it later.
If you treat multivariable derivatives as “a bunch of partial derivatives,” you can still compute things, but it’s easy to lose track of structure.
If you treat the derivative as “the best linear map near a point,” everything becomes systematic:
Let . We say is differentiable at x if there exists a linear map such that
with an error that becomes negligible compared to as .
That linear map is the derivative .
Write .
Then the Jacobian is the m×n matrix:
This is exactly the matrix that maps a small input perturbation to an approximate output perturbation:
Imagine an “interactive canvas” with three boxes:
1) x-box: x ∈ ℝⁿ
2) u-box: u = g(x) ∈ ℝᵐ
3) y-box: y = f(u) ∈ ℝᵖ
Now add two kinds of arrows:
You draw a little arrow at the input.
So perturbations flow with the function direction.
If the final output is scalar (p=1), you draw a gradient arrow at the output: (often just 1 if the scalar is the loss itself).
Then gradients flow backwards through transposes:
This is the heart of backprop. In this lesson, we’re building the “transpose reflex”:
There are two common conventions:
1) Jacobian is m×n (outputs by inputs). Gradients are column vectors.
2) Jacobian is n×m (inputs by outputs). Gradients are row vectors.
We’ll use the most common ML convention:
With that convention, the chain rule for scalar output becomes a clean transpose pullback (we’ll derive it soon).
The multivariable chain rule is not a new rule you memorize. It is a consequence of one fact:
If you approximate each function by a linear map near the relevant point, then approximating the composition means composing those linear maps.
And linear maps compose by matrix multiplication.
Let and .
Define:
Start with a small perturbation .
Step 1: Linearize g at x
Let , so
Step 2: Linearize f at u
Substitute :
But the left side is approximately
So we have the linear approximation:
By uniqueness of the best linear approximation, the derivative must be:
That’s the multivariable chain rule.
Sometimes you want the version that looks like “sum over paths.”
Let with components:
Then for each output component i and input component j:
This is literally matrix multiplication in coordinates.
Path-tracing interpretation:
Think of a small graph:
For our two-layer composition:
x →(g)→ u →(f)→ y
On an interactive canvas, you can show two overlays:
To get total derivative x→y, multiply along the path:
Pick a concrete point x₀.
This “two-way animation” is the visualization you want to internalize:
Now suppose is scalar. Then is 1×m.
From the Jacobian chain rule:
This left side is a 1×n row vector. If we want the gradient as an n×1 column vector, transpose:
\begin{align*}
\nabla (f\circ g)(\mathbf{x})
&= D(f\circ g)(\mathbf{x})^\top \\
&= \bigl(Df(\mathbf{u})\,Dg(\mathbf{x})\bigr)^\top \\
&= Dg(\mathbf{x})^\top\,Df(\mathbf{u})^\top \\
&= Dg(\mathbf{x})^\top\,\nabla f(\mathbf{u}).
\end{align*}
That last line is the standard “gradient chain rule” used everywhere in ML.
If you have , then:
Forward-mode perturbations multiply left-to-right in the same order as the functions apply (inner to outer).
Reverse-mode gradients multiply by transposes right-to-left:
This is backprop in one line—just applied repeatedly.
Neural networks are compositions of many vector-valued functions:
Training requires , gradients with respect to parameters. The only tool you need conceptually is the multivariable chain rule, but applied efficiently.
The hard part is not the calculus; it’s bookkeeping:
Let’s build a tiny “network” with one hidden layer and a scalar loss. Define:
Forward computation:
1) a = (so a ∈ ℝ³)
2) h = elementwise (so h ∈ ℝ³)
3) (so ∈ ℝ)
This is a composition:
x →(affine)→ a →(nonlinearity)→ h →(dot)→ ℓ
On an interactive canvas, you can attach:
| Edge | Local derivative | Shape |
|---|---|---|
| x→a | 3×2 | |
| a→h | 3×3 | |
| h→ℓ | 1×3 |
A perturbation produces:
\begin{align*}
\Delta \mathbf{a} &\approx W\,\Delta \mathbf{x} \\
\Delta \mathbf{h} &\approx \operatorname{Diag}(\sigma'(\mathbf{a}))\,\Delta \mathbf{a} \\
\Delta \ell &\approx \mathbf{c}^\top\,\Delta \mathbf{h}.
\end{align*}
Combine:
So the total Jacobian (1×2 row vector) is:
Because ℓ is scalar, we typically want as a 2×1 column vector.
Start with .
Compare with the forward-mode Jacobian expression above:
They are transposes, consistent with .
To address visualization explicitly, here’s the picture you should rehearse:
1) Pick a point x₀.
2) Draw a tiny arrow at x.
3) Multiply by local Jacobians to watch the arrow morph:
Now reverse:
1) Draw a gradient arrow at h: it points in the direction that increases ℓ fastest in h-space.
2) Pull it back to a using the transpose of the local Jacobian: (symmetric here, so transpose doesn’t change it).
3) Pull it back to x using .
This is not two unrelated processes. It’s the same linear maps viewed from two dual perspectives:
You don’t need the formal differential-geometry language to use it correctly, but you do need the operational rule:
If forward uses , backward uses .
Backpropagation is essentially the repeated application of:
where is the Jacobian of a local block in the computational graph.
Once you’re comfortable multiplying Jacobians (forward) and multiplying by transposes (backward), you’re ready to study backprop as an algorithmic optimization: reuse intermediate results so you don’t form huge Jacobian matrices explicitly.
Let and be
Define . Compute and at a general point, then evaluate at .
Step 1: Compute (2×2).
We have
Thus
Step 2: Compute (1×2).
so
Step 3: Apply the Jacobian chain rule.
Substitute :
Now multiply:
\begin{align*}
Dh(\mathbf{x})
&= \begin{bmatrix} 1 & 2x_1x_2 \end{bmatrix}
\begin{bmatrix} 2x_1 & 1 \\ x_2 & x_1 \end{bmatrix} \\
&= \begin{bmatrix}
1\cdot 2x_1 + (2x_1x_2)\cdot x_2 \; , \; 1\cdot 1 + (2x_1x_2)\cdot x_1
\end{bmatrix} \\
&= \begin{bmatrix}
2x_1 + 2x_1x_2^2 \; , \; 1 + 2x_1^2x_2
\end{bmatrix}.
\end{align*}
Step 4: Convert to gradient (column vector).
Since is scalar, is 1×2 and
Step 5: Evaluate at (1,2).
Insight: The computation stayed organized because we never mixed ‘partial derivative rules’ randomly. We computed two local Jacobians with clear shapes (2×2 and 1×2), multiplied them in the only shape-consistent order, then transposed to get the gradient vector.
Let x ∈ ℝ². Define
1) u = g(x) where
2) scalar output where
At x₀ = (1,1), compute:
Step 1: Compute the forward values at x₀.
At (1,1):
Step 2: Pushforward (perturbations use Jacobians).
Compute (2×2). Since g is linear,
Compute (1×2):
At u = (3,0):
Now push forward:
Then
Step 3: Pullback (gradients use transposes).
Because is scalar, start with :
At u=(3,0):
Now pull back through g using :
Step 4: Consistency check via dot product.
The linear prediction should satisfy
Compute:
matching the pushforward result.
Insight: This example makes the duality visible: forward-mode computes the effect of a small move ; reverse-mode computes the gradient that, when dotted with , predicts the same . The transpose is exactly what makes those two views consistent.
Let and be
Compute at a general point using path-tracing.
Step 1: Identify the dependency paths.
We want .
So only paths through matter for .
Step 2: Apply the coordinate chain rule.
Use
Compute each factor:
Step 3: Sum over k.
\begin{align*}
\frac{\partial y_2}{\partial x_3}
&= \left(\frac{\partial y_2}{\partial u_1}\right)\left(\frac{\partial u_1}{\partial x_3}\right) + \left(\frac{\partial y_2}{\partial u_2}\right)\left(\frac{\partial u_2}{\partial x_3}\right) \\
&= (u_2)(0) + (u_1)(2x_3) \\
&= 2x_3\,u_1 \\
&= 2x_3\,(x_1x_2).
\end{align*}
Insight: The summation form is ‘matrix multiplication with indices.’ It forces you to enumerate intermediate coordinates and add their contributions—exactly like summing over all paths in a computational graph.
A multivariable derivative is best understood as a linear map; its matrix representation is the Jacobian.
For and , the chain rule is $$
Shape-tracking is a correctness tool: is the only order that composes.
Perturbations push forward: .
Gradients pull back: for scalar output, .
The coordinate form $$ is the same rule as matrix multiplication, interpreted as ‘sum over intermediate coordinates.’
Backpropagation is repeated application of the pullback rule (multiply by local Jacobian transposes) along a computational graph.
Multiplying Jacobians in the wrong order (forgetting that composition order reverses: outer derivative on the left).
Mixing up Jacobians (matrices) and gradients (vectors), especially the row-vs-column convention; forgetting the transpose when converting to .
Trying to build the full Jacobian for a large network when you only need Jacobian-vector products (forward mode) or Jacobianᵀ-vector products (reverse mode).
Losing track of which variables each function actually depends on; missing or adding dependency paths in the summation form.
Let be and let be . Compute .
Hint: Compute (3×2) and (3×1). Then use .
We have .
Compute
Then
Suppose and . You are told is m×n and is p×m. What is the shape of ? Also, if , what is the shape of (as a column vector)?
Hint: Use matrix multiplication shapes. For , the Jacobian is 1×n and the gradient is its transpose.
has shape (p×m)(m×n) = p×n.
If , then is 1×n, so the gradient (column) has shape n×1.
Let u = g(x) with given by , . Let scalar with . Compute .
Hint: Compute first, then multiply by . Remember is 2×2 and diagonal here.
First, so
Next, :
Thus
So
Next up: apply this rule repeatedly and efficiently in neural networks.
Related supporting nodes you may also want: