Computing gradients efficiently via chain rule through network.
Quick unlock - significant prerequisite investment but simple final step. Verify prerequisites first.
Training a neural network is mostly one question repeated millions of times: “If I nudge this weight a tiny bit, how does the loss change?” Backpropagation is the algorithmic answer—an efficient way to compute all those derivatives without redoing work for every parameter.
Backpropagation computes ∂L/∂(parameters) efficiently by (1) viewing the network as a computation graph, (2) running a forward pass to cache intermediate values, then (3) running a backward pass that propagates sensitivities δⱼ := ∂L/∂zⱼ from outputs to inputs using node-local partial derivatives, summing gradient contributions where paths merge.
A neural network learns by adjusting parameters (weights and biases) to reduce a loss L. Gradient-based optimization methods (like SGD, Momentum, Adam) need derivatives such as ∂L/∂W and ∂L/∂b.
If you tried to compute these derivatives “from scratch” for each parameter independently, you would do a huge amount of repeated work. A modern network can have millions or billions of parameters, and the forward computation reuses intermediate activations across many parameters. We want the same kind of reuse when differentiating.
Backpropagation is exactly that: a systematic reuse of intermediate derivative calculations.
Any forward computation can be represented as a directed acyclic graph (DAG) where:
Example (conceptually):
Even if you write this as “layers,” it’s still a composition of functions, which is what the chain rule is about.
A central symbol in backprop is the sensitivity of the loss to a node’s pre-activation (or any intermediate variable). For a scalar loss L and a node value zⱼ, define:
δⱼ := ∂L/∂zⱼ
Think of δ as answering: “If zⱼ changes a tiny bit, how much does the loss change?”
Backpropagation is built from two atomic concepts:
1) Node-local backward rule
A node converts an incoming loss gradient (upstream gradient) into gradients for its inputs using only its local partial derivatives.
If a node computes:
z = f(x)
and you know the upstream gradient ∂L/∂z, then the chain rule gives:
∂L/∂x = ∂L/∂z · ∂z/∂x
This is “local”: it only needs the node’s own derivative ∂z/∂x and the incoming gradient.
2) Reverse traversal + summation at merges
You apply these node-local rules in reverse topological order (from outputs back to inputs). If multiple forward paths feed into the same variable, their gradient contributions add.
This “add at merges” is not a special trick—it is exactly the multivariable chain rule.
Suppose your network computes L in O(C) time (C is the cost of the forward pass). Backprop computes all parameter gradients in about O(C) additional time (often a small constant multiple).
In contrast, computing a gradient for each parameter via finite differences would take O(P·C) where P is the number of parameters. That’s infeasible.
The big idea to keep in your head: Forward pass computes values. Backward pass computes sensitivities, reusing cached values and local derivatives.
A neural network is built from small operations: add, multiply, matrix multiply, activation functions, normalization, etc. If we can differentiate each operation locally, we can differentiate the whole network by composition.
This makes backprop feel like message passing:
Let a node compute:
z = f(x)
Loss is L(z). By the chain rule:
∂L/∂x = ∂L/∂z · ∂z/∂x
Written as a backward rule:
This is the simplest “local rule.”
Most nodes have multiple inputs. Suppose:
z = f(x, y)
Then:
∂L/∂x = ∂L/∂z · ∂z/∂x
∂L/∂y = ∂L/∂z · ∂z/∂y
Same upstream gradient ∂L/∂z fans out to each input, multiplied by the appropriate local partial.
Now suppose a variable x influences L through two different children nodes, z₁ and z₂:
z₁ = f₁(x)
z₂ = f₂(x)
L = g(z₁, z₂)
Then x affects L through both routes. Multivariable chain rule gives:
∂L/∂x = (∂L/∂z₁)(∂z₁/∂x) + (∂L/∂z₂)(∂z₂/∂x)
This is the “sum contributions at merges” rule.
Let:
Forward:
Backward (local rules):
1) L = v² ⇒ ∂L/∂v = 2v
2) v = u + y ⇒ ∂v/∂u = 1, ∂v/∂y = 1
So ∂L/∂u = ∂L/∂v · 1
And contribution to ∂L/∂y from this path is ∂L/∂v · 1
3) u = x · y ⇒ ∂u/∂x = y, ∂u/∂y = x
So ∂L/∂x = ∂L/∂u · y
And contribution to ∂L/∂y from this path is ∂L/∂u · x
Finally, add the two contributions to ∂L/∂y:
∂L/∂y = (from v) + (from u)
= (∂L/∂v · 1) + (∂L/∂u · x)
Notice how “sum at merge” naturally appears because y is used in two places.
Neural nets use vector-valued activations and matrix multiplies. The same local-rule logic holds, but you must respect shapes.
A very common pattern:
z = Wx + b
Where:
Given upstream gradient g = ∂L/∂z (same shape as z), the local backward rules are:
∂L/∂b = g
∂L/∂W = g xᵀ
∂L/∂x = Wᵀ g
These formulas are not “memorize-only.” They come from componentwise chain rule. For example, for W:
Let zᵢ = ∑ⱼ Wᵢⱼ xⱼ + bᵢ
Then:
∂L/∂Wᵢⱼ = ∑ₖ (∂L/∂zₖ)(∂zₖ/∂Wᵢⱼ)
But zₖ depends on Wᵢⱼ only when k = i, so:
∂zₖ/∂Wᵢⱼ = xⱼ if k = i, else 0
So:
∂L/∂Wᵢⱼ = (∂L/∂zᵢ) xⱼ
In matrix form that is g xᵀ.
Node-local derivatives typically depend on forward values (like x for σ′(x), or x for gxᵀ).
So the forward pass doesn’t just compute outputs; it stores the needed intermediates (often called “activations” and sometimes “pre-activations”). Backprop then uses them.
A useful mental rule:
This is why training uses more memory than inference.
| Forward op | Forward definition | Upstream gradient | Backward outputs |
|---|---|---|---|
| Add | z = x + y | ∂L/∂z | ∂L/∂x = ∂L/∂z, ∂L/∂y = ∂L/∂z |
| Multiply | z = x·y | ∂L/∂z | ∂L/∂x = (∂L/∂z)·y, ∂L/∂y = (∂L/∂z)·x |
| Affine | z = Wx + b | g = ∂L/∂z | ∂L/∂W = gxᵀ, ∂L/∂b = g, ∂L/∂x = Wᵀg |
| ReLU | a = max(0, z) | ∂L/∂a | ∂L/∂z = (∂L/∂a) ⊙ 𝟙[z>0] |
| Sigmoid | a = σ(z) | ∂L/∂a | ∂L/∂z = (∂L/∂a) ⊙ a ⊙ (1−a) |
| Tanh | a = tanh(z) | ∂L/∂a | ∂L/∂z = (∂L/∂a) ⊙ (1−a²) |
(⊙ is elementwise product.)
These are “node-local rules”: upstream gradient in, multiply by local derivative, produce gradients out.
In the forward pass, you compute prerequisites first: inputs → intermediate values → loss.
In the backward pass, you need ∂L/∂(something). But ∂L/∂(something) depends on how that “something” affects later computations. So you must start at L and propagate backward: loss → outputs → earlier intermediates → parameters.
This is exactly reverse topological order on the computation graph.
Given a computation graph that produces scalar loss L:
1) Forward pass
2) Initialize the backward pass
3) Reverse traversal
4) Read off parameter gradients
Suppose x feeds into two children nodes z₁ and z₂, and both influence L.
Let:
z₁ = f₁(x)
z₂ = f₂(x)
L = g(z₁, z₂)
We can derive ∂L/∂x explicitly:
∂L/∂x = (∂L/∂z₁)(∂z₁/∂x) + (∂L/∂z₂)(∂z₂/∂x)
Steps:
∂L/∂x = ∂g/∂z₁ · ∂z₁/∂x + ∂g/∂z₂ · ∂z₂/∂x
Recognize ∂g/∂z₁ = ∂L/∂z₁ and ∂g/∂z₂ = ∂L/∂z₂.
So if x has multiple outgoing edges in the forward graph, it has multiple incoming gradient contributions in the backward graph, which must be summed.
In neural networks, it’s common to distinguish:
For layer ℓ:
zˡ = Wˡ aˡ⁻¹ + bˡ
aˡ = φ(zˡ)
Define the layer sensitivity:
δˡ := ∂L/∂zˡ
This is exactly your node-local symbol δⱼ := ∂L/∂zⱼ, but grouped by layer and vectorized.
Now derive the standard backprop identities.
We have aˡ = φ(zˡ). So elementwise:
∂L/∂zˡ = ∂L/∂aˡ ⊙ φ′(zˡ)
So:
δˡ = (∂L/∂aˡ) ⊙ φ′(zˡ)
zˡ = Wˡ aˡ⁻¹ + bˡ.
Given δˡ = ∂L/∂zˡ, we get:
∂L/∂Wˡ = δˡ (aˡ⁻¹)ᵀ
∂L/∂bˡ = δˡ
∂L/∂aˡ⁻¹ = (Wˡ)ᵀ δˡ
Now plug Step 2 into Step 1 for the previous layer:
δˡ⁻¹ = (∂L/∂aˡ⁻¹) ⊙ φ′(zˡ⁻¹)
= ((Wˡ)ᵀ δˡ) ⊙ φ′(zˡ⁻¹)
That is the classic recurrence.
The backward pass starts from the loss. Two common cases:
1) Regression with MSE
If L = ½‖aᴸ − y‖², then:
∂L/∂aᴸ = aᴸ − y
Then δᴸ = (∂L/∂aᴸ) ⊙ φ′(zᴸ).
2) Classification with softmax + cross-entropy
A very common final layer is:
p = softmax(z)
L = −∑ᵢ yᵢ log pᵢ
A key simplification (worth remembering) is:
∂L/∂z = p − y
So the starting sensitivity at the logits is just prediction minus label.
A useful way to imagine backprop in code:
This is reverse-mode automatic differentiation (reverse-mode AD). Backpropagation is reverse-mode AD specialized to neural nets.
| Mode | Computes | Best when | Cost intuition |
|---|---|---|---|
| Forward-mode AD | directional derivatives (Jacobian-vector products) | few inputs, many outputs | roughly scales with #inputs |
| Reverse-mode AD (backprop) | gradients of scalar loss (vector-Jacobian products) | many parameters, scalar loss | roughly scales with #outputs (≈1) |
Neural nets have huge input dimension in parameter space (millions of weights) and a single scalar loss, so reverse-mode is ideal.
In practice, we rarely train on one example at a time. With a minibatch of size B:
Suppose X ∈ ℝ^{n×B} and Z = WX + b, where b is broadcast across columns.
Let G = ∂L/∂Z ∈ ℝ^{m×B}.
Then the batched gradients are:
∂L/∂W = G Xᵀ
∂L/∂b = ∑_{k=1}^{B} G[:, k]
∂L/∂X = Wᵀ G
Notice:
Deep learning frameworks implement layers/modules that each provide:
This is exactly the node-local backward rule, just grouped into a bigger node.
Common modules:
Even when the module is complicated, its backward pass follows the same idea:
Backprop needs forward intermediates. For deep nets, storing every activation can be expensive.
Two practical strategies:
1) Store everything (standard)
2) Checkpointing / recomputation
Backprop multiplies many Jacobians together. For a deep chain:
∂L/∂x = (J₁ᵀ J₂ᵀ … Jₖᵀ) ∂L/∂z
If the norms of these Jacobians are often < 1, gradients shrink (vanish). If often > 1, they grow (explode).
This is why choices like:
matter so much. They change the effective Jacobians that backprop multiplies.
A residual block often looks like:
h = x + F(x)
In backprop:
∂L/∂x = ∂L/∂h · ∂h/∂x
But h depends on x through two paths: identity and F.
So:
∂L/∂x = ∂L/∂h · I + (∂L/∂h) · ∂F/∂x
Meaning:
∂L/∂x = ∂L/∂h + (∂L/∂h)·∂F/∂x
The identity path contributes a clean gradient term, which helps gradients flow even if ∂F/∂x becomes small.
Deep architectures (CNNs, RNNs, Transformers) are essentially large computation graphs with repeated motifs.
Backprop is what makes them trainable at scale:
Once you are comfortable with δ propagation and reverse traversal, you’re ready to reason about:
That is the bridge to the next node: deeper architectures and the systems that train them.
Let x and y be scalars. Define:
1) u = x·y
2) v = u + y
3) L = v²
Compute ∂L/∂x and ∂L/∂y using backprop (node-local rules + reverse traversal).
Forward pass:
Initialize backward:
Work from L backward.
Backprop through L = v²:
Backprop through v = u + y:
Local partials: ∂v/∂u = 1, ∂v/∂y = 1
So:
Backprop through u = x·y:
Local partials: ∂u/∂x = y, ∂u/∂y = x
So:
Sum contributions at the merge for y:
= 2v(1 + x)
And ∂L/∂x = 2v·y, where v = u + y = xy + y = y(x+1).
Insight: The only global coordination is the reverse traversal and the “add at merges” rule. Everything else is local: multiply the upstream gradient by a local derivative.
Consider a 2-layer (one hidden layer) network for a single example:
Loss: L = ½‖a² − y‖²
Derive ∂L/∂W², ∂L/∂b², ∂L/∂W¹, ∂L/∂b¹ using δ notation.
Start at the loss:
L = ½‖a² − y‖²
So:
∂L/∂a² = a² − y
Output is identity: a² = z²
Therefore:
δ² := ∂L/∂z² = ∂L/∂a² = a² − y
Backprop through z² = W²a¹ + b² using the affine rules with upstream gradient δ²:
∂L/∂W² = δ² (a¹)ᵀ
∂L/∂b² = δ²
∂L/∂a¹ = (W²)ᵀ δ²
Backprop through a¹ = φ(z¹):
δ¹ := ∂L/∂z¹ = (∂L/∂a¹) ⊙ φ′(z¹)
So:
δ¹ = ((W²)ᵀ δ²) ⊙ φ′(z¹)
Backprop through z¹ = W¹x + b¹:
∂L/∂W¹ = δ¹ xᵀ
∂L/∂b¹ = δ¹
And (if needed) ∂L/∂x = (W¹)ᵀ δ¹
Insight: Backprop for layered networks is just repeated application of two templates: (1) affine backward, (2) activation backward. The δ vectors store exactly the sensitivities you need to reuse.
Let logits z ∈ ℝ^K. Softmax probabilities: pᵢ = exp(zᵢ) / ∑ⱼ exp(zⱼ).
Cross-entropy loss with one-hot y: L = −∑ᵢ yᵢ log pᵢ.
Show that ∂L/∂z = p − y.
Rewrite the loss in a convenient form.
Since log pᵢ = zᵢ − log(∑ⱼ exp(zⱼ)), we have:
L = −∑ᵢ yᵢ[zᵢ − log(∑ⱼ exp(zⱼ))]
= −∑ᵢ yᵢ zᵢ + (∑ᵢ yᵢ) log(∑ⱼ exp(zⱼ))
For one-hot (or any distribution label), ∑ᵢ yᵢ = 1, so:
L = −∑ᵢ yᵢ zᵢ + log(∑ⱼ exp(zⱼ))
Differentiate w.r.t. z_k:
∂/∂z_k [−∑ᵢ yᵢ zᵢ] = −y_k
Differentiate the log-sum-exp term:
Let S = ∑ⱼ exp(zⱼ)
Then:
∂/∂z_k [log S] = (1/S) · ∂S/∂z_k
And:
∂S/∂z_k = exp(z_k)
So:
∂/∂z_k [log(∑ⱼ exp(zⱼ))] = exp(z_k) / ∑ⱼ exp(zⱼ) = p_k
Combine terms:
∂L/∂z_k = p_k − y_k
Stacking all k gives:
∂L/∂z = p − y
Insight: This simplification is why softmax + cross-entropy is so popular: it produces a clean starting sensitivity at the logits, avoiding an extra Jacobian factor.
Backpropagation is reverse-mode differentiation on a computation graph: forward computes values; backward computes gradients.
The atomic operation is the node-local backward rule: given ∂L/∂z, compute ∂L/∂(inputs) using local partial derivatives.
Reverse traversal (reverse topological order) ensures every node’s upstream gradient is available before using it.
When a variable influences the loss through multiple forward paths, its gradients from each path add: ∂L/∂x = ∑ paths (contribution).
Sensitivities δⱼ := ∂L/∂zⱼ (and layerwise δˡ) are the reusable quantities that make gradient computation efficient.
For affine layers z = Wx + b, the core backward identities are ∂L/∂W = gxᵀ, ∂L/∂b = g, ∂L/∂x = Wᵀg.
Softmax + cross-entropy yields a particularly simple gradient: ∂L/∂z = p − y.
Backprop’s efficiency (≈ one extra forward pass worth of compute) is what makes deep learning practical.
Forgetting to sum gradient contributions at merges (e.g., residual connections or reused tensors), leading to gradients that are too small.
Mismatching tensor shapes/transposes in affine backward rules (common bug: using xgᵀ instead of gxᵀ).
Confusing pre-activation z and activation a when applying φ′: you need φ′(z) (or an equivalent expression using cached a).
Dropping batch reductions: bias gradients typically require summing over the batch dimension (and sometimes spatial dimensions in CNNs).
Scalar backprop practice: Let a = x + y, b = x·a, L = b². Compute ∂L/∂x and ∂L/∂y in terms of x and y.
Hint: Work backward: L→b→(x,a) and a→(x,y). Remember that x influences b both directly and through a, so you must add contributions for ∂L/∂x.
Forward:
a = x + y
b = x·a = x(x+y)
L = b²
Backward:
∂L/∂b = 2b
For b = x·a:
∂b/∂x = a, ∂b/∂a = x
So contributions:
∂L/∂x (via b direct) = (2b)·a
∂L/∂a = (2b)·x
For a = x + y:
∂a/∂x = 1, ∂a/∂y = 1
So:
Contribution to ∂L/∂x (via a) = ∂L/∂a · 1 = (2b)x
∂L/∂y = ∂L/∂a · 1 = (2b)x
Add for x:
∂L/∂x = 2b·a + 2b·x = 2b(a + x)
Substitute a = x+y and b = x(x+y):
∂L/∂x = 2x(x+y)[(x+y)+x] = 2x(x+y)(2x+y)
∂L/∂y = 2x·b = 2x·x(x+y) = 2x²(x+y)
Affine + ReLU layer: z = Wx + b, a = ReLU(z). Suppose the upstream gradient is g = ∂L/∂a. Derive expressions for ∂L/∂W, ∂L/∂b, and ∂L/∂x using g and cached z.
Hint: First convert g into δ = ∂L/∂z by multiplying by the ReLU mask 𝟙[z>0]. Then use affine backward rules.
ReLU backward (elementwise):
δ := ∂L/∂z = g ⊙ 𝟙[z>0]
Affine backward with upstream δ:
∂L/∂W = δ xᵀ
∂L/∂b = δ
∂L/∂x = Wᵀ δ
Show the merge-sum rule in vector form: Let h = x + F(x) where F: ℝ^n → ℝ^n is differentiable. Given upstream gradient u = ∂L/∂h, express ∂L/∂x in terms of u and the Jacobian J_F = ∂F/∂x.
Hint: Treat h as two paths from x: identity and F. The derivative of identity is I. Use vector-Jacobian products appropriate for reverse-mode.
We have h(x) = x + F(x).
The Jacobian is:
∂h/∂x = I + J_F
Reverse-mode uses a vector-Jacobian product:
∂L/∂x = (∂h/∂x)ᵀ u
= (I + J_F)ᵀ u
= Iᵀ u + J_Fᵀ u
= u + J_Fᵀ u
This is exactly “sum contributions at the merge”: the identity path contributes u, and the F path contributes J_Fᵀ u.
Unlocks: Deep Learning
Related prerequisite refreshers:
Next good companions (often adjacent in a tech tree):