An algorithmic technique to compute exact derivatives of functions expressed as programs by applying the chain rule systematically (forward- or reverse-mode), which underlies efficient backpropagation in modern frameworks. Knowing AD helps in understanding gradient computation costs and implementation choices.
Self-serve tutorial - low prerequisites, straightforward concepts.
You already know computational graphs show how values flow. Automatic differentiation (AD) is what makes those graphs actionable: it turns “a program that computes a number” into “a program that computes that number and its exact derivative”, with predictable computational cost. This is the algorithmic heart of backpropagation.
Automatic differentiation computes exact derivatives of functions written as programs by systematically applying the chain rule to a trace of primitive operations. Forward-mode propagates tangents (directional derivatives) alongside values and is efficient when inputs are few. Reverse-mode propagates adjoints (sensitivities) backward and is efficient when outputs are few (e.g., scalar loss), which is why it powers backprop in deep learning.
When you want derivatives, you have several options:
Deep learning depends on fast and reliable gradients. If your loss is a scalar ℓ and your parameters are θ, you typically need ∇_θ ℓ. AD gives you those gradients with a cost that you can reason about.
Let f be a function implemented as a program:
The Jacobian is
J_f(x) = Df = [∂yᵢ/∂xⱼ] ∈ ℝᵐˣⁿ
Automatic differentiation is an algorithmic technique to compute J_f(x) (or products with it) by decomposing f into a sequence of primitive operations whose derivatives are known, then applying the chain rule systematically along that sequence.
A key AD idea: your code (or computational graph) induces a sequence of intermediate values:
x → v₁ → v₂ → … → v_k → y
Each step is a primitive operation like +, ×, sin, exp, matmul, etc.
Example (scalar):
f(x) = sin(x) + x²
A “trace” might look like:
AD differentiates this trace.
AD is not algebraic simplification. It does not try to build and simplify a symbolic formula for df/dx. Instead, it evaluates derivatives numerically, step-by-step, using exact local derivative rules.
So the result is exact with respect to the executed operations (subject to floating point), and it naturally supports:
AD comes in two main propagation styles:
They compute different Jacobian-related objects efficiently:
| Mode | Efficiently computes | Typical use case |
|---|---|---|
| Forward | J_f(x) v (Jacobian-vector product, JVP) | Few inputs, many outputs; sensitivity in a direction |
| Reverse | wᵀ J_f(x) (vector-Jacobian product, VJP) | Many inputs, few outputs (often m = 1) |
Deep learning training typically has m = 1 (a scalar loss), and n can be millions of parameters. That asymmetry makes reverse-mode the star.
Backpropagation is best understood as reverse-mode AD applied to a computational graph. The deep learning context often adds batching, tensor primitives, and GPU kernels, but the underlying algorithm is reverse-mode AD.
The main learning goal for this node: understand what is being propagated, why the two modes have different cost profiles, and how the chain rule is organized algorithmically.
Forward-mode is conceptually straightforward: as you compute each intermediate value v, you also compute how v changes with respect to a chosen input direction.
Think: “If I nudge the input x a tiny amount along direction u, how does the output change?”
This is a directional derivative:
d/dε f(x + εu) |_{ε=0} = J_f(x) u
Forward-mode computes exactly this quantity in one pass.
A classic way to explain forward-mode is dual numbers. Replace a real number x with
x̂ = x + ẋ ε
where ε is a special symbol with ε² = 0 (but ε ≠ 0). Then:
If you initialize ẋ = 1, then the ε coefficient at the end becomes df/dx.
Example: for multiplication,
(a + ȧ ε)(b + ḃ ε)
= ab + (a ḃ + b ȧ) ε
The ε coefficient exactly matches the product rule.
AD implementations don’t need to literally use ε, but the algebra captures the idea: push derivative information forward alongside values.
Suppose a program defines intermediates vᵢ = opᵢ(parents). Forward-mode maintains (vᵢ, v̇ᵢ), where v̇ᵢ is the directional derivative of vᵢ in the chosen input direction.
If vᵢ = g(vⱼ, v_k), then by chain rule:
v̇ᵢ = (∂g/∂vⱼ) v̇ⱼ + (∂g/∂v_k) v̇_k
You can see the pattern: local partial derivatives times incoming tangents.
For x ∈ ℝⁿ, choose a direction u ∈ ℝⁿ. Initialize input tangents:
ẋ = u
Run the program forward; each intermediate carries its tangent. The output tangent is:
ẏ = J_f(x) u
This is a Jacobian-vector product (JVP).
If you want the full Jacobian J_f, you could run forward-mode n times with u = eⱼ (standard basis vectors). That costs O(n) forward sweeps.
Forward-mode cost is roughly:
The big scaling fact:
So forward-mode is attractive when n (inputs) is small, even if m is large.
Forward-mode shines in situations like:
It is also valuable in deep learning systems as a building block (for higher-order derivatives and for efficient JVP computation).
For common scalar primitives (primal v, tangent v̇):
For vectors/matrices, the same principles apply with appropriate linear algebra. For example, if y = A x, then ẏ = A ẋ (a linear map).
Forward-mode sets the stage: it makes the chain rule feel like a local propagation law. Reverse-mode will use the same local derivatives but run the information in the opposite direction.
In machine learning you often have:
You want ∇_θ ℓ, i.e., all ∂ℓ/∂θⱼ.
Forward-mode would require n separate passes (one per θⱼ direction). Reverse-mode computes all partials in about the cost of a small constant times one evaluation of f.
Reverse-mode defines, for each intermediate vᵢ, an adjoint (also called sensitivity):
v̄ᵢ = ∂ℓ/∂vᵢ
Read it as: “how much does the final scalar output ℓ change if vᵢ changes?”
This is exactly what backprop stores as “gradients with respect to activations.”
Suppose an intermediate is computed by
vᵢ = g(vⱼ, v_k)
Given v̄ᵢ, we can distribute it to parents:
v̄ⱼ += v̄ᵢ · ∂vᵢ/∂vⱼ
v̄_k += v̄ᵢ · ∂vᵢ/∂v_k
This is the chain rule written in an accumulation form.
Important: the “+=” is not decoration. If a variable feeds into multiple downstream paths, its total sensitivity is the sum of contributions from each path.
At the end, θ̄ = ∂ℓ/∂θ = ∇_θ ℓ.
Reverse-mode needs values from the forward pass to compute local derivatives. Example:
If v = sin(a), then ∂v/∂a = cos(a). During backward, you need a (or v) to compute cos(a).
So reverse-mode typically stores (“tapes”) intermediate values.
This leads to a real implementation tradeoff:
| Strategy | Memory | Compute | Notes |
|---|---|---|---|
| Store all intermediates | High | Low | Standard backprop |
| Recompute some intermediates (checkpointing) | Lower | Higher | Useful for very deep nets |
For general f: ℝⁿ → ℝᵐ, reverse-mode naturally computes
wᵀ J_f(x)
for a chosen output weighting w ∈ ℝᵐ.
When m = 1 and ℓ = f(x), w = 1 and
1ᵀ J_f(x) = ∇_x ℓ
So gradients are a special case of a vector-Jacobian product (VJP).
Let C be the cost of evaluating f once.
So reverse-mode is efficient when m (outputs) is small, especially m = 1.
Consider:
v₁ = x
v₂ = y
v₃ = v₁ v₂
v₄ = sin(v₁)
v₅ = v₃ + v₄
ℓ = v₅
Adjoints:
So:
∂ℓ/∂x = v̄₁ = y + cos(x)
∂ℓ/∂y = v̄₂ = x
This is the chain rule, reorganized so you reuse shared subcomputations efficiently.
A subtle but important distinction:
Even if an operation is not invertible (e.g., squaring), reverse-mode still works because it doesn’t need to invert; it needs partial derivatives.
Real programs may include primitives like ReLU(x) = max(0, x), which is not differentiable at x = 0.
Frameworks typically define a convention (a subgradient) such as:
d/dx ReLU(x) = 0 if x < 0, 1 if x > 0, and a chosen value at x = 0 (often 0)
AD will propagate according to those primitive rules.
Reverse-mode is the workhorse for deep learning, but remember: it is one mode of a general AD idea. The real power comes from understanding both modes and the Jacobian products they compute.
In neural networks, you define a scalar loss:
ℓ = L(model(x; θ), target)
Training needs ∇_θ ℓ. The computation of ℓ is a composition of many layers, each a primitive (or a bundle of primitives). Reverse-mode AD traverses the graph backward, accumulating adjoints.
So when you see “.backward()” in a framework, it’s executing reverse-mode AD.
In high dimensions, explicitly building J_f is often impossible:
AD avoids this by computing products:
These are enough for many algorithms:
If you want second derivatives, you can apply AD to an AD-produced derivative computation.
For scalar ℓ(x):
A common strategy is:
This is “forward-over-reverse” and can compute H v efficiently.
Let f: ℝⁿ → ℝᵐ.
So:
| Goal | Better mode | Why |
|---|---|---|
| Full Jacobian J ∈ ℝᵐˣⁿ | forward if n ≪ m, reverse if m ≪ n | One sweep per “small side” |
| Gradient of scalar ℓ | reverse | m = 1 |
| JVP | forward | computes J v directly |
| VJP | reverse | computes wᵀ J directly |
Real programs include:
AD works by differentiating the executed trace. That means:
This is usually what you want in ML, but it matters for correctness and debugging.
Implementations generally fall into two families:
| Approach | How it works | Pros | Cons |
|---|---|---|---|
| Operator overloading | Run code with special number/tensor types that record a tape | Easy to integrate with dynamic languages | Runtime overhead; tracing can be subtle |
| Source transformation | Transform the program (or IR) into a new program that computes derivatives | Can be faster; more compile-time optimization | Harder to implement; language/IR constraints |
Modern frameworks blend ideas (e.g., tracing to an IR, then compiling).
AD avoids finite-difference error, but it cannot fix a numerically unstable primal computation.
Example: if your function involves subtracting nearly equal numbers, the primal has catastrophic cancellation; the derivative will reflect that sensitivity.
So you still need good numerical practices.
Backprop is reverse-mode AD, and its efficiency is why deep learning is feasible at scale.
But understanding AD gives you extra superpowers:
jvp, vjp, grad, jacobian, hessianMost importantly: you stop treating gradients as “magic,” and start seeing them as a carefully organized application of the chain rule.
Let f: ℝ² → ℝ² be
f(x₁, x₂) = (y₁, y₂)
where
y₁ = x₁ x₂ + sin(x₁)
y₂ = exp(x₂) + x₁²
Compute the Jacobian-vector product J_f(x) u at x = (1, 0) along direction u = (2, -1).
Write a trace with intermediates:
v₁ = x₁
v₂ = x₂
v₃ = v₁ v₂
v₄ = sin(v₁)
v₅ = v₃ + v₄ (this is y₁)
v₆ = exp(v₂)
v₇ = v₁²
v₈ = v₆ + v₇ (this is y₂)
We will propagate tangents v̇ᵢ.
Initialize primals at x = (1, 0):
v₁ = 1
v₂ = 0
Initialize tangents from u:
v̇₁ = 2
v̇₂ = -1
Forward propagate:
v₃ = v₁ v₂ = 1·0 = 0
v̇₃ = v̇₁ v₂ + v₁ v̇₂ = 2·0 + 1·(-1) = -1
Next:
v₄ = sin(v₁) = sin(1)
v̇₄ = cos(v₁) v̇₁ = cos(1)·2 = 2cos(1)
Add for y₁:
v₅ = v₃ + v₄ = 0 + sin(1) = sin(1)
v̇₅ = v̇₃ + v̇₄ = -1 + 2cos(1)
For y₂ part:
v₆ = exp(v₂) = exp(0) = 1
v̇₆ = exp(v₂) v̇₂ = 1·(-1) = -1
Square:
v₇ = v₁² = 1² = 1
v̇₇ = 2 v₁ v̇₁ = 2·1·2 = 4
Add for y₂:
v₈ = v₆ + v₇ = 1 + 1 = 2
v̇₈ = v̇₆ + v̇₇ = -1 + 4 = 3
Collect the JVP:
ẏ = (v̇₅, v̇₈) = ( -1 + 2cos(1), 3 )
Insight: Forward-mode naturally returns J_f(x)u in one sweep, without building J_f. You chose u; AD gave the directional derivative of the entire program along that direction.
Let ℓ(x, y) be
ℓ = (x y + sin x)²
Compute ∂ℓ/∂x and ∂ℓ/∂y using reverse-mode AD, and evaluate at (x, y) = (1, 2).
Build a trace:
v₁ = x
v₂ = y
v₃ = v₁ v₂
v₄ = sin(v₁)
v₅ = v₃ + v₄
v₆ = v₅²
ℓ = v₆
Forward pass at (1, 2):
v₁ = 1
v₂ = 2
v₃ = 1·2 = 2
v₄ = sin(1)
v₅ = 2 + sin(1)
v₆ = (2 + sin(1))²
Initialize adjoints (bars). Since ℓ = v₆:
v̄₆ = ∂ℓ/∂v₆ = 1
Backprop through v₆ = v₅²:
∂v₆/∂v₅ = 2 v₅
v̄₅ += v̄₆ · 2 v₅ = 1 · 2(2 + sin(1)) = 2(2 + sin(1))
Backprop through v₅ = v₃ + v₄:
v̄₃ += v̄₅ · 1 = 2(2 + sin(1))
v̄₄ += v̄₅ · 1 = 2(2 + sin(1))
Backprop through v₄ = sin(v₁):
∂v₄/∂v₁ = cos(v₁) = cos(1)
v̄₁ += v̄₄ · cos(1)
= 2(2 + sin(1)) cos(1)
Backprop through v₃ = v₁ v₂:
∂v₃/∂v₁ = v₂ = 2
∂v₃/∂v₂ = v₁ = 1
v̄₁ += v̄₃ · v₂
= 2(2 + sin(1)) · 2
= 4(2 + sin(1))
v̄₂ += v̄₃ · v₁
= 2(2 + sin(1)) · 1
= 2(2 + sin(1))
Now read off gradients:
∂ℓ/∂x = v̄₁ = 4(2 + sin(1)) + 2(2 + sin(1))cos(1)
∂ℓ/∂y = v̄₂ = 2(2 + sin(1))
Insight: Reverse-mode computes all input partials of a scalar output in one backward sweep by accumulating adjoints. Notice how v̄₁ receives contributions from two downstream paths (through v₃ and v₄), which is why accumulation is essential.
Consider f: ℝ³ → ℝ². You want the full Jacobian J_f ∈ ℝ²ˣ³ at a point. Compare forward-mode and reverse-mode sweep counts conceptually.
Forward-mode computes one JVP J_f(x) u per forward sweep.
To get all 3 columns of J_f, run 3 sweeps with u = e₁, e₂, e₃.
So: ~3 forward sweeps.
Reverse-mode computes one VJP wᵀ J_f(x) per backward sweep (after one forward to record the trace).
To get all 2 rows of J_f, run 2 backward sweeps with w = e₁ and e₂.
So: ~2 backward sweeps (plus the shared forward for values/tape).
Conclude: if m < n (2 < 3 here), reverse-mode needs fewer sweeps to form the full Jacobian; if n < m, forward-mode would.
Insight: The rule “forward is good when inputs are few; reverse is good when outputs are few” is not folklore—it comes directly from how many basis directions you must seed to recover the full Jacobian.
Automatic differentiation computes derivatives of a program by applying the chain rule to a recorded sequence of primitive operations (the trace).
Forward-mode AD propagates tangents and efficiently computes J_f(x) v (a JVP) in one forward sweep.
Reverse-mode AD propagates adjoints and efficiently computes wᵀ J_f(x) (a VJP); gradients of a scalar loss are the special case w = 1.
Reverse-mode requires storing (or recomputing) intermediates from the forward pass; memory is often the limiting factor in deep networks.
To form a full Jacobian J ∈ ℝᵐˣⁿ: forward-mode typically needs n sweeps; reverse-mode typically needs m sweeps.
Backpropagation in deep learning is reverse-mode AD on the computational graph of the loss computation.
AD is “exact” relative to the primitives executed (up to floating-point), unlike finite differences which introduce approximation error.
Confusing AD with numerical differentiation: AD does not pick a small ε or suffer truncation error; it uses analytic local derivatives on the executed trace.
Assuming reverse-mode is always better: for functions with small input dimension or when you specifically need JVPs, forward-mode can be cheaper and simpler.
Forgetting accumulation in reverse-mode: when a value feeds multiple downstream operations, its adjoint must sum contributions (using +=).
Ignoring memory cost: reverse-mode may store many intermediates; without checkpointing or careful design, memory can dominate runtime.
Let f(x) = exp(x) sin(x). Use forward-mode AD (dual-number style) to compute df/dx at x = 0.
Write the primal and tangent updates step-by-step.
Hint: Trace: v₁ = x, v₂ = exp(v₁), v₃ = sin(v₁), v₄ = v₂ v₃. Initialize v̇₁ = 1.
Trace and tangents:
v₁ = x, v̇₁ = 1
At x = 0: v₁ = 0
v₂ = exp(v₁) = exp(0) = 1
v̇₂ = exp(v₁) v̇₁ = 1·1 = 1
v₃ = sin(v₁) = sin(0) = 0
v̇₃ = cos(v₁) v̇₁ = cos(0)·1 = 1
v₄ = v₂ v₃ = 1·0 = 0
v̇₄ = v̇₂ v₃ + v₂ v̇₃ = 1·0 + 1·1 = 1
So df/dx at x = 0 equals v̇₄ = 1.
Reverse-mode practice. Define ℓ(x, y, z) = (x y + z) · sin(y).
Compute ∂ℓ/∂x, ∂ℓ/∂y, ∂ℓ/∂z using reverse-mode AD (show the trace, then adjoint propagation).
Hint: A good trace: v₁=x, v₂=y, v₃=z, v₄=v₁ v₂, v₅=v₄+v₃, v₆=sin(v₂), v₇=v₅ v₆ = ℓ.
Trace:
v₁ = x
v₂ = y
v₃ = z
v₄ = v₁ v₂
v₅ = v₄ + v₃
v₆ = sin(v₂)
v₇ = v₅ v₆
ℓ = v₇
Adjoints (initialize): v̄₇ = 1
Backprop v₇ = v₅ v₆:
v̄₅ += v̄₇ · v₆ = 1·v₆ = v₆
v̄₆ += v̄₇ · v₅ = 1·v₅ = v₅
Backprop v₆ = sin(v₂):
v̄₂ += v̄₆ · cos(v₂) = v₅ cos(v₂)
Backprop v₅ = v₄ + v₃:
v̄₄ += v̄₅ · 1 = v₆
v̄₃ += v̄₅ · 1 = v₆
Backprop v₄ = v₁ v₂:
v̄₁ += v̄₄ · v₂ = v₆ v₂
v̄₂ += v̄₄ · v₁ = v₆ v₁
Collect:
∂ℓ/∂x = v̄₁ = y sin(y)
∂ℓ/∂z = v̄₃ = sin(y)
∂ℓ/∂y = v̄₂ = (x y + z) cos(y) + x sin(y)
(Using v₅ = x y + z and v₆ = sin(y).)
Let f: ℝⁿ → ℝᵐ. You need the full Jacobian J_f at a point. Suppose one forward evaluation costs C.
1) Estimate the cost (in units of C) to compute J_f with forward-mode.
2) Estimate the cost to compute J_f with reverse-mode.
3) For which regimes of (n, m) is each preferable?
Hint: Forward-mode gives one column per seeded input basis direction; reverse-mode gives one row per seeded output basis direction (after taping).
1) Forward-mode: one sweep gives J_f(x) u. To get all columns, run with u = e₁,…,eₙ. Cost ≈ n·C (up to a small constant factor).
2) Reverse-mode: one backward sweep gives wᵀ J_f(x). To get all rows, run with w = e₁,…,eₘ. Cost ≈ m·C plus the cost of the forward pass to build/store the trace (often counted once). So ≈ (m+1)·C times constants.
3) Prefer forward-mode when n ≪ m (few inputs, many outputs). Prefer reverse-mode when m ≪ n (few outputs, many inputs). In ML with scalar loss, m = 1 and n is huge, so reverse-mode is strongly preferred.
Next steps and related nodes: