A representation of mathematical expressions as directed graphs where nodes are operations and edges are tensors, used to structure and execute forward and backward passes in neural network computation. They make it clear how values flow and where gradients need to be propagated for training.
Deep-dive lesson - accessible entry point but dense material. Use worked examples and spaced repetition.
Neural networks look like “layers”, but training them is really about one thing: computing lots of derivatives efficiently and correctly. Computational graphs are the picture (and data structure) that makes both the forward computation and the backward (gradient) computation explicit—so we can automate learning.
A computational graph represents a function as a directed graph: nodes are operations f(…), edges carry tensors (values). A forward pass evaluates node outputs in dependency order. A backward pass (reverse-mode autodiff / backprop) propagates gradients from outputs to inputs using the chain rule, accumulating contributions when a value fans out to multiple consumers.
A computational graph is a directed graph that represents how a mathematical expression (or program) computes its outputs from its inputs.
Why we need this picture at all:
A computational graph makes two things explicit:
1) Data flow (forward computation): how numbers/tensors are produced.
2) Dependency structure (for derivatives): which intermediate values influence which outputs.
This is the key mental model:
Suppose we define
z = (x + y) · y
We can break this into intermediate values:
Graphically (conceptually): x and y flow into “+” producing a, and then a and y flow into “×” producing z.
Even at this size, the graph is useful because it tells us:
In ML, edges typically carry vectors/matrices/tensors. We’ll use v for vectors and uppercase like W for matrices.
The graph doesn’t care whether values are scalars or tensors; it’s the same idea. The only difference is that “derivatives” become vector-Jacobian and Jacobian-vector products under the hood.
Computational graphs are the backbone of systems like PyTorch, JAX, and TensorFlow.
They enable:
You can think of a computational graph as the “assembly language” of differentiable programs: explicit enough to analyze, optimize, and differentiate.
The forward pass is simply: evaluate every node in an order that respects dependencies.
If node B uses the output of node A, then A must run before B. This is a partial order induced by the directed edges.
Most computational graphs used for standard feed-forward neural networks are acyclic (a DAG: directed acyclic graph). For a DAG, a topological ordering exists.
Let
u = x · y
w = sin(ν)
z = w + y
Dependencies:
A valid forward order is: compute ν → w → z.
Here is a subtle but important point: for backprop, you typically need some intermediate values.
So most autodiff systems either:
This leads to a core engineering trade-off:
| Strategy | What you store | Memory | Extra compute | Common use |
|---|---|---|---|---|
| Store intermediates | ν, w, … | higher | low | default training |
| Recompute (checkpoint) | fewer values | lower | higher | very deep nets |
A single node output may feed many later nodes. That means:
Example:
Here a is used in two places. During backprop, the gradient ∂L/∂a has two contributions:
You add them because the total derivative is linear in contributions from different paths.
Edges carry tensors with shapes. Each node must obey shape rules.
Computational graphs make shape mismatches easier to catch because every edge has a well-defined produced tensor.
Forward mode is “just computation,” but the graph clarifies:
That clarity becomes crucial when we reverse direction and compute gradients.
Training needs gradients: if L is a loss and θ are parameters, we want ∂L/∂θ.
In deep learning, we usually have:
Reverse-mode autodiff computes all ∂L/∂θ in about the cost of a small constant multiple of the forward pass. That’s why it’s the default for neural nets.
Each node performs an operation. Backprop works by combining:
1) An upstream gradient: how the loss changes with respect to the node’s output.
2) A local derivative: how the node’s output changes with respect to its inputs.
Then it produces downstream gradients for each input.
If a node computes
u = f(a)
then by the chain rule:
dL/da = (dL/dν) · (dν/da)
In graph terms:
Let
a = x + y
z = a · y
Assume L = z (so L = (x + y) · y).
Forward:
Backward:
So:
So:
Now combine contributions to y (because y influenced L via two paths):
dL/dy = a + y = (x + y) + y = x + 2y
This “sum the contributions at fan-out” rule is one of the most important operational facts about backprop.
Whenever an edge’s value is used in multiple places, its gradient is the sum of gradients coming back from each usage.
Operationally, frameworks maintain an accumulator for each node output:
For tensors, the same idea holds but with appropriate linear algebra.
Example: y = W x
If we have upstream gradient g = dL/dy (a vector in ℝᵐ), then:
You don’t need to explicitly form Jacobians; backprop uses efficient vector-Jacobian products.
To compute gradients correctly, you need to process nodes in reverse topological order (from outputs back to inputs). In a DAG, this is well-defined.
Branches are fine; they just introduce multiple gradient contributions that must sum.
Shared subgraphs (reuse) are also fine; they are exactly “fan-out.”
Some models have logical cycles (recurrent connections). Training still uses backprop by unrolling the cycle over time steps, yielding a larger DAG.
Reverse-mode autodiff is “the chain rule on a graph”:
Computational graphs aren’t just a conceptual tool; they determine how training runs on real hardware.
A neural net layer is a subgraph pattern:
Stacking layers composes graphs.
A whole training step includes:
1) Forward pass to compute predictions and loss L
2) Backward pass to compute gradients ∂L/∂θ
3) Parameter update (e.g., θ ← θ − η ∂L/∂θ)
That’s “forward graph” + “backward graph” + “optimizer graph.”
If gradients are NaN or 0, a graph helps you ask:
This connects directly to numerical stability: many stability tricks (like log-sum-exp) are specific subgraph rewrites.
Once computation is explicit, systems can:
The graph is also used for device placement:
Different frameworks expose graphs differently.
| Style | How graph is built | Pros | Cons | Examples |
|---|---|---|---|---|
| Static graph | define graph, then run | global optimization, easier deployment | less flexible debugging, control flow harder | TF 1.x, XLA-compiled graphs |
| Dynamic graph | graph created as code runs | very flexible, Pythonic debugging | fewer global optimizations unless compiled | PyTorch eager, JAX tracing (hybrid) |
Modern systems blur the line (e.g., torch.compile, JAX jit): write dynamic code, then trace/compile into a more static graph for speed.
A central reason computational graphs matter is that learning signals (gradients) follow the same dependency paths as values—but reversed.
That provides a practical “sanity check” when designing architectures.
Computational graphs are the map; autodiff is the navigation algorithm; deep learning is the city you can now traverse.
Define a computational graph with
u = x + y
z = u · y
w = u²
L = z + w
Goal: compute dL/dx and dL/dy using the graph + chain rule.
Forward compute intermediate values:
Initialize gradient at the output:
Because L = z + w, local derivatives are:
Backprop through w = u²:
So contribution to u from this path:
Backprop through z = u · y:
Local derivatives:
So:
Accumulate gradients at u (fan-out merge):
Backprop through u = x + y:
Local derivatives:
So:
Accumulate total gradient for y (two paths):
Substitute u = x + y:
Insight: Whenever a value is used by multiple later operations (fan-out), its gradient is the sum of gradients flowing back along each outgoing path. This is the graph version of “collect all chain rule paths and add them.”
Let x ∈ ℝⁿ, W ∈ ℝᵐˣⁿ, b ∈ ℝᵐ, target t ∈ ℝᵐ.
Define:
Compute gradients dL/dy, dL/dW, dL/db, dL/dx.
Rewrite loss with an intermediate:
Let e = y − t.
Then L = 1/2 ‖e‖² = 1/2 ∑ᵢ eᵢ².
Compute gradient w.r.t. e:
For each component, ∂L/∂eᵢ = 1/2 · 2eᵢ = eᵢ.
So in vector form:
Backprop through e = y − t:
So:
Backprop through y = W x + b:
Split into two nodes: u = W x, then y = u + b.
Upstream gradient is g = dL/dy.
For y = u + b:
For u = W x:
Using matrix calculus identities:
Optionally substitute g = (y − t) and y = W x + b:
Insight: Computational graphs scale because each node only needs a local backward rule (like for + or matmul). Reverse-mode AD stitches those local rules into full gradients without ever building a giant Jacobian explicitly.
Consider probabilities from logits s ∈ ℝᵏ via softmax:
If you then compute log-prob for class c:
Show how the computational graph suggests a stable rewrite using log-sum-exp.
Start from the definition:
softmax(s)꜀ = exp(s꜀) / ∑ⱼ exp(sⱼ)
So:
L = −log(exp(s꜀) / ∑ⱼ exp(sⱼ)).
Use log rules:
L = −(log(exp(s꜀)) − log(∑ⱼ exp(sⱼ)))
= −(s꜀ − log(∑ⱼ exp(sⱼ)))
= log(∑ⱼ exp(sⱼ)) − s꜀.
Identify numerical issue in the original graph:
The naive path computes exp(sⱼ) which can overflow if some sⱼ is large.
Graph insight: the problematic node is exp(·) applied to raw logits.
Apply the log-sum-exp trick by subtracting m = maxⱼ sⱼ:
∑ⱼ exp(sⱼ) = ∑ⱼ exp((sⱼ − m) + m)
= exp(m) ∑ⱼ exp(sⱼ − m)
So:
log(∑ⱼ exp(sⱼ)) = log(exp(m) ∑ⱼ exp(sⱼ − m))
= m + log(∑ⱼ exp(sⱼ − m)).
Final stable expression:
Let m = maxⱼ sⱼ.
Then:
L = (m + log(∑ⱼ exp(sⱼ − m))) − s꜀.
All exponentials now see inputs (sⱼ − m) ≤ 0, reducing overflow risk.
Insight: Computational graphs aren’t just for gradients; they help you spot unstable subgraphs (like exp feeding into sums and logs) and replace them with equivalent but stable subgraphs.
A computational graph represents a function/program as a directed graph: nodes are operations f(…), edges carry tensors (values) and encode dependencies.
The forward pass evaluates nodes in a dependency-respecting order (often a topological order in a DAG).
Reverse-mode autodiff (backprop) computes gradients by propagating upstream gradients backward through nodes using local derivatives and the chain rule.
When a value fans out to multiple consumers, its gradient is the sum of contributions from each path (gradient accumulation).
For tensor operations, backprop uses efficient vector-Jacobian products (e.g., dL/dx = Wᵀ dL/dy) rather than forming full Jacobians.
Graphs enable practical system features: saving intermediates, checkpointing, scheduling, kernel fusion, and device placement.
Many numerical stability techniques can be seen as graph rewrites (e.g., log-sum-exp for softmax-related computations).
Forgetting to accumulate gradients at fan-out points (overwriting instead of summing contributions).
Confusing the direction of data flow (forward) with the direction of gradient flow (backward) and mixing up which quantity is “upstream.”
Dropping needed forward intermediates (e.g., ν for cos(ν)) and then being unable to compute local derivatives correctly without recomputation.
Ignoring tensor shapes/broadcasting rules, leading to silent shape errors or incorrect gradient shapes.
Let a = x · y, b = a + y, L = b². Compute dL/dx and dL/dy.
Hint: Do forward intermediates a, b. Backprop from L to b to (a, y) and then to (x, y). Remember y affects b both directly and via a.
Forward:
a = x y
b = a + y = x y + y = y(x + 1)
L = b²
Backward:
dL/db = 2b
For b = a + y:
For a = x y:
Total:
dL/dx = 2by
dL/dy = 2b + 2bx = 2b(1 + x)
Substitute b = y(x + 1):
dL/dx = 2 y(x + 1) · y = 2 y² (x + 1)
dL/dy = 2 y(x + 1) (1 + x) = 2 y (x + 1)²
Consider y = ReLU(W x + b) and loss L = ∑ᵢ yᵢ (sum of outputs). Express dL/dW in terms of x and a mask derived from the pre-activation h = W x + b.
Hint: Use dL/dy = 1 (vector of ones). ReLU’(h) is 1 where hᵢ > 0 else 0. Then dL/dW looks like an outer product.
Let h = W x + b, y = ReLU(h).
Given L = ∑ᵢ yᵢ, we have dL/dy = 1 (all ones).
Backprop through ReLU:
Define mask m where mᵢ = 1 if hᵢ > 0 else 0.
Then dL/dh = dL/dy ⊙ m = 1 ⊙ m = m.
Backprop through h = W x + b:
For matrix multiply, dL/dW = (dL/dh) xᵀ = m xᵀ.
So dL/dW is an outer product of the ReLU mask with the input x.
A value u feeds into two branches: v = u² and w = 3u. The final output is L = v · w. Compute dL/du by treating it as a computational graph and showing the gradient accumulation clearly.
Hint: Compute dL/dv and dL/dw first from L = v w. Then push back to u through each branch and add.
Forward:
v = u²
w = 3u
L = v w
Backward:
From L = v w:
Backprop through v = u²:
So contribution:
dL/du |_via_v = dL/dv · dv/du = w · 2u
Backprop through w = 3u:
So contribution:
dL/du |_via_w = dL/dw · dw/du = v · 3
Accumulate at u:
dL/du = (w · 2u) + (v · 3)
Substitute v = u² and w = 3u:
dL/du = (3u · 2u) + (u² · 3)
= 6u² + 3u²
= 9u².
Next nodes you can study:
Related conceptual building blocks: