Architecture pattern that adds the input of a layer to its output to enable training of very deep networks and ease gradient flow; understand the motivation, basic implementation, and effect on optimization. Residual connections are a structural backbone of Transformer layers.
Self-serve tutorial - low prerequisites, straightforward concepts.
Deep networks can be powerful, but they’re often hard to train: as you stack more layers, optimization can stall or become unstable. Residual (skip) connections fix a surprisingly big part of this with a tiny structural change: let information (and gradients) flow around a layer by adding the input back to the output.
A residual block outputs y = x + F(x). Instead of forcing a stack of layers to learn a full mapping H(x), you let it learn a residual F(x) = H(x) − x. This makes the identity mapping easy to represent, improves gradient flow, and enables training much deeper networks. Residual connections are core to modern architectures, including Transformers (via Add & Norm).
When you build a deep neural network, you typically compose many transformations:
In principle, deeper networks can represent more complex functions. In practice, very deep stacks can become hard to optimize: training loss stops improving, gradients become unhelpful, and the model struggles to learn even “simple” behaviors.
Residual (skip) connections are a structural pattern designed to make deep stacks trainable.
A residual connection means a layer (or block) computes:
y = x + F(x)
where:
This is why it’s also called a skip connection: x “skips” around the transform F and is added back in.
Without a residual connection, a block must output H(x) directly:
With residual connections, the block outputs:
So the block isn’t forced to generate the whole signal from scratch. It can instead learn an edit to the input.
If the best thing to do is “do nothing,” the residual block can set F(x) ≈ 0, giving y ≈ x. That “identity solution” turns out to be extremely helpful for optimization.
You can rewrite the residual form as:
The representational power is not reduced—H can still be learned—but the parameterization makes certain useful functions (especially identity) much easier to reach during training.
Residual connections show up in many places:
In Transformers, you’ll repeatedly see patterns like:
That single plus sign is a big part of why very deep Transformer stacks can train at all.
Training uses gradient-based optimization. If the gradient of the loss with respect to early layers becomes extremely small or chaotic, learning stalls.
Residual connections help by providing a direct path from later layers back to earlier layers.
To see the key idea without getting lost in tensor notation, consider a single residual block with scalar input x:
Let L be the loss depending on y. By the chain rule:
∂L/∂x = ∂L/∂y · ∂y/∂x
Now compute ∂y/∂x:
y = x + F(x)
∂y/∂x = 1 + ∂F/∂x
So:
∂L/∂x = ∂L/∂y · (1 + ∂F/∂x)
The critical term is the “1” coming from the skip connection.
This doesn’t magically eliminate vanishing/exploding gradients in every setting, but it adds a stable component to gradient propagation.
Let x ∈ ℝᵈ, and F maps ℝᵈ → ℝᵈ. Then:
y = x + F(x)
The Jacobian is:
∂y/∂x = I + ∂F/∂x
So the gradient to earlier layers includes an identity term I. Across many stacked blocks, this means there exists a path where the gradient multiplies by (approximately) identity instead of repeatedly multiplying by potentially small/large Jacobians.
Consider a deep stack of residual blocks:
x₀ = input
For k = 0…N−1:
xₖ₊₁ = xₖ + Fₖ(xₖ)
If all Fₖ are near 0 early in training, then:
xₖ₊₁ ≈ xₖ
So the entire deep network initially behaves close to the identity mapping—meaning signals don’t get destroyed just because you made the network deep.
This is one of the most important practical effects:
You can add more layers to gain capacity, and training remains feasible because the network can default to “almost identity” until each block learns useful residual corrections.
A helpful mental model:
This shifts the optimization landscape toward something easier to navigate.
| Property | Plain stack (no skip) | Residual stack |
|---|---|---|
| Block form | y = F(x) | y = x + F(x) |
| Identity mapping | Must be learned by parameters | Achieved by F(x) = 0 |
| Gradient term | ∂F/∂x | I + ∂F/∂x |
| Deep behavior early in training | Can distort signals | Often close to identity |
| Practical depth | Limited | Much larger |
Residual connections don’t replace good initialization, normalization, and learning-rate choices—but they make deep learning dramatically more robust.
Suppose the function you ultimately want is H(x). In many deep models, especially when stacking many similar blocks, H(x) may be close to x much of the time.
Examples of “close to identity” needs:
If H(x) ≈ x, then the difference H(x) − x is small.
Residual blocks make the model explicitly represent that difference:
F(x) = H(x) − x
This can be easier to learn because:
Imagine a block F is implemented as a small network, like:
F(x) = W₂ σ(W₁ x)
If you want the overall mapping to be identity (y = x), then in a plain network you would need:
W₂ σ(W₁ x) = x
That’s a strong constraint; depending on σ, it might be awkward or require special parameter values.
In a residual network, identity is just:
y = x + W₂ σ(W₁ x) = x
which is achieved whenever:
W₂ σ(W₁ x) = 0
A sufficient condition is simply:
W₂ = 0 (or very small)
This shows why “do nothing” is naturally available to the model.
Rearrange the residual update:
xₖ₊₁ = xₖ + Fₖ(xₖ)
This looks like an iterative refinement method:
If you squint, it resembles a discretized dynamical system:
xₖ₊₁ − xₖ = Fₖ(xₖ)
This perspective helps explain why very deep residual networks behave like many small steps rather than a few huge jumps.
Residual connections are conceptually simple, but practical implementations need two details:
Elementwise addition requires the same shape:
If shapes differ, common strategies include:
y = P(x) + F(x)
In Transformer literature you’ll see:
Both use residual addition; they differ in optimization behavior and stability for very deep stacks. The key idea for this node is the residual path itself: the “+ x” that preserves a direct route for information.
It’s tempting to think residual connections add representational power. But the main win is reparameterization:
That’s why residual connections are discussed primarily as an optimization technique—even though they also interact with generalization and training dynamics in interesting ways.
A Transformer layer has multiple complex subcomponents:
Training would be much harder if each layer had to fully rewrite token representations. Residual connections make each sublayer behave like a controlled update to the current representation.
For a token representation vector x (really a matrix of token vectors), a typical sublayer wrapper is:
y = x + Sublayer(x)
Then normalization is applied either before or after depending on design.
A simplified (pre-norm) Transformer block looks like:
1) Attention sublayer
x₁ = x + Attention(LayerNorm(x))
2) MLP sublayer
x₂ = x₁ + MLP(LayerNorm(x₁))
Each residual addition means:
Self-attention produces context-aware mixtures of token information. But residual addition ensures:
This is especially important early in training, when attention weights may be noisy.
In real training runs, residual connections often enable:
Residuals are not sufficient alone—Transformers also rely heavily on normalization, learning-rate schedules, and careful initialization. But the skip connection is a structural backbone.
Residuals are usually beneficial, but you still must manage:
A common engineering tactic is to ensure the residual branch starts small (e.g., initialization choices), so early training behaves close to identity and gradually learns stronger updates.
Residual connections are a pattern you’ll see across architectures because they solve a general problem:
Transformers are one of the most important modern examples, but the core math is the same: y = x + F(x).
Let F(x) = a·x (a is a scalar parameter). Compare a plain block y = F(x) to a residual block y = x + F(x). Suppose the loss is L = ½·(y − t)² for target t. Compute ∂L/∂x for both cases.
Plain block:
y = F(x) = a·x
L = ½·(a·x − t)²
Compute ∂L/∂y:
∂L/∂y = (y − t)
Compute ∂y/∂x:
y = a·x ⇒ ∂y/∂x = a
Chain rule:
∂L/∂x = ∂L/∂y · ∂y/∂x
= (y − t) · a
= (a·x − t) · a
Residual block:
y = x + F(x) = x + a·x = (1 + a)·x
L = ½·((1 + a)·x − t)²
Compute ∂y/∂x:
y = (1 + a)·x ⇒ ∂y/∂x = (1 + a)
(Equivalently: y = x + a·x ⇒ ∂y/∂x = 1 + a.)
Chain rule:
∂L/∂x = (y − t) · (1 + a)
= (((1 + a)·x) − t) · (1 + a)
Insight: In the plain case, the backprop multiplier is a. If a is small, gradients shrink. In the residual case, the multiplier is (1 + a), which includes a built-in 1 from the skip path—making it harder for gradients to vanish purely because the learned transform is weak early in training.
Assume the desired function is H(x) = 3x + 2 (scalar). If we implement a residual block y = x + F(x), what residual function F(x) must be learned? Verify that the residual formulation reproduces H(x).
Start from the relationship:
H(x) = x + F(x)
So:
F(x) = H(x) − x
Plug in H(x) = 3x + 2:
F(x) = (3x + 2) − x
= 2x + 2
Verify by substitution:
y = x + F(x)
= x + (2x + 2)
= 3x + 2
= H(x)
Insight: Residual learning doesn’t restrict what you can represent; it changes what the block is asked to model. If H(x) is close to x, then F(x) is small—often easier to learn robustly with gradient-based optimization.
Let x ∈ ℝ³, but suppose your transform produces F(x) ∈ ℝ² (e.g., you changed hidden size). You cannot add x + F(x) directly. Show how a projection P fixes this, and write the new residual form.
Problem: elementwise addition requires matching shapes.
x ∈ ℝ³
F(x) ∈ ℝ²
So x + F(x) is undefined.
Introduce a learned projection P: ℝ³ → ℝ², often linear:
P(x) = W x + b
where W is 2×3 and b ∈ ℝ².
Define the residual output:
y = P(x) + F(x)
Now both terms are in ℝ², so the sum is valid.
Interpretation:
Insight: Residual connections fundamentally require an additive merge. If dimensions change, you can keep the residual idea by making the shortcut path match dimensions via a cheap projection.
A residual (skip) connection outputs y = x + F(x) using elementwise addition.
Residual learning reframes the target mapping H(x) as a residual F(x) = H(x) − x, often making optimization easier.
The skip path contributes an identity term to backprop: ∂y/∂x = I + ∂F/∂x, improving gradient flow in deep stacks.
Residual blocks make the identity mapping easy: set F(x) ≈ 0 ⇒ y ≈ x.
Deep residual stacks behave like iterative refinement: xₖ₊₁ = xₖ + Fₖ(xₖ) (many small updates).
Elementwise addition requires matching shapes; if dimensions differ, use a projection shortcut y = P(x) + F(x).
Transformers rely on residual connections around attention and MLP sublayers (Add & Norm), enabling stable training at depth.
Forgetting that “+” is elementwise: trying to add tensors with mismatched shapes without a projection shortcut.
Thinking residual connections mainly increase expressivity; their primary benefit is optimization via easier identity solutions and improved gradient flow.
Assuming residuals guarantee no vanishing/exploding gradients in any circumstance; they help a lot, but normalization, initialization, and learning rates still matter.
Mixing up where normalization happens in Transformers (pre-norm vs post-norm) and attributing all stability differences solely to the residual connection.
Let x ∈ ℝᵈ and a residual block be y = x + F(x). Show (using Jacobians) that ∂y/∂x = I + ∂F/∂x.
Hint: Differentiate y componentwise: yᵢ = xᵢ + Fᵢ(x). What is ∂xᵢ/∂xⱼ?
Write yᵢ = xᵢ + Fᵢ(x).
Then for each i, j:
∂yᵢ/∂xⱼ = ∂xᵢ/∂xⱼ + ∂Fᵢ/∂xⱼ
But ∂xᵢ/∂xⱼ = 1 if i = j, else 0, which is exactly the identity matrix I.
So the Jacobian is:
∂y/∂x = I + ∂F/∂x.
Suppose a deep network is a stack of N residual blocks: xₖ₊₁ = xₖ + Fₖ(xₖ). If all Fₖ are exactly 0, what function does the whole network compute? What does this imply about adding more layers?
Hint: If Fₖ = 0, then xₖ₊₁ = xₖ. Unroll the recurrence.
If Fₖ ≡ 0 for every k, then:
x₁ = x₀ + 0 = x₀
x₂ = x₁ + 0 = x₁ = x₀
...
x_N = x₀
So the whole network computes the identity function.
Implication: adding layers does not force the network to change behavior; the model can default to identity and only learn nonzero residuals where useful. This makes very deep architectures easier to optimize.
You have a residual block where x ∈ ℝ⁶, but the transform produces F(x) ∈ ℝ⁴. Propose two ways to make a valid residual connection, and write the resulting equations.
Hint: Either change the transform to output ℝ⁶, or project the skip path into ℝ⁴ (or project the transform into ℝ⁶).
Two valid approaches:
1) Keep the skip path unchanged and make the transform match ℝ⁶:
2) Use a projection shortcut from ℝ⁶ → ℝ⁴:
(Variants: you could also project F(x) up to ℝ⁶ and add to x, but typically you project the shortcut when changing widths.)
Related next nodes you’ll often want after this: