The sliding-window dot-product between a kernel and input (1D/2D) used to extract local patterns and build translation-equivariant representations in convolutional networks. Grasping how stride, padding, and kernel size affect output shape and receptive field is key for CNN design.
Deep-dive lesson - accessible entry point but dense material. Use worked examples and spaced repetition.
Convolution is the workhorse operation behind CNNs: a small set of weights (a kernel) slides across an input and produces a new signal/image whose values tell you “how much this local pattern appears here.” The power comes from doing the same computation everywhere (weight sharing), which naturally builds translation-equivariant representations.
A (discrete) convolution layer computes a sliding-window dot product between a kernel and local patches of the input. Stride, padding, and dilation control where the kernel is applied and therefore determine output shape and receptive field. In deep learning libraries, what’s called “convolution” is usually cross-correlation (no kernel flip), but the shape math and intuition are the same.
This node is designed to be foundational, but a few micro-prereqs and conventions will save you from common shape bugs.
You should be comfortable with:
1) Indices and summations
2) Dot product
3) Arrays / tensors and shapes
4) Basic CNN layer conventions
In classical signal processing, discrete convolution flips the kernel. In most deep learning libraries, the operation named conv is actually cross-correlation (no flip). For 1D:
For learning CNNs, the “pattern detector sliding” intuition works either way; the network can learn flipped weights if needed. What matters most in practice is shape math and which positions are included.
Different frameworks store tensors differently:
| Name | Meaning | Common in |
|---|---|---|
| NCHW | [batch, channels, height, width] | PyTorch (default), many CUDA kernels |
| NHWC | [batch, height, width, channels] | TensorFlow (often), some accelerators |
The convolution math is identical; only the dimension ordering changes.
Libraries define padding="same" to mean “output spatial size is approximately input size when stride=1,” but when stride>1 there are multiple choices (rounding up vs down, distributing extra pad left/right). When you care about exact sizes, use the explicit formula with (p_left, p_right) rather than relying on a name.
Dilation spaces out kernel taps. It changes the effective kernel size and receptive field without increasing parameter count.
Keep these in mind; they directly address the most common beginner failure mode: implementing a conv layer and getting an output whose shape doesn’t match your expectation.
Many data types have a strong notion of locality:
A fully connected layer mixes everything with everything, which is expensive and ignores locality. Convolution instead says:
1) Look at a small local patch.
2) Compute a dot product with a kernel (a learned pattern).
3) Slide that kernel across positions, reusing the same weights.
This reuse is called weight sharing. It gives you an inductive bias: if a feature (like a vertical edge) matters at one location, it likely matters at other locations too.
Let x be a 1D input of length L and w be a kernel of length k. We form an output y where each y[n] is a weighted sum of a window of x.
A common deep-learning definition (cross-correlation) is:
Interpretation:
So each y[n] is:
For an image x with height H and width W, and a 2D kernel w of size k_h × k_w, the output at (u, v) is:
Again, it’s a dot product between a flattened patch and the flattened kernel.
Images typically have channels (RGB), and intermediate CNN layers have many channels. A convolution kernel spans all input channels.
If input x has C_in channels and kernel has C_in channels too, then:
And if you want C_out output channels, you learn C_out different kernels, one per output channel. The full weight tensor is shaped:
1) Local receptive field: y[u,v] depends only on a local neighborhood of x.
2) Translation equivariance: if you shift the input, the output shifts correspondingly (ignoring boundary effects from padding). This comes from applying the same kernel at every location.
Equivariance is not the same as invariance:
Convolution gives you the equivariant building block that deeper architectures can compose into higher-level behavior.
This section is about the “physics” of convolution in code: where the kernel lands, how many outputs you get, and what stride/padding/dilation actually do.
With stride s (1D), you compute outputs at positions n = 0, s, 2s, … rather than every position.
For 2D, you have stride (s_h, s_w). The kernel top-left corner moves by s_h rows and s_w columns.
Effect: larger stride → smaller output spatial size → more downsampling.
Without padding, you can only place the kernel where it fully fits in the input. That shrinks the output.
Padding adds extra values around the border, commonly zeros.
Effect: more padding → larger output spatial size and more border coverage.
Boundary note: padding breaks perfect translation equivariance at the edges because the border now “sees” artificial values.
Dilation d inserts gaps between kernel elements. In 1D, a kernel with k elements and dilation d covers an effective size:
Example: k=3, d=2 covers positions like [0, 2, 4] → k_eff=5.
In 2D, apply this per dimension:
Effect: increases receptive field without increasing parameter count, but can “skip over” fine details.
Let:
Compute effective kernel size:
Then output length L_out is:
Why the floor? Because you only count kernel placements that fit.
The last valid output position corresponds to the last kernel placement whose rightmost sampled index is within the padded input.
So:
With stride s, starts are n = 0, s, 2s, …, m s. Largest m such that m s ≤ L + p - k_eff is:
And the number of outputs is m+1:
For input H×W, kernel k_h×k_w, dilation d_h,d_w, stride s_h,s_w, and padding totals p_h = p_top+p_bottom, p_w = p_left+p_right:
where
It helps to name common padding schemes, but always ground them in the formula.
| Scheme | Typical meaning | Outcome (stride=1, dilation=1) |
|---|---|---|
| valid | p=0 | output shrinks: L_out = L-k+1 |
| same | choose p so L_out = L | preserves length/size |
For stride=1, dilation=1, to get L_out = L you need:
That means total padding p = k-1. Usually you split it as evenly as possible:
For 2D, do this per dimension.
When stride>1, there is not a single padding that guarantees H_out = H exactly; libraries choose rules like “output is ceil(H/stride).” If you need exactness, compute padding explicitly.
For a standard 2D conv with C_in input channels and C_out output channels:
Compute is roughly proportional to:
These simple counts help you reason about model size and performance.
Now that you can compute shapes, the next step is understanding why convolutions are such a good building block.
Think of a kernel as encoding a template.
At each position, the dot product is high when the local patch aligns with the kernel’s pattern.
If you normalize vectors, the dot product relates to cosine similarity. CNNs don’t usually normalize patches, so magnitude also matters. But the intuition remains: convolution measures how much a pattern is present locally.
Weight sharing means the same weights w are used for every location. This creates a structured linear map.
Let T_Δ be a shift operator that shifts the input by Δ (in 1D, Δ is an integer). For interior positions (ignoring boundary padding effects), convolution satisfies:
That is equivariance.
Why you care:
Padding caveat:
The receptive field of an output unit is the region of the input that can affect it.
For a single conv layer with kernel size k and dilation 1:
But when you stack layers, receptive fields grow.
Suppose you apply two conv layers, both with kernel size k=3, stride=1, dilation=1.
Layer 1:
Layer 2:
Substitute dependencies:
So y₂[n] depends on x[n..n+4], i.e. 5 input positions.
This shows a pattern: with stride=1 and dilation=1, stacking k=3 layers increases receptive field by 2 each layer.
For 1D layers ℓ=1..L, with kernel sizes k_ℓ and strides s_ℓ (ignore dilation for a moment), define:
Update per layer:
This is a standard CNN design tool: stride increases the jump (downsampling), and kernel size increases the field.
If you include dilation d_ℓ, replace (k_ℓ − 1) with (k_{eff,ℓ} − 1) where k_eff = d(k−1)+1.
A conv layer usually has many output channels. Each output channel corresponds to a different kernel (feature detector).
So at each spatial location, you don’t just output one number; you output a vector of C_out numbers. You can think of that vector as a learned local descriptor.
Convolution itself is a linear operation in x:
CNNs become powerful because you compose:
The conv operation is the structured linear piece that respects locality and translation structure.
This section connects the operation to practical CNN design decisions and implementation details.
When you design a CNN block, you typically decide:
1) Desired change in spatial size (keep, half, quarter)
2) Number of channels (C_in → C_out)
3) Kernel size (locality) and whether to use dilation
4) Padding to control boundaries
A common pattern:
For k=3, dilation=1, want H_out = H.
If you want a larger receptive field, you can:
Trade-offs:
| Method | Pros | Cons |
|---|---|---|
| Larger kernel | Direct, dense coverage | More parameters and compute |
| Stack 3×3 | Often efficient; more nonlinearities | Deeper network, may be harder to optimize |
| Dilation | Big receptive field, same params | Can miss fine detail; gridding artifacts |
Even though this node focuses on standard convolution, you’ll see these variants:
These change parameter count and mixing across channels.
When you implement Conv2d / tf.nn.conv2d, you must align:
1) Tensor layout
conv2d often uses NHWC unless configured.2) Weight layout
3) Padding semantics
and pads accordingly.
4) Dilation parameter
A convolution can be implemented by:
1) Extracting all patches into a big matrix (each row = one patch) → im2col
2) Flattening kernels into columns
3) Doing a matrix multiply
This shows convolution is a structured linear operator. It also explains why:
You don’t need to implement im2col, but knowing it helps debug shapes: patches correspond to output positions.
Once you’re fluent with convolution mechanics, you can understand:
This node is the gateway from basic linear algebra (dot products) to deep learning feature extraction.
Input x (length L=6): x = [1, 2, 0, -1, 3, 1]
Kernel w (length k=3): w = [2, -1, 1]
Use cross-correlation convention (no flip), stride s=1, padding p=0, dilation d=1.
List the patches of length 3 (since k=3) that fit with p=0:
Compute output length using the formula:
Effective kernel size: k_eff = d(k-1)+1 = 1·(3-1)+1 = 3
Total padding p = 0
So y has length 4 (indices 0..3).
Compute each y[n] as a dot product y[n] = ∑_{i=0}^{2} w[i] x[n+i].
n=0:
n=1:
n=2:
n=3:
Final output:
y = [0, 3, 4, -4]
Insight: Each output is just a dot product between the kernel and a local patch. The output length comes from counting how many valid kernel placements fit; the formula matches the patch list exactly.
Input image x has shape H×W = 7×7 (ignore channels for shape math).
Kernel size k_h×k_w = 3×3.
Stride s_h=s_w=2.
Dilation d_h=d_w=1.
Padding: p_top=p_bottom=p_left=p_right=1 (so totals p_h=2, p_w=2).
Compute effective kernel sizes:
Compute output height:
Simplify:
7+2-3 = 6
Compute output width (same numbers):
So the output spatial shape is 4×4.
Sanity check by thinking in placements:
Insight: The shape formula is just counting how many stride-spaced kernel placements fit inside the padded input. Padding made the padded input 9×9, enabling 4 placements along each dimension with stride 2.
Consider a 2D CNN block with two convolution layers, both with kernel 3×3, stride 1, dilation 1, and padding 1 (so spatial sizes are preserved). We track the receptive field size (in one dimension; it’s symmetric in H and W here).
Initialize receptive field and jump:
Layer 1: k=3, s=1
After the first conv, each output pixel sees a 3-pixel-wide region (in 1D).
Layer 2: k=3, s=1
Conclusion:
Two stacked 3×3 stride-1 conv layers yield a receptive field of 5×5 in 2D (since 5 in height and 5 in width).
Insight: Stacking small kernels grows receptive field gradually while adding nonlinearities between them—one reason repeated 3×3 convs are so common in CNNs.
A convolution (as used in CNNs) is a sliding-window dot product: each output value is a weighted sum of a local input patch.
Most deep learning libraries implement cross-correlation (no kernel flip) but still call it convolution; the learning behavior is essentially unaffected.
Weight sharing (same kernel at every position) is what gives convolution translation equivariance (away from boundary effects).
Stride controls downsampling: larger stride reduces output spatial size by skipping kernel placements.
Padding controls boundary behavior and output size; explicit padding values are safer than relying on ambiguous “same” rules for stride>1.
Dilation increases effective kernel size k_eff = d(k−1)+1, expanding receptive field without increasing parameter count.
Output shapes follow floor-based formulas; they come directly from counting how many kernel placements fit in the padded input.
Stacking convolution layers expands receptive field; stride increases the jump between receptive field centers.
Mixing up tensor layouts (NCHW vs NHWC) and weight layouts (PyTorch vs TensorFlow), leading to silent shape mismatches.
Assuming padding="same" means identical behavior across libraries or for stride>1; the rounding/distribution of padding can differ.
Forgetting dilation when computing output shape: you must use k_eff, not k.
Confusing translation equivariance with invariance (equivariance preserves shifts; invariance requires pooling/aggregation).
1D shape practice: An input of length L=20 is convolved with kernel size k=5, stride s=2, dilation d=1, and total padding p=4 (e.g., 2 left + 2 right). What is L_out?
Hint: Use k_eff = d(k−1)+1 and L_out = floor((L+p−k_eff)/s)+1.
Compute k_eff:
Then:
So L_out = 10.
2D shape + channels: You have an input tensor with shape (N=8, C_in=3, H=32, W=32) in NCHW. You apply a Conv2d with C_out=16, kernel 3×3, stride 1, dilation 1, padding 1. What is the output shape and how many weight parameters (ignore bias)?
Hint: Padding 1 with kernel 3 and stride 1 preserves H and W. Parameters are C_out·C_in·k_h·k_w.
Spatial shape:
Similarly W_out=32.
So output shape is (8, 16, 32, 32).
Parameter count:
So there are 432 weights (plus 16 biases if bias were included).
Receptive field reasoning: In 1D, stack three conv layers with kernel sizes [3, 3, 3], strides [1, 2, 1], dilations all 1. Compute the receptive field size r₃ using the update rules j_ℓ = j_{ℓ−1}s_ℓ and r_ℓ = r_{ℓ−1} + (k_ℓ−1)j_{ℓ−1}.
Hint: Start with r₀=1, j₀=1 and apply the updates layer by layer. Be careful: stride affects j first, but r uses j_{ℓ−1}.
Initialize:
Layer 1: k₁=3, s₁=1
Layer 2: k₂=3, s₂=2
Layer 3: k₃=3, s₃=1
So the final receptive field is 9 input positions wide.
Unlocks and next steps:
Related nodes you may want next (if available in your tree):