Nonlinear functions applied elementwise to neuron outputs (e.g., ReLU, sigmoid, tanh) that enable networks to model complex, non-linear relationships and affect training dynamics. Understanding their properties (saturation, gradient behavior, sparsity) is important for architecture design and optimization.
Deep-dive lesson - accessible entry point but dense material. Use worked examples and spaced repetition.
A neural network without activation functions is secretly just one big linear model—no matter how many layers you stack. Activation functions are the small elementwise “twists” that make deep learning expressive, trainable, and (sometimes) numerically fragile.
An activation function applies an elementwise nonlinearity to each neuron’s pre-activation . This breaks linearity so deep networks can model complex functions, and its derivative controls gradient flow in backprop. Key properties: saturation (tiny gradients), sparsity (many zeros for ReLU), and smoothness/scale (affects optimization and stability).
A single neuron computes a weighted sum and bias, then optionally squashes it:
The function is the activation function. In standard feedforward layers, it is applied elementwise to each component of the vector z (one value per neuron).
The key reason activation functions matter is nonlinearity.
If you remove (or choose ), then each layer is linear:
Stack two such layers:
Now expand:
That is still just one linear layer with new weights and bias:
So without nonlinear activations, adding depth does not add expressive power. Activations prevent this “collapse,” allowing networks to represent highly non-linear mappings.
For a layer with neurons, you typically compute:
Then apply activation elementwise:
There’s no mixing between neurons inside ; all the mixing is in .
Backprop uses the chain rule. For a scalar loss , the gradient with respect to pre-activation includes :
If , then
So the local slope determines how strongly error signals flow backward.
Two immediate consequences:
Below is an inline static SVG overlaying common activation curves and (scaled) derivatives. The shapes make saturation and gradient flow tangible.
<svg width="780" height="310" viewBox="0 0 780 310" xmlns="http://www.w3.org/2000/svg" role="img" aria-label="Overlay of activation functions (sigmoid, tanh, ReLU) and their derivatives">
<rect x="0" y="0" width="780" height="310" fill="#ffffff" />
<!-- axes -->
<line x1="40" y1="155" x2="760" y2="155" stroke="#222" stroke-width="2"/>
<line x1="400" y1="20" x2="400" y2="290" stroke="#222" stroke-width="2"/>
<text x="745" y="148" font-size="12" fill="#222">z</text>
<text x="408" y="30" font-size="12" fill="#222">value</text>
<!-- grid ticks -->
<g stroke="#eee" stroke-width="1">
<line x1="40" y1="55" x2="760" y2="55"/>
<line x1="40" y1="255" x2="760" y2="255"/>
<line x1="220" y1="20" x2="220" y2="290"/>
<line x1="580" y1="20" x2="580" y2="290"/>
</g>
<g fill="#666" font-size="11">
<text x="208" y="170">-2</text>
<text x="568" y="170">+2</text>
<text x="392" y="170">0</text>
<text x="15" y="60">+1</text>
<text x="15" y="260">-1</text>
</g>
<!-- sigmoid (approx) -->
<path d="M40 230 C140 225, 220 210, 300 185 C340 170, 360 160, 400 155 C440 150, 460 140, 500 125 C580 100, 660 85, 760 80" fill="none" stroke="#1f77b4" stroke-width="3"/>
<!-- tanh (approx) -->
<path d="M40 255 C140 250, 220 225, 300 185 C340 165, 370 155, 400 155 C430 155, 460 145, 500 125 C580 85, 660 60, 760 55" fill="none" stroke="#ff7f0e" stroke-width="3"/>
<!-- ReLU -->
<path d="M40 155 L400 155 L760 55" fill="none" stroke="#2ca02c" stroke-width="3"/>
<!-- derivatives (scaled to fit): sigmoid' peak 0.25 -> scale x4 to show as 1 -->
<path d="M40 155 C170 155, 300 140, 400 105 C500 140, 630 155, 760 155" fill="none" stroke="#1f77b4" stroke-width="2" stroke-dasharray="6,4"/>
<!-- tanh' = 1 - tanh^2 peak 1 at 0 (already) -->
<path d="M40 155 C170 155, 300 135, 400 55 C500 135, 630 155, 760 155" fill="none" stroke="#ff7f0e" stroke-width="2" stroke-dasharray="6,4"/>
<!-- ReLU' piecewise: 0 for z<0, 1 for z>0 (draw as step) -->
<path d="M40 155 L400 155" fill="none" stroke="#2ca02c" stroke-width="2" stroke-dasharray="6,4"/>
<path d="M400 55 L760 55" fill="none" stroke="#2ca02c" stroke-width="2" stroke-dasharray="6,4"/>
<circle cx="400" cy="55" r="3" fill="#2ca02c"/>
<!-- legend -->
<g font-size="12" fill="#222">
<text x="50" y="25">Solid = activation f(z), dashed = derivative f'(z) (scaled where noted)</text>
<rect x="50" y="35" width="10" height="3" fill="#1f77b4"/><text x="65" y="42">sigmoid</text>
<rect x="140" y="35" width="10" height="3" fill="#ff7f0e"/><text x="155" y="42">tanh</text>
<rect x="210" y="35" width="10" height="3" fill="#2ca02c"/><text x="225" y="42">ReLU</text>
</g>
</svg>
Reading the figure:
This single picture captures most of what you’ll care about when choosing activations: where gradients flow, where they die, and what ranges outputs can take.
A linear classifier in 2D has decision boundary , which is a line. If you compose linear layers only, you still get a line.
When you insert a nonlinearity, the network can “bend” space. Intuitively:
ReLU is
Consider a 1-hidden-layer network:
Each hidden unit splits input space by a hyperplane .
Because different subsets of ReLUs turn “on” in different regions, the whole network becomes piecewise linear: linear in each region, but with many regions.
This SVG illustrates how two ReLU units create a partition of 2D space into regions where different linear pieces apply.
<svg width="780" height="320" viewBox="0 0 780 320" xmlns="http://www.w3.org/2000/svg" role="img" aria-label="2D partition of the plane into regions by two ReLU hyperplanes">
<rect x="0" y="0" width="780" height="320" fill="#ffffff"/>
<text x="20" y="25" font-size="14" fill="#222">Two ReLU pre-activations define two lines (z₁=0 and z₂=0), partitioning the plane into 4 regions.</text>
<!-- coordinate frame -->
<line x1="80" y1="270" x2="360" y2="270" stroke="#222" stroke-width="2"/>
<line x1="220" y1="300" x2="220" y2="60" stroke="#222" stroke-width="2"/>
<text x="345" y="290" font-size="12" fill="#222">x₁</text>
<text x="230" y="70" font-size="12" fill="#222">x₂</text>
<!-- lines z1=0 and z2=0 -->
<line x1="110" y1="80" x2="330" y2="260" stroke="#1f77b4" stroke-width="3"/>
<line x1="110" y1="250" x2="330" y2="90" stroke="#ff7f0e" stroke-width="3"/>
<text x="115" y="75" font-size="12" fill="#1f77b4">z₁=0</text>
<text x="115" y="265" font-size="12" fill="#ff7f0e">z₂=0</text>
<!-- region labels -->
<g font-size="12" fill="#222">
<text x="120" y="140">Region A:</text>
<text x="120" y="158">z₁<0, z₂>0</text>
<text x="250" y="140">Region B:</text>
<text x="250" y="158">z₁>0, z₂>0</text>
<text x="120" y="215">Region C:</text>
<text x="120" y="233">z₁<0, z₂<0</text>
<text x="250" y="215">Region D:</text>
<text x="250" y="233">z₁>0, z₂<0</text>
</g>
<!-- explanation on the right -->
<rect x="410" y="60" width="350" height="220" fill="#fafafa" stroke="#ddd"/>
<text x="430" y="90" font-size="13" fill="#222">In each region, the network is linear:</text>
<text x="430" y="115" font-size="12" fill="#222">- If zⱼ<0 ⇒ ReLU(zⱼ)=0 (unit off)</text>
<text x="430" y="138" font-size="12" fill="#222">- If zⱼ>0 ⇒ ReLU(zⱼ)=zⱼ (unit on)</text>
<text x="430" y="170" font-size="12" fill="#222">So different regions activate different subsets</text>
<text x="430" y="190" font-size="12" fill="#222">of linear pieces, yielding a “bent” boundary</text>
<text x="430" y="210" font-size="12" fill="#222">when you solve y(x)=0.</text>
</svg>
This picture explains why ReLU networks are powerful: with many units, you get many regions, and therefore many linear pieces. Depth increases the number of regions dramatically.
Sigmoid:
Tanh:
These functions don’t create sharp “kinks” like ReLU; they create smooth transitions. That can be beneficial (smooth gradients) but can also cause saturation for large |z|.
A practical (often overlooked) design detail is the range and mean of activations:
| Activation | Range | Zero-centered? | Typical note |
|---|---|---|---|
| Sigmoid | (0, 1) | No | Good for probabilities; saturates |
| tanh | (-1, 1) | Yes | Often better than sigmoid in hidden layers |
| ReLU | [0, ∞) | No | Sparse activations; simple; risk of dead units |
Zero-centering matters because if activations are mostly positive, the next layer’s gradients can become biased (all weights pushed similarly), sometimes slowing optimization.
Let one neuron be:
During backprop, the gradient that flows into is:
So is literally a multiplier on the error signal.
If you stack many layers, you multiply many such terms (plus weight matrices). A simplified 1D intuition:
If the product shrinks toward 0, you get vanishing gradients. If it grows huge, exploding gradients.
Sigmoid derivative:
First compute it cleanly:
Differentiate:
A more useful identity:
This peaks at (i.e., ):
So even in the best case, each sigmoid layer multiplies gradients by at most 0.25 (before considering weights). In the saturated tails (large |z|), .
Tanh derivative:
This peaks at 1 when , but still goes to 0 as |z| grows.
Interpretation:
ReLU derivative is:
(Undefined at 0, but implementations pick 0 or 1; it rarely matters in practice.)
So for active units (z>0), gradients pass through unchanged (multiplied by 1). That’s a big reason ReLU enabled very deep networks to train effectively.
But the zero-derivative region causes the dying ReLU problem: if a neuron’s inputs make for most data, it outputs 0 and gets no gradient to recover.
Most activation variants are attempts to tune one or more properties:
| Activation | Definition | Derivative behavior | Typical use |
|---|---|---|---|
| Leaky ReLU | small slope α for z<0 | reduces dead ReLUs | |
| ELU | if z>0 else | smooth negative side | sometimes improves convergence |
| GELU | (approx) | smooth, nonzero slope | Transformers/modern NLP |
| Softplus | smooth ReLU; never 0 | when you need smoothness |
Leaky ReLU derivative:
So even when “off,” some gradient passes.
ReLU-like activations produce many exact zeros. That implies:
But sparsity is not free: too many zeros can reduce effective capacity and can stall learning for dead units.
Activation functions interact with floating-point behavior.
This avoids overflow when z is large.
These concerns connect directly to careful initialization and normalization methods you’ll meet later.
You rarely choose an activation in isolation. You choose it given:
A useful rule of thumb:
| Task | Output activation | Output meaning |
|---|---|---|
| Binary classification | sigmoid | |
| Multi-class (single label) | softmax | categorical distribution |
| Regression (unbounded) | identity | any real value |
| Regression (positive) | softplus or exp | positive real |
| Regression (bounded) | tanh/sigmoid | constrained range |
Softmax is not elementwise (it mixes logits), but it’s commonly discussed alongside activations. Elementwise activations typically happen in hidden layers; softmax is a special output nonlinearity.
ReLU:
GELU (common in Transformers):
Even though sigmoid/tanh can saturate, they remain important:
Activations influence how variance propagates forward/backward. While you’ll study initialization formally later, the intuition is:
This is why “He initialization” is often paired with ReLU-like activations, and “Xavier/Glorot” is often paired with tanh.
| Property | Helps with | Often hurts |
|---|---|---|
| Bounded outputs (sigmoid/tanh) | stability, interpretability | saturation → vanishing gradients |
| Unbounded positive (ReLU) | gradient flow, simplicity | dead units, nonzero mean |
| Smoothness (tanh, softplus, GELU) | stable optimization, differentiability | can reduce sparsity; may saturate |
| Sparsity (ReLU) | regularization-like effect | too many inactive units |
Activation functions are not just “a nonlinearity.” They are a design lever that shapes geometry (expressiveness) and learning (gradient flow).
Consider a 2-layer network with no activation functions:
\nLayer 1:
\nLayer 2: .
\nShow that is an affine function of and write the equivalent single-layer parameters.
Substitute Layer 1 into Layer 2:
\n
Distribute :
\n
Group terms as a single affine map :
\n,
\n
Conclude: any number of stacked linear layers (no nonlinear activation between them) equals one linear layer, so depth adds no representational power in that case.
Insight: This is the core “why before how” for activations: prevents this collapse, allowing each layer to change the function class rather than merely re-parameterize a linear map.
Let a scalar neuron be , , and loss .
\nPick , , , .
\nCompare gradients for:
1) sigmoid at and at
2) ReLU at the same z values.
Compute generic derivatives.
\nSince , we have:
\n.
\nAnd by chain rule:
\n.
\nWith : .
Case A (sigmoid) at z=0.
\n.
\n.
\nSo .
Case B (sigmoid) at z=-10.
\n.
\n.
\nSo .
\nThe gradient magnitude shrank from 1e-1 to about 1e-5 due to saturation.
Case C (ReLU) at z=0.
\n.
\nAt exactly 0, ReLU derivative is undefined; implementations pick 0 or 1. Consider a tiny positive z (e.g., z=+ε) to represent “active” behavior: then .
\nIf z≈0⁺: , so .
Case D (ReLU) at z=-10.
\n and (inactive).
\nSo .
\nNo gradient flows: this is the dying-ReLU risk if many datapoints keep z<0.
Insight: Sigmoid gives you some gradient when saturated, but it can be extremely small; ReLU gives you strong gradients when active, but exactly zero when inactive. Training dynamics are dominated by these local derivative regimes.
Let and define a 2-unit ReLU network:
\n.
\nShow explicitly that is linear on each interval cut by the hinge points and .
Identify hinge points where each ReLU switches.
\n switches at .
\n switches at .
\nSo consider intervals: , , .
Interval 1: .
\nThen and .
\nSo both ReLUs are 0:
\n (constant, hence linear).
Interval 2: .
\nThen but .
\nSo:
\n,
\n.
\nThus (linear).
Interval 3: .
\nThen both and .
\nSo:
\n (linear).
Conclude: the network is piecewise linear with ‘kinks’ at x=-1 and x=1; adding more ReLU units adds more hinge points, increasing shape flexibility.
Insight: This is the 1D version of the 2D partition picture: ReLUs introduce regions where different linear formulas apply, letting you build complex shapes by stitching simple pieces.
Activation functions apply an elementwise mapping to each neuron’s pre-activation .
Without nonlinear activations, stacked linear layers collapse into a single linear (affine) transformation—depth gives no extra expressiveness.
The derivative directly gates backpropagation: .
Sigmoid and tanh saturate for large |z|, causing vanishing gradients in their tails; sigmoid’s maximum slope is only 0.25.
ReLU is non-saturating for z>0 (good gradient flow) but has zero gradient for z<0 (risk of dead neurons).
ReLU networks represent piecewise-linear functions; in higher dimensions, ReLU hyperplanes partition space into regions with different linear behaviors.
Activation choice affects output range, zero-centering, sparsity, and numerical stability—so it influences both modeling and optimization.
Using sigmoid (or tanh) in many deep hidden layers without understanding saturation, then wondering why gradients vanish and training stalls.
Assuming activations are “just a detail,” ignoring their derivatives—when is the main determinant of gradient flow locally.
Forgetting that ReLU can die: if pre-activations stay negative, the unit outputs 0 and receives zero gradient (especially with large learning rates or biased initialization).
Mismatching output activation to the task (e.g., using ReLU for probabilities, or sigmoid for unbounded regression).
Compute derivatives: (a) for sigmoid , and (b) . Then evaluate each derivative at z=0 and describe what it implies for gradient flow near the origin.
Hint: For sigmoid, try rewriting in terms of after differentiating. For tanh, use or the identity .
Sigmoid:
Differentiate:
And using gives:
At z=0: so .
Tanh:
At z=0: so .
Implication: near z=0, tanh passes gradient more strongly than sigmoid; sigmoid’s slope is capped at 0.25 even in its best region.
Consider a 2-layer network with ReLU between layers: and . Explain (in words) why this network can represent non-linear decision boundaries in x-space, unlike the same architecture without ReLU.
Hint: Focus on what happens in regions where each component of is positive vs negative.
With ReLU, each hidden unit outputs either 0 (if its pre-activation is negative) or a linear function of x (if positive). The set of signs of the hidden pre-activations partitions input space into regions; within each region, the network behaves like a linear model, but different regions have different linear formulas because different subsets of hidden units are active. The boundary y=0 can therefore bend across regions, forming a non-linear decision boundary. Without ReLU, the whole network is just one affine map in x, so y=0 is a single hyperplane (linear boundary).
Dying ReLU thought experiment: Suppose a neuron uses ReLU and its pre-activation is . If your dataset has x mostly around 0 and you initialize b = -5 with small w, what happens to this neuron during early training? Propose one fix.
Hint: Evaluate the sign of z initially and connect it to .
If x≈0 and w is small, then initially z≈b=-5, so z<0 for most examples. The neuron outputs a=ReLU(z)=0 and its derivative is ReLU'(z)=0 in that region. In backprop, gets multiplied by 0, so and for that neuron are ~0; it may not recover and becomes a dead unit. Fixes include: (1) initialize biases closer to 0 (or slightly positive) so some examples activate the unit, (2) use Leaky ReLU so the negative side has slope α>0 and gradients can update w and b, or (3) use normalization (e.g., BatchNorm) to keep z distribution near 0.
Next nodes you can explore: