Layers of nonlinear transformations. Universal approximators.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
A neural network is what you get when you stop asking a model to be “one good formula” and instead let it be “many simple formulas composed together.” Each layer is easy to understand (an affine map plus a nonlinearity), but stacking layers creates a surprisingly rich family of functions—rich enough to approximate almost any smooth pattern you can express with data.
Neural networks are parametric functions built by alternating affine transformations (x ↦ Wx + b) and elementwise nonlinearities. The nonlinearity is essential: without it, many layers collapse into one linear map. With it, depth creates expressive, flexible models that generalize logistic regression to multi-layer feature learning.
Logistic regression is powerful because it turns a linear score (Wx + b) into a nonlinear probability using σ(·). But it still fundamentally draws a linear decision boundary in the original feature space: it can only separate classes that are linearly separable (or close to it).
Neural networks extend this idea by repeatedly doing two steps:
By composing these steps many times, a network can learn intermediate representations—new “features” that make a difficult task easy for the final layer.
A (feedforward) neural network defines a function:
f( x; θ ) → y
where θ is the set of all parameters, typically weights and biases across layers:
θ = { (W¹, b¹), (W², b²), …, (Wᴸ, bᴸ) }
A standard L-layer multilayer perceptron (MLP) has hidden states hˡ computed as:
The final layer is sometimes left “linear” (no activation) depending on the task.
If the input has dimension d and layer l has width nˡ:
This is just matrix-vector multiplication plus a bias.
A key motivation: neural networks can approximate very complicated functions. Roughly:
Important nuance:
A helpful mental model is:
This is logistic regression’s idea—applied repeatedly.
Binary logistic regression can be written as a 1-layer network:
p(y = 1 | x) = σ( Wx + b )
A deeper network replaces the direct linear score with a learned feature map h = g(x) and then uses:
p(y = 1 | x) = σ( wᵀ h + b )
where g(·) is itself a composition of affine maps and nonlinearities.
An affine transformation
x ↦ Wx + b
is the simplest learnable operation that can:
If you have matrix calculus background, you already know how gradients behave for affine maps, which is one reason they are so central.
Think of x as a point in ℝᵈ.
A linear layer can:
But a single affine layer cannot “bend” space. It cannot create curved decision boundaries by itself.
Suppose the first layer is:
z¹ = W¹ x + b¹
Each coordinate z¹ᵢ is:
z¹ᵢ = w¹ᵢᵀ x + b¹ᵢ
So every neuron computes a linear score of the input—very similar to logistic regression’s logit. The difference comes next: we don’t directly interpret this as a probability; we feed it forward as a feature.
Without b, every hyperplane zᵢ = 0 must pass through the origin. Adding bias lets each neuron choose its own threshold.
This matters especially when combined with piecewise-linear activations (like ReLU), where the bias controls where the “kink” happens.
The hidden dimension nˡ affects what the network can represent.
| Choice | What it enables | What it risks |
|---|---|---|
| nˡ < nˡ⁻¹ (compression) | bottleneck features, dimensionality reduction | information loss |
| nˡ = nˡ⁻¹ | stable capacity | may need depth for expressiveness |
| nˡ > nˡ⁻¹ (expansion) | rich feature mixing, sparse or disentangled features | overfitting, optimization cost |
If you stack affine maps without nonlinearities:
h¹ = W¹ x + b¹
h² = W² h¹ + b²
then:
h² = W²(W¹ x + b¹) + b²
= (W²W¹)x + (W²b¹ + b²)
This is still just one affine map.
So if we only used affine layers, depth would be pointless. The entire expressive leap of neural networks comes from the nonlinearity.
A nonlinearity φ applied elementwise:
h = φ(z) meaning hᵢ = φ(zᵢ)
is what prevents the network from collapsing into a single affine transformation.
Nonlinearities let networks represent:
You’ll see a small set of activations repeatedly:
| Activation | Formula | Range | Pros | Cons |
|---|---|---|---|---|
| Sigmoid | σ(t) = 1/(1+e⁻ᵗ) | (0,1) | probabilistic interpretation | saturates, vanishing gradients |
| Tanh | tanh(t) | (−1,1) | zero-centered | saturates |
| ReLU | max(0, t) | [0, ∞) | simple, sparse, strong gradients for t>0 | “dead” units for t≤0 |
| Leaky ReLU | max(αt, t) | (−∞, ∞) | reduces dead units | extra hyperparameter α |
| GELU | t·Φ(t) (approx) | (−∞, ∞) | smooth, strong in transformers | more compute |
For many modern MLPs, ReLU/GELU variants dominate.
Consider a 1D input x and a single neuron:
h = ReLU(wx + b)
This is:
So it is a “hinge” function with a kink at x = −b/w.
If you sum many such hinges, you can approximate complex shapes. In higher dimensions, each ReLU neuron corresponds to a half-space boundary (wᵀx + b = 0), and the network becomes a partition of input space into regions where the overall function is linear.
This is part of why deep networks can represent complex functions more efficiently than shallow ones.
The last layer is often chosen to match the meaning of outputs:
| Task | Output | Typical final layer |
|---|---|---|
| Binary classification | p(y=1) | sigmoid |
| Multi-class classification | p(y=k) | softmax |
| Regression | real value | linear (identity) |
For multi-class, with logits s ∈ ℝᴷ:
softmax(s)ₖ = exp(sₖ) / ∑ⱼ exp(sⱼ)
Elementwise nonlinearities are simple and efficient. But note:
Many advanced architectures introduce nonlinearities that depend on multiple coordinates (attention, normalization, gating). But elementwise activations are the core starting point.
A neural network is best understood as a composition of functions:
f(x) = fᴸ ∘ fᴸ⁻¹ ∘ … ∘ f¹ (x)
where each layer function is typically:
fˡ(h) = φ( Wˡ h + bˡ )
Composition matters because:
For a 2-hidden-layer network:
h¹ = φ( W¹ x + b¹ )
h² = φ( W² h¹ + b² )
y = W³ h² + b³
Even though each step is simple, the final mapping x ↦ y can be highly nonlinear.
Each neuron computes:
hᵢ = φ(wᵀ h_prev + b)
This can be seen as:
In early layers, features might correspond to simple patterns.
In deeper layers, features become combinations of combinations.
Many networks can be decomposed conceptually:
h = g(x; θ_g)
y = Ah + c
where g is the deep feature extractor and (A, c) is a linear “head.”
This view is helpful because:
A neural network becomes useful when paired with a loss.
Given dataset {( x⁽ⁱ⁾, t⁽ⁱ⁾ )}ᵢ:
Minimize: (1/N) ∑ᵢ ℓ( f(x⁽ⁱ⁾), t⁽ⁱ⁾ )
Examples:
Because you know matrix calculus, you can view training as gradient-based optimization in a high-dimensional parameter space.
If layer l has nˡ units and previous layer has nˡ⁻¹:
Total parameters ≈ ∑ˡ (nˡ·nˡ⁻¹ + nˡ)
Large capacity can fit complex functions—but increases overfitting risk, motivating regularization (an unlock node).
Universal approximation results say “there exists parameters.” In practice you must also consider:
This is why deep learning is both a theory of function classes and an engineering discipline.
When feature engineering is hard, neural networks shine because they can learn representations.
They appear in many forms:
The common thread is still layers of affine-like transforms plus nonlinearities, often with architectural constraints.
Logistic regression gives a hyperplane boundary.
An MLP can build boundaries that are unions and compositions of many half-spaces.
A helpful picture (conceptual):
To train, you need gradients ∂ℓ/∂Wˡ and ∂ℓ/∂bˡ for every layer.
Naively computing derivatives separately for each parameter would be expensive.
Backpropagation is the efficient application of chain rule through the layered composition. This is exactly the next node you unlock.
High-capacity models can memorize. Regularization techniques (L2, dropout, early stopping) and normalization (batch norm, layer norm) help:
These are also unlocked nodes—and they become much easier to appreciate once the basic network mapping is clear.
Variational autoencoders (VAEs) and many other generative models use neural networks to parameterize distributions:
In this sense, “neural network” is not the whole algorithm; it’s the function approximator inside the algorithm.
Autoencoders are neural networks trained to reconstruct inputs through a bottleneck:
x → encoder → z (low-dim) → decoder → x̂
This creates a learned, nonlinear dimensionality reduction—connected to the dimensionality reduction node.
Compute the output of a 2-layer network with ReLU hidden layer and a linear output. Let x ∈ ℝ².
Given:
W¹ = [[1, −2],
[0, 3]] , b¹ = [−1, 2]
W² = [[2, −1]] , b² = [0.5]
Activation φ = ReLU.
Input x = [2, 1].
Step 1: Compute pre-activation z¹ = W¹x + b¹.
W¹x = [[1, −2],[0, 3]] [2,1]
= [1·2 + (−2)·1,
0·2 + 3·1]
= [0, 3]
z¹ = [0, 3] + [−1, 2] = [−1, 5]
Step 2: Apply ReLU: h¹ = ReLU(z¹).
ReLU([−1, 5]) = [0, 5]
Step 3: Compute output pre-activation z² = W²h¹ + b².
W²h¹ = [2, −1] [0,5] = 2·0 + (−1)·5 = −5
z² = −5 + 0.5 = −4.5
Step 4: Since output is linear, y = z².
Final output y = −4.5
Insight: Even with tiny matrices, you can see the pattern: affine → ReLU → affine. ReLU zeroed out the first hidden unit, so only the second feature contributed to the final score. This “selective routing” is a core behavior of ReLU networks.
Show that stacking affine layers without an activation is still just an affine map.
Let h¹ = W¹x + b¹ and h² = W²h¹ + b².
Start with the definition of the second layer:
h² = W²h¹ + b²
Substitute h¹ = W¹x + b¹:
h² = W²(W¹x + b¹) + b²
Distribute W²:
h² = W²W¹x + W²b¹ + b²
Group terms into a single affine form:
Let W̃ = W²W¹ and b̃ = W²b¹ + b².
Then h² = W̃x + b̃
Insight: Depth without nonlinearity gives no extra expressive power. Activations prevent this collapse, making layered composition meaningful.
Approximate a simple piecewise-linear function on x ∈ [0, 2] using a sum of shifted ReLUs.
Target function:
f(x) = { x for 0 ≤ x ≤ 1
{ 2 − x for 1 < x ≤ 2
This is a triangle peak at x=1. Show it can be written using ReLU(·).
Recall ReLU(t) = max(0, t). Consider these three hinge functions:
h₁(x) = ReLU(x)
h₂(x) = ReLU(x − 1)
h₃(x) = ReLU(x − 2)
Construct a piecewise-linear function by combining them:
g(x) = 1·h₁(x) − 2·h₂(x) + 1·h₃(x)
Check intervals.
For 0 ≤ x ≤ 1:
So g(x)=x (matches f).
For 1 ≤ x ≤ 2:
So g(x)= x − 2(x−1) = x − 2x + 2 = 2 − x (matches f).
For x ≥ 2:
So g(x)= x − 2(x−1) + (x−2) = 0 (triangle returns to 0).
Insight: A sum of a few shifted ReLUs can build a nontrivial shape. In higher dimensions and deeper networks, this idea scales: many hinges compose into extremely rich functions.
A feedforward neural network is a composition of layers: hˡ = φ(Wˡhˡ⁻¹ + bˡ).
Affine layers (Wx + b) mix and shift features but cannot create curvature by themselves.
Elementwise nonlinearities (ReLU, tanh, sigmoid, GELU, …) prevent stacked affine maps from collapsing into one affine map.
Depth matters because composition can create complex functions from simple building blocks; width controls how many features exist per layer.
The last-layer activation should match the task (sigmoid for binary, softmax for multi-class, identity for regression).
Neural networks generalize logistic regression: logistic regression is essentially a 1-layer network with sigmoid output.
Universal approximation is an existence statement: a network can represent many functions, but training and generalization require additional tools (backprop, regularization, normalization).
Assuming more layers automatically help, while forgetting that without nonlinearities, multiple layers are equivalent to one affine map.
Mixing up shapes: Wˡ is (nˡ × nˡ⁻¹) and multiplies hˡ⁻¹ on the right; many bugs are silent shape/broadcast errors.
Using sigmoid/tanh everywhere in deep nets without considering saturation and vanishing gradients (often ReLU/GELU are better defaults).
Interpreting “universal approximator” as “guaranteed to train well” rather than “able to represent.”
You are given a network h¹ = ReLU(W¹x + b¹), y = σ(wᵀh¹ + b). If ReLU were removed (replaced with identity), show that the model reduces to logistic regression in the original input x.
Hint: Substitute h¹ = W¹x + b¹ into the output and regroup terms into a single weight vector and bias.
Without ReLU, h¹ = W¹x + b¹.
Then the logit is:
s = wᵀh¹ + b
= wᵀ(W¹x + b¹) + b
= (wᵀW¹)x + (wᵀb¹ + b)
Define w̃ᵀ = wᵀW¹ and b̃ = wᵀb¹ + b.
So p(y=1|x) = σ(s) = σ(w̃ᵀx + b̃), which is logistic regression.
Consider an MLP with input dimension d = 10, one hidden layer of width n¹ = 64, and output dimension K = 5 (multi-class). The hidden layer uses ReLU and the output uses softmax. How many parameters are there total (including biases)?
Hint: Count parameters per layer: W¹, b¹, W², b².
Layer 1: W¹ ∈ ℝ⁶⁴×¹⁰ has 64·10 = 640 parameters. b¹ ∈ ℝ⁶⁴ has 64 parameters.
Layer 2: W² ∈ ℝ⁵×⁶⁴ has 5·64 = 320 parameters. b² ∈ ℝ⁵ has 5 parameters.
Total = 640 + 64 + 320 + 5 = 1029 parameters.
Let φ be ReLU. For a 1-hidden-layer network f(x) = ∑ᵢ aᵢ ReLU(wᵢ x + bᵢ) + c in 1D, explain why f(x) is piecewise linear, and where its slope can change.
Hint: Each ReLU term changes from 0 to linear at the point where wᵢ x + bᵢ = 0.
Each term ReLU(wᵢ x + bᵢ) is either 0 (when wᵢ x + bᵢ ≤ 0) or a linear function (wᵢ x + bᵢ) (when wᵢ x + bᵢ > 0). Therefore, on any interval where the sign of every (wᵢ x + bᵢ) is fixed, every term is linear (either constant 0 or linear), and the sum is linear.
A slope change can only occur when at least one neuron switches regime, i.e., at a breakpoint x where wᵢ x + bᵢ = 0 ⇒ x = −bᵢ / wᵢ (for wᵢ ≠ 0). Thus f(x) is piecewise linear with possible kinks at those breakpoint locations.
Next nodes you’re set up for:
Related refreshers: