Neural networks with many layers. CNNs, RNNs, architectures.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
Deep learning is the art of building a useful family of functions by stacking simple transformations into a long composition—and then making that composition trainable and stable at scale.
A deep network is a composed function . Depth creates hierarchical representations (each layer builds features from earlier features). Architecture is about inductive bias: choosing structure (convolutions, recurrence, attention, normalization, residual paths) that makes learning feasible and generalization likely. Training success depends as much on conditioning (initialization, normalization, residuals) as on optimization (SGD variants).
Deep learning is not “just bigger neural nets.” It’s a strategy for representing complicated functions using many simple, reusable parts (layers), and for learning representations that make downstream prediction easy.
The core object is a composed function:
At each layer , we maintain an activation / representation vector (often written when shape is clear):
A very common concrete layer is affine + nonlinearity:
Depth matters because it changes what is easy to represent and what is easy to learn.
You already know backprop and SGD; let’s anchor deep learning in one concrete “small but real” example.
Task: binary classification in 2D. Input , label . Suppose the decision boundary is not linearly separable (e.g., two moons).
A 2-layer MLP (one hidden layer) is:
Interpretation:
Even here, you can see the deep learning pattern:
1) Representation: is not hand-designed—it’s learned.
2) Composition: the model builds a nonlinear function from simple parts.
3) Trainability: success depends on gradients flowing from back to .
Now scale that idea: more layers, richer inductive biases (convolution, attention), and careful conditioning (normalization/residuals) to make training stable.
Before going further, keep these three questions in mind:
1) What family of functions does this architecture represent?
2) What representations will intermediate layers tend to discover?
3) Will gradients and signals propagate stably through depth?
Deep learning is largely the practice of answering those three questions well.
You could increase width (more units per layer) or increase depth (more layers). Both add parameters, but they add different representational structure.
A useful mental model:
Depth encourages distributed hierarchical representations:
In images, this often looks like edges → textures → parts → objects. In language, characters/subwords → local syntax → semantics.
Write the network as repeated transformations:
Think of as a coordinate system the network invents. Learning aims to make the final layer’s problem “simple” (often linearly separable).
A very common pattern is:
where Norm might be BatchNorm, LayerNorm, RMSNorm, etc.
Deep nets are compositions, so their derivatives are products (chains) of Jacobians.
Let . Define the Jacobian
Then:
This single equation explains a lot:
You don’t need to compute these Jacobians explicitly to benefit from this mental model; it motivates initialization, normalization, and residual connections.
Without assumptions, learning in high dimensions is sample-inefficient (curse of dimensionality). Architectural choices encode assumptions like:
These biases restrict the function class to something that matches the world.
| Architecture | Core operation | Inductive bias | Strengths | Common failure mode |
|---|---|---|---|---|
| MLP (feedforward) | dense affine + nonlinearity | weak (mostly none) | flexible; works on tabular/embeddings | data-hungry; ignores structure |
| CNN | convolution (weight sharing, locality) | translation equivariance; local patterns | vision, audio; parameter efficient | struggles with global context unless deep/augmented |
| RNN / LSTM / GRU | recurrence | sequential state; temporal locality | streaming, variable-length sequences | long-range dependencies; parallelization limits |
| Attention / Transformer | content-based mixing (self-attn) | flexible pairwise interactions; permutation equivariance with positional encoding | long-range dependencies; parallelizable | quadratic cost in sequence length; needs lots of data |
| GNN | message passing on graphs | graph equivariance/invariance | molecules, networks, relational data | oversmoothing; limited expressivity for some tasks |
We’ll focus on CNNs and sequence models (RNNs/attention), since they are canonical deep learning building blocks.
A 2D convolution layer applies a kernel over local neighborhoods. If you already know the convolution operation, the key deep-learning additions are:
1) Channels: kernels map input channels to output channels.
2) Stacking: repeated convs grow the receptive field.
A simplified expression (single output channel) is:
With multiple channels:
Why it helps: weight sharing means you learn “edge detector” once and reuse it across the image. Locality reduces parameters and encourages features to be local.
Common CNN design motifs:
An RNN maintains a state updated over time:
This encodes an inductive bias: “the present depends on a compressed summary of the past.”
Training issue: backprop through time multiplies many Jacobians across timesteps, causing vanishing/exploding gradients. LSTMs/GRUs mitigate this with gating, roughly creating more stable paths for gradient flow.
Self-attention computes a weighted average of value vectors using query-key similarity.
Given matrices :
The inductive bias shifts from locality/recurrent state to learned interactions between all positions.
Transformers add:
A practical decision rule:
Architecture is not just accuracy—it’s compute, latency, memory, and data efficiency.
In shallow models, SGD “just works” surprisingly often. In deep models, optimization can fail even when the model is expressive enough.
The chain-of-Jacobians view tells you why: signals and gradients must propagate through many transformations. If their magnitudes drift, training becomes unstable.
We’ll build a stable mental model in three steps:
1) initialization tries to keep variance roughly constant across layers
2) normalization actively stabilizes distributions during training
3) residual connections provide easy paths for gradient flow
Consider a layer:
Assume are i.i.d. with mean 0 and variance . If weights have mean 0 and variance , then (roughly):
where is fan-in.
To keep from blowing up with depth, choose .
Two famous schemes:
These are not magic constants; they are attempts to keep forward activations and backward gradients in a reasonable range.
Even with good initialization, distributions drift as parameters update. Normalization layers reduce internal covariate shift and improve conditioning.
For a mini-batch, BN normalizes pre-activations per feature:
Pros: strong stabilizer, often speeds up CNN training.
Cons: batch-size dependence; tricky for RNNs/online/very small batches.
LN normalizes across features within a single example:
Pros: works well in Transformers; independent of batch size.
RMSNorm scales by root-mean-square without subtracting the mean:
Often used in modern LLM stacks for simplicity and stability.
A residual block computes:
Differentiate w.r.t. :
The identity term ensures there is always a path with derivative near 1, which combats vanishing gradients.
This is a key reason very deep networks (ResNets, deep Transformers) are trainable.
When a deep model won’t train, ask:
These are not “details”—they are often the difference between success and failure.
In theory, you can specify and run SGD. In practice, deep learning is an engineering loop:
1) pick an architecture with the right inductive bias
2) ensure optimization is stable (normalization, residuals, schedules)
3) regularize and validate (to generalize)
4) scale data/compute appropriately
Let’s connect the concepts to concrete workflows.
Decide output:
For classification with logits and label :
Compute with stable log-sum-exp:
This prevents overflow in .
Even if you know SGD, deep learning often uses schedules and adaptive methods.
| Optimizer | Typical use | Notes |
|---|---|---|
| SGD + momentum | CNNs, large-scale vision | often best generalization; needs tuning + LR schedule |
| Adam/AdamW | Transformers, NLP | fast convergence; AdamW decouples weight decay |
Learning rate schedules (cosine decay, step decay, warmup) can be as important as the optimizer.
You already know L1/L2/dropout. In deep learning, common additional regularizers include:
Tie back to our earlier 2-layer classifier:
Deep learning succeeds when your architecture makes the right representations cheap to discover with gradient descent.
If you can explain:
1) as a composition of layers,
2) as learned representations,
3) inductive bias as the reason architectures differ,
4) trainability as controlling Jacobian products,
…then you have a working deep learning “tech tree” model that scales to modern architectures.
Let . Define a 2-layer network:
Layer 1: with
, .
Layer 2 (logit): with and . Output probability .
Compute pre-activation :
$\mathbf{W}^1\mathbf{x} = \begin{bmatrix}1 & 1\\ -1 & 2\\ 0 & -1\end{bmatrix}\begin{bmatrix}1\\-2\end{bmatrix}
= \begin{bmatrix}1\cdot1 + 1\cdot(-2)\\ (-1)\cdot1 + 2\cdot(-2)\\ 0\cdot1 + (-1)\cdot(-2)\end{bmatrix}
= \begin{bmatrix}-1\\ -5\\ 2\end{bmatrix}$.
Add bias:
.
Apply ReLU elementwise:
.
Compute the logit:
.
Convert to probability with sigmoid:
.
Insight: Even this tiny deep net builds a representation where the final decision is simple (a dot product). ReLU created a sparse feature vector: only the third feature is active for this input. Scaling depth increases the space of learned features and their compositional reuse.
Consider a depth- scalar network (for intuition):
, and for :
(a linear layer with scalar weight ).
Output is . We examine and how it scales with depth.
Write the closed form:
By induction:
.
Differentiate w.r.t. input:
.
Analyze cases:
If , then as grows ⇒ gradients vanish.
If , then ⇒ gradients explode.
If , gradients stay in a workable range.
Insight: Real networks are not scalar, but the principle survives: deep learning stability depends on keeping the effective Jacobian product near an isometry (singular values near 1). Initialization, normalization, and residual connections are practical tools to approximate this behavior.
Compare two ways to process a 32×32 RGB image (). Option A: a dense layer to 100 hidden units. Option B: a conv layer with 64 kernels of size 3×3.
We count parameters (ignoring biases for simplicity).
Dense layer: flatten input size is .
Parameters = .
Convolution: each kernel has size .
With 64 output channels:
Parameters = .
Compare:
Dense: 307,200 parameters
Conv: 1,728 parameters
The conv layer uses about × fewer parameters.
Insight: Weight sharing and locality massively reduce parameters while matching image structure. This is inductive bias made concrete: CNNs restrict the function family to translation-equivariant local pattern detectors, improving sample efficiency.
A deep network is a nested composition: , producing intermediate representations .
Depth primarily helps by enabling hierarchical feature composition and feature reuse—not merely by adding parameters.
Training stability is governed by products of Jacobians; vanishing/exploding gradients are expected failure modes without careful design.
Architectures are defined by inductive biases (locality, weight sharing, recurrence, attention) that improve sample efficiency and generalization.
CNNs encode translation equivariance and locality; RNNs encode sequential state; attention enables flexible long-range interactions.
Initialization schemes (Xavier/He) aim to keep activation/gradient scales reasonable across layers.
Normalization (BatchNorm/LayerNorm/RMSNorm) improves conditioning and stability; residual connections provide gradient highways.
Deep learning practice is an engineering loop balancing architecture, optimization, regularization, and compute constraints.
Treating depth as automatically beneficial: deeper models can be harder to optimize and may overfit without the right stabilizers and regularization.
Ignoring conditioning: using arbitrary initialization or omitting normalization/residuals often causes silent training failure (loss plateaus or NaNs).
Choosing architecture by trend rather than inductive bias: e.g., using an MLP on images without exploiting locality, or using attention when streaming constraints require recurrence.
Debugging only the optimizer: many issues blamed on SGD/Adam are actually numerical stability, normalization placement, or learning-rate schedule problems.
You have a depth-10 network where each layer (locally) has an average Jacobian spectral norm of about 0.9. Roughly how will gradient magnitudes scale from output back to the input? What qualitative behavior do you expect during training?
Hint: Use the idea that gradient scales like a product of per-layer factors.
If each layer contributes a factor ≈ 0.9, then over 10 layers the scale is about . Gradients shrink as they propagate backward (vanishing tendency). Training may be slower for early layers and may require residual connections, normalization, or different initialization to keep effective scales closer to 1.
Design choice: You need to classify 1-second audio clips sampled at 16 kHz. You can represent them as a spectrogram (time × frequency grid) or as raw waveform. Which inductive bias suggests a CNN is a strong baseline, and what structure is the CNN exploiting?
Hint: Think locality and weight sharing on a grid.
A CNN is a strong baseline because audio (especially as a spectrogram) has local time-frequency structure: nearby time frames and frequencies form local patterns (harmonics, onsets). Convolutions exploit locality (small receptive fields) and weight sharing (same detector across time/frequency shifts), giving translation-equivariant feature extraction and parameter efficiency.
Suppose you remove residual connections from a 48-layer Transformer block stack but keep everything else the same. Using the chain-of-Jacobians viewpoint, explain why optimization becomes much harder. Propose two architectural/training modifications that could partially compensate (even if imperfect).
Hint: Residuals add an identity term to the layer-to-layer derivative; without it the product of Jacobians must stay well-conditioned by itself.
Without residuals, the layer-to-layer derivative is dominated by rather than . The gradient becomes a product of many non-identity Jacobians, making vanishing/exploding gradients much more likely (singular values drift away from 1). Two partial compensations: (1) stronger/appropriate normalization (e.g., careful LayerNorm/RMSNorm placement, possibly pre-norm) to stabilize activation distributions and Jacobian spectra; (2) adjust initialization and learning-rate schedule (smaller LR, warmup, scaled init) to keep updates small and maintain conditioning. Other possible aids include gradient clipping and reducing depth.