Deep Learning

Machine LearningDifficulty: █████Depth: 12Unlocks: 3

Neural networks with many layers. CNNs, RNNs, architectures.

Interactive Visualization

t=0s

Core Concepts

▸Layered function composition: a model is a nested composition of parameterized transformations (layers)
▸Distributed hierarchical representations: each layer computes feature representations that combine lower-level features into higher-level abstractions
▸Architectural inductive biases: structural constraints (e.g., locality, weight sharing, recurrence, attention) that restrict the function family and encode useful invariances

Key Symbols & Notation

f_theta(x) = f_L(... f_2(f_1(x))) (composed network function)h^l (activation / representation vector at layer l)

Essential Relationships

↔The composed function f_theta produces the layerwise representations h^l (hierarchical, distributed features), and architectural inductive biases determine which compositions are parameter-efficient and generalize from data

Prerequisites (9)

Backpropagation5 atoms

Stochastic Gradient Descent5 atoms

Regularization9 atoms

Activation Functions6 atoms

Computational Graphs6 atoms

Automatic Differentiation5 atoms

Curse of Dimensionality6 atoms

Convolution Operation6 atoms

Numerical Stability and Conditioning4 atoms

Unlocks (2)

Attention Mechanismslvl 5

Meta-Learninglvl 5

▶ Advanced Learning Details

Graph Position

209

Depth Cost

Fan-Out (ROI)

Bottleneck Score

Chain Length

Cognitive Load

Atomic Elements

Total Elements

Percentile Level

Atomic Level

All Concepts (20)

• Depth and hierarchical feature learning (many stacked layers produce progressively higher-level representations)
• Convolutional network architecture elements beyond the convolution operation (pooling, strided convolution for downsampling, transposed/upsampling convolution)
• Architectural efficiency tricks for CNNs (depthwise-separable convolutions, bottleneck layers)
• Dilated (atrous) convolution (increase receptive field without increasing parameter count)
• Effective receptive field vs. theoretical receptive field (actual influence of input pixels on deeper activations)
• Residual and skip connections (identity/shortcut paths between non-adjacent layers)
• Normalization layers specific to deep architectures (BatchNorm, LayerNorm, GroupNorm) and their use
• Recurrent neural networks (RNNs) as sequence models: recurrence, hidden state, timestep indexing
• Backpropagation Through Time (BPTT) - applying backpropagation to unfolded recurrent computations
• Gated recurrent units and memory cells (LSTM and GRU): gate vectors and cell state to control information flow
• Gradient issues specific to deep and recurrent architectures (vanishing and exploding gradients in very deep nets / long sequences) and practical fixes
• Gradient clipping (rescaling or truncating gradients to stabilize training)
• Attention mechanisms (query/key/value formulation) and self-attention
• Transformer-style architectures: replacing recurrence with multi-head self-attention plus positional encodings
• Softmax output layer and cross-entropy loss (converting logits to a probability distribution and measuring mismatch)
• Advanced optimization algorithms commonly used in deep learning (Adam, RMSProp, AdamW and their per-parameter adaptive learning rates / moment estimates)
• Initialization schemes tuned for deep nets (Xavier/Glorot, He/Kaiming) to control variance propagation
• Training at scale: pretraining, transfer learning, fine-tuning, and checkpointing
• Design trade-offs in architecture depth vs width, and compute/parameter-efficiency strategies
• Regularization and augmentation strategies commonly used with deep architectures (data augmentation, early stopping, label smoothing - beyond L1/L2 and dropout)

Teaching Strategy

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

Deep learning is the art of building a useful family of functions by stacking simple transformations into a long composition—and then making that composition trainable and stable at scale.

TL;DR:

A deep network is a composed function $f_\theta(x)=f_L(\cdots f_2(f_1(x)))$ . Depth creates hierarchical representations (each layer builds features from earlier features). Architecture is about inductive bias: choosing structure (convolutions, recurrence, attention, normalization, residual paths) that makes learning feasible and generalization likely. Training success depends as much on conditioning (initialization, normalization, residuals) as on optimization (SGD variants).

What Is Deep Learning? (And a Minimal Working Mental Model)

Why before how

Deep learning is not “just bigger neural nets.” It’s a strategy for representing complicated functions using many simple, reusable parts (layers), and for learning representations that make downstream prediction easy.

The core object is a composed function:

f_\theta(x)=f_L\big( f_{L-1}(\cdots f_2(f_1(x)) )\big)

At each layer $\ell$ , we maintain an activation / representation vector $\mathbf{h}^{\ell}$ (often written $h^\ell$ when shape is clear):

• $\mathbf{h}^{0} = \mathbf{x}$ (the input)
• $\mathbf{h}^{\ell} = f_{\ell}(\mathbf{h}^{\ell-1};\theta_{\ell})$
•output $\hat{\mathbf{y}} = \mathbf{h}^{L}$

A very common concrete layer is affine + nonlinearity:

\mathbf{z}^{\ell} = \mathbf{W}^{\ell}\mathbf{h}^{\ell-1} + \mathbf{b}^{\ell}, \qquad \mathbf{h}^{\ell} = \phi(\mathbf{z}^{\ell})

Depth matters because it changes what is easy to represent and what is easy to learn.

Minimal working mental model: a 2-layer network on a simple task

You already know backprop and SGD; let’s anchor deep learning in one concrete “small but real” example.

Task: binary classification in 2D. Input $\mathbf{x} \in \mathbb{R}^2$ , label $y\in\{0,1\}$ . Suppose the decision boundary is not linearly separable (e.g., two moons).

A 2-layer MLP (one hidden layer) is:

\mathbf{h}^1 = \mathrm{ReLU}(\mathbf{W}^1\mathbf{x}+\mathbf{b}^1)

\hat{y} = \sigma(\mathbf{w}^2 \cdot \mathbf{h}^1 + b^2)

Interpretation:

•Layer 1 creates a set of learned “features” (half-spaces gated by ReLU).
•Layer 2 mixes those features into a probability via logistic regression.

Even here, you can see the deep learning pattern:

1) Representation: $\mathbf{h}^1$ is not hand-designed—it’s learned.

2) Composition: the model builds a nonlinear function from simple parts.

3) Trainability: success depends on gradients flowing from $\hat{y}$ back to $\mathbf{W}^1$ .

Now scale that idea: more layers, richer inductive biases (convolution, attention), and careful conditioning (normalization/residuals) to make training stable.

Checkpoint: what “deep” adds

Before going further, keep these three questions in mind:

1) What family of functions does this architecture represent?

2) What representations will intermediate layers tend to discover?

3) Will gradients and signals propagate stably through depth?

Deep learning is largely the practice of answering those three questions well.

Core Mechanic 1: Layered Function Composition → Representations

Why depth is not just “more parameters”

You could increase width (more units per layer) or increase depth (more layers). Both add parameters, but they add different representational structure.

A useful mental model:

•Width adds “parallel feature templates.”
•Depth adds “feature reuse and hierarchy.”

Depth encourages distributed hierarchical representations:

•Early layers: local/simple patterns.
•Middle layers: combinations of patterns.
•Late layers: task-level abstractions.

In images, this often looks like edges → textures → parts → objects. In language, characters/subwords → local syntax → semantics.

The forward pass as representation building

Write the network as repeated transformations:

\mathbf{h}^{\ell} = f_{\ell}(\mathbf{h}^{\ell-1})

Think of $\mathbf{h}^{\ell}$ as a coordinate system the network invents. Learning aims to make the final layer’s problem “simple” (often linearly separable).

A very common pattern is:

f_{\ell}(\mathbf{h}) = \phi\big(\mathrm{Norm}(\mathbf{W}\mathbf{h}+\mathbf{b})\big)

where Norm might be BatchNorm, LayerNorm, RMSNorm, etc.

A little math: how composition shapes sensitivity

Deep nets are compositions, so their derivatives are products (chains) of Jacobians.

Let $\mathbf{h}^{\ell} \in \mathbb{R}^{d_\ell}$ . Define the Jacobian

\mathbf{J}^{\ell} = \frac{\partial \mathbf{h}^{\ell}}{\partial \mathbf{h}^{\ell-1}} \in \mathbb{R}^{d_\ell \times d_{\ell-1}}

Then:

\frac{\partial \mathbf{h}^{L}}{\partial \mathbf{x}} = \mathbf{J}^{L}\,\mathbf{J}^{L-1}\cdots \mathbf{J}^{1}

This single equation explains a lot:

•If typical singular values of $\mathbf{J}^\ell$ are > 1, gradients can explode.
•If they are < 1, gradients can vanish.
•If they cluster near 1, training tends to be stable.

You don’t need to compute these Jacobians explicitly to benefit from this mental model; it motivates initialization, normalization, and residual connections.

Checkpoint: what you should carry forward

•Representations $\mathbf{h}^\ell$ are the real “product” of deep learning.
•The chain-of-Jacobians view predicts optimization pathologies.
•Architecture is about shaping both representations and Jacobians.

Core Mechanic 2: Architectural Inductive Biases (CNNs, RNNs, Attention, MLPs)

Why inductive bias is the point of architecture

Without assumptions, learning in high dimensions is sample-inefficient (curse of dimensionality). Architectural choices encode assumptions like:

•locality
•translation equivariance
•temporal recurrence
•permutation invariance/equivariance
•long-range interactions

These biases restrict the function class to something that matches the world.

A comparison table of major deep learning architectures

Architecture	Core operation	Inductive bias	Strengths	Common failure mode
MLP (feedforward)	dense affine + nonlinearity	weak (mostly none)	flexible; works on tabular/embeddings	data-hungry; ignores structure
CNN	convolution (weight sharing, locality)	translation equivariance; local patterns	vision, audio; parameter efficient	struggles with global context unless deep/augmented
RNN / LSTM / GRU	recurrence $\mathbf{h}_t=f(\mathbf{h}_{t-1},\mathbf{x}_t)$	sequential state; temporal locality	streaming, variable-length sequences	long-range dependencies; parallelization limits
Attention / Transformer	content-based mixing (self-attn)	flexible pairwise interactions; permutation equivariance with positional encoding	long-range dependencies; parallelizable	quadratic cost in sequence length; needs lots of data
GNN	message passing on graphs	graph equivariance/invariance	molecules, networks, relational data	oversmoothing; limited expressivity for some tasks

We’ll focus on CNNs and sequence models (RNNs/attention), since they are canonical deep learning building blocks.

CNNs: locality + weight sharing

A 2D convolution layer applies a kernel over local neighborhoods. If you already know the convolution operation, the key deep-learning additions are:

1) Channels: kernels map $C_{in}$ input channels to $C_{out}$ output channels.

2) Stacking: repeated convs grow the receptive field.

A simplified expression (single output channel) is:

y[i,j] = \sum_{u,v} k[u,v] \; x[i+u, j+v]

With multiple channels:

y_c[i,j] = \sum_{c'}\sum_{u,v} k_{c,c'}[u,v] \; x_{c'}[i+u, j+v]

Why it helps: weight sharing means you learn “edge detector” once and reuse it across the image. Locality reduces parameters and encourages features to be local.

Common CNN design motifs:

•small kernels (3×3) stacked
•pooling or strided conv for downsampling
•residual blocks (ResNet)

RNNs: recurrence for sequences

An RNN maintains a state $\mathbf{h}_t$ updated over time:

\mathbf{h}_t = \phi(\mathbf{W}_h\mathbf{h}_{t-1} + \mathbf{W}_x\mathbf{x}_t + \mathbf{b})

This encodes an inductive bias: “the present depends on a compressed summary of the past.”

Training issue: backprop through time multiplies many Jacobians across timesteps, causing vanishing/exploding gradients. LSTMs/GRUs mitigate this with gating, roughly creating more stable paths for gradient flow.

Attention/Transformers: content-based routing

Self-attention computes a weighted average of value vectors using query-key similarity.

Given matrices $\mathbf{Q},\mathbf{K},\mathbf{V}$ :

\mathrm{Attn}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \mathrm{softmax}\Big(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d}}\Big)\mathbf{V}

The inductive bias shifts from locality/recurrent state to learned interactions between all positions.

Transformers add:

•positional information (positional encodings)
•residual connections
•normalization
•MLP blocks

Checkpoint: choosing an architecture

A practical decision rule:

•If your data has grid locality (images, spectrograms): start with CNNs.
•If your data is sequential and you need streaming/low latency: consider RNNs.
•If you need long-range dependencies and can afford batch processing: attention/Transformers.

Architecture is not just accuracy—it’s compute, latency, memory, and data efficiency.

Making Deep Nets Trainable: Initialization, Normalization, Residual Paths

Why this section exists

In shallow models, SGD “just works” surprisingly often. In deep models, optimization can fail even when the model is expressive enough.

The chain-of-Jacobians view tells you why: signals and gradients must propagate through many transformations. If their magnitudes drift, training becomes unstable.

We’ll build a stable mental model in three steps:

1) initialization tries to keep variance roughly constant across layers

2) normalization actively stabilizes distributions during training

3) residual connections provide easy paths for gradient flow

1) Initialization as variance control

Consider a layer:

\mathbf{z} = \mathbf{W}\mathbf{h}

Assume $h_i$ are i.i.d. with mean 0 and variance $\mathrm{Var}(h_i)=\sigma_h^2$ . If weights have mean 0 and variance $\mathrm{Var}(W_{ij})=\sigma_w^2$ , then (roughly):

\mathrm{Var}(z_j) \approx n\,\sigma_w^2\,\sigma_h^2

where $n$ is fan-in.

To keep $\mathrm{Var}(z_j)$ from blowing up with depth, choose $\sigma_w^2 \propto 1/n$ .

Two famous schemes:

•Xavier/Glorot (good for tanh-like activations): $\sigma_w^2 \approx \frac{2}{\text{fan-in}+\text{fan-out}}$
•He/Kaiming (good for ReLU): $\sigma_w^2 \approx \frac{2}{\text{fan-in}}$

These are not magic constants; they are attempts to keep forward activations and backward gradients in a reasonable range.

2) Normalization as conditioning control

Even with good initialization, distributions drift as parameters update. Normalization layers reduce internal covariate shift and improve conditioning.

BatchNorm (BN)

For a mini-batch, BN normalizes pre-activations per feature:

\hat{z} = \frac{z-\mu_B}{\sqrt{\sigma_B^2+\epsilon}}, \qquad y = \gamma\hat{z}+\beta

• $\mu_B, \sigma_B^2$ computed over the batch
• $\gamma,\beta$ are learned scale/shift

Pros: strong stabilizer, often speeds up CNN training.

Cons: batch-size dependence; tricky for RNNs/online/very small batches.

LayerNorm (LN)

LN normalizes across features within a single example:

\hat{\mathbf{z}} = \frac{\mathbf{z}-\mu}{\sqrt{\sigma^2+\epsilon}}

Pros: works well in Transformers; independent of batch size.

RMSNorm

RMSNorm scales by root-mean-square without subtracting the mean:

\mathrm{RMS} = \sqrt{\frac{1}{d}\sum_i z_i^2 + \epsilon}, \qquad \hat{\mathbf{z}} = \frac{\mathbf{z}}{\mathrm{RMS}}

Often used in modern LLM stacks for simplicity and stability.

3) Residual connections as gradient highways

A residual block computes:

\mathbf{h}^{\ell+1} = \mathbf{h}^{\ell} + F(\mathbf{h}^{\ell})

Differentiate w.r.t. $\mathbf{h}^{\ell}$ :

\frac{\partial \mathbf{h}^{\ell+1}}{\partial \mathbf{h}^{\ell}} = \mathbf{I} + \frac{\partial F}{\partial \mathbf{h}^{\ell}}

The identity term $\mathbf{I}$ ensures there is always a path with derivative near 1, which combats vanishing gradients.

This is a key reason very deep networks (ResNets, deep Transformers) are trainable.

Checkpoint: the stability toolkit

When a deep model won’t train, ask:

•Are activations exploding/vanishing? (initialization, normalization)
•Are gradients unstable? (residuals, normalization, learning rate)
•Is the loss numerically unstable? (log-sum-exp tricks, mixed precision care)

These are not “details”—they are often the difference between success and failure.

Application/Connection: Designing and Training Deep Models in Practice

Why practice looks different from theory

In theory, you can specify $f_\theta$ and run SGD. In practice, deep learning is an engineering loop:

1) pick an architecture with the right inductive bias

2) ensure optimization is stable (normalization, residuals, schedules)

3) regularize and validate (to generalize)

4) scale data/compute appropriately

Let’s connect the concepts to concrete workflows.

A practical blueprint: from data to model

Step 1: Represent input and output

•Images: tensors $\in \mathbb{R}^{C\times H\times W}$
•Text: token IDs → embeddings $\in \mathbb{R}^{T\times d}$
•Tabular: normalized numeric + embeddings for categoricals

Decide output:

•classification: softmax / sigmoid
•regression: linear head (maybe with bounded activation)
•sequence-to-sequence: autoregressive decoder or encoder-decoder

Step 2: Choose an inductive bias

•Spatial locality? Use CNN or vision transformer with patches.
•Long-range dependencies? Use attention.
•Need streaming? Consider RNNs or efficient attention variants.

Step 3: Choose a loss and ensure numerical stability

For classification with logits $\mathbf{s}$ and label $y$ :

\mathrm{CE}(\mathbf{s},y) = -\log \frac{e^{s_y}}{\sum_k e^{s_k}}

Compute with stable log-sum-exp:

\log \sum_k e^{s_k} = m + \log \sum_k e^{s_k-m}, \quad m=\max_k s_k

This prevents overflow in $e^{s_k}$ .

Step 4: Optimization choices (SGD family)

Even if you know SGD, deep learning often uses schedules and adaptive methods.

Optimizer	Typical use	Notes
SGD + momentum	CNNs, large-scale vision	often best generalization; needs tuning + LR schedule
Adam/AdamW	Transformers, NLP	fast convergence; AdamW decouples weight decay

Learning rate schedules (cosine decay, step decay, warmup) can be as important as the optimizer.

Step 5: Regularize for generalization

You already know L1/L2/dropout. In deep learning, common additional regularizers include:

•data augmentation (especially vision)
•early stopping
•label smoothing
•stochastic depth / drop-path (deep residual nets)

Worked mental model: “depth creates features, architecture chooses which features are easy”

Tie back to our earlier 2-layer classifier:

•adding layers lets the model build intermediate representations that progressively linearize the task
•CNN bias makes “edge-like” features easy to learn everywhere
•attention bias makes “copy/align” behavior easy to learn across positions

Deep learning succeeds when your architecture makes the right representations cheap to discover with gradient descent.

Connections forward: why this node unlocks attention and meta-learning

•Attention mechanisms refine the idea of learned representations by letting the model decide where to read from in its own activations.
•Meta-learning treats the learning process itself as an object: deep nets can learn representations that adapt quickly, or learn optimizers/updates.

Final checkpoint

If you can explain:

1) $f_\theta$ as a composition of layers,

2) $\mathbf{h}^\ell$ as learned representations,

3) inductive bias as the reason architectures differ,

4) trainability as controlling Jacobian products,

…then you have a working deep learning “tech tree” model that scales to modern architectures.

Worked Examples (3)

Example 1: Forward pass as feature building in a 2-layer ReLU network

Let $\mathbf{x} = \begin{bmatrix}1\\-2\end{bmatrix}$ . Define a 2-layer network:

Layer 1: $\mathbf{h}^1 = \mathrm{ReLU}(\mathbf{W}^1\mathbf{x}+\mathbf{b}^1)$ with

$\mathbf{W}^1 = \begin{bmatrix}1 & 1\\ -1 & 2\\ 0 & -1\end{bmatrix}$ , $\mathbf{b}^1=\begin{bmatrix}0\\1\\-1\end{bmatrix}$ .

Layer 2 (logit): $s = \mathbf{w}^2\cdot \mathbf{h}^1 + b^2$ with $\mathbf{w}^2=\begin{bmatrix}2\\-1\\1\end{bmatrix}$ and $b^2=0$ . Output probability $\hat{y}=\sigma(s)$ .

Compute pre-activation $\mathbf{z}^1 = \mathbf{W}^1\mathbf{x}+\mathbf{b}^1$ :
$\mathbf{W}^1\mathbf{x} = \begin{bmatrix}1 & 1\\ -1 & 2\\ 0 & -1\end{bmatrix}\begin{bmatrix}1\\-2\end{bmatrix}
= \begin{bmatrix}1\cdot1 + 1\cdot(-2)\\ (-1)\cdot1 + 2\cdot(-2)\\ 0\cdot1 + (-1)\cdot(-2)\end{bmatrix}
= \begin{bmatrix}-1\\ -5\\ 2\end{bmatrix}$.
Add bias:
$\mathbf{z}^1 = \begin{bmatrix}-1\\ -5\\ 2\end{bmatrix} + \begin{bmatrix}0\\1\\-1\end{bmatrix} = \begin{bmatrix}-1\\ -4\\ 1\end{bmatrix}$ .
Apply ReLU elementwise:
$\mathbf{h}^1 = \mathrm{ReLU}(\mathbf{z}^1)=\begin{bmatrix}\max(0,-1)\\\max(0,-4)\\\max(0,1)\end{bmatrix} = \begin{bmatrix}0\\0\\1\end{bmatrix}$ .
Compute the logit:
$s = \mathbf{w}^2\cdot \mathbf{h}^1 = \begin{bmatrix}2\\-1\\1\end{bmatrix}\cdot\begin{bmatrix}0\\0\\1\end{bmatrix} = 2\cdot0 + (-1)\cdot0 + 1\cdot1 = 1$ .
Convert to probability with sigmoid:
$\hat{y} = \sigma(1)=\frac{1}{1+e^{-1}} \approx 0.731$ .

Insight: Even this tiny deep net builds a representation $\mathbf{h}^1$ where the final decision is simple (a dot product). ReLU created a sparse feature vector: only the third feature is active for this input. Scaling depth increases the space of learned features and their compositional reuse.

Example 2: Why gradients can vanish/explode (a Jacobian product toy calculation)

Consider a depth- $L$ scalar network (for intuition):

$h^0=x$ , and for $\ell=1,\dots,L$ :

$h^{\ell} = a\, h^{\ell-1}$ (a linear layer with scalar weight $a$ ).

Output is $h^L = a^L x$ . We examine $\frac{\partial h^L}{\partial x}$ and how it scales with depth.

Write the closed form:
$h^1 = a x$
$h^2 = a h^1 = a(ax)=a^2 x$
By induction:
$h^L = a^L x$ .
Differentiate w.r.t. input:
$\frac{\partial h^L}{\partial x} = \frac{\partial (a^L x)}{\partial x} = a^L$ .
Analyze cases:
If $|a|<1$ , then $|a|^L \to 0$ as $L$ grows ⇒ gradients vanish.
If $|a|>1$ , then $|a|^L \to \infty$ ⇒ gradients explode.
If $|a|\approx 1$ , gradients stay in a workable range.

Insight: Real networks are not scalar, but the principle survives: deep learning stability depends on keeping the effective Jacobian product near an isometry (singular values near 1). Initialization, normalization, and residual connections are practical tools to approximate this behavior.

Example 3: CNN parameter efficiency vs dense layers (quick comparison)

Compare two ways to process a 32×32 RGB image ( $C=3$ ). Option A: a dense layer to 100 hidden units. Option B: a conv layer with 64 kernels of size 3×3.

We count parameters (ignoring biases for simplicity).

Dense layer: flatten input size is $32\cdot32\cdot3 = 3072$ .
Parameters = $3072 \times 100 = 307,200$ .
Convolution: each kernel has size $3\times3\times C_{in} = 3\times3\times3 = 27$ .
With 64 output channels:
Parameters = $27 \times 64 = 1,728$ .
Compare:
Dense: 307,200 parameters
Conv: 1,728 parameters
The conv layer uses about $307,200 / 1,728 \approx 178$ × fewer parameters.

Insight: Weight sharing and locality massively reduce parameters while matching image structure. This is inductive bias made concrete: CNNs restrict the function family to translation-equivariant local pattern detectors, improving sample efficiency.

Key Takeaways

✓
A deep network is a nested composition: $f_\theta(x)=f_L(\cdots f_2(f_1(x)))$ , producing intermediate representations $\mathbf{h}^\ell$ .
✓
Depth primarily helps by enabling hierarchical feature composition and feature reuse—not merely by adding parameters.
✓
Training stability is governed by products of Jacobians; vanishing/exploding gradients are expected failure modes without careful design.
✓
Architectures are defined by inductive biases (locality, weight sharing, recurrence, attention) that improve sample efficiency and generalization.
✓
CNNs encode translation equivariance and locality; RNNs encode sequential state; attention enables flexible long-range interactions.
✓
Initialization schemes (Xavier/He) aim to keep activation/gradient scales reasonable across layers.
✓
Normalization (BatchNorm/LayerNorm/RMSNorm) improves conditioning and stability; residual connections provide gradient highways.
✓
Deep learning practice is an engineering loop balancing architecture, optimization, regularization, and compute constraints.

Common Mistakes

✗
Treating depth as automatically beneficial: deeper models can be harder to optimize and may overfit without the right stabilizers and regularization.
✗
Ignoring conditioning: using arbitrary initialization or omitting normalization/residuals often causes silent training failure (loss plateaus or NaNs).
✗
Choosing architecture by trend rather than inductive bias: e.g., using an MLP on images without exploiting locality, or using attention when streaming constraints require recurrence.
✗
Debugging only the optimizer: many issues blamed on SGD/Adam are actually numerical stability, normalization placement, or learning-rate schedule problems.

Practice

easy

You have a depth-10 network where each layer (locally) has an average Jacobian spectral norm of about 0.9. Roughly how will gradient magnitudes scale from output back to the input? What qualitative behavior do you expect during training?

Hint: Use the idea that gradient scales like a product of per-layer factors.

Show solution

If each layer contributes a factor ≈ 0.9, then over 10 layers the scale is about $0.9^{10} \approx 0.35$ . Gradients shrink as they propagate backward (vanishing tendency). Training may be slower for early layers and may require residual connections, normalization, or different initialization to keep effective scales closer to 1.

medium

Design choice: You need to classify 1-second audio clips sampled at 16 kHz. You can represent them as a spectrogram (time × frequency grid) or as raw waveform. Which inductive bias suggests a CNN is a strong baseline, and what structure is the CNN exploiting?

Hint: Think locality and weight sharing on a grid.

Show solution

A CNN is a strong baseline because audio (especially as a spectrogram) has local time-frequency structure: nearby time frames and frequencies form local patterns (harmonics, onsets). Convolutions exploit locality (small receptive fields) and weight sharing (same detector across time/frequency shifts), giving translation-equivariant feature extraction and parameter efficiency.

hard

Suppose you remove residual connections from a 48-layer Transformer block stack but keep everything else the same. Using the chain-of-Jacobians viewpoint, explain why optimization becomes much harder. Propose two architectural/training modifications that could partially compensate (even if imperfect).

Hint: Residuals add an identity term to the layer-to-layer derivative; without it the product of Jacobians must stay well-conditioned by itself.

Show solution

Without residuals, the layer-to-layer derivative is dominated by $\partial F/\partial \mathbf{h}$ rather than $\mathbf{I}+\partial F/\partial \mathbf{h}$ . The gradient becomes a product of many non-identity Jacobians, making vanishing/exploding gradients much more likely (singular values drift away from 1). Two partial compensations: (1) stronger/appropriate normalization (e.g., careful LayerNorm/RMSNorm placement, possibly pre-norm) to stabilize activation distributions and Jacobian spectra; (2) adjust initialization and learning-rate schedule (smaller LR, warmup, scaled init) to keep updates small and maintain conditioning. Other possible aids include gradient clipping and reducing depth.

Connections

Attention Mechanisms

Meta-Learning

Quality: A (4.3/5)

← back to tree browse all →