Principles for avoiding floating-point issues (overflow/underflow, catastrophic cancellation) and understanding problem conditioning; includes common fixes like log-sum-exp, normalization, and careful initialization. These concerns are crucial for reliable training of deep models.
Self-serve tutorial - low prerequisites, straightforward concepts.
Training deep models is often less about “the math is wrong” and more about “the math was computed in a fragile way.” Numerical stability is the craft of making mathematically correct formulas behave reliably under finite-precision floating-point arithmetic—so your softmax doesn’t overflow, your variance doesn’t go negative, and your gradients don’t silently drift due to reduction order.
Finite-precision arithmetic introduces rounding and a limited dynamic range. Numerical stability is about choosing algorithms that control rounding error and avoid overflow/underflow; conditioning is about how sensitive the problem itself is to small input perturbations. In deep learning, common fixes include log-sum-exp for softmax/log-likelihoods, stable variance/normalization formulas, careful reduction (pairwise/Kahan), and scale-aware initialization/normalization—especially in float16/float32 training.
You can write down a correct mathematical expression and still get nonsense on a real computer.
These aren’t “bugs” in deep learning libraries—they’re consequences of finite-precision arithmetic.
Numerical analysis separates two ideas:
A well-conditioned problem can be solved poorly by an unstable algorithm. A well-designed stable algorithm cannot fully rescue an ill-conditioned problem—but it can avoid making things worse.
Computers represent real numbers in floating-point format (float16/float32/float64). Two consequences matter constantly:
A standard model of floating-point arithmetic is:
where:
We’ll use machine epsilon (often denoted ) as the “scale” of relative rounding error.
Typical values (approximate):
| type | significand bits | (≈) | dynamic range (rough) |
|---|---|---|---|
| float16 | 11 (incl. hidden bit) | 9.77e-4 | ~1e−5 to ~6e4 |
| bfloat16 | 8 (incl. hidden bit) | 7.81e-3 | ~1e−38 to ~3e38 |
| float32 | 24 | 1.19e-7 | ~1e−38 to ~3e38 |
| float64 | 53 | 2.22e-16 | ~1e−308 to ~1e308 |
Two important takeaways:
Consider computing . If a tiny perturbation can cause a large change in , the problem is ill-conditioned near .
A classic measure is the relative condition number:
Interpretation:
Example intuition:
In deep learning, conditioning shows up as:
An algorithm is (backward) stable if the computed result equals the exact result for a slightly perturbed input:
This is the gold standard: rounding happens anyway; stability means your algorithm behaves like it just saw a tiny bit of noise.
Key mental model:
A core stability intuition is that floats are denser near 0 and sparser at large magnitudes.
value magnitude: 0 ---- 1 ---- 10 ---- 1e3 ---- 1e6 ---- 1e9
float spacing: tiny small bigger larger huge enormousAt large magnitudes, the next representable number can be far away, so adding a small number can do nothing (it rounds away). This drives reduction-order issues and “small updates disappear” problems.
This section builds a concrete catalog of failure modes you will repeatedly see in ML code. The goal is to recognize them quickly, then reach for the right stable pattern.
Exponentials, products over many terms, and repeated multiplications are common.
Naive softmax:
If some is large (say 1000 in float32), overflows to ∞, and you get ∞/∞ → NaN.
Subtracting two nearly equal numbers destroys significant digits.
If , then is small, but the floating-point representation of and each has rounding error. The subtraction exposes that error.
The identity is correct:
But numerically, if has large mean and small variance, both terms are large and close, so subtraction cancels digits. You may get a small negative result.
Floating-point addition is not associative:
because each intermediate sum is rounded.
Training is full of reductions:
On GPU/TPU, parallel reductions change order depending on hardware and scheduling. That can change results—usually slightly, sometimes significantly.
If you add a tiny number to a huge number, the tiny number may not change the sum at all (it gets rounded away). So order matters: adding small numbers together first can preserve more information.
Normalization often has the form:
If is near 0, this blows up. Many algorithms add an term:
The here is a stability constant, not machine epsilon (though it’s motivated by it). In float16, often must be larger.
Even if your forward pass is stable, gradients can underflow or become 0 in low precision.
Examples:
This interacts with floating-point: values smaller than ~1e−8 may become subnormals or 0 in float16.
Given prerequisites about computational graphs, here’s the key operational point:
This is why stable local formulas (softmax, log-softmax, normalization) matter so much: they sit early in the graph and affect everything.
To address these issues systematically, your interactive canvas should have three mini-labs, each with:
Controls:
Display:
Controls:
Display:
Controls:
Display:
This canvas ties directly to: overflow in softmax, cancellation in variance, and reduction-order error—the three most common “why did my model blow up?” numerics in practice.
Now we turn the failure modes into stable, reusable techniques.
You want to compute:
Directly computing is risky if any is large.
Let . Then:
Taking log:
Key benefits:
Stable softmax:
Stable log-softmax:
In practice you compute log-softmax directly (stable) and then NLL loss, rather than softmax then log.
Variance is everywhere: batch norm, layer norm, feature scaling, monitoring.
The naive formula is prone to cancellation:
Given samples , maintain:
Initialize:
For :
Then:
It avoids subtracting two nearly equal large numbers. It tracks deviations relative to an evolving mean, keeping intermediate quantities closer to the scale of the variance.
Summing many numbers is ubiquitous: losses, gradient accumulation, dot products.
Instead of summing linearly, sum in a tree (balanced) structure. This tends to reduce error growth from to closer to in many practical settings.
Maintain a compensation term for lost low-order bits.
Algorithm (for summing values ):
Interpretation: estimates what was lost to rounding.
Normalization isn’t only about optimization speed; it’s also about preventing overflow/underflow and controlling conditioning.
If one feature has magnitude ~1e6 and another ~1e−3, dot products and gradients become ill-scaled. Scaling features can reduce condition numbers and improve numerical behavior.
Careful initialization (e.g., Xavier/He) aims to keep variance of activations/gradients roughly stable across depth. That’s partly a conditioning story (keeping Jacobian singular values reasonable) and partly a numerical range story (avoid overflow/underflow).
A useful mental diagram:
| Stable algorithm | Unstable algorithm | |
|---|---|---|
| Well-conditioned problem | Great: accurate results | Often fixable by changing algorithm |
| Ill-conditioned problem | Best you can do: results sensitive to noise | Worst case: meaningless results |
Deep learning often has locally ill-conditioned regions (e.g., during early training or near sharp minima). Your job is to avoid compounding that with unstable computations.
Modern training often uses float16/bfloat16 for speed, with float32 master weights or accumulators.
Common stability patterns:
These are not “cheats”—they’re explicit numerics engineering.
This section connects the numerics toolbox to everyday deep learning components.
In classification, the loss is often:
Naively:
1) compute softmax
2) take log
This is numerically fragile because:
Stable approach:
This avoids both overflow and log(0) underflow.
Normalization layers compute statistics over batches/features.
Key stability decisions:
Important nuance:
Attention uses logits:
Then softmax over .
The scaling is partly about gradients/optimization, but also about numerics:
FlashAttention and similar kernels implement careful stabilization (blockwise softmax with max-subtraction) explicitly to keep values in safe ranges.
Whenever you multiply many probabilities, underflow is inevitable:
Use logs:
And whenever you need to sum probabilities in log space, use log-sum-exp.
This pattern generalizes far beyond softmax.
Exploding gradients can cause:
Gradient clipping (by norm) can be viewed as enforcing a bound on magnitude to keep values in range.
When training becomes NaN/Inf:
A key mindset: you’re not only debugging code—you’re debugging a numeric process under rounding.
Conditioning appears as “how hard the optimizer must work.”
While this lesson focuses on floating-point issues, it’s important to see that:
Let logits be z = [1000, 1001, 999]. Compute softmax and log-softmax stably (in exact math), and explain what goes wrong with the naive computation in float32.
Naive softmax uses exp(zᵢ). But exp(1000) and exp(1001) overflow in float32, producing ∞.
So the naive numerator/denominator become ∞/∞, which is undefined → NaN.
Use the stability shift m = max(z) = 1001.
Compute shifted logits: z' = z − m = [−1, 0, −2].
Compute exponentials of shifted logits:
exp(z') = [e^{−1}, e^{0}, e^{−2}] = [0.367879..., 1, 0.135335...].
These are all in (0, 1], safe in float16/32/64.
Compute denominator:
S = 0.367879... + 1 + 0.135335... = 1.503214...
Compute softmax:
softmax(z) = exp(z') / S
= [0.367879/1.503214, 1/1.503214, 0.135335/1.503214]
≈ [0.244728, 0.665241, 0.090031].
Compute log-sum-exp:
LSE(z) = m + log(∑ exp(zᵢ − m))
= 1001 + log(1.503214...)
≈ 1001 + 0.407606 = 1001.407606.
Compute log-softmax:
log softmax(z)ᵢ = zᵢ − LSE(z)
So:
log softmax = [1000 − 1001.407606,
1001 − 1001.407606,
999 − 1001.407606]
= [−1.407606, −0.407606, −2.407606].
Sanity check: exponentiating log-softmax returns the softmax probabilities:
exp(−0.407606) ≈ 0.665241, etc.
Insight: Subtracting max(logits) is not a heuristic; it’s an algebraically exact transformation that keeps exponentials in a safe numeric range. Log-sum-exp is the reusable pattern: whenever you see log(sum(exp(·))), you should think “shift by max.”
Suppose x values are near 1e8 with tiny spread: x = [100000000.0, 100000001.0, 99999999.0, 100000000.0]. Show why Var = E[x²] − (E[x])² is fragile, and outline Welford’s stable computation.
Compute the mean (exact math):
μ = (100000000 + 100000001 + 99999999 + 100000000) / 4
= 400000000 / 4 = 100000000.
Compute deviations from the mean:
(x − μ) = [0, 1, −1, 0].
So the true population variance is:
Var = (0² + 1² + (−1)² + 0²)/4 = 2/4 = 0.5.
Now look at the naive formula components:
E[x²] is about (1e8)² = 1e16 scale.
(μ)² is also (1e8)² = 1e16 scale.
Their difference should be 0.5 — extremely tiny relative to 1e16.
In float32, you only have ~7 decimal digits of precision.
At magnitude 1e16, the spacing between representable floats is huge (much larger than 1).
So x² and μ² are rounded so coarsely that subtracting them cannot reliably yield 0.5.
This is catastrophic cancellation: the small result is dominated by rounding error in the large terms.
Welford’s algorithm avoids subtracting huge nearly-equal numbers.
Initialize:
μ₁ = x₁ = 100000000
M2₁ = 0
k=2, x₂=100000001:
δ = x₂ − μ₁ = 1
μ₂ = μ₁ + δ/2 = 100000000 + 0.5 = 100000000.5
δ₂ = x₂ − μ₂ = 0.5
M2₂ = M2₁ + δ·δ₂ = 0 + 1·0.5 = 0.5
k=3, x₃=99999999:
δ = x₃ − μ₂ = 99999999 − 100000000.5 = −1.5
μ₃ = μ₂ + δ/3 = 100000000.5 − 0.5 = 100000000.0
δ₂ = x₃ − μ₃ = −1
M2₃ = 0.5 + (−1.5)(−1) = 2.0
k=4, x₄=100000000:
δ = x₄ − μ₃ = 0
μ₄ = μ₃ + 0/4 = 100000000
δ₂ = 0
M2₄ = 2.0
Population variance:
Var = M2₄ / 4 = 2/4 = 0.5 (correct).
Insight: The naive identity is mathematically correct but numerically dangerous when mean ≫ std. Welford keeps intermediate values near the scale of deviations, so rounding error doesn’t get amplified by subtracting huge nearly-equal numbers.
Sum the numbers: [1e8, 1, −1e8] in float32-like arithmetic to illustrate non-associativity. Compare (a+b)+c vs a+(b+c).
Let a = 1e8, b = 1, c = −1e8.
Compute (a + b) + c:
First a + b = 1e8 + 1.
At magnitude 1e8 in float32, the spacing between representable floats is larger than 1, so 1e8 + 1 rounds to 1e8.
So (a+b) ≈ 1e8.
Then (a+b)+c ≈ 1e8 + (−1e8) = 0.
Compute a + (b + c):
First b + c = 1 + (−1e8) = −99999999.
This is representable near −1e8 and keeps the +1 effect.
Then a + (b+c) = 1e8 + (−99999999) = 1.
So:
(a+b)+c ≈ 0, but a+(b+c) ≈ 1.
Same real-number expression, different floating-point result due to rounding in intermediate steps.
Insight: Summation order matters because adding a tiny number to a huge one can lose the tiny number entirely. Pairwise summation and compensated summation reduce this loss, which matters for large reductions (loss totals, gradient sums, dot products).
Finite-precision arithmetic means every operation is rounded; machine epsilon (and unit roundoff) sets the scale of relative rounding error.
Overflow/underflow are about limited dynamic range; catastrophic cancellation and reduction-order error are about limited precision.
Conditioning describes sensitivity of the true solution to input perturbations; stability describes whether an algorithm controls rounding error (often via backward stability).
Log-sum-exp (shift by max) is the standard fix for softmax/log-softmax and log-probability computations.
Variance is a classic cancellation trap; prefer Welford or stable two-pass methods over E[x²] − (E[x])² when mean ≫ std.
Floating-point addition is not associative; for large reductions, consider pairwise summation, FP32 accumulation, or compensated summation.
Normalization and scale-aware initialization help both optimization conditioning and numerical range control.
In mixed precision, keep sensitive ops and accumulations in FP32 and use loss scaling to prevent gradient underflow.
Computing softmax then log for cross-entropy instead of using a stable log-softmax (log-sum-exp).
Using Var = E[x²] − (E[x])² in low precision when data have a large mean and small variance, leading to negative variance and NaNs.
Assuming sums/reductions are deterministic and associative across hardware; changing parallel reduction order can change results.
Confusing the stability epsilon added in normalizations (a modeling/engineering constant) with machine epsilon (a property of the dtype).
Compute stably: L = log( exp(100) + exp(101) + exp(99) ). Give the exact stable expression and a numerical approximation (to ~4 decimals).
Hint: Use m = max = 101 and write L = m + log(∑ exp(zᵢ − m)).
Let m = 101.
Then:
L = 101 + log( exp(100−101) + exp(101−101) + exp(99−101) )
= 101 + log( e^{−1} + 1 + e^{−2} ).
Compute: e^{−1}≈0.3679, e^{−2}≈0.1353, sum≈1.5032.
So L ≈ 101 + log(1.5032) ≈ 101 + 0.4076 = 101.4076.
You are summing 10 million float16 values to compute a loss total. Name two strategies to reduce summation error and briefly explain why each helps.
Hint: Think about changing the summation order and using higher precision for the accumulator.
Two good strategies:
1) Accumulate in float32 (FP32 accumulation): float32 has much smaller ε_machine than float16, so each addition rounds less, reducing total error.
2) Pairwise/tree reduction (or compensated summation like Kahan): summing numbers of similar magnitude earlier reduces the “tiny added to huge gets lost” effect, lowering rounding error compared to naive left-to-right summation.
Explain (no code required) why computing cross-entropy as −log(softmax(z)[y]) can produce −∞ or NaN even when the true value is finite. Then state the stable alternative formula.
Hint: Consider underflow in softmax probabilities and overflow in exp(z). Use log-sum-exp.
Naively computing softmax requires exp(z). Large logits can overflow exp(z) to ∞, producing ∞/∞ → NaN. Small probabilities can underflow to 0, and then log(0) → −∞, even if the mathematically correct probability is merely very small but nonzero. The stable approach is to compute log-softmax directly:
log softmax(z)ᵢ = zᵢ − (m + log ∑ⱼ exp(zⱼ − m)), where m = maxⱼ zⱼ.
Then cross-entropy is −log softmax(z)_y, avoiding overflow and log(0) underflow.
Related next-step ideas you’ll likely meet in the Deep Learning node: