The softmax function converts raw model outputs (logits) into a probability distribution over classes, and is central to attention weight computation and classification losses. Familiarity with numerical stability tricks (log-sum-exp) and interpreting logits/probabilities is included.
Self-serve tutorial - low prerequisites, straightforward concepts.
A model often outputs any real numbers it wants—positive, negative, huge, tiny. Softmax is the bridge that turns those raw scores (logits) into a clean probability distribution you can interpret, train with, and reuse inside attention.
Logits are unnormalized scores z ∈ ℝᴷ (one per class). Softmax converts them into probabilities pᵢ = exp(zᵢ) / ∑ⱼ exp(zⱼ). Softmax is invariant to adding a constant to all logits, so you compute it stably by subtracting max(z) before exponentiating (the log-sum-exp trick). Logits encode relative preference via differences: pᵢ/pⱼ = exp(zᵢ − zⱼ).
Most models do not naturally output probabilities. A neural network layer typically outputs a vector of real numbers with no constraints:
But in classification (and in attention), we want a probability distribution: numbers between 0 and 1 that sum to 1.
That’s what softmax does.
Suppose you have K classes. The model outputs a vector of logits
Interpretation:
Logits are not probabilities. They are better thought of as: “how much evidence do we have for each class?”
Softmax maps logits to probabilities:
pᵢ = softmax(z)ᵢ = exp(zᵢ) / ∑ⱼ exp(zⱼ)
Key properties:
1) 0 < pᵢ < 1 for all i (because exp(·) > 0)
2) ∑ᵢ pᵢ = 1 (because you divide by the sum)
3) If zᵢ is bigger, pᵢ tends to be bigger (monotonic relationship)
So softmax turns an unconstrained vector in ℝᴷ into a point on the probability simplex.
If one logit is much larger than the others, softmax becomes very “peaky” (high confidence). If logits are close, softmax becomes more uniform (uncertainty).
Softmax does two intuitive operations:
1) Exponentiate: converts differences in logits into multiplicative ratios.
2) Normalize: turns those positive weights into probabilities.
This exponential step is why small logit differences can matter a lot.
Softmax gives a clean relationship:
pᵢ / pⱼ = exp(zᵢ) / exp(zⱼ) = exp(zᵢ − zⱼ)
So:
This is the core interpretation: logits are “log-probability-like scores” where differences correspond to multiplicative preference.
If K = 2 with logits (a, b), then
p(class 1) = exp(a) / (exp(a)+exp(b))
Divide numerator and denominator by exp(b):
p(class 1) = exp(a−b) / (exp(a−b)+1) = 1 / (1+exp(−(a−b)))
That’s the sigmoid σ(a−b). So softmax generalizes sigmoid.
This also reinforces: only the difference (a−b) matters, not the absolute values.
Before you worry about numerical stability, you need to understand what softmax is doing conceptually:
These facts show up constantly in practice: calibration, confidence, attention distributions, and why subtracting max(z) is “allowed.”
Let c be any real number and define z'ᵢ = zᵢ + c.
Compute softmax(z')ᵢ:
softmax(z')ᵢ = exp(zᵢ + c) / ∑ⱼ exp(zⱼ + c)
Now factor out exp(c):
= (exp(c)exp(zᵢ)) / (∑ⱼ exp(c)exp(zⱼ))
= exp(zᵢ) / ∑ⱼ exp(zⱼ)
= softmax(z)ᵢ
So adding the same constant to every logit does not change the probabilities.
Why you care: this is the mathematical reason you can subtract max(z) for stability.
Now consider scaling logits by a positive scalar α:
z' = αz
Then:
pᵢ(α) = exp(αzᵢ) / ∑ⱼ exp(αzⱼ)
Often this is expressed as a temperature T > 0:
softmax_T(z)ᵢ = exp(zᵢ / T) / ∑ⱼ exp(zⱼ / T)
Let m = maxⱼ zⱼ and let k be an index achieving that max.
Consider the ratio for any i ≠ k:
pᵢ / p_k = exp(α(zᵢ − z_k))
Since zᵢ − z_k < 0, as α → ∞:
exp(α(zᵢ − z_k)) → 0
So p_k → 1 and all others → 0.
Softmax can be motivated (informally) as producing a distribution that:
This viewpoint helps when you later see attention: attention weights are softmax over similarity scores. Changing scale (like dividing by √d in Transformers) directly changes entropy/peakedness.
Because softmax exponentiates, it can output very high probabilities even when the model is wrong.
Example: logits z = (10, 9, 0)
Softmax will still assign almost all mass to the first two classes, with the third near 0.
This is not a bug in softmax—it’s a reflection of what logits encode. If the model produces large-magnitude logits, it is expressing high confidence.
Practical takeaway: to interpret probabilities sensibly, you often need calibration techniques (temperature scaling, etc.). But as a foundational building block, softmax is doing exactly what its math says.
Softmax uses exp(zᵢ). Exponentials grow (and shrink) extremely fast.
In floating-point arithmetic (like float32), this causes two common problems:
Even when the final softmax probabilities are well-defined, naive computation can blow up.
Because softmax is shift-invariant, choose:
m = maxᵢ zᵢ
Define shifted logits:
z̃ᵢ = zᵢ − m
Then compute:
softmax(z)ᵢ = exp(z̃ᵢ) / ∑ⱼ exp(z̃ⱼ)
Now:
This prevents overflow and keeps values in a sane numeric range.
Start from the original softmax:
pᵢ = exp(zᵢ) / ∑ⱼ exp(zⱼ)
Subtract m:
exp(zᵢ) = exp(zᵢ − m)exp(m)
So:
pᵢ = exp(zᵢ − m)exp(m) / ∑ⱼ exp(zⱼ − m)exp(m)
Cancel exp(m):
pᵢ = exp(zᵢ − m) / ∑ⱼ exp(zⱼ − m)
Often you don’t just need softmax; you need the log of its denominator:
log ∑ⱼ exp(zⱼ)
This appears in:
The stable identity is:
log ∑ⱼ exp(zⱼ) = m + log ∑ⱼ exp(zⱼ − m)
where m = maxⱼ zⱼ.
Let S = ∑ⱼ exp(zⱼ). Factor out exp(m):
S = ∑ⱼ exp(zⱼ)
= ∑ⱼ exp((zⱼ − m) + m)
= ∑ⱼ exp(zⱼ − m)exp(m)
= exp(m) ∑ⱼ exp(zⱼ − m)
Now take log:
log S = log( exp(m) ∑ⱼ exp(zⱼ − m) )
= log exp(m) + log ∑ⱼ exp(zⱼ − m)
= m + log ∑ⱼ exp(zⱼ − m)
Now all exponentials are exp(≤ 0), so you avoid overflow.
Many training losses require log probabilities:
log pᵢ = log softmax(z)ᵢ
Naively:
log pᵢ = log(exp(zᵢ) / ∑ⱼ exp(zⱼ))
= zᵢ − log ∑ⱼ exp(zⱼ)
This is great because it avoids forming tiny probabilities directly.
Use log-sum-exp for stability:
log pᵢ = zᵢ − (m + log ∑ⱼ exp(zⱼ − m))
Computing log-softmax directly avoids this.
| Goal | Naive expression | Stable expression | Typical use |
|---|---|---|---|
| Softmax probs | exp(zᵢ)/∑ exp(z) | exp(zᵢ−m)/∑ exp(z−m) | inference, attention weights |
| log-sum-exp | log ∑ exp(z) | m + log ∑ exp(z−m) | normalization constants |
| log-softmax | log softmax(z) | zᵢ − logsumexp(z) | cross-entropy training |
When implementing or debugging ML code, it’s useful to ask: “Do I need probs, or only log-probs?” If you only need log-probs, prefer log-softmax.
Softmax is not an isolated formula. It sits at two of the most important junctions in modern ML:
1) Multiclass classification (via cross-entropy)
2) Transformer attention (via softmax-normalized similarity scores)
Understanding what softmax is doing in these contexts makes later nodes (masking, Transformers) much easier.
You have:
The negative log-likelihood (cross-entropy with a one-hot target) is:
L = −log p_y
Substitute softmax:
L = −log( exp(z_y) / ∑ⱼ exp(zⱼ) )
Now simplify:
L = −( z_y − log ∑ⱼ exp(zⱼ) )
= log ∑ⱼ exp(zⱼ) − z_y
This form is the reason log-sum-exp is central: it’s literally inside the loss.
If you add a constant c to all logits:
z'_j = z_j + c
Then:
L' = log ∑ⱼ exp(zⱼ + c) − (z_y + c)
= log( exp(c)∑ⱼ exp(zⱼ) ) − z_y − c
= (c + log ∑ⱼ exp(zⱼ)) − z_y − c
= L
So the loss is shift-invariant too.
Even without doing a full backprop derivation, one famous result explains a lot:
For cross-entropy with softmax, the gradient w.r.t. logits is:
∂L/∂zᵢ = pᵢ − 𝟙[i = y]
Meaning:
This elegant “p − y_onehot” structure is one reason softmax + cross-entropy is standard.
In Transformers, attention scores are computed (simplified) as:
scoresᵢ = ( q · kᵢ ) / √d
Then attention weights are:
αᵢ = softmax(scores)ᵢ
And the output is a weighted sum:
o = ∑ᵢ αᵢ vᵢ
Here softmax ensures:
So attention becomes a convex combination of value vectors. This is essential for stability and interpretability.
Dot products grow with dimension d. If scores become too large in magnitude, softmax becomes too peaky (near one-hot), harming learning.
Dividing by √d reduces score scale and keeps the softmax in a healthier regime.
In attention, you often must prevent attending to certain positions (padding or future tokens). The standard trick is:
scoresᵢ ← scoresᵢ + maskᵢ
where maskᵢ is:
Then softmax assigns ~0 probability to masked positions because exp(−∞) ≈ 0.
This works because softmax converts large negative logits into near-zero probability.
So when you learn Sequence Masking (causal and padding masks), you’ll essentially be learning how to engineer logits before applying softmax.
A few practical guidelines:
Softmax is simple—but its behavior under shifts, scales, and masking is the backbone of many modern architectures.
A 3-class classifier outputs logits z = (2, 1, 0). Compute softmax(z) and interpret the ratio p₁/p₂.
Write the definition:
pᵢ = exp(zᵢ) / (exp(z₁)+exp(z₂)+exp(z₃))
Compute exponentials:
exp(2) ≈ 7.389
exp(1) ≈ 2.718
exp(0) = 1
Compute the normalization constant:
S = 7.389 + 2.718 + 1 = 11.107
Compute probabilities:
p₁ = 7.389 / 11.107 ≈ 0.665
p₂ = 2.718 / 11.107 ≈ 0.245
p₃ = 1 / 11.107 ≈ 0.090
Interpret the ratio using logit differences:
p₁/p₂ = exp(z₁ − z₂) = exp(2 − 1) = exp(1) ≈ 2.718
Check numerically:
0.665 / 0.245 ≈ 2.714 (close, rounding differences)
Insight: Softmax turns additive differences in logits into multiplicative differences in probabilities. A 1-point advantage in logits means about 2.7× higher probability relative to a competitor, all else equal.
Compute softmax(z) for z = (1000, 1001, 999) using a numerically stable method.
Identify the maximum logit:
m = max(1000, 1001, 999) = 1001
Shift logits:
z̃ = z − m = (1000−1001, 1001−1001, 999−1001) = (−1, 0, −2)
Exponentiate shifted logits (these are safe):
exp(−1) ≈ 0.3679
exp(0) = 1
exp(−2) ≈ 0.1353
Normalize:
S = 0.3679 + 1 + 0.1353 = 1.5032
p₁ = 0.3679 / 1.5032 ≈ 0.2447
p₂ = 1 / 1.5032 ≈ 0.6652
p₃ = 0.1353 / 1.5032 ≈ 0.0900
Explain why the naive way fails:
Naively you’d compute exp(1000), exp(1001), exp(999), which overflow in float32, producing ∞ and undefined ratios.
Subtracting m uses shift invariance so the final probabilities are unchanged.
Insight: Stable softmax is not a heuristic—it’s mathematically identical to naive softmax, but keeps the exponentials in a safe range by exploiting shift invariance.
A 4-class model outputs logits z = (3, 1, −2, 0). The true class is y = 2 (the logit 1). Compute L = −log p_y stably.
Use the stable loss form:
L = log ∑ⱼ exp(zⱼ) − z_y
Compute m = max(zⱼ):
m = max(3, 1, −2, 0) = 3
Compute shifted exponentials:
z − m = (0, −2, −5, −3)
exp(0)=1
exp(−2)≈0.1353
exp(−5)≈0.0067
exp(−3)≈0.0498
Compute the shifted sum:
S̃ = 1 + 0.1353 + 0.0067 + 0.0498 = 1.1918
Compute log-sum-exp:
log ∑ⱼ exp(zⱼ) = m + log S̃ = 3 + log(1.1918)
log(1.1918) ≈ 0.1755
So log ∑ⱼ exp(zⱼ) ≈ 3.1755
Subtract the true logit z_y = 1:
L ≈ 3.1755 − 1 = 2.1755
Insight: Cross-entropy from logits naturally involves log-sum-exp. Computing it stably avoids ever forming tiny probabilities like p_y directly.
Logits z ∈ ℝᴷ are unnormalized real-valued scores; softmax converts them into a probability distribution.
Softmax is pᵢ = exp(zᵢ)/∑ⱼ exp(zⱼ), so differences in logits become probability ratios: pᵢ/pⱼ = exp(zᵢ − zⱼ).
Softmax is shift-invariant: softmax(z + c·1) = softmax(z). This enables stable computation.
Compute softmax stably with m = max(z) and exp(zᵢ−m); compute log-sum-exp as m + log ∑ exp(zⱼ−m).
For training, prefer log-softmax / log-sum-exp forms; they avoid underflow and give stable cross-entropy.
Scaling logits changes distribution sharpness; temperature T controls how peaky or uniform softmax becomes.
In attention, softmax turns similarity scores into nonnegative weights that sum to 1, enabling weighted sums of value vectors.
Masking works by adding large negative values to selected logits so softmax assigns them ~0 probability.
Computing exp(zᵢ) directly on large logits (overflow) instead of subtracting max(z) first.
Interpreting logits as probabilities (e.g., thinking a negative logit means “negative probability”). Logits are unconstrained scores.
Forgetting that softmax only cares about relative differences; adding the same constant to all logits should not change outputs.
Using softmax probabilities in the loss computation instead of a combined stable cross-entropy-from-logits (leading to log(0) or NaNs).
Compute softmax(z) for z = (0, 0, 0, 0). What is the entropy intuition here?
Hint: All exponentials are the same; normalization divides equally among classes.
exp(0)=1 for each class, so ∑ exp(zⱼ)=4 and each pᵢ = 1/4 = 0.25. Intuition: identical logits mean no preference, so softmax returns the uniform (maximum-entropy) distribution.
Show that softmax is shift-invariant: prove softmax(z) = softmax(z − m·1) for any scalar m.
Hint: Factor exp(−m) out of numerator and denominator (or exp(m) depending on direction).
Let z'ᵢ = zᵢ − m. Then softmax(z')ᵢ = exp(zᵢ − m)/∑ⱼ exp(zⱼ − m) = (exp(zᵢ)exp(−m)) / (∑ⱼ exp(zⱼ)exp(−m)) = exp(zᵢ)/∑ⱼ exp(zⱼ) = softmax(z)ᵢ. So shifting all logits by the same m does not change the distribution.
Given logits z = (5, 2, 1) and true class y = 1 (the first entry), compute the cross-entropy loss L = log ∑ⱼ exp(zⱼ) − z_y using a stable log-sum-exp step.
Hint: Let m = max(z)=5, compute ∑ exp(zⱼ−m), then add m and subtract z_y.
m=5. Shifted logits: (0, −3, −4). Exponentials: exp(0)=1, exp(−3)≈0.0498, exp(−4)≈0.0183. Sum S̃≈1.0681. logsumexp ≈ 5 + log(1.0681). log(1.0681)≈0.0659, so logsumexp≈5.0659. With y=1, z_y=5, so L≈5.0659−5=0.0659.
Next, softmax becomes a tool you control by editing logits before normalization: