A function that converts a vector of real values into a probability distribution by exponentiating and normalizing each entry; commonly used to produce attention weights. Understanding softmax behavior, numerical stability, and temperature scaling is important for interpreting attention scores.
Deep-dive lesson - accessible entry point but dense material. Use worked examples and spaced repetition.
Whenever a model needs to turn “scores” into “choices”, it needs a bridge from arbitrary real numbers to probabilities. Softmax is that bridge: it takes a vector of real-valued logits and returns a probability distribution—smoothly, differentiably, and with behavior you can control (via shifting for stability and temperature for sharpness).
Softmax maps a vector x ∈ ℝⁿ to probabilities by exponentiating and normalizing: softmax(x)ᵢ = exp(xᵢ)/∑ⱼ exp(xⱼ). It’s shift-invariant (adding a constant to all logits changes nothing), so we can subtract max(x) for numerical stability. Temperature scaling softmax(x/T) controls how peaked the distribution is: low T → more confident/peaked; high T → flatter/more uniform.
In many ML systems, we compute scores for several options: which class is present, which token to attend to, which action to take. Those scores often live in ℝ: they can be negative, huge, and not constrained to sum to 1.
But downstream we often want a probability distribution:
Softmax is the standard way to convert a vector of real-valued scores (“logits”) into a probability distribution.
Let x = (x₁, x₂, …, xₙ) be a vector of real numbers (logits).
The softmax function returns a vector p = softmax(x) where each component is
Softmax does two simple things:
So softmax turns relative score gaps into relative probability mass.
For each i:
So softmax(x) lies on the probability simplex (the set of all probability vectors).
Softmax is simple to write down, but its behavior (and pitfalls) matter a lot in real models—especially numerical stability and temperature scaling, which we’ll build up next.
Exponentials have two key effects:
A crucial identity is the ratio form:
This says softmax compares logits via their differences. If exceeds by Δ, then i gets times more probability than k.
If n = 2 with logits (a, b), then
That’s exactly a sigmoid in the logit difference (a − b). This is a nice mental model:
Softmax is not invariant to scaling of logits. If you multiply logits by a constant c, softmax typically becomes more or less peaked (we’ll formalize this with temperature later).
Consider three logits: x = (2, 1, 0).
Compute exponentials:
Sum ≈ 11.11
So probabilities ≈ (0.665, 0.245, 0.090).
A gap of 1 between logits becomes a factor of e ≈ 2.72 in weight; a gap of 2 becomes e² ≈ 7.39. This is why softmax can produce strong preferences even from modest logit gaps.
For n = 3, the output probabilities (p₁, p₂, p₃) satisfy p₁ + p₂ + p₃ = 1 and each pᵢ ≥ 0. That set is a 2D triangle (a simplex) embedded in 3D.
Here’s an ASCII simplex diagram to orient you:
p3=1
▲
/ \
/ \
/ • \ • interior points: all p_i in (0,1)
/ \
/ \
/___________\
p1=1 p2=1Softmax maps any logits vector x to some point inside this triangle.
For two options, softmax probability of option 1 depends on the logit difference d = x₁ − x₂:
Below is an inline diagram showing how changing T changes the curve. The horizontal axis is d, vertical is p₁.
p1
1.0 | ............ T=0.5 (sharper)
| .....
0.8 | .....
| ....
0.6 | ... _________ T=1 (baseline)
| ... _____
0.5 |-----+-------------------+----------------------------- d
| ... ____
0.4 | ... ____ - - - - - - - - T=2 (flatter)
| ....__
0.2 | .....
| .....
0.0 | ............
-6 -4 -2 0 2 4 6Interpretation:
We’ll connect this to attention weights: low temperature makes attention concentrate on a few tokens; high temperature spreads it out.
In attention, you’ll see softmax applied to a vector of scores for a given query over all keys. If you have a matrix of scores, softmax is applied per row (or per last dimension), producing a distribution over positions for each query.
This first mechanic—exponentiate then normalize—gives the core behavior. Next we’ll cover the crucial property that makes softmax usable in real systems: shift invariance and numerical stability.
Exponentials can overflow or underflow:
Yet logits in neural nets can easily reach magnitudes where naive exp() is unsafe. So we need a stable way to compute softmax.
Softmax is unchanged if you add the same constant c to every logit:
Derivation (showing work):
Let .
So adding a constant doesn’t change the output probabilities.
Because of shift-invariance, we can choose c conveniently. The most common choice is
Define and .
Then , so every .
Now compute softmax using z:
This is stable because:
Suppose x = (1000, 1001, 999).
Naively, e^{1001} overflows.
Use max trick: m = 1001
Exponentials:
Sum ≈ 1.5032
Probabilities ≈ (0.2447, 0.6652, 0.0900)
These are perfectly reasonable—no overflow.
Shifting logits by a constant slides x along the direction 1 = (1,1,1,…). Softmax “forgets” that direction completely.
For n=3, imagine two different logit vectors:
They map to the exact same point (p₁, p₂, p₃) on the simplex triangle.
Here’s a conceptual diagram combining both ideas—shift vs. scale:
Simplex (n=3 probabilities)
(0,0,1)
▲
/ \
/ \
/ A \ A = softmax(x)
/ \ softmax(x + 10·1) = A (shift: unchanged)
/ • \ softmax(x / T) moves toward vertex or center (scale)
/___________\
(1,0,0) (0,1,0)
- Shift logits: stay at the same point A.
- Scale logits (or change T): slide along a path toward a vertex (peaked) or toward center (uniform).Always compute softmax as:
1)
2)
3)
This gives identical results in exact math, and far better results in floating-point.
Often you want log probabilities (e.g., for cross-entropy). Use:
Stably, compute:
Even if you don’t implement it now, it’s important conceptually: stability is not optional when exponentials are involved.
Next we’ll look at temperature scaling, which is like a controlled scaling of logits that changes the softness/hardness of the distribution.
Sometimes you want probabilities that are:
Temperature scaling gives a single knob T > 0 that controls this.
Given logits x, temperature-scaled softmax is
Equivalent viewpoint: dividing by T is like multiplying logits by .
Let p(T) = softmax(x/T).
1) As T → 0⁺:
2) As T → ∞:
You can see this via differences: ratios are
In dot-product attention, scores often look like
Then attention weights are
The factor plays a temperature-like role: it prevents dot products from growing too large with dimension d (which would make softmax too peaked too early).
Take logits x = (2, 1, 0). Consider three temperatures.
Compute probabilities:
On the simplex triangle, these three points lie along a path from the center-ish region toward the vertex (1,0,0) as T decreases.
Temperature scaling is also used for calibration: you can adjust T (often on a validation set) so predicted probabilities better match empirical accuracy.
This is a big reason softmax is interpreted carefully: the raw logits contain information beyond just the top class.
At this point you know:
Next we connect it directly to attention mechanisms, masking, and how to interpret attention scores.
In attention, you compute a score for each key/value relative to a query. These scores are logits s.
Softmax turns them into weights a that sum to 1:
Then the attention output is a weighted sum:
So softmax is the mechanism that converts similarities into a convex combination of values.
Because and ∑ᵢ aᵢ = 1:
But interpret carefully:
In sequence models you often must prevent attending to:
The standard technique: add a large negative number (−∞ in math; a big negative constant in practice) to masked logits before softmax.
Let mask mᵢ be 0 for allowed, and −∞ for disallowed. Define
Then
This works because softmax only cares about exponentials; setting a logit to −∞ removes it from the sum.
In floating point, you use something like −1e9 (float32) or a framework-provided mask fill value.
Softmax is commonly paired with cross-entropy loss.
If the true class is k and predicted probabilities are pᵢ, then
When the model assigns low probability to the correct class, the loss is large.
A key internal quantity is log-sum-exp:
This is one reason stable log-softmax implementations are so common.
Logits contain “un-normalized evidence.” Softmax converts them to probabilities, but:
Use softmax when you need:
Avoid or reconsider when you need:
With these connections, you’re ready to use softmax as a dependable building block for attention mechanisms, masking, and sequence-to-sequence models.
Let x = (2, 1, 0). Compute softmax(x) exactly as exponentiate-and-normalize.
Write the definition:
softmax(x)ᵢ = exp(xᵢ) / (exp(2) + exp(1) + exp(0)).
Compute exponentials:
exp(2) ≈ 7.389,
exp(1) ≈ 2.718,
exp(0) = 1.
Sum them:
S = 7.389 + 2.718 + 1 = 11.107.
Normalize each component:
p₁ = 7.389 / 11.107 ≈ 0.665,
p₂ = 2.718 / 11.107 ≈ 0.245,
p₃ = 1 / 11.107 ≈ 0.090.
Check the distribution sums to 1 (up to rounding):
0.665 + 0.245 + 0.090 = 1.000.
Insight: Softmax cares about differences: (2 vs 1 vs 0) becomes roughly (0.665, 0.245, 0.090). A 1-point logit gap turns into a factor of e ≈ 2.72 in probability mass before normalization.
Let x = (1000, 1001, 999). Show why naive computation fails and compute softmax stably using the max trick.
Naive approach would require exp(1000), exp(1001), exp(999).
In float32/float64, exp(1001) overflows (becomes ∞), making the result undefined (∞/∞).
Use shift-invariance:
Let m = max(x) = 1001.
Define zᵢ = xᵢ − m, so z = (-1, 0, -2).
Compute exponentials safely:
exp(-1) ≈ 0.3679,
exp(0) = 1,
exp(-2) ≈ 0.1353.
Sum:
S = 0.3679 + 1 + 0.1353 = 1.5032.
Normalize:
p₁ = 0.3679 / 1.5032 ≈ 0.2447,
p₂ = 1 / 1.5032 ≈ 0.6652,
p₃ = 0.1353 / 1.5032 ≈ 0.0900.
Insight: Subtracting max(x) doesn’t change softmax outputs, but it bounds the largest exponent at 1, preventing overflow and improving precision.
Let x = (2, 1, 0). Compute softmax(x/T) for T = 2, 1, 0.5 and compare.
Case T = 2:
x/2 = (1, 0.5, 0).
exp values: (2.718, 1.649, 1).
Sum S ≈ 5.367.
Probabilities: (0.506, 0.307, 0.186).
Case T = 1:
Already computed: (0.665, 0.245, 0.090).
Case T = 0.5:
x/0.5 = (4, 2, 0).
exp values: (54.598, 7.389, 1).
Sum S ≈ 62.987.
Probabilities: (0.867, 0.117, 0.016).
Compare:
As T decreases, p₁ increases and the distribution becomes more peaked.
The argmax remains index 1 for all T > 0 (since scaling by 1/T preserves order).
Insight: Temperature doesn’t change which logit is largest, but it strongly affects how much probability mass concentrates on the top options—critical for attention sharpness and calibration.
Softmax converts logits x ∈ ℝⁿ into probabilities: softmax(x)ᵢ = exp(xᵢ)/∑ⱼ exp(xⱼ).
Softmax outputs are always positive and sum to 1, so they lie on the probability simplex.
Softmax depends on logit differences: softmax(x)ᵢ / softmax(x)ₖ = exp(xᵢ − xₖ).
Shift-invariance: adding the same constant to all logits leaves softmax unchanged; this enables the stable max-subtraction trick.
For stable computation, use z = x − max(x) before exponentiating to avoid overflow/underflow.
Temperature scaling softmax(x/T) controls sharpness: low T → peaked; high T → flat; T → ∞ approaches uniform.
In attention, softmax turns similarity scores into attention weights; masking is implemented by adding −∞ (or a large negative value) to disallowed logits before softmax.
Computing softmax as exp(xᵢ)/∑exp(xⱼ) without subtracting max(x), leading to overflow, underflow, or NaNs.
Confusing shift-invariance with scale-invariance: adding a constant changes nothing, but multiplying/dividing logits (including temperature) changes the distribution.
Using softmax for multi-label problems where labels are independent; sigmoid per label is usually appropriate there.
Interpreting softmax probabilities as absolute confidence without considering temperature, calibration, or the set of competing logits in the denominator.
Compute softmax(x) for x = (0, 0, 0, 0). What distribution do you get and why?
Hint: All exponentials are equal; normalize by their sum.
exp(0)=1 for each entry, sum = 4, so each probability is 1/4. Softmax returns the uniform distribution when all logits are equal.
Show (algebraically) that softmax is shift-invariant: softmax(x + c1) = softmax(x).
Hint: Factor e^c out of numerator and denominator.
Let yᵢ = xᵢ + c. Then softmax(y)ᵢ = e^{xᵢ+c}/∑ⱼ e^{xⱼ+c} = (e^c e^{xᵢ})/(e^c ∑ⱼ e^{xⱼ}) = e^{xᵢ}/∑ⱼ e^{xⱼ} = softmax(x)ᵢ.
Let x = (3, 1, -1). Compute softmax(x/T) for T = 1 and T = 2 (use the max trick if you want). Which is more peaked? Explain using ratios.
Hint: Compare p₁/p₂ = exp((x₁-x₂)/T).
For T=1: exponentials are (e^3, e^1, e^{-1}) ≈ (20.085, 2.718, 0.368). Sum ≈ 23.171. So p ≈ (0.867, 0.117, 0.016).
For T=2: logits are (1.5, 0.5, -0.5). exponentials ≈ (4.482, 1.649, 0.607). Sum ≈ 6.738. So p ≈ (0.665, 0.245, 0.090).
T=1 is more peaked. Ratio explanation: p₁/p₂ = exp((3-1)/T) = exp(2/T). For T=1 ratio is e^2≈7.39; for T=2 ratio is e^1≈2.72, so the top class dominates more at lower T.