Chernoff, Hoeffding bounds. Tail probabilities.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
When you average noisy data, you feel like it should stabilize. Concentration inequalities make that feeling precise: they turn “rare big deviations” into concrete, exponentially small tail probabilities—with constants you can compute and optimize.
Concentration inequalities bound tail probabilities like P(S − E[S] ≥ t). The core move is the exponential Markov method: apply Markov to exp(λ(S−E[S])) to get P(S−E[S] ≥ t) ≤ exp(ψ_S(λ) − λt). Optimizing over λ gives a Legendre/duality picture: the best exponent is sup_λ (λt − ψ_S(λ)) (a convex conjugate). For sums of independent bounded variables, Hoeffding’s lemma yields a quadratic upper bound on ψ, giving sub-Gaussian tails (Hoeffding bounds). For Bernoulli sums, sharper ψ yields Chernoff bounds with KL-divergence exponents.
A concentration inequality is a guarantee that a random quantity stays close to its typical value (usually its expectation) with high probability.
Concretely, if S is a random variable (often a sum or average), we want bounds of the form
The defining feature is that these probabilities decay fast in t (often like exp(−c t²) or exp(−c t)). This is the mathematical version of “averaging reduces noise.”
You already know the Central Limit Theorem (CLT): sums of many independent variables become approximately normal under mild conditions.
But CLT answers a different question:
In practice, concentration is what you use when you want statements like:
“With probability at least 1 − δ, the sample mean differs from the true mean by at most ε.”
This is a tail probability question.
Most concentration results for sums of independent variables share a template:
This node focuses on the classic pair:
And the unifying engine behind both:
For a random variable X, define its moment-generating function (mgf) when it exists:
and its cumulant-generating function (CGF)
Why the log? Because for independent sums it turns products into sums:
If X and Y are independent, then
This additivity is the backbone of concentration for sums.
1) Tail bound as an optimization problem:
We will repeatedly get bounds that look like
and then choose the best λ ≥ 0.
2) Geometry / duality:
The exponent comes from comparing the curve ψ(λ) against the line λt.
We will make that picture explicit later, because it is the most reliable way to remember what’s going on.
The starting point is humble: Markov’s inequality.
For a nonnegative random variable Y and a > 0,
This is often too weak directly. The trick is to manufacture a useful nonnegative variable by exponentiating.
Suppose we want an upper tail for X:
For any λ > 0,
Since e^{\lambda X} ≥ 0, we can apply Markov:
That is
This is the exponential Markov method.
Most concentration statements are about deviations from the mean. Let Z = S − E[S]. Then
Note that
So you can either work with ψ_S and subtract λE[S], or directly define ψ for the centered variable.
The inequality holds for every λ > 0. So take the best one:
Equivalently,
Define the convex conjugate (Legendre transform) of ψ_Z:
Then the clean conceptual form is:
This “duality” is not just fancy notation: it is exactly the geometry of “supporting lines” to a convex function.
Because ψ_Z(λ) is convex in λ (log mgf is convex), the function
is convex too (convex minus linear). Minimizing it has a crisp first-order condition when differentiable:
Interpretation:
This directly matches the visualization suggestion:
If your interactive environment supports it, show two linked panels:
Panel A (duality / optimization):
Panel B (resulting tail):
This makes “optimize over λ” feel like reading off a best supporting line.
So far, everything is exact but useless unless we can bound ψ_Z(λ).
The entire field of concentration inequalities is essentially a library of ways to upper bound ψ for different kinds of random variables.
For this node, we focus on:
Before going there, one more crucial simplification:
Let S = ∑ᵢ Xᵢ with independent Xᵢ. For centered terms (or after centering),
So if you can bound each ψ_{X_i}(λ), you can bound ψ_S(λ) by summing.
That is why bounded-difference / bounded-summand principles are so powerful: they give a uniform bound per term.
Hoeffding’s inequality is the canonical “bounded independent noise averages out” theorem.
We’ll build it in two stages:
1) Hoeffding’s lemma: a single bounded random variable has a sub-Gaussian mgf.
2) Additivity of ψ for independent sums + λ-optimization yields Hoeffding’s inequality.
Let X be a random variable with support in an interval [a, b]. Define the centered variable:
Hoeffding’s lemma states:
Equivalently in CGF form:
This is the precise bridge from boundedness to quadratic ψ, and quadratic ψ is what produces Gaussian-like tails exp(−t²).
A useful mental model:
So Hoeffding’s lemma says: “a bounded centered variable behaves like a Gaussian in terms of its mgf, with effective variance proxy (b−a)²/4.” (Up to constants.)
A full proof uses convexity (or a comparison to a two-point distribution at the endpoints). The intuition:
For this lesson, the important takeaway is the usable inequality:
Let X₁, …, X_n be independent, with Xᵢ ∈ [aᵢ, bᵢ] almost surely. Let
Define the centered sum Z = S − μ = ∑ᵢ (Xᵢ − E[Xᵢ]).
For λ > 0,
Because the centered terms are independent,
Each Xᵢ − E[Xᵢ] is still bounded in an interval of width (bᵢ − aᵢ). So
Summing gives
Define the “range-sum” parameter
Then
We minimize the exponent in λ:
Differentiate:
Set h'(λ)=0:
Plug back in:
\n1) Compute the quadratic term:
2) Compute the linear term:
So
Therefore
That is Hoeffding’s inequality.
Similarly,
By a union bound,
If Xᵢ ∈ [0,1], then (bᵢ − aᵢ)=1 and V = n.
Let \bar{X} = (1/n)∑ Xᵢ and E[\bar{X}] = μ̄.
Then
And two-sided:
This is the second visualization the node asked for.
In Hoeffding’s proof, we replaced the true ψ_Z(λ) by an upper quadratic:
Geometrically, you can plot:
Then the optimized exponent becomes the conjugate of the quadratic:
So the tail is bounded by exp(−2t²/V).
Interactive canvas idea (explicit):
This connects directly to “how many samples do I need for error ≤ ε with probability ≥ 1−δ?”
Hoeffding is powerful and simple, but it is sometimes loose because it uses only boundedness.
Chernoff bounds refine the story when the distribution is known (especially Bernoulli) by using a sharper ψ.
Let Xᵢ ∼ Bernoulli(p) independent. Then S = ∑ Xᵢ ∼ Binomial(n, p), and μ = E[S] = np.
For a single Bernoulli X:
So
For the sum S, independence gives
For the centered sum Z = S − μ,
Now apply exponential Markov:
Let’s express the event in relative terms. Set t = n\varepsilon, i.e. S ≥ n(p+ε). Then
We optimize over λ ≥ 0.
Let q = p+ε (so q > p). The optimal exponent is
where D(q\|p) is the Bernoulli KL divergence:
Thus the Chernoff bound becomes
Similarly for the lower tail (q < p):
This is a major conceptual upgrade:
A popular parameterization is S ≥ (1+δ)μ with δ>0.
Since μ=np, the event is S ≥ (1+δ)np.
A standard Chernoff upper tail bound is:
And a simpler (looser) but easy-to-remember form:
Lower tail for δ∈(0,1):
These are derived by selecting a convenient λ rather than solving exactly, or by bounding KL by a quadratic.
For Bernoulli sums,
This is the same duality picture as before, but now the optimized supporting line corresponds to a KL divergence.
Extend the earlier Panel A:
Then in Panel B, plot two bounds versus q:
This comparison makes the “distribution-aware” advantage visible.
Even though this node centers on independent sums, the exponential Markov method generalizes.
In Markov chain settings, you often want concentration for empirical averages along a trajectory:
Independence fails, but one can still bound exponential moments using spectral gaps / mixing and obtain Hoeffding-like or Bernstein-like inequalities for dependent data.
The important connection is conceptual:
So learning the ψ(λ) vs λt optimization here pays off later in dependent concentration.
| Inequality | Setting | Tail behavior | What ψ bound uses | Typical exponent | ||
|---|---|---|---|---|---|---|
| Hoeffding | Independent, bounded Xᵢ ∈ [aᵢ,bᵢ] | Sub-Gaussian | Quadratic upper bound on ψ (Hoeffding lemma) | 2t² / ∑(bᵢ−aᵢ)² | ||
| Chernoff | Bernoulli/Binomial (and related) | Often tighter; KL form | Exact ψ, then optimize | n·D(q | p) |
Both are the same engine + different ψ information.
Let X₁,…,X_n be independent with Xᵢ ∈ [0,1]. Let \bar{X} = (1/n)∑Xᵢ and μ̄ = E[\bar{X}]. We want n such that P(|\bar{X} − μ̄| ≥ ε) ≤ δ.
Start from two-sided Hoeffding for bounded [0,1] variables:
P(|\bar{X} − μ̄| ≥ ε) ≤ 2 exp(−2nε²).
Set the bound ≤ δ:
2 exp(−2nε²) ≤ δ.
Divide by 2:
exp(−2nε²) ≤ δ/2.
Take logs (note log is increasing):
−2nε² ≤ log(δ/2).
Multiply by −1 (flips inequality):
2nε² ≥ log(2/δ).
Solve for n:
Interpretation: to cut ε in half, you need ~4× more samples; to make δ 10× smaller, you need only an additive increase of (log 10)/(2ε²).
Insight: This is the classic PAC-style sample complexity tradeoff. The square dependence on ε comes from the quadratic ψ upper bound (sub-Gaussianity), while the log dependence on 1/δ comes from the exponential tail.
Let X₁,…,X_n independent with Xᵢ ∈ [aᵢ,bᵢ]. Let S=∑Xᵢ and μ=E[S]. Define V=∑(bᵢ−aᵢ)². Show P(S−μ ≥ t) ≤ exp(−2t²/V).
Let Z = S − μ = ∑(Xᵢ − E[Xᵢ]). We apply exponential Markov:
For any λ>0,
P(Z ≥ t) = P(e^{λZ} ≥ e^{λt}) ≤ E[e^{λZ}] / e^{λt}.
Take logs via ψ:
P(Z ≥ t) ≤ exp( ψ_Z(λ) − λt ), where ψ_Z(λ)=log E[e^{λZ}].
Use independence to add CGFs:
ψ_Z(λ) = ∑ ψ_{Xᵢ−E[Xᵢ]}(λ).
Apply Hoeffding’s lemma to each term:
ψ_{Xᵢ−E[Xᵢ]}(λ) ≤ λ²(bᵢ−aᵢ)²/8.
Sum the bounds:
ψ_Z(λ) ≤ (λ²/8)∑(bᵢ−aᵢ)² = λ²V/8.
So P(Z ≥ t) ≤ exp( λ²V/8 − λt ).
Optimize h(λ)=λ²V/8 − λt:
h'(λ)=λV/4 − t = 0 ⇒ λ* = 4t/V.
Plug in:
h(λ*) = ( (16t²/V²)·V / 8 ) − (4t/V)·t = (2t²/V) − (4t²/V) = −2t²/V.
Conclude:
Insight: This example makes the optimization/duality tangible: once ψ is upper-bounded by a quadratic, the entire tail bound becomes a one-line quadratic minimization in λ.
Let S ∼ Binomial(n,p). For a target fraction q>p, bound P(S/n ≥ q).
Use exponential Markov on S:
For λ>0,
P(S ≥ nq) = P(e^{λS} ≥ e^{λnq}) ≤ E[e^{λS}] / e^{λnq}.
Compute the mgf of a binomial via independence:
E[e^{λS}] = (E[e^{λX₁}])^n with X₁∼Bern(p).
E[e^{λX₁}] = 1−p + pe^{λ}.
So:
P(S/n ≥ q) ≤ exp( n log(1−p+pe^{λ}) − λnq ).
Optimize over λ>0:
Take derivative w.r.t. λ of the exponent per-sample:
φ(λ)=log(1−p+pe^{λ}) − λq.
φ'(λ) = (pe^{λ})/(1−p+pe^{λ}) − q.
Set φ'(λ*)=0:
(pe^{λ})/(1−p+pe^{λ}) = q.
Solve for e^{λ*}:
q(1−p+pe^{λ}) = pe^{λ}
q(1−p) = pe^{λ*}(1−q)
⇒ e^{λ*} = (q(1−p))/(p(1−q)).
Plugging λ* back yields the KL form:
min_λ [log(1−p+pe^{λ}) − λq] = −D(q||p),
so
Insight: Chernoff is the same exponential Markov method, but with an exact ψ. The optimization condition ψ'(λ*)=t becomes an explicit equation whose solution encodes the tilted (exponentially reweighted) Bernoulli distribution, and the exponent becomes a KL divergence.
The exponential Markov method is the universal engine: apply Markov to e^{λX} to turn P(X≥t) into an mgf/CGF bound.
The cumulant-generating function ψ_X(λ)=log E[e^{λX}] controls tail decay; for sums of independent variables, ψ adds: ψ_{∑Xᵢ}(λ)=∑ψ_{Xᵢ}(λ).
Tail bounds are optimization problems: P(Z≥t) ≤ exp( inf_{λ>0}(ψ_Z(λ)−λt) ) = exp(−sup_{λ>0}(λt−ψ_Z(λ))).
The ψ(λ) vs λt picture is geometric duality: maximizing the gap λt−ψ(λ) gives the best exponent; the optimizer satisfies ψ'(λ*)=t when differentiable.
Hoeffding’s lemma gives a quadratic upper bound on ψ for bounded variables, implying sub-Gaussian tails; Hoeffding’s inequality follows by summing ψ and optimizing λ.
For Bernoulli/Binomial sums, Chernoff bounds use a sharper ψ and yield KL-divergence exponents exp(−nD(q||p)), often tighter than Hoeffding.
In high-probability estimation, concentration turns (ε,δ) requirements into explicit sample complexity like n ≥ log(2/δ)/(2ε²) for [0,1] averages.
Forgetting to center: bounding P(S≥t) when you really need P(S−E[S]≥t) changes ψ and often worsens constants or breaks symmetry.
Optimizing λ with the wrong sign (e.g., using λ>0 for a lower tail); for P(Z≤−t) you typically apply the method to −Z (or λ<0).
Assuming Hoeffding always beats everything: it ignores variance and distribution shape; Bernstein/Chernoff can be much tighter when additional information exists.
Mixing up t for sums vs averages: if Z = \bar{X}−E\bar{X}, then the corresponding sum deviation is n·t, which changes exponents by n.
Let X₁,…,X_n be independent with Xᵢ ∈ [−1,2]. Give a bound on P(|\bar{X}−E\bar{X}| ≥ ε).
Hint: Use two-sided Hoeffding on the sum with (bᵢ−aᵢ)=3, then translate from S to \bar{X}.
Here (bᵢ−aᵢ)=3 for each i, so V=∑(bᵢ−aᵢ)² = n·9 = 9n.
For S=∑Xᵢ and μ=E[S], Hoeffding gives:
P(|S−μ| ≥ t) ≤ 2 exp(−2t²/V) = 2 exp(−2t²/(9n)).
Now |\bar{X}−E\bar{X}| ≥ ε is equivalent to |S−μ| ≥ nε, so t=nε:
P(|\bar{X}−E\bar{X}| ≥ ε) ≤ 2 exp(−2(nε)²/(9n)) = 2 exp(−2nε²/9).
Let S ∼ Binomial(n,p). Using the exponential Markov method, show that for any λ>0:
P(S/n ≥ q) ≤ exp(n[log(1−p+pe^{λ}) − λq]). Then write the condition that defines the optimal λ*.
Hint: Compute E[e^{λS}] via independence; then differentiate the exponent per-sample and set to zero.
Exponential Markov:
P(S≥nq)=P(e^{λS}≥e^{λnq})≤E[e^{λS}]/e^{λnq}.
For Bernoulli X, E[e^{λX}]=1−p+pe^{λ}. For S=∑Xᵢ, E[e^{λS}]=(1−p+pe^{λ})^n.
Thus
P(S/n ≥ q) ≤ (1−p+pe^{λ})^n / e^{λnq} = exp(n[log(1−p+pe^{λ}) − λq]).
For optimal λ*, minimize f(λ)=log(1−p+pe^{λ}) − λq.
Set derivative to zero:
f'(λ)=(pe^{λ})/(1−p+pe^{λ}) − q = 0,
so λ satisfies (pe^{λ})/(1−p+pe^{λ*}) = q.
Suppose Xᵢ ∈ [0,1] independent, and you want P(\bar{X} − E\bar{X} ≥ ε) ≤ δ (one-sided). Solve for the smallest n that Hoeffding guarantees.
Hint: Use the one-sided Hoeffding bound exp(−2nε²) and solve exp(−2nε²) ≤ δ.
For Xᵢ∈[0,1], Hoeffding one-sided gives:
P(\bar{X} − E\bar{X} ≥ ε) ≤ exp(−2nε²).
Require exp(−2nε²) ≤ δ.
Take logs:
−2nε² ≤ log δ.
Multiply by −1 (flip):
2nε² ≥ log(1/δ).
Thus
Next nodes you might connect: