Concentration Inequalities

Probability & StatisticsDifficulty: █████Depth: 7Unlocks: 0

Chernoff, Hoeffding bounds. Tail probabilities.

Interactive Visualization

t=0s

Core Concepts

▸Exponential Markov method: apply Markov's inequality to exp(lambda X) to convert tail probabilities into moment-generating bounds.
▸Cumulant-generating function (log mgf) as the quantity that controls exponential tail decay: its growth rate determines concentration.
▸Bounded-summand (bounded-difference) principle: uniformly bounded independent summands yield sub-Gaussian tails (Hoeffding-type quadratic CGF upper bound).

Key Symbols & Notation

psi_X(lambda) = log E[exp(lambda X)] (cumulant-generating function, i.e., log mgf)

Essential Relationships

↔Exponential-Markov tail bound: for any lambda>0, P(X >= t) <= exp(-lambda t + psi_X(lambda)) (then optimize over lambda).
↔Independence additivity: for independent X_i, psi_{sum X_i}(lambda) = sum_i psi_{X_i}(lambda), allowing tail bounds for sums.

Prerequisites (2)

Central Limit Theorem6 atoms

Markov Chains6 atoms

▶ Advanced Learning Details

Graph Position

120

Depth Cost

Fan-Out (ROI)

Bottleneck Score

Chain Length

Cognitive Load

Atomic Elements

Total Elements

Percentile Level

Atomic Level

All Concepts (18)

• Tail probability: the probability mass in the tails P(X >= t), P(X <= -t), or two-sided P(|X - E[X]| >= t)
• Concentration inequality: a non-asymptotic bound that upper-bounds tail probabilities (typically exponentially small in t or n)
• Markov's inequality (as used in tail bounds): P(Y >= a) <= E[Y]/a for nonnegative Y, used on exponentials in the Chernoff method
• Chernoff method (exponential/tilting method): applying Markov's inequality to e^{λX} and optimizing over λ to obtain exponential tail bounds
• Moment-generating function (MGF): M_X(λ) = E[e^{λX}] and its use as the basic object controlling tails
• Cumulant-generating function / log-MGF: ψ_X(λ) = log E[e^{λX}] and using its convexity to derive bounds
• Optimization over the tilt parameter λ: choosing the λ that minimizes the Chernoff bound to tighten the tail estimate
• Hoeffding's lemma: a specific bound on the MGF of a zero-mean bounded random variable (gives a quadratic bound in λ)
• Hoeffding's inequality: a concrete exponential tail bound for sums of independent bounded random variables
• Multiplicity (multiplicative) Chernoff bounds for Bernoulli/binomial sums: relative-error tail bounds of the form P(S >= (1+δ)μ) that decay exponentially in μ
• Sub-Gaussian random variable (definition via MGF or tail): variables whose tails are bounded like a Gaussian (equivalent forms: tail decay, moment growth, MGF bound)
• Sub-exponential random variable (parameterized tail/ MGFs): heavier-tailed class characterized by MGF behavior and linear-exponential tail bounds
• Bounded differences / McDiarmid-type condition: control of function deviation when each input coordinate is changed (used to obtain concentration for functions of independent variables)
• One-sided vs two-sided concentration: distinguishing upper-tail, lower-tail, and absolute-deviation bounds
• Exponential decay rate / rate function idea: tail probabilities often behave like exp(-I(t)) where I(t) is (often) quadratic near the mean and linear farther out
• Using union bound with concentration inequalities to get uniform bounds over finite collections
• Dependence vs independence assumptions: many concentration inequalities require independence or bounded-difference/martingale-type dependence controls
• Additivity/composition for sums: how MGFs / sub-Gaussian parameters combine when summing independent variables (e.g., variances/squared-parameters add)

Teaching Strategy

Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.

When you average noisy data, you feel like it should stabilize. Concentration inequalities make that feeling precise: they turn “rare big deviations” into concrete, exponentially small tail probabilities—with constants you can compute and optimize.

TL;DR:

Concentration inequalities bound tail probabilities like P(S − E[S] ≥ t). The core move is the exponential Markov method: apply Markov to exp(λ(S−E[S])) to get P(S−E[S] ≥ t) ≤ exp(ψ_S(λ) − λt). Optimizing over λ gives a Legendre/duality picture: the best exponent is sup_λ (λt − ψ_S(λ)) (a convex conjugate). For sums of independent bounded variables, Hoeffding’s lemma yields a quadratic upper bound on ψ, giving sub-Gaussian tails (Hoeffding bounds). For Bernoulli sums, sharper ψ yields Chernoff bounds with KL-divergence exponents.

What Is a Concentration Inequality?

A concentration inequality is a guarantee that a random quantity stays close to its typical value (usually its expectation) with high probability.

Concretely, if S is a random variable (often a sum or average), we want bounds of the form

•Upper tail: P(S − E[S] ≥ t)
•Lower tail: P(S − E[S] ≤ −t)
•Two-sided: P(|S − E[S]| ≥ t)

The defining feature is that these probabilities decay fast in t (often like exp(−c t²) or exp(−c t)). This is the mathematical version of “averaging reduces noise.”

Why concentration is not the CLT

You already know the Central Limit Theorem (CLT): sums of many independent variables become approximately normal under mild conditions.

But CLT answers a different question:

•CLT is asymptotic (n → ∞) and gives an approximate distribution near the center.
•Concentration inequalities are finite-sample (fixed n) and give non-asymptotic upper bounds on tail probabilities.

In practice, concentration is what you use when you want statements like:

“With probability at least 1 − δ, the sample mean differs from the true mean by at most ε.”

This is a tail probability question.

The common structure

Most concentration results for sums of independent variables share a template:

1)Convert a tail event into an inequality about an exponential moment.
2)Bound that exponential moment using some structure (independence, boundedness, sub-Gaussianity, etc.).
3)Optimize a parameter (λ) to get the best exponent.

This node focuses on the classic pair:

•Hoeffding bounds (bounded variables)
•Chernoff bounds (especially for Bernoulli sums / binomials)

And the unifying engine behind both:

•Exponential Markov method using the cumulant-generating function (CGF)

The key symbol: the cumulant-generating function

For a random variable X, define its moment-generating function (mgf) when it exists:

M_X(\lambda) = \mathbb{E}[e^{\lambda X}]

and its cumulant-generating function (CGF)

\psi_X(\lambda) = \log \mathbb{E}[e^{\lambda X}]

Why the log? Because for independent sums it turns products into sums:

If X and Y are independent, then

\psi_{X+Y}(\lambda) = \log \mathbb{E}[e^{\lambda(X+Y)}] = \log(\mathbb{E}[e^{\lambda X}]\mathbb{E}[e^{\lambda Y}]) = \psi_X(\lambda) + \psi_Y(\lambda).

This additivity is the backbone of concentration for sums.

Two mental pictures to keep in mind

1) Tail bound as an optimization problem:

We will repeatedly get bounds that look like

\mathbb{P}(S - \mathbb{E}S \ge t) \le \exp(\psi_S(\lambda) - \lambda t)

and then choose the best λ ≥ 0.

2) Geometry / duality:

The exponent comes from comparing the curve ψ(λ) against the line λt.

•ψ(λ) is convex.
•λt is a line through the origin with slope t.
•The best λ is where the line is “most above” ψ(λ) in the sense of maximizing λt − ψ(λ).

We will make that picture explicit later, because it is the most reliable way to remember what’s going on.

Core Mechanic 1: Exponential Markov Method and the ψ(λ) vs λt Optimization

The starting point is humble: Markov’s inequality.

For a nonnegative random variable Y and a > 0,

\mathbb{P}(Y \ge a) \le \frac{\mathbb{E}[Y]}{a}.

This is often too weak directly. The trick is to manufacture a useful nonnegative variable by exponentiating.

Step 1: Convert a tail event into an exponential event

Suppose we want an upper tail for X:

\mathbb{P}(X \ge t).

For any λ > 0,

X \ge t \iff e^{\lambda X} \ge e^{\lambda t}.

Since e^{\lambda X} ≥ 0, we can apply Markov:

\mathbb{P}(X \ge t) = \mathbb{P}(e^{\lambda X} \ge e^{\lambda t}) \le \frac{\mathbb{E}[e^{\lambda X}]}{e^{\lambda t}}.

That is

\mathbb{P}(X \ge t) \le \exp(\psi_X(\lambda) - \lambda t).

This is the exponential Markov method.

Centering matters

Most concentration statements are about deviations from the mean. Let Z = S − E[S]. Then

\mathbb{P}(S - \mathbb{E}S \ge t) = \mathbb{P}(Z \ge t) \le \exp(\psi_Z(\lambda) - \lambda t).

Note that

\psi_Z(\lambda) = \log \mathbb{E}[e^{\lambda(S-\mathbb{E}S)}] = \log \left(e^{-\lambda\mathbb{E}S}\,\mathbb{E}[e^{\lambda S}]\right) = \psi_S(\lambda) - \lambda \mathbb{E}S.

So you can either work with ψ_S and subtract λE[S], or directly define ψ for the centered variable.

Step 2: Optimize over λ

The inequality holds for every λ > 0. So take the best one:

\mathbb{P}(Z \ge t) \le \inf_{\lambda > 0} \exp(\psi_Z(\lambda) - \lambda t).

Equivalently,

\mathbb{P}(Z \ge t) \le \exp\Big( - \sup_{\lambda > 0} (\lambda t - \psi_Z(\lambda)) \Big).

Define the convex conjugate (Legendre transform) of ψ_Z:

\psi_Z^*(t) = \sup_{\lambda \in \mathbb{R}} (\lambda t - \psi_Z(\lambda)).

Then the clean conceptual form is:

\mathbb{P}(Z \ge t) \le e^{-\psi_Z^*(t)}.

This “duality” is not just fancy notation: it is exactly the geometry of “supporting lines” to a convex function.

The geometry: ψ(λ) and its supporting line

Because ψ_Z(λ) is convex in λ (log mgf is convex), the function

g(\lambda) = \psi_Z(\lambda) - \lambda t

is convex too (convex minus linear). Minimizing it has a crisp first-order condition when differentiable:

g'(\lambda) = \psi_Z'(\lambda) - t = 0 \quad\Rightarrow\quad \psi_Z'(\lambda^*) = t.

Interpretation:

•ψ_Z'(λ) is the slope of ψ_Z at λ.
•We choose λ* so that the slope of ψ matches the target deviation t.

This directly matches the visualization suggestion:

•Plot ψ_Z(λ) vs λ.
•Plot the line ℓ(λ) = λt.
•The quantity λt − ψ_Z(λ) is the vertical gap between the line and the curve.
•The best λ* maximizes that gap.

Interactive canvas idea (explicit)

If your interactive environment supports it, show two linked panels:

Panel A (duality / optimization):

•A convex curve ψ(λ).
•A movable line λt (you drag t; line slope changes).
•A highlighted λ* where the vertical gap is maximized.
•Display: exponent = ψ(t) = λt − ψ(λ*).

Panel B (resulting tail):

•Plot the bound exp(−ψ*(t)) as a function of t.
•As you move t in Panel A, show the corresponding tail point in Panel B.

This makes “optimize over λ” feel like reading off a best supporting line.

Step 3: Use structure to bound ψ

So far, everything is exact but useless unless we can bound ψ_Z(λ).

The entire field of concentration inequalities is essentially a library of ways to upper bound ψ for different kinds of random variables.

For this node, we focus on:

•Independent bounded summands → quadratic upper bound on ψ (Hoeffding-type)
•Bernoulli/Binomial → exact-ish ψ → KL divergence exponents (Chernoff)

Before going there, one more crucial simplification:

ψ for sums of independent terms

Let S = ∑ᵢ Xᵢ with independent Xᵢ. For centered terms (or after centering),

\psi_S(\lambda) = \sum_i \psi_{X_i}(\lambda).

So if you can bound each ψ_{X_i}(λ), you can bound ψ_S(λ) by summing.

That is why bounded-difference / bounded-summand principles are so powerful: they give a uniform bound per term.

Core Mechanic 2: Bounded Summands → Hoeffding’s Lemma → Hoeffding Bounds

Hoeffding’s inequality is the canonical “bounded independent noise averages out” theorem.

We’ll build it in two stages:

1) Hoeffding’s lemma: a single bounded random variable has a sub-Gaussian mgf.

2) Additivity of ψ for independent sums + λ-optimization yields Hoeffding’s inequality.

Stage 1: Hoeffding’s lemma (mgf bound for a bounded variable)

Let X be a random variable with support in an interval [a, b]. Define the centered variable:

Y = X - \mathbb{E}[X].

Hoeffding’s lemma states:

\mathbb{E}[e^{\lambda Y}] \le \exp\left(\frac{\lambda^2 (b-a)^2}{8}\right) \quad \forall \lambda \in \mathbb{R}.

Equivalently in CGF form:

\psi_Y(\lambda) \le \frac{\lambda^2 (b-a)^2}{8}.

Why this is the key (motivation)

This is the precise bridge from boundedness to quadratic ψ, and quadratic ψ is what produces Gaussian-like tails exp(−t²).

A useful mental model:

•A truly Gaussian variable G ∼ N(0, σ²) has

\psi_G(\lambda) = \log \mathbb{E}[e^{\lambda G}] = \frac{\sigma^2 \lambda^2}{2}.

So Hoeffding’s lemma says: “a bounded centered variable behaves like a Gaussian in terms of its mgf, with effective variance proxy (b−a)²/4.” (Up to constants.)

A brief sketch of why it’s true

A full proof uses convexity (or a comparison to a two-point distribution at the endpoints). The intuition:

•For fixed bounds [a, b] and fixed mean, the mgf is maximized by a distribution that places all mass at the endpoints.
•That reduces the worst case to a simple two-point calculation.

For this lesson, the important takeaway is the usable inequality:

\psi_{X-\mathbb{E}X}(\lambda) \le \frac{\lambda^2 (b-a)^2}{8}.

Stage 2: Hoeffding inequality for sums

Let X₁, …, X_n be independent, with Xᵢ ∈ [aᵢ, bᵢ] almost surely. Let

S = \sum_{i=1}^n X_i, \quad \mu = \mathbb{E}[S] = \sum_{i=1}^n \mathbb{E}[X_i].

Define the centered sum Z = S − μ = ∑ᵢ (Xᵢ − E[Xᵢ]).

Step A: Apply exponential Markov

For λ > 0,

\mathbb{P}(S-\mu \ge t) = \mathbb{P}(Z \ge t) \le \exp(\psi_Z(\lambda) - \lambda t).

Step B: Use independence to sum ψ

Because the centered terms are independent,

\psi_Z(\lambda) = \sum_{i=1}^n \psi_{X_i-\mathbb{E}X_i}(\lambda).

Step C: Bound each ψ via Hoeffding’s lemma

Each Xᵢ − E[Xᵢ] is still bounded in an interval of width (bᵢ − aᵢ). So

\psi_{X_i-\mathbb{E}X_i}(\lambda) \le \frac{\lambda^2 (b_i-a_i)^2}{8}.

Summing gives

\psi_Z(\lambda) \le \frac{\lambda^2}{8} \sum_{i=1}^n (b_i-a_i)^2.

Define the “range-sum” parameter

V = \sum_{i=1}^n (b_i-a_i)^2.

Then

\mathbb{P}(Z \ge t) \le \exp\left(\frac{\lambda^2 V}{8} - \lambda t\right).

Step D: Optimize over λ

We minimize the exponent in λ:

h(\lambda) = \frac{\lambda^2 V}{8} - \lambda t.

Differentiate:

h'(\lambda) = \frac{\lambda V}{4} - t.

Set h'(λ)=0:

\frac{\lambda^* V}{4} = t \quad\Rightarrow\quad \lambda^* = \frac{4t}{V}.

Plug back in:

\n1) Compute the quadratic term:

\frac{(\lambda^*)^2 V}{8} = \frac{(16 t^2 / V^2) V}{8} = \frac{16 t^2}{8V} = \frac{2t^2}{V}.

2) Compute the linear term:

\lambda^* t = \frac{4t}{V} \cdot t = \frac{4t^2}{V}.

h(\lambda^*) = \frac{2t^2}{V} - \frac{4t^2}{V} = -\frac{2t^2}{V}.

Therefore

\mathbb{P}(S-\mu \ge t) \le \exp\left(-\frac{2t^2}{\sum_{i=1}^n (b_i-a_i)^2}\right).

That is Hoeffding’s inequality.

Two-sided version

Similarly,

\mathbb{P}(S-\mu \le -t) \le \exp\left(-\frac{2t^2}{\sum_{i=1}^n (b_i-a_i)^2}\right).

By a union bound,

\mathbb{P}(|S-\mu| \ge t) \le 2\exp\left(-\frac{2t^2}{\sum_{i=1}^n (b_i-a_i)^2}\right).

The sample mean form

If Xᵢ ∈ [0,1], then (bᵢ − aᵢ)=1 and V = n.

Let \bar{X} = (1/n)∑ Xᵢ and E[\bar{X}] = μ̄.

Then

\mathbb{P}(\bar{X} - \mū \ge \varepsilon) = \mathbb{P}(S - \mu \ge n\varepsilon) \le \exp\left(-2n\varepsilon^2\right).

And two-sided:

\mathbb{P}(|\bar{X} - \mū| \ge \varepsilon) \le 2\exp(-2n\varepsilon^2).

Visualization reinforcement: quadratic ψ bound → sub-Gaussian tail curves

This is the second visualization the node asked for.

In Hoeffding’s proof, we replaced the true ψ_Z(λ) by an upper quadratic:

\psi_Z(\lambda) \le \frac{\lambda^2 V}{8}.

Geometrically, you can plot:

•The true ψ_Z(λ) (unknown generally)
•The quadratic upper envelope (λ²V/8)

Then the optimized exponent becomes the conjugate of the quadratic:

\sup_{\lambda \ge 0}(\lambda t - \lambda^2 V/8) = \frac{2t^2}{V}.

So the tail is bounded by exp(−2t²/V).

Interactive canvas idea (explicit):

•Let the user pick n, δ.
•Show ε(δ) such that 2exp(−2nε²)=δ, i.e.

\varepsilon(\delta) = \sqrt{\frac{\log(2/\delta)}{2n}}.

•Plot the family of curves exp(−2nε²) as n varies.
•Show how changing n shifts the curve down (stronger concentration).

This connects directly to “how many samples do I need for error ≤ ε with probability ≥ 1−δ?”

Application/Connection: Chernoff Bounds, KL Exponents, and How ψ Controls Tail Decay

Hoeffding is powerful and simple, but it is sometimes loose because it uses only boundedness.

Chernoff bounds refine the story when the distribution is known (especially Bernoulli) by using a sharper ψ.

Bernoulli sums and a sharper ψ

Let Xᵢ ∼ Bernoulli(p) independent. Then S = ∑ Xᵢ ∼ Binomial(n, p), and μ = E[S] = np.

For a single Bernoulli X:

\mathbb{E}[e^{\lambda X}] = (1-p)\,e^{0} + p\,e^{\lambda} = 1-p + pe^{\lambda}.

\psi_X(\lambda) = \log(1-p + pe^{\lambda}).

For the sum S, independence gives

\psi_S(\lambda) = n\log(1-p + pe^{\lambda}).

For the centered sum Z = S − μ,

\psi_Z(\lambda) = \psi_S(\lambda) - \lambda\mu = n\log(1-p + pe^{\lambda}) - \lambda np.

Now apply exponential Markov:

\mathbb{P}(S-\mu \ge t) \le \exp\left(n\log(1-p + pe^{\lambda}) - \lambda np - \lambda t\right).

Let’s express the event in relative terms. Set t = n\varepsilon, i.e. S ≥ n(p+ε). Then

\mathbb{P}\left(\frac{S}{n} \ge p+\varepsilon\right) \le \exp\left(n\big[\log(1-p + pe^{\lambda}) - \lambda(p+\varepsilon)\big]\right).

We optimize over λ ≥ 0.

The KL-divergence form (the cleanest statement)

Let q = p+ε (so q > p). The optimal exponent is

\inf_{\lambda>0} \Big(\log(1-p+pe^{\lambda}) - \lambda q\Big) = -D(q\|p),

where D(q\|p) is the Bernoulli KL divergence:

D(q\|p) = q\log\frac{q}{p} + (1-q)\log\frac{1-q}{1-p}.

Thus the Chernoff bound becomes

\mathbb{P}\left(\frac{S}{n} \ge q\right) \le \exp\big(-n\,D(q\|p)\big).

Similarly for the lower tail (q < p):

\mathbb{P}\left(\frac{S}{n} \le q\right) \le \exp\big(-n\,D(q\|p)\big).

This is a major conceptual upgrade:

•Hoeffding: exponent is quadratic in (q−p), distribution-free.
•Chernoff: exponent is KL divergence, distribution-aware and often tighter.

Recovering the “classic” multiplicative Chernoff forms

A popular parameterization is S ≥ (1+δ)μ with δ>0.

Since μ=np, the event is S ≥ (1+δ)np.

A standard Chernoff upper tail bound is:

\mathbb{P}(S \ge (1+\delta)\mu) \le \left(\frac{e^{\delta}}{(1+\delta)^{1+\delta}}\right)^{\mu}.

And a simpler (looser) but easy-to-remember form:

\mathbb{P}(S \ge (1+\delta)\mu) \le \exp\left(-\frac{\mu\delta^2}{2+\delta}\right).

Lower tail for δ∈(0,1):

\mathbb{P}(S \le (1-\delta)\mu) \le \exp\left(-\frac{\mu\delta^2}{2}\right).

These are derived by selecting a convenient λ rather than solving exactly, or by bounding KL by a quadratic.

Connection back to ψ(λ) vs λt geometry

For Bernoulli sums,

•ψ_Z(λ) is known exactly.
•The best exponent is the convex conjugate ψ_Z*(t).
•That conjugate turns out to be n·D(q||p) when t corresponds to q.

This is the same duality picture as before, but now the optimized supporting line corresponds to a KL divergence.

Visualization suggestion (Chernoff / KL)

Extend the earlier Panel A:

•Plot ψ_Z(λ) for a chosen p and n (or per-sample ψ and then multiply by n).
•Fix a target deviation q>p.
•Plot the line λ·n(q−p) (or equivalently incorporate centering as above).
•Highlight λ* and show the exponent nD(q||p).

Then in Panel B, plot two bounds versus q:

•Chernoff: exp(−nD(q||p))
•Hoeffding: exp(−2n(q−p)²)

This comparison makes the “distribution-aware” advantage visible.

Where Markov chains connect (high level)

Even though this node centers on independent sums, the exponential Markov method generalizes.

In Markov chain settings, you often want concentration for empirical averages along a trajectory:

\frac{1}{n}\sum_{t=1}^n f(X_t).

Independence fails, but one can still bound exponential moments using spectral gaps / mixing and obtain Hoeffding-like or Bernstein-like inequalities for dependent data.

The important connection is conceptual:

•Still use exp(λ·sum) + Markov.
•Still bound ψ(λ) (now harder).
•Still optimize over λ.

So learning the ψ(λ) vs λt optimization here pays off later in dependent concentration.

Summary table: Hoeffding vs Chernoff

Inequality	Setting	Tail behavior	What ψ bound uses	Typical exponent
Hoeffding	Independent, bounded Xᵢ ∈ [aᵢ,bᵢ]	Sub-Gaussian	Quadratic upper bound on ψ (Hoeffding lemma)	2t² / ∑(bᵢ−aᵢ)²
Chernoff	Bernoulli/Binomial (and related)	Often tighter; KL form	Exact ψ, then optimize	n·D(q		p)

Both are the same engine + different ψ information.

Worked Examples (3)

Hoeffding bound for a sample mean in [0,1] and solving for n given (ε, δ)

Let X₁,…,X_n be independent with Xᵢ ∈ [0,1]. Let \bar{X} = (1/n)∑Xᵢ and μ̄ = E[\bar{X}]. We want n such that P(|\bar{X} − μ̄| ≥ ε) ≤ δ.

Start from two-sided Hoeffding for bounded [0,1] variables:
P(|\bar{X} − μ̄| ≥ ε) ≤ 2 exp(−2nε²).
Set the bound ≤ δ:
2 exp(−2nε²) ≤ δ.
Divide by 2:
exp(−2nε²) ≤ δ/2.
Take logs (note log is increasing):
−2nε² ≤ log(δ/2).
Multiply by −1 (flips inequality):
2nε² ≥ log(2/δ).
Solve for n:
$n \ge \frac{\log(2/\delta)}{2\varepsilon^2}.$
Interpretation: to cut ε in half, you need ~4× more samples; to make δ 10× smaller, you need only an additive increase of (log 10)/(2ε²).

Insight: This is the classic PAC-style sample complexity tradeoff. The square dependence on ε comes from the quadratic ψ upper bound (sub-Gaussianity), while the log dependence on 1/δ comes from the exponential tail.

Deriving Hoeffding’s inequality via ψ(λ) and optimizing λ explicitly

Let X₁,…,X_n independent with Xᵢ ∈ [aᵢ,bᵢ]. Let S=∑Xᵢ and μ=E[S]. Define V=∑(bᵢ−aᵢ)². Show P(S−μ ≥ t) ≤ exp(−2t²/V).

Let Z = S − μ = ∑(Xᵢ − E[Xᵢ]). We apply exponential Markov:
For any λ>0,
P(Z ≥ t) = P(e^{λZ} ≥ e^{λt}) ≤ E[e^{λZ}] / e^{λt}.
Take logs via ψ:
P(Z ≥ t) ≤ exp( ψ_Z(λ) − λt ), where ψ_Z(λ)=log E[e^{λZ}].
Use independence to add CGFs:
ψ_Z(λ) = ∑ ψ_{Xᵢ−E[Xᵢ]}(λ).
Apply Hoeffding’s lemma to each term:
ψ_{Xᵢ−E[Xᵢ]}(λ) ≤ λ²(bᵢ−aᵢ)²/8.
Sum the bounds:
ψ_Z(λ) ≤ (λ²/8)∑(bᵢ−aᵢ)² = λ²V/8.
So P(Z ≥ t) ≤ exp( λ²V/8 − λt ).
Optimize h(λ)=λ²V/8 − λt:
h'(λ)=λV/4 − t = 0 ⇒ λ* = 4t/V.
Plug in:
h(λ*) = ( (16t²/V²)·V / 8 ) − (4t/V)·t = (2t²/V) − (4t²/V) = −2t²/V.
Conclude:
$P(S−\mu \ge t) \le \exp\left(−\frac{2t^2}{V}\right).$

Insight: This example makes the optimization/duality tangible: once ψ is upper-bounded by a quadratic, the entire tail bound becomes a one-line quadratic minimization in λ.

Chernoff bound for a Binomial upper tail in KL form

Let S ∼ Binomial(n,p). For a target fraction q>p, bound P(S/n ≥ q).

Use exponential Markov on S:
For λ>0,
P(S ≥ nq) = P(e^{λS} ≥ e^{λnq}) ≤ E[e^{λS}] / e^{λnq}.
Compute the mgf of a binomial via independence:
E[e^{λS}] = (E[e^{λX₁}])^n with X₁∼Bern(p).
E[e^{λX₁}] = 1−p + pe^{λ}.
So:
P(S/n ≥ q) ≤ exp( n log(1−p+pe^{λ}) − λnq ).
Optimize over λ>0:
Take derivative w.r.t. λ of the exponent per-sample:
φ(λ)=log(1−p+pe^{λ}) − λq.
φ'(λ) = (pe^{λ})/(1−p+pe^{λ}) − q.
Set φ'(λ*)=0:
(pe^{λ})/(1−p+pe^{λ}) = q.
Solve for e^{λ*}:
q(1−p+pe^{λ}) = pe^{λ}
q(1−p) = pe^{λ*}(1−q)
⇒ e^{λ*} = (q(1−p))/(p(1−q)).
Plugging λ* back yields the KL form:
min_λ [log(1−p+pe^{λ}) − λq] = −D(q||p),
so
$P(S/n \ge q) \le \exp\big(−n D(q\|p)\big).$

Insight: Chernoff is the same exponential Markov method, but with an exact ψ. The optimization condition ψ'(λ*)=t becomes an explicit equation whose solution encodes the tilted (exponentially reweighted) Bernoulli distribution, and the exponent becomes a KL divergence.

Key Takeaways

✓
The exponential Markov method is the universal engine: apply Markov to e^{λX} to turn P(X≥t) into an mgf/CGF bound.
✓
The cumulant-generating function ψ_X(λ)=log E[e^{λX}] controls tail decay; for sums of independent variables, ψ adds: ψ_{∑Xᵢ}(λ)=∑ψ_{Xᵢ}(λ).
✓
Tail bounds are optimization problems: P(Z≥t) ≤ exp( inf_{λ>0}(ψ_Z(λ)−λt) ) = exp(−sup_{λ>0}(λt−ψ_Z(λ))).
✓
The ψ(λ) vs λt picture is geometric duality: maximizing the gap λt−ψ(λ) gives the best exponent; the optimizer satisfies ψ'(λ*)=t when differentiable.
✓
Hoeffding’s lemma gives a quadratic upper bound on ψ for bounded variables, implying sub-Gaussian tails; Hoeffding’s inequality follows by summing ψ and optimizing λ.
✓
For Bernoulli/Binomial sums, Chernoff bounds use a sharper ψ and yield KL-divergence exponents exp(−nD(q||p)), often tighter than Hoeffding.
✓
In high-probability estimation, concentration turns (ε,δ) requirements into explicit sample complexity like n ≥ log(2/δ)/(2ε²) for [0,1] averages.

Common Mistakes

✗
Forgetting to center: bounding P(S≥t) when you really need P(S−E[S]≥t) changes ψ and often worsens constants or breaks symmetry.
✗
Optimizing λ with the wrong sign (e.g., using λ>0 for a lower tail); for P(Z≤−t) you typically apply the method to −Z (or λ<0).
✗
Assuming Hoeffding always beats everything: it ignores variance and distribution shape; Bernstein/Chernoff can be much tighter when additional information exists.
✗
Mixing up t for sums vs averages: if Z = \bar{X}−E\bar{X}, then the corresponding sum deviation is n·t, which changes exponents by n.

Practice

easy

Let X₁,…,X_n be independent with Xᵢ ∈ [−1,2]. Give a bound on P(|\bar{X}−E\bar{X}| ≥ ε).

Hint: Use two-sided Hoeffding on the sum with (bᵢ−aᵢ)=3, then translate from S to \bar{X}.

Show solution

Here (bᵢ−aᵢ)=3 for each i, so V=∑(bᵢ−aᵢ)² = n·9 = 9n.

For S=∑Xᵢ and μ=E[S], Hoeffding gives:

P(|S−μ| ≥ t) ≤ 2 exp(−2t²/V) = 2 exp(−2t²/(9n)).

Now |\bar{X}−E\bar{X}| ≥ ε is equivalent to |S−μ| ≥ nε, so t=nε:

P(|\bar{X}−E\bar{X}| ≥ ε) ≤ 2 exp(−2(nε)²/(9n)) = 2 exp(−2nε²/9).

medium

Let S ∼ Binomial(n,p). Using the exponential Markov method, show that for any λ>0:

P(S/n ≥ q) ≤ exp(n[log(1−p+pe^{λ}) − λq]). Then write the condition that defines the optimal λ*.

Hint: Compute E[e^{λS}] via independence; then differentiate the exponent per-sample and set to zero.

Show solution

Exponential Markov:

P(S≥nq)=P(e^{λS}≥e^{λnq})≤E[e^{λS}]/e^{λnq}.

For Bernoulli X, E[e^{λX}]=1−p+pe^{λ}. For S=∑Xᵢ, E[e^{λS}]=(1−p+pe^{λ})^n.

Thus

P(S/n ≥ q) ≤ (1−p+pe^{λ})^n / e^{λnq} = exp(n[log(1−p+pe^{λ}) − λq]).

For optimal λ*, minimize f(λ)=log(1−p+pe^{λ}) − λq.

Set derivative to zero:

f'(λ)=(pe^{λ})/(1−p+pe^{λ}) − q = 0,

so λ satisfies (pe^{λ})/(1−p+pe^{λ*}) = q.

hard

Suppose Xᵢ ∈ [0,1] independent, and you want P(\bar{X} − E\bar{X} ≥ ε) ≤ δ (one-sided). Solve for the smallest n that Hoeffding guarantees.

Hint: Use the one-sided Hoeffding bound exp(−2nε²) and solve exp(−2nε²) ≤ δ.

Show solution

For Xᵢ∈[0,1], Hoeffding one-sided gives:

P(\bar{X} − E\bar{X} ≥ ε) ≤ exp(−2nε²).

Require exp(−2nε²) ≤ δ.

Take logs:

−2nε² ≤ log δ.

Multiply by −1 (flip):

2nε² ≥ log(1/δ).

Thus

n \ge \frac{\log(1/\delta)}{2\varepsilon^2}.

Connections

Next nodes you might connect:

•Markov's Inequality and Tail Bounds — the starting tool behind the exponential Markov method.
•Moment Generating Functions — ψ(λ) = log mgf and why convexity/duality appears.
•Sub-Gaussian Random Variables — abstraction of “quadratic ψ” beyond boundedness.
•Bernstein Inequalities — variance-sensitive concentration (often tighter than Hoeffding).
•Large Deviations and Cramér’s Theorem — the general theory where ψ* governs exponential rates.
•Concentration for Markov Chains — extending the same exp(λ·sum) idea to dependent samples.

Quality: A (4.3/5)

← back to tree browse all →