Testing claims about populations using sample data. p-values, significance.
Multi-session curriculum - substantial prior knowledge and complex material. Use mastery gates and deliberate practice.
You run an A/B test and see Variant B converts 2% better than A. Is that real—or just random luck in the sample? Hypothesis testing is the machinery that turns that question into a repeatable decision procedure.
Hypothesis testing compares two precise claims (H₀ vs Hₐ) about a population parameter using a test statistic and its null distribution. A p-value is the probability (computed assuming H₀ is true) of seeing evidence at least as extreme as what you observed. You reject H₀ when p ≤ α, where α is a pre-chosen significance level that controls the long-run false-rejection rate.
Hypothesis testing is a structured way to use sample data to evaluate a claim about a population.
The core difficulty is that samples vary. Even if nothing is changing in the population, random sampling will produce different means, proportions, and counts. Hypothesis testing acknowledges this uncertainty and asks:
A hypothesis test always starts with two competing hypotheses about a population parameter (a fixed but unknown number like μ, p, or λ).
Example (mean):
Example (proportion):
Notice that both hypotheses are statements about the population, not the sample. The sample is our window.
The alternative hypothesis encodes what “extreme evidence” means.
This choice is not cosmetic. It determines which tail(s) of the null distribution count as “as extreme or more extreme.”
1) Choose H₀/Hₐ → 2) compute a test statistic from the sample → 3) compare it to its null distribution → 4) compute a p-value → 5) reject or fail to reject using α.
Under H₀, your test statistic has a distribution. You mark either:
Both are just shaded areas under the same “null curve.”
Below is a generic null distribution (often approximately Normal). The two critical values cut off α/2 in each tail.
Null distribution of test statistic (under H₀)
/
/ \
/ \
__________/ \__________
/ \
--------|----|------------------|----|---------> t
-c 0 +c
α/2 α/2
(reject) (reject)
Decision rule (two-sided): reject H₀ if t ≤ -c or t ≥ +cHere the alternative is “greater than,” so only the right tail counts.
Null distribution of test statistic (under H₀)
/
/ \
/ \
__________/ \__________
/ \
--------|-------------------|\\\\\\\\\\\-> t
0 t_obs p-value area
p-value = P(T ≥ t_obs | H₀)These pictures are the backbone of hypothesis testing. Everything else is computation.
A dataset is many numbers. A hypothesis test needs a single number that summarizes the evidence against H₀.
That number is the test statistic: a function of the sample.
Common patterns:
But the raw statistic (like x̄) is hard to interpret without scale. We typically convert it into a standardized form that answers:
“How many standard errors away from the null value is the observed estimate?”
That’s what z-scores and t-scores do.
If you repeatedly sample n observations from a fixed population, the statistic varies. Its standard deviation is the standard error (SE).
A key CLT-driven idea (you already know CLT) is:
so .
so .
The null distribution is the sampling distribution of the test statistic assuming H₀ is true.
If H₀ specifies μ = μ₀, then under H₀:
A canonical z-test statistic for a mean with known σ is:
Under H₀ (and with CLT / Normal assumptions),
That last line is crucial: it tells you how to turn an observed Z into a tail probability.
Once you have the null distribution, “extreme” means “in the tail(s) consistent with Hₐ.”
Let T be your test statistic with observed value t_obs.
For symmetric null distributions (like Normal), the two-sided p-value is often:
The p-value measures evidence; α is a decision threshold.
Decision rule:
This is equivalent to using critical values.
For example, in a two-sided z-test with α = 0.05:
This equivalence is worth seeing explicitly.
If under H₀, and α = 0.05 two-sided, we choose c so that:
By symmetry:
So .
Then:
A p-value is:
It is not:
A small p-value means: “If H₀ were true, this would be rare.” It does not, by itself, tell you whether the effect is practically important.
The biggest source of confusion in hypothesis testing is mixing up:
This section focuses on making the tail logic visual and automatic.
You should be able to answer: “Which sample outcomes would convince me H₀ is wrong?”
This is not something you should decide after seeing data.
Think of α as “how much tail area we’re willing to label as ‘reject’ when H₀ is true.”
Here are the three standard rejection-region pictures.
Null distribution under H₀
/
/ \
/ \
__________/ \__________
/ \
--------|--------------------|\\\\\\\\\\\-> t
0 c
α (reject)
Reject if t ≥ c Null distribution under H₀
/
/ \
/ \
__________/ \__________
/ \
\\\\\\\\\\\|--------------------|---------> t
c 0
α (reject)
Reject if t ≤ c Null distribution under H₀
/
/ \
/ \
__________/ \__________
/ \
\\\\\\\|----|------------------|----|\\\\\\\-> t
-c 0 +c
α/2 α/2
Reject if t ≤ -c or t ≥ +cThe p-value is not “α in the tail.” It is the observed tail area beyond your observed statistic, using the appropriate tail rule.
If you remember one sentence, use this:
p-value = shaded area in the tail(s) of the null distribution beyond the observed statistic, in the direction(s) specified by Hₐ.
Examples of shading:
| Test type | Hₐ form | Rejection region area | p-value computed as | ||||
|---|---|---|---|---|---|---|---|
| Right-tailed | parameter > value | α in right tail | |||||
| Left-tailed | parameter < value | α in left tail | |||||
| Two-sided | parameter ≠ value | α/2 each tail | $P( | T | ≥ | t_{obs} | \mid H₀)$ |
Suppose we do a right-tailed z-test at α = 0.05.
So .
Now compare two observed z-values:
1) z_obs = 1.2
2) z_obs = 2.0
The rule “reject if p ≤ α” and the rule “reject if z_obs ≥ 1.645” always agree because they are the same geometric comparison under the null curve.
When p > α, we say fail to reject H₀, not “accept H₀.”
Why? Because the test is asymmetric:
Large p-values can happen because:
Hypothesis testing is a reusable template. Once you internalize the tail logic, you can apply it across many settings.
1) A/B testing (proportions)
2) Quality control (means)
3) Healthcare / experiments
In each case:
α is the probability of rejecting H₀ when H₀ is true:
That is a guarantee about a procedure, not about a single dataset. If you repeatedly run the same testing procedure in a world where H₀ is actually true, about α fraction of runs will (incorrectly) reject.
This motivates why α should be chosen before looking at the data: it’s part of the design of the decision rule.
A tiny effect can be statistically significant if n is huge (SE becomes small). Conversely, a meaningful effect can fail to be significant if n is small.
Because the standardized statistic is often of the form:
Increasing n shrinks SE like , making it easier for a fixed effect to appear “many SEs away.”
In practice, you should pair hypothesis tests with:
A correct interpretation template:
Avoid:
Hypothesis testing is closely tied to other core stats tools:
Even if you don’t go deep into theory, the tail diagrams and “null distribution + shaded area” mental model will transfer directly.
A factory claims its bolts have mean length μ = 10.0 cm. You sample n = 36 bolts and measure sample mean x̄ = 10.3 cm. Assume the population standard deviation is known: σ = 0.9 cm. Test at significance level α = 0.05.
Hypotheses:
Test statistic:
Compute the standard error:
SE = σ/√n = 0.9/√36 = 0.9/6 = 0.15
Compute the observed z-value:
z_obs = (x̄ − μ₀)/SE
= (10.3 − 10.0)/0.15
= 0.3/0.15
= 2.0
Compute the two-sided p-value:
p = P(|Z| ≥ |2.0| | H₀)
= 2·P(Z ≥ 2.0)
Using standard normal tables (or known value): P(Z ≥ 2.0) ≈ 0.0228
So p ≈ 2·0.0228 = 0.0456
Decision using p-value:
Since p ≈ 0.0456 ≤ α = 0.05, reject H₀.
Same decision using rejection region (critical values):
For α = 0.05 two-sided, critical values are ±1.96.
Reject if |z_obs| ≥ 1.96.
Here |2.0| ≥ 1.96, so reject.
Insight: Two equivalent lenses: (1) compare p to α (shaded tail area beyond ±|z_obs|), or (2) compare z_obs to critical values (fixed α/2 tails). Both are literally the same geometry under the null distribution.
A website historically has conversion rate p = 0.10. After a UI change, you observe n = 400 visitors with x = 52 conversions, so p̂ = 52/400 = 0.13. Test if conversion increased at α = 0.01.
Hypotheses:
Approximate (CLT) test statistic:
Under H₀, Z ≈ N(0,1).
Compute p̂:
p̂ = 52/400 = 0.13
Compute the standard error under H₀:
SE = √(p₀(1−p₀)/n)
= √(0.10·0.90/400)
= √(0.09/400)
= √(0.000225)
= 0.015
Compute the observed z-value:
z_obs = (p̂ − p₀)/SE
= (0.13 − 0.10)/0.015
= 0.03/0.015
= 2.0
Compute the right-tailed p-value:
p = P(Z ≥ 2.0 | H₀) ≈ 0.0228
Decision:
Compare p to α:
0.0228 > 0.01, so fail to reject H₀.
Rejection-region equivalent:
For a right-tailed test at α = 0.01, the critical value is about 2.326.
Since 2.0 < 2.326, z_obs is not in the rejection region, so fail to reject.
Insight: Same z_obs can be ‘significant’ at α=0.05 but not at α=0.01. Tightening α shrinks the rejection region (less shaded tail area), making rejection harder.
Re-use Example 1 where z_obs = 2.0 from testing μ₀ = 10.0.
Consider two different alternatives:
A) Hₐ: μ > 10.0 (right-tailed)
B) Hₐ: μ ≠ 10.0 (two-sided)
Compute the p-values and compare.
Right-tailed p-value (μ > 10):
p_right = P(Z ≥ 2.0 | H₀) ≈ 0.0228
Two-sided p-value (μ ≠ 10):
p_two = P(|Z| ≥ 2.0 | H₀)
= 2·P(Z ≥ 2.0)
≈ 2·0.0228
= 0.0456
Interpretation:
The two-sided p-value is (for symmetric nulls) twice the one-sided p-value because it counts extremes in both tails.
Insight: This is why you must choose one- vs two-sided before seeing the data: you’re defining what counts as “as extreme.” The diagram literally changes from one shaded tail to two shaded tails.
A hypothesis test compares two precise population claims: H₀ (baseline) vs Hₐ (the direction/shape of deviation you care about).
A test statistic compresses the sample into one number that measures evidence against H₀, often as “estimate minus null value, measured in standard errors.”
The null distribution is the sampling distribution of the test statistic assuming H₀ is true; it’s the reference curve you shade areas under.
The p-value is a probability computed under H₀: the tail area at least as extreme as the observed statistic, in the direction(s) specified by Hₐ.
α is chosen in advance and is the total area of the rejection region under the null curve; in two-sided tests it splits into α/2 per tail.
The rules “reject if p ≤ α” and “reject if statistic is beyond critical value(s)” are equivalent views of the same geometry.
Failing to reject H₀ is not the same as proving H₀; it may reflect low power, small effects, or high noise.
Statistical significance (small p) is not the same as practical importance; always consider effect size and uncertainty (e.g., confidence intervals).
Interpreting the p-value as P(H₀ is true) instead of P(data as-or-more-extreme | H₀).
Choosing one-sided vs two-sided after looking at the data, which silently changes the tail area you count as ‘extreme.’
Forgetting to split α into α/2 and α/2 for two-sided tests, leading to wrong critical values and wrong conclusions.
Saying “accept H₀” when p > α; the correct language is “fail to reject H₀,” because the test is not designed to confirm the null.
A machine fills bottles with target mean μ₀ = 500 ml. You sample n = 64 bottles and get x̄ = 497 ml. Assume σ = 16 ml is known. Test H₀: μ = 500 vs Hₐ: μ < 500 at α = 0.05. Compute z_obs and the p-value, and decide.
Hint: Use SE = σ/√n, then z = (x̄ − μ₀)/SE. Since Hₐ is left-tailed, p = P(Z ≤ z_obs).
SE = 16/√64 = 16/8 = 2.
z_obs = (497 − 500)/2 = −3/2 = −1.5.
Left-tailed p-value: p = P(Z ≤ −1.5) ≈ 0.0668.
Since 0.0668 > 0.05, fail to reject H₀ (not enough evidence at α=0.05 that the mean is below 500).
A support team claims their on-time rate is p₀ = 0.95. In a week, they handle n = 200 tickets and 184 are on time (p̂ = 0.92). Test H₀: p = 0.95 vs Hₐ: p ≠ 0.95 at α = 0.05 using a normal approximation. Compute z_obs and decide.
Hint: Two-sided: p = 2·P(Z ≥ |z_obs|). Use SE under H₀: √(p₀(1−p₀)/n).
p̂ = 184/200 = 0.92.
SE = √(0.95·0.05/200) = √(0.0475/200) = √(0.0002375) ≈ 0.01541.
z_obs = (0.92 − 0.95)/0.01541 ≈ −0.03/0.01541 ≈ −1.947.
Two-sided p-value: p = 2·P(Z ≥ 1.947).
P(Z ≥ 1.947) ≈ 0.0257, so p ≈ 0.0514.
Since 0.0514 > 0.05, fail to reject H₀ (barely).
You compute a test statistic with observed value t_obs = 2.4. Under H₀, T ∼ N(0,1). (a) Find the right-tailed p-value. (b) Find the two-sided p-value. (c) For α = 0.05, state reject/fail-to-reject for each alternative.
Hint: Use standard normal tail probabilities. Two-sided p is twice the one-sided tail beyond |t_obs|.
(a) Right-tailed: p_right = P(Z ≥ 2.4) ≈ 0.0082.
(b) Two-sided: p_two = 2·P(Z ≥ 2.4) ≈ 2·0.0082 = 0.0164.
(c) At α=0.05: