P(A|B) = P(B|A)P(A)/P(B). Updating beliefs with evidence.
Self-serve tutorial - low prerequisites, straightforward concepts.
If you know how to compute P(A|B), Bayes’ Theorem teaches you how to “turn it around” into P(B|A)—and, more importantly, how to update beliefs about a hypothesis when new evidence arrives.
Bayes’ Theorem is
P(A|B) = P(B|A)P(A)/P(B).
Interpretation:
It is a rule for rational belief-updating under uncertainty.
In real problems you often face the direction mismatch:
Bayes’ Theorem bridges this gap. It’s also the mathematical backbone of “updating beliefs with evidence”: start with an initial belief (a prior), observe data (the evidence), and compute an updated belief (the posterior).
For events A and B, with P(B) > 0:
P(A|B) = P(B|A)P(A) / P(B)
You’ll see several names for each term:
Even if the formula feels simple, the meaning is subtle: probability is redistributed across hypotheses when evidence arrives.
Bayes’ Theorem is not a “special trick”—it is a direct consequence of the definition of conditional probability.
Start with the definition:
P(A|B) = P(A ∩ B) / P(B)
Also:
P(B|A) = P(A ∩ B) / P(A)
Solve the second equation for P(A ∩ B):
P(A ∩ B) = P(B|A)P(A)
Plug into the first equation:
P(A|B)
= P(A ∩ B) / P(B)
= [P(B|A)P(A)] / P(B)
That’s Bayes’ Theorem.
Posterior = (how well A predicts the evidence) × (how plausible A was) ÷ (how surprising the evidence is overall).
Often we treat:
But the theorem is symmetric: you can swap roles depending on what you’re trying to compute.
Bayes is most useful when:
1) You can model P(B|A) (data given hypothesis) more easily than P(A|B).
2) You have multiple competing hypotheses (A₁, A₂, …) and want to update which is most plausible.
3) Base rates matter (priors), and ignoring them would lead to bad conclusions.
When you write P(A|B) directly, it can hide structure. Bayes explicitly factors the update into:
This decomposition is powerful because each piece comes from a different source:
Suppose A is a hypothesis and B is a newly observed fact.
Bayes:
P(A|B) ∝ P(B|A)P(A)
Read “∝” as “proportional to.” This is the key intuition:
This proportionality form is often how you reason informally:
1) Hypotheses that better predict the evidence get boosted.
2) Hypotheses that poorly predict the evidence get penalized.
3) Priors still matter—rare things stay rare unless the evidence is very strong.
Imagine two hypotheses A and ¬A (not A). Using Bayes on both:
P(A|B) = P(B|A)P(A)/P(B)
P(¬A|B) = P(B|¬A)P(¬A)/P(B)
Take the ratio (posterior odds):
P(A|B) / P(¬A|B)
= [P(B|A)P(A)] / [P(B|¬A)P(¬A)]
This shows:
Posterior odds = Likelihood ratio × Prior odds
Where likelihood ratio = P(B|A) / P(B|¬A).
This is useful because the annoying P(B) cancels, and you can see the “strength of evidence” as a ratio.
Let’s tie each term to an interpretation you can check:
A common conceptual trap is to confuse P(B|A) with P(A|B). They can be wildly different.
Suppose:
We’ll compute P(A|B), but first note: we need P(B). That’s not optional.
Using total probability:
P(B) = P(B|A)P(A) + P(B|¬A)P(¬A)
Compute:
P(B)
= 0.90·0.01 + 0.05·0.99
= 0.009 + 0.0495
= 0.0585
Now Bayes:
P(A|B)
= (0.90·0.01) / 0.0585
= 0.009 / 0.0585
≈ 0.1538
Even with strong evidence, a rare prior can keep the posterior moderate.
| Name | Symbol | Meaning | Typical source | |
|---|---|---|---|---|
| Prior | P(A) | Belief before seeing B | Base rate, historical data | |
| Likelihood | P(B | A) | Evidence frequency if A true | Measurement/test model |
| Evidence | P(B) | Overall chance of evidence | Computed via total probability | |
| Posterior | P(A | B) | Updated belief after B | Result of Bayes |
Bayes’ Theorem divides by P(B). This ensures that the posterior is a valid probability.
If you compute unnormalized weights:
w(A) = P(B|A)P(A)
then the normalized posterior is:
P(A|B) = w(A) / ∑ₖ w(Aₖ)
Where {Aₖ} are mutually exclusive, exhaustive hypotheses.
So P(B) is exactly:
P(B) = ∑ₖ P(B|Aₖ)P(Aₖ)
That is the Law of Total Probability.
If hypotheses are A and ¬A:
P(B)
= P(B|A)P(A) + P(B|¬A)P(¬A)
This formula is the workhorse behind medical-test and spam-filter calculations.
If you have n hypotheses:
Then:
P(B) = ∑ᵢ P(B|Aᵢ)P(Aᵢ)
And Bayes becomes:
P(Aᵢ|B) = P(B|Aᵢ)P(Aᵢ) / ∑ⱼ P(B|Aⱼ)P(Aⱼ)
The evidence term answers: “How often would we see B regardless of which hypothesis is true?”
This matches everyday reasoning: a surprising clue carries more information than a mundane one.
When doing Bayes problems by hand:
1) Compute the numerator: P(B|A)P(A).
2) Compute P(B) using total probability.
3) Divide.
This prevents mistakes like forgetting the ¬A term or miscomputing complements.
Think of the prior probabilities across hypotheses as “mass” that sums to 1.
This is the essence of Bayesian updating.
Before finalizing a Bayes calculation:
Bayes’ Theorem is the simplest formal model of learning from data:
Even many modern ML systems can be described as “compute something proportional to likelihood × prior, then normalize.”
Medical test problems are the classic Bayes showcase because humans often ignore priors.
Key terms you’ll see:
What you usually want is:
P(disease | positive)
That is Bayes with A = disease, B = positive.
The important lesson: even a highly accurate test can yield many false positives when the disease is rare.
Suppose:
Then Bayes says:
P(spam | contains ‘free’)
= P(contains ‘free’ | spam)P(spam) / P(contains ‘free’)
This is the skeleton of naïve Bayes classifiers (where B is many word-features). You will later learn more advanced versions, but the core update logic is identical.
In tracking problems:
Repeated Bayes updates over time lead to filters like the Kalman filter and particle filter (conceptually Bayesian, though implementation details differ).
Bayes’ Theorem for events is the entry point to Bayes for distributions.
Event version:
P(A|B) = P(B|A)P(A)/P(B)
Distribution version (preview):
p(θ|D) ∝ p(D|θ)p(θ)
Where:
This node unlocks that next step.
| Aspect | Frequentist (very rough) | Bayesian (very rough) |
|---|---|---|
| Probability means | Long-run frequency | Degree of belief (coherent with axioms) |
| Parameters | Fixed unknown constants | Random variables with priors |
| Output | Point estimates, confidence intervals | Posterior distributions, credible intervals |
You don’t need to “choose a side” to use Bayes’ Theorem; you just need to be clear about what probabilities represent in your problem.
A disease affects 1% of the population. A test has sensitivity 99% and specificity 95%.
Let:
Given:
P(A) = 0.01
P(B|A) = 0.99
Specificity = P(negative|¬A) = 0.95 ⇒ P(positive|¬A) = P(B|¬A) = 0.05
Goal: compute P(A|B).
Compute the complement prior:
P(¬A) = 1 − P(A)
= 1 − 0.01
= 0.99
Compute evidence via total probability:
P(B) = P(B|A)P(A) + P(B|¬A)P(¬A)
= 0.99·0.01 + 0.05·0.99
= 0.0099 + 0.0495
= 0.0594
Apply Bayes’ Theorem:
P(A|B) = P(B|A)P(A) / P(B)
= (0.99·0.01) / 0.0594
= 0.0099 / 0.0594
≈ 0.1667
Interpretation:
Even with a good test, the posterior is ≈ 16.7%, not ≈ 99%, because false positives among the many healthy people are substantial when the disease is rare.
Insight: The base rate (prior) can dominate. A positive result is evidence, but it’s not the same as near-certainty unless the test is extremely specific or the disease is common.
A factory has two machines making the same part.
Let:
A₁ = “part came from Machine 1”
A₂ = “part came from Machine 2”
B = “part is defective”
Given:
P(A₁)=0.70, P(A₂)=0.30
P(B|A₁)=0.02, P(B|A₂)=0.05
Goal: compute P(A₂|B) (probability it came from Machine 2 given defect).
Compute evidence (defect probability overall):
P(B) = P(B|A₁)P(A₁) + P(B|A₂)P(A₂)
= 0.02·0.70 + 0.05·0.30
= 0.014 + 0.015
= 0.029
Apply Bayes for Machine 2:
P(A₂|B) = P(B|A₂)P(A₂) / P(B)
= (0.05·0.30) / 0.029
= 0.015 / 0.029
≈ 0.5172
Optional: compute P(A₁|B) as a sanity check:
P(A₁|B) = (0.02·0.70)/0.029
= 0.014/0.029
≈ 0.4828
And indeed 0.5172 + 0.4828 = 1.
Insight: Even though Machine 2 produces fewer parts (lower prior), a defect strongly shifts probability toward it because its likelihood of defect is higher.
You have three hypotheses about a coin:
A₁: fair (P(heads)=0.5)
A₂: biased toward heads (P(heads)=0.8)
A₃: biased toward tails (P(heads)=0.2)
Your prior beliefs are:
P(A₁)=0.6, P(A₂)=0.2, P(A₃)=0.2
You flip once and observe B = “heads”.
Goal: compute P(Aᵢ|heads) for i=1..3 using weights and normalization.
Compute unnormalized weights w(Aᵢ)=P(B|Aᵢ)P(Aᵢ):
w(A₁) = 0.5·0.6 = 0.30
w(A₂) = 0.8·0.2 = 0.16
w(A₃) = 0.2·0.2 = 0.04
Compute evidence as the sum of weights:
P(B) = ∑ᵢ w(Aᵢ)
= 0.30 + 0.16 + 0.04
= 0.50
Normalize to get posteriors:
P(A₁|heads)=0.30/0.50=0.60
P(A₂|heads)=0.16/0.50=0.32
P(A₃|heads)=0.04/0.50=0.08
Interpretation:
One heads result increases belief in the heads-biased coin and decreases belief in the tails-biased coin, while the fair coin remains most probable due to its strong prior.
Insight: Computing Bayes via “weights then normalize” generalizes cleanly to many hypotheses and avoids repeatedly recomputing P(B) from scratch.
Bayes’ Theorem is derived directly from conditional probability: P(A|B) = P(B|A)P(A)/P(B).
Interpret Bayes as belief updating: posterior ∝ likelihood × prior.
P(B) (evidence) is computed using the Law of Total Probability and acts as a normalizing constant.
P(B|A) and P(A|B) are not interchangeable; confusing them is a major source of errors.
Rare hypotheses (small priors) require strong evidence (large likelihood ratio) to become likely.
For multiple hypotheses A₁…Aₙ: P(Aᵢ|B)=P(B|Aᵢ)P(Aᵢ)/∑ⱼ P(B|Aⱼ)P(Aⱼ).
Posterior odds = likelihood ratio × prior odds is often the cleanest way to interpret “strength of evidence.”
Mixing up directions: treating P(B|A) as if it were P(A|B).
Forgetting to compute P(B) using all relevant hypotheses (e.g., omitting the ¬A term).
Using an incorrect complement: writing P(¬A)=1−P(B) or similar mismatches of events.
Interpreting likelihood as “probability the hypothesis is true”; likelihood is about evidence given the hypothesis.
A spam filter flags an email if it contains the word “winner”. Suppose:
P(spam)=0.2,
P(“winner”|spam)=0.6,
P(“winner”|not spam)=0.05.
Compute P(spam|“winner”).
Hint: Compute P(“winner”) = P(“winner”|spam)P(spam) + P(“winner”|¬spam)P(¬spam), then apply Bayes.
Let A=spam, B=contains “winner”.
P(A)=0.2, P(¬A)=0.8.
P(B)=0.6·0.2 + 0.05·0.8
=0.12 + 0.04
=0.16.
P(A|B)=P(B|A)P(A)/P(B)
=(0.6·0.2)/0.16
=0.12/0.16
=0.75.
Two coins are in a box. Coin 1 is fair. Coin 2 lands heads with probability 0.9. You pick a coin uniformly at random and flip it once; it shows heads. What is P(you picked Coin 2 | heads)?
Hint: Use hypotheses A₁ (fair) and A₂ (biased). Priors are 0.5 and 0.5. Evidence is heads.
Let A₂=Coin 2 chosen, B=heads.
P(A₂)=0.5, P(A₁)=0.5.
P(B|A₂)=0.9, P(B|A₁)=0.5.
P(B)=0.9·0.5 + 0.5·0.5
=0.45 + 0.25
=0.70.
P(A₂|B)=0.9·0.5 / 0.70
=0.45/0.70
≈ 0.6429.
A test for a condition has sensitivity 0.97 and specificity 0.98. The condition prevalence is 0.4%. If a person tests positive, compute P(condition | positive). Then explain in one or two sentences why the result is not close to 97%.
Hint: Convert specificity to false positive rate: P(positive|¬condition)=1−0.98. Use P(condition)=0.004.
Let A=condition, B=positive.
P(A)=0.004, P(¬A)=0.996.
P(B|A)=0.97.
Specificity=0.98 ⇒ P(B|¬A)=0.02.
P(B)=0.97·0.004 + 0.02·0.996
=0.00388 + 0.01992
=0.02380.
P(A|B)=0.97·0.004 / 0.02380
=0.00388/0.02380
≈ 0.1630 (≈ 16.3%).
Explanation: although the test is sensitive, the condition is rare, so false positives among the many healthy people contribute heavily to positive results.
Next you’ll generalize this event-based rule to distributions and parameters in Bayesian Inference.
Related foundations:
Related applications (later nodes often build on Bayes):