Distinguishing correlation from causation. DAGs, do-calculus.
Quick unlock - significant prerequisite investment but simple final step. Verify prerequisites first.
You see two variables move together: a medicine and recovery, education and income, rain and umbrella sales. Your brain wants a story—one causes the other. Causal inference is the discipline of turning that story into a testable, formal claim: “If I intervene and force X to be x, how would Y change?”
Causal inference distinguishes observing from doing. A causal DAG encodes assumptions about how variables generate one another. The causal effect is defined by an interventional distribution P(Y | do(X=x)). Identification asks whether P(Y | do(X)) can be computed from observational data P(·) given the DAG. Do-calculus provides sound transformation rules to rewrite interventional queries into observational ones when possible (often via backdoor, frontdoor, or more general adjustments).
In everyday data analysis, you often estimate something like P(Y | X=x): the distribution of outcomes among units where X happens to equal x. That is an observational quantity. It answers: “Among the units I observed with X=x, what does Y look like?”
A causal question is different. It asks: “If I were to force X to be x (possibly contrary to what it would naturally be), what would Y look like?” That is P(Y | do(X=x)). The do-operator is not a stylistic choice; it marks a different data-generating regime.
A classic trap:
These are not equivalent because of confounding: some variable Z (e.g., disease severity) might influence both drug-taking and recovery. Then P(Y | X) mixes multiple mechanisms.
Causal inference (in the Pearl/DAG framework) revolves around three atomic ideas:
1) Intervention (do-operator)
2) Causal directed acyclic graph (DAG)
3) Identification (often via do-calculus)
It’s tempting to read P(Y | X=x) and P(Y | do(X=x)) as “conditional probabilities with different notation.” But their meanings differ:
A helpful mental model:
A common semantics for DAGs is a set of structural assignments:
where Uₓ, Uᵧ are exogenous “noise” terms.
Intervening do(X=x) replaces the equation for X with:
Everything else stays the same. This formalizes “cutting incoming arrows into X.”
Because a causal graph is acyclic, it admits a topological order. That’s not just an algorithmic convenience: it reflects a generative ordering—causes upstream, effects downstream. When we factor a joint distribution according to a DAG,
P(V₁, …, Vₙ) = ∏ᵢ P(Vᵢ | Pa(Vᵢ))
we are implicitly using that ordering. Under intervention, we alter exactly one (or more) of those factors.
Causal inference can answer “what if we change X?” questions from observational data, but only if you are willing to:
This is a feature, not a bug: it forces you to separate what the data say from what your causal assumptions add.
If you only use probability tables, every dependence can be “explained” by many stories. DAGs provide a language of mechanisms: they constrain which variables can directly influence which others. That constraint lets you reason about which associations are causal and which are spurious.
A DAG is a directed acyclic graph where:
The DAG does not automatically tell you the effect size; it tells you what adjustments are needed to estimate causal effects from observational data.
Most causal reasoning problems are combinations of three motifs:
X → M → Y
X ← Z → Y
X → C ← Y
This collider rule is one of the most counterintuitive pieces of causal inference and is responsible for many real-world errors (e.g., selection bias, Berkson’s paradox).
A DAG implies conditional independence constraints via d-separation. The idea is to determine whether all paths between two sets of variables are blocked by a conditioning set.
A path is blocked if it contains:
While the full formal definition can be dense, the practical payoff is clear:
When estimating the causal effect of X on Y, you worry about paths that create association between X and Y that is not due to the directed causal influence X → … → Y.
A backdoor path from X to Y is any path that begins with an arrow into X.
Example:
Backdoor paths matter because they transmit non-causal association. If you can block all backdoor paths (without introducing new bias), then the remaining association corresponds to the causal effect.
A set of variables S satisfies the backdoor criterion relative to (X, Y) if:
1) No node in S is a descendant of X.
2) S blocks every backdoor path from X to Y.
If such an S exists, then the causal effect is identified via adjustment:
P(Y | do(X=x)) = ∑ₛ P(Y | X=x, S=s) P(S=s)
This looks like “standardize over S,” but the why is causal: conditioning on S blocks spurious paths, and averaging over S restores the overall population distribution.
It’s easy to say “we’ll control for variables.” But which variables you control for matters.
A compact comparison:
| Structure | Graph | What happens if you condition on the middle node? | Typical name |
|---|---|---|---|
| Fork | X ← Z → Y | Blocks spurious association (good) | Confounding |
| Collider | X → C ← Y | Creates spurious association (bad) | Selection bias |
| Chain | X → M → Y | Blocks mediated effect (depends on goal) | Mediation |
Real systems often include unobserved variables U (e.g., genetics, socioeconomic factors). If U causes both X and Y, you have an unblocked backdoor path X ← U → Y, but you can’t condition on U.
Then adjustment may fail even if you “control for everything you measured.” This is where identification becomes subtle: you may still identify effects via alternative structures (e.g., frontdoor), instruments, or additional assumptions.
You may see causal effects with covariate vectors Z. The adjustment formula generalizes:
P(Y | do(X=x)) = ∑_z P(Y | X=x, Z=z) P(Z=z)
(For continuous variables, replace ∑ with ∫.)
The important part is conceptual: you’re averaging the conditional outcome model over the marginal distribution of confounders, not the conditioned-on distribution within a particular X group.
A DAG plus observational data gives you P(V) and conditional distributions like P(Y | X, Z). Your causal target, however, is interventional: P(Y | do(X)).
Identification asks:
This matters because randomized experiments directly approximate do(X) by design, but observational studies do not. Identification is the bridge.
If the joint distribution factorizes along the DAG as:
P(V) = ∏ᵢ P(Vᵢ | Pa(Vᵢ))
then intervening do(X=x) yields:
P(V | do(X=x)) = 𝟙[X=x] ∏_{Vᵢ ≠ X} P(Vᵢ | Pa(Vᵢ))
Equivalently, for the post-intervention distribution over non-intervened variables:
P(V\{X} | do(X=x)) = ∏_{Vᵢ ≠ X} P(Vᵢ | Pa(Vᵢ))\big|_{X=x}
This is the formal “cut incoming arrows into X” idea.
Two major identification patterns appear so often they deserve names.
If S satisfies the backdoor criterion, then:
P(Y | do(X=x)) = ∑ₛ P(Y | X=x, S=s) P(S=s)
Frontdoor applies when:
Then:
P(Y | do(X=x)) = ∑ₘ P(M=m | X=x) \; ∑_{x'} P(Y | M=m, X=x') P(X=x')
Intuition:
Backdoor/frontdoor are consequences of do-calculus. Do-calculus provides rules for transforming expressions like P(Y | do(X), Z) into forms where the do(·) can be removed.
You don’t need to memorize all details to use causal inference effectively, but you do need to understand what the rules do:
Let X, Y, Z, W be disjoint sets of nodes. Let G be the causal DAG.
We define modified graphs:
Then the do-calculus rules (informally) are:
1) Insertion/deletion of observations
If Y is d-separated from Z given X and W in G\bar{X}, then:
P(Y | do(X), Z, W) = P(Y | do(X), W)
2) Action/observation exchange
If Y is d-separated from Z given X and W in G\bar{X}\underline{Z}, then:
P(Y | do(X), do(Z), W) = P(Y | do(X), Z, W)
3) Insertion/deletion of actions
If Y is d-separated from Z given X and W in G\bar{X}, where Z has no ancestors in X after modification (more precisely via graph surgery conditions), then:
P(Y | do(X), do(Z), W) = P(Y | do(X), W)
These statements can look intimidating because of the graph modifications, but the core idea is consistent:
When handed a causal query, a DAG, and observational data, a practical workflow is:
1) State the target: e.g., P(Y | do(X=x)).
2) List candidate adjustment variables using backdoor (if possible).
3) If backdoor fails due to unobserved confounding, check frontdoor conditions.
4) If neither applies, use general do-calculus / algorithmic tools (e.g., ID algorithm) to test identification.
5) If not identifiable, redesign: collect more variables, use instruments, exploit experiments, or accept partial identification bounds.
Suppose S blocks all backdoor paths from X to Y and contains no descendants of X.
We want to show:
P(Y | do(X=x)) = ∑ₛ P(Y | X=x, S=s) P(S=s)
Sketch with “show your work” steps (conceptual algebra):
1) Start with law of total probability under intervention:
P(Y | do(X=x))
= ∑ₛ P(Y, S=s | do(X=x))
= ∑ₛ P(Y | S=s, do(X=x)) P(S=s | do(X=x))
2) Because S has no descendants of X and we only intervene on X, distribution of S is unchanged:
P(S=s | do(X=x)) = P(S=s)
3) Because S blocks backdoor paths from X to Y, Y is conditionally independent of the intervention once we condition on (X, S):
P(Y | S=s, do(X=x)) = P(Y | X=x, S=s)
4) Substitute back:
P(Y | do(X=x))
= ∑ₛ P(Y | X=x, S=s) P(S=s)
The DAG is doing the heavy lifting in steps (2) and (3). The formula is “statistics,” but the justification is “causality.”
Identification is symbolic: it tells you what expression equals the causal effect in the population.
Estimation is numerical: given finite data, how do you estimate that expression?
Even if a causal effect is identifiable, estimation can still be hard due to:
This lesson focuses on identification logic, but keep the distinction clear: do-calculus answers “can we,” not “how well with this dataset.”
Modern ML and data science frequently optimize predictive accuracy: estimate P(Y | X). But decision-making needs causal quantities: what happens to Y if we change X?
Examples:
In each case, you are implicitly asking about P(Y | do(X)).
A strong habit: translate English into an estimand.
Once you have the estimand, you can ask: is it identifiable from available data and assumptions?
| Situation | Typical DAG symptom | Identification move | Notes |
|---|---|---|---|
| Measured confounding | X ← Z → Y with observed Z | Backdoor adjustment | Most common, but don’t condition on colliders |
| Hidden confounding but observed mediator | X → M → Y and X ↔ Y confounded, but M measured | Frontdoor adjustment | Requires strong structural assumptions |
| Randomized experiment | do(X) approximated by design | No adjustment needed (in principle) | Still adjust for precision / imbalance |
| Selection bias | Conditioning on S where X → S ← Y | Avoid / model selection | Often sneaks in via “only users who…” |
| Unidentifiable effect | No valid adjustment; hidden confounding; complex feedback avoided by DAG | Collect more variables, use instruments, partial ID | Honest “can’t answer” is a valid result |
You already know Bayesian inference. Once identification gives you an estimand like:
P(Y | do(X=x)) = ∑ₛ P(Y | X=x, S=s) P(S=s)
you can estimate the components using Bayesian models:
Causal inference tells you what to estimate; Bayesian inference tells you how to estimate with uncertainty.
Some common ML pitfalls become clearer with DAGs:
1) Target leakage
If a feature is a descendant of the label (Y → Feature), the model will “predict” using consequences of Y. That’s not causally meaningful.
2) Controlling for post-treatment variables
If you adjust for a mediator M (X → M → Y) while trying to estimate the total effect of X, you will generally underestimate it.
3) Selection on engagement
Analyzing “active users only” can create collider bias: Feature change (X) affects engagement (C), and user satisfaction (Y) affects engagement; conditioning on active users selects on C.
When you face a causal question in practice:
1) Specify variables: treatment X, outcome Y, potential confounders Z, mediators M, selection variables S.
2) Draw a DAG: even if imperfect, it makes assumptions explicit.
3) Decide estimand: ATE, conditional effect, policy value, etc.
4) Identify using backdoor/frontdoor/do-calculus.
5) Assess assumptions:
6) Estimate with appropriate methods (regression adjustment, matching, IPW, doubly robust, Bayesian models).
7) Sensitivity analysis for unmeasured confounding.
The conceptual leap is steps (2)–(4): that is what causal inference adds beyond standard statistics.
Variables: X = treatment (0/1), Y = recovery (0/1), Z = severity (0/1). DAG: Z → X, Z → Y, and X → Y. Goal: identify P(Y | do(X=1)) and the ATE.
Assume Z is observed.
1) Identify backdoor paths from X to Y.
2) Choose an adjustment set S.
(i) Z is not a descendant of X (true).
(ii) Conditioning on Z blocks the path X ← Z → Y (true).
So Z is a valid backdoor adjustment set.
3) Write the adjustment formula.
P(Y | do(X=1)) = ∑_z P(Y | X=1, Z=z) P(Z=z)
Similarly,
P(Y | do(X=0)) = ∑_z P(Y | X=0, Z=z) P(Z=z)
4) Convert to an ATE expression (difference in expectations).
Let Y be binary, then E[Y | do(X=x)] = P(Y=1 | do(X=x)).
ATE = E[Y | do(X=1)] − E[Y | do(X=0)]
= ∑_z [P(Y=1 | X=1, Z=z) − P(Y=1 | X=0, Z=z)] P(Z=z).
5) Interpret.
Within each severity stratum Z=z, compare treated vs untreated (a like-for-like comparison), then average those differences over how common each severity level is in the population.
Insight: The key move is not “control for Z because it predicts Y,” but “control for Z because it opens/closes causal paths.” Z is required because it is a common cause of X and Y. The DAG explains why the standardization ∑_z (…) P(Z=z) matches an intervention.
Variables: X = skill (continuous), Y = friendliness (continuous), C = hired (0/1). DAG: X → C ← Y. You only observe people who are hired (C=1) and then you compute the correlation between X and Y.
Question: Can conditioning on C create an association between X and Y even if they are independent marginally?
1) Read the DAG.
2) Identify the path between X and Y.
3) Apply collider rule.
So X and Y can remain independent in the overall population.
4) Condition on C=1 (selection).
This induces a negative association between X and Y within the selected set.
5) Consequence.
If you regress Y on X using only hired people, you may conclude X ‘causes’ Y or at least that they are strongly related, but the association is an artifact of conditioning on C.
Insight: “Control for everything you can” is not safe. Conditioning is an operation on distributions, not a free improvement. DAGs tell you when conditioning opens paths (colliders) and thereby manufactures correlations that were not present before.
Variables: X = smoking (0/1), M = tar exposure (continuous), Y = lung disease (0/1), U = genetic risk (unobserved). DAG: U → X and U → Y (hidden confounding), X → M → Y, and no direct arrow X → Y.
Assume: (i) all causal effect of X on Y goes through M, (ii) no unobserved confounding between X and M, (iii) no unobserved confounding between M and Y given X.
Goal: identify P(Y | do(X=x)) from observational data.
1) Recognize why backdoor fails.
Thus no standard backdoor adjustment is available (in this simplified setup).
2) Check frontdoor conditions.
3) Write the frontdoor formula.
P(Y | do(X=x)) = ∑_m P(M=m | X=x) ∑_{x'} P(Y | M=m, X=x') P(X=x')
4) Explain the two-stage averaging.
5) Result.
The causal effect of X on Y is identified despite unobserved U, because the mediator M provides a measurable pathway that “transmits” the effect and can be isolated with the frontdoor logic.
Insight: Frontdoor is a powerful reminder that hidden confounding does not automatically make causal inference impossible. But it replaces “measure confounders” with stronger structural assumptions about mediation and the absence of certain confounders—assumptions you must defend scientifically.
Causal questions are about interventions: P(Y | do(X=x)) is fundamentally different from P(Y | X=x).
A causal DAG encodes mechanistic assumptions; it is not merely a visualization. It determines which statistical adjustments are valid.
Confounding arises from forks (X ← Z → Y); selection bias arises from conditioning on colliders (X → C ← Y).
Backdoor adjustment identifies causal effects when you can block all backdoor paths with a set S that contains no descendants of X.
Frontdoor adjustment can identify effects even with unobserved confounding, but requires strong mediator-based assumptions.
Do-calculus generalizes backdoor/frontdoor: it provides rules for removing do(·) operators when graph-separation conditions hold in modified graphs.
Identification (symbolic equality in the population) is separate from estimation (numerical computation from finite data).
Good causal practice: write the estimand, draw the DAG, check paths, then pick an identification strategy—only then choose an estimator.
Treating P(Y | X=x) as if it were P(Y | do(X=x)) without checking for backdoor paths (confounding).
“Controlling for everything,” especially conditioning on colliders or post-treatment variables, which can introduce bias.
Assuming a DAG is validated because it ‘looks reasonable’; many different DAGs can fit the same observational distribution.
Confusing identification with estimation: a correctly identified estimand can still be poorly estimated due to lack of overlap, high-dimensional covariates, or measurement error.
Backdoor practice: Consider the DAG Z → X → Y and Z → Y. (a) List all backdoor paths from X to Y. (b) Does S={Z} satisfy the backdoor criterion? (c) Write P(Y | do(X=x)) in terms of observational quantities.
Hint: A backdoor path must start with an arrow into X. In this graph, check X ← Z → Y.
(a) The backdoor path is X ← Z → Y.
(b) Yes. Z is not a descendant of X, and conditioning on Z blocks X ← Z → Y.
(c) P(Y | do(X=x)) = ∑_z P(Y | X=x, Z=z) P(Z=z).
Collider reasoning: Suppose X → C ← Y and additionally C → D (D is a descendant of the collider). Are X and Y independent given D? Explain using collider logic.
Hint: Conditioning on a collider or any of its descendants opens the path through the collider.
In X → C ← Y, C is a collider, so the path between X and Y is blocked marginally. But D is a descendant of C. Conditioning on D provides information about C, which effectively conditions on (or partially conditions on) the collider. This opens the path X → C ← Y, inducing an association between X and Y given D. So X and Y are generally not independent given D.
Frontdoor check: You observe X, M, Y with DAG X → M → Y, and there is an unobserved U such that U → X and U → Y. Additionally, suppose there is also an unobserved W such that W → M and W → Y. Is the causal effect of X on Y identifiable by frontdoor? Why or why not?
Hint: Frontdoor requires no unblocked confounding between M and Y given X. Hidden common causes of M and Y break that.
No, not by the standard frontdoor criterion. The unobserved W creates confounding between M and Y via M ← W → Y. Even conditioning on X does not block this backdoor path because W is not a descendant of X and is unobserved. Therefore the relationship between M and Y cannot be learned unbiasedly from P(Y | M, X), and the frontdoor formula is not justified.
Prerequisites you’re using here:
Next nodes you’ll likely want: