Causal Inference

Probability & StatisticsDifficulty: █████Depth: 8Unlocks: 0

Distinguishing correlation from causation. DAGs, do-calculus.

Interactive Visualization

t=0s

Core Concepts

  • Intervention (do-operator): causal effect defined as the distribution of outcomes when a variable is externally set, i.e., P(Y | do(X=x))
  • Causal directed acyclic graph (DAG): a graph whose directed edges encode direct causal influence and whose structure encodes conditional independences among variables
  • Identification (do-calculus): formal rules that determine when and how an interventional distribution P(Y | do(X)) can be expressed using only observational distributions given a DAG

Key Symbols & Notation

do(X = x) (the do-operator denoting intervention)

Essential Relationships

  • Intervention vs observation: P(Y | do(X=x)) is generally not equal to P(Y | X=x); the DAG reveals confounding (backdoor) paths, and do-calculus specifies when observational data suffice to recover the interventional distribution
▶ Advanced Learning Details

Graph Position

127
Depth Cost
0
Fan-Out (ROI)
0
Bottleneck Score
8
Chain Length

Cognitive Load

5
Atomic Elements
49
Total Elements
L4
Percentile Level
L3
Atomic Level

All Concepts (26)

  • Causation vs correlation: the idea that statistical association (correlation) does not by itself imply a directional cause–effect relationship
  • Causal Directed Acyclic Graph (causal DAG): a DAG where edges represent direct causal influences (not merely statistical dependence)
  • Structural Causal Model (SCM): a set of structural (functional) equations tying each variable to its parents and an independent noise term
  • Structural equation (mechanism) for a variable: X := f(Pa(X), U_X) (the variable is generated by a function of its parents and an exogenous noise term)
  • Exogenous (noise) variables: unobserved independent variables U_X that model omitted randomness and allow deterministic structural functions
  • Intervention (surgical intervention): an operation that replaces the structural equation of a variable (or set) with a fixed value or new mechanism
  • do-operator: formal notation for an intervention, do(X=x), distinct from observational conditioning
  • Interventional / causal distribution: P(Y | do(X=x)), the distribution of Y after intervening to set X to x
  • Identifiability of causal effect: whether P(Y | do(X)) can be expressed uniquely in terms of the available (observational) distribution P(V)
  • Backdoor path: any non-causal path between X and Y that begins with an arrow into X and can create confounding
  • Confounder: a variable (or set) that creates a spurious association between treatment and outcome by opening backdoor paths
  • Backdoor criterion: a graphical condition specifying when a set Z suffices to adjust for confounding so that P(Y|do(X)) can be computed from P
  • Adjustment set (sufficient set): a set Z satisfying the backdoor criterion enabling the adjustment formula
  • Adjustment formula (backdoor adjustment): expressing P(Y | do(X)) as a sum/integral over an adjustment set: P(Y|do(X=x)) = sum_z P(Y | X=x, Z=z) P(Z=z) (when Z satisfies backdoor)
  • Front-door criterion: a graphical condition using a mediator M that, under certain conditions, permits identification of P(Y|do(X)) even with unmeasured confounding
  • Front-door adjustment formula: the specific formula that identifies P(Y|do(X)) via a mediating variable meeting front-door conditions
  • Collider: a node on a path with two arrowheads into it (A -> C <- B); conditioning on colliders can induce spurious associations (collider bias)
  • d-separation: a graphical criterion that determines whether a conditional independence (X ⟂ Y | Z) holds in all distributions compatible with the DAG under the causal Markov condition
  • Causal Markov condition: given the causal DAG, each variable is independent of its non-descendants conditional on its parents
  • Faithfulness (stability): the assumption that all and only the conditional independencies entailed by the causal graph appear in the probability distribution
  • Do-calculus: a set of graph-based inference rules (three rules) that transform expressions with do-operators into observational expressions when justified by d-separation
  • Graph surgery viewpoint of intervention: thinking of interventions as removing incoming edges to intervened nodes and replacing the generating mechanism
  • Counterfactuals / potential outcomes: variables of the form Y_x (or Y(x)) representing the value Y would take under a hypothetical intervention do(X=x)
  • Natural (direct and indirect) effects and mediation decomposition: decomposing total causal effect into parts attributable to specific causal paths (direct vs mediated)
  • Instrumental variable (IV): a variable Z that affects X and is independent of unobserved confounders and affects Y only through X, enabling identification under certain conditions
  • Path-specific effect and path-based identification: attributing causal influence to particular directed paths in a DAG and conditions to identify those effects

Teaching Strategy

Quick unlock - significant prerequisite investment but simple final step. Verify prerequisites first.

You see two variables move together: a medicine and recovery, education and income, rain and umbrella sales. Your brain wants a story—one causes the other. Causal inference is the discipline of turning that story into a testable, formal claim: “If I intervene and force X to be x, how would Y change?”

TL;DR:

Causal inference distinguishes observing from doing. A causal DAG encodes assumptions about how variables generate one another. The causal effect is defined by an interventional distribution P(Y | do(X=x)). Identification asks whether P(Y | do(X)) can be computed from observational data P(·) given the DAG. Do-calculus provides sound transformation rules to rewrite interventional queries into observational ones when possible (often via backdoor, frontdoor, or more general adjustments).

What Is Causal Inference?

Why you can’t get causation “for free” from correlation

In everyday data analysis, you often estimate something like P(Y | X=x): the distribution of outcomes among units where X happens to equal x. That is an observational quantity. It answers: “Among the units I observed with X=x, what does Y look like?”

A causal question is different. It asks: “If I were to force X to be x (possibly contrary to what it would naturally be), what would Y look like?” That is P(Y | do(X=x)). The do-operator is not a stylistic choice; it marks a different data-generating regime.

A classic trap:

  • Observational: People who take a certain drug recover more often.
  • Causal claim: Taking the drug causes recovery.

These are not equivalent because of confounding: some variable Z (e.g., disease severity) might influence both drug-taking and recovery. Then P(Y | X) mixes multiple mechanisms.

The core objects: interventions, DAGs, and identification

Causal inference (in the Pearl/DAG framework) revolves around three atomic ideas:

1) Intervention (do-operator)

  • P(Y | do(X=x)) is defined as the distribution of Y when an external action sets X to x.
  • This “cuts” the normal causes of X; X is no longer generated by its usual parents.

2) Causal directed acyclic graph (DAG)

  • Nodes are variables; directed edges encode direct causal influence.
  • The graph implies conditional independences and supports reasoning about which paths create spurious associations.

3) Identification (often via do-calculus)

  • A causal effect is identified if it can be expressed purely in terms of observational distributions (things you can estimate from passive data) given the DAG.
  • Do-calculus provides formal rules to transform expressions with do(·) into expressions without do(·) when warranted.

Observing vs doing: same symbols, different worlds

It’s tempting to read P(Y | X=x) and P(Y | do(X=x)) as “conditional probabilities with different notation.” But their meanings differ:

  • P(Y | X=x): you condition on an event in the observed world.
  • P(Y | do(X=x)): you modify the data-generating process and then ask about Y.

A helpful mental model:

  • Conditioning is like filtering your dataset to rows where X=x.
  • Intervention is like editing reality so that everyone’s X is set to x, then sampling outcomes.

Structural causal model intuition (without getting lost)

A common semantics for DAGs is a set of structural assignments:

  • X := fₓ(Parents(X), Uₓ)
  • Y := fᵧ(Parents(Y), Uᵧ)

where Uₓ, Uᵧ are exogenous “noise” terms.

Intervening do(X=x) replaces the equation for X with:

  • X := x

Everything else stays the same. This formalizes “cutting incoming arrows into X.”

Why topological order matters (a prerequisite connection)

Because a causal graph is acyclic, it admits a topological order. That’s not just an algorithmic convenience: it reflects a generative ordering—causes upstream, effects downstream. When we factor a joint distribution according to a DAG,

P(V₁, …, Vₙ) = ∏ᵢ P(Vᵢ | Pa(Vᵢ))

we are implicitly using that ordering. Under intervention, we alter exactly one (or more) of those factors.

What causal inference promises—and what it requires

Causal inference can answer “what if we change X?” questions from observational data, but only if you are willing to:

  • encode assumptions (via a DAG or equivalent),
  • check testable implications where possible,
  • accept that some causal questions are not identifiable without additional data or assumptions.

This is a feature, not a bug: it forces you to separate what the data say from what your causal assumptions add.

Core Mechanic 1: Causal DAGs, Paths, and the Difference Between Confounding and Selection

Why DAGs are the right abstraction

If you only use probability tables, every dependence can be “explained” by many stories. DAGs provide a language of mechanisms: they constrain which variables can directly influence which others. That constraint lets you reason about which associations are causal and which are spurious.

A DAG is a directed acyclic graph where:

  • An arrow X → Y means X is a direct cause of Y (in your model).
  • Absence of an arrow means “no direct causal influence” (again, in your model).

The DAG does not automatically tell you the effect size; it tells you what adjustments are needed to estimate causal effects from observational data.

Three fundamental causal motifs

Most causal reasoning problems are combinations of three motifs:

1) Chain (mediation)

X → M → Y

  • X affects Y through mediator M.
  • If you condition on M, you may block part (or all) of X’s effect on Y.

2) Fork (confounding)

X ← Z → Y

  • Z is a common cause of X and Y.
  • Without adjusting for Z, X and Y are associated even if X has no causal effect on Y.

3) Collider (selection)

X → C ← Y

  • C is a common effect of X and Y.
  • Unconditioned, X and Y can be independent.
  • Conditioning on C (or a descendant of C) can create a spurious association between X and Y.

This collider rule is one of the most counterintuitive pieces of causal inference and is responsible for many real-world errors (e.g., selection bias, Berkson’s paradox).

d-separation: reading independences from a DAG

A DAG implies conditional independence constraints via d-separation. The idea is to determine whether all paths between two sets of variables are blocked by a conditioning set.

A path is blocked if it contains:

  • a non-collider that is conditioned on, or
  • a collider that is not conditioned on and has no conditioned descendants.

While the full formal definition can be dense, the practical payoff is clear:

  • d-separation tells you which conditional independences must hold if your DAG is correct.
  • those independences can sometimes be tested against data (not always, especially with latent variables).

Backdoor paths: the key to confounding

When estimating the causal effect of X on Y, you worry about paths that create association between X and Y that is not due to the directed causal influence X → … → Y.

A backdoor path from X to Y is any path that begins with an arrow into X.

Example:

  • X ← Z → Y is a backdoor path.
  • X ← Z ← W → Y is also a backdoor path (it still starts with an arrow into X).

Backdoor paths matter because they transmit non-causal association. If you can block all backdoor paths (without introducing new bias), then the remaining association corresponds to the causal effect.

Adjustment sets (backdoor criterion)

A set of variables S satisfies the backdoor criterion relative to (X, Y) if:

1) No node in S is a descendant of X.

2) S blocks every backdoor path from X to Y.

If such an S exists, then the causal effect is identified via adjustment:

P(Y | do(X=x)) = ∑ₛ P(Y | X=x, S=s) P(S=s)

This looks like “standardize over S,” but the why is causal: conditioning on S blocks spurious paths, and averaging over S restores the overall population distribution.

Confounding vs selection: two different dangers

It’s easy to say “we’ll control for variables.” But which variables you control for matters.

  • Confounder control (forks): often necessary.
  • Collider control (selection): often harmful.

A compact comparison:

StructureGraphWhat happens if you condition on the middle node?Typical name
ForkX ← Z → YBlocks spurious association (good)Confounding
ColliderX → C ← YCreates spurious association (bad)Selection bias
ChainX → M → YBlocks mediated effect (depends on goal)Mediation

Latent variables and hidden confounding

Real systems often include unobserved variables U (e.g., genetics, socioeconomic factors). If U causes both X and Y, you have an unblocked backdoor path X ← U → Y, but you can’t condition on U.

Then adjustment may fail even if you “control for everything you measured.” This is where identification becomes subtle: you may still identify effects via alternative structures (e.g., frontdoor), instruments, or additional assumptions.

A note on notation: vectors vs scalar variables

You may see causal effects with covariate vectors Z. The adjustment formula generalizes:

P(Y | do(X=x)) = ∑_z P(Y | X=x, Z=z) P(Z=z)

(For continuous variables, replace ∑ with ∫.)

The important part is conceptual: you’re averaging the conditional outcome model over the marginal distribution of confounders, not the conditioned-on distribution within a particular X group.

Core Mechanic 2: Interventions, Identification, and Do-Calculus

Why we need identification rules

A DAG plus observational data gives you P(V) and conditional distributions like P(Y | X, Z). Your causal target, however, is interventional: P(Y | do(X)).

Identification asks:

  • Can we express P(Y | do(X)) using only observational quantities implied by P(V) (and the DAG)?
  • If yes, what formula computes it?
  • If no, what extra data/assumptions would be required?

This matters because randomized experiments directly approximate do(X) by design, but observational studies do not. Identification is the bridge.

The truncated factorization (intervention) formula

If the joint distribution factorizes along the DAG as:

P(V) = ∏ᵢ P(Vᵢ | Pa(Vᵢ))

then intervening do(X=x) yields:

P(V | do(X=x)) = 𝟙[X=x] ∏_{Vᵢ ≠ X} P(Vᵢ | Pa(Vᵢ))

Equivalently, for the post-intervention distribution over non-intervened variables:

P(V\{X} | do(X=x)) = ∏_{Vᵢ ≠ X} P(Vᵢ | Pa(Vᵢ))\big|_{X=x}

This is the formal “cut incoming arrows into X” idea.

Backdoor and frontdoor as special identification results

Two major identification patterns appear so often they deserve names.

Backdoor adjustment (confounding control)

If S satisfies the backdoor criterion, then:

P(Y | do(X=x)) = ∑ₛ P(Y | X=x, S=s) P(S=s)

Frontdoor adjustment (mediated identification under hidden confounding)

Frontdoor applies when:

  • There is unobserved confounding between X and Y,
  • but X affects Y only through a measured mediator M,
  • and there is no unobserved confounding between X and M, nor between M and Y conditional on X.

Then:

P(Y | do(X=x)) = ∑ₘ P(M=m | X=x) \; ∑_{x'} P(Y | M=m, X=x') P(X=x')

Intuition:

  • First, learn how X changes M (observable).
  • Then, learn how M relates to Y by averaging over natural variation in X (observable), carefully avoiding the confounding path.
  • Combine them to simulate intervening on X.

Do-calculus: the general tool

Backdoor/frontdoor are consequences of do-calculus. Do-calculus provides rules for transforming expressions like P(Y | do(X), Z) into forms where the do(·) can be removed.

You don’t need to memorize all details to use causal inference effectively, but you do need to understand what the rules do:

  • They allow you to replace interventions with observations when certain graph-separation conditions hold in manipulated graphs.
  • They are sound and (together with standard probability) complete for identification in causal DAG models.

The three rules (high-level)

Let X, Y, Z, W be disjoint sets of nodes. Let G be the causal DAG.

We define modified graphs:

  • G\bar{X}: remove all incoming arrows into X (representing do(X)).
  • G\underline{Z}: remove all outgoing arrows from Z (representing conditioning vs intervention manipulations).

Then the do-calculus rules (informally) are:

1) Insertion/deletion of observations

If Y is d-separated from Z given X and W in G\bar{X}, then:

P(Y | do(X), Z, W) = P(Y | do(X), W)

2) Action/observation exchange

If Y is d-separated from Z given X and W in G\bar{X}\underline{Z}, then:

P(Y | do(X), do(Z), W) = P(Y | do(X), Z, W)

3) Insertion/deletion of actions

If Y is d-separated from Z given X and W in G\bar{X}, where Z has no ancestors in X after modification (more precisely via graph surgery conditions), then:

P(Y | do(X), do(Z), W) = P(Y | do(X), W)

These statements can look intimidating because of the graph modifications, but the core idea is consistent:

  • If intervening/conditioning on something doesn’t open a path to Y (given what you already control), you can add or remove it.

Identification workflow (practical)

When handed a causal query, a DAG, and observational data, a practical workflow is:

1) State the target: e.g., P(Y | do(X=x)).

2) List candidate adjustment variables using backdoor (if possible).

3) If backdoor fails due to unobserved confounding, check frontdoor conditions.

4) If neither applies, use general do-calculus / algorithmic tools (e.g., ID algorithm) to test identification.

5) If not identifiable, redesign: collect more variables, use instruments, exploit experiments, or accept partial identification bounds.

A careful derivation: why adjustment works

Suppose S blocks all backdoor paths from X to Y and contains no descendants of X.

We want to show:

P(Y | do(X=x)) = ∑ₛ P(Y | X=x, S=s) P(S=s)

Sketch with “show your work” steps (conceptual algebra):

1) Start with law of total probability under intervention:

P(Y | do(X=x))

= ∑ₛ P(Y, S=s | do(X=x))

= ∑ₛ P(Y | S=s, do(X=x)) P(S=s | do(X=x))

2) Because S has no descendants of X and we only intervene on X, distribution of S is unchanged:

P(S=s | do(X=x)) = P(S=s)

3) Because S blocks backdoor paths from X to Y, Y is conditionally independent of the intervention once we condition on (X, S):

P(Y | S=s, do(X=x)) = P(Y | X=x, S=s)

4) Substitute back:

P(Y | do(X=x))

= ∑ₛ P(Y | X=x, S=s) P(S=s)

The DAG is doing the heavy lifting in steps (2) and (3). The formula is “statistics,” but the justification is “causality.”

Estimation vs identification (don’t conflate them)

Identification is symbolic: it tells you what expression equals the causal effect in the population.

Estimation is numerical: given finite data, how do you estimate that expression?

Even if a causal effect is identifiable, estimation can still be hard due to:

  • high-dimensional Z (curse of dimensionality),
  • model misspecification,
  • limited overlap/positivity (some X levels rarely occur for some covariates),
  • measurement error.

This lesson focuses on identification logic, but keep the distinction clear: do-calculus answers “can we,” not “how well with this dataset.”

Application/Connection: From Causal Questions to Analysis Plans

Why causal inference shows up everywhere

Modern ML and data science frequently optimize predictive accuracy: estimate P(Y | X). But decision-making needs causal quantities: what happens to Y if we change X?

Examples:

  • Policy: effect of a training program on income.
  • Medicine: effect of a treatment on survival.
  • Product: effect of a feature change on retention.
  • ML fairness: effect of using a sensitive attribute (or a proxy) on outcomes.

In each case, you are implicitly asking about P(Y | do(X)).

Turning a vague question into a causal estimand

A strong habit: translate English into an estimand.

  • “Does the ad campaign work?”
  • Target: average treatment effect (ATE) in some population: E[Y | do(X=1)] − E[Y | do(X=0)]
  • “What if we set the price to $p?”
  • Target: E[Sales | do(Price=p)]
  • “What would the conversion rate be if everyone saw the new UI?”
  • Target: P(Y=1 | do(X=1))

Once you have the estimand, you can ask: is it identifiable from available data and assumptions?

Choosing an identification strategy: a decision table

SituationTypical DAG symptomIdentification moveNotes
Measured confoundingX ← Z → Y with observed ZBackdoor adjustmentMost common, but don’t condition on colliders
Hidden confounding but observed mediatorX → M → Y and X ↔ Y confounded, but M measuredFrontdoor adjustmentRequires strong structural assumptions
Randomized experimentdo(X) approximated by designNo adjustment needed (in principle)Still adjust for precision / imbalance
Selection biasConditioning on S where X → S ← YAvoid / model selectionOften sneaks in via “only users who…”
Unidentifiable effectNo valid adjustment; hidden confounding; complex feedback avoided by DAGCollect more variables, use instruments, partial IDHonest “can’t answer” is a valid result

Example connection: Bayesian inference as estimation machinery

You already know Bayesian inference. Once identification gives you an estimand like:

P(Y | do(X=x)) = ∑ₛ P(Y | X=x, S=s) P(S=s)

you can estimate the components using Bayesian models:

  • Put priors on parameters of P(Y | X, S) and P(S).
  • Compute posterior predictive quantities.

Causal inference tells you what to estimate; Bayesian inference tells you how to estimate with uncertainty.

Causal inference and ML: where the DAG matters

Some common ML pitfalls become clearer with DAGs:

1) Target leakage

If a feature is a descendant of the label (Y → Feature), the model will “predict” using consequences of Y. That’s not causally meaningful.

2) Controlling for post-treatment variables

If you adjust for a mediator M (X → M → Y) while trying to estimate the total effect of X, you will generally underestimate it.

3) Selection on engagement

Analyzing “active users only” can create collider bias: Feature change (X) affects engagement (C), and user satisfaction (Y) affects engagement; conditioning on active users selects on C.

A realistic end-to-end analysis plan (template)

When you face a causal question in practice:

1) Specify variables: treatment X, outcome Y, potential confounders Z, mediators M, selection variables S.

2) Draw a DAG: even if imperfect, it makes assumptions explicit.

3) Decide estimand: ATE, conditional effect, policy value, etc.

4) Identify using backdoor/frontdoor/do-calculus.

5) Assess assumptions:

  • Are confounders measured?
  • Any colliders being conditioned on by your sampling strategy?
  • Positivity/overlap plausible?

6) Estimate with appropriate methods (regression adjustment, matching, IPW, doubly robust, Bayesian models).

7) Sensitivity analysis for unmeasured confounding.

The conceptual leap is steps (2)–(4): that is what causal inference adds beyond standard statistics.

Worked Examples (3)

Backdoor adjustment: estimating the effect of a treatment with a measured confounder

Variables: X = treatment (0/1), Y = recovery (0/1), Z = severity (0/1). DAG: Z → X, Z → Y, and X → Y. Goal: identify P(Y | do(X=1)) and the ATE.

Assume Z is observed.

  1. 1) Identify backdoor paths from X to Y.

    • There is a directed causal path X → Y (this is the effect we want).
    • There is a backdoor path X ← Z → Y (starts with arrow into X).
  2. 2) Choose an adjustment set S.

    • Candidate: S = {Z}.
    • Check backdoor criterion:

    (i) Z is not a descendant of X (true).

    (ii) Conditioning on Z blocks the path X ← Z → Y (true).

    So Z is a valid backdoor adjustment set.

  3. 3) Write the adjustment formula.

    P(Y | do(X=1)) = ∑_z P(Y | X=1, Z=z) P(Z=z)

    Similarly,

    P(Y | do(X=0)) = ∑_z P(Y | X=0, Z=z) P(Z=z)

  4. 4) Convert to an ATE expression (difference in expectations).

    Let Y be binary, then E[Y | do(X=x)] = P(Y=1 | do(X=x)).

    ATE = E[Y | do(X=1)] − E[Y | do(X=0)]

    = ∑_z [P(Y=1 | X=1, Z=z) − P(Y=1 | X=0, Z=z)] P(Z=z).

  5. 5) Interpret.

    Within each severity stratum Z=z, compare treated vs untreated (a like-for-like comparison), then average those differences over how common each severity level is in the population.

Insight: The key move is not “control for Z because it predicts Y,” but “control for Z because it opens/closes causal paths.” Z is required because it is a common cause of X and Y. The DAG explains why the standardization ∑_z (…) P(Z=z) matches an intervention.

Collider bias: why adjusting for a selection variable can create a fake causal effect

Variables: X = skill (continuous), Y = friendliness (continuous), C = hired (0/1). DAG: X → C ← Y. You only observe people who are hired (C=1) and then you compute the correlation between X and Y.

Question: Can conditioning on C create an association between X and Y even if they are independent marginally?

  1. 1) Read the DAG.

    • X and Y both cause C.
    • There is no arrow between X and Y, so the model allows X ⟂ Y marginally (they can be independent in the population).
  2. 2) Identify the path between X and Y.

    • There is one path: X → C ← Y.
    • C is a collider on this path.
  3. 3) Apply collider rule.

    • If we do not condition on C (and none of C’s descendants are conditioned on), the path through the collider is blocked.

    So X and Y can remain independent in the overall population.

  4. 4) Condition on C=1 (selection).

    • Conditioning on a collider opens the path X → C ← Y.
    • Intuition: among hired people, if someone has low X (skill), they must “compensate” with higher Y (friendliness) to be hired, and vice versa.

    This induces a negative association between X and Y within the selected set.

  5. 5) Consequence.

    If you regress Y on X using only hired people, you may conclude X ‘causes’ Y or at least that they are strongly related, but the association is an artifact of conditioning on C.

Insight: “Control for everything you can” is not safe. Conditioning is an operation on distributions, not a free improvement. DAGs tell you when conditioning opens paths (colliders) and thereby manufactures correlations that were not present before.

Frontdoor adjustment: identifying a causal effect with unobserved confounding

Variables: X = smoking (0/1), M = tar exposure (continuous), Y = lung disease (0/1), U = genetic risk (unobserved). DAG: U → X and U → Y (hidden confounding), X → M → Y, and no direct arrow X → Y.

Assume: (i) all causal effect of X on Y goes through M, (ii) no unobserved confounding between X and M, (iii) no unobserved confounding between M and Y given X.

Goal: identify P(Y | do(X=x)) from observational data.

  1. 1) Recognize why backdoor fails.

    • There is a backdoor path X ← U → Y.
    • U is unobserved, so we cannot condition on it.

    Thus no standard backdoor adjustment is available (in this simplified setup).

  2. 2) Check frontdoor conditions.

    • X affects M (X → M) and M affects Y (M → Y): mediator observed.
    • All directed paths from X to Y go through M: satisfied by assumption (no X → Y edge).
    • No unblocked backdoor from X to M: U does not cause M (assumed), so OK.
    • Backdoor paths from M to Y are blocked by conditioning on X: since U affects Y and X but not M, conditioning on X blocks M ← X ← U → Y type paths (under the stated assumptions).
  3. 3) Write the frontdoor formula.

    P(Y | do(X=x)) = ∑_m P(M=m | X=x) ∑_{x'} P(Y | M=m, X=x') P(X=x')

  4. 4) Explain the two-stage averaging.

    • First term P(M | X=x): how changing X changes mediator M (estimable observationally because no confounding between X and M).
    • Second term: compute P(Y | do(M=m)) indirectly by averaging P(Y | M=m, X=x') over the natural distribution of X (this step neutralizes the confounding between X and Y because we are not comparing groups at fixed X; we are using X as a stratifier to learn the M → Y relationship).
  5. 5) Result.

    The causal effect of X on Y is identified despite unobserved U, because the mediator M provides a measurable pathway that “transmits” the effect and can be isolated with the frontdoor logic.

Insight: Frontdoor is a powerful reminder that hidden confounding does not automatically make causal inference impossible. But it replaces “measure confounders” with stronger structural assumptions about mediation and the absence of certain confounders—assumptions you must defend scientifically.

Key Takeaways

  • Causal questions are about interventions: P(Y | do(X=x)) is fundamentally different from P(Y | X=x).

  • A causal DAG encodes mechanistic assumptions; it is not merely a visualization. It determines which statistical adjustments are valid.

  • Confounding arises from forks (X ← Z → Y); selection bias arises from conditioning on colliders (X → C ← Y).

  • Backdoor adjustment identifies causal effects when you can block all backdoor paths with a set S that contains no descendants of X.

  • Frontdoor adjustment can identify effects even with unobserved confounding, but requires strong mediator-based assumptions.

  • Do-calculus generalizes backdoor/frontdoor: it provides rules for removing do(·) operators when graph-separation conditions hold in modified graphs.

  • Identification (symbolic equality in the population) is separate from estimation (numerical computation from finite data).

  • Good causal practice: write the estimand, draw the DAG, check paths, then pick an identification strategy—only then choose an estimator.

Common Mistakes

  • Treating P(Y | X=x) as if it were P(Y | do(X=x)) without checking for backdoor paths (confounding).

  • “Controlling for everything,” especially conditioning on colliders or post-treatment variables, which can introduce bias.

  • Assuming a DAG is validated because it ‘looks reasonable’; many different DAGs can fit the same observational distribution.

  • Confusing identification with estimation: a correctly identified estimand can still be poorly estimated due to lack of overlap, high-dimensional covariates, or measurement error.

Practice

easy

Backdoor practice: Consider the DAG Z → X → Y and Z → Y. (a) List all backdoor paths from X to Y. (b) Does S={Z} satisfy the backdoor criterion? (c) Write P(Y | do(X=x)) in terms of observational quantities.

Hint: A backdoor path must start with an arrow into X. In this graph, check X ← Z → Y.

Show solution

(a) The backdoor path is X ← Z → Y.

(b) Yes. Z is not a descendant of X, and conditioning on Z blocks X ← Z → Y.

(c) P(Y | do(X=x)) = ∑_z P(Y | X=x, Z=z) P(Z=z).

medium

Collider reasoning: Suppose X → C ← Y and additionally C → D (D is a descendant of the collider). Are X and Y independent given D? Explain using collider logic.

Hint: Conditioning on a collider or any of its descendants opens the path through the collider.

Show solution

In X → C ← Y, C is a collider, so the path between X and Y is blocked marginally. But D is a descendant of C. Conditioning on D provides information about C, which effectively conditions on (or partially conditions on) the collider. This opens the path X → C ← Y, inducing an association between X and Y given D. So X and Y are generally not independent given D.

hard

Frontdoor check: You observe X, M, Y with DAG X → M → Y, and there is an unobserved U such that U → X and U → Y. Additionally, suppose there is also an unobserved W such that W → M and W → Y. Is the causal effect of X on Y identifiable by frontdoor? Why or why not?

Hint: Frontdoor requires no unblocked confounding between M and Y given X. Hidden common causes of M and Y break that.

Show solution

No, not by the standard frontdoor criterion. The unobserved W creates confounding between M and Y via M ← W → Y. Even conditioning on X does not block this backdoor path because W is not a descendant of X and is unobserved. Therefore the relationship between M and Y cannot be learned unbiasedly from P(Y | M, X), and the frontdoor formula is not justified.

Connections

Prerequisites you’re using here:

  • Bayesian Inference — once you identify an estimand (e.g., ∑_z P(Y | X, Z)P(Z)), Bayesian modeling is a natural way to estimate it with uncertainty.
  • Topological Sort — DAGs admit an ordering that underlies factorization and “graph surgery” under interventions.

Next nodes you’ll likely want:

Quality: A (4.3/5)