15 June 2021 by Richard

*Okay so correlation does not imply causation. What now?*

**It isn’t my job to disappoint people, but I’m good at
it.** Other researchers are out there writing books about the
wonder of science, capturing the imagination of the public, inspiring
the thinkers that will secure our species’ just and sustainable future.
Meanwhile, I’m telling anyone who will listen that, if we are very
careful and try very hard, we might not completely mislead ourselves.

There’s a chance, is what I’m saying.

But people find the message disappointing. I suspect the message is disappointing because an education in science oversells the wins and hides the losses. It’s a form of propaganda that functions to convince us to believe things. But it’s shit at teaching us how to discover (or dis-discover) things. Of course this means it is sometimes bad at justifying those beliefs.

When the topic turns to data analysis, what my students and colleagues want is a way to separate spurious associations from true causal relationships. We want some method—a “scientific method” maybe—to reliably infer causation.

There are many scientific methods and there is no causal inference police. Science is anarchy. But it’s not too disappointing to compress the statistical versions into three types.

First, there is **Causal Salad**: You put everything into a
**regression equation**, toss with some creative
story-telling, and hope the reviewers eat it. In general, this is not a
valid approach, for well-known reasons. But it can get you published.
Causal salad can discover causes too. But you have to get lucky. The
Salad isn’t only regression. Really any procedure that hopes to take a
list of variables (features) and return causal inference is Causal
Salad. No amount of data reliably turns salad into sense.

Second, there is **Causal Design**: Use a framework like
**do-calculus** to derive a causal estimate, given a causal
model. If such an estimate is possible—often it is not—this approach
will tell you which variables to use and how to use them. Sometimes the
result will resemble a regression equation. Often it will not. I include
in this approach the idea of **randomized **and**
natural experiments**, because these are justified only through
some logic of causal design.

Third, there is **Full-Luxury Bayesian Inference**: Program
the entire causal model as a joint probability distribution and let the
logic of probability theory work out the implications. In principle,
this is equivalent to **Causal Design**—all of the same
results can be derived either way, given the same causal model. In
practice, the approaches are different. They have different strengths
and require different procedures.

To understand these approaches, we need to see how each manages a common inference task—same data, same goal. Let’s set up the task now.

Suppose a sample of mother and daughter pairs, such as the pair pictured below, have completed their families. Many things influence the number of children a women ends up with. We are interested here in the social transmission of family norms from mother to daughter. What is the causal influence, if any, of mom’s family size on the family size of her daughter?

We know that there are unmeasured confounds, because mothers and
daughters share common environmental exposures that could explain some
part of the association between their completed family sizes. The only
data we have for all women is their **birth order** (first
born or not). Suppose we know from prior research that first born
daughters have higher fertility.

There are no tricks here. Take this story as given and consider what you would do, in order to infer the causal influence of mom’s family size on her daughter’s. Are the birth order variables useful? How would you use them?

To make the story more solid, let’s go ahead and simulate some data that approximate the story. This story is already pretty complicated, because the variables are counts and daughter’s birth order must be influenced by mom’s family size—your chance of being first born depends upon how many siblings you have. I will leave out that complexity, so we can focus instead on the structural issues of which variables directly influence which other variables. Here’s some simple R code to produce a synthetic data set. We’ll use these data in the sections below. But go ahead and explore the data, if you feel frisky. I will assume that mom’s family size has zero direct influence on daughter’s, just to make the lessons starker.

In this code, the variables are: **B1** mom’s birth order,
**M** mom’s family size, **B2** daughter’s
birth order, **D** daughter’s family size. Those are the
variables we get to use for inference. The variable **U**
is part of the generative model, but we haven’t measured it in our
thought experiment, so it isn’t available for use in what follows.

If you were going to apply **Causal Salad** to this
problem, you would build a regression with daughter’s family size
(**D**) as the outcome and mother’s (**M**) as
a predictor. In common linear formula notation: **D ~ M**.
This model will measure the association between **D** and
**M**.

You already know that some of this association is due to unmeasured
common exposures. Let’s call those unmeasured exposures
**U**. The question now is what happens when we produce the
causal salad by adding the birth order variables to the model. It can
only help, right?

Unfortunately, it can hurt. The dis-logic of naive regression is that
omitted variables, like **U**, are the key threat. Included
variables don’t usually hurt us. We can just add them and see if the fit
improves, which sparkling p-values emerge.

To see what I’m saying, let’s fit two regression models. The first is
just D ~ M, which measures the association between **D**
and **M**. Nothing more. Then we’ll fit D ~ M + B1 + B2,
the Causal Salad model. We’ll also compare these models with AIC, a
common measure of a model’s predictive quality, not just its fit to
sample. Smaller AIC values are better, because AIC is a measure of
prediction error.

The first model, D ~ M, gives you:

Estimate Std. Error t value Pr(>|t|) (Intercept) 0.7128 0.1244 5.729 3.71e-08 *** M 0.2596 0.0619 4.194 4.12e-05 ***

To coefficient for **M** is positive and has a small
standard error. This is a result of the **U** confound, we
know from the generative model above. The AIC for this model is 740. The
second regression gives us:

Estimate Std. Error t value Pr(>|t|) (Intercept) -0.03980 0.15280 -0.260 0.7948 M 0.39468 0.06326 6.239 2.67e-09 *** B1 -0.52544 0.22049 -2.383 0.0181 * B2 1.80475 0.17101 10.553 < 2e-16 ***

Adding **B1** and **B2** actually made things
worse. The coefficient on M is even further from zero (its true value)
now. And the coefficient on **B1** is actually on the wrong
side of zero. The influence of **B1** is positive, we know
because we simulated the data. The AIC on this model is 647. It’s better
(at prediction) than the first model, despite having worse inferences.

What’s going on here? To really explain why the coefficients turn out this way, we’ll need the next approach, Causal Design. But note the following.

First**, **adding variables can actually make inference
worse**.** It isn’t harmless to just add variables to the
salad and let the model sort it out. You have to give some thought to
the causal relationships among the predictor variables, not just between
the outcome and the predictors. That is what the next two approaches
force us to do.

Second, the model with worse estimates (the second model) is expected to make better predictions (it has a smaller AIC). And it is very likely that it could make better predictions. Making good predictions does not require an accurate causal model. It just requires a good description of associations. As long as we do not intervene in the system, this kind of model can make very useful predictions.

But scientific understanding is something different. When we understand
a causal influence, all we usually mean (statistically) is that we can
predict the consequences of an intervention. And the models above
definitely do not provide that, because the simulated influence of
**M** on **D** is zero. So intervening on
**M**, maybe through incentives or communication, would
have no direct impact on **D**.

That’s enough for now. Next time, in
Part 2/3, I’ll pick up with
**Causal Design** and revisit the same scenario and data.
You’ll see that we actually can use **B1** to improve
inference. But doing so requires something other than thinking like a
regression.

There’s a chance, is what I’m saying.