15 June 2021 by Richard
Regression, Fire, and Dangerous Things (1/3)
Okay so correlation does not imply causation. What now?
Three Disappointing Things
It isn’t my job to disappoint people, but I’m good at it. Other researchers are out there writing books about the wonder of science, capturing the imagination of the public, inspiring the thinkers that will secure our species’ just and sustainable future. Meanwhile, I’m telling anyone who will listen that, if we are very careful and try very hard, we might not completely mislead ourselves.
There’s a chance, is what I’m saying.
But people find the message disappointing. I suspect the message is disappointing because an education in science oversells the wins and hides the losses. It’s a form of propaganda that functions to convince us to believe things. But it’s shit at teaching us how to discover (or dis-discover) things. Of course this means it is sometimes bad at justifying those beliefs.
When the topic turns to data analysis, what my students and colleagues want is a way to separate spurious associations from true causal relationships. We want some method—a “scientific method” maybe—to reliably infer causation.
There are many scientific methods and there is no causal inference police. Science is anarchy. But it’s not too disappointing to compress the statistical versions into three types.
First, there is Causal Salad: You put everything into a regression equation, toss with some creative story-telling, and hope the reviewers eat it. In general, this is not a valid approach, for well-known reasons. But it can get you published. Causal salad can discover causes too. But you have to get lucky. The Salad isn’t only regression. Really any procedure that hopes to take a list of variables (features) and return causal inference is Causal Salad. No amount of data reliably turns salad into sense.
Second, there is Causal Design: Use a framework like do-calculus to derive a causal estimate, given a causal model. If such an estimate is possible—often it is not—this approach will tell you which variables to use and how to use them. Sometimes the result will resemble a regression equation. Often it will not. I include in this approach the idea of randomized and natural experiments, because these are justified only through some logic of causal design.
Third, there is Full-Luxury Bayesian Inference: Program the entire causal model as a joint probability distribution and let the logic of probability theory work out the implications. In principle, this is equivalent to Causal Design—all of the same results can be derived either way, given the same causal model. In practice, the approaches are different. They have different strengths and require different procedures.
To understand these approaches, we need to see how each manages a common inference task—same data, same goal. Let’s set up the task now.
A Tale of Two Mothers
Suppose a sample of mother and daughter pairs, such as the pair pictured below, have completed their families. Many things influence the number of children a women ends up with. We are interested here in the social transmission of family norms from mother to daughter. What is the causal influence, if any, of mom’s family size on the family size of her daughter?
We know that there are unmeasured confounds, because mothers and daughters share common environmental exposures that could explain some part of the association between their completed family sizes. The only data we have for all women is their birth order (first born or not). Suppose we know from prior research that first born daughters have higher fertility.
There are no tricks here. Take this story as given and consider what you would do, in order to infer the causal influence of mom’s family size on her daughter’s. Are the birth order variables useful? How would you use them?
To make the story more solid, let’s go ahead and simulate some data that approximate the story. This story is already pretty complicated, because the variables are counts and daughter’s birth order must be influenced by mom’s family size—your chance of being first born depends upon how many siblings you have. I will leave out that complexity, so we can focus instead on the structural issues of which variables directly influence which other variables. Here’s some simple R code to produce a synthetic data set. We’ll use these data in the sections below. But go ahead and explore the data, if you feel frisky. I will assume that mom’s family size has zero direct influence on daughter’s, just to make the lessons starker.
In this code, the variables are: B1 mom’s birth order, M mom’s family size, B2 daughter’s birth order, D daughter’s family size. Those are the variables we get to use for inference. The variable U is part of the generative model, but we haven’t measured it in our thought experiment, so it isn’t available for use in what follows.
Regression
If you were going to apply Causal Salad to this problem, you would build a regression with daughter’s family size (D) as the outcome and mother’s (M) as a predictor. In common linear formula notation: D ~ M. This model will measure the association between D and M.
You already know that some of this association is due to unmeasured common exposures. Let’s call those unmeasured exposures U. The question now is what happens when we produce the causal salad by adding the birth order variables to the model. It can only help, right?
Unfortunately, it can hurt. The dis-logic of naive regression is that omitted variables, like U, are the key threat. Included variables don’t usually hurt us. We can just add them and see if the fit improves, which sparkling p-values emerge.
To see what I’m saying, let’s fit two regression models. The first is just D ~ M, which measures the association between D and M. Nothing more. Then we’ll fit D ~ M + B1 + B2, the Causal Salad model. We’ll also compare these models with AIC, a common measure of a model’s predictive quality, not just its fit to sample. Smaller AIC values are better, because AIC is a measure of prediction error.
The first model, D ~ M, gives you:
Estimate Std. Error t value Pr(>|t|) (Intercept) 0.7128 0.1244 5.729 3.71e-08 *** M 0.2596 0.0619 4.194 4.12e-05 ***
To coefficient for M is positive and has a small standard error. This is a result of the U confound, we know from the generative model above. The AIC for this model is 740. The second regression gives us:
Estimate Std. Error t value Pr(>|t|) (Intercept) -0.03980 0.15280 -0.260 0.7948 M 0.39468 0.06326 6.239 2.67e-09 *** B1 -0.52544 0.22049 -2.383 0.0181 * B2 1.80475 0.17101 10.553 < 2e-16 ***
Adding B1 and B2 actually made things worse. The coefficient on M is even further from zero (its true value) now. And the coefficient on B1 is actually on the wrong side of zero. The influence of B1 is positive, we know because we simulated the data. The AIC on this model is 647. It’s better (at prediction) than the first model, despite having worse inferences.
What’s going on here? To really explain why the coefficients turn out this way, we’ll need the next approach, Causal Design. But note the following.
First, adding variables can actually make inference worse. It isn’t harmless to just add variables to the salad and let the model sort it out. You have to give some thought to the causal relationships among the predictor variables, not just between the outcome and the predictors. That is what the next two approaches force us to do.
Second, the model with worse estimates (the second model) is expected to make better predictions (it has a smaller AIC). And it is very likely that it could make better predictions. Making good predictions does not require an accurate causal model. It just requires a good description of associations. As long as we do not intervene in the system, this kind of model can make very useful predictions.
But scientific understanding is something different. When we understand a causal influence, all we usually mean (statistically) is that we can predict the consequences of an intervention. And the models above definitely do not provide that, because the simulated influence of M on D is zero. So intervening on M, maybe through incentives or communication, would have no direct impact on D.
To Be Continued
That’s enough for now. Next time, in Part 2/3, I’ll pick up with Causal Design and revisit the same scenario and data. You’ll see that we actually can use B1 to improve inference. But doing so requires something other than thinking like a regression.
There’s a chance, is what I’m saying.