# There Are No Magic Outcome Variables

I’m very sorry but I am going to write about statistics and causal inference again. I’d much rather be doing science. I’ll make it brief.

I was reading a preprint about the relationship between kinship institutions and economic development. Do different kinship systems influence the economy in different ways? The paper uses log-GDP per capita as a measure of economic welfare. This is just the logarithm of GDP divided by the population density. I am not going to complain about this variable.

Instead I want to point out that just because log-GDP per capita uses population density, that is not sufficient for saying that your inference will control for population density. This is what the paper says (page 23): The regressions exclude population density, and the reason given is that the outcome variable is in per capita terms already. It’s not clear what the reasoning is here. But anyway it (the justification) is wrong. Population density may still bias inference, even if the outcome variable is in per capita. There is no principle of causal inference that says using a variable in constructing another variable will control for any biasing paths.

Okay let me draw out the logic here. To make the narrative easier, here’s a causal diagram I drew with dagitty.net: The variable X in the upper right is the cause of interest. We want to learn the effect of intervening on X on the outcome GDP/P in the lower left. Notice that there is no direct path between X and GDP/P. X influences the economy (GDP) directly. And GDP is also influenced by population density P. But GDP/P is just a calculation that uses GDP and P. The only direct causes of GDP/P are the variables used to compute it.

However it does make sense, I think, to ask about less proximate causes of GDP/P. Like X. X influences GDP/P through GDP (and possibly other things not shown). By the logic of structural causal models, in order to decide which covariates we should adjust for, we should at minimum find any so-called backdoor paths connection the exposure (X) to the outcome (GDP/P). In the diagram above, there is a backdoor path through population density P. P is a confound, because I have assumed that population density influences the kinship system. The fact that GDP/P uses P in its calculation does NOT close that backdoor path. We still need to adjust by P in our inference. Otherwise we’ll get a bias, possible a substantial one.

Now maybe you want to argue for a different diagram. Fine. The point is that just because GDP/P uses P, that is no basis for deciding whether or not you should or shouldn’t adjust for P in the model. You have to make some causal assumptions in order to justify that decision. I think the authors of the paper can easily address the problem, given that they have thought quite a lot about the causal structure in other parts of the paper.

There are lots of constructed outcome variables in the wilds of the sciences. And people often use them in this way: As a kind of back-alley adjustment strategy. This is rarely a good idea.

Alright that’s enough statistics for this month. Below is a little simulation of the causal model above that you can use to demonstrate that in this example we do need to adjust by P to get an unbiased estimate. I’ve assumed kinship K has no effect on GDP, just to make it easier to recongnize the bias.

```N <- 5000
P <- runif(N,1,10)
K <- scale( rnorm(N,-log(P)) )
G <- rpois(N,exp( -0*K + log(P)))+1
GP <- scale( log(G)/P )

(lm(GP ~ K))
(lm(GP ~ K + log(P) ))```

If you are new to this way of thinking about model construction and justification, I have made just for you an entire 10 week course of lectures, all freely available: Statistical Rethinking 2022 Lectures