Regression, Fire, and Dangerous Things (3/3)

Not falling, not ignoring: Odd and even are on one die.

Thinking Like a Probability Distribution

Hyakujō’s Fox is a classic Zen kōan attested from the year 1036 CE [link]. It is surely much older. It gives the story of a monk who is transformed into a fox, because he denies that an enlightened person is subject to cause and effect. He is later freed when he realizes his error.

Like all kōan’s, there are varying interpretations, and they are all correct. But many of them assert that there is no escape from causation. The enlightened person becomes enlightened through cause and effect, and denying it abnegates enlightenment. Everyone is subject to cause and effect, whether they ignore it through ignorance or realize it through enlightenment.

I’m talking about foxes and enlightenment, because I want to use this story to tie together the three parts of this series.

In Part 1, Thinking Like a Regression (if you haven’t read it, go back and start there), we used a common statistical model that, like most common statistical models, had no conscious connection to causation. It mostly ignored it. But its behavior was of course a direct consequence of the causal model that it ignored. This is the approach I called Causal Salad. It can provide valid inference. But you have to get lucky. And more information can even hurt you.

In Part 2, Thinking Like a Graph, we took a big step on the path to enlightenment. We made a distinction between the causal model and the statistical model. The statistical model was then deliberately designed to answer a specific causal question. But the statistical model does not contain the causal model, and the causal model cannot be reconstructed from the statistical model. The statistical model will not in general even contain all of the data in the full causal model, and any one causal model may imply a large number of different statistical models, depending upon the question and the data. This is the approach I labeled Causal Design. It serves us in the design of statistical models as well as in the design of research.

Now in Part 3, Thinking Like a Probability Distribution, the distinction between the causal model and the statistical model shrinks. Instead of building a separate statistical model, we’ll just use the causal model itself. This is enlightenment. This means we’ll use all of the variables and express all of their relationships as a joint probability distribution. Then any available data can be used to constrain the joint distribution, eliminating possibilities and refining our information about causal effects.

This is the approach I called (in Part 1) Full-Luxury Bayesian Inference. Instead of many statistical models for any one causal model, it uses one statistical model and possibly many distinct simulations from it. This approach doesn’t require us to realize or derive the statistical implications of the causal model. It does that automatically. And it leads readily to ways to estimate quantities in finite samples, even when there is missing data and measurement error and other commonplace terrors.

But before you get too excited, I use the term Full-Luxury with some irony. Like all luxuries, this approach has costs. The machinery required to make it work can be complicated. And for some causal models, it is highly impractical or even impossible to reliably perform the calculations. It’s often true that we can make the machine more efficient by transforming the generative model a bit. This is one of several reasons that Causal Design, using analysis in the spirit of do-calculus, provides unique benefits that can be combined with Full-Luxury Inference. I’ll show you what I mean.

Full-Luxury Bayesian Inference

So let’s return to the Tale of Two Mothers (pictured, below) from Parts 1 and 2 and see how to apply the Full-Luxury approach. The key idea is to just use the generative (causal) model to define a joint probability distribution of all of the variables. The steps involved are: (1) express the causal model as a joint probability distribution, (2) teach this distribution to a computer and let the computer figure out what the data imply about the other variables, (3) use generative simulations to measure different causal interventions. I’ll walk you through a complete example, with code. Let’s go one step at a time.

(1) Express the model as a joint distribution

To remind you, the generative model in this example is given by the simulation itself:

And the graphical version (from Part 2), which includes symbols like b and m for the unknown causal effects, looks like this:

Our job now is to write this generative model as a joint probability distribution. And in this case, all that is required is to re-express the statements from the code, using the symbols from the graph. Like this:

    \begin{align*} M_i &\sim \text{Normal}(\mu_i,\sigma)\\ \mu_i &= \alpha_1 + b B_{1,i} + k U_i\\[6pt] D_i &\sim \text{Normal}(\nu_i,\tau)\\ \nu_i &= \alpha_2 + b B_{2,i} + m M_i + k U_i\\[6pt] B_{j,i} &\sim \text{Bernoulli}(p)\\[6pt] U_i &\sim \text{Normal}(0,1) \end{align*}

The little i subscripts in this model indicate some particular dyad of women i. This is just a convention for saying that the definition applies to every dyad of women. The first two lines express the causes influencing M, U and B1. These correspond to line 6 in the simulation code. The third and fourth lines above express the influences on D, U and B2 and M. Line 8 in the simulation has the same structure. Then the last two lines express the generative models for B1 and B2 (lines 5 and 7 in the code) and U (line 3 in the code). Note that there is a new symbol, p, for the probability of either woman being a first born. We know this probability is 0.5 in the population. But here we have to estimate it.

The statement for U, the unobserved confound, needs some explanation. So bear with me for a brief tour of our unobserved confound.

The U values haven’t been observed, so there is no way to estimate their mean and variance. But it’s fine, as you’ll see, to just assign them a standardized normal distribution. That’s what I’ve done here. If we had observed these values, the Normal(0,1) would be called a “likelihood”. But here we have not observed the values. So instead it’s conventional to call it a “prior”. But notice that the information doesn’t change, when a variable is observed. Whether a variable is observed or not is something that happens after the model is specified, not before. These terms “likelihood” and “prior” are useful, but they exist in a state of information super-position, until the variable is observed. This is one of the best features of the Full-Luxury approach, because when data go missing (are not observed), the generative model already knows how to cope with missing data. I’ll provide a link at the end, for depth on this point.

Okay, tour over. Back to building the joint distribution.

This isn’t a complete joint probability distribution yet. We still need probability distributions for the remaining latent variables: b, m, k, and so on. Some people call these variables “parameters”, but they are also unobserved variables, like U. There are several different, useful ways to assign prior distributions to such variables. In my book, I argue that a good way to understand prior distributions is through their effects on observed variables. This involves simulating the observations implied by prior probability assignments. But that’s not the topic here, and I don’t want to get off track. So I’ll just impose some weakly-regularizing distributions here that encode skepticism of large causal effects.

    \begin{align*} \alpha_1,\alpha_2,b,m &\sim \text{Normal}(0,0.5)\\ k,\sigma,\tau &\sim \text{Exponential}(1)\\ p &\sim \text{Beta}(2,2) \end{align*}

The parameter k needs some more explanation. I’ve assigned it an exponential distribution, which is a simple way to constrain it to be positive. We don’t know if the effect of U on M or D is positive or negative. But we need to force it to be one or the other, because the sign of the U values will determine what happens in each dyad. This is a common problem with latent (unobserved) variables. The product of k and any particular U value can be positive if (1) both are positive or (2) both are negative. The model can predict fine either way. If we don’t constrain k, the analysis still works. But sometimes you really have to enforce constraints like this to get the computer to work. That’s Full-Luxury.

(2) Teach the distribution to a computer

Now that we have expressed the generative model this way, it automatically implies (by the axioms of probability) a joint probability distribution of all of the variables. The variables are the data (the observed variables) and the parameters (the latent variables). So if we acquire information about any of these variables, we can update the joint distribution (again, by the axioms of probability) to see if this new information also implies new information about any of the other variables. In this example, the information we receive is observations of M, D, B1, and B2. We want to know what, if anything, these observations imply about m, the causal influence of mom on her daughter.

This might sound a little overwhelming. It’s one thing to state the above, which is really nothing more than a statement that we’ll obey the axioms of probability. There are people who disobey (or restrict) the axioms of probability, and sometimes that’s okay. This town is big enough for all of us. But we obey the axioms here, and that’s very respectable and quite luxurious.

The Bayesian cowboy (left) lives in luxury. The probability outlaw (right) gets the job done anyway he can. [image source]

Still, however luxurious, it’s another thing to actually do it. In a large joint distribution, applying the axioms isn’t easy. But we’re in luck. Since the axioms of probability are rules, and computers are good at applying rules, we do all of this with the computer. There are lots of different algorithms for accomplishing it. But the approach I favor is to use something called a probabilistic programming language to express the generative model (as above) in code, and then let the computer do the rest. This is a well-established approach, and there many different languages and interfaces to choose from.

For this example, I’ll use the Stan language and define the model with my own R interface (rethinking), which simplifies the model syntax so that it more resembles the mathematical definition above. To use this code, you’ll need to install both the rethinking package and the cmdstanr package. Instructions are here. But you could also not install it and just follow along with the example. You can loop back later and deal with the installation. It’s enough to focus on concepts for now.

Here’s the code.

There are many details here that vary from one language and interface to another. But it’s worth scanning each line and making sure you can see which part of the mathematical version it corresponds to.

When you run the code, the computer (the Stan math library) uses automatic differentiation to sample from the approximate joint distribution of the model, conditioning on the observed data. This is a fancy way of saying the computer returns a numerical approximation that we can use to see what information the data contain about the causal effects, if any.

The joint distribution has a lot of variables. More than 200, because there is a variable for each unobserved U value, recall. So let’s just focus on the inference for the effect m, b, and k. The rethinking package gives a concise summary of the marginal distributions of these variables with precis(flbi,pars=c("m","b","k")). The output is:

  mean   sd  5.5% 94.5% n_eff Rhat4
m 0.12 0.10 -0.04  0.27   237  1.01
b 1.85 0.12  1.65  2.05  1889  1.00
k 0.68 0.20  0.30  0.96   156  1.02

Recall that the naive regression (Part 1) returned a central estimate for m of 0.39 with a standard deviation of 0.06. The regression was confident the effect is positive. Now we get a better answer, closer to real causal effect: 0.12 with standard deviation 0.10. Small and not confidently positive. If you increase the sample size or adjust the true value of m in the simulation, you’ll see as you did with the graph-derived estimate in Part 2 that this approach also gets it right.

(3) Simulate causal interventions

In this example, the causal query is so simple that there is a single parameter for it, m. But that isn’t always true. Suppose we also want to know the causal effect of intervening on B1 on D. If we change mom’s birth order, what’s the effect on the daughter’s family size? Now more than one parameter matters. This is a linear model, so from theory the average effect of making the mom first born (B1=1) must be the product of b and m. But if the system weren’t linear, or you didn’t realize that bm is the answer, you can always simulate any intervention you like, even on more than one variable at a time.

I’ll show the code and then explain it.

The first chunk of code computes the distribution of the causal effect of B1 on D. The output is:

        0%        25%        50%        75%       100% 
-0.4564898  0.1027334  0.2273751  0.3477799  0.8091831

So the median is 0.23 and there is a wide range of both negative and positive effects. There really is no clear signal here, and this is because m is small and uncertain.

The rest of the code above simulates the same conclusion. First we suppose B1=0. We simulate the distribution of M, assuming B1=0. This means using the entire joint probability distribution, not just the mean. Then we use those simulated values of M to simulate values of D. Second we suppose B1=1, and we repeat the same steps. This yields two distributions of D, one for B1=0 and one for B1=1. We take the difference of these distributions to get a distribution of the difference, the causal effect of changing B1=0 to B1=1. The quantiles end up exactly the same as those above, because they are computed from the same samples from the joint probability distribution.

You can compute any intervention you like this way. In Part 2, we’d need a different statistical model for each causal query. Here instead we use a single “statistical” model (the joint distribution), but we need different causal simulations for each query. So it’s equivalent work, but back-loaded instead of front loaded.

Ups and Downs

A key advantage of the Full-Luxury approach is that it can automatically discover ways to de-confound inference. You didn’t need to realize anything here about the subtle importance of B1 for removing the contaminating influence of U. Probability theory just figured it out silently. This can be a big help, when the situation changes or becomes more complicated.

For example, suppose B1 also influences D directly, as in the graph below. This might happen if mom’s birth order entitled her to inheritance that could support her daughters as well as herself. In this case, B1 will no longer remove the confound U. Instead its a confound itself that we have to control for. But I’ve also added two new observed variables, V and W. These are children of (caused by) U. We still don’t see U in the data. But it turns out we can use V and W to remove the confounding influence of U. But how? Just adding them to a multiple regression does not fix it.

You can screw around with the algebraic system that the graph above implies and figure out an estimator for m again. Or you can just dump the graph into Full-Luxury Bayesian Inference. I don’t show the code in this post, but you can find it here. It is built in the same fashion as before. It’s just more complicated now.

This approach isn’t all perfume and champagne though. Sometimes the calculations are inefficient or even impossible, and we can do much better if we mix in some algebraic thinking. In the Tale of Two Mothers model, we can make the inference more reliable and faster by not even bothering to include the U values. Instead we can average over their unknown values and get the same answer. This is accomplished by treating M and D as pairs of values drawn from a common distribution with some correlation induced by the unmeasured confound U. After this conversion, the model looks like this:

    \begin{align*} \begin{pmatrix}M_i \\ D_i \end{pmatrix} &\sim \text{MVNormal}\left(  \begin{pmatrix} \mu_i \\ \nu_i \end{pmatrix} , \mathbf \Sigma \right) \\ \mu_i &= \alpha_1 + b B_{1,i} \\ \nu_i &= \alpha_2 + b B_{2,i} + m M_i \\[6pt] B_{j,i} &\sim \text{Bernoulli}(p) \end{align*}

where \mathbf \Sigma is a covariance matrix that attempts to learn the residual association between M and D due to U. To save space (this post is already too long), I don’t show the code the implement this model. But you can find it here. It runs much more efficiently than the previous code. This is maybe not surprising, since it omits 200 parameters (the unobserved U values).

Another problem with the Full-Luxury approach is that it doesn’t actually explain how the inference works. It analyzes the data, but not the model. So we don’t necessarily understand how the analysis extracts information. It’s all hidden under the fog of probability theory. Sometimes that’s okay. But understanding why an inference is or is not possible helps us anticipate or design better research. It’s not something we should ignore.

Still, the Causal Design approach (Part 2) by itself is incomplete, because it needs some way to really do the calculation. The expected value in an infinite sample isn’t something we can actually use. In Part 2, I did a quick bootstrap to get uncertainty. But bootstrap procedures are not always possible or practical. The inference situation here is actually a special example of an instrumental variable analysis. Many people would conduct such an analysis using a procedure called Two-Stage-Least-Squares (2SLS). This procedure tries to calculate the correct uncertainty in a finite sample. But it is neither efficient nor particularly reliable, as the people who use it are well-aware. The Full-Luxury approach here is both more reliable and more flexible. It’s also only recently practical to implement.

The Full-Luxury approach and the Causal Design approach complement one another. One of the most important ways this is true is to use do-calculus (or analogous logic) to figure out when inference is possible and which variables can be ignored, but then to use Bayes to do the calculations. Sure, we could include all the of variables in the analysis, but do-calculus (or similar) will usually tell us that we don’t have to.

This is important to emphasize, because there is a traditional division between these communities. Occasionally both sides imply the other is fundamentally wrong about something (see this post and links in first paragraphs). Some of the disagree is surely just confusion. The different approaches can use common vocabulary like “condition” in different ways. In a multiple regression (Part 1), “condition on X” means to include X as a predictor. In Causal Design (Part 2), “condition on X” means p(Y|X). It’s very literal. In Full-Luxury Bayes, it is something of a commandment to “condition on all available information”. But “condition on X” for Bayes means to include X in the joint probability distribution. But how it plugs into the join distribution depends upon the causal model. This lets Full-Luxury Bayes do things like “condition on a collider” (something Causal Design rightly warns against) without problems, because “condition” here doesn’t mean “use a predictor in a regression”.

I use both Causal Design and Full-Luxury Bayes (also in my book) and find them highly synergistic. And both can be profitable without the other as well. But they are also the same in an important way. They both depend upon the definition and analysis of a generative model. This model doesn’t need to be a DAG or even a graph. It could be a set of differential equations. But either way, the implications of the generative model need to be used to build a statistical strategy. And when that is done logically, both end up at approximately equivalent answers, even though the steps to those answers can look very different.

And neither promises to deliver the answer you want, but rather that you have to put in a lot of assumptions—an entire generative model—in order to get meager inference out. Often it is possible to test some of those assumptions. But never all of them.

When I started this series, I promised to disappoint you. I hope, despite that promise, that this has been a hopeful message. In the midst of all the necessary assumptions and computations and fighting with computers, there is a fighting chance to produce powerful inference, and the tools that make it possible are mostly free and highly democratized.

The more we use these tools, and the less we do “statistics”, the better.

Read More

Causal Inference with Bayes Rule: [blog post] [paper]

My book has many examples of both do-calculus and Full-Luxury Bayes: [Statistical Rethinking]

Instrumental variables used to be very popular in econometrics. But the field has soured on them, for good reasons. See this paper for example. Meanwhile, genetics has become more interested in them, under the name Mendelian randomization.

The Full-Luxury Approach to missing data: The last lecture of my 2019 course focused on this – [slides] [video]. This talk covers similar ground.

The Stan library used here is extremely well documented. There are many examples in the user guide.

Another very good interface is Turing.jl.