29 June 2021 by Richard

Hyakujō’s Fox is a classic Zen kōan attested from the year 1036 CE [link]. It is surely much older. It gives the story of a monk who is transformed into a fox, because he denies that an enlightened person is subject to cause and effect. He is later freed when he realizes his error.

Like all kōan’s, there are varying interpretations, and they are all correct. But many of them assert that there is no escape from causation. The enlightened person becomes enlightened through cause and effect, and denying it abnegates enlightenment. Everyone is subject to cause and effect, whether they ignore it through ignorance or realize it through enlightenment.

I’m talking about foxes and enlightenment, because I want to use this story to tie together the three parts of this series.

In **Part 1**,
**Thinking Like a Regression** (if you haven’t read it, go
back and start there), we used a common statistical model that, like
most common statistical models, had no conscious connection to
causation. It mostly ignored it. But its behavior was of course a direct
consequence of the causal model that it ignored. This is the approach I
called **Causal Salad**. It can provide valid inference.
But you have to get lucky. And more information can even hurt you.

In **Part 2**,
**Thinking Like a Graph**, we took a big step on the path
to enlightenment. We made a distinction between the causal model and the
statistical model. The statistical model was then deliberately designed
to answer a specific causal question. But the statistical model does not
contain the causal model, and the causal model cannot be reconstructed
from the statistical model. The statistical model will not in general
even contain all of the data in the full causal model, and any one
causal model may imply a large number of different statistical models,
depending upon the question and the data. This is the approach I labeled
**Causal Design**. It serves us in the design of
statistical models as well as in the design of research.

Now in Part 3, **Thinking Like a Probability
Distribution**, the distinction between the causal model and the
statistical model shrinks. Instead of building a separate statistical
model, we’ll just use the causal model itself. This is enlightenment.
This means we’ll use all of the variables and express all of their
relationships as a joint probability distribution. Then any available
data can be used to constrain the joint distribution, eliminating
possibilities and refining our information about causal effects.

This is the approach I called (in Part 1) **Full-Luxury Bayesian
Inference**. Instead of many statistical models for any one
causal model, it uses one statistical model and possibly many distinct
simulations from it. This approach doesn’t require us to realize or
derive the statistical implications of the causal model. It does that
automatically. And it leads readily to ways to estimate quantities in
finite samples, even when there is missing data and measurement error
and other commonplace terrors.

But before you get too excited, I use the term
**Full-Luxury** with some irony. Like all luxuries, this
approach has costs. The machinery required to make it work can be
complicated. And for some causal models, it is highly impractical or
even impossible to reliably perform the calculations. It’s often true
that we can make the machine more efficient by transforming the
generative model a bit. This is one of several reasons that
**Causal Design**, using analysis in the spirit of
do-calculus, provides unique benefits that can be combined with
Full-Luxury Inference. I’ll show you what I mean.

So let’s return to the Tale of Two Mothers (pictured, below) from Parts
1 and 2 and see how to apply the Full-Luxury approach. The key idea is
to just use the generative (causal) model to define a joint probability
distribution of all of the variables. The steps involved are: (1)
**express** the causal model as a joint probability
distribution, (2) **teach** this distribution to a computer
and let the computer figure out what the data imply about the other
variables, (3) use generative **simulations** to measure
different causal interventions. I’ll walk you through a complete
example, with code. Let’s go one step at a time.

- Express the model as a joint distribution

To remind you, the generative model in this example is given by the simulation itself:

And the graphical version (from Part 2), which includes symbols like
* b* and

Our job now is to write this generative model as a joint probability distribution. And in this case, all that is required is to re-express the statements from the code, using the symbols from the graph. Like this:

The little * i* subscripts in this model indicate
some particular dyad of women

The statement for **U**, the unobserved confound, needs
some explanation. So bear with me for a brief tour of our unobserved
confound.

The **U** values haven’t been observed, so there is no way
to estimate their mean and variance. But it’s fine, as you’ll see, to
just assign them a standardized normal distribution. That’s what I’ve
done here. If we had observed these values, the Normal(0,1) would be
called a “likelihood”. But here we have not observed the values. So
instead it’s conventional to call it a “prior”. But notice that the
information doesn’t change, when a variable is observed. Whether a
variable is observed or not is something that happens *after* the
model is specified, not before. These terms “likelihood” and “prior” are
useful, but they exist in a state of information super-position, until
the variable is observed. This is one of the best features of the
Full-Luxury approach, because when data go missing (are not observed),
the generative model already knows how to cope with missing data. I’ll
provide a link at the end, for depth on this point.

Okay, tour over. Back to building the joint distribution.

This isn’t a complete joint probability distribution yet. We still need
probability distributions for the remaining latent variables:
* b*,

The parameter * k* needs some more explanation.
I’ve assigned it an exponential distribution, which is a simple way to
constrain it to be positive. We don’t know if the effect of

- Teach the distribution to a computer

Now that we have expressed the generative model this way, it
automatically implies (by the axioms of probability) a joint probability
distribution of all of the variables. The variables are the data (the
observed variables) and the parameters (the latent variables). So if we
acquire information about any of these variables, we can update the
joint distribution (again, by the axioms of probability) to see if this
new information also implies new information about any of the other
variables. In this example, the information we receive is observations
of **M**, **D**, **B1**, and
**B2**. We want to know what, if anything, these
observations imply about * m*, the causal
influence of mom on her daughter.

This might sound a little overwhelming. It’s one thing to state the above, which is really nothing more than a statement that we’ll obey the axioms of probability. There are people who disobey (or restrict) the axioms of probability, and sometimes that’s okay. This town is big enough for all of us. But we obey the axioms here, and that’s very respectable and quite luxurious.

Still, however luxurious, it’s another thing to actually do it. In a
large joint distribution, applying the axioms isn’t easy. But we’re in
luck. Since the axioms of probability are rules, and computers are good
at applying rules, we do all of this with the computer. There are lots
of different algorithms for accomplishing it. But the approach I favor
is to use something called a
**probabilistic
programming language** to express the generative model (as
above) in code, and then let the computer do the rest. This is a
well-established approach, and there many different languages and
interfaces to choose from.

For this example, I’ll use the
**Stan**
language and define the model with my own R interface
(rethinking),
which simplifies the model syntax so that it more resembles the
mathematical definition above. To use this code, you’ll need to install
both the rethinking package and the
cmdstanr
package. Instructions are
here.
But you could also not install it and just follow along with the
example. You can loop back later and deal with the installation. It’s
enough to focus on concepts for now.

Here’s the code.

There are many details here that vary from one language and interface to another. But it’s worth scanning each line and making sure you can see which part of the mathematical version it corresponds to.

When you run the code, the computer (the Stan math library) uses automatic differentiation to sample from the approximate joint distribution of the model, conditioning on the observed data. This is a fancy way of saying the computer returns a numerical approximation that we can use to see what information the data contain about the causal effects, if any.

The joint distribution has a lot of variables. More than 200, because
there is a variable for each unobserved **U** value,
recall. So let’s just focus on the inference for the effect
* m*,

mean sd 5.5% 94.5% n_eff Rhat4 m 0.12 0.10 -0.04 0.27 237 1.01 b 1.85 0.12 1.65 2.05 1889 1.00 k 0.68 0.20 0.30 0.96 156 1.02

Recall that the naive regression (Part 1) returned a central estimate
for * m* of 0.39 with a standard deviation of
0.06. The regression was confident the effect is positive. Now we get a
better answer, closer to real causal effect: 0.12 with standard
deviation 0.10. Small and not confidently positive. If you increase the
sample size or adjust the true value of

- Simulate causal interventions

In this example, the causal query is so simple that there is a single
parameter for it, * m*. But that isn’t always
true. Suppose we also want to know the causal effect of intervening on

I’ll show the code and then explain it.

The first chunk of code computes the distribution of the causal effect
of **B1** on **D**. The output is:

0% 25% 50% 75% 100% -0.4564898 0.1027334 0.2273751 0.3477799 0.8091831

So the median is 0.23 and there is a wide range of both negative and
positive effects. There really is no clear signal here, and this is
because * m* is small and uncertain.

The rest of the code above simulates the same conclusion. First we
suppose **B1**=0. We simulate the distribution of
**M**, assuming **B1**=0. This means using the
entire joint probability distribution, not just the mean. Then we use
those simulated values of **M** to simulate values of
**D**. Second we suppose **B1**=1, and we
repeat the same steps. This yields two distributions of
**D**, one for **B1**=0 and one for
**B1**=1. We take the difference of these distributions to
get a distribution of the difference, the causal effect of changing
**B1**=0 to **B1**=1. The quantiles end up
exactly the same as those above, because they are computed from the same
samples from the joint probability distribution.

You can compute any intervention you like this way. In Part 2, we’d need a different statistical model for each causal query. Here instead we use a single “statistical” model (the joint distribution), but we need different causal simulations for each query. So it’s equivalent work, but back-loaded instead of front loaded.

A key advantage of the Full-Luxury approach is that it can automatically
discover ways to de-confound inference. You didn’t need to realize
anything here about the subtle importance of **B1** for
removing the contaminating influence of **U**. Probability
theory just figured it out silently. This can be a big help, when the
situation changes or becomes more complicated.

For example, suppose **B1** also influences
**D** directly, as in the graph below. This might happen if
mom’s birth order entitled her to inheritance that could support her
daughters as well as herself. In this case, **B1** will no
longer remove the confound **U**. Instead its a confound
itself that we have to control for. But I’ve also added two new observed
variables, **V** and **W**. These are children
of (caused by) **U**. We still don’t see **U**
in the data. But it turns out we can use **V** and
**W** to remove the confounding influence of
**U**. But how? Just adding them to a multiple regression
does not fix it.

You can screw around with the algebraic system that the graph above
implies and figure out an estimator for * m*
again. Or you can just dump the graph into Full-Luxury Bayesian
Inference. I don’t show the code in this post, but you can
find
it here. It is built in the same fashion as before. It’s just more
complicated now.

This approach isn’t all perfume and champagne though. Sometimes the
calculations are inefficient or even impossible, and we can do much
better if we mix in some algebraic thinking. In the Tale of Two Mothers
model, we can make the inference more reliable and faster by not even
bothering to include the **U** values. Instead we can
average over their unknown values and get the same answer. This is
accomplished by treating **M** and **D** as
pairs of values drawn from a common distribution with some correlation
induced by the unmeasured confound **U**. After this
conversion, the model looks like this:

where the Sigma ∑ is a covariance matrix that attempts to learn the
residual association between **M** and **D**
due to **U**. To save space (this post is already too
long), I don’t show the code the implement this model. But you can
find
it here. It runs much more efficiently than the previous code. This
is maybe not surprising, since it omits 200 parameters (the unobserved
**U** values).

Another problem with the Full-Luxury approach is that it doesn’t actually explain how the inference works. It analyzes the data, but not the model. So we don’t necessarily understand how the analysis extracts information. It’s all hidden under the fog of probability theory. Sometimes that’s okay. But understanding why an inference is or is not possible helps us anticipate or design better research. It’s not something we should ignore.

Still, the Causal Design approach (Part 2) by itself is incomplete, because it needs some way to really do the calculation. The expected value in an infinite sample isn’t something we can actually use. In Part 2, I did a quick bootstrap to get uncertainty. But bootstrap procedures are not always possible or practical. The inference situation here is actually a special example of an instrumental variable analysis. Many people would conduct such an analysis using a procedure called Two-Stage-Least-Squares (2SLS). This procedure tries to calculate the correct uncertainty in a finite sample. But it is neither efficient nor particularly reliable, as the people who use it are well-aware. The Full-Luxury approach here is both more reliable and more flexible. It’s also only recently practical to implement.

The Full-Luxury approach and the Causal Design approach complement one
another. One of the most important ways this is true is to use
do-calculus (or analogous logic) to figure out when inference is
possible and which variables can be ignored, but then to use Bayes to do
the calculations. Sure, we *could* include all the of variables
in the analysis, but do-calculus (or similar) will usually tell us that
we don’t have to.

This is important to emphasize, because there is a traditional division
between these communities. Occasionally both sides imply the other is
fundamentally wrong about something (see
this
post and links in first paragraphs). Some of the disagree is surely
just confusion. The different approaches can use common vocabulary like
“condition” in different ways. In a multiple regression (Part 1),
“condition on **X**” means to include **X** as
a predictor. In Causal Design (Part 2), “condition on
**X**” means
*p*(**Y**|**X**). It’s very literal. In
Full-Luxury Bayes, it is something of a commandment to “condition on all
available information”. But “condition on **X**” for Bayes
means to include **X** in the joint probability
distribution. But how it plugs into the join distribution depends upon
the causal model. This lets Full-Luxury Bayes do things like “condition
on a collider” (something Causal Design rightly
warns
against) without problems, because “condition” here doesn’t mean
“use a predictor in a regression”.

I use both Causal Design and Full-Luxury Bayes (also in my book) and find them highly synergistic. And both can be profitable without the other as well. But they are also the same in an important way. They both depend upon the definition and analysis of a generative model. This model doesn’t need to be a DAG or even a graph. It could be a set of differential equations. But either way, the implications of the generative model need to be used to build a statistical strategy. And when that is done logically, both end up at approximately equivalent answers, even though the steps to those answers can look very different.

And neither promises to deliver the answer you want, but rather that you have to put in a lot of assumptions—an entire generative model—in order to get meager inference out. Often it is possible to test some of those assumptions. But never all of them.

When I started this series, I promised to disappoint you. I hope, despite that promise, that this has been a hopeful message. In the midst of all the necessary assumptions and computations and fighting with computers, there is a fighting chance to produce powerful inference, and the tools that make it possible are mostly free and highly democratized.

The more we use these tools, and the less we do “statistics”, the better.

Causal Inference with Bayes Rule: [blog post] [paper]

My book has many examples of both do-calculus and Full-Luxury Bayes: [Statistical Rethinking]

Instrumental variables used to be very popular in econometrics. But the field has soured on them, for good reasons. See this paper for example. Meanwhile, genetics has become more interested in them, under the name Mendelian randomization.

The Full-Luxury Approach to missing data: The last lecture of my 2019 course focused on this – [slides] [video]. This talk covers similar ground.

The Stan library used here is extremely well documented. There are many examples in the user guide.

Another very good interface is Turing.jl.