21 June 2021 by Richard

The brilliance of artificial intelligence is that it is much better than us, its creators, at tasks that we find challenging. Decades of research have resulted in chess programs that can beat even the best human players. Even simple AI is much better than we are at basic mathematics and search. Children train for years to perform tasks that simple computers excel at.

The stupidity of artificial intelligence is that it is often comically bad at tasks that people find easy. Like distinguishing curly tails from cinnamon rolls (pictured above). Even very good image recognition systems—deep neural networks—can be tricked by the introduction of noise that has no effect on human observers. For example, introducing noise to the panda below leads the AI to become very confident that it is a picture of a gibbon (see the paper).

So-called **adversarial images** can trick AI systems into
potentially deadly mistakes. Panda or gibbon is maybe not so serious,
but a self-driving car that can be forced to mistake STOP for YIELD
might be a problem.

**An image recognition AI thinks like a regression.** It’s
job is the compress patterns into some algorithmic form that can
categorize and repeat those patterns. The AI, like a regression, knows
nothing about the causes of those patterns. So it is confused by
superficial similarities—it has no way to judge them as superficial—and
can be fooled by tiny changes that would not fool any person.

Why are people better at this than the computer? One reason is that we experience the image as the result of a scene that contains 3D objects. The computer does not. It’s just a grid of colors and edges and features. It has no way to recognize superficial resemblance. And as the panda example above shows, the AI is really dreaming in a space no person wants to visit. It’s trapped in Plato’s Cave.

People think **more like graphs than like regressions**.
When we walk around town or succeed at some science, we learn about
causes, not just patterns. Why is this thinking like a graph? What I
mean by graph is a directed graph, a graph with arrows. These arrows
represent causes. A regression in contrast has no direction to it. It
sees no causes just as the AI sees no objects. A graph can draw and see
the same pictures as a regression, but it understands them differently
and has expectations for what would happen if we were to make tiny
changes.

In **Part 1**, I
presented **Causal Salad**. Widespread in the sciences,
causal salad tosses data together without any thought given to the
causal relationships among them. The regression (or any other machine
learning tool) learns the patterns in the data, and then a human
interprets these patterns as measures of causes. In that presentation, I
showed that with this approach more data can actually produce worse
inference. It’s gibbons and cinnamon buns.

Now let’s turn to the second approach described in Part 1,
**Causal Design**. This is just my label for a family of
approaches that use explicit causal models to design and interpret
statistical estimates. Causal models vary in form, but they are all
graphs in the sense that they contains arrows, directional relationships
in which some things are inputs for other things. **X —>
Y** means that a change in **X** can result in a
change in **Y**, but that a change in **Y**
will have no impact on **X**. A statistical model, in
contrast, is a set of associations. These associations do not have
arrows. Scientific models do, even though we don’t (or can’t) always
represent them as graphs.

Let’s revisit the Tale of Two Mothers from Part 1. Same data, same goal.
But now let’s think like a graph. How do the variables
**M**, **D**, **B1**, and
**B2** relate to one another? And how can we use these
relationships to design a statistical analysis?

The first step is to stop thinking about regression and start thinking about a generative model of the data. So let’s think about each variable and which other variables influence it. Here is the simulation code again, to get us moving.

First let’s consider **M**, mom’s family size. This is
influenced by **B1**, her birth order. But it’s also
influenced by the unobserved confounds **U**. So
**M** is some function of **B1** and
**U**. We could make this into a graph, with arrows
indicating causal influence, like this:

**B1 —> M <— U**

The daughter’s family size **D** is influenced, possibly,
by **M**. That’s the point of the exercise, to estimate
that influence. What else influences **D**?
**B2** and **U**, just like
**M**. So **D** is some function of
**M**, **B2**, and **U**. If you
draw this as a graph, there will be three arrows entering
**D**.

The other variables, **B1** and **B2** and
**U**, aren’t influenced by any of the other variables we
have here. While they surely have their own causes, we aren’t modeling
those.

Let’s put it all together now, as a single graph. I will also add some
labels on the arrows, so that we can talk about specific arrows. We’ll
need to talk about arrows, if we are going to turn this into a
statistical strategy. I’ll also draw **U** with a circle
around it, which is a convention for indicating that the variable is
unobserved. We don’t see it in the data, but we do believe it has
influenced the data.

The arrow labeled * m* is what we are after. What
is the causal influence of

Now let’s try thinking like a graph. The generative model is a linear one: Each variable is some additive combination of influences. You can see this in the simulation in Part 1. In a linear model like this one, the covariance between two variables can be calculated directly from the graph. There are a bunch of rules for how to do this, and if you’ve ever studied structural equation models or path analysis, then you know what I’m talking about.

But you don’t need all the rules for this example. Here’s what we need.

If * b* is the causal influence of

cov( B1 , M ) / var( B1 )

and the result (about 2.2) is the same estimate you get from a
regression of **M** on **B1**: lm( M ~ B1 ).

Now here’s the cool trick. We want to learn * m*,
but we can’t get it directly, because a regression lm( D ~ M ) is
confounded by

But watch this. The covariance of **B1** and
**D** is something we can compute. And the expected
covariance, thinking like the graph, is:

cov( **B1** , **D** ) = * b
m* var(

When there are multiple arrows on the path, we multiply the causes. Okay
so now we are ready to estimate * m*, because we
know or can estimate every other part of the formula above. Solving for

* m* = cov(

We know the formula for * b* = cov(

* m* = cov(

And there you have it. Try out that calculation in R and you’ll see it
gets very close to the right answer (zero). Now modify the simulation so
that * m* is some non-zero value like 0.5 and try
again. You’ll see that you again get close to the right answer, using
the ratio of covariances above. If you increase the sample size in the
simulation to for example N = 1e5, you should get the causal effect
exactly right.

But obviously what is missing here is any sense of the uncertainty of
our estimate of * m*, the causal influence of mom
on daughter. That’s a separate issue, and one that thinking like a graph
doesn’t really solve. I will have much more to say about this in Part 3.
In this case, you can bootstrap and get a useful error estimate. The
code to repeat the data simulation and get a bootstrap estimate is
below. You can modify the “true” effect of mom on daughter in line 8
(the 0*M part) to simulate and recover different causal effects. You
might also want to play with the sample size in line 2.

Now you’re thinking like a graph. We used only the same data as in Part
1, where including **B1** in the model actually made
inference worse. Here we found a way to use it to get the right answer.
But we didn’t use a regression, not in the usual sense. Instead we
started with a generative perspective and derived a way to estimate the
causal effect of interest.

I’ll say more about this general approach. But first let’s address the
looming question of why the regression (Part 1) gets things so wrong.
This example is a special case of a general phenomenon known as**
bias amplification**. Bias amplification arises when (1) the
exposure of interest (**M** here) is confounded (by
**U** here) and (2) we add a variable (**B1**
here) that is a strong predictor of the exposure (**M**
here). When this happens, the additional variable tends to amplify the
original confound and overall make inference worse. That is what we saw
in Part 1. As a rule of thumb, it is better to add variables like
**B2**, strong predictors of the outcome, than like
**B1**, strong predictors of the exposure. I’ll drop a
citation at the end of this post, if you want to chase the details.

Bias-amplification happens when you think like a regression: Which variables should I add to the model? Instead we’re thinking like a graph. This provides better questions and principled answers. Schematically, thinking like a graph involves three stages:

**Step 1**: Specify a generative model of the data,
including any unmeasured confounds. It could be a crude graph or a
detailed system of equations. The important thing is that it contains
causal structure.

**Step 2**: Choose a specific exposure of interest (here
**M**) and outcome of interest (here **D**).

**Step 3**: Use the model structure (in this example, the
graph) to deduce a procedure for calculating the causal influence of the
exposure on the outcome.

If you want to know more than one causal effect, you need to repeat
steps 2 and 3 more than once. This is something to really emphasize:
**A single causal model implies a variety of statistical models,
possibly a different statistical model for each causal
question.**

In this example, I used a causal model that specified the exact
functions that related the different variables. It was all linear. But
there is a more general approach that avoids the functions all together.
It uses graphs called **DAGs: Directed Acyclic Graphs**.
DAGs look like the graph we made above, but they don’t say anything
about the exact functions that relate the variables to one another. It
turns out a lot can be said about the possibility (or impossibility) of
causal inference, even when all we have is a DAG. There is a framework,
known as **do-calculus**, that allows us to query a DAG
about a causal effect and decide if there is a way to estimate it. This
framework is very general, and can deal with many diverse problems
ranging from missing data to measurement error to generalizability.

There is a large literature on this approach. It’s rather standard in some sciences, despite being completely absent in others. I’ll put some pointers at the bottom of this post.

The key idea of do-calculus is to ask how we can use either statistical
or design choices to modify a graph so that it contains no confounds for
some association of interest. When the confounds are removed, the
remaining association is an estimate of the causal effect. In principle,
modifying a graph to remove all confounds is easy: Just remove all
arrows entering the exposure of interest. This is how we represent an
intervention, setting the exposure to some value. The causal influence
of interest is defined by this intervention, written as
*p*(**D**|*do*(**M**)), the
distribution of **D** when we intervene on
**M**.

For example, here is the graph from before, but with all arrows entering
**M** removed:

In this scenario, the association between **M** and
**D** reflects the causal relationship
*p*(**D**|*do*(**M**)). (We
might still want to include **B2** in an analysis, because
it would help with precision.) When we can do a controlled experiment,
we are effectively doing exactly this, removing the arrows entering the
exposure. If we could experimentally set family sizes
**M**, that would mean removing all other influences on
**M**. That means no arrow entering it. This what a
randomized experiment attempts to accomplish.

In practice, experiments don’t always assign treatments effectively (the “intent-to-treat” problem) and in our example, it would be both impractical and monstrously unethical to try to randomly assign family sizes to women. In these cases, do-calculus provides an algorithm for deducing purely statistical ways to covert the original graph into the one above. So-called “natural” experiments try for the same result, finding some statistical way to mimic a randomized experiment and get the new graph.

When you have more detail than a DAG, the general approach remains
valid, although the details are different. We still want to
statistically transform one graph (or set of equations) into another. We
did exactly that with our * m* = cov(

You don’t have to eat the causal salad. Neither your own nor anyone else’s. The first and easiest step is just to ask which generative assumptions allow us to claim that some estimate is a causal estimate. Really the interpretation of statistical results always depends upon causal assumptions, assumptions that ultimately cannot be tested with available data.

If you find this disappointing, I have three things to say.

First, I warned you already at the start of Part 1.

Second, this is only disappointing because methods have been oversold. Researchers have been taught to think of statistical methods as a kind of sorcery that can conjure causal facts from data, as long as you have enough of it. Maybe this sounds harsh, but when a group of experts rely upon a set of methods to make decisions for them, but these experts have little mechanical understanding of the methods, fearing to deviate from convention, and lacking any formal framework for justifying these conventions, that sounds like sorcery. We must do better.

Third, the perspective here is actually very powerful. We shouldn’t be disappointed at all. The do-calculus and related formal methods of causal inference are extraordinary achievements. They make inference possible in frustrating contexts, like the tale of the two mothers.

There’s a chance, is what I’m saying.

If you learn nothing else, learn this: The Table 2 Fallacy

On bias amplification: Understanding Bias Amplification

Short textbook do-calculus and causal inference: Causal Inference in Statistics: A Primer

My own book & lectures contain working examples (with code) of this framework. The DAG content begins with Lecture 5. All materials: Statistical Rethinking Book and Lectures