21 June 2021 by Richard
Regression, Fire, and Dangerous Things (2/3)
I’ll worry about the singularity when AI isn’t confused about cinnamon rolls
Thinking Like a Graph
The brilliance of artificial intelligence is that it is much better than us, its creators, at tasks that we find challenging. Decades of research have resulted in chess programs that can beat even the best human players. Even simple AI is much better than we are at basic mathematics and search. Children train for years to perform tasks that simple computers excel at.
The stupidity of artificial intelligence is that it is often comically bad at tasks that people find easy. Like distinguishing curly tails from cinnamon rolls (pictured above). Even very good image recognition systems—deep neural networks—can be tricked by the introduction of noise that has no effect on human observers. For example, introducing noise to the panda below leads the AI to become very confident that it is a picture of a gibbon (see the paper).
So-called adversarial images can trick AI systems into potentially deadly mistakes. Panda or gibbon is maybe not so serious, but a self-driving car that can be forced to mistake STOP for YIELD might be a problem.
An image recognition AI thinks like a regression. It’s job is the compress patterns into some algorithmic form that can categorize and repeat those patterns. The AI, like a regression, knows nothing about the causes of those patterns. So it is confused by superficial similarities—it has no way to judge them as superficial—and can be fooled by tiny changes that would not fool any person.
Why are people better at this than the computer? One reason is that we experience the image as the result of a scene that contains 3D objects. The computer does not. It’s just a grid of colors and edges and features. It has no way to recognize superficial resemblance. And as the panda example above shows, the AI is really dreaming in a space no person wants to visit. It’s trapped in Plato’s Cave.
People think more like graphs than like regressions. When we walk around town or succeed at some science, we learn about causes, not just patterns. Why is this thinking like a graph? What I mean by graph is a directed graph, a graph with arrows. These arrows represent causes. A regression in contrast has no direction to it. It sees no causes just as the AI sees no objects. A graph can draw and see the same pictures as a regression, but it understands them differently and has expectations for what would happen if we were to make tiny changes.
In Part 1, I presented Causal Salad. Widespread in the sciences, causal salad tosses data together without any thought given to the causal relationships among them. The regression (or any other machine learning tool) learns the patterns in the data, and then a human interprets these patterns as measures of causes. In that presentation, I showed that with this approach more data can actually produce worse inference. It’s gibbons and cinnamon buns.
Now let’s turn to the second approach described in Part 1, Causal Design. This is just my label for a family of approaches that use explicit causal models to design and interpret statistical estimates. Causal models vary in form, but they are all graphs in the sense that they contains arrows, directional relationships in which some things are inputs for other things. X —> Y means that a change in X can result in a change in Y, but that a change in Y will have no impact on X. A statistical model, in contrast, is a set of associations. These associations do not have arrows. Scientific models do, even though we don’t (or can’t) always represent them as graphs.
Let’s revisit the Tale of Two Mothers from Part 1. Same data, same goal. But now let’s think like a graph. How do the variables M, D, B1, and B2 relate to one another? And how can we use these relationships to design a statistical analysis?
The first step is to stop thinking about regression and start thinking about a generative model of the data. So let’s think about each variable and which other variables influence it. Here is the simulation code again, to get us moving.
First let’s consider M, mom’s family size. This is influenced by B1, her birth order. But it’s also influenced by the unobserved confounds U. So M is some function of B1 and U. We could make this into a graph, with arrows indicating causal influence, like this:
B1 —> M <— U
The daughter’s family size D is influenced, possibly, by M. That’s the point of the exercise, to estimate that influence. What else influences D? B2 and U, just like M. So D is some function of M, B2, and U. If you draw this as a graph, there will be three arrows entering D.
The other variables, B1 and B2 and U, aren’t influenced by any of the other variables we have here. While they surely have their own causes, we aren’t modeling those.
Let’s put it all together now, as a single graph. I will also add some labels on the arrows, so that we can talk about specific arrows. We’ll need to talk about arrows, if we are going to turn this into a statistical strategy. I’ll also draw U with a circle around it, which is a convention for indicating that the variable is unobserved. We don’t see it in the data, but we do believe it has influenced the data.
The arrow labeled m is what we are after. What is the causal influence of M on D? The arrows labeled b are the influence of birth order on each woman’s family size. I’ve labeled these the same, because they should be the same influence, assuming any influence of birth order is the same over time. Likewise, we assume the influence k of the confound U is the same for both women. This doesn’t have to be true. It’s just to keep things a little simpler.
Now let’s try thinking like a graph. The generative model is a linear one: Each variable is some additive combination of influences. You can see this in the simulation in Part 1. In a linear model like this one, the covariance between two variables can be calculated directly from the graph. There are a bunch of rules for how to do this, and if you’ve ever studied structural equation models or path analysis, then you know what I’m talking about.
But you don’t need all the rules for this example. Here’s what we need.
If b is the causal influence of B1 on M, then the expected covariance between B1 and M is the causal effect b multiplied by the variance of B1. cov(B1,M) = b var(B1). Of course usually we don’t know b but want to estimate it. In that case we just solve for b and get b = cov(B1,M)/var(B1). You can try this on the R command line:
cov( B1 , M ) / var( B1 )
and the result (about 2.2) is the same estimate you get from a regression of M on B1: lm( M ~ B1 ).
Now here’s the cool trick. We want to learn m, but we can’t get it directly, because a regression lm( D ~ M ) is confounded by U. The expected covariance cov( M , D ) is not just from the m path, but also from the path that goes through U. There is a formula for this expected covariance, but since we don’t know k or the variance of U, we can’t use it to get m.
But watch this. The covariance of B1 and D is something we can compute. And the expected covariance, thinking like the graph, is:
cov( B1 , D ) = b m var( B1 )
When there are multiple arrows on the path, we multiply the causes. Okay so now we are ready to estimate m, because we know or can estimate every other part of the formula above. Solving for m:
m = cov( B1 , D ) / ( b var( B1 ) )
We know the formula for b = cov( B1 , M ) / var( B1 ), so we can substitute that in and simplify to:
m = cov( B1 , D ) / cov( B1 , M )
And there you have it. Try out that calculation in R and you’ll see it gets very close to the right answer (zero). Now modify the simulation so that m is some non-zero value like 0.5 and try again. You’ll see that you again get close to the right answer, using the ratio of covariances above. If you increase the sample size in the simulation to for example N = 1e5, you should get the causal effect exactly right.
But obviously what is missing here is any sense of the uncertainty of our estimate of m, the causal influence of mom on daughter. That’s a separate issue, and one that thinking like a graph doesn’t really solve. I will have much more to say about this in Part 3. In this case, you can bootstrap and get a useful error estimate. The code to repeat the data simulation and get a bootstrap estimate is below. You can modify the “true” effect of mom on daughter in line 8 (the 0*M part) to simulate and recover different causal effects. You might also want to play with the sample size in line 2.
Do-Calculus, Not Too Much, Mostly Graphs
Now you’re thinking like a graph. We used only the same data as in Part 1, where including B1 in the model actually made inference worse. Here we found a way to use it to get the right answer. But we didn’t use a regression, not in the usual sense. Instead we started with a generative perspective and derived a way to estimate the causal effect of interest.
I’ll say more about this general approach. But first let’s address the looming question of why the regression (Part 1) gets things so wrong. This example is a special case of a general phenomenon known as bias amplification. Bias amplification arises when (1) the exposure of interest (M here) is confounded (by U here) and (2) we add a variable (B1 here) that is a strong predictor of the exposure (M here). When this happens, the additional variable tends to amplify the original confound and overall make inference worse. That is what we saw in Part 1. As a rule of thumb, it is better to add variables like B2, strong predictors of the outcome, than like B1, strong predictors of the exposure. I’ll drop a citation at the end of this post, if you want to chase the details.
Bias-amplification happens when you think like a regression: Which variables should I add to the model? Instead we’re thinking like a graph. This provides better questions and principled answers. Schematically, thinking like a graph involves three stages:
Step 1: Specify a generative model of the data, including any unmeasured confounds. It could be a crude graph or a detailed system of equations. The important thing is that it contains causal structure.
Step 2: Choose a specific exposure of interest (here M) and outcome of interest (here D).
Step 3: Use the model structure (in this example, the graph) to deduce a procedure for calculating the causal influence of the exposure on the outcome.
If you want to know more than one causal effect, you need to repeat steps 2 and 3 more than once. This is something to really emphasize: A single causal model implies a variety of statistical models, possibly a different statistical model for each causal question.
In this example, I used a causal model that specified the exact functions that related the different variables. It was all linear. But there is a more general approach that avoids the functions all together. It uses graphs called DAGs: Directed Acyclic Graphs. DAGs look like the graph we made above, but they don’t say anything about the exact functions that relate the variables to one another. It turns out a lot can be said about the possibility (or impossibility) of causal inference, even when all we have is a DAG. There is a framework, known as do-calculus, that allows us to query a DAG about a causal effect and decide if there is a way to estimate it. This framework is very general, and can deal with many diverse problems ranging from missing data to measurement error to generalizability.
There is a large literature on this approach. It’s rather standard in some sciences, despite being completely absent in others. I’ll put some pointers at the bottom of this post.
The key idea of do-calculus is to ask how we can use either statistical or design choices to modify a graph so that it contains no confounds for some association of interest. When the confounds are removed, the remaining association is an estimate of the causal effect. In principle, modifying a graph to remove all confounds is easy: Just remove all arrows entering the exposure of interest. This is how we represent an intervention, setting the exposure to some value. The causal influence of interest is defined by this intervention, written as p(D|do(M)), the distribution of D when we intervene on M.
For example, here is the graph from before, but with all arrows entering M removed:
In this scenario, the association between M and D reflects the causal relationship p(D|do(M)). (We might still want to include B2 in an analysis, because it would help with precision.) When we can do a controlled experiment, we are effectively doing exactly this, removing the arrows entering the exposure. If we could experimentally set family sizes M, that would mean removing all other influences on M. That means no arrow entering it. This what a randomized experiment attempts to accomplish.
In practice, experiments don’t always assign treatments effectively (the “intent-to-treat” problem) and in our example, it would be both impractical and monstrously unethical to try to randomly assign family sizes to women. In these cases, do-calculus provides an algorithm for deducing purely statistical ways to covert the original graph into the one above. So-called “natural” experiments try for the same result, finding some statistical way to mimic a randomized experiment and get the new graph.
When you have more detail than a DAG, the general approach remains valid, although the details are different. We still want to statistically transform one graph (or set of equations) into another. We did exactly that with our m = cov( B1 , D ) / cov( B1 , M ) solution.
Less Salad, More Exercise
You don’t have to eat the causal salad. Neither your own nor anyone else’s. The first and easiest step is just to ask which generative assumptions allow us to claim that some estimate is a causal estimate. Really the interpretation of statistical results always depends upon causal assumptions, assumptions that ultimately cannot be tested with available data.
If you find this disappointing, I have three things to say.
First, I warned you already at the start of Part 1.
Second, this is only disappointing because methods have been oversold. Researchers have been taught to think of statistical methods as a kind of sorcery that can conjure causal facts from data, as long as you have enough of it. Maybe this sounds harsh, but when a group of experts rely upon a set of methods to make decisions for them, but these experts have little mechanical understanding of the methods, fearing to deviate from convention, and lacking any formal framework for justifying these conventions, that sounds like sorcery. We must do better.
Third, the perspective here is actually very powerful. We shouldn’t be disappointed at all. The do-calculus and related formal methods of causal inference are extraordinary achievements. They make inference possible in frustrating contexts, like the tale of the two mothers.
There’s a chance, is what I’m saying.
If you learn nothing else, learn this: The Table 2 Fallacy
On bias amplification: Understanding Bias Amplification
Short textbook do-calculus and causal inference: Causal Inference in Statistics: A Primer