24 August 2017 by Richard
Multilevel Regression as Default
Abstract: Don’t ask why you should use a multilevel model. Ask instead, Why not?
There is a chronic shortage of organs. Too many people needlessly die, while waiting for replacement kidneys, hearts, and lungs. This is despite high public approval [PDF] of organ donation. One component of the mismatch between supply and approval appears to be whether a nation regards donation as default or rather opt-in. Countries with opt-in policies, requiring people to take an extra step to be listed as a donor, have very much lower rates of donation.
I am not an expert on the topic of organ donation. Maybe the details complicate this story. But it seems plausible. Which choice is defined as a choice can have extraordinary power. I know it happens in statistics.

Organ donation, colored by default. Gold: Opt-in. Blue: Default donor. [source]
Statistical Defaults
In applied statistics, defaults are king. Default models, and default algorithms, and default priors, and default diagnostics, and default criteria. Defaults are often sensible, but they also embody historical inertia and exert unwarranted normative force. Not using a default must be justified, because Reviewer #2 is lurking, just waiting for you to do something unusual.
So it’s worth rethinking our defaults. Some of them are actively harmful.
Ordinary, non-hierarchical fixed-effects regression is a bad default. I started teaching statistics in 2002. One of the most common questions I’ve heard from students and colleagues has been: “Would a multilevel model be appropriate for my project?” I nearly always answer, “YES!” Of course there are cases where such models do not make sense. But multilevel models are so broadly useful that they should be our default. It matters where you begin. We should be justifying those times when we choose not use a multilevel model, not asking in every instance whether one would be useful.
Why So Useful
Multilevel models extend the core benefit of regression to higher orders of abstraction. The core benefit of regression is, well, regression to the mean. When we assign variables to a common distribution, predictions regress to the mean. In the majority of cases, this improves prediction.
Data often contain clusters. Individuals, households, countries, and time periods are all clusters containing repeated observations. When there are clusters in the data, then we can exploit regression among those clusters as well. If we allow each cluster to have its own tendencies, and then assign those individual tendencies to a common distribution, we get regression and improved prediction.
This benefit of multilevel regression has little to do with statistical paradigm, research design, sample size, the number of clusters (2 is enough), whether the clusters were sampled from a larger population of clusters, or any of the other metaphysical justifications for multilevel models. If you want improved estimates and predictions, use a multilevel model.
Or use a technique that achieves the same benefits. There are lots of machine learning procedures that obscurely exploit the same principles. I prefer to focus on benefits, not the tools themselves. So if posterior distributions upset your stomach, then its fair to find some machine learning wizardry that achieves similar benefits but remains compatible with your physiology.

Fixed-effect regression prepares breakfast. [src]
Dumb Robots and Dumber Robots
It can be helpful to think of statistical models as little, dumb robots. What kind of robot would act like a classical, fixed-effect regression?
A robot with anterograde amnesia would act that way. Anterograde amnesia is the inability to form new memories. Fixed-effect models treat each cluster as unrelated to the others. They forget everything they’ve learned, every time they switch clusters. If the clusters are of the same type, then it is better to use the joint information to improve inference about each. Doing otherwise means ignoring, for example, that you have ever before visited a café, treating every new café as if it were the first. Cafés do differ. But they are also alike. A multilevel model estimates how alike and uses that estimate to pool information among cafés. That usually improves the estimate for each.
I riff on these points in Chapter 12 of my book. You can read it for free here: [PDF of Chapters 1 and 12, Statistical Rethinking].
Why Sometimes Not So Useful
Multilevel models should be our default. But there will always be reasons to use classical regression models. These include:
(1) There are no clusters in the data. Even then, sometimes it makes sense to treat each outcome as a “cluster” and assign a random effect to it, because this helps to account for over-dispersion. And the absence of categorical clusters does not necessarily mean that some multilevel model isn’t possible—devices like Gaussian process regression can extend the benefits of partial pooling to continuous differences in, for example, age or location or time.
(2) The clusters are not very different from one another. If the cluster tendencies are very similar, then the multilevel model may add very little. However, since a multilevel model adapts to the variation in the sample, it is often no harm to use a multilevel model, even in this case.
(3) The multilevel model just won’t fit reliably. With certain samples and certain models and certain algorithms, no reliable set of estimates can be made. It can be reasonable to fall back on pooling evidence and ignoring variance, in such cases. As algorithms like Hamiltonian Monte Carlo improve, this excuse is harder to use. But it will never go away entirely.
(4) Inadequate training or experience. All of this assumes, unfairly, that a person has been adequately exposed to multilevel models and knows how to specify and interpret them. While many training programs provide substantial material on multilevel models, many still do not. And many researchers, inside and outside the academy, have little time to pick up new tools. So I’m very sympathetic to the excuse, “I just haven’t had the proper training.” That’s something that can be remedied, and there is no shame in not knowing something.
So I don’t mean to say one should always use a multilevel model. But one should always be prepared to defend the choice not to, given the well documented inferential benefits of the multilevel approach.
Read More
Outside Bayesian circles, the benefits of multilevel models are traditionally linked to the James-Stein estimator, sometimes referred to as “Stein’s paradox.”
Andrew Gelman has a readable essay about the creation and role of defaults in applied statistics. PDF is here.
I tried to find an accessible introduction to Gaussian processes, but failed. If someone knows one that isn’t saturated with matrix algebra but focuses instead on the purpose and implications, please let me know.