28 August 2017 by Richard
First World Modeling Problems
Abstract: Convenience is the mindkiller. Convenient statistical software postpones thinking and disguises assumptions.
Food is just about the best thing for you. As a consequence, all animals love food. Human animals have spent millennia engineering the food supply to produce increasing amounts of increasingly delicious food. Our massive brains, fueled by high energy diets, devise better and better ways to produce, ship, combine, store and prepare so much food. Snack chips, together with the necessary supply and manufacturing chains, rank among humanities greatest technological achievements.
And now of course, in the wealthiest human societies, food is among the most dangerous things for you. The leading cause of death in so-called First World nations has long been various kinds of heart disease. Heart disease is complex, but overeating is a major contributor. Food is a First World Problem.
Twenty Seconds to Live
Modern statistical software is a bit like snack chips.
I use Stan (mc-stan.org) for most of my statistics, these days. Stan compiles model code into a binary sampler. For simpler models and small data sets, the compilation step is most of the running time. So in the Stan community, we often see fair and reasonable comments like [src]:
I agree. That 20 seconds can feel like an eternity.
I also feel guilty for agreeing. I have been doing science long enough to remember when a simple model might take substantially longer than 20 seconds. The models I ran for my Master’s degree took a couple of weeks to execute. And execution time wasn’t the worst part. They consumed a few months to write and debug. My PhD supervisor did his statistical analyses on a campus mainframe computer, using punchcard input. Students back then bought blank punch cards from the campus bookstore, by the linear foot. After delivering your punched cards to the queue, you might wait a day or two until your program got a turn on the machine. A common experience was to return to collect your output and receive only an error report. Back to making punchcards!
So my elders checked their code a bit more carefully than I check mine now. Mistakes were costly in both time and money.
Costs of Convenience
To have my “old man yells at cloud” moment, I’m going to ask: Could convenient statistical software actually be bad for us?
The most obvious cost: Convenience—in terms of speed of execution—may allow us to try too many models and data transformations. Instead of thinking about principled models, we can just stumble our way through, adding and subtracting terms. The trouble is that the justification for the final model may be little more than “it worked.”
I think this concern is widely shared. P-hacking is easier with SPSS than it would be with punch cards. It’s easier now when the computer does the math than when we did the sums-of-squares by hand.
But there’s a less obvious cost that worries me more: Convenient statistical software hides its assumptions. I’m thinking more here of convenient input languages than I am of speed of execution. Let’s compare two different multilevel model specifications. The R package lme4 [doc] is fantastic for fitting multilevel models of many types. All it needs to specify a generalized linear mixed model (GLMM) with random intercepts and slopes is:
y ~ (1 + x|group) + x
That’s it. In contrast, the R package that I wrote, rethinking [repo], requires the following catastrophe:
y ~ dnorm(mu,sigma), mu <- a[group] + b[group]*x, c(a,b)[group] ~ dmvnorm2( c(abar,bbar) , Sigma_group , Rho_group ), abar ~ dnorm( 0 , 10 ), bbar ~ dnorm( 0 , 1 ), Sigma_group ~ dexp(1), Rho_group ~ dlkjcorr(2), sigma ~ dexp(1)
This is a mess, but it’s a lovely mess. It’s all about managing the proper amount of convenience.
A Love Story
The two models above are essentially the same. Of course lme4 isn’t a Bayesian creature, but it specifies the same pooling relationship among the random effects. You just can’t see it. The packages brms [doc], rstanarm [doc], and MCMCglmm [doc] are Bayesian, like rethinking. But they use nearly the same input language as lme4. In these packages, the priors are implicit. They can be customized, but typical code doesn’t present them. All of these packages—lme4, brms, rstanarm, and MCMCglmm—are great and much more convenient than my package.
My package is inconvenient because there are costs to convenience, especially when people are learning to model. The convenience may do harm in two ways, both stemming from hidden aspects of the model.
First, convenient input forms hide assumptions and discourage learning about them. I could rewrite rethinking to use simpler model specifications, like lme4. All that’s needed is to pick some default distributions and priors and link functions. But I won’t do it, because all the tiresome information needed to specify that model is necessary. Either you make those choices, and are aware of them, or the computer makes them for you. Possibly you never learn about them. I’m not against users graduating to use convenient packages. But I am against hiding the model itself, while still learning. Otherwise students risk ending up in the familiar swamp of push-button mysticism.
In contrast, rethinking is so annoying that one cannot specify even a simple linear regression with less than four lines of code. The exact relationship between each parameter, variable and the outcome must be typed out using the arcane symbols of arithematic. You’re welcome.
Second, the simple input languages are not self-documenting. It is not easy to figure out the full model from a simple model specification. One must hunt through documentation, parsing odd parameter conventions, to figure out something as simple as which Wishart prior a package uses. It’s a special kind of hell. So even when the person who conducted the analysis knew the model, readers will have a hard time discovering what the model was. This isn’t a huge problem, when people take the responsibility to write mathematical model definitions in their papers or supplemental materials. But usually all we get is a citation to MCMCglmm version something something.
In contrast, rethinking input is so detailed that the model specification itself documents the model assumptions. A reader does not have to hunt through the arcane folios of CRAN.
Of course rethinking ends up being a lot more flexible than lme4 and the like. But that isn’t the reason for its inconvenience. The inconvenience is the design, because it forces learning and self documentation.
But my package was intended as a scaffold—use it to build your wall, then let it fall. I never really intended for people to keep using it, after they got the hang of multilevel modeling. There’s a direct path from rethinking to daily use of brms and rstanarm instead. Of course the documentation issue looms, but life is full of tradeoffs. And rethinking is, let’s face it, tedious. You don’t need to code the same random slopes model from scratch every time.
But it turns out that lots of people prefer to keep using it, or even to graduate to working with raw Stan code instead. For some of us, the clarity of the full model is hard to let go of.
The important thing is to manage convenience. We need the right amount of junk in all the right places, like we need the right amounts of the right kinds of food in our diet. More convenience in one aspect of our work will usually cost us in some other.