7 September 2017 by Richard
Metamorphosis and the Multilevel Model
Abstract: The same multilevel model can be written different ways. Algebra don’t mind. But the differences do matter to the computer.
A terrible fact about applied statistical inference is that the same statistical model can be expressed in many different ways. This truth contains two aspects.
The first aspect is the cunning of algebra. Mathematical expressions can be arranged and rearranged into many equivalent forms. But these equivalent forms can sometimes look very different, and we can exploit this to aid our understanding.
The second aspect is the terror of implementation. To give life to a statistical model, we now use computers. The computer implementation typically makes one or another mathematically equivalent form of a model superior. So it can really pay to understand how to rearrange a model—how to reparameterize it.
I want to briefly explain by example of multilevel models. By “multilevel model,” I mean a model with partially pooled parameters—random effects and the like. These models are not exotic. But they can be factored and expressed in many different, equivalent, and confusing forms. Typically, one form or another works much better, inside the computer. So it’s worth knowing about your options.
Friendly Model
Consider a simple varying intercepts model. This model predicts an outcome that occurs repeatedly inside different units like individuals and households. We allow each individual or household to have its own unique mean value for
. We want to estimate the mean in each unit, but by pooling information among all units to regularize and improve inference. This is standard multilevel strategy.
A typical model looks like this:
(1)
Each unit in the sample is given a parameter
for its mean value of
. These unit means are assigned a common prior with mean
and standard deviation
. This prior is learned from the data as well, and so partial pooling regularizes the individual
parameters.
The model form above is known as the centered parameterization. You can recognize it by the presence of parameters inside the adaptive prior for the varying intercepts. For the sake of illustration, let’s use a data set from my rethinking R package:
library(rethinking) data(bangladesh) # subsample 100 women set.seed(1) idx <- sample(1:nrow(bangladesh),size=100) dat_list <- list( y = bangladesh$use.contraception[idx], district = coerce_index(as.factor(bangladesh$district[idx])) )
These data are a classic study on the adoption of contraception in Bangladesh. I’ve just sampled 100 women from the full data, pulling out their contraceptive use into the outcome variable y and the district each lived in into the index variable district. Now fit the varying intercepts model above [I’m using the Experimental branch of my package, so if something doesn’t work for you, that’s probably why]:
set.seed(1) m_centered <- map2stan( alist( y ~ bernoulli(p), logit(p) <- alpha[district], alpha[district] ~ normal( a , tau ), a ~ normal(0,10), tau ~ cauchy(0,1) ), data=dat_list , chains=2 , control=list(adapt_delta=0.8) )
You should see a buffet of scary warnings like:
Warning messages:
1: There were 8 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
2: There were 2 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
3: Examine the pairs() plot to diagnose sampling problems
4: In map2stan(alist(y ~ bernoulli(p), logit(p) <- alpha[district], :
There were 8 divergent iterations during sampling.
Check the chains (trace plots, n_eff, Rhat) carefully to ensure they are valid.
What all of this indicates is that the chains aren’t mixing very well. A look at the trace plot confirms:
The number of effective samples, n_eff, is very low. We drew 2000 samples, but only got information approximately the same as 74 samples. You could try adjusting Stan’s various control parameters, like that adapt_delta in the code above. But that won’t help much in this case. There is something much better. Algebra is going to rescue us.
Die Verwandlung
Suppose the model above is accidentally exposed to highly radioactive algebra. It transforms into:
(2)
This is exactly the same model as before, but it looks different. The mean was smuggled out into the linear model. This leaves the adaptive prior with a mean of zero. And now the “random effects”
are deviations from zero. This is the same as before, because we can always subtract the mean from a normal distribution to center the distribution at zero. And we can get the
parameters back just by defining them as
.
We could fit this model to the same data. But let’s first splash it with some algebra one more time, to complete the transformation:
(3)
Now the random effects are just little z-scores, , with a fixed prior that is unit normal,
. All the scaling and centering of the effects gets done inside the linear model,
. This works, because when we multiply the z-scores by their scale
, they end up back on the right scale. If this isn’t obvious, consider that you make z-scores by dividing values by their standard deviation. So the model above is just reversing that transformation.
The model above uses a non-centered parameterization of the varying effects. This form looks monstrous. But it often samples much more efficiently. Here’s the code to do it:
set.seed(1) m_noncentered <- map2stan( alist( y ~ bernoulli(p), logit(p) <- a + tau*z[district], z[district] ~ normal( 0 , 1 ), a ~ normal(0,10), tau ~ cauchy(0,1) ), data=dat_list , chains=2 , control=list(adapt_delta=0.8) , constraints=list(tau="lower=0") )
No more scary warnings. And the trace plot looks great:
Despite the beastly n_eff on tau, this is a much less monstrous trace plot. The messy model works better.
Coda
I’ve previously argued that multilevel models deserve to be the default form of regression. I do believe that. But these models also demand more attention to use properly.
One aspect we need to attend to is good estimation. Sometimes the seemingly monstrous version of a model will sample more efficiently and provide more reliable results. While the example above is for a one-dimensional vector of varying intercepts, this same strategy works for multi-dimensional matrices of varying intercepts and slopes. The algebra is a little more complicated, but the underlying logic is the same: smuggle all the parameters out of the prior and into the linear model. I have some examples of this in Chapter 13 of my book.
However, the non-centered approach is not always best. With some combinations of data and model, its instead the centered version, the original one that is easy to read, that works better. So it pays to know both and be comfortable with both. Sometimes you need the non-centered prior for one set of varying effects and the centered prior for another set, in the same model.
The same strategy works for more than normal distributions as well. If you can subtract or factor a parameter out of a distribution—standardizing the distribution—then you can use this non-centering strategy with it. Basically any distribution with a location or scale parameter abides.
Read More
Michael Betancourt, one of the members of the Stan development team, wrote a thorough case study on diagnosing mixing problems and the impact of non-centered parameterizations. Find it here.