There is Always Prior Information

Abstract: If priors did not exist, you would have to invent them. But this is no guarantee they are easy to use.

The flames burn higher, when you add Bayes.

This is a joke about SPSS.

The well-known statistical suite SPSS, which stands for Smoldering Pile of Statistical Software, is tossing the gasoline of Bayesian inference onto the flames. I take this to mean that Bayes has gained so much ground in recent years, that even the most conservative statistical software is taking it onboard.

This is, all joking aside, a good sign. If nothing else, the presence of Bayesian options will remind some users that there is no One True Way to perform statistical inference. And people who know a little Bayes often come to understand their favorite non-Bayes methods better, similar to how learning a foreign language teaches us about our native tongues.

The challenge, then, is to introduce Bayes to a user community that is largely unfamiliar with it. In the link above, there is a conventional introduction. It repeats the conventional view that priors are something of a burden:

While these [Bayesian] methods work very well when there is prior information, often in applied work there is little such information available.

Lots of statisticians talk this way, as if needing prior information is often a burden. Efron & Hastie talk the same way in their sensible book, Computer Age Statistical Inference.

I argue that this view is very limiting. We can always do better than pretending we know nothing about a parameter, even when we know nothing in particular about a parameter. The fact that it is a parameter is information enough. Ironically, this is a point that isn’t always appreciated by active users of Bayesian procedures either.

Flat Priors Are Flat Dangerous

Even non-Bayesian procedures are improved by introducing devices that resemble priors, because these devices reduce overfitting. Overfitting here just means learning too much from a sample. In all branches of statistics, it is appreciated that prediction of the next sample will be more accurate when less than everything is learned from the first sample. What we want to learn are the “regular” or recurrent features of a sample.

As a result, statisticians have introduced procedures for regularizing inference. In some cases, these procedures are mathematically equivalent to using prior information that down-weights extreme parameter values. Penalized likelihood is the best known example. This is equivalent to a prior distribution that is less than perfectly flat. We can always do better than to use a flat prior. Any prior that is slightly less flat will be an improvement.

Of course if the prior is too concentrated in the wrong place, then it will hurt inference. But there is a universe of prior distributions that beat the flat prior implied by classical statistical methods. And that is why non-Bayesian statisticians use regularizing procedures that achieve the same advantages as prior information.

This is why, if priors didn’t exist, we would have to invent them. Any procedure which does not attempt to control overfitting is failing to include enough prior information.

Everybody Overfits

This has little to do with historical brawls between Frequentists and Bayesians. Non-Bayesian procedures can and have invented devices for including prior information, independent of Bayes. Bayes has no unique advantage here. In fact, it shares the disadvantage of flat priors, because people use them so often. Lots of canned Bayesian models use flat or nearly-flat priors. It’s not unusual to see priors chosen for the explicit purpose of achieving the same inferences as a non-Bayesian procedure, as if this were reassuring.

Nearly every BUGS model I have seen uses absurdly flat priors. Here’s a typical example:

model{      
for (i in 1:N) {        
  THEFT[i] ~ dnorm(mu[i] , tau)        
  mu[i] <- beta0 + beta1*MAN[i] + beta2*DIST2[i] + beta3*DIST3[i]           
}          
beta0 ~ dnorm(0, 0.00001)          
beta1 ~ dnorm(0, 0.00001)         
beta2 ~ dnorm(0, 0.00001)        
beta3 ~ dnorm(0, 0.00001)          
tau ~ dgamma (0.001, 0.001)        
sigma2 <- 1/tau}          

You don’t need to understand most of this. Just focus on the priors for the beta parameters. Those normal distributions with variance in the thousands are trying very hard to be flat. At best, they give you the same results as a non-Bayesian model that will overfit your sample.

Everybody needs to guard against overfitting. Priors are a simple and transparent way to do it. They are not a special burden. They are a safety device. You can ignore them, like you can ignore a seat belt. Most trips, they won’t make any difference. But I don’t want to use a statistical procedure without regularization anymore than I want to ride in a car without seat belts.

At the same time, in many complex models it is not always clear how to achieve good regularization. This is equally true of non-Bayesian models as Bayesian ones. These models need priors (or the functional equivalent), but choosing them is hard on everyone. Flat is still not an option, however.

More To Read

Andrew Gelman has a short, clear post from 2013 about the dangers of noninformative priors.

I wrote an entire book about Bayesian statistics, and it is obsessed with overfitting. Sample chapters are available here.

Ridge regression is a common non-Bayesian procedure that employs regularization and is mathematically the same as using prior information. It is described lots of places, including in Efron & Hastie’s Computer Age Statistical Inference. There is also of course the typically obtuse Wikipedia page.