17 July 2023 by Richard

*Yeah p-values suck. But replacing them with another metric is no
solution. The problem lies deeper.*

I am not a fan of p-values and null hypothesis significance testing (NHST). I do not use them, and I hardly ever talk about them. Statisticians have written about the inadequacy of NHST for cumulative science many times (see here, and here), but seemingly with no impact on practice, that it seems futile at this point.

When I do talk about them, it is usually to make a joke about how few scientists have a basic understanding of them (see right). We laugh because the alternative is crying. Is there any other profession in which the normative quantitative metric is so misunderstood by practitioners? Worse, when confronted with their own lack of understanding and evidence of misuse, many scientists simple say, “but the reviewers demand them, what are we supposed to do?” I have personally heard colleagues at my own institute justify a design-inappropriate analysis because it was the one that gave them a smaller p-value. The goal must never be the criterion. But what are they supposed to do?

The first step is some moral courage of course. The second is easier.

Suppose you accept the consensus among statisticians that p-values are misused in dangerous and counter-productive ways in the sciences. Many scientists do or are starting to. So now what?

I get asked often about two alternatives: Bayes factors and confidence (credibility) intervals. I don’t think either of these is any solution at all.

Bayes factors are one Bayesian way of doing null hypothesis testing.
They are popular in some pockets, like psychology and phylogenetics. But
Bayesian statisticians are not in general positive about them. Gelman et
al’s definitive text, *Bayesian Data Analysis*, actively
recommends against them. To begin, there are technical problems. They
are hard to compute. For years, one phylogenetics program was using a
Bayes Factor estimator that had
literally
infinite variance. An estimate with infinite variance contains no
information of value. There are better algorithms now. But it remains
hard, even though there are generally robust ways to compute
cross-validations (like
importance sampling).

However even if all the computational problems are solved, Bayes factors are sensitive to priors in ways that parameter estimates are not. This point is subtle. We want estimates to be sensitive to priors. If there are not, then they are also not sensitive to data, because priors are mathematically the same as previous observations. But Bayes factors are sensitive in ways that have nothing to do with learning the posterior distribution: differences in priors that have essentially no impact on the posterior distribution within a model can have a massive impact on its Bayes factor. Christian Robert has written a nice comment that contains many citations for background.

However, the biggest problem with Bayes factors is none of these. Rather it is that people want to use them to do null hypothesis testing, and testing null hypotheses is the fuel of the garbage fire in the first place. Cumulative science needs non-null scientific models and ways to confront them with data. More on this later.

The other alternative is to report estimates and confidence intervals (or Bayesian credible intervals). This is better than raw null hypothesis testing, okay. With the estimate and its uncertainty, we can distinguish between a large estimate with a large uncertainty and a small estimate with a small uncertainty, even though these two could have the same p-value. However readers will often interpret the confidence intervals just as badly as they interpret p-values. So maybe it makes little difference in practice.

I certainly report posterior distributions in my own work, and sometimes with intervals (although these Bayesian intervals are not appropriate for null hypothesis testing, they are just summaries of the distribution). So I am not against this form of summary. But still I don’t think it’s a solution, because the real problem is still deeper.

The real problem is that researchers cannot usually justify an analysis in the first place.

Many well-meaning reformers see the problems with how p-values and NHST are used and propose that we reform the curriculum to teach better and also encourage pre-registration so that researchers are disincentivized to cheat in the pursuit of statistical significance. However it will hardly matter to pre-register an analysis that has no logical connection to the phenomenon in the first place. How can an analysis inform scientific knowledge when there are few logically-specified scientific models that are logically connected to the analysis? Others have made this and related points very thoroughly.

Pre-registration may have other benefits. But their promise for improving data analysis is limited. Researchers often treat statistical control like a magic sauce. They were never taught a framework that could logically justify a set of control variables, nor which functional relationships they require. And so they lack the training necessary to produce a scientifically justified pre-registration. For example, including post-treatment variables is very likely to produce misleading estimates, and these estimates could easily be statistically significant (or have impressive Bayes factors). Mistakes like this can ruin experiments just as well as observational studies.

So yes this is about causal inference. What is needed are scientific models that exist outside the statistical models. These models do many things.

In the first place, they help us clarify and develop theories. If theories remain verbal, we can usually always get our way. And once we start really building quantitative models of observable phenomena, we are forced to deal with difficult problems like expected effect sizes, rates, spatial patterns, and the many other realities of real data. Fields like population genetics discovered decades ago that there are few meaningful and useful null models in complex natural systems (see). The only solution is to commit to developing and comparing scientifically informed, rigorously analyzed, non-null models. Often we find at this stage that we had the whole thing wrong from the beginning, that the verbally-persuasive explanantion is internally incoherent, and can save ourselves the trouble of doing pointless research.

Second, scientific models allow us to logical and transparently justify a research design and an analysis. Too many studies just use some clever rhetoric to connect data to inference. For example, a classic paper in cognitive psychology claimed to debunk the widespread belief that basketball players experience a “hot hand” during which their shooting accuracy is higher (see). The authors proposed no scientific model of the hot hand but instead just chose some rhetorically convincing way to analyze the data. Later model-based analyses showed that the original paper missed the evidence of the hot hand that was in their data all along. Here is an excellent explanation of the original, its problems, and the solution.

The solution was to think logically with the aid a scientific model that could be used to justify a statistical analysis. In the hot hand example, the scientific model provides a way to use the data at all appropriately. But the general problem of covariate selection also requires a scientific model. Covariates can cause as many problems as they can solve.

Third, without scientific models, what are we even doing? How can diverse sources of evidence combine to inform a common, cumulative body of knowledge about nature, if we don’t invest in formalized representations of that knowledge? Science as a vocation must be about building, testing, revising scientific models. It must not be about making papers that metaphorically relate any convenient measurement made from a convenience sample to a convenient estimator that conveniently gets us trade-book deals.

So let’s stop talking about replacing p-values with different summaries or supporting them with better teaching and fixes like pre-registration. Instead let’s support a foundation in scientific modeling and workflows for deducing appropriate statistical procedures from unambiguous assumptions about data-generative processes.

Of course p-values, Bayes factors, and intervals have proper uses. But none of them addresses the foundational problem of linking statistical procedure to scientific modeling. And without that, none of them are inferential. They are merely statistical.

Now go read this by Sander Greenland.

2016 ASA Statement on p-Values: Context, Process, and Purpose [link]

2021 “The p-value statement, five years on” [link]

2008 “The Harmonic Mean of the Likelihood: Worst Monte Carlo Method Ever” [link]

2015 (updated 2022) “Pareto Smoothed Importance Sampling” [link]

2015 “The expected demise of the Bayes factor” [link]

2021 “The case for formal methodology in scientific reform” [link]

2018 “How Conditioning on Posttreatment Variables Can Ruin Your Experiment and What to Do about It” [link]

2021 “Causal Inference: Science Before Statistics” [link]

2010 “The standard of neutrality: still flapping in the breeze?” [link]

1985 “The hot hand in basketball: On the misperception of random sequences” [link]

2017 “Momentum isn’t magic – vindicating the hot hand with the mathematics of streaks” [link]

2023 “Connecting Simple and Precise P-values to Complex and Ambiguous Realities” [link]