14 July 2018 by Richard
Statistical Rethinking, Edition 2: ETA 2020
It came as a complete surprise to me that I wrote a statistics book. Really, I am an anthropologist. I study human evolution. Statistics is for me only a necessary activity, required for making inferences from data. In my list of professional identities, statistician falls somewhere below asker-of-unfair-questions and somewhere above hobbyist-pizza-chef.
It is even more surprising how popular the book has become. But I had set out to write the statistics book that I wish I could have had in graduate school. No one should have to learn this stuff the way I did. I am glad there is an audience to benefit from the book.
It consumed 5 years to write the book. There was an initial set of course notes, melted down and hammered into a first 200 page manuscript. I discarded that first manuscript. But it taught me the outline of the book I really wanted to write. Then several years of teaching with the manuscript further refined it.
Really I could have continued refining it every year. Going to press carries the penalty of freezing a dynamic process of both learning how to teach the material and keeping up with changes in the material. As time goes on, I see more elements of the book that I wish I had done differently. I’ve also received a lot of feedback on the book, and that feedback has given me ideas for improving it.
So now I am working on a second edition. The goal with a second edition is only to refine the strategy that made the first edition a success. I’m revising the text and code now, aiming to teach with it in Winter 2019. Then I’ll take student and colleague feedback and aim for a 2020 publication date.
So it isn’t happening soon. But it will happen soon enough. Here is a brief outline of both ongoing and planned changes. I am naturally interested in comments.
Look out you rock ‘n rollers
An emphasis on generative modeling. I feel most guilty that the first edition adopts the usual model construction approach of taking the data set as given and building the model conditional on what has been observed. This encourages the mistaken view that a Bayesian model is just a non-Bayesian model plus a prior. This view is maybe harmless in very basic regression examples. But it is limiting, because it makes it harder to teach things like missing data and measurement error, contexts in which the distinction between “observed” and “unobserved” is not so perfect. It also makes it seem like statistical models are detached from substantive scientific models—often they are, but they shouldn’t be. So I plan to emphasize instead the perspective that variables can be observed or unobserved in different contexts. The scientific model stays the same across these contexts. This will also be an attempt to move away from re-using non-Bayesian terminology to describe Bayesian modeling. My hope is that all this will actually make the early chapters simpler.
An emphasis on understanding priors through prior predictive distributions. To go along with the emphasis on generative modeling, the book will emphasize that a primary way to understand distributional assumptions about variables (observed or not) is to simulate from the pre-data model. Isolated likelihoods and priors can be mystical things, since their impacts happen only in the context of the joint model. So we can simulate from the joint model to see what it expects, before the data arrive. A simple example where this matters a lot is in the choice of priors for logistic models—flat priors put nearly all the mass on insane possibilities. Yes, this was said in Chapter 10 of the first edition, but it was brief and wasn’t really integrated throughout.
A stronger emphasis on the difference between prediction and inference. Chapter 5 in the first edition does emphasis the problems of just dumping variables into a model and letting regression sort it out. But the presentation isn’t forceful enough and isn’t integrated well with later chapters. There needs to be a memorable example of conditioning on a collider (language I do not like). I might split Chapter 5 into two shorter chapters, so that proper emphasis exists. I’d like to have a brief presentation of DAGs, including an example where a DAG doesn’t work.
New model types. I’d like to include examples of time series, item-response (factor analysis), survival analysis, hidden-Markov, and ODE analyses. Not all of these will make it. But hopefully some of them will.
New data sets. I’ve collected several new data sets to serve as integrated examples across chapters. One of them is the historical Japanese cherry blossom data set. Time series examples were missing from the first edition, so I’ve adopted these data to remedy that. I only regret that the book isn’t in full color, so I won’t be able to show the blossom data points in pink. I also have a bigger, better primate brain size data set to use, courtesy of Dr Sally Street. I am still looking for a good data set to illustrate survival analysis. I am also considering some item-response examples.
Streamlining Chapter 6. The overfitting and information theory chapter has too much historical content, like AIC and DIC, and too little study of the piecewise nature of WAIC. It should also cover piecewise cross-validation, including PSIS-LOO. Most importantly, I want to emphasize much more that model comparison is often most useful for understanding the behavior of a target model, not for deciding which model to use. This ties into the causal inference material as well. I also want to discuss model stacking in contrast to model averaging.
A new chapter on models beyond GLMs and GLMMs. A central irony with GLMs is that they are so useful and also so limiting. It is impressive how many disparate problems can be expressed as GLMs. But many inferential problems require thinking outside the additive predictor approach and beginning with a substantive model of the system. I have in mind three examples. First, a biologically-motivated study of ecological and population dynamics. Second, an example in which I reanalyze a published child development study to infer process instead of just behavior frequencies. Third, a hidden-Markov latent state model. I would also like to include an example of modeling competitive outcomes—like football matches or round-robin preference experiments—but I am not sure which chapter to place it in. It could go in this chapter or another. The necessary models are technically GLMs, because they can be expressed that way. But these models make much more sense when we construct them in ignorance of the usual additive list of predictors approach.
More raw Stan examples. I still think that wrappers like map2stan are needed. Students who know R but have never programmed in a statically typed language can have problems with Stan code. They can of course get it eventually. But a smooth ramp is better. That ramp can be even smoother by having some strategic examples of simple models expressed both in raw Stan and map2stan, emphasizing additional optimizations that one can perform in Stan.
A new version of map2stan. I already have a prototype complete refactor of map2stan. The new version is ignorant of GLMs and so should be able to express many more model types. It is also much easier to template, extend, and maintain. For the contents of the book, it will look and function much the same, but for those who want to keep using map2stan in their own work, it will be much better.
Simultaneous Python code translation. This is just an ambition at this point. It would be nice to have a translation into PyMC3 and/or PyStan done when the book is released.
Pretty soon now you’re gonna get older
As work goes on, new sections will trickle out for comment. I’ll also have new code and data examples up in an online repository. I’ll start putting links at the top of this post. The plans above might also change. In which case I’ll update the post.