14 July 2018 by Richard

# Statistical Rethinking, Edition 2: ETA March 2020

**[updated 18 Dec 2019 — see second edition table of contents at bottom]**

It came as a complete surprise to me that I wrote a statistics book. Really, I am an anthropologist. I study human evolution. Statistics is for me only a necessary activity, required for making inferences from data. In my list of professional identities, *statistician* falls somewhere below *asker-of-unfair-questions* and somewhere above *hobbyist-pizza-chef*.

It is even more surprising how popular the book has become. But I had set out to write the statistics book that I wish I could have had in graduate school. No one should have to learn this stuff the way I did. I am glad there is an audience to benefit from the book.

It consumed 5 years to write the book. There was an initial set of course notes, melted down and hammered into a first 200 page manuscript. I discarded that first manuscript. But it taught me the outline of the book I really wanted to write. Then several years of teaching with the manuscript further refined it.

Really I could have continued refining it every year. Going to press carries the penalty of freezing a dynamic process of both learning how to teach the material and keeping up with changes in the material. As time goes on, I see more elements of the book that I wish I had done differently. I’ve also received a lot of feedback on the book, and that feedback has given me ideas for improving it.

## Ch-ch-ch-ch-changes

So now I have almost finished a second edition. The goal with a second edition is only to refine the strategy that made the first edition a success. I revised the text and code and taught with it in Winter 2019. Now I’ve taken student and colleague feedback, revised more, and the book is in production for a target March 2020 publication.

The soul of the book is the same. But there is a lot of new material as well. Here is an outline of the changes.

## Look out you rock ‘n rollers

**The R package has some new tools.** The **map** tool from the first edition is still here, but now it is named **quap**. This renaming is just to avoid misunderstanding. We just used it to get a quadratic approximation to the posterior. So now is named as such. A bigger change is that **map2stan** has been replaced by **ulam**. The new **ulam** is very similar to **map2stan**, and in many cases can be used identically. But it is also much more flexible, mainly because it does not make any assumptions about GLM structure and allows explicit variable types within the formula list. All the **map2stan** code is still in the package and will continue to work. But now **ulam** allows for much more, especially in later chapters. Both of these tools allow sampling from the prior distribution, using **extract.prior**, as well as the posterior. This helps with the next change.

**Much more prior predictive simulation.** A prior predictive simulation means simulating predictions from a model, using only the prior distribution instead of the posterior distribution. This is very useful for understanding the implications of a prior. There was only a vestigial amount of this in the first edition. Now most modeling examples have some prior predictive simulation. I think this is most useful addition to the second edition, since it helps so much with understanding not only priors but also the model itself.

**More emphasis on the distinction between prediction and inference.** Chapter~5, the chapter on multiple regression, has been split into two chapters. The first chapter focuses on helpful aspects of regression. The second focuses on ways that it can mislead. This allows as well a more direct discussion of causal inference. This means that DAGs—directed acyclic graphs—make an appearance. The chapter on overfitting, Chapter~7 now, is also more direct in cautioning about the predictive nature of information criteria and cross-validation. Cross-validation and importance sampling approximations of it are now discussed explicitly.

**New model types.** Chapter~4 now presents simple splines. Chapter~7 introduces one kind or robust regression. Chapter~12 explains how to use ordered categorical predictor variables. Chapter~13 presents a very simple type of social network model, the social relations model. Chapter~14 has an example of a phylogenetic regression, with a somewhat critical and heterodox presentation. And there is an entirely new chapter, Chapter~16, that focuses on models that are not easily conceived of as GLMMs, including ordinary differential equation models.

**Some new data examples.** There are some new data examples, including the Japanese cherry blossoms time series on the cover and a larger primate evolution data set with 300 species and a matching phylogeny.

**More presentation of raw Stan models.** There are many more places now where raw Stan model code is explained. I hope this makes a transition to working directly in Stan easier. But most of the time, working directly in Stan is still optional.

**Much more material on the details of Hamiltonian Monte Carlo.** There is detailed material on divergent transitions now, and complete raw R code for implementing a simple HMC simulation.

## Pretty soon now you’re gonna get older

Not everything has changed. Mostly it is the same book, with the same kind style.

As in the first edition, I have tried to make the material as kind as possible. None of this stuff is easy, and the journey into understanding is long and haunted. It is important that readers expect that confusion is normal. This is also the reason that I have not changed the basic modeling strategy in the book.

First, I force the reader to explicitly specify every assumption of the model. Some readers of the first edition lobbied me to use simplified formula tools like **brms** or **rstanarm**. Those are fantastic packages, and graduating to use them after this book is recommended. But I don’t see how a person can come to understand the model when using those tools. The priors being hidden isn’t the most limiting part. Instead, since linear model formulas like y ~ (1|x) + z don’t show the parameters, nor even all of the terms, it is not easy to see how the mathematical model relates to the code. It is ultimately kinder to be a bit cruel and require more work. So the formula lists remain. In this book, you are programming the log-posterior, down to the exact relationship between each variable and coefficient. You’ll thank me later.

Second, half the book goes by before MCMC appears. Some readers of the first edition wanted me to start instead with MCMC. I do not do this because Bayes is not about MCMC. We seek the posterior distribution, but there are many legitimate approximations of it. MCMC is just one set of strategies. Using quadratic approximation in the first half also allows a clearer tie to non-Bayesian algorithms. And since finding the quadratic approximation is fast, it means readers don’t have to struggle with too many things at once.

## Turn and face the links

The publisher has a page for the second edition up. The R package will remain on github. The current Experimental branch will become the master branch, when the book appears in print.

## Table of Contents for Second Edition

### Table of Contents

** Preface to the Second Edition
Preface
Audience
Teaching strategy
How to use this book
Installing the rethinking R package
Acknowledgments**

**Chapter 1. The Golem of Prague**

Statistical golems

Statistical rethinking

Tools for golem engineering

Summary

** Chapter 2. Small Worlds and Large Worlds
** The garden of forking data

Building a model

Components of the model

Making the model go

Summary

Practice

**Chapter 3. Sampling the Imaginary**

Sampling from a grid-approximate posterior

Sampling to summarize

Sampling to simulate prediction

Summary

Practice

** Chapter 4. Geocentric Models**

Why normal distributions are normal

A language for describing models

Gaussian model of height

Linear prediction

Curves from lines

Summary

Practice

** Chapter 5. The Many Variables & The Spurious Waffles
** Spurious association

Masked relationship

Categorical variables

Summary

Practice

** Chapter 6. The Haunted DAG & The Causal Terror**

Multicollinearity

Post-treatment bias

Collider bias

Confronting confounding

Summary

Practice

** Chapter 7. Ulysses’ Compass**

The problem with parameters

Entropy and accuracy

Golem Taming: Regularization

Predicting predictive accuracy

Model comparison

Summary

Practice

** Chapter 8. Conditional Manatees**

Building an interaction

Symmetry of interactions

Continuous interactions

Summary

Practice

**Chapter 9. Markov Chain Monte Carlo**

Good King Markov and His island kingdom

Metropolis Algorithms

Hamiltonian Monte Carlo

Easy HMC: ulam

Care and feeding of your Markov chain

Summary

Practice

** Chapter 10. Big Entropy and the Generalized Linear Model**

Maximum entropy

Generalized linear models

Maximum entropy priors

Summary

** Chapter 11. God Spiked the Integers**

Binomial regression

Poisson regression

Multinomial and categorical models

Summary

Practice

**Chapter 12. Monsters and Mixtures**

Over-dispersed counts

Zero-inflated outcomes

Ordered categorical outcomes

Ordered categorical predictors

Summary

Practice

** Chapter 13. Models With Memory**

Example: Multilevel tadpoles

Varying effects and the underfitting/overfitting trade-off

More than one type of cluster

Divergent transitions and non-centered priors

Multilevel posterior predictions

Summary

Practice

** Chapter 14. Adventures in Covariance
** Varying slopes by construction

Advanced varying slopes

Instruments and causal designs

Social relations as correlated varying effects

Continuous categories and the Gaussian process

Summary

Practice

**Chapter 15. Missing Data and Other Opportunities**

Measurement error

Missing data

Categorical errors and discrete absences

Summary

Practice

** Chapter 16. Generalized Linear Madness**

Geometric people

Hidden minds and observed behavior

Ordinary differential nut cracking

Population dynamics

Summary

Practice

** Chapter 17. Horoscopes**

Endnotes