Algebra and the Missing Oxen

Abstract: How to sample discrete parameters with Hamiltonian Monte Carlo and learn something about probability theory at the same time.

In a particular village in central Asia, children are not allowed to drink tea in the evening, until the family ox is stabled. No one remembers why this is the rule, but it has always been so.

You are the village enforcer of oxen customs. No one remembers why, but it has always been so. Each evening, you must determine which children have properly stabled their oxen. For many houses, you can see whether or not the ox is stabled. But other houses have enclosed stables, so you cannot observe the ox without appearing to accuse that child of violating the rule. To do so and then discover a properly stabled ox would be embarrassing for everyone. No one remembers why, but it has always been so.

You’d like to maximize detection of lazy children while minimizing social embarrassment. How to do this?


This is a story about parameters and how to estimate them. Many models contain discrete parameters, whether these represent latent states or missing values. In this parable, the discrete parameters of interest are the unobserved oxen.

Unfortunately, discrete parameters don’t play nice with newer, modern MCMC algorithms like Hamiltonian Monte Carlo and NUTS. You want to use NUTS, because it is a lot better than older approaches like Gibbs sampling. I wrote an entire post about the advantages. So people who learned to program models for Gibbs engines like BUGS find themselves unable to program their models in a NUTS engine like Stan.

The vast majority of models with discrete parameters can be easily coded in Stan, however. All it takes is a modicum of secondary school algebra. In the process, the models sample more efficiently and we learn something perhaps about the logic of modeling. Let’s see how to do it.

Tea and Algebra

Algebra is never the answer, but it is always the path.

In this case, you cannot see inside all the stables, but you can see which children are drinking tea. Everyone likes tea, so most children who have stabled their oxen are drinking tea. These children have earned their tea. And since everyone likes tea so much, some children who have not yet stabled their oxen are also drinking tea. These children are cheaters.

Is it worth your time, and potential embarrassment, to check the covered stables of children who are drinking tea? What about the children who are not drinking tea? Should you investigate them as well? If you had to choose, which nets you more empty stables, drinkers or non-drinkers? To answer these questions, we need a model.

The probability that a child drinks tea depends upon whether or not they have stabled their ox. Let p_\textsc{drink}=\Pr(\textsc{tea}|\textsc{ox}) be the probability a child, who has stabled their ox, drinks tea. They don’t have to drink tea, so this number may be less than 1. Let p_\textsc{cheat}=\Pr(\textsc{tea}|\textsc{no-ox}) be the probability a child, who has not yet stabled their ox, drinks tea. This number is certainly greater than zero, because tea. Finally, let s indicate whether or not a child has stabled their ox. All together, these definitions imply this statistical model:

    \begin{align*} \textsc{tea}_i &\sim \mathrm{Bernoulli}( \pi_i )\\ \pi_i &= s_i p_\textsc{drink} + (1-s_i) p_\textsc{cheat}\\ s_i &\sim \mathrm{Bernoulli}( \sigma ) \end{align*}

where the new symbol \sigma is the proportion of kids who stable their oxen properly. The variables \textsc{tea}_i and s_i are data for child i. Everything else in unobservable and must be estimated. Give these other symbols some priors to finish to joint probability model:

    \begin{align*} p_\textsc{drink} &\sim \mathrm{Beta}(2,2)\\ p_\textsc{cheat} &\sim \mathrm{Beta}(2,2)\\ \sigma &\sim \mathrm{Beta}(2,2) \end{align*}

These are very mildly regularizing priors. If you want to see what they look like, try curve( dbeta(x,2,2) , from=0 , to=1 ) in R.

And of course the entire problem stems from the fact that some of the s_i values are unobserved. We need to treat them like parameters, and Bayes will do the rest. Where is their prior? It is right there in front of you: \mathrm{Bernoulli}(\sigma). This lets us do what priors are for: Average over our ignorance. The probability of observing a child drinking tea is, by the laws of probability:

    \begin{align*} \underbrace{\Pr(\textsc{tea})}_\textsc{A} = \underbrace{ \Pr(\textsc{ox}) \Pr(\textsc{tea}|\textsc{ox}) }_\textsc{B} + \underbrace{ \Pr(\textsc{no-ox}) \Pr(\textsc{tea}|\textsc{no-ox}) }_\textsc{C} \end{align*}

This just says that the unconditional probability of tea drinking is the sum of the probability in each state—ox or not—times the chance of being in each state. It’s just the average probability over states. I’ve labeled the terms A, B, and C so we can refer back to them later. In the terms of our model, these become:

    \begin{align*} \Pr(\textsc{tea}) = \sigma p_\textsc{drink} + (1-\sigma) p_\textsc{cheat} \end{align*}

This is the bit that a Gibbs sampler does for you, by dancing over discrete states for the unobserved s values. It computes this expectation iteratively, by sampling. But we have the formula! The sampling isn’t necessary.

Finally, we can compute the probability that a child who is drinking tea has not stabled their ox. Bayes formula tells us that:

    \begin{align*} \Pr(\textsc{no-ox}|\textsc{tea}) = \frac{ \Pr(\textsc{no-ox}) \Pr(\textsc{tea}|\textsc{no-ox}) }{ \Pr(\textsc{tea}) } \end{align*}

The conditional probability is just the joint probability (tea and no-ox) divided by the thing we want to condition on (tea). But wait. Where have we seen these probabilities before? Oh yeah, they are the terms A and C above! Substituting in the model symbols:

    \begin{align*} \Pr(\textsc{no-ox}|\textsc{tea}) = \frac{\textsc{C}}{\textsc{A}} = \frac{ (1-\sigma) p_\textsc{cheat} }{ \sigma p_\textsc{drink} + (1-\sigma) p_\textsc{cheat} } \end{align*}

The rule here is that the terms we need to compute the marginal probability \Pr(\textsc{tea}) are the same terms we need to compute the probability of each unobserved s_i value.

But this also means we need to fit the model, to get estimates of p_\textsc{drink}, p_\textsc{cheat}, and \sigma from the observed oxen, before we can do the forensics on the covered stables. So let’s do that. Let’s fit the model, using the marginal likelihood.

Marginal Oxen

Since we have the formula, we can just write it into our model. I’ll write the Stan model in pieces, so it doesn’t explode all at once and look more complicated than it really is.

First, we define the observed values that are input to the model.

  int N_children;      // number of children
  int tea[N_children]; // [0,1] observed drinking tea
  int s[N_children];   // [0,1,-1] stabled ox

We’ll use -1 as a missingness code for the stabled oxen. When s[i] == -1, that means the stable is covered, so we can’t see whether or not the ox is there.

Now there are three parameters to define:

  real p_cheat;
  real p_drink;
  real sigma;

And then the model block will calculate the log-probability of the observed variables, conditional on the unobserved ones (the parameters). This is where we average over—marginalize over—the unobserved oxen.

  // priors
  p_cheat ~ beta(2,2);
  p_drink ~ beta(2,2);
  sigma ~ beta(2,2);

  // probability of tea
  for ( i in 1:N_children ) {
    if ( s[i] == -1 ) {
      // ox unobserved
      target += log_mix( 
                  sigma , 
                  bernoulli_lpmf( tea[i] | p_drink ) , 
                  bernoulli_lpmf( tea[i] | p_cheat ) );
    } else {
      // ox observed
      tea[i] ~ bernoulli( s[i]*p_drink + (1-s[i])*p_cheat );
      s[i] ~ bernoulli( sigma );

If you are already familiar with Stan models, then the only tricky part of this is the log_mix bit. All that function does is compute

log( sigma*bernoulli(tea[i]|p_drink) + (1-sigma)*bernoulli(tea[i]|p_cheat) )

But it does it with better precision. Remember, we keep things on the log-probability scale, or else small probabilities risk underflowing to zero. So log_mix, and its parent log_sum_exp (check the Stan reference manual), are really useful in these marginalization tasks.

To put this into action, we need some data. So let’s synthesize some.

N_children <- 51
s <- rbinom( N_children , size=1 , prob=0.75 )
s_obs <- s
s_obs[ sample( 1:N_children , size=21 ) ] <- -1
tea <- rbinom( N_children , size=1 , prob=s*1 + (1-s)*0.5 )

data_list <- list(
  N_children = N_children,
  tea = tea,
  s = s_obs )

Now run the Markov chains and draw samples from the joint posterior distribution of p_drink, p_cheat, and sigma.

m <- stan( model_code=stan_model , data=data_list )
print( m , probs=c( (1-0.89)/2 , 1-(1-0.89)/2 ) )
          mean se_mean   sd   5.5%  94.5% n_eff Rhat
p_cheat   0.48    0.00 0.14   0.26   0.71  3327    1
p_drink   0.93    0.00 0.05   0.84   0.98  4000    1
sigma     0.76    0.00 0.07   0.65   0.87  2924    1
lp__    -44.54    0.03 1.35 -47.17 -43.13  1771    1

So the model recovers the simulation values pretty well. But our goal is to decide whether it is worth inspecting stables of tea drinkers. That is, we want to compute C/A from the previous section. This is easy enough, at the posterior mean:

    \begin{align*} \Pr(\textsc{no-ox}|\textsc{tea}) &= \frac{ (1-\sigma) p_\textsc{cheat} }{ \sigma p_\textsc{drink} + (1-\sigma) p_\textsc{cheat} } \\ &= \frac{ (1-0.76)(0.48) }{ (1-0.76)(0.48) + (0.76)(0.93) }\\ &\approx 0.14 \end{align*}

But we don't deal in posterior means alone. We want the entire posterior distribution. So just perform the calculation above for each sample from the posterior:

post <- extract(m)

# probability no-ox|tea
B <- with( post, sigma*p_drink )
C <- with( post, (1-sigma)*p_cheat )
A <- B + C
p_noox_tea <- C / A

And now to plot it:

plot( density( p_noox_tea ) , lwd=2 ,
      xlim=c(0,1) , xlab="Pr(no-ox|tea)" ,
      main="" , bty="l" )

Most likely, a child who is drinking tea has properly stabled their ox. Why? Because the simulation assumed that most kids (75%) stable their oxen and half also obey the tea rule. What you are looking at above is the posterior probability that s_i=0 when \textsc{tea}_i=1. It is the posterior probability of a missing observation.

What about those not drinking tea? This calculation is just as before, but now we need probabilities of not drinking.

# probability no-ox|no-tea
B <- with( post, sigma*(1-p_drink) )
C <- with( post, (1-sigma)*(1-p_cheat) )
A <- B + C
p_noox_notea <- C / A
plot( density( p_noox_notea ) , lwd=2 ,
      xlim=c(0,1) , xlab="Pr(no-ox|no-tea)" ,
      main="" , bty="l" )

And this is the posterior probability that s_i=0 when \textsc{tea}_i=0. These children are much more likely to be guilty of improper animal husbandry than those who are drinking tea.

So should you check their stables? That depends upon the magnitude of the embarrassment and the importance of catching lazy children. But at least with these posterior distributions, you could do a proper decision theoretic analysis, given those costs and benefits.


In most missing observation models, we'll want to automate the calculation above. The calculations will often be different for each row with missing data, because there will be other covariates that influence the outcome and the outcome might take on more than 2 values. We can add the automation to the model's generated quantities. Here's a quick way to do it. This code will compute the posterior probability that s_i=1 on each row i.

generated quantities{
  vector[N_children] s_impute;
  for ( i in 1:N_children ) {
    if ( s[i] == -1 ) {
      vector[2] log_pox;
      real pox;
      log_pox[1] = log(sigma) + bernoulli_lpmf( tea[i] | p_drink );
      log_pox[2] = log1m(sigma) + bernoulli_lpmf( tea[i] | p_cheat );
      pox = exp(log_pox[1]) / ( exp(log_pox[1]) + exp(log_pox[2]) );
      s_impute[i] = pox;
    } else {
      s_impute[i] = s[i];

A vector of values, s_impute, is calculated, one value for each child i. When s_i was observed, the observed value is stored again. When it was instead missing, the terms from before are calculated and used to compute the posterior probability. This code could be better, by using something more efficient like the softmax function. But the version above is hopefully more useful for learning.

Now when you run the model, there will be a bunch of samples for each of these imputation variables. They will be constant for the observed cases. But they will be samples from the right posterior distribution for the unobserved cases.

This approach generalizes directly to models in which sigma is itself a function of other data (maybe teenagers are less likely to stable their oxen) and the probabilities p_drink and p_cheat are functions of various variables. Just compute the values of those symbols for each case i and then substitute them in.


For a model with discrete parameters, the recipe for defining the Stan model is:

(1) Write the probably of an outcome y[i] conditional on known values of the discrete parameters. Call this L, the conditional likelihood.

(2) List all the possible states the discrete parameters could take. For example, if you have two binary parameters, then there are four possible states: 11, 10, 01, and 00. Let j be an index for state, so that in this example j can take the values 1, 2, 3, and 4.

(3) For each state in (2), compute the probability of that state. Your model provides these probabilities, and they will depend upon the details of your model. Call each state's probability P_j.

(4) For each state in (2), compute the probability of an outcome y[i] when the discrete parameters take on those values. For example, there is a different probability for each of 11, 10, 01, and 00. You can use the expression from (1) and just insert the values of the parameters for each state. Call each state's corresponding likelihood L_j.

(5) Now you can compute the unconditional probability of y[i] by multiplying each P_j by L_j. Then sum these products for all states: M=\sum_j P_j L_j. This M is the marginal likelihood, the probability of y[i] averaging over the unknown values of the discrete parameters.

In the actual code, we must do all of the above on the log-probability scale, or otherwise numerical precision will be poor. So in practice each P_j L_j term is computed as a sum of log probabilities: term[j] = logP[j] + logL[j]. And then we can compute \log M as log_sum_exp(term).

Even More Automation

I've been thinking about these algorithms a lot lately, because I've been working on automating them in my rethinking R package, which is basically a scaffold for using Stan. It guesses at the variable definitions and automates a few things. The Experimental branch (v1.71) has prototype code for automatically generating the required mixture and imputation code. It looks like this, for our oxen example:

library(rethinking) # only version 1.71 or higher

# recode -1 to NA
data_list2 <- list(
  tea = tea,
  s = ifelse( s_obs==-1 , NA , s_obs ) )

m2 <- map2stan(
    tea ~ bernoulli(p),
    p <- s*p_drink + (1-s)*p_cheat,
    s ~ bernoulli(sigma),
    p_drink ~ beta(2,2),
    p_cheat ~ beta(2,2),
    sigma ~ beta(2,2)
  data=data_list2 , 
    sigma="lower=0,upper=1") ,
  do_discrete_imputation=TRUE , chains=4 )

The package has done this for continuous missing values for some time. It's a major topic in Chapter 14 of the textbook. Getting the discrete code to work for arbitrary binary predictors has taken more time. If you look at the Stan code—stancode(m2) will display it—you'll see that the code for a general case is much more complex.

Read More

Section 15 (page 210 in version 2.17) of the Stan reference manual has several examples of coded models with latent discrete variables.

Bob Carpenter wrote a detailed guide to coding a multispecies occupancy model.

Some models contain far too many latent states, or configurations of latent states, to efficiently marginalize over. In that case, approximations are available. See for example Fitting mechanistic epidemic models to data: a comparison of simple Markov chain Monte Carlo approaches.