## Non-randomly missing data is hard, or why weights won’t solve your survey problems and you need to think generatively

Throw this onto the big pile of stats problems that are a lot more subtle than they seem at first glance. This all started when Lauren pointed me at the post Another way to see why mixed models in survey data are hard on Thomas Lumley’s blog. Part of the problem is all the jargon in survey sampling—I couldn’t understand Lumley’s language of estimators and least squares; part of it is that missing data is hard.

The full data model

Imagine we have a a very simple population of $N^{\textrm{pop}}$ items with values normally distributed members with standard deviation known to be 2,

$y_n \sim \textrm{normal}(\mu, 2) \ \textrm{for} \ i \in 1:N^{\textrm{pop}}.$

To complete the Bayesian model, we’ll assume a standard normal prior on $\mu$,

$\mu \sim \textrm{normal}(0, 1).$

Now we’re not going to observe all $y_n$, but only a sample of the $N^{\textrm{pop}}$ elements. If the model is correct, our inferences will be calibrated in expection given a random sample of items $y_n$ from the population.

Missing data

Now let’s assume the sample of $y_n$ we observe is not drawn at random from the population. Imagine instead that we have a subset of $N$ items from the population, and for each item $n$, there is a probability $\pi_n$ that the item will be included in the sample. We’ll take the log odds of inclusion to be equal to the item’s value,

$\pi_n = \textrm{logit}^{-1}(y_n)$.

Now when we collect our sample, we’ll do something like poll $N = 2000$ people from the population, but each person $n$ only has a $\pi_n$ chance of responding. So we only wind up with $N^{\textrm{obs}}$ observations, with $N^{\textrm{miss}} = N - N^{\textrm{obs}}$ observations missing.

This situation arises in surveys, where non-response can bias results without careful adjustment (e.g., see Andrew’s post on pre-election polling, Don’t believe the bounce).

So how do we do the careful adjustment?

Approach 1: Weighted likelihood

A traditional approach is to inverse weight the log likelihood terms by the inclusion probability,

$\sum_{n = 1}^{N^{\textrm{obs}}} \frac{1}{\pi_n} \log \textrm{normal}(y_n \mid \mu, 2).$

Thus if an item has a 20% chance of being included, its weight is 5.

In Stan, we can code the weighted likelihood as follows (assuming pi is given as data).

for (n in 1:N_obs)
target += inv(pi[n]) * normal_lpdf(y[n] | mu, 2);


If we optimize with the weighted likelihood, the estimates are unbiased (i.e., the expectation of the estimate $\hat{\pi}$ is the true value $\pi$). This is borne out in simulation.

Although the parameter estimates are unbiased, the same cannot be said of the uncertainties. The posterior intervals are too narrow. Specifically, this approach fails simulation-based calibration; for background on SBC, see Dan’s blog post You better check yo self before you wreck yo self.

One reason the intervals are too narrow is that we are weighting the data as if we had observed $N$ items when we’ve only observed $N^{\textrm{obs}}$ items. That is, their weights are what we’d expect to get if we’d observed $N$ items.

So my next thought was to standardize. Let’s take the inverse weights and normalize so the sum of inverse weights is equal to $N^{\textrm{obs}}.$ That also fails. The posterior intervals are still too narrow under simulation.

Sure, we could keep fiddling weights in an ad hoc way for this problem until they were better calibrated empirically, but this is clearly the wrong approach. We’re Bayesians and should be thinking generatively. Maybe that’s why Lauren and Andrew kept telling me I should be thinking generatively (even though they work on a survey weighting project!).

Approach 2: Missing data

What is going on generativey? We poll $N$ people out of a population of $N^{\textrm{pop}}$, each of which has a $\pi_n$ chance of responding, leading to a set of responses of size $N^{\textrm{obs}}.$

Given that we know how $\pi$ relates to $y$, we can just model everything (in the real world, this stuff is really hard and everything’s estimated jointly).

Specifically, the $N^{\textrm{miss}} = N - N^{\textrm{obs}}$ missing items each get parameters $y^{\textrm{miss}}_n$ representing how they would’ve responded had they responded. We also model response, so we have an extra term $\textrm{bernoulli}(0 \mid \textrm{logit}^{-1}(y_n^{\textrm{miss}}))$ for the unobserved values and an extra term $\textrm{bernoulli}(1 \mid \textrm{logit}^{-1}(y_n))$ for the observed values.

This works. Here’s the Stan program.

data {
int N_miss;
int N_obs;
vector[N_obs] y_obs;
}
parameters {
real mu;
vector[N_miss] y_miss;
}
model {
// prior
mu ~ normal(0, 1);
// observed data likelihood
y_obs ~ normal(mu, 2);
1 ~ bernoulli_logit(y_obs);
// missing data likelihood and missingness
y_miss ~ normal(mu, 2);
0 ~ bernoulli_logit(y_miss);
}


The Bernoulli sampling statements are vectorized and repeated for each element of y_obs and y_miss. The suffix _logit indicates the argument is on the log odds scale, and could have been written:

for (n in 1:N_miss)
0 ~ bernoulli(inv_logit(y_miss[n]))


or even more explicitly,

for (n in 1:N_miss)
target += bernoulli_lpmf(0 | inv_logit(y_miss[n]));


And here’s the simulation code, including a cheap run at SBC:

library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores(), logical = FALSE)

printf <- function(msg, ...) { cat(sprintf(msg, ...)); cat("\n") }
inv_logit <- function(u) 1 / (1 + exp(-u))

printf("Compiling model.")
model <- stan_model('missing.stan')

for (m in 1:20) {

# SIMULATE DATA
mu <- rnorm(1, 0, 1);
N_tot <- 1000
y <- rnorm(N_tot, mu, 2)
z <- rbinom(N_tot, 1, inv_logit(y))
y_obs <- y[z == 1]
N_obs <- length(y_obs)
N_miss <- N_tot - N_obs

# COMPILE AND FIT STAN MODEL
fit <- sampling(model,
data = list(N_miss = N_miss, N_obs = N_obs, y_obs = y_obs),
chains = 1, iter = 5000, refresh = 0)
mu_ss <- extract(fit)\$mu
mu_hat <- mean(mu_ss)
q25 <- quantile(mu_ss, 0.25)
q75 <- quantile(mu_ss, 0.75)
printf("mu = %5.2f in 50pct(%5.2f, %5.2f) = %3s;  mu_hat = %5.2f",
mu, q25, q75, ifelse(q25 <= mu && mu <= q75, "yes", "no"), mean(mu_ss))

}


Here's some output with random seeds, with mu, mu_hat and 50% intervals and indicator of whether mu is in the 50% posterior interval.

mu =  0.60 in 50pct( 0.50,  0.60) =  no;  mu_hat =  0.55
mu = -0.73 in 50pct(-0.67, -0.56) =  no;  mu_hat = -0.62
mu =  1.13 in 50pct( 1.00,  1.10) =  no;  mu_hat =  1.05
mu =  1.71 in 50pct( 1.67,  1.76) = yes;  mu_hat =  1.71
mu =  0.03 in 50pct(-0.02,  0.08) = yes;  mu_hat =  0.03
mu =  0.80 in 50pct( 0.76,  0.86) = yes;  mu_hat =  0.81


The only problem I'm having is that this crashes RStan 2.19.2 on my Mac fairly regularly.

Exercise

How would the generative model differ if we polled members of the population at random until we got 1000 respondents? Conceptually it's more difficult in that we don't know how many non-resondents were approached on the way to 1000 respondents. This would be tricky in Stan as we don't have discrete parameter sampling---it'd have to be marginalized out.

Lauren started this conversation saying it would be hard. It took me several emails, part of a Stan meeting, buttonholing Andrew to give me an interesting example to test, lots of coaching from Lauren, then a day of working out the above simulations to convince myself the weighting wouldn't work and code up a simple version that would work. Like I said, not easy. But at least doable with patient colleagues who know what they're doing.

1. this kind of thing becomes REALLY HARD when you have a likelihood that involves lots of predictors, and unknown values in all the predictors, and unknown response function

I mean suppose for example you have 1 million people, you poll them in random order until you get 1000 respondents, and you have a model in which their age, sex, weight, height, income, education level, and zip-code determine say a score on a brief test of problem solving…

You don’t know how many people didn’t answer… you don’t know what their age,sex,weight,height,income,education level, or zip code was… and you don’t know what their problem solving score was.

all you know is that there is probably some nonresponse probability function which is *also* a function of all the demographic variables.

One suspects this is shockingly common in the world of “big data”. Like, it describes pretty much every opt-in survey anyone’s ever done right?

The usual thing is just to model the behavior of “people who answered” and pretend it’s the same as “everyone else”

2. Eric B Rasmusen says:

I’ve gotten myself confused on this before. Is it correct that one simple case is that if pi_n, the probability of being sampled, is correlated with the X’s but not with the disturbance term (so it’s correlate with Y only via the X’s), then you don’t have to do anything special and you still get an unbiased estimate? (standard errors are a different matter).

• Can you elaborate? I think you’re saying that you can learn the shape of a function Y vs X in an unbiased way even if you never see data in a certain region of the domain? That seems obviously false. Are you talking about a pure linear function?

• Thomas Lumley says:

If the probability of being sampled is known for people in the sample and bounded away from zero for people in the population, you can get unbiased estimation (and consistent estimation, under reasonable asymptotic embeddings). You can’t if any of the probabilities are zero. A Bayesian version that gets a posterior distribution centered at the right place is easy.

Similarly, if pairwise sampling probabilities are known for all pairs in the sample and bounded away from zero for pairs in the population, you can get unbiased estimation of the variance (and consistent estimation, under reasonable asymptotic embeddings). You can’t if any of the pairwise probabilities are zero. A Bayesian version that gets the right posterior variance is non-trivial.

The big problem in many (but, pace Andrew, not all) real applications is that you don’t actually know the probabilities; they have to be estimated.

• Suppose there are 1M people in a population. you sample 50, you try to regress say test score vs income, the probability to sample people with income under 10k is nonzero but small, but the number of actual people in the sample with income under 10k is zero. The response function is nonlinear. How could you *possibly* get an unbiased estimate of the shape of this function between 0 and 10k?

• Thomas Lumley says:

You can’t simultaneously fix the observed number of people with income under 10k and talk about bias. Bias is a property of sampling distributions, not of single numbers. You can (reasonably) argue that being unbiased isn’t a strong enough property to be useful, but that’s different.

And obviously it won’t work for completely arbitrary functional forms]

• Makes sense… I guess my real point is that estimate variance can be so high that the unbiased property is pretty useless as whatever value you wind up with in your actual data set is very likely to be far from correct.

and yes to the comment about arbitrary functional forms, which is very relevant to many kinds of real-world data, where for example test scores rise a lot with a bit of education, but saturate with a lot of education or similarly for things like amount of money spent on say rent as a function of income or such like.

• Interesting, now that I think about it,

As someone who is always doing Bayesian analyses, my intuitive thought when you said “unbiased” was zero average error across the posterior predictive distribution…

but of course you mean zero average error across repeated data collection.

• That wasn’t Andrew writing the post, it was me. As I was trying to extract a simple workable example from them, Lauren and Andrew kept emphasizing that everything’s unknown and estimated, usually from misspecified models. The hard lesson for me was realizing that simple inverse probability weighting, even when normalized to the number of actual items, doesn’t get the right posterior uncertainty.

3. Thomas Lumley says:

Bracketing any philosophical issues about weights and non-response, can I just note that the jargon is *not* from survey sampling?
It’s from Doug Bates’ computational formulation of mixed model loglikelihood and REML as penalised least squares problems.

4. Martha (Smith) says:

“If the model is correct, our inferences will be calibrated in expection given a random sample of items y_n from the population.”

Is there a typo here, or am I missing something?

• I meant a sample of items y[n[1]], …, y[n[M]]. But otherwise, I just meant that you automatically get calibration with Bayes if the model’s well specified.

5. Thomas Lumley says:

Another interesting example to think about is case-control sampling and logistic regression.

The model is Y_i as Bernoulli(p_i) with logit(p)=Xb, and you sample everyone with Y=1 (‘cases’) and a small fraction f of the people with Y=0 (‘controls’). The weighted analysis uses weights 1/f for the controls.

A likelihood-based or full Bayesian analysis would apparently have to model the distribution of X in the unmeasured controls. What make it interesting is that the maximum likelihood estimator turns out to be *unweighted* logistic regression with an offset to correct the log(f) bias in the intercept. The maximum likelihood estimator is the same as ignoring the sampling and not trying to model the distribution of the unobserved X. There are a series of papers in Biometrika showing that Bayesian versions of this also work (eg, https://www.jstor.org/stable/29777153)

It’s a useful check for principled approaches to deriving weighted estimation to see what they do with case-control logistic regression.

• Andrew says:

Thomas:

This is related to the principle that your data collection model is ignorable if inclusion depends only on variables included in the model, and it has applications, for example if you have a sequential data collection rule, you should include time in your model. Many Bayesians seem to miss this point and discuss sequential data collection and other selection rules too glibly.

• Tanja says:

How would one go about doing this? Where to start when trying to grasp the idea behind this in simple toy examples?

In psychophysics it’s customary to use sequential designs in which stimulus selections depend on observed responses. For example if the participant correctly identifies the location of a signal twice in a row, signal level is decreased; or in more modern approaches signal level would depend on what would minimize the expected entropy of the posterior distribution.

I don’t think I’ve ever seen anyone taking sequntiality into account when analyzing data from such designs. What would be the starting point to adding “time” to designs like these?

• Andrew says:

Tanja:

To start with, I’d try a linear or logistic regression with time as a predictor. Another option is a sequential learning model such as done in Bush and Mosteller (1954).

• Tanja says:

Thanks for these pointers.

Just by happenstance I was earlier – before reading this blog post – reading an article which you had co-authored and discussed those dog-shocking experiments… that somehow I think were related to that/a paper by Mosteller. At least there was an earlier post in your blog. Ah, tiredness makes me confused!

• gec says:

In most staircasing procedures, the operative thing is not so much time per se as it is the history of prior responses. As a result, if it is assumed that each trial depends only on the stimulus presented on that trial, as many measurement models assume (like how the Binomial likelihood assumes order of events doesn’t matter), then there is no role for past responding and it doesn’t enter into the model.

I’m not necessarily advocating this, just saying that this is probably the reason why staircase designs aren’t modeled with explicit temporal components—it is assumed that there is no trial-by-trial learning/adaptation. And time itself might play a role if there are variations in arousal/vigilance over the course of the experiment, though in my experience these changes have negligible effects on psychophysical experiments.

To me, modeling effects of trial history seems like the most useful route. Two forms of history effects seem particularly important, first, response adaptation (a participant is more likely to make responses they have made before) and second stimulus adaptation (depending on the design, a participant may “shrink” their perception of a stimulus on one trial toward or away from their perception of previous stimuli).

If we keep things simple and assume that these history effects have exponential forms, we have four history parameters to estimate: a rate of decay for response attraction (alpha_r) representing bias *toward* past responses; a rate of decay for response repulsion (beta_r) representing bias *away* from past responses; a rate of decay for stimulus attraction (alpha_s) representing a tendency to blend toward previous stimuli; and a rate of decay for stimulus repulsion (beta_s) representing a tendency to differentiate away from previous stimuli. We also assume that stimuli can be represented along a single underlying continuum “mu”. We associate each possible response category k on trial i with a “strength” A_{ik}:
A_{ik} = mu[Stimulus on trial i] + sum_{j=1}^{i – 1} ((exp(-alpha_r * (i – j)) – exp(-beta_r * (i – j))) * response[j] + (exp(-alpha_s * (i – j)) – exp(-beta_s * (i – j))) * stimulus[j])
Then we use softmax/Luce’s choice to convert these strengths to probabilities, e.g., Pr(response k on trial i) = exp(A_{ik}) / sum_{l = 1}^{Num. responses} exp(A_{il})

To be clear, the exponential is an assumption for convenience, though it is not an unreasonable model for drift/adaptation effects. I also have made no attempt to determine if this is identifiable at all. But it seems to me a reasonable starting point for modeling sequential effects in psychophysical experiments.

• gec says:

EDIT: in the strength equation, “response[j]” should really be an 0/1 indicator, “response[j] == k”.

• Thomas Lumley says:

That’s true, and important, but for me the fun point about the case-control example is that sampling *is not* ignorable, but that you still get unweighted logistic regression as the likelihood or Bayesian analysis because all the bias ends up in the intercept. Since sampling is really not ignorable, all the rules of thumb for relative efficiency of weighted analysis under ignorable sampling are wrong.

• Yuling says:

It is probably more than that: the case-control invariant mentioned by Thomas only holds for logit model, not probit, not normal normal, whereas change of a distribution does not change any data collection or variable inclusion. So there is still some other principles. Indeed, a case-control logit model does not model the selection explicitly at all.

In the high level, in many problems we often encounter (y,x) with data selection on either y or x. Most Bayesian models addresses the conditional model y|x directly (discriminative model), as long as the likelihood is invariant under samples and observation. The weighted approach can be viewed as a special case where we model p(x,y) jointly (generative model), in which p(x) is modeled through inverse probability weighting.

The naive weighting approach can be horrible. Nevertheless, these is no universal answer to whether the discriminative or generative approach is always better for all problem, otherwise the alternative will not be invented in the first place. For some problems such as measurement error, modeling X is necessary and inevitable. For some other problems such as the case-control logit model, the conditional model is as good as the joint model even without taking into account the selection. In the bottom line, there should at least be a robustness-efficiency tradeoff between these two approaches.

6. Aki Vehtari says:

One typo
0 ~ bernoulli(y_miss[n] | inv_logit(y_miss[n]))
should be
0 ~ bernoulli(inv_logit(y_miss[n]))

7. Sean says:

I think the assumption that mu is the same across the obs and missing is quite strong. I guess others are pointing this out by talking about covariates and all the methods to control for this like using an IV. I wonder how much the value a survey taker gets, either in terms of monetary compensation or some personal intrinsic value aiding data analysis, effects the estimates? Does it matter that their personal “utility” of taking the survey is higher than non-responders? Is there a fundamental difference in the two, that *may* bias estimates of the true mu?

• Yes, way too strong. I didn’t mean this to be a realistic example, just one that illustrates the problem when inclusion weights relate to outcomes. As a computer scientist, I like to build from very simple examples, but I’ve seen that statisticians like to introduce problems with more realistic examples.

In a realistic setting, we’d have to estimate the effect of incentives on inclusion probability.

8. Peter says:

Is there a reason for this statement?
1 ~ bernoulli_logit(y_obs);
All its components are observed or stipulated, so I don’t understand what Stan is doing with it probabilistically under the hood, and when I run the code without that line I get the same results. Apologies if my question is very ignorant.

As an aside, as a more applied researcher, this exercise made me feel pretty hopeless. To get this model “right” you had to know an incredibly unrealistic amount of information about the function generating missing responses. I’m left with more sympathy for those who don’t even try to build the generative model.

• Andrew says:

Peter:

In some way, the reason this model is so awkward to build is that it’s so artificial. In a typical survey, you’ll model the probability of inclusion in the sample as depending on some background variables of the respondents such as sex, ethnicity, age, and education. You can then fit a regression model for the outcome of interest, including sex, ethnicity, age, and education as predictors and get inferences for the general population using poststratification, and no survey weights are necessary. Bob’s model (or the model that Yajuan, Natesh, and I fit here) is complicated in because it is contextless, with abstract “weights” that appear out of nowhere. This sort of exercise can be useful to help us understand why it is typically a good idea to model survey inclusion as a function of demographic variables of interest.

• Peter says:

Thanks Andrew, that’s helpful context. This exercise did help me better understand simulation-based calibration and how missingness can have such a strong impact on standard errors.

• For poststratification to work you need each sub-population that you define in your sample to be represented in your sample in a representative way, or you need to know how the covariates relate to the probability to be included so you can correct the sample bias…

If thin truck drivers with beards are largely censored from your sample, and you aren’t aware of this, then you’ll conclude that overall truck drivers with beards are like your sample of truck drivers with beards, rather than the true population which has more thin ones, and your extrapolation will extend this bias.

Of course, this is true in general, if you are systematically seeing very few of some kind of thing and you aren’t aware of the fact, so you can’t try to correct for it through some model… you’ll get the wrong answer. But it’s worth mentioning because the complexity of something like this can make it look like magic.

Another way to say this is the generative model needs to be at least approximately correct, so you had better think through if there are “unknown unknowns” affecting your sample.

• Andrew says:

Daniel:

Or, to put it more formally, your inferences will be sensitive to (a) your model for who is included in the sample, and (b) your prior distribution for groups that are unrepresented or underrepresented in the sample.

An example where this arose was in some state polls in 2016 that did not poststratify on education.

• Thomas Lumley says:

There’s a practical problem in official surveys in that the design variables are often variables about a geographic area, and these variables may not be available to the analyst. For example, in NHANES, the public-use data sets include information about race/ethnicity for sampled individuals, but the sampling also involves proportion of minority individuals in small areas (?census blocks — I don’t remember exactly) and these are not part of the data set.

If you’re analysing your own survey, you always have access to the actual design variables, but if you’re analysing someone else’s survey you may not.

• The way all this works is buried deep in the Stan reference manual. What happens is that a sampling statement using ~ will drop all constant terms in the log density. In this case, as Peter notes, that means the whole thing gets dropped and it has no effect. I was just writing out the full model for symmetry, but now see how that’s confusing. Alternatively if you write

target += bernoulli_logit_lpmf(1 | y_obs);


Then you get a constant contribution to the log density. So you’ll see different values for lp__ (the unnormalized target density defined by the Stan program), but the sampling will be the same because they’re constant. We’re going to fix this going forward in the Stan 3 parser so that we’ll be able to control whether or not to include constant terms in both sampling and log probability density/mass functions.

You’re right the model’s unrealistic. It’s just an illustration of how hard this is even if you know the missingness pattern perfectly. As others pointed out, normally you have to estimate this.

9. Ben Goodrich says:

When I taught missing data classes, the explanation I gave for why “Approach 1” isn’t right in the Bayesian sense is that if you upweight the likelihood by 1 / pi_n you are assuming that the missing people that the n-th observed person represents all have y_n as their outcome. But you don’t know that for sure. Even if the missing people look like the n-th observed person on exogenous covariates, they could randomly differ on the endogenous outcome being modeled. So, you end up conditioning on something that you are not certain about and the posterior uncertainties are all off.

These weighting schemes are derived from a frequentist perspective where if you think about the sampling distribution of some estimator over datasets, instead of conditioning on the outcomes in a particular dataset, the estimator can be consistent or sometimes even unbiased.

• Andrew says:

Ben:

Yes, what you say in your first paragraph is exactly what I told Bob at the meeting! But he wanted to see it in a simulation . . .

• > But he wanted to see it in a simulation …
Hmm having been scolded as a student for stooping to simulation to check my work or get a better grasp of something, I am worried some might read this as a similarly disparaging remark.

Simulation is an experiment on probability models to learn “anything” you want about them, and if designed and analysed well, no one should be criticized for making use of them.

In fact, I would suggest anyone (no accusations here) that would refrain from such stooping are being possibly overly certain of their math skills, putting elegance above function (avoiding questions for which they can get analytical results) and more concerned about seeming to be right than actually being right. For instance, it likely still is true as Rob Tibshirani reported a couple years ago that it was impossible to publish methods in stats journals which did not have analytical results even though extensive simulation had shown them to be superior.

OK, my rant is over ;-)

• I couldn’t understand Lumley’s blog post, which Lauren pointed me to. I then couldn’t understand Lauren’s attempts to clarify. So I brought it up during the Stan meeting. But I couldn’t follow Andrew’s explanation there, either. I’m afraid the language barrier was high enough that I had to get Andrew to write down a concrete example, then work through the two modeling approaches. I find Ben’s explanation easier to follow because he kindly paraphrased “endogenous” and “exogenous” and talked about the core problem.