In the high level, in many problems we often encounter (y,x) with data selection on either y or x. Most Bayesian models addresses the conditional model y|x directly (discriminative model), as long as the likelihood is invariant under samples and observation. The weighted approach can be viewed as a special case where we model p(x,y) jointly (generative model), in which p(x) is modeled through inverse probability weighting.

The naive weighting approach can be horrible. Nevertheless, these is no universal answer to whether the discriminative or generative approach is always better for all problem, otherwise the alternative will not be invented in the first place. For some problems such as measurement error, modeling X is necessary and inevitable. For some other problems such as the case-control logit model, the conditional model is as good as the joint model even without taking into account the selection. In the bottom line, there should at least be a robustness-efficiency tradeoff between these two approaches.

]]>target += bernoulli_logit_lpmf(1 | y_obs);

Then you get a constant contribution to the log density. So you’ll see different values for `lp__` (the unnormalized target density defined by the Stan program), but the sampling will be the same because they’re constant. We’re going to fix this going forward in the Stan 3 parser so that we’ll be able to control whether or not to include constant terms in both sampling and log probability density/mass functions.

You’re right the model’s unrealistic. It’s just an illustration of how hard this is even if you know the missingness pattern perfectly. As others pointed out, normally you have to estimate this.

]]>In a realistic setting, we’d have to estimate the effect of incentives on inclusion probability.

]]>Hmm having been scolded as a student for stooping to simulation to check my work or get a better grasp of something, I am worried some might read this as a similarly disparaging remark.

Simulation is an experiment on probability models to learn “anything” you want about them, and if designed and analysed well, no one should be criticized for making use of them.

In fact, I would suggest anyone (no accusations here) that would refrain from such stooping are being possibly overly certain of their math skills, putting elegance above function (avoiding questions for which they can get analytical results) and more concerned about seeming to be right than actually being right. For instance, it likely still is true as Rob Tibshirani reported a couple years ago that it was impossible to publish methods in stats journals which did not have analytical results even though extensive simulation had shown them to be superior.

OK, my rant is over ;-)

]]>If you’re analysing your own survey, you always have access to the actual design variables, but if you’re analysing someone else’s survey you may not.

]]>Just by happenstance I was earlier – before reading this blog post – reading an article which you had co-authored and discussed those dog-shocking experiments… that somehow I think were related to that/a paper by Mosteller. At least there was an earlier post in your blog. Ah, tiredness makes me confused!

]]>Yes, what you say in your first paragraph is exactly what I told Bob at the meeting! But he wanted to see it in a simulation . . .

]]>Or, to put it more formally, your inferences will be sensitive to (a) your model for who is included in the sample, and (b) your prior distribution for groups that are unrepresented or underrepresented in the sample.

An example where this arose was in some state polls in 2016 that did not poststratify on education.

]]>If thin truck drivers with beards are largely censored from your sample, and you aren’t aware of this, then you’ll conclude that overall truck drivers with beards are like your sample of truck drivers with beards, rather than the true population which has more thin ones, and your extrapolation will extend this bias.

Of course, this is true in general, if you are systematically seeing very few of some kind of thing and you aren’t aware of the fact, so you can’t try to correct for it through some model… you’ll get the wrong answer. But it’s worth mentioning because the complexity of something like this can make it look like magic.

Another way to say this is the generative model needs to be at least approximately correct, so you had better think through if there are “unknown unknowns” affecting your sample.

]]>These weighting schemes are derived from a frequentist perspective where if you think about the sampling distribution of some estimator over datasets, instead of conditioning on the outcomes in a particular dataset, the estimator can be consistent or sometimes even unbiased.

]]>I’m not necessarily advocating this, just saying that this is probably the reason why staircase designs aren’t modeled with explicit temporal components—it is assumed that there is no trial-by-trial learning/adaptation. And time itself might play a role if there are variations in arousal/vigilance over the course of the experiment, though in my experience these changes have negligible effects on psychophysical experiments.

To me, modeling effects of trial history seems like the most useful route. Two forms of history effects seem particularly important, first, response adaptation (a participant is more likely to make responses they have made before) and second stimulus adaptation (depending on the design, a participant may “shrink” their perception of a stimulus on one trial toward or away from their perception of previous stimuli).

If we keep things simple and assume that these history effects have exponential forms, we have four history parameters to estimate: a rate of decay for response attraction (alpha_r) representing bias *toward* past responses; a rate of decay for response repulsion (beta_r) representing bias *away* from past responses; a rate of decay for stimulus attraction (alpha_s) representing a tendency to blend toward previous stimuli; and a rate of decay for stimulus repulsion (beta_s) representing a tendency to differentiate away from previous stimuli. We also assume that stimuli can be represented along a single underlying continuum “mu”. We associate each possible response category k on trial i with a “strength” A_{ik}:

A_{ik} = mu[Stimulus on trial i] + sum_{j=1}^{i – 1} ((exp(-alpha_r * (i – j)) – exp(-beta_r * (i – j))) * response[j] + (exp(-alpha_s * (i – j)) – exp(-beta_s * (i – j))) * stimulus[j])

Then we use softmax/Luce’s choice to convert these strengths to probabilities, e.g., Pr(response k on trial i) = exp(A_{ik}) / sum_{l = 1}^{Num. responses} exp(A_{il})

To be clear, the exponential is an assumption for convenience, though it is not an unreasonable model for drift/adaptation effects. I also have made no attempt to determine if this is identifiable at all. But it seems to me a reasonable starting point for modeling sequential effects in psychophysical experiments.

]]>In some way, the reason this model is so awkward to build is that it’s so artificial. In a typical survey, you’ll model the probability of inclusion in the sample as depending on some background variables of the respondents such as sex, ethnicity, age, and education. You can then fit a regression model for the outcome of interest, including sex, ethnicity, age, and education as predictors and get inferences for the general population using poststratification, and no survey weights are necessary. Bob’s model (or the model that Yajuan, Natesh, and I fit here) is complicated in because it is contextless, with abstract “weights” that appear out of nowhere. This sort of exercise can be useful to help us understand why it is typically a good idea to model survey inclusion as a function of demographic variables of interest.

]]>1 ~ bernoulli_logit(y_obs);

All its components are observed or stipulated, so I don’t understand what Stan is doing with it probabilistically under the hood, and when I run the code without that line I get the same results. Apologies if my question is very ignorant.

As an aside, as a more applied researcher, this exercise made me feel pretty hopeless. To get this model “right” you had to know an incredibly unrealistic amount of information about the function generating missing responses. I’m left with more sympathy for those who don’t even try to build the generative model.

]]>To start with, I’d try a linear or logistic regression with time as a predictor. Another option is a sequential learning model such as done in Bush and Mosteller (1954).

]]>0 ~ bernoulli(y_miss[n] | inv_logit(y_miss[n]))

should be

0 ~ bernoulli(inv_logit(y_miss[n])) ]]>

In psychophysics it’s customary to use sequential designs in which stimulus selections depend on observed responses. For example if the participant correctly identifies the location of a signal twice in a row, signal level is decreased; or in more modern approaches signal level would depend on what would minimize the expected entropy of the posterior distribution.

I don’t think I’ve ever seen anyone taking sequntiality into account when analyzing data from such designs. What would be the starting point to adding “time” to designs like these?

]]>This is related to the principle that your data collection model is ignorable if inclusion depends only on variables included in the model, and it has applications, for example if you have a sequential data collection rule, you should include time in your model. Many Bayesians seem to miss this point and discuss sequential data collection and other selection rules too glibly.

]]>The model is Y_i as Bernoulli(p_i) with logit(p)=Xb, and you sample everyone with Y=1 (‘cases’) and a small fraction f of the people with Y=0 (‘controls’). The weighted analysis uses weights 1/f for the controls.

A likelihood-based or full Bayesian analysis would apparently have to model the distribution of X in the unmeasured controls. What make it interesting is that the maximum likelihood estimator turns out to be *unweighted* logistic regression with an offset to correct the log(f) bias in the intercept. The maximum likelihood estimator is the same as ignoring the sampling and not trying to model the distribution of the unobserved X. There are a series of papers in Biometrika showing that Bayesian versions of this also work (eg, https://www.jstor.org/stable/29777153)

It’s a useful check for principled approaches to deriving weighted estimation to see what they do with case-control logistic regression.

]]>As someone who is always doing Bayesian analyses, my intuitive thought when you said “unbiased” was zero average error across the posterior predictive distribution…

but of course you mean zero average error across repeated data collection.

]]>and yes to the comment about arbitrary functional forms, which is very relevant to many kinds of real-world data, where for example test scores rise a lot with a bit of education, but saturate with a lot of education or similarly for things like amount of money spent on say rent as a function of income or such like.

]]>And obviously it won’t work for completely arbitrary functional forms]

]]>Is there a typo here, or am I missing something?

]]>Similarly, if pairwise sampling probabilities are known for all pairs in the sample and bounded away from zero for pairs in the population, you can get unbiased estimation of the variance (and consistent estimation, under reasonable asymptotic embeddings). You can’t if any of the pairwise probabilities are zero. A Bayesian version that gets the right posterior variance is non-trivial.

The big problem in many (but, pace Andrew, not all) real applications is that you don’t actually know the probabilities; they have to be estimated.

]]>It’s from Doug Bates’ computational formulation of mixed model loglikelihood and REML as penalised least squares problems. ]]>

I mean suppose for example you have 1 million people, you poll them in random order until you get 1000 respondents, and you have a model in which their age, sex, weight, height, income, education level, and zip-code determine say a score on a brief test of problem solving…

You don’t know how many people didn’t answer… you don’t know what their age,sex,weight,height,income,education level, or zip code was… and you don’t know what their problem solving score was.

all you know is that there is probably some nonresponse probability function which is *also* a function of all the demographic variables.

One suspects this is shockingly common in the world of “big data”. Like, it describes pretty much every opt-in survey anyone’s ever done right?

The usual thing is just to model the behavior of “people who answered” and pretend it’s the same as “everyone else”

]]>