Hi Andrew,

Another related and probably simple question. I understand the general idea of MRP when all of the variables are survey strata variables – e.g. sex, race, region, … etc. However, in practice, these are not the only variables of interest. Other non-strata type variables such as attitudes, opinions, etc., are going to be in a regression model. In my original post, I mentioned student and school factors that are not strata type variables. How are those estimates handled in MRP? Is it reasonable to assume that there is no adjustment that can be made to them and that posterior predictive distributions would still be better when combined with strata-variables that have been post-stratified? I guess what I am really asking is how does one do post-stratification in a Bayesian regression analysis when some of the variables are strata (e.g. sex) for which I can get population estimates, and some are not. I hope that made sense.

]]>Thanks.

]]>Thomas, Aleja:

There are 3 ways of handling this sort of non-census variable in MRP:

1. Compute weighted averages within poststratification cells, as we do here; see here for clean code.

2. Include these additional variables in the poststratification and do some modeling to estimate the population distribution; see here for an early example.

3. Adjust for enough census variables that you don’t feel you have to worry about any remaining adjustment variables. Remember that the weights are just a means to an end, which is inference for the general population.

]]>Or the Fragile Families Survey. There are many other surveys where the sampling design variables are not included in the public use files. What is the procedure in that situation?

]]>I infer that your weights model something like probability of response, to account for oversampling, and that’s increasing variance. But that approach does nothing to account for (potential) similarity of responses among those with similarly-sized networks. What if you created “classes” representing different ranges of network size, via median split or some substantive/theoretically-driven partitioning? It seems plausible that people with a similar number of connections might respond similarly. (or, if on different sides of an issue, might hold similarly extreme/moderate views.) If, in fact, the lower third of network sizes all respond similarly, and the same for the middle and upper thirds, that could soak up a lot of the variance your weights are adding in. Put another way, you currently have up to n classes based on number of connections, with a different weight assigned to each class. You might reduce the number of classes to as few as 2 or 3, then apply the weights within classes. Not only might the classes themselves absorb variance, but the nested weights might also add less variance (since the class itself already explains some of the oversampling). I should disclose surveys aren’t my area of expertise, but it sounds like a cool idea you could try and check the ICC’s.

]]>Previous research on multilevel PISA analysis suggests using weights at level 1, but scaling weights for level 2 (in this case, schools). For example: Rabe‐Hesketh, S., & Skrondal, A. (2006). Multilevel modelling of complex survey data. Journal of the Royal Statistical Society: Series A (Statistics in Society), 169(4), 805-827. There is a certain consensus in recommending not to use weights on level 2 as the student weight (in level 1) already corresponds to the inverse of the joint probability of selection for a particular student in certain school. See: Rutkowski, L., Gonzalez, E., Joncas, M., & von Davier, M. (2010). International large-scale assessment data: Issues in secondary analysis and reporting. Educational Researcher, 39(2), 142-151.

Additionnaly, in case of PISA, there are 80 replicate weights (BRR) to address sampling variance. In R, for example, BIFIEsurvey can handle 2-level regressions with weights, replicate weights (and plausible values as outcome variables). Not sure if there is any bayesian ‘ready to use’ approach.

Obviousily this is far from MrP but, in same cases, is worth knowing when it is not possible to use post-stratification (in my case, I do cross-country models).

]]>Alain:

I discuss respondent-driven sampling here. The short answer is that in real-world respondent-driven sampling, you don’t actually know the probability of selection. The way I recommend analyzing such data is to poststratify on gregariousness (number of social connections) or some similar variable. The distribution of gregariousness in the population won’t be known, so that itself will have to be estimated from the sample.

]]>Thanks for your reply. Is multilevel modeling still needed if the bias comes from a SINGLE variable?

For example, with respondent-driven sampling (RDS), participants with a larger social network get over-represented in the sample. It is common practice to correct for this bias by using weights that are inversely proportional to the degree of participants (i.e. number of social connections). The adjustment is therefore based on a single variable.

Assuming the degree distribution of the target population was known (which is not the case in practice…), controlling for degree before post-stratification should work? No need for multilevel in that case?

]]>Alain:

No method corrects bias from unmodeled variables, but, to the extent that weighting corrects for bias, poststratification does so too. No tradeoff needed. Poststrat increases variance if you poststratify the raw data but not if you first do multilevel regression or regularized prediction. See here and here.

]]>When the sample is not representative, not weighing yields a biased estimate, but weighing increases variance (which is annoying when you are after statistically significant results…)

Does post-stratification increase the variance (of the population estimate) as weighing does?

PS: if it does not, it would help me convince my boss to move to bayesian analysis and post-stratification

]]>