This has come up before (in 2011 and in 2018) but someone just asked me again today so I thought I’d give this advice again, since I haven’t seen it written anywhere.
Here was the question:
I’m reaching out to seek your advice on how to integrate two probability samples for the new Poverty Tracker cohort.
In the Poverty Tracker 2024 cohort, we finally got the chance to include an Address Based Sampling (ABS) sample. Our sample design includes half of the sample coming from the Random Digit Dial frame and the other half recruited through the Address-Based Sampling. I’m trying to map out how we’d be able to integrate those samples from two frames but having a hard time to find resources.
And here’s my reply:
The right thing to do is to simply pool the data together from both samples into a single dataset. Also in the dataset include an indicator that says, for each data point, which sample it came from. Then we do all analysis (including construction of weights, if we want to do that) using the combined dataset. When fitting models, running regressions, we include the indicator for which sample, just in case it is predictive of anything.
You can do something similar if you want to combine more than two samples; just include an indicator for each sample. And the same idea applies when combining raw data from multiple surveys (although then you might need to do some work to line up relevant poststratification variables, for example if the two surveys use different categories or different question wordings when asking about education or ethnicity or party identification or whatever).
Strictly from a sampling perspective (i.e., not accounting for differential nonresponse and measurement error across modes), this is a multiple frame survey estimation problem. There is a pretty large literature on that in survey sampling, with various approaches to integrate the samples. It’s almost 20 years now, but Lohr and Rao at JASA (https://www.jstor.org/stable/27590779) is a great review paper on this topic.
At ANES this time, we integrated the various sample sources (FtF, Web and Panel) using a Hartley estimator minimizing the MSE of the overall vote choice estimate.
Please elaborate on “simply pool the data together from both samples into a single dataset.” If each survey has a complex stratified design with different weights, how are they “simply” pooled? What do you do regarding the weights from each design?
Dale:
I would pool the data together without weights and then construct weights on the pooled sample.
So, if the surveys each have complex but different designs, constructing the weights after pooling sounds somewhat complex to me. So, I am questioning your claim to “simply” pool the data – is it still simple in such cases?
Dale:
Sure, real life can get complicated! Even with just one survey and not two, if you have a complicated design it can be difficult to generalize to the population, and simple weighting will often not do the job.
being careful, of course, not to construct so many weights that the sample sinks to the bottom of the pool… :)
Hi Team,
If the two surveys respondents are sampled from one large set of population. these two surveys have different set of covariates based on different system it got sampled, what would be the best way to maximize the bias correction if I want to also include those system specific covariates?
Sample 1 has A * B * C * D
Sample 2 has A * B * E * F