Of buggy whips and moral hazards; or, Sympathy for the Aapor

We’ve talked before about those dark-ages classical survey sampling types who say you can’t do poop with opt-in samples. The funny thing is, these people do all sorts of adjustment themselves, in the sampling or in post-data weighting or both, to deal with the inevitable fact that the people you can actually reach when you try to do a random sample, are not actually representative of the general population. They talk like hard-liners, but if they were really hard-liners they’d have given up with surveys already.

Anyway, I was talking with someone today (29 Jan, not that it matters, but there you have it) about this dispute and he was stunned that those buggy-whip guys could take such a ridiculous position, given what we know about the real world of surveys (and also what we’ve learned about the effectiveness of Mister P and other approaches to model-based survey adjustment).

And I realized that I wasn’t presenting the whole story. I’ve been handling the “statics” ok while missing some of the “dynamics.”

What I mean is, I think the buggy-whip people have (some of) a point. I disagree with them, but they have a legitimate argument. It was hard for me to find this argument because they weren’t actually making it explicitly, but I think I can find it, deep inside their pile of rhetoric.

The legitimate argument of the buggy-whip people has nothing to do with “grounding in theory” or any of the other pseudo-rigorous stuff that they’re saying. Rather, their best argument is all about moral hazard.

It goes like this. If it becomes widely accepted that properly adjusted opt-in samples can give reasonable results, then there’s a motivation for survey organizations to not even try to get representative samples, to simply go with the sloppiest, easiest, most convenient thing out there. Just put up a website and have people click. Or use Mechanical Turk. Or send a couple of interviewers with clipboards out to the nearest mall to interview passersby. Whatever. Once word gets out that it’s OK to adjust, there goes all restraint.

It’s the same reason why we shouldn’t put air bags and bumpers on cars—it just encourages people to drive recklessly.

I don’t find the moral hazard argument particularly convincing—for one thing, I worry about the framing in terms of bad opt-in samples and good probability samples, as I think it encourages a dangerous complacency with probability samples.

And, for that matter, I’m not a fan of crappy sampling: the worse the sampling, the more of a burden it puts on the adjustment. That’s why I think we should be emphasizing sampling design, practical sampling, careful measurement, and comprehensive adjustment as complementary tools in surveys. You want to do all four of these as best you can.

But I can see how the Aapor people could worry that the knowledge of powerful adjustment tools could lead some practitioners to get sloppy on their sampling, or it could lead some potential clients to not bother with a high-quality survey, under the mistaken belief that any problem in the design can be fixed in the analysis.

23 thoughts on “Of buggy whips and moral hazards; or, Sympathy for the Aapor

  1. Nice example of steel-manning. (Instead of putting up a deliberately weak version of your opponent’s arguments to knock down, steel-manning is the idea that you should address the strongest version of your opponent’s arguments, even fixing weaknesses or omissions in their argument if necessary.) http://wiki.lesswrong.com/wiki/Steel_man

  2. I think I agree with Andrew….but…..there is another effect that need to be considered.

    Here is a peer reviewed survey published in PLOSMedicine:

    http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1001533

    The sampling design is nothing fancy but it isn’t a simple random sample because it has clustering and stratification. The estimates do account for clustering but not for stratification. I’ve reanalyzed the data and the stratification matters (which is not a huge shock….)

    This reminds me of a conversation I had a while back with a demographer. We were discussing a methodology document put out for epidemiologists working in emergency situations. The manual stressed really strongly that every household in the sampling frame had to have an equal probability of selection. I said this was obviously wrong and stupid. The demographer replied that there are a lot of people out there doing these surveys who have no knowledge of sampling theory. If you tell them that that they don’t have to bust their butts to try to ensure that every household has the same selection probability then they will run with that but their estimates will still assume equal selection probabilities.

    It seems to me that he had a point.

    Another way to put this is that Andrew’s comment assumes that as sampling methods evolve the estimation methods will evolve in parallel. Andrew’s worry is that too little effort might go into sampling and too much effort will go into fancy estimation to undo the effect of lazy sampling. But what if the sampling gets lazier and the estimation just ignores this deterioration in sampling effort?

    Obviously, this problem would be solved by better education. But will this happen?

    A lot of stuff gets published with sampling descriptions that don’t go much further than saying: “we took a random sample.” You could do all sorts of sampling and still say that. Again, you could say the solution here is that people need to demand more information about sampling. But will that happen?

    I really want to be with Andrew on this but I’ve got reservations…..

    • Ancient post, I know, but can you (or anyone) point me to a good source on sampling theory? I’ve mostly gotten the standard “you must use a random sample thing,” and while I know from this blog that no one has a random sample, I don’t really know too much about other techniques or ways to adjust for the nonrandomness.

      • I don’t know if I’d call it “outmoded technology”, unless you mean random digit dialing or something specific. Basically, a random sample is a great thing to have, but in the presence of an “adversary” who is avoiding you, or a subset of the population who are adversaries… you have very little you CAN do to ensure a sample that is free of serious bias. Your only alternative is to try to model that bias!

        • Daniel:

          Telephone surveys still have their uses, so, yes, the buggy-whip analogy isn’t perfect. The real point is that Michael W. Link is pointing to some nonexistent theory to justify an aggressive stance against a new method he doesn’t understand, degrading the reputation of his organization in the process. I used to respect Aapor, and I still respect a lot of what they stand for. But it’s hard for me to take seriously an organization that officially opposes what I do, with no justification beyond rhetoric. Instead of changing with the times, they’re fighting the new. And that’s why I tag them with the “buggy whip” label.

    • A bit OT but I can’t resist. The buggy-whip analogy may work but the photo does not. The picture at the head of the post does not display a “buggy”. I believe it is a sulky.

      For American readers, I believe your President Grant once was cited for speeding while driving one in Washington D.C.

  3. I’ve been thinking of writing a blog post about this issue myself, but from a different perspective. I am grappling with sampling issues right now at work, and I am not as sanguine as Andrew seems to be about the ability to post-stratify to fix everything up. I’m going to describe the project here in hopes of getting some useful advice.

    We want to learn about the use of electric motors in U.S. industry. Ideally, we’d like to know the statistics on the entire installed base of motors, by industry and sub-sector (e.g. auto parts manufacturers, auto assembly plants, …): what’s the distribution of motor sizes, ages, efficiencies, how many hours per year they’re used at each power output, etc., etc.

    Obviously we can’t go look at, and measure, every motor in the country. Indeed, even visiting a single facility and checking their motors can take hours or even days. Realistically, we can sample maybe 200 facilities total, and for the larger facilities we’ll have to do within-facility sampling. And we’ll need to make sure the facilities in our sample aren’t scattered all over tarnation, because travel time and costs will eat us alive.

    What Andrew might call the “old-school” way to handle this would be to divide the country into Primary Sampling Units (PSUs) choose some of them at random (using a weighting scheme so that some PSUs are more likely to be included than others), then divide the PSUs into Secondary Sampling Units and do the same, then pick a random sample of facilities in each selected SSU (using stratification to make sure we get about the right number of facilities in each industry).

    An important issue is that we expect the participation rate to be very low, perhaps in single digits. That is, we expect fewer than 10% of the facilities that we approach to allow us to come and look at their motors and make power measurements. Participation requires some time and hassle on the part of the plant facility manager or equivalent person, and there’s very little reason for them to agree. We recognize that an incentive worth a few hundred dollars would substantially increase the participation rate — we could give the an iPad or something, and a lot more of them would say Yes — but the project sponsor won’t let us provide incentives.

    So…if we’re only going to get 8% participation (or whatever) in any case, what’s the point of doing an elaborate sampling scheme? There’s pretty much no chance that the 8% of facilities that agree to participate will really be representative of the 92% that won’t. Just to give a few examples: maybe they’re more likely to say Yes if they’re not very busy, so we might get an overrepresentation of businesses that are not doing well. Or maybe we’ll get an overrepresentation of businesses that are doing especially well, because they’ll have adequate staff and won’t mind taking the time for our survey, whereas the struggling ones have laid off people and whoever is left is working flat out. One can imagine many other systematic biases.

    My initial instinct was that we should use a geographic randomization scheme based on PSUs and SSUs in any case, because why not? The answer is, it’s not easy. We need to collect information about the geographic distribution of facilities in different industries in order to define the sampling units and assign them reasonable sampling weights. All of the large textile mills are in a small number of geographic areas, so we’d better make sure we sample at least one of those areas…but there are similar concentrations in other industries, so there’s a complicated set of constraints that we’d have to apply in order to be sure that we get adequate coverage of the industries we’re interested in. (Well, we’re interested in all industries, but we’re more interested in the ones that use more motor energy).

    Another option is to use a sampling scheme that is a lot less formal. For example, we know we need N auto parts manufacturing plants. We could call around (or, rather, have a contractor call around) until we find one that is willing to participate, and then we can start calling other businesses in the area to see if we can get them to participate too. Just sort of adjust on the fly to try to get the right mix of sizes and industries etc.

    There are some obvious problems with this approach. The biggest problem is that the absence of randomization means we’ve got no protection against unknown biases. It reminds me of the way people used to do “man on the street” polls: supposedly, a newspaper would send reporters out with orders to speak to 100 people, including at least 30 white women, 10 black men, 30 people over age 60, etc. to find out who they plan to vote for. They discovered to their embarrassment that even if the sample “looked like America” in these crude demographics, the samples were terrible when it came to predicting the vote for president or whatever.

    Anyway, our time and budget constraints being what they are, we are probably going to go with an informal “take what we can get” sampling scheme, and try to post-stratify on facility size, region, and a few other parameters. This makes me very uncomfortable. But Andrew, it seems that you think this is just fine!

    Constructive advice (or comforting words) would be welcome from any quarter!

      • As I mentioned, we’d like to know the statistics on the entire installed base of motors, by industry and sub-sector (e.g. auto parts manufacturers, auto assembly plants, …): what’s the distribution of motor sizes, ages, efficiencies, how many hours per year they’re used at each power output, etc., etc.

        If you’re saying “why does the sponsor want to know that stuff,” there are several reasons. The information can be used to help figure out things like how long it takes improved motors to be taken up by industry; to figure out how much electricity is wasted by inefficient or poorly sized motors and whether there is enough of a problem that it’s worth designing programs to try to address it; to figure out the likely effects on energy use if various industries grow or shrink in the future; and so on.

    • I think your hesitation is very justified, but I take Andrew’s meaning to be “well at least you’ll have SOME information, whereas you have NONE now”.

      This issue becomes even more problematic when there is some kind of financial incentive to sample poorly. I was recently involved in a lawsuit in which an expert went and looked at the condition of the walls in a building where a fire had occurred. they looked at all the rooms in the same stack, one room per floor. They took samples from each room. they made no effort to ensure _anything_ in their sampling. In fact the only thing they mention in the report is “we sampled two chosen stacks”. My argument was that they learned essentially that it was possible to find 12 separate square feet of the building where there wasn’t much damage, and that it didn’t mean much of anything about the other portions of the building. whereas they claimed the whole building was undamaged (at least, with respect to the type of damage they were investigating).

      Here though, there’s no possibility for the rooms to “refuse to cooperate” so there was no excuse not to use proper random sampling of the rooms throughout the building.. in fact that’s exactly what I recommended. But there is also an incentive to “go and find places where there aren’t problems”, just like maybe the “man on the street samples” there’s an incentive to go find people who are easy to spot and not very busy, and likely to be cooperative (to get you out of the hot sun and tired feet as fast as possible)

      In your situation Phil, I think you are partially up a creek, but if you want to spend the money you should at least find as much information to help adjust as possible. For example, maybe more places would be willing to give you total electric usage over the last 3 months than would be willing to actually let you creep around their building looking for motors. You could then see how the ones that let you creep around compare to the average in terms of total usage… stuff like that.

      • Note, the experts were working for the insurance company, so their client was benefited if there was less damage, whereas I was working for experts on the owners side. the obvious thing to say there is that there’s an incentive to find damages. I encouraged “our side” not to HIDE that, but rather to describe in detail the things we did to minimize bias by removing them from our control (such as using a proper computer RNG, and collecting a kind of complete data about multiple aspects of each room).

        The point is, people who are going to the effort of trying to avoid bias will generally find ways to reduce that bias, so it’s not wrong to hold a kind of “hard line” about good low-bias sampling practices, but it is wrong to pretend you have a low-bias method because you ignore a huge source of bias… namely the “adversarial” nature of your subjects!

    • > but the project sponsor won’t let us provide incentives.
      This seems like the insurmountable opportunity – get them to think through the harms/costs of this objection.

      An executive at the Canadian Institute of Health Research a number of years ago overcome strong objections to wasting scarce research funding providing incentives to participate in surveys with something like “we can spend $100,000.00 and end up with a study of limited relevance or $110,000.00 and get a study we can actual interpret and make good use of”.

      There is a reason the sponsor wants the study done, and done well (or as David pointed out done poorly but not too obviously so) and so they should be able to grasp the cost/benefits involved.

      If there is political/legal barriers to gifts these might be overcome by providing a [I am just wildly guessing here] report to the plant manager perhaps with some additional expert opinions about their particular motors (e.g. suggest maintenance replacement plans).

      Hey in Canada an extremely important mandatory survey was cancelled and to me it is actually hard to believe the decision to do that was properly and fully informed on the issues (OK the Chief Statistician resigned and what they tried to communicate and especially how is unknown – moral hazard is as good guess as any.)

      • Keith, I’ve lobbied for incentives as hard as I can. They just refuse. This is just a huge huge issue for federally funded surveys. There’s even an organization, the “Council of Professional Associations on Federal Statistics” (I’m not kidding) that held a conference devoted to the issue. There are a few arms of the federal government that are willing to allow incentives to be used but in general it’s not allowed. They understand that they can get better results at lower cost if they provide incentives, but there are overriding political considerations. Evidently the political calculus on this hasn’t changed in years.

    • So why don’t you use two stage sampling. You get your 10% response, and then you take a random sample form the remaining 90% and really go hard at them. Not perfect (there is also non-response in second stage) but better perhaps.

      • Yes, this is a good idea and I suggested it. It’s probably a good idea whether or not we truly randomize. Thing is, it’s not clear what “go hard at them” would mean, given that we can’t provide incentives.

  4. For what it’s worth, I was at AAPOR’s conference this year and there were quite a few panels on opt-in panels and how to adjust the data. AAPOR’s official stance may be against them, but there’s still a fair amount of work going on amongst its members.

  5. On moral hazard.

    I find this is a real problem. Once I worked in a major international organization. Managers always wanted the latest data on economic developments. But of course, they were investing zero money on data collection, so the data where always several periods behind.

    When presented with this problem the response was not “ok, lets invest in better and more timely data collection”. Rather, the response was: “Make an estimate”. In principle, there can be a perfectly reasonable cost-benefit analysis for this but I doubt that is what was going on.

    The problem, as I saw it, is that managers, and their bosses, can only monitor by looking at the final power point slide, or how fancy is the data interface. Whether the data are in fact bogus is something that they don’t know, and is very costly to check. So in this principal agent problem estimates and eye candy are as addictive as crack.

Leave a Reply to Fernando Cancel reply

Your email address will not be published. Required fields are marked *