Someone wrote in:
We are about to conduct a voting list experiment. We came across your comment recommending that each item be removed from the list. Would greatly appreciate it if you take a few minutes to spell out your recommendation in a little more detail. In particular: (a) Why are you “uneasy” about list experiments? What would strengthen your confidence in list experiments? (b) What do you mean by “each item be removed”? As you know, there are several non-sensitive items and one sensitive item in a list experiment. Do you mean that the non-sensitive items should be removed one-by-one for the control group or are you suggesting a multiple arm design in which each arm of the experiment has one non-sensitive item removed. What would be achieved by this design?
I replied: I’ve always been a bit skeptical about list experiments, partly because I worry that the absolute number of items on the list could itself affect the response. For example, someone might not want to check off 6 items out of 6 but would have no problem checking off 6 items out of 10: even if 4 items on that latter list were complete crap, their presence on the list might make the original 6 items look better by comparison. So this has made me think that a list experiment should really have some sort of active control. But the problem with the active control is that then any effects will be smaller. Then that made me think that one might be interested in interactions, that is, which groups of people would be triggered by different items on the list. But that’s another level of difficulty…
And then I remembered that I’ve never actually done such an experiment! So I thought I’d bring in some experts. Here’s what they said:
I have had mixed experiences with list experiments.
Enumerators are sometimes confused by them and so are subjects and sometimes we have found enumerators implementing them badly, eg sometimes getting the subjects to count out as they go along reading the list that kind of thing. Great enumerators shouldn’t have this problem, but some of ours have.
In one implementation that we thought went quite well we cleverly did two list experiments with the same sensitive item and different nonsensitive items, but got very different results. So that is not encouraging.
The length of list issue I think is not the biggest. You can keep lists constant length and include an item that you know the answer to (maybe because you ask it elsewhere, or because you are willing to bet on it). Tingley gives some references that discuss this kind of thing: http://scholar.harvard.edu/files/dtingley/files/fall2012.pdf
A bigger issue though is that list experiments don’t incentivize people to give you information that they don’t want you to have. eg if people do not want you to know that there was fraud, and if they understand the list experiment, you should not get evidence of fraud. The technique only seems relevant for cases in which people DO want you to know the answer but don;t want to be identifiable as the person that told you.
Simon Jackman and I ran a number of list experiments in the 2008 Cooperative Campaign Analysis Project. Substantively, we were interested in Obama’s race, Hillary Clinton’s sex, and McCain’s age. We ran them in two different waves (March of 2008 and September of 2008).
Like the others, we got some strange results that prevented us from writing up the results. Ultimately, I think we both concluded that this was not a method we would use again in the future.
In the McCain list, people would freely say “his age” was a reason they were not voting for him. We got massive effects here. We didn’t get much at all on the Clinton list (“She’s a woman.”) And, on the Obama list, we got results in the OPPOSITE direction in the second wave! I will let you make of those patterns what you will — but, it seemed to us to echo what Macartan writes below — if it’s truly a sensitive item, people seem to figure out what is going on and won’t comply with the “treatment.”
If the survey time is easily available (i.e. running this is cheap), I think I still might try it. But if you are sacrificing other potentially interesting items, you should probably reconsider doing the list. Also, one more thing: If you are going to go back to these people in any kind of capacity you don’t want to do anything that will damage the rapport you have with the respondents. If they “figure out” what you’re up to in the list experiment they may be less likely to give you honest answers to other questions down the line. As you develop the survey you want to be sensitive to fostering the notion that surveys are “just out to trick people.” I’d put a premium on that just now if I were you.
I’ve had experiences similar to what Macartan and Lynn reported. I think Macartan’s last point about the incentives makes a lot of sense. If the respondent is not motivated in that way, then the validity of the experiment requires that the respondent can follow the instructions but is not so attentive as to avoid being tricked. That may not be a reasonable assumption.
There’s also the work that Jason Lyall and coauthors have done using both list experiments and endorsement experiments in Afghanistan. E.g., http://onlinelibrary.wiley.com/doi/10.1111/ajps.12086/abstract
They seem to think that they the techniques have been effective and so it may be useful to contact Jason to get some tips that would be specifically relevant to research in Afghanistan. It’s possible that the context really moderates the performance of these techniques.
“List” experiments — aka “item-count” experiments — seem most prone to run into trouble when the “sensitive item” jumps off the page. This gives rise to the “top-coding” problem: if all J items are things I’ve done, including the sensitive item, then I’m going to respond “J” only if I’m ok revealing myself as someone who would respond “yes” to the sensitive item.
Then you’ve got to figure out how to have J items, including your sensitive item, such that J-1 might be the plausible upper bound on the item count. This can be surprisingly hard. Pre-testing would seem crucial, fielding your lists trying to avoid “top-coding”.
I still use the technique now and then (including a paper out now under R&R), but I’ve come to realize they can be expensive to do well, especially in novel domains of application, given the test cases you have to burn through to get the lists working well.
More generally, the item-count technique seems like a lot of work for an estimate of the population rate of the “sensitive” attitude or self-report of the sensitive behavior. Sure, modeling (a la Imai) can get you estimates of the correlates of the sensitive item and stratification lets you estimate rates in sub-populations. But if the lists aren’t working well to begin with, then the validity of “post-design”, model-based swings at the problem have to be somewhat suspect.
One thing I’m glad Lynn and I did in our 2008 work was to put the whole “misreport/social-desirability” rationale to a test. For the context we were working in — American’s attitudes about Obama and McCain on a web survey — there were more than a few people willing to quite openly respond that they wouldn’t vote for Obama because he’s black, or won’t for McCain because he’s too old. These provided useful lower bounds on what we ought to have got from the item-count approach. Again, note the way you’re blowing through sample to test & calibrate the lists.
And Brendan Nyhan adds:
I suspect there’s a significant file drawer problem on list experiments. I have an unpublished one too! They have low power and are highly sensitive to design quirks and respondent compliance as others mentioned. Another problem we found is interpretive. They work best when the social desirability effect is unidirectional. In our case, however, we realized that there was a plausible case that some respondents were overreporting misperceptions as a form of partisan cheerleading and others were underreporting due to social desirability concerns, which could create offsetting effects.
That makes sense to me. Regular blog readers will know that I’m generally skeptical about claims of unidirectional effects.
And Alissa Stollwerk discusses some of her experiences here.