Thinking of doing a list experiment? Here’s a list of reasons why you should think again

Someone wrote in:

We are about to conduct a voting list experiment. We came across your comment recommending that each item be removed from the list. Would greatly appreciate it if you take a few minutes to spell out your recommendation in a little more detail. In particular: (a) Why are you “uneasy” about list experiments? What would strengthen your confidence in list experiments? (b) What do you mean by “each item be removed”? As you know, there are several non-sensitive items and one sensitive item in a list experiment. Do you mean that the non-sensitive items should be removed one-by-one for the control group or are you suggesting a multiple arm design in which each arm of the experiment has one non-sensitive item removed. What would be achieved by this design?

I replied: I’ve always been a bit skeptical about list experiments, partly because I worry that the absolute number of items on the list could itself affect the response. For example, someone might not want to check off 6 items out of 6 but would have no problem checking off 6 items out of 10: even if 4 items on that latter list were complete crap, their presence on the list might make the original 6 items look better by comparison. So this has made me think that a list experiment should really have some sort of active control. But the problem with the active control is that then any effects will be smaller. Then that made me think that one might be interested in interactions, that is, which groups of people would be triggered by different items on the list. But that’s another level of difficulty…

And then I remembered that I’ve never actually done such an experiment! So I thought I’d bring in some experts. Here’s what they said:

Macartan Humphreys:

I have had mixed experiences with list experiments.

Enumerators are sometimes confused by them and so are subjects and sometimes we have found enumerators implementing them badly, eg sometimes getting the subjects to count out as they go along reading the list that kind of thing. Great enumerators shouldn’t have this problem, but some of ours have.

In one implementation that we thought went quite well we cleverly did two list experiments with the same sensitive item and different nonsensitive items, but got very different results. So that is not encouraging.

The length of list issue I think is not the biggest. You can keep lists constant length and include an item that you know the answer to (maybe because you ask it elsewhere, or because you are willing to bet on it). Tingley gives some references that discuss this kind of thing: http://scholar.harvard.edu/files/dtingley/files/fall2012.pdf

A bigger issue though is that list experiments don’t incentivize people to give you information that they don’t want you to have. eg if people do not want you to know that there was fraud, and if they understand the list experiment, you should not get evidence of fraud. The technique only seems relevant for cases in which people DO want you to know the answer but don;t want to be identifiable as the person that told you.

Lynn Vavreck:

Simon Jackman and I ran a number of list experiments in the 2008 Cooperative Campaign Analysis Project. Substantively, we were interested in Obama’s race, Hillary Clinton’s sex, and McCain’s age. We ran them in two different waves (March of 2008 and September of 2008).

Like the others, we got some strange results that prevented us from writing up the results. Ultimately, I think we both concluded that this was not a method we would use again in the future.

In the McCain list, people would freely say “his age” was a reason they were not voting for him. We got massive effects here. We didn’t get much at all on the Clinton list (“She’s a woman.”) And, on the Obama list, we got results in the OPPOSITE direction in the second wave! I will let you make of those patterns what you will — but, it seemed to us to echo what Macartan writes below — if it’s truly a sensitive item, people seem to figure out what is going on and won’t comply with the “treatment.”

If the survey time is easily available (i.e. running this is cheap), I think I still might try it. But if you are sacrificing other potentially interesting items, you should probably reconsider doing the list. Also, one more thing: If you are going to go back to these people in any kind of capacity you don’t want to do anything that will damage the rapport you have with the respondents. If they “figure out” what you’re up to in the list experiment they may be less likely to give you honest answers to other questions down the line. As you develop the survey you want to be sensitive to fostering the notion that surveys are “just out to trick people.” I’d put a premium on that just now if I were you.

Cyrus Samii:

I’ve had experiences similar to what Macartan and Lynn reported. I think Macartan’s last point about the incentives makes a lot of sense. If the respondent is not motivated in that way, then the validity of the experiment requires that the respondent can follow the instructions but is not so attentive as to avoid being tricked. That may not be a reasonable assumption.

There’s also the work that Jason Lyall and coauthors have done using both list experiments and endorsement experiments in Afghanistan. E.g., http://onlinelibrary.wiley.com/doi/10.1111/ajps.12086/abstract
They seem to think that they the techniques have been effective and so it may be useful to contact Jason to get some tips that would be specifically relevant to research in Afghanistan. It’s possible that the context really moderates the performance of these techniques.

Simon Jackman:

“List” experiments — aka “item-count” experiments — seem most prone to run into trouble when the “sensitive item” jumps off the page. This gives rise to the “top-coding” problem: if all J items are things I’ve done, including the sensitive item, then I’m going to respond “J” only if I’m ok revealing myself as someone who would respond “yes” to the sensitive item.

Then you’ve got to figure out how to have J items, including your sensitive item, such that J-1 might be the plausible upper bound on the item count. This can be surprisingly hard. Pre-testing would seem crucial, fielding your lists trying to avoid “top-coding”.

I still use the technique now and then (including a paper out now under R&R), but I’ve come to realize they can be expensive to do well, especially in novel domains of application, given the test cases you have to burn through to get the lists working well.

More generally, the item-count technique seems like a lot of work for an estimate of the population rate of the “sensitive” attitude or self-report of the sensitive behavior. Sure, modeling (a la Imai) can get you estimates of the correlates of the sensitive item and stratification lets you estimate rates in sub-populations. But if the lists aren’t working well to begin with, then the validity of “post-design”, model-based swings at the problem have to be somewhat suspect.

One thing I’m glad Lynn and I did in our 2008 work was to put the whole “misreport/social-desirability” rationale to a test. For the context we were working in — American’s attitudes about Obama and McCain on a web survey — there were more than a few people willing to quite openly respond that they wouldn’t vote for Obama because he’s black, or won’t for McCain because he’s too old. These provided useful lower bounds on what we ought to have got from the item-count approach. Again, note the way you’re blowing through sample to test & calibrate the lists.

And Brendan Nyhan adds:

I suspect there’s a significant file drawer problem on list experiments. I have an unpublished one too! They have low power and are highly sensitive to design quirks and respondent compliance as others mentioned. Another problem we found is interpretive. They work best when the social desirability effect is unidirectional. In our case, however, we realized that there was a plausible case that some respondents were overreporting misperceptions as a form of partisan cheerleading and others were underreporting due to social desirability concerns, which could create offsetting effects.

That makes sense to me. Regular blog readers will know that I’m generally skeptical about claims of unidirectional effects.

And Alissa Stollwerk discusses some of her experiences here.

13 thoughts on “Thinking of doing a list experiment? Here’s a list of reasons why you should think again

  1. “Like the others, we got some strange results that prevented us from writing up the results. ”

    Surely the strangeness of the results would be interesting to the scientific community, especially from a methodological standpoint.

    We need to move beyond “significant” “interesting” “novel” “substantive” storytelling. Let’s report our research findings, warts and all.

    • That’s what I thought. Isn’t this precisely the file-drawer problem in action? “Ultimately, I think we both concluded that this was not a method we would use again in the future.” There’s your paper.

    • Maybe it’s something like this: Say I’m commissioned to survey a small farm. I lug around my surveying tools & at the end of the day my computer spews out a farm area of 9000 sq. kilometers.

      Now do I report this pronto? Or repeat the study, recallibrate my lasers, or discuss my non-intuitive results with some colleagues etc? Assume after discussion it turns out that people report this brand of instrument is buggy & known to yield weird results every once in a while. Would I be OK in discarding these results & that instrument?

      • If, after reflection, you can explain why the tools produce weird results at certain farms, you clearly should publish that result, yes … or maybe I don’t understand the hypo.

  2. I wrote a technical paper that looked like it was going to end up mostly being variations on Imai’s results, so it’s ended up in my own file drawer. On statistical efficiency grounds it’s likely to be pretty poor, but we all knew that. Randomized response is much more efficient statistically, but of course it’s very time consuming and highly obtrusive.

    My feeling, as a psych person (such as it is… psychometrics degree ;), is that it’s hard to justify a list experiment in most survey research and that statistical concerns will be dominated by substantive/context effects. The chances that the other items affect the hot item are pretty large, so the context effects are simultaneously non-trivial and very hard to know anything about. We have to make some pretty tall assumptions to assume that “I have gotten drunk on the job in the last 30 days” by itself and in the context of a list of other items are the same thing, and that the interactions don’t happen. Some kind of meta-analytic or pooling approach might help, so rather than aiming for one list and hoping against hope that the other items don’t interact, aim for several lists with the other items swapped around and then pool.

    That said, it might be useful in contexts where folks have to respond, such as an employee survey, where there is a very large risk of social desirability effects.

    • Why reponse to this post would’ve been “why don’t you do randomized response instead”? Now reading your answer, I’d like to understand better why you say randomized response is “very time consuming and highly obtrusive”. If answers are solicited electronically, why is this the case? My question might be very naive. I haven’t done any actual user experiments that involved randomized response.

      • Mainly they’re time consuming for participants because they require complicated instructions and probably a dry run to implement. It’s also very obtrusive because the instructions make it plain what’s going on.

        Consider a list experiment about the prevalence of workplace drinking. (I’m not saying this is a particularly well designed list experiment, mind you, this is just for illustration.) The list experiment requires the respondent to read a short prompt and give a number. For example:

        Please indicate the number of the following you have engaged in in the last 30 days while at work:

        I have become intoxicated on the job. (item of interest)
        I have watched internet videos on my office computer.
        I have slept in the break room.
        I have used the office copier for personal copies.

        We are not interested in which ones, only how many.

        If the list was what I called (in the aforementioned file drawered paper) “simple deletion”, the treatment group would get these four and the control group would only get the last three. The naive method of moments estimate of the proportion of on the job drinking would simply be mean(treatment) – mean(control). Unsurprisingly, this is likely to have a fairly wide standard error.

        From the respondent’s perspective, this is a somewhat peculiar survey item but not all that difficult to answer (in theory). It might not even be obvious that the survey is about drinking as opposed to anything else.

        A forced choice randomized response prompt would be something like this:

        We are interested in whether you were intoxicated at the workplace in the last 30 days. To protect your privacy, we will use the randomized response technique, which prevents us from knowing your individual response. You will be given two dice, which you will roll out of the sight of the interviewer. If the sum of the two dice is 2 or 3, answer “no” to the question regardless of whether you were intoxicated or not. If the sum of the two dice is 11 or 12, answer “yes” regardless of whether you were intoxicated or not. If the dice show any value rom 4 to 10, answer honestly.

        Of course computerization would help but it would be essential to feel confident that the respondent actually understood what was expected and wasn’t cheating them. Remember, you’re asking about something they don’t want to fess up to!

  3. List experiments certainly are challenging to implement, but I think that they can be a useful tool to answer questions where social desirability bias might be at play. Several authors have looked at questions of list size and other methodological issues (Glynn, Berinsky, Imai for starters), and I think those are less of an issue, especially since best practices dictate a list of 3 or 4 control items, which is easier to handle (and design!) than a list of 6 or 10. This isn’t to say that designing a control list is easy, and I think Simon’s point about making sure that the sensitive item doesn’t stand out is key. Combine this with other standard recommendations (see Glynn 2013) — creating a list that avoids floor and ceiling effects (a respondent choosing all or none, thus losing anonymity) and creating a list with items that are negatively correlated to reduce variance — and it can be very challenging to come up with perfect control items. In my mind, the control list is the hardest part of designing a list experiment, but it can be done, especially with the help of pilot testing and past surveys.

    I conducted two list experiments with Jeffrey Lax and Justin Phillips on the 2013 CCES to see if people are lying to pollsters about their support for same-sex rights. Our first list experiment, on same-sex marriage, was quite informative, showing that overall support matched population support, but that different subgroups might be lying in intuitive and logical ways. (See the working paper here: http://polisci.columbia.edu/files/polisci/u230/Lax_Phillips_Stollwerk_List_Experiment.pdf) Our second list experiment, on employment non-discrimination, unfortunately demonstrated some of the pitfalls of the list experiment — especially on the question of trust that Lynn brings up. We found that survey results were quite different and hard to explain among those who had already seen one list experiment treatment on gay rights (but normal for those who were in the control in our first list experiment). It may be the case that list experiments are a tool to use occasionally, when the question requires it and the format is conducive to it, rather than a design to incorporate as a standard survey work-horse.

    • Just to add a quick comment to add to Alissa’s — and she’s more of an expert than I am certainly — we found the pilot testing (and close readings of previous surveys) to be crucial. That is, my intuitions about question pairs with sufficiently negative correlations would have been a poor substitute. It might be easier in some settings than others to find the right control questions with such correlations and while avoiding floor and ceiling effects (while also avoiding obviously fake choices).

  4. My secondary analysis of list experiments in the 1991 National Race and Politics Survey produced results that paralleled Lynn’s finding regarding the opposite direction. The survey included a list experiment with a test item of “a black family moving next door to you” and a list experiment with a test item of “black leaders asking the government for affirmative action”; the survey also included fourteen stereotype measures asking respondents to rate from 0 to 10 how well “most blacks” could be described as intelligent in school, aggressive or violent, etc.

    For whites who never rated “most blacks” at the favorable end of a stereotype measure, the list experiment estimated that 11 percent were angered by the thought of a black family moving in next door; but for whites who rated “most blacks” at the favorable end of at least four stereotype measures, the list experiment produced a point estimate of negative 92 percent angered by the thought of a black family moving in next door.

    This “opposite” effect did not appear to happen in the affirmative action list experiment: for whites who rated “most blacks” at the favorable end of at least four stereotype measures, the list experiment estimated that 49 percent were angry about black leaders asking the government for affirmative action, compared to 54 percent angered among whites who never rated “most blacks” at the favorable end of a stereotype measure.

    Some of this deflation appeared to result from some respondents in the “black family” condition reporting that zero items angered them, possibly to send an unambiguous signal of racial tolerance. These results were described in the SSQ article, “You Wouldn’t Like Me When I’m Angry.”

Leave a Reply to Moritz Cancel reply

Your email address will not be published. Required fields are marked *