Poisoning the well with a within-person design? What’s the risk?

I was thinking more about our recommendation that psychology researchers routinely use within-person rather than between-person designs.

The quick story is that a within-person design is more statistically efficient because, when you compare measurements within a person, you should get less variation than when you compare different groups. But researchers often use between-person designs out of a concern with “poisoning the well”: the worry that, if you apply treatments A and B to someone, the effects of A might persist until the second measurement period, or the two treatments can interact.

I think there’s a common view among researchers that, even if the within-person design might be more efficient, the between-person design is safer in that it gives an unbiased estimate. And it’s considered a better scientific decision to choose the safer option.

I have a few things to say about this attitude, in which people want to use the safe, conservative statistical analysis.

1. As John Carlin and I explain, if you restrict yourself to summarizing with statistically significant comparisons (as is standard practice), your estimates are not at all unbiased. Type M error can be huge.

2. When uncontrolled variation is high, type S errors can also be huge: in short, if you have a noisy study, you’re likely to make substantively wrong conclusions.

3. Finally, just on its own terms—even if you accept the (false) belief that the noisy, between-person design is “safer”—even then, so what? Scientific research is not supposed to be safe. Power pose, ovulation and voting, embodied cognition, etc.: These are not “safe” ideas. They are controversial, risky ideas—they’re surprising, and that’s one reason they hit the headlines. We’re talking about researchers who in general don’t consider the safe path as a virtue: they want to make new, surprising discoveries.

Putting this all together, I thought it could be useful to frame questions of experimental design and analysis in terms of risks and benefits.

In a typical psychology experiment, the risk and benefits are indirect. No patients’ lives are in jeopardy, nor will any be saved. There could be benefits in the form of improved educational methods, or better psychotherapies, or simply a better understanding of science. On the other side, the risk is that people’s time could be wasted with spurious theories or ineffective treatments. Useless interventions could be costly in themselves and could do further harm by crowding out more effective treatments that might otherwise have been tried.

The point is that “bias” per se is not the risk. The risks and benefits come later on when someone tries to do something with the published results, such as to change national policy on child nutrition based on claims that are quite possibly spurious.

Now let’s apply these ideas to the between/within question. I’ll take one example, the notorious ovulation-and-voting study, which had a between-person design: a bunch of women were asked about their vote preference, the dates of their cycle, and some other questions, and then women in a certain phase of their cycle were compared to women in other phases. Instead, I think this should’ve been studied (if at all) using a within-person design: survey these women multiple times at different times of the month, each time asking a bunch of questions including vote intention. Under the within-person design, there’d be some concern that some respondents would be motivated to keep their answers consistent, but in what sense does that constitute a risk? What would happen is that changes would be underestimated, but when this propagates down to inferences about day-of-cycle effects, I’m pretty sure this is a small problem compared to all the variation that tangles up the between-person design. One could do a more formal version of this analysis; the point is that such comparisons can be done.

44 thoughts on “Poisoning the well with a within-person design? What’s the risk?

  1. Great points! Also important to consider that there are many ways to control or quantify the effects of “poisoning the well”, such as counterbalancing, ABA sequence (or its more elaborate cousins) designs, or by including sufficiently long time periods between assessments to reduce consistency demand effects.

  2. Funny, i just did an Implicit Association Task (IAT) to investigate something. I did a few of these as a participant at university. While doing that, i found the task to be quite hard. I had to try and focus on one of my fingers (e.g. from the left hand) and keep repeating the category or categories in my head that were associated with pressing the key below my left hand.

    I was trying to investigate whether the test results, would be influenced by me either focusing from the start of the experiment on my left finger (and the associated category/categories), or on my right finger (and the associated category/categories). The thing i was trying to investigate was whether the first choice of focus on which finger (and associated category/categories) would influence the results.

    I used this site:

    http://www.projectimplicit.net/deadoralive/

    I first tried focusing on the left finger and got:

    1) slight automatic identification with alive compared to dead
    2) moderate automatic identification with alive compared to dead
    3) little to no identification with alive compared to dead

    Hmm, not very reliable, but perhaps i could still investigate what i wanted to, by now fucusing on the right finger (and associated category/categories). I got:

    1) moderate automatic identification with alive compared to dead

    Then i stopped. I would have continued if i would have found “moderate automatic identification with dead compared to alive”, but i did not find that.

    I did learn something though: within-persons experiments can be exhausting.

    • It’s all fun and games, I get it, but something that surprised me, when I was getting into reaction time tests, was that the standard keyboards are actually temporally fairly inaccurate. That’s a really concrete source of noise–in this blog sources of variation and noise are regularly discussed, and sometimes it is on a really abstract level, but here, I’d say, we are on a really down-to-the-earth level. And of course since that is on the Amazing Internet (AI.. oh! Wait what?) that is going to add some noise in there too. Hey-o, everyone get on the noise-adding train!

      I must say, as a sort of post scriptum, that this is not intended as any sort of defense for IAT type tests. They are, as I see it, quite wishy-washy. I was horrified to hear that they are still used as a quantitative measurement of racial bias. I’d wish we’d get back to the good ol’ skull measurements.

      • It depends on what you mean by inaccurate. Keyboards are usually a source of noise but that’s about it, and a relatively small source. Most of the smallest RT effects you care about are in the range of the absolute error added by the keyboard (polled at 100Hz). However, the only amount that matters is the SE of that error and therefore it can be reduced easily collecting a few more responses. Or do the right thing can collect a lot more responses. I’ve run many an RT study that wind up with effect CIs in <4ms range.

        There is a bigger issue in this particular case though. Some keyboards can provide bias because they scan slowly (sometimes longer than 10ms) in a consistent direction and the keys on the lower right end up being detected later than the left. I'm not sure the exact the design of this study, or how similar ones are, but if you're looking at pretty small RT effects between right and left buttons that can be problematic.

      • It’s worth noting that the online versions of IAT are intended for edification only. Those who collect data in controlled settings are intensely aware of the problems with keyboards (they don’t just add noise, but bias too, since different keys are scanned at different times) and often use specialized button-boxes instead.

        IAT, as an experimental paradigm, has been explored quite well as far as its replicability, variance and robustness; my verdict is that it’s definitely measuring “something.” Whether that something is racial bias, is unclear to me. It’s also a something that seems to be mostly interesting at population level and less at individual level.

        It fits into a broader literature of reaction time studies and the speed-accuracy tradeoff in psychophysics. I think of it more generally as revealing something about what mechanisms (and compensatory mechanisms) are involved in visual perception for categories. I had very similar reaction time effects when studying an illusion composed of conflicting motion cues, for instance. The reaction time effects were in line with what you would expect if one cue was fast to process but easily biased and the other was slower but more reliable.

        Within subjects experiments *are* exhausting. I paid volunteers $15 for a typically 50 minute session where they made ~1000 judgements.

        • Your last paragraph suggests that “time at which the judgment occurred” is likely to be a confounding factor in such experiments. Is some attempt made to adjust for this? (Is it even recorded? Is order of judgments randomized?)

        • Well, I recorded a timestamp every time a pixel changed on the screen and did a bunch of work to make sure those timestamps were accurate to a millisecond, so yes, the time within session is recorded, and there’s usually some attempt to check variation of results using trials early / late in sessions or early/late in that subject’s overall data.

          It’s common to use an “interleaved staircase” design; say if you wanted to measure sensitivities for 8 stimulus variants, you set up 8 parallel staircases (i.e. sequential online estimation procedures) and perform an equal number of trials from each procedure in shuffled order.

          There are definitely serial-ordering effects, where the stimulus/judgement from the last trial affects the next; mostly we try to minimize those by balancing the order of trials.

  3. While risk aversion might partly explain this, the bigger reason is bias. A between person RCT has no real possibility of bias and thus the estimate that emerges is “unbiased.” Victory! The fact that the within subject estimator is *possibly* biased makes it a nonstarter even though the variance will almost surely be dramatically lower. Even where potentially biased estimators are allowed by the reviewer, the combination of possible bias and low variance lead to the suspicion that one has determined a statistically significant bias, not a statistically significant effect. Try getting the paper published in which you say: “There’s a definite effect here, but whether it’s bias in the experimental setup or a real effect in the world I can’t convincingly say.” Much easier to avoid the bias question altogether by accepting a high-variance zero-bias setup, learning about the world be damned.

    • > A between person RCT has no real possibility of bias
      That is just not true (even in a between animal RCT) – the randomization just provides equal in distribution comparison groups.

      It would be very unusual for everything to go as planned – some un-blinding, some informative loss of follow-up, non-compliances, etc. etc. will lead to some bias.

      The the position of not trading any bias for variance reduction because you have so far avoided any bias – is not really defensible.

      Now in any context there is an amount of bias that would make the experiment not very informative. So there needs to be an assessment of its size as for instance discussed here https://www.ncbi.nlm.nih.gov/pubmed/12933636

      • Well, sure. I was being somewhat terse here. No possibility of bias in the experimental setup, but the real world has actual people performing actual experiments which may not precisely match the setup. But in any case where individual response has big variance between but small variance within, that variance reduction has to be compared to what I assume are the tiny biases introduced by performance noise. (Though in some cases I grant they may not be tiny at all.) Also, to answer Andrew, I never meant unbiased after the application of a statistical significance filter.

  4. as i have heard it, the poisoning the well argument typically refers to cases where being exposed to multiple treatments leads people to change their behavior in line with the experimenter’s hypothesis (i.e. “demand effects”). so, unlike the ovulation-and-voting example (where the hypothesis is sufficiently esoteric that people are unlikely to guess it and change their behavior accordingly), the concern is that the within design can produce a spurious effect of experimental condition, rather than washing out a real difference.

    • Sourdough:

      Yes, there is this concern, and it needs to be addressed in the experimental design and also in the modeling. And in some cases these expectation effects are so large that a crossover design won’t work at all. The point of the above post is that there is a tradeoff, that the difficulty of identification amid these potential crossover effects must be balanced against the hopelessness of learning anything with noisy between-person comparisons. And my problem with lots and lots of research is that people don’t seem to consider the tradeoff at all, instead doing useless between-person comparisons because they don’t know any better.

      • “And in some cases these expectation effects are so large that a crossover design won’t work at all”. Would you mind amplifying this remark ? If that expectation effect is *that* large, can the effect be in fact “objectively” studied ?

        In this case, the experimenter is, in fact, part of the experiment… The only way out I can see is some kind of “meta-experiment”, with the meta-experimenter attributing to the experimenters hypotheses to test (and reasons to do so). Hierarchical designs, anyone ?

  5. Hundreds of books and articles are written on the STATISTICAL analysis of ws and bs designs. Hardly any research exists on the comparison of results of ws and bs designs examining the same phenomenon in psychology. This kind of research is very much needed! It will increase our understanding of the relative advantages of both designs AND of the phenomenon itself.

    General advice on what design to use is not a smart thing to do, if one does not know how a design interacts with the phenomenon studied.

    Importantly, some theories having implications for individual behavior, should be tested at the individual level to truly test the theory. Ofen this does not happen. Famous example is prospect theory with the reflection effect. Often this was tested using between subjects design, but as the theory implies reflection at the individual level, this does not make sense – it should be tested using a ws design.

    • Marcel:

      I agree that psychology researchers should be thinking about these issues in the context of their particular experiments. As a statistician, I think I can make a useful contribution here by explaining the problems with naive ideas about unbiasedness.

    • Recent focus has been on statistical analysis in published papers and undergrad textbooks but at the graduate level BS and WS design costs and benefits are in all of the decent textbooks. Some studies, like the ovulation and voting one, most definitely should have been done WS. Too many clinical and social psychologists end up working with and around medical professionals who only understand control v. experimental group designs. If you look at cognitive psych, which works primarily purely academically or within industry where they are the scientific expert, BS designs are a last resort.

      • Design costs and benefits – that is all clear. But what is missing is which findings are better examined by ws than bs designs (no evidence on asymmetric order effects), and meta-analytic findings on the effect of designs on the examined effect size.

  6. “Poisoning the well” (AKA “contamination”) can also happen in between-subject designs. I was once on the Ph.D. committee of a student studying “stereotype threat” who encountered this. The student subjects (who were administered the “intervention” one at a time) talked to each other (presumably despite instructions not to).

  7. I think the risk that the well will be poisoned really depends on the experiment. For instance, asking somebody who they will vote for once probably doesn’t change what they will say if you ask them who they will vote for again. But maybe putting somebody through the Milgram experiment twice really would change how they responded. I guess this is another thing that psychological theories need to specify.

    • Asking someone once who they will vote for anchors future responses. People are reluctant to change their mind much. Your best bet in that case wouldn’t be getting a bunch of binomial response because you’ll only get variation from those on the fence. Your best bet would be to instead have a candidate rating of the variable in question. In this particular case it wouldn’t have been too bad I suppose because the variable was kind of incidental (red shirt or something).

  8. Thanks for updating this advice. I certainly agree that there’s a bias/variance tradeoff when deciding between within subject and between subject (which you would probably abbreviate “bs”) designs. And I also agree that there’s a tendency for people to ignore this tradeoff and do bs designs because they’re unbiased even in situations where the tradeoff clearly favors a ws design. (And yes, the significance filter and measurement problems can make published results from otherwise flawed bs designs biased also.) But I’m glad you’re stopping short of generally recommending ws designs.

  9. WIth complete counterbalancing, every within-subject design contains its own (underpowered) between subject design — the first stimulus/response pair in each respondent’s set. With a large enough N, that slice of the data can give a sense of what the “uncontaminated” responses look like, right?

  10. Following Molenaar’s advice I see this in terms of two different covariance structures. A cross-sectional observational study estimates the between-person covariance of X & Y. A within-person study with repeated measures estimates the within-person covariation of X & Y – potentially something completely different. Just look at the field of nutrition all tied up in knots because in many of their studies calories consumed doesn’t correlate highly with weight status (or even swings negative). And yet it’s basically axiomatic that change in intake (up or down) correlates very highly with change in weight status. But that’s the within person covariation of intake and weight – and that’s not what the big surveys typically measure.

  11. While I think this is essentially the source of bias you’re talking about, but when psychologists are trained they are generally taught that the gold standard for internal validity is the between-subjects experiment. I don’t wholly agree or disagree with this, but it’s not hard to see how for a lot of research questions, a within-subjects design may be less likely to have clear internal validity given this poisoning of the well, as you put it.

    • When I went to school that wasn’t the case at all. WS designs were strongly encouraged the the circumstances where you wouldn’t use them were laid out pretty well. Further, even some intro texts (i.e. Stanovich) point out that both kinds of designs are important for understanding a topic.

      • Jacob said: “when psychologists are trained they are generally taught that the gold standard for internal validity is the between-subjects experiment.”

        I think the phrase “gold standard” has done a lot of harm — it tends to seduce people into believing that if you have the thing that is called “gold standard,” then everything is fine. So, for example, they may think they are using the “gold standard” when they do a “randomized controlled trial,” but they ignore blinding, don’t ensure that the randomization is really random, ignore drop-outs, don’t measure negative effects, take full advantage of the garden of forking paths, etc. — and naively think they’ve got a good study because it is “the gold standard”.

  12. It’s a little odd reading these discussions coming from the perceptual / neuroscience arm of psychology, where WS designs are normal. I often find myself in the position of being asked by lay audiences why a study with “only” n=5 participants, (in perception studies, one of whom is often the lead author) should be trusted. The significance of WS versus BS designs isn’t obvious to many and the fact that one has, say, 2000 data points per subject rather than a small handful often doesn’t register either.

    Does anyone have any useful metaphors or explanations?

    I’ve tried pointing out that choosing to select more subjects to study trades off directly against the amount of data you can collect per subject. So there is a continuum of sorts, on the one end is something like a presidential election: an experiment that uses as many subjects as possible but the amount of data collected per subject is 1 bit. On the other end you have something like high-energy physics where experiments have a very large number of trials, but there are only 1 or 2 instruments capable of doing the experiment and moreover we can only study our given (n=1) universe so we can actually only perform within-subject experiments.

    Another tact I’ve tried is to point out why generally, people will have different sensitivities to the same stimuli, different reaction times; people are diverse, much more so than a line of lab mice. And generally we aren’t interested in cataloguing the distribution of sensitivities across the population, but rather in determining the common relations among them — which we think of as existing within each subject. So our hypothesis comes in the form of a model that has the same form across subjects but has some parameters that are allowed to between subjects. For any one subject we can get a read on how well such a model captures behavior for that subject. Done this way. an N=5 study starts to look like 5 “replications” of a single-subject experiment, and the question becomes, “if H1 beats H0 at a particular p-level in this subject, how many times do you need to repeat that measure before you start thinking that H1 is useful hypothesis for humans generally?”

    • It seems to me that this very-low-N approach is valid to the extent there are strong neurobiological invariances underlying the effects studied. One sign of this would be a high replicability of these effects.

      In any case, invariances might be much fewer in other domains of study. I guess a common intuition is that the “farther away” you get from “low-level” neural processes, the more flexibility there is (hello psychology!).

    • Thanks so much for this comment! I’ve been trying to express something like this for a while, but I don’t think I came close to this…

      You wrote:
      “And generally we aren’t interested in cataloguing the distribution of sensitivities across the population, but rather in determining the common relations among them — which we think of as existing within each subject.”

      I think I agree with you that this is the status quo. But why do we think this way? Why don’t we think of the variation as a tool to understand the systems we study? My current work examines a population of children who have a type of genetically-linked epilepsy. Here it is clear to me why I am studying the variation.

      • I suspect that a lot of people really underestimate the amount of variation between individuals. But I often wonder why that is? As a child, I observed the differences between my relatives, even though they are genetically related. As a professor, I noticed the differences between individual students whenever I graded exams or asked or answered questions in class. Do most people not see these differences — not just in how people look, but in the ways they look at things and the ways they think?

Leave a Reply to Martha (Smith) Cancel reply

Your email address will not be published. Required fields are marked *