Zach Horne writes:

A student of mine was presenting at the annual meeting of the Law and Society Association. She sent me this note after she gave her talk:

I presented some research at LSA which used a within subject design. I got attacked during the Q&A session for using a within subjects design and a few people said it doesn’t mean much unless I can replicate it with a between subjects design and it has no ecological validity.

She asked me if I had any thoughts on this and whether I had previously had problems defending a within subjects design. She also wondered what one should say when people take issue with within subjects designs.

I sent her a note with my initial thoughts but I thought it would be worth bringing up (again) on your blog because I’ve run into this criticism a lot. I don’t think she should just buckle and start running between subjects designs just to appease reviewers. We need people to understand the value of within designs, but the mantra “measurement error is important to think about” doesn’t seem to be doing the trick.

For background, here are two old posts of mine that I found on this topic from 2016 and 2017:

Poisoning the well with a within-person design? What’s the risk?

Now, to quickly answer the questions above:

First off, the “ecological validity” thing is a red herring. Whoever said that were either misunderstood or didn’t know what they were talking about. Ecological validity refers to generalization from the lab to the real world, and it’s an important concern—but it has nothing to do with whether your measurements are within or between people.

Second, I think within-person designs are generally the best option when studying within-person effects. But there are settings where a between-person design is better.

In order to understand why I prefer the within-person design, it’s helpful to see the key advantage of the between-person design, which is that, by doing giving each person only one treatment, the effects of the treatment are pure. No crossover effects to worry about.

The disadvantage of the between-person design is that it does not control for variation among people, which can be huge.

In short, the between-person design is often cleaner, but at the cost of being so variable as to be essentially useless.

OK, at this point you might say, Fine, just do the between-person design with a really large N. But this approach has two problems. First, people don’t always get a really large N. One reason for that is the naive view that, if you have statistical significance, then your sample size was large enough. Second, all studies have bias (for example, in a psychology experiment there will be information leakage and demand effects), and ramping up N won’t solve that problem.

Here’s what I wrote a few years ago:

The clean simplicity of [within-person] designs has led researchers to neglect important issues of measurement . . .

Why use between-subject designs for studying within-subject phenomena? I see a bunch of reasons. In no particular order:

1. The between-subject design is easier, both for the experimenter and for any participant in the study. You just perform one measurement per person. No need to ask people a question twice, or follow them up, or ask them to keep a diary.

2. Analysis is simpler for the between-subject design. No need to worry about longitudinal data analysis or within-subject correlation or anything like that.

3. Concerns about poisoning the well. Ask the same question twice and you might be concerned that people are remembering their earlier responses. This can be an issue, and it’s worth testing for such possibilities and doing your measurements in a way to limit these concerns. But it should not be the deciding factor. Better a within-subject study with some measurement issues than a between-subject study that’s basically pure noise.

4. The confirmation fallacy. Lots of researchers think that if they’ve rejected a null hypothesis at a 5% level with some data, that they’ve proved the truth of their preferred alternative hypothesis. Statistically significant, so case closed, is the thinking. Then all concerns about measurements get swept aside: After all, who cares if the measurements are noisy, if you got significance? Such reasoning is wrong wrong wrong but lots of people don’t understand.

One motivation for between-subject design is an admirable desire to reduce bias. But we shouldn’t let the apparent purity of randomized experiments distract us from the importance of careful measurement.

And this framing of questions of experimental design and analysis in terms of risks and benefits:

In a typical psychology experiment, the risk and benefits are indirect. No patients’ lives are in jeopardy, nor will any be saved. There could be benefits in the form of improved educational methods, or better psychotherapies, or simply a better understanding of science. On the other side, the risk is that people’s time could be wasted with spurious theories or ineffective treatments. Useless interventions could be costly in themselves and could do further harm by crowding out more effective treatments that might otherwise have been tried.

The point is that “bias” per se is not the risk. The risks and benefits come later on when someone tries to do something with the published results, such as to change national policy on child nutrition based on claims that are quite possibly spurious.

Now let’s apply these ideas to the between/within question. I’ll take one example, the notorious ovulation-and-voting study, which had a between-person design: a bunch of women were asked about their vote preference, the dates of their cycle, and some other questions, and then women in a certain phase of their cycle were compared to women in other phases. Instead, I think this should’ve been studied (if at all) using a within-person design: survey these women multiple times at different times of the month, each time asking a bunch of questions including vote intention. Under the within-person design, there’d be some concern that some respondents would be motivated to keep their answers consistent, but in what sense does that constitute a risk? What would happen is that changes would be underestimated, but when this propagates down to inferences about day-of-cycle effects, I’m pretty sure this is a small problem compared to all the variation that tangles up the between-person design. One could do a more formal version of this analysis; the point is that such comparisons can be done.

So, to get back to the question from my correspondent: what to do if someone hassles you to conduct a between-person design?

First, you can do a simulation study or design calculation and show the huge N that you would need to get a precise enough estimate of your effect of interest.

Second, you can point out that inferences from the between-person design are entirely indirect and only of averages, even though for substantive reasons you almost certainly are interested in individual effects.

Third, you can throw the “ecological validity” thing back at them and point out that, in real life, people are exposed to all sorts of different stimuli. Real life is a within-person design. In psychology experiments, we’re not talking about lifetime exposures to some treatment. In real life, people do different things all the time.

Andrew: Excellent points all. And let’s not forget Pavlov’s brilliant n=1 research on classical conditioning. No aggregates, no group comparisons; no statistics. Intensive studies of one dog at a time, each a replication and extension of the prior. And Pavlov knew each of his dogs intimately and by name— a tradition carried on into the late 1940s by his former students (Gannt & Liddell) who studied and published experimental work on 3 dogs (“Nick,” “Fritz,” and “Peter”) over a 12 year period!!

Regarding the two points:

“First, you can do a simulation study or design calculation and show the huge N that you would need to get a precise enough estimate of your effect of interest.

“Second, you can point out that inferences from the between-person design are entirely indirect and only of averages, even though for substantive reasons you almost certainly are interested in individual effects.”

People usually are not taught to figure out how to simulate data; this is one big gap in the teaching curriculum everywhere. Even in the MSc in Statistics I did at Sheffield, this point wasn’t raised even once, not in medical stats, not when we studied linear mixed models. In retrospect, I feel this is shocking. What hope is there to teach this stuff to non-statisticians if even stats students don’t get this education?

The point holds for individual differences. Nobody is taught to care about this; we are taught to look at average differences. I keep quoting Spiegelhalter and Blastland’s book (The Norm Chronicles) on this point: “The average is an abstract. The reality is variation.” Why do we study the average difference? I really have no idea. I wasted my whole life (I mean, 17 years) focusing on average differences, and it was the dumbest thing I may have done so far.

Someone needs to write a textbook focusing on these two points. I think that someone might be me (I have contracts with CRC Press for some upcoming books). It would be great if someone like Andrew also incorporated a full chapter on ind diffs (I know that fake-data simulation is already part of his workflow, I learnt it from his books).

On a related note, Andrew, a psychologist suggested something to me: don’t call it fake-data simulation. Just call it simulated data. I think this is a good suggestion; the words fake data may be disconcerting to some newcomers not used to Andrewspeak.

I agree that looking at average differences ignores the important issue of individual differences. The average does have some value, but largely as a reference for considering variation of individuals from the “average” or “typical”.

An additional problem with averages is that different types of “average” (or “measure of center”) may be different for different situations. One that at least sometimes occurs in beginning textbooks is the median, which is more “typical” of the group than the mean if the distribution is skewed (as, for example, is usually the case with housing prices in a particular locality). Other types of “average/measure of center” that may be more appropriate in some circumstances include:

Harmonic means (including weighted harmonic means): Appropriate when considering average speed, corporate average fuel economy

Geometric mean: Appropriate in situations where there is “proportional growth” — e.g., in investments with variable interest rates.

Certainly the median is a different beast from the mean, and is often more useful.

But the harmonic mean and geometric mean are just ways of refusing to acknowledge that you used the wrong measure in the first place. The geometric mean is just the exponential of the mean of the log of the measure. The harmonic mean is just the reciprocal of the mean of the reciprocal of the measure. Really, it just means that your original measure should have been the log or reciprocal of what you used. If you think about it, for example, of course the appropriate metric for fuel economy is gallons per mile, not miles per gallon.

Indeed gallons/mile is better than miles/gallon for measuring fuel economy. See http://nsmn1.uh.edu/dgraur/niv/theMPGIllusion.pdf . Both versions appear now on the window stickers on new cars.

I don’t think hours/mile is better than miles/hour for speed, which you might want to average sometimes.

“better” isn’t an absolute thing… what’s better for one purpose isn’t necessarily better for another purpose.

For example, if you take two snapshots from an airplane you can calculate the distance traveled by all the cars in the snapshot and calculate an average speed of cars on a stretch of road.

Or, you can put a sensor at a certain point in the road that measures your speed as you pass it.

now, within a certain stretch of road, the slow cars are on that stretch longer. Civil engineers use the harmonic mean when averaging across space, because it weights slow cars more and they’re on the stretch longer.

On the other hand, if you want to know what the average speed that people go past your sensor is, you use the time-average of all the passing events, just the average of the speeds…

So, you should choose your measure based on what you want to accomplish. Clyde’s point is I think very valid. What’s needed is a careful thoughtful choice, rather than the application of a default.

Even miles per gallon vs gallons per mile is not so straightforward. If your question is “How much farther can I go before I run out of gas” the miles/gallon is easy to use, you can just multiply by the approximate number of gallons left in your tank, and most people find multiplication easier than addition…

but if your question is “how much gas will I have to buy to drive to the next city?” then gallons/mile is the relevant quantity.

all of this comes down to basically people find multiplication intuitive for extrapolations, and division intuitive for comparing sizes.

I do within-person experiments all the time, with myself. The idea that you can’t learn from them is ridiculous.

Indeed, I recently read a book, ‘Smoking Ears and Screaming Teeth: A Celebration of Scientific Eccentricity and Self-Experimentation’, by Trevor Norton, that contains nearly nothing except descriptions of how science was advanced by people doing within-person experiments (on themselves). Many of these examples are famous and celebrated, like the guy who drank h. pylori to see if it would give him an ulcer (it did).

Here’s a relevant post from a year and a half ago: https://statmodeling.stat.columbia.edu/2018/07/10/exercise-weight-loss-long-term-follow/