Before data analysis: Additional recommendations for designing experiments to learn about the world

Statisticians talk a lot about what to do with your data. We can go further by considering what comes before data analysis: design of experiments and data collection. Here are some recommendations for design and data collection:

Recommendation 1. Consider measurements that address the underlying construct of interest.

The concepts of validity and reliability of measurement are well known in psychology but are often forgotten in experimental design and analysis. Often we see exposure or treatment measures and outcome measures that connect only indirectly to substantive research goals. This can be seen in the frequent disconnect between the title and abstract of a research paper, on one hand, and the actual experiment, on the other. A notorious example in psychology is a paper that referred in its title to a “long-term experimental study” that in fact was conducted for only three days.

Our recommendation goes in two directions. First, set up your design and data collection to measure what you want to learn about. If you are interested in long-term effects, conduct a long-term study if possible. Second, to the extent that it is not possible to take measurements that align with your inferential goals, be open about this gap and explicit about the theoretical assumptions or external information that you are using to support your more general conclusions.

Recommendation 2. When designing an experiment, consider realistic effect sizes.

There is a tendency to overestimate effect sizes when designing a study. Part of this is optimism and availability bias: it is natural for researchers who have thought hard about a particular effect to think that it will be important, to envision scenarios where the treatment will have large effects and not to think so much about cases where it will have no effect or where it will be overwhelmed by other possible influences on the outcome. In addition, past results will be much more likely to be published if they have reached a significance threshold, and this results in literature reviews that vastly overestimate effect sizes.

Overestimation of effect sizes leads to overconfidence in design, with researchers being satisfied by small sample size and sloppy measurements in a mistaken belief that the underlying effect is so large that it can be detected even with a very crude inference. And this causes three problems. First, it is a waste of resources to conduct an experiment that is so noisy that there is essentially no chance of learning anything useful, and this sort of work can crowd out the more careful sorts of studies that would be needed to detect realistic effect sizes. Second, a false expectation of high power creates a cycle of hype and disappointment that can discredit a field of research. Third, the process of overestimation can be self-perpetuating, with a noisy experiment being analyzed until apparently statistically-significant results appear, leading to another overestimate to add to the literature. These problems arise not just in statistical power analysis (where the goal is to design an experiment with a high probability of yielding a statistically significant result) but also applies to more general design analyses where inferences will be summarized by estimates and uncertainties.

Recommendation 3. Simulate your data collection and analysis on the computer first.

In the past, we have designed experiments and gathered data on the hope that the results would lead to insight and possible publication—but then the actual data would end up too noisy, and we would realize in retrospect that our study never really had a chance of answering the questions we wanted to ask. Such an experience is not a complete waste— we learn from our mistakes and can use them to design future studies—but we can often do better by preceding any data collection with a computer simulation.

Simulating a study can be more challenging than conducting a traditional power analysis. The simulation does not require any mathematical calculations; the challenge is the need to specify all aspects of the new study. For example, if the analysis will use regression on pre-treatment predictors, these must be simulated too, and the simulated model for the outcome should include the possibility of interactions.

Beyond the obvious benefit of revealing designs that look to be too noisy to detect main effects or interactions of interest, the construction of the simulation focuses our ideas by forcing us to make hard choices in assuming the structure and sizes of effects. In the simulation we can make assumptions about variation in measurement and in treatment effects, which can facilitate the first two recommendations above.

Recommendation 4. Design in light of analysis.

In his book, The Chess Mysteries of Sherlock Holmes, logician Raymond Smullyan (1979) wrote, “To know the past, one must first know the future.” The application of this principle to statistics is that design and data collection should be aligned with how you plan to analyze your data. As Lee Sechest put it, “The central issue is the validity of the inferences about the construct rather than the validity of the measure per se.”

One place this arises is in the collection of pre-treatment variables. If there is concern about imbalance between treatment and control groups in an observational study or an experiment with dropout, it is a good idea to think about such problems ahead of time and gather information on the participants to use in post-data adjustments. Along similar lines, it can make sense to recruit a broad range of participants and record information on them to facilitate generalizations from the data to larger populations of interest. A model to address problems of representativeness should include treatment interactions so that effects can vary by characteristics of the person and scenario.

In summary, we can most effectively learn from experiment if we think plan the design and data collection ahead of time, which involves: (1) using measurement that relates well to underlying constructs of interest, (2) considering realistic effect sizes and variation, (3) simulating experiments on the computer before collecting any data, and (4) keeping analysis plans in mind in the design stage.

The background on this short paper was that I was asked by Joel Huber from the Journal of Consumer Psychology to comment on an article by Michel Wedel and David Gal, Beyond statistical significance: Five principles for the new era of data analysis and reporting. Their recommendations were: (1) summarize evidence in a continuous way, (2) recognize that rejection of statistical model A should not be taken as evidence in favor of preferred alternative B, (3) use substantive theory to generalize from experimental data to the real world, (4) report all the data rather than choosing a single summary, (5) report all steps of data collection and analysis. (That’s my rephrasing; you can go to the Wedel and Gal article to see their full version.)

10 thoughts on “Before data analysis: Additional recommendations for designing experiments to learn about the world

  1. Also, if one ends up doing the analysis, Recommendation 3 has essentially zero additional cost ex post, as simulation code would be written for a lot of purposes anyway (checking and debugging inference, posterior predictive checks, etc).

  2. Re: Recommendation 2. When designing an experiment, consider realistic effect sizes.

    1. I have long been curious why ‘effect size’ should come into play in determining sample size n, when a researcher is trying to move away from the dichotomous and cynical view that the ‘goal’ of an experiment is to reject the null hypothesis, rather than learn something new about the world. Certainly we should be concerned that the estimate from an experiment is sufficiently precise and accurate, but using say the formula for the standard error to determine n wouldn’t usually require us to assume something about the effect size.

    2. That said, when I was first introduced to the concept of determining n from a power analysis, I was taught that the effect size to use is the smallest value for the true effect that would be *practically* meaningful. Now, I may be ‘anchored’ to the view that if you are going to use those formulas, that is the right way to use them, Talking to a lot of academic researchers over the years I don’t think I’ve ever encountered any that take practical significance into account. Instead, I’ve seen what seems like the very cynical decision of plugging in the estimate of the effect from a previous study. That just seems to suggest to me that the researcher is focusing too much on trying to reject H0. And if the estimate was the true effect size, then we wouldn’t need to run the follow up in the first place. And, yes, use a previous study to help you make your follow up more efficient. This doesn’t do that.

    3. Also in talking to academic researchers, even statisticians, it always struck me that a lot of them thought that a statisticians contribution to the design of experiments was just this, determining n from a power analysis. This is baffling, when statistical design of experiments is such a rich field. there are 800 page textbooks, even at the introductory level that don’t even mention doing a power analysis.

    Personally I think some of best advice a statistician could give when asked to determine n for an experiment is ‘make n as large as your budget affords. Then, once *you* tell me that n, I’ll show you how best to assign those units to the different treatments to ensure the effects you want to estimate are not confounded and can be estimated efficiently.’

    • Jyd:

      1. Setting “statistical significance” and dichotomization aside, some studies are so noisy as to be useless. I don’t think it’s good to conduct a study estimating an effect of size 0.1 if the study is gonna come up with an estimate whose standard error is 0.5. That might sound ridiculous, but we see such studies all the time. The size of the underlying effect is very relevant to how a study should be designed.

      2. “Practically meaningful” is fine, but I’m also concerned about reality. It would be practically meaningful if unicorns and mermaids existed, but they don’t [read to bottom of linked post]. It would be practically meaningful if 20% of women were to change their voting preferences depending on the time of the month, but they don’t! It would be practically meaningful if subconscious priming with elderly-related words caused people to walk 10% more slowly, but it doesn’t! Etc.

      3. Yes, that’s why we wrote Chapter 16 in Regression and Other Stories. You might argue that the topic is important enough that we shouldn’t have delayed it to chapter 16, and you might be right about that!

      Finally, no, I don’t recommend increasing N as a first step. I think the first step is to think about low-bias, low variance measurements, then the second step is to take multiple measurements on each person. I understand what you’re saying about increasing N, but I think it’s a mistake to consider the experimental design and measurement to be fixed. Lots can typically be done in those areas!

      • Oh, I may need to clarify:

        When I said,
        “This is baffling, when statistical design of experiments is such a rich field. there are 800 page textbooks, even at the introductory level that don’t even mention doing a power analysis.”

        I am referring to 800 page introductions to statistical design of experiments (not statistics textbooks more generally).

        And so I am not saying i consider it an ‘omission’ for statistical design of experiments to not cover power analysis. I don’t think determining n via a power analysis is particularly useful, when it is largely an economic decision.

        But I am saying that, these texts suggest there is so much for statisticians to consider in treatment assignments to units in their experimental designs for even a fixed n.

      • Also, in #2, I said ‘… the smallest value for the true effect that would be *practically* meaningful‘…’

        Your response makes it sound like I said to use any ‘practically meaningful’ effect size in the power calculation.

        It’s exactly the plugging into a power analysis formula these large effect estimates from previous published studies like the voting preference and priming examples you provide that would result in the feedback loop of underpowered studies that you’re concerned with, right?

  3. (2) recognize that rejection of statistical model A should not be taken as evidence in favor of preferred alternative B, (3) use substantive theory to generalize from experimental data to the real world

    There is a tension between this and the underlying assumption being made that a study should be about estimating an “effect size”. When testing a substantive theory, you want to constrain the parameters and see if the theory can still be consistent with observation.

    All this concern about effect size is a relic of NHST. The error is designing a study with that as a goal, rather than testing your theory.

    • If your theory can’t predict a magnitude it’s not much of a theory (Meehl). A particular magnitude may not be very generalizable, but within a given method there should be some prediction power.

      Therefore, effect sizes matter.

      In addition, I didn’t get the impression that the goal of the study is about effect sizes per se but that real effect sizes matter when designing studies about real things.

      Considering a study I just read as an example, there was a 3 page discussion on the importance of drawing as a study method when it had a 1% (CI95% = [0.02, 1.98]) effect on a test. The discussion did not once recognize the effect size observed, only that it was statistically significant.

      I don’t think an extensive discussion of the importance of such a finding was warranted. And the reason I think that, and many people reading the study rationally may feel the same, is because the effect size was so small.

      [BTW, if your comment was something to do with standardized effect sizes in some way perhaps my response here is way off base. Clarification would be appreciated. (I’m not sure why you put effect size in quotes so maybe I’m missing something there.)]

  4. Think of “effect size” as a parameter of a model. It only means something if the model means something.

    You mentioned drawing as a study method. Say you have a theory that drawing improves learning on some task. First thing to do is derive some functional form to model learning on the task.

    Depends on the task, but lets say it can be modeled as a sequence of Bernoulli trials with probability p of success. Learning is modeled by increasing p over time. Then we ask what would determine this value p? Well the student can either perform a successful action s or make some kind of error e, the probability of success is then:

    p(s) = s/(s + e)

    Where S + E is the entire universe of possible actions on the task. Every time the student is successful, they are more likely to repeat that behavior in the future. Ie, where k is the rate of learning from success:

    s[t+1] = s[t] + k

    Similarly, they may learn from errors by becoming less likely to make the same (or a similar) mistake twice. Where c is the rate of learning from failure:

    e[t+1] = e[t] – c

    So then we work out the consequences of the model. I wrote a quick R simulation (maybe someone else wants to work out the analytical result):

    n_trials = 20
    out = matrix(nrow = n_trials, ncol = 4)
    colnames(out) = c(“Result”, “S”, “E”, “Prob_S”)
    s = 10
    e = 100
    k = 10
    c = 5
    for(t in 1:n_trials){
    p_s = s/(s + e)
    res = sample(0:1, 1, T, c(1 -p_s, p_s))

    if(res == 1){
    s = s + k
    }else{
    e = max(e – c, 0)
    }

    out[t, ] = c(res, s, e, p_s)
    }

    There are four parameters (really three since it is only the ratio of the initial values for S and E that matters), and if you plot p_s you will see it is a sigmoidal curve which is a common shape for learning curves.

    Now we get to the effect of drawing. Is the theory that it increases the rate of learning from success, or failure? Ie, is it going to affect k or c, or both, and by how much? Maybe we don’t know, this is exploratory. So collect data with and without drawing and fit the above model. Then compare the values of the parameters you get if we assume this model reflects reality. And obviously there are lots of ways to expand on it, or totally different models you can think up and derive instead.

    • To run the above code you have to replace the quotes and minus sign with the correct characters. WordPress swapped them out for some strange ones.

      But the point is the theory (or better, theories you try to distinguish between) needs to guide what kind of data gets collected. Eg, how can you tell if your model has anything to do with reality from a single test result?

Leave a Reply

Your email address will not be published. Required fields are marked *