Yes on design analysis, No on “power,” No on sample size calculations

Kevin Lewis points us to this paper, “Sample-Size Planning for More Accurate Statistical Power: A Method Adjusting Sample Effect Sizes for Publication Bias and Uncertainty,” by Samantha Anderson, Ken Kelley, and Scott Maxwell.

My reaction: Yes, it’s reasonable, but I have two big problems with the general approach:

1. I don’t like talk of power because that’s all about trying to get statistical significance which I think is misguided. I think design analysis is important, and I think what Anderson et al. are doing is basically sound, but I just don’t think anyone should care what is the probability of getting “p less than 0.05” in a study. Yes, I know the idea is that the low p-value is a success and can be published, but I think this attitude is a mistake, and in my recent empirical work I’ve been firmly insisting that we not treat a low p-value (or a posterior interval excluding zero) as a determinant of success.

2. I think the focus should be on better measurements, not higher sample size. The question, “How large does N need to be?”, is a bit of a trap, in that if your measurements are really noisy, I don’t know that it’s such a good idea to try to sweep away your problems with a large sample size. Increasing N is a very crude way of decreasing variance, and it doesn’t do anything about bias at all. I say this in full awareness that Jennifer and I have a chapter on sample size calculations in our book. We have it in there because people feel the need to do it, but it bothers me.

My point in raising these issues is not to slam these three hardworking researchers who are making progress within the standard paradigm. We move forward one step at a time, after all. Their paper is fine. But I want to register my suggestion that we move forward, that we reduce the level of methodological research devoted to understanding and modeling epicycles, and move to a more direct or Copernican approach toward our study of statistics and research methods.

53 thoughts on “Yes on design analysis, No on “power,” No on sample size calculations

  1. “But I want to register my suggestion that we move forward, that we reduce the level of methodological research devoted to understanding and modeling epicycles, and move to a more direct or Copernican approach toward our study of statistics and research methods.”

    To solve the replication crisis in the social sciences, I think we need to make two major changes, one for the theorists and one for the empirical researchers who test theories.

    First, social science theorists need to start developing theories that, like theories of the heavens, predict specific parameter values. That way, increasingly precise estimates to get p < 0.05 would be a great success…at disproving the theory. If my theory predicts x = 3, exactly, and a well-designed study can reject that hypothesis, p < 0.05 (or better yet, p < 5 sigma), then my theory is wrong, and science has moved forward a step.

    Second, the professional success of those of us who collect data to test theories should be based on our study designs and execution, which we can control, not our outcomes, which we can't (unless we p-hack).

    • Precisely. And the way to achieve your second point is by performing direct replications based on publicly available info. Researchers with a history of generating reproducible research reports should be rewarded.

      Regarding the first (theory) point, the first issue is widespread lack of training in math/programming. Most researchers in many fields like biomed (90%+) are incapable of both coming up with actual quantitative models of what they believe is happening or assessing those of others.

      Regarding the second (data collection) point, people seem to dislike replications for some reason. I don’t know why, when I was doing research I loved that. It is very satisfying to check someone else’s work or have them check yours. At the end you can be much more confident that there is some phenomenon to be explained present. I honestly think people who dislike replications actually dislike science and should go do something else…

      But in the end I think the main problem is there are just too many sacred cows that are threatened. Standards have been very lax on both fronts so there are all sorts of “theories” explaining stuff that doesn’t deserve to be explained. There are also many vague explanations for real phenomena that fall apart upon closer scrutiny (eg, a drug raises calcium levels in the cell. It is supposedly via some receptor, but on closer inspection that would require reaction rates orders of magnitude higher than ever though possible).

      Due to the sacred cow problem, I don’t see it happening without a larger cultural change. An entirely new funding model and set of institutions is probably required. But I would love to be proven wrong on that.

      • “…people seem to dislike replications for some reason.”

        I can’t speak for everyone, but many journals want novel findings, not replication. If your goal is to get published in one of those journals, with the accompanying benefits of promotion/grant funding, I can seen why replication is unappealing.

        • If your goal is to get published in one of those journals, with the accompanying benefits of promotion/grant funding

          That is fine as a secondary goal, but shouldn’t the primary goal be “do good science”? If it is required to comply with the science-preventing policy to have a career in research then I say it is time for a different career. The original point of doing the research is lost.

        • No, the primary goal of research (for most people) is to earn a living. A more comfortable living is correlated with more income which is associated with a more “successful” research career under the rules that currently exist — publishing.

          So, one of two things needs to happen (in my opinion):
          1) Incentives for publishers and employers should change — they should be penalized for publishing rubish that can’t be reproduced or is pure hype. Then they can then give resources to publishing strong science and reward researchers who do well-designed, well-thought-out research.

          2) Research could become (again) the field of independently wealthy people who can earn a living from things other than research.

        • No, the primary goal of research (for most people) is to earn a living.

          One conclusion I came to is that (post-WW2) government funding turned academic research into a jobs program. But still I think if you ask people who donate to AHA or wherever why they are doing it, I doubt they will say to help a poor PhD student have a job.

          1) Incentives for publishers and employers should change — they should be penalized for publishing rubish that can’t be reproduced or is pure hype. Then they can then give resources to publishing strong science and reward researchers who do well-designed, well-thought-out research.

          I don’t see why this can’t happen in principle, but people refuse to do this for whatever reason. So something like your #2 is happening by default.

        • Incentives come from the top where the funding agencies only look for supposedly revolutionary research ideas to support.

        • yyw said: “Incentives come from the top where the funding agencies only look for supposedly revolutionary research ideas to support.”

          My impression (maybe I’m behind the times) is that it’s “innovative” or “novel” rather than “revolutionary”. Innovative or novel is sometimes progress and sometimes baloney.

      • @Anoneuoid: “Precisely. And the way to achieve your second point is by performing direct replications based on publicly available info. Researchers with a history of generating reproducible research reports should be rewarded.”

        There are really two kinds of “replication”, and they serve different purposes. One is to reproduce some existing experiment as closely as possible. This kind can find mistakes or statistical flukes in the work to be replicated.

        The second kind is a more general replication, using, for instance, different populations (the elderly vs college students, perhaps) and possibly slightly different experimental methods. This kind of “replication” tests whether the original results might be more generally useful.

        So there’s a place for both kinds, but they are mostly smushed together in discussions.

        • I hoped to refer to the first type of replication by using the term “direct replication”. People like to do the second type instead since there is always some excuse for why the result is different and can easily lead to a series of “cookie-cutter” studies since there is a never-ending series of “holes in our knowledge” to be filled in.

          Stuff like “the treatment…”:

          Study 1: “works in male aged rats”
          Study 2: “kind of works in female adolescent rats”,
          Study 3: “doesn’t work in male adolescent rats”,

          Then for study 4: “Wow the next study should be to check female aged rats and our lab is perfectly set up to do it, how perfect”.

          Combined with standard p-hacking (use a couple different definitions of “treatment works”) you can quite easily get an elaborate theory that explains why it works sometimes but not others (and entire career) out of nothing.

          Meehl described this perfectly over 50 years ago:

          this ad hoc challenging of auxiliary hypotheses is repeated in the course of a series of related experiments, in which the auxiliary hypothesis involved in Experiment 1 (and challenged ad hoc in order to avoid the latter’s modus tollens impact on the theory) becomes the focus of interest in Experiment 2, which in turn utilizes further plausible but easily challenged auxiliary hypotheses, and so forth. In this fashion a zealous and clever investigator can slowly wend his way through a tenuous nomological net- work, performing a long series of related experiments which appear to the uncritical reader as a fine example of “an integrated research program,” without ever once refuting or corroborating so much as a single strand of the network. Some of the more horrible examples of this process would require the combined analytic and recon- structive efforts of Carnap, Hempel, and Popper to unscramble the logical relationships of theories and hypotheses to evidence. Meanwhile our eager-beaver researcher, undismayed by logic-of-science considerations and relying blissfully on the “exactitude” of modern statistical hypothesis-testing, has produced a long publication list and been promoted to a full professorship. In terms of his contribution to the enduring body of psychological knowledge, he has done hardly anything. His true position is that of a potent-but-sterile intellectual rake, who leaves in his merry path a long train of ravished maidens but no viable scientific offspring. 2

          https://meehl.dl.umn.edu/sites/g/files/pua1696/f/074theorytestingparadox.pdf

    • I think I get the idea about predicting a specific value, but it’s not obvious to me that this is a good idea – mainly because it’s plainly insane to think you can exactly guess a parameter in most social science contexts. I’m not actually sure that there’s a good way to get good approximation unless you’re trying to get an idea of the parameter in one specific context – say, something like how many likes a Facebook post needs for people to feel that they’re popular. Estimating that parameter and generalizing to non-Facebook situations, though, would almost certainly end in failure even if the general principle of “more votes of approval = more perceived popularity” held in all contexts.

      Even in just the context of Facebook, though, that parameter would drift over time. In a way, you could say, “That’s fine, just model the factors that cause it to shift.” And people could of course test things and find variables that seem to have an effect. Finding the actual equation, though, seems insanely unlikely, since this will probably be a mixture of a bunch of different equations, each one of which on its own is insanely messy and inelegant. When you make a more specific model it can be more easily proven incorrect, but when you go from a less specific model to a more specific model you can move your theory from actually being correct to being incorrect. Perhaps that’s okay – if I understand the situation right physicists are pretty sure their own models are incorrect, and they just use them because, well, they’re good enough for most purposes and there’s nothing better around. But I still have a hard time feeling particularly optimistic whenever anyone suggests specific equations or specific parameter predictions.

      What do you say about this? Is there something I’m getting wrong? Are there any examples of what you think are good research using the paradigm you’re suggesting?

      • Austin,

        First of all, yes, the idea is to make our social science theories much more amenable to genuine *tests*. Currently, our theories only predict that x > 0, so we have a 50% chance of passing that test even if our theory is dead wrong (I’m obviously simplifying for the sake of argument). Increasingly precise estimates of x should put our theories at risk, not ensure that they are left standing no matter what. Being able to confidently reject a theory as our measurements improve is a good thing.

        Second, I agree that it might not be easy to formulate social science theories to predict specific parameter values, but how many people have tried? My grad student and I formulated a theory of behavior in the ultimatum game (UG). Basically, we proposed that participants infer the situation, and then make a “fair” offer given the situation. We manipulated the situation so that fair offers should deviate substantially from those in the standard UG. By the usual standards of social science, our experiment was a resounding success. Offers did deviate substantially from the standard UG, in the directions we predicted (clustering near 0 and 1, instead of near 0.50 as in the standard ultimatum game). They did match “fair” offers for that situation. And by usual social science standards, we had huge effect sizes (d ~ 2). But our theory predicted that participants would infer the fair offer, and then make *exactly* that offer, i.e., the difference between stated and actual offers should be exactly 0. Although the offers of many participants did conform to our theory — their actual offers exactly matched their stated fair offer — many didn’t. Our theory is wrong, and that’s OK! I write a bit more about it here:

        https://grasshoppermouse.github.io/2018/01/16/the-replication-crisis-our-statistics-don-t-suck-our-theories-do/

      • Who cares about “how many likes a Facebook post needs for people to feel that they’re popular,” or “the ultimatum game”? If you’re going to illustrate with an example, please try to find one that is worth studying rather than a toy example.

        • Martha, were you being facetious? You do realize that Nobel prizes have been awarded for work on econ games like the ultimatum game, right? Here’s a quote from Daniel Kahneman, from the Nobel Prize website:

          “One question that arose during this research was whether people would be wiling to pay something to punish another agent who treated them “unfairly”, and in some circumstances would share a windfall with a stranger in an effort to be “fair”. We decided to investigate these ideas using experiments for real stakes. The games that we invented for this purpose have become known as the ultimatum game and the dictator game. Alas, while writing up our second paper on fairness (Kahneman, Knetsch and Thaler, 1986b) we learned that we had been scooped on the ultimatum game by Werner Guth and his colleagues, who had published experiments using the same design a few years earlier. I remember being quite crestfallen when I learned this. I would have been even more depressed if I had known how important the ultimatum game would eventually become.”

          https://www.nobelprize.org/prizes/economic-sciences/2002/kahneman/biographical/

        • I wasn’t trying to be facetious, but I can see how my comment might have sounded facetious to you. Maybe a better way to have put it would have been something like, “These examples seem like toy examples to me.” I do find it amazing that a Nobel prize could be awarded for studying something like the ultimatum game. Possibly I am missing something; but I come to this with a strong skepticism of a lot that is considered important in some of the social sciences.

        • Martha:

          I’m in the poli sci dept, and colleagues are always talking about experiments where they go to some African village and run the dictator game. I respect these colleagues a lot—but I too have a queasy feeling about the whole thing. It all just seems like a gimmick, and I’m skeptical of claims that this is a good measure of cooperativity or whatever.

        • Andrew, Martha, I wrote the Anonymous comment, but for some reason my name didn’t come through.

          Anyways, Martha asked, “who cares”? And the answer is, a lot of people do. We are also skeptical of standard interpretations of the UG, DG, and other econ games. But instead of grumbling about them deep in the comment section of some blog, we designed an experiment to test our critique, got a grant to run it, ran it again to make sure it replicated, and then published it as a registered report. We show that with simple manipulations we can get offers near 0 or near 1, i.e., dramatically different than almost all other studies of the UG (which almost always find offers to be around 0.4-0.5). To quote from our abstract, our result “provides further evidence that it is difficult to draw firm conclusions about economic decision-making from decontextualized games.”

        • Ed:

          Fair enough. I’m just giving my impression, which is perhaps suitable for deep in a blog discussion thread but not in any more prominent place. I’ve never said that I think the dictator game etc. is the wrong thing to do, just that it seems iffy to me, that’s all.

        • Ed,

          Given that a lot of people do seem to believe that these games are reasonable models of human behavior in realistic settings, I do think that the research you describe is worthwhile. Thanks for the further details.

        • I made a toy example because it was sufficient to illustrate the idea and I’m pretty busy. But on the subject of examples that people care about, this is one of the things that I see being a problem with making the questions you ask specific enough for a parameter estimate.

          For particular studies where practice and policy are mixed – for an example, if you’ve gotten approved for a large scale test of an educational intervention that the government is considering applying universally – parameter estimation makes great sense. Parameter testing doesn’t – saying, “Whelp, whatever the effect was, I can say that my experimental and control groups weren’t drawn from two normal mixture populations which had different means with an effect size of d = .43”, is completely useless and uninformative. Even saying “the effect was some positive number” is more useful than that. Of course you can do your specific test and make a parameter estimate in the same study, but I just don’t see how an

          And it would be worse for basic research. When interpreting that, I don’t think very many people going to care about whether your effect size predictions are exactly right there, even if they’d like a general idea of how big the effect is. Probably in most situations, they’re going to want to know if there is an effect and in what direction. An idea of whether it will probably generalize or not is also desirable. But basic research on how big the effect of receiving a gift is on helping behavior, for Stanford students, in 2006, when it’s sunny out, is not a research question people want to know the answer to. They want to know if it increased helping behavior there (ideally by a good amount), and if it can generally be relied upon to do so elsewhere. But I think that’s more easily achieved through replication than through exact modeling.

        • I think at the end of that second paragraph I was going to say that I don’t see how the test would really add to it, since your model is obscenely unlikely to be precisely correct anyhow.

        • Two additions to the above: first, to be clear, if there is some situation where the model testing approach would be productive, I would certainly like to hear it. Anything that improves understanding of a productive paradigm for research is welcome.

          But also, in retrospect it appears that you (Martha) were advocating for parameter estimation, not a model-and-test approach. I can see that working much more easily, at least for some forms of research.

  2. “…social science theorists need to start developing theories that, like theories of the heavens, predict specific parameter values.”

    It’s hard to argue with this. Unfortunately, my experience with biomedical researchers and social scientists is that the notion of a “parameter” is completely alien. In many minds there is data and only data, and effects are “found” or “not found”, usually via standard methods like ANOVA.

    It has taken me years to recognize that disconnect between the way statisticians view things and the way that many applied researchers view things.

    • Good point. I suspect that this may reflect a prevalence of/preference for “dichotomous thinking” rather than “continuous thinking”.

    • I comment in more detail under Ed Hagen’s comment, but how do you propose people come up with specific parameters in social science? I don’t feel particularly optimistic about exactly guessing a correct parameter, and in social science I’d generally expect most parameters to be variables that change from situation to situation, rather than constants.

      As an aside, without measurement at the interval scale or above I suspect parameters won’t mean much. Though maybe “approximately interval scale” measurements could enable approximate parameter estimates.

      At any rate, what do you think about the chances of this working? How would one go about doing it? Are there any examples of research in the social sciences you think did a good job of this?

      • “in social science I’d generally expect most parameters to be variables that change from situation to situation, rather than constants.”

        Yes — and in biology and medicine also, and in many engineering and physics situations. So, for example, we may believe a particular random variable has a normal distribution — but there are many possible normal distributions, each specified by the two parameters mean and standard deviation. If we are talking about modeling a phenomenon with a normal distribution, the parameters will indeed vary from situation to situation, even if the normal model is appropriate for each situation. Estimating a parameter can only be done when we have enough specificity of the situation we are studying.

    • >> “…social science theorists need to start developing theories that, like theories of the heavens, predict specific parameter values.”

      >> It’s hard to argue with this.

      Ill try to argue with this!

      Isnt the problem that the most quantities Social Sciences are trying to measure are, in fact, unquantifiable. Maybe s Social Sciences should stop pretending they are Physics and do social sciences instead?

        • Well, the cognitive stuff seems pretty important for predicting/explaining behavior, so there’s good reasons for trying to measure it.

      • Well, I can think of a few examples, but that might not really be the issue here. I think the problem is that well-meaning statisticians insist that parameter estimation should replace significance testing in applied statistics. However, many applied researchers don’t think in terms of parameters but they *do* find value in significance testing, justified or not. I fear that no amount of drum beating by well-meaning statisticians will change things until applied scientists find value in parameter estimation. I want to be wrong, but I don’t think that I am.

        As for an example from my days as a student, we would predict the optimal diet for hunter-gatherers using theories from evolutionary biology and micro-economics. The predictions were very specific (include/do not include a particular prey animal in the diet), which had implications for conservation biology, human evolution, and cultural anthropology. Actually collecting the necessary data and making these predictions was very hard, as you can imagine. I don’t know if that approach ever got traction.

  3. I really like the ideas in the retro paper and the idea of not chasing p-values and power. Sample size is a big issue practically. Studies aren’t cheap, and my boss wants to know how much he has to spend. Recently, rather than doing traditional power analysis, I have been asking the question, “What is the minimum clinically interesting/relevant effect that you might expect to see out of this intervention?” Once I find that answer, I go about simulating data based on prior information of parameters and variance, program in the minimum clinically relevant effect size, and then run the Bayesian model that I would have proposed to analyze this data. I then program this to do it hundreds of times, altering the sample size, until I find the sample size where my 95% credible interval doesn’t include zero 80% of the time out of a large number of simulations.
    Perhaps this is sort of a weird blend of power analysis via simulation and Bayesian models, but it does seem like a way to find a sample size where if the population effect is a given relevant size, and my programmed variance is reasonable (assumptions, all), then I have a reasonable chance of ‘finding’ that effect and not under or over-spending $, where ‘finding’ is defined as the ability to discover the population effect with a low enough error to be able to publish it.
    If there is a better way, that is practical in the real world, then I’m all ears.
    Also, I’m a big fan of actually trying to simulate the study dataset beforehand (many times!), because it helps give an idea of potential pitfalls. Helps with design.

    • find the sample size where my 95% credible interval doesn’t include zero

      The procedure you describe is great for achieving that goal. But why do you care so much the interval includes zero? Unless a theory your are interested in predicted the value should be near zero then who cares? The parameter could be “not zero” for many reasons.

  4. “I think the focus should be on better measurements, not higher sample size.”

    I couldn’t agree more! But that’s a hard sell in a field (psychology) where some appear to take pride in measuring one’ entire personality in a few minutes and with a handful of items. And what doesn’t fit into this approach is more or less ignored.

    • “But that’s a hard sell in a field (psychology) where some appear to take pride in measuring one’ entire personality in a few minutes and with a handful of items. And what doesn’t fit into this approach is more or less ignored.”

      Sad, but probably true.

      • Somewhat sarcastic remark. Even worse than a psychologist with a dodgy measure is a social scientist with no idea of the mechanics of testing being let loose. I remember reading a paper by a health economist who suggested that it would be much more “efficient” to cut down a sense-of-well-being questionnaire from 37 items to 3. Think of the time savings alone!

        • As a counter argument, I always ask someone who proposes a 37-item scale whether they can point to any one of those items that they are absolutely certain is valid. It’s amazing how often those 37-items scales come about in the hope that many dodgy items will somehow average out into a solid scale score.

        • Plus there’s the risk that people will get bored if the questionnaire is too long and stop paying attention to the questions, or (especially in the case of online surveys) the possibility that people will decide it’s not worth their time and drop out of your sample entirely. Though I tend to get pretty nervous about questionnaires that have like, two questions per construct.

        • A big “gotcha” with the sort of scales my colleagues work with is deciding up front that you need at least some certain number (usually a half dozen or so) items per construct. This can lead to including items in one scale that could almost as plausibly been used in a scale for some related construct. It’s all too easy to end up including items with no discriminating power, arbitrarily allocated to whatever scale was short on items. That creates measures for supposedly distinct constructs that have spurious correlations due to those in-betweener items.

          Not having enough items is definitely a concern. But if you can’t come up with enough really, really on-point items then best to take your medicine and go with a two- or three-item scale.

        • ” It’s all too easy to end up including items with no discriminating power, arbitrarily allocated to whatever scale was short on items. That creates measures for supposedly distinct constructs that have spurious correlations due to those in-betweener items.”

          Sad, but very believable.

        • But even if you thought you had a sufficient number of relevant items, what I am missing so far in psychology is a rigorous examination of how a person’s response to the items on a specific scale comes about. What is the process behind the response? To what extent does it even reflect to a significant extent the construct I want to tap (e.g., extraversion, well-being, self-esteem) or indexes other, half-related or even completely unrelated things such as availability heuristics, properties of memory retrieval processes, pseudoneglect, and so on? That’s where better measurement — and more well-validated and understood measurement — should come into play. And it simply does too rarely.

        • Oliver said,
          “To what extent does it [the response] even reflect to a significant extent the construct I want to tap (e.g., extraversion, well-being, self-esteem)”

          The problem here seems to be at least partly that the the construct YOU want to tap might not be the same as the construct (with the same name) someone else wants to tap. In other words, you have to define the construct before you can decide whether or not the response “taps” it — yet (in my experience, at least) definitions in psychology are often given in terms of the responses it generates.

        • I think we are thinking about 2 different beasts here. The type of instrument I was thinking of would likely have been developed and the items refined over, perhaps, years of research and analysis. And generally used it the type of situation where boredom would not be the primary worry.

          I had not even considered an on-line survey. In fact, it would be almost certainly unethical to use it in such a way.

          On the other hand, I agree that I would be really, really, nervous about any survey that depended on two items per construct. One does carry out normal diagnostics on the test does one not?

        • Let’s say there’s a very specific construct that is crucial to your theory and there are two items of which you have little doubt as to their validity. If no other items seem specific enough to the construct (i.e. they indicate that construct but not the others in your theory) then what is the proper response? Make arbitrary permutations of wording from the only two good items you have? Go ahead and use items that are not entirely specific to the construct? Just leave out the construct and deem it unmeasurable?

          Or is it best to simply have a two-item scale and not be able to use the type of correlation-based measures that we usually apply to multi-item scales?

        • Absolute validity? Probably impossible but any decent instrument should have had the really dodgy items tossed early in the construction phase.

          It may be that we just are looking at different types of instruments. I come from a psych background where we could spend years developing, testing and analyzing even a fairly basic instrument before using using it seriously.

          Not that all psychologists do, unfortunately.

  5. I am totally on board with thinking the usual practice of “power calculations” simply fails to make sense in many cases. But I’m at a loss to think of what sort of procedure is well principled and leads to a conclusion other than “Get as big a sample as your funding agency will allow you to budget”.

    If we have to speculate about the degree of noise in our measures AND we are avoiding any reference to NHST AND we don’t want to incorporate anything smacking of binary decision rules then what exactly do we know, a priori, that might inform an estimate of required sample size? Maybe if I had any Bayesian background at all in my training I wouldn’t be asking this but how does someone like Andrew go about deciding how many participants should be enrolled in a clinical trial or an evaluation of some randomized, controlled intervention study?

  6. Andrew, it seems somewhat inconsistent to say that “the low p-value is a success and can be published, but I think this attitude is a mistake” just before saying that your book includes power analysis “because people feel the need to do it.” Perhaps if you left it out people would feel less inclined to do it. Or maybe they wouldn’t buy/use the book, which is akin to an editor not publishing a paper because p-values are not provided or interpreted conventionally. In both cases, concessions are being made by the author to meet the demands of the audience for tools that are frequently abused. Of course, you do criticize the “attitude” toward using p-values, which implies the odd conclusion that the authors of this paper are less culpable if they don’t genuinely believe what they wrote, i.e., so long as they feel as uneasy about the paper as you do about your chapter and eschew these practices in their future work. I’m not trying to be snarky, it just stood out to me that your usual, admirable practice of drawing a sharp, objective line on principle gets a little squishy there for a moment. I should also acknowledge that a book like yours is a survey of the field and not a set of recommendations for their own sake, and not having read it, for all I know you may include a disclaimer in the chapter about your reservations, which would be both consistent with both your principles and your reservations.

    • Michael:

      It’s a struggle. Chapter 20 in our book, Data Analysis Using Regression and Multilevel/Hierarchical Models, was pretty much traditional power analysis (although with a more realistic flavor than the usual textbook treatment, I hope). In revising this chapter for Regression and Other Stories, I’ve changed things and also added a bunch of material explaining the problems with naive power analysis. I hope that will do the trick. You’ll have to take a look at it when it comes out!

  7. Andrew:

    Many times I’ve heard you say people should improve the quality of their measurements. Have you considered that people may be quite close to the best quality of measurement they can achieve?
    Have you thought about the degree of measurement improvement that might actually be achievable?
    And what that would mean for the quality of statistical inferences?

    Competent psychophysicists are getting measurements that are close to the best they can reasonably achieve. Equipment that costs ten times more might only reduce error by one thousandth. It’s the variation between people that gets ya.

    • Dan:

      Thanks for the comment, and you make a good point.

      There are subfields where measurement is taken seriously. You mention psychophysics; other examples include psychometrics and of old-fashioned physics and chemistry. In those fields, I agree that there can be diminishing returns from improved measurement.

      What I was talking about are the many, many fields of social research where measurement is sloppy and noisy. I think the source of much of this is a statistical ideology that measurement doesn’t really matter.

      The reasoning, I think, goes like this:

      1. Measurement has bias and variance.

      2. If you’re doing a randomized experiment, you don’t need to worry about bias because it cancels out in the two groups.

      3. Variance matters because if your variance his higher, your standard errors will be higher and so you’ll be less likely to achieve statistical significance.

      4. If your findings are statistically significant, then retroactively you can say that your standard error was not too high, hence measurement variance did not materially affect your results.

      5. Another concern is that you were not measuring quite what you thought you were measuring. But that’s ok because you’ve still discovered something. If you claimed that Y is predicted from X but you didn’t actually measure X, you were actually measuring Z, then you just change the interpretation of your finding: you’ve now discovered that Y is predicted from Z, and you still have a finding.

      Put the above 5 steps together and you can conclude that as long as you achieve statistical significance from a randomized experiment, you don’t have to worry about measurement. And, indeed, I’ve seen lots and lots of papers in top journals, written by respected researchers, that don’t seem to take measurement at all seriously (again, with exceptions, especially in fields such as psychometrics that are particularly focused on measurement).

      I’ve never seen steps 1-5 listed explicitly in the above form, but it’s my impression that this the implicit reasoning that allows many many researchers to go about their work without concern about measurement error. Their reasoning is, I think, that if measurement error were a problem, it would show up in the form of big standard errors. So when standard errors are big and results are not statistically significant, then they might start to worry about measurement error. But not before.

      I think the apparent syllogism of steps 1-5 above is wrong. As Eric Loken and I have discussed, when you have noisy data, a statistical significant finding doesn’t tell you so much. The fact that a result is statistically significant does not imply that your measurement error was so low that your statistically significant finding can be trusted.

      If all of social and behavioral science were like psychometrics and psychophysics, I’d still have a lot to talk about, but I don’t think I’d need to talk so much about measurement error.

    • P.S. When I’ve given this speech to psychometricians and other researchers who are aware of the importance of measurements, they seem to be happy that I’m raising the point. Researchers who are interested in measurement are often bothered that lots of other researchers in neighboring fields don’t seem to care about measurement, and they are interested in the efforts of my collaborators and myself to untangle exactly where the reasoning of steps 1-5 above goes wrong.

      When I speak on this stuff, people don’t step up and tell me that the importance of measurement is obvious and everybody knows it. No, they tell me that measurement is important, and they’re frustrated that so many researchers don’t seem to realize this.

      That said, there’s selection bias. It could well be that the people who disagree with my message either don’t show up to my talks or don’t speak up in the question period!

Leave a Reply

Your email address will not be published. Required fields are marked *