“Participants reported being hungrier when they walked into the café (mean = 7.38, SD = 2.20) than when they walked out [mean = 1.53, SD = 2.70, F(1, 75) = 107.68, P < 0.001]."

E. J. Wagenmakers points me to a delightful bit of silliness from PPNAS, “Hunger promotes acquisition of nonfood objects,” by Alison Jing Xu, Norbert Schwarz, and Robert Wyer. It has everything we’re used to seeing in this literature: small-N, between-subject designs, comparisons of significant to non-significant, and enough researcher degrees of freedom to buy Uri Simosohn a lighthouse on the Uruguayan Riviera.

But this was my favorite part:

Participants in study 2 (n = 77) were recruited during lunch time (between 11:30 AM and 2:00 PM) either when they were entering a campus café or when they had eaten and were about to leave. . . . Participants reported being hungrier when they walked into the café (mean = 7.38, SD = 2.20) than when they walked out [mean = 1.53, SD = 2.70, F(1, 75) = 107.68, P < 0.001].

Ya think?

But seriously, folks . . .

To me, the most interesting thing about this paper is that it’s so routine, nothing special at all. Published in PPNAS? Check. Edited by bigshot psychology professor (in this case, Richard Nisbett)? Check. Statistical significance with t=1.99? Check. What could possibly go wrong???

I happened to read the article because E. J. sent it to me, but it’s not particularly bad. It’s far better than the himmicanes and hurricanes paper (which had obvious problems with data selection and analysis), or the ovulation and clothing paper (data coding problems and implausible effect sizes), or the work of Marc Hauser (who wouldn’t let people see his data), or Daryl Bem’s ESP paper (really bad work, actually I think people didn’t even realize how bad it was because they were distracted by the whole ESP thing), or the beauty and sex ratio paper (sample size literally about a factor of 100 too low to learn anything useful from the data).

I guess I’d put this “hungry lunch” paper in roughly the same category as embodied cognition or power pose: it could be true, or the opposite could be true (hunger could well reduce the desire to acquire of nonfood objects; remember that saying, “You can’t have your cake and eat it too”?). This particular study is too noisy and sloppy for anything much to be learned, but their hypotheses and conclusions are not ridiculous. I still wouldn’t call this good science—“not ridiculous” is a pretty low standard—but I’ve definitely seen worse.

And that’s the point. What we have here is regular, workaday, bread-and-butter pseudoscience. An imitation of the scientific discovery process that works on its own, week after week, month after month, in laboratories around the world, chasing noise around in circles and occasionally moving forward. And, don’t get me wrong, I’m not saying all this work is completely useless. As I’ve written on occasion, even noise can be useful in jogging our brains, getting us to think outside of our usual patterns. Remember, Philip K. Dick used the I Ching when he was writing! So I can well believe that researchers can garner useful insights out of mistaken analyses of noisy data.

What do I think should be done? I think researchers should publish everything, all their data, show all their comparisons and don’t single out what happens to have p less than .05 or whatever. And I guess if you really want to do this sort of study, follow the “50 shades of gray” template and follow up each of your findings with a preregistered replication. In this case it would’ve been really easy.

50 thoughts on ““Participants reported being hungrier when they walked into the café (mean = 7.38, SD = 2.20) than when they walked out [mean = 1.53, SD = 2.70, F(1, 75) = 107.68, P < 0.001]."

  1. >”And that’s the point. What we have here is regular, workaday, bread-and-butter pseudoscience. An imitation of the scientific discovery process that works on its own, week after week, month after month, in laboratories around the world, chasing noise around in circles and occasionally moving forward. And, don’t get me wrong, I’m not saying all this work is completely useless. As I’ve written on occasion, even noise can be useful in jogging our brains, getting us to think outside of our usual patterns. Remember, Philip K. Dick used the I Ching when he was writing! So I can well believe that researchers can garner useful insights out of mistaken analyses of noisy data.”

    I’ve begun thinking of it (the NHST ritual) as playing the same role as meditation/prayer. It is essentially a replacement for the “spiritual/mystical stuff” which is frowned upon by many elite circles these days.

    Sure it *might* be possible for NHST to help you make progress more than flailing about randomly, just like thinking about something deeply could help. However, it should be more of a personal ritual, done in the privacy of your office/lab/home. An experience shared with, at most, a small number of family and colleagues. Once it gets organized and institutionalized and intertwined with “the State”, bad stuff starts to happen.

  2. The paper uses some measure denoted as $b$, whenever they report something like a correlation. What does it stand for? Example:

    > The likelihood of correctly identifying hunger-related words
    > increased significantly with self-reported hunger
    > (b = 0.024, SE = 0.012, t = 1.99, P = 0.05).

  3. The paper refers to other work to suggest plausibility of the underlying thesis:

    “In many studies, concepts rendered accessible by allegedly unrelated tasks had a profound impact on behavior. For example, using words such as “support” and “share” to construct sentences can activate the concept of cooperation, leading people to sacrifice personal benefits for the public good (5). Merely labeling a game the “Community Game” rather than the “Wall Street Game” is sufficient to elicit differential cooperation (6). Voting at a school can activate school-relevant norms, such as one should support education and care about children, which can increase support for school-funding initiatives (7). Similarly, listening to a political speech by a candidate one opposes can activate a disposition to counterargue and consequently increase counterarguing in response to an unrelated advertisement encountered later (8). We propose that internal states, such as hunger, can have similar effects.”

    What if – it seems likely – these other works fall into the same area of being only weakly supported by the evidence and analyses? Where does that leave the foundation for the current paper?

    This is a serious problem that so much published-but-weak work creates. It’s nearly impossible to keep on top of what claims are well supported and which are not. And the whole ediface keeps growing.

    • Agreed. When you do this kind of work you can (post-hoc) find a plausible sounding theoretical justification for effects of any kind.
      I also think it odd that Nisbett was the action editor on this. Schwarz (second author) worked at the University of Michigan for twenty years (up until 2013) in the same department as Nisbett and they have published together quite a bit.

      • Mark:

        You write, “I also think it odd that Nisbett was the action editor on this. Schwarz (second author) worked at the University of Michigan for twenty years (up until 2013) in the same department as Nisbett and they have published together quite a bit.”

        I thought, no joke, that this was the point of PNAS, that being a journal editor gives you the privilege to publish work that you like and work by your friends and former students? We discussed this recently in the context of Susan Fiske. It seems that if you have work that you think it generally acceptable, you can publish it in JPSP or whatever. If you have work that’s not so strong and you happen to be friends with an editor, you send to PNAS. The only part of this that puzzles me, is why journalists seem to respect PNAS articles almost without question, given this pattern of how articles get published there.

        • Mark and Andrew: Richard Nisbett, Robert Wyer, and Norbert Schwarz are all big names in social psychology. (Mark knows this, I’m sure). Alison Jing Xu is younger, having earned her PhD in marketing at the University of Illinois in 2010. Wyer (professor emeritus at UIUC) was her dissertation advisor and Schwarz was on her committee. The article in question seems to have been part of Xu’s dissertation; impossible to say for sure, because Xu blocked access on both ProQuest Theses and Dissertations and the UIUC dissertation database.

        • That is interesting and weird. I wonder if this would turn out to be another case of what O’Boyle calls the Chrysallis Effect. That is, dissertations that undergo metamorphosis (changes in sample size, dropped hypotheses, added hypotheses etc.) on their way to becoming a journal article.

        • Mark: The abstract is available in both databases. She has three sets of studies in her dissertation; the PNAS article seems to be based on the third set. She talks about the third set of studies in a YouTube video; search for “An empty stomach can lead to an empty wallet.”

          Sometimes people block access to their dissertations until they have published the studies contained therein.

  4. OK, this is garden variety stuff. We don’t know how many experiments they did that they didn’t report, for example.

    I don’t have a problem with the research, so much as the use of my tax dollars to pay for it. (although in this case, it’s Canadian taxpayers)

    • I’m gonna go out a limb here and suggest that the sign of the finding regarding hunger level and whether people are entering or exiting a restaurant is actually correct regarding the nature of people in the world*. I’m not so worried about p-hacking or non-reporting or whatever here. I’m just gonna go ahead and believe this one. I don’t think I’ve ever said that here.

      *except for the sub-group of people who are restaurant employees.

  5. I had an issue with the desire to test a specific form of decision-making, thus connecting hunger to “acquisition” when, to give them the benefit of the doubt, they could be seeing bits of worse decision-making by hungry people. You know, more “have a snickers because you’re not yourself” than I’m hungry, I’ll buy towels. I think a universal experience is being too hungry to think properly. (I admit to having given up on the study about halfway through. Maybe because I was hungry.) My point is, even giving them full benefit of the doubt, I don’t get how or why they draw a line from hunger to any specific behavior except maybe as a potential correlation to some degree of hunger’s effect on decision-making in general. And I don’t expect them to test that because that would be harder. My guess is I could find support for the proposition that hunger makes people more, no maybe less powerful because they adopt a power stance to get food, unless of course maybe they don’t.

  6. > What we have here is regular, workaday, bread-and-butter pseudoscience. An imitation of the scientific discovery process that works on its own, week after week, month after month, in laboratories around the world, chasing noise around in circles and occasionally moving forward.

    That’s what gets me. It’s not like we’ve run out of real science to do. Life is short. Why not engage in the real thing?

    “Participants reported being hungrier when they walked into the café than when they walked out… Participants liked the food items more before eating than after eating, but eating did not significantly affect their liking for nonfood items …”

    Someone thought it was important enough to check those things that they funded the study. Fair enough. I’m trying to wrap my head around what we’d do if we found those things not to be true? Suppose we discovered that people were hungrier when they walked out of the cafe than when they walked in? Would you suspect it to be an effect specific to that cafe or a more general phenomenon? Could it be that lots of people are hungrier after they eat than before they eat but they keep quiet about it because they’re concerned that others will think they’re weird? If the results of the study had been different seems like it could have opened up a bunch of interesting new lines of research.

    • Chris:

      See my response to Raghuveer in comments. Short answer is that the main claim of the paper was “Hunger promotes acquisition of nonfood objects.” A claim for which the authors unfortunately supplied no real evidence.

      • > Yes, I fully recognize that the quote about being hungrier is not the main finding of the paper, it is just a reality check.

        Andrew,

        To the extent that it has utility I don’t think it’s as a reality check as the check isn’t on a model but on beliefs; it would be to calculate the covariance of hunger and acquisition of non-food objects. (Speaking of that, where’s their plot propensity to acquire non-food objects vs hunger?)

        As a physical scientist, I do reality checks (comparison of uncontroversial data with model prediction) to confirm that 1) I wrote down my equations correctly and 2) if I have a model which enables predictions of different types of observables but I only have validation data for one type then I can develop confidence in my ability to predict observables I can’t currently measure because I can accurately predict the ones I can. The authors’ hunger assessment doesn’t meet either of those criteria.

        An example of #2 above: In atomic and molecular physics you could have a quantum mechanical model which predicts the central wavelength of an emission feature as well as the lineshape where both follow from a single set of model parameters. Your reality check might be to confirm that you predict some wavelengths correctly. If you get the wavelengths right then your reality check passes and you don’t bother to check the accuracy of lineshape predictions because they’re generated using the same parameters. (Eventually you’d probably want to check lineshapes but in the short term you could be satisfied that you’ve got things right enough to proceed.)

  7. “Participants reported being hungrier when they walked into the café than when they walked out”

    That’s just a manipulation check (i.e. that their assumption that people are hungrier beforehand is actually true). It’s not the finding itself.

  8. “Participants reported being hungrier when they walked into the café (mean = 7.38, SD = 2.20) than when they walked out [mean = 1.53, SD = 2.70, F(1, 75) = 107.68, P < 0.001"

    I don’t understand why you (Andrew) are mocking the paper for this statement.

    First of all, it’s not at all the point of the study, which looks at whether being hungry affects people’s desire for non-food items. If I made some sort of photovoltaic device and used it to measure light levels somewhere, but first placed it in the sunlight and reported that it did, in fact, return a large voltage, I would hardly expect to be mocked for reporting that sunlight is bright. In fact, one *wants* in general to see in experimental papers some sort of connection to “obvious” things, if one’s measurement methods aren’t standard.

    Second, you’ll probably reply that you’re not mocking that statement, but rather the overall content of the paper. I don’t care one way or another about the paper — though I think your criticism of it is vague, and I don’t see why this paper merits singling out — but clearly “leading off” with this headline invites the reader to mis-interpret what the paper is about (see Chris G, above).

    • Raghuveer:

      Yes, I fully recognize that the quote about being hungrier is not the main finding of the paper, it is just a reality check. Reporting “p ;lt& 0.001” here is a bit over the top, but that’s just a stylistic thing. My real criticism is that the paper is, as I wrote, pseudoscience: it could be true, or the opposite could be true (hunger could well reduce the desire to acquire of nonfood objects; remember that saying, “You can’t have your cake and eat it too”?). This particular study is too noisy and sloppy for anything much to be learned.

      You write, “I don’t care one way or another about the paper.” I don’t either. My problem with it is, as I wrote, that it is “regular, workaday, bread-and-butter pseudoscience. An imitation of the scientific discovery process that works on its own, week after week, month after month, in laboratories around the world, chasing noise around in circles and occasionally moving forward.” I’m bothered not because this one particular paper is so bad, but rather because this is what thought leaders such as Richard Nisbett think is science.

      You write, “I don’t see why this paper merits singling out.” I’m using it as an example. I think it’s useful here to have an example that’s not on-its-face ridiculous such as ESP or ovulation-and-voting. That’s part of my point: having a vaguely plausible hypothesis and vaguely relevant data and getting published in a top journal is not enough.

      You write that my criticism is “vague.” This is another thing. I don’t believe in the “incumbency effect” by which we should take seriously a paper just because it appeared in PPNAS. I will take published claims seriously to the extent they are supported by evidence. In this case, the evidence are t statistic such as 1.99, coming from small-N, between-subject designs, comparisons of significant to non-significant, and copious researcher degrees of freedom. That’s no evidence at all.

      The above criticism is “vague” only because I didn’t put in the time to list the N’s, to explain where it says that the designs are between subject, to identify which comparisons are made between significant and non-significant, and to point out researcher degrees of freedom. It should be easy for you or others to fill in these blanks if you’d like. It would’ve been easy for the PPNAS editors, too—but they’re so used to cargo-cult science, they didn’t even think to look. Same as with the editors of the ESP paper, the himmicanes paper, the power pose paper, the beauty-and-sex-ratio paper, and all the rest.

      • And the reason these papers are important is there may be other papers just like this one, except that they appear reasonably argued to the laymen, appeal to the laymen’s inherent biases, that seem plausible, that can be used to justify laymen’s policies, that quickly becomes highly-cited and accepted dogma in sciences…but the truth are that the papers are just using the same pseudo-scientific approach as these other papers and have only been lucky that nobody listen to their critics.

      • Andrew:

        You write, “Yes, I fully recognize that the quote about being hungrier is not the main finding of the paper, it is just a reality check.”

        I’m struggling to understand, how is this not just a non-commercial version of clckbait? Is cherry picking a sentence from a weak paper to set up a snarky sucker-punch comment completely informed by hindsight bias functionally any different or better than promising ‘jaw dropping’ photos to get people to click through so you can bombard them with ads?

        You write, “…but they’re so used to cargo-cult science, they didn’t even think to look.”

        I completely agree with the underlying point of this comment: psychology has been and continues to be a cargo-cult science. I’m curious: What do you think would be a good way to convince someone to leave a cult? Do you think mocking, ridiculing and shaming are effective strategies for convincing cult members, or anyone for that matter, that their whole belief system and conception of reality is wrong? Do you think it would be considered ethical for an academic researcher studying a cargo cult to write a blog wherein she regularly mocks, ridicules and shames the cult members she studies?

        • Sentinel:

          I think it’s just fine to mock things that are mockable. I’ve written hundreds of papers and I’m sure I’ve written some mockable things, people can feel fine mocking them. If I have a paper with some silly jargon or whatever and somebody mocks my jargon, fine, they can go at it. I’ll respond by pointing out all the serious things in my paper and I’ll either say the jargon was a mistake or I’ll argue what it was appropriate in context. In this particular case, though, there is no serious content to the paper. Indeed, the only really well-supported thing in the paper is the claim that I am justly mocking, which is the unsurprising fact that people are less hungry after they’ve had lunch.

          Finally, I don’t consider it my job to convince people to leave a cult. If you want to do that, go at it. I consider my job to tell the truth as I see it. We have a division of labor here. I can talk about problems and solutions (and I’ve spent lots of time writing about both). You can convince cult members. I’ll do what I do as best I can, you do what you do as best you can. But if you’re serious about convincing cult members, I’d suggest you go at it, rather than wasting time commenting here!

        • Andrew:

          You’re dodging my question. I didn’t ask you what you think your job is; I asked you what you think would be a good way to convince someone to leave a cult, and whether the use of mocking, ridicule and shaming might be effective strategies for convincing people to leave a cult, which involves helping them understand that their perceptions of reality are distorted.

          But since you brought it up, I’m guessing that you do think teaching is part of your job. You’ve written a book about it and I have to imagine that it’s part of the job description for a professor at an Ivy League school. Do you generally think mocking, ridiculing and shaming are effective strategies for helping students learn difficult and complex subjects like statistics and quantitative methods? Do you think it’s just fine to mock students whose work rises to the level of mockable, as you define it?

          I work everyday in the ways afforded me to help move the field forward, making the science truly rigorous and sound. Regardless of what you think your job is, the tone and approach you use here matter. It’s hard enough to get people to accept that everything they know might not be so without the sense that they are under attack from people who want to destroy their careers. Despite what you might want to believe, as soon as you take an interest in the field of psychology as a critic and a scholar it becomes your job to care and consider the repercussions of your words. I assume you blog because at some level you believe you have important things to say and want to influence the thoughts and beliefs of others. To then suggest that you get to say whatever you want without concern for how it influences others just because others are free to mock you is a copout, professionally and morally.

        • Sentinel:

          I’m not kidding when I say I believe in the division of labor. I do work by studying statistical problems, statistical errors, and potential solutions to these problems. I also contribute by drawing attention to these problems so that they’re not just a bunch of isolated incidents. That’s why I blog about these things rather than just trying to publish a series of letters to the editors of all these journals.

          But I recognize that I can’t do everything, in particular I think I have zero chance of convincing Satoshi Kanazawa, Amy Cuddy, Susan Fiske, or Ed Wegman of anything. I’ve never met any of these people (OK, I might have met Wegman once, I’m not sure); I come to these opinions based on what I’ve read from them.

          Kanazawa I kinda feel bad for: he seems like an ideologue who sees himself as a statistical expert but really doesn’t know what’s going on, and is not in contact with anyone who can straighten him out. Being trapped in an ideological bubble doesn’t help, and I fully expect he’ll continue on his path for the rest of his career.

          Cuddy irritates me a bit. I think she’s even more of a statistical naif than Kanazawa, but being at Harvard she has ample opportunity for access to statistical expertise. So even though I expect that her errors were honest mistakes, products of nothing more than zeal for scientific discovery, I fear that at this point she is actively avoiding any opportunities to clean up her act. Given this, I don’t see any real chance of persuading her of anything.

          Fiske is in a different state: she’s reached a position of eminence doing what she does, and it would take a big leap for her to admit error. I think she should do it, and I think she just might do so—ultimately, if she cares about psychology research, and I think she does, that’s the way to go—but I don’t think I’m the one to convince her, and I don’t think any silver-tongued persuasion on my part would help. On the other hand, maybe it will help if her colleagues, and ultimately she, realize that her brand name is being destroyed by PPNAS papers on himmicanes and the like. I’m not trying to “shame” her, but I do think it should be helpful for people to realize that her name on a PPNAS paper is not a signal of quality.

          Finally, Wegman copied without attribution and refuses to admit it. That’s really tacky and I think it deserves shaming, not just from me. Again, I don’t think my words will have any persuasive power on Wegman, and I’m not trying to persuade him. But I am sending the message to the larger community that people who behave like Wegman should not be taken seriously.

          Finally, to get back to the division of labor: You can feel free to devote some of your effort to convincing people to leave a cult. That’s fine. This is not something I know anything about. At this point in my life I think it makes sense for me to make contributions where I can, recognizing that others such as yourself can contribute in ways that I cannot.

        • Shravan:

          Not literally brand name. But “brand name” in the sense that her name is suppose to be some sort of badge of quality. That’s what it means if she’s listed as the editor of a PPNAS article, right? News organizations trust the article because it is in PPNAS, and PPNAS trusts the article because Fiske vouched for it. Thus, news organizations trust the article because Fiske vouched for it. It’s her seal of approval. If she does enough of these, and people start to catch on, I’m assuming that her seal of approval will start to change its received meaning.

        • “Kanazawa …not in contact with anyone who can straighten him out. Cuddy …being at Harvard she has ample opportunity for access to statistical expertise.”
          Perhaps from where you sit the London School of Economics is a statistical desert, but there are several other universities within 20 mins walk, or if worse came to worst the train from Kings Cross to Cambridge takes 50 minutes.

        • Frederick:

          Sure, Kanazawa would have easy access to statistical expertise; I wasn’t claiming otherwise. In his case the problem is that he probably thinks he has the statistical expertise—indeed, he has the demonstrated expertise to pull statistically significant p-values out of pure noise and get the results published in real journals, and unlike, say, Richard Tol, he doesn’t even have to screw up the data to do it. Hence I’m guessing that he (Kanazawa) doesn’t see any benefit from talking with statistics experts.

  9. I analyze relatively large survey data or administrative data, so I (usually) don’t have to worry too much about sample size. This also means I’m not really familiar with sample size (ie. power) calculations. So I have a questions about the sample sized, I hope someone can help.

    To summarize the studies (but please correct me if I’m wrong): Study 1 (N=69) shows 9 acquisition-related words, 4 hunger-related words, and 9 control words. Then they estimate the correlation between self-reported hunger and subjects’ ability to correctly identify the words. So you do have 69 respondents per word category, but because the same respondents answer questions for the other word categories, you “sort of” end up with lower power, say for 69/3=23 subjects per word category. Study 2 (N=77) has four dependent variables (desire to acquire food, desire to acquire nonfood, liking for food, and liking for nonfood). If I understand the experimental setup correctly, half the respondents (not reported, but that’s what I assume, so about 39) were asked questions when walking into the cafe (ie. being hungry), whereas the other half were asked questions walking out (i.e. not hungry). Then they do two-sample t-tests to compare the two groups for the 4 dependent variables. In both study 1 and 2, analyses seem to have been carried out separately, so not taking into account that the same respondents are used for 3 (study 1) or 4 (study 2) dependent variables.

    Assuming my interpretation is correct, Andrew or anyone else, what SHOULD the sample size have been for these two studies? Sort of, ball park estimate?

    Why am I asking this? Psychologists I know find the above setups completely reasonable, and refer to their psych stats books that state something like: after an experiment in which you’re interested only in comparing means of two groups, you need n=35 in each group to be able to identify a medium-sized effect. (I have no idea if these are the actual numbers, but it’s in that range). To, me that sounds almost ridiculous, but then again, if you do a fine double blind randomized controlled trial, then perhaps it’s good enough to have 35 respondents in each group? Can anyone walk me through what the sample size should have been, to be able to detect a small, medium, or large effect? That would be very helpful, because I think that *that’s* how I can convince (naive) colleagues to not do such experiments again. I don’t mean this in a bad way, but with the simple statement “small-N, between-subject designs, comparisons of significant to non-significant, and copious researcher degrees of freedom. That’s no evidence at all” you’re preaching to the Gelman choir. The readers of this blog agree with you anyway, and may have read most of your work, so they know exactly what’s wrong here already. But it doesn’t really help those who are not ‘in the know’. A step-by-step explanation of what the researchers should have done differently, would be very helpful, I think. Thanks!

    PS. Not sure how study 2 recruited the hungry people, I would never have participated before my lunch ;)

    • Anonymous: You have detected that the analysis for Study 1 was not done correctly. Study 1 did three separate regressions, one for the acquisition words, one for the hunger words, and one for the control words, using the same set of participants for each regression. A more appropriate design would have used the three kinds of words in a one-way within-subjects (repeated measures) analysis of variance, with the three kinds of words serving as the three levels of the word factor.

      A number of books provide formulas for determining power for different study designs. The classic is Jacob Cohen’s (1988) STATISTICAL POWER ANALYSIS FOR THE BEHAVIORAL SCIENCES (2nd edition). There are several newer and more comprehensive books, also.

      To do a power analysis, one must have some idea of the effect size.

      Some of the standard statistical packages can do power analysis. There is also G*Power. Karl Wuensch at East Carolina University has some power-analysis demonstrations on his website, including one for the analysis of variance that I described above. On his webpage, click on “Power Analysis for One-Way Repeated Measures ANOVA.”

      I haven’t looked beyond Study 1 yet.

      • It’s worth noting that within-subjects/repeated measures designs are more powerful than between-subjects designs. Because you have multiple measures on the same unit, you can remove within-unit error (noise) from the error term and gain some precision in your estimate. From a power perspective, the authors shot themselves in the foot by not using this or a related model. (I suspect Andrew would point out that the ideal model in this situation is hierarchical). From a p-hacking perspective, they probably tried the within-subjects model and found that the critical interaction was not significant and so went with multiple between-subjects models to get the p-value they need where they needed it.

        • Sentinel:

          I’m not sure but I have the impression that researchers tend to prefer the between-subject design because they’re worried about bias, about poisoning the well by taking multiple measurements on the same person. I haven’t written much on this, but when I give talks I often bring up this point and explain that there is a bias-variance tradeoff here, and the practical consequence of between-study designs is often just to create a dataset full of noise. People are often interested in this point because I think many of them have been trained to believe that the between-study design is safer.

        • Andrew–maybe you could post something on the blog about this. I understand the tradeoffs between between-subject designs and within-subject designs, but I’d be interested in you writing more about this.

        • I think that’s generally true but it depends a lot on the area of research and the nature of the constructs you’re manipulating and measuring. Cognitive psych tends to use a lot of within-subject designs and I think that has to do with the ability to counterbalance or present stimuli in random order to avoid bias. I haven’t read the Jing Xu paper, but it sounds like they could have done their analyses within-subject; I can’t help but assume that that was their intention until they didn’t get a significant condition X word interaction. Based on my experience, it is considered best practices to analyze reactions to word stimuli within-subjects. The fact that they used and got away with multiple between-subjects analyses is actually surprising. Reviewer 2 must have slept on this paper.

        • Anonymous and Sentinel Chicken: I should have mentioned that various tests to be done after anova to compare the three groups will control for multiple comparisons (that is, adjust the p value). The three separate regressions used by the authors do not do this.

    • Anonymous,

      This is not what you are asking for, but might be better than (or at least a worthwhile supplement to) what you are asking for (since sometimes when we don’t understand things, we don’t know enough to ask the right questions).

      The bottom line is that power depends on lots of things. If, for example, you’re talking about comparing two means, then power depends on the difference between those two means, the variances of the variables involved, and the significance level used, as well as the sample size.

      You may find it helpful to play with the shinyapp at https://istats.shinyapps.io/power/ (Note that there is a tab at the top which allows you to choose either proportion or mean.)

      Rather than giving more details here, if you care for more details, I will refer you to things I’ve already written down at:

      pp. 14 – 45 of http://www.ma.utexas.edu/users/mks/CommonMistakes2016/SSISlidesDayThree2016.pdf

      and

      http://www.ma.utexas.edu/blogs/mks/2014/07/03/beyond-the-buzz-part-vi-better-ways-of-calculating-power-and-sample-size/.

      • Another option, especially useful for Bayesian analysis, is to directly simulate fake data, run your Bayesian fit, and then see how the posterior distribution concentrates. Since Bayesian models can be quite complicated, ultimately this is the most flexible and accurate way to figure out how informative N data points of a certain kind are going to be. However, you need to stop thinking about “power to detect an effect of size…” and start thinking about “information acquired in observing N data points”. You can informally measure this information in terms of say variance reduction from prior to posterior, or directly use relative entropy reduction from prior to posterior.

        If you have a “practical level of effect size” which is reasonably well defined, you can use this technique to determine how many data points would be needed to rule out zero effect if the effect were of the “practical size”.

        Also, if you can quantify a decision analysis in terms of tradeoffs between the cost of uncertainty and the cost of collecting data, and/or the benefits of discovering an effect of size X, you can form an optimization problem to find the right sample size.

        I personally think these are all much better ways to think about how large of a study to run than classical “power” calculations.

    • Thanks for your answers! Very helpful. Sentinel chicken, thanks for the note on within- vs between subjects design, I didn’t think of that. But Andrew’s reaction makes sense too, so I second Larry’s question to posting something more in-depth about this. Martha, your response was super helpful (similar to Carol’s, but I don’t need to go buy Cohen’s book but instead can read a blog post) — and I love the shiny app.

Leave a Reply to Chris G Cancel reply

Your email address will not be published. Required fields are marked *