“Tweeking”: The big problem is not where you think it is.

In her recent article about pizzagate, Stephanie Lee included this hilarious email from Brian Wansink, the self-styled “world-renowned eating behavior expert for over 25 years”:

OK, what grabs your attention is that last bit about “tweeking” the data to manipulate the p-value, where Wansink is proposing research misconduct (from NIH: “Falsification: Manipulating research materials, equipment, or processes, or changing or omitting data or results such that the research is not accurately represented in the research record”).

But I want to focus on a different bit:

. . . although the stickers increase apple selection by 71% . . .

This is the type M (magnitude) error problem—familiar now to us, but not so familiar a few years ago to Brian Wansink, James Heckman, and other prolific researchers.

In short: when your data are noisy, you’ll expect to get large point estimates. Also large standard errors, but as the email above illustrates, you can play games to get the standard errors down.

My point is: an unreasonably high point estimate (in this case, “stickers increase apple selection by 71%,” a claim that’s only slightly less ridiculous than the claim that women are three times more likely to wear red or pink during days 6-14 of their monthly cycle) is not a sign of a good experiment or a strong effect—it’s a sign that whatever you’re looking for is being overwhelmed by noise.

One problem, I fear, is the lesson sometimes taught by statisticians, that we should care about “practical significance” and not just “statistical significance.” Generations of researchers have been running around, petrified that they might overinterpret an estimate of 0.003 with standard error 0.001. But this distracts them from the meaninglessness of estimates that are HUGE and noisy.

I say “huge and noisy,” not “huge or noisy,” because “huge” and “noisy” go together. It’s the high noise level that allows you to get that huge estimate in the first place.

P.S. I wrote this post about 6 months ago and it happens to be appearing today. Wansink is again in the news. Just a coincidence. The important lesson here is GIGO. All the preregistration in the world won’t help you if your data are too noisy.

15 thoughts on ““Tweeking”: The big problem is not where you think it is.

  1. Sure, the “tweeking” suggestion and the “71%” magnitude error are both awful, but the bit that really jumped out at me was the “If you can get the data….”. I mean, at a bare minimum, shouldn’t he know where the data for a paper they are working to publish is? It seems bad enough when people lose their data post-publication, but is losing it pre-publication also a common occurrence?

  2. A couple things:

    First, I don’t like working with percentages. If you have apple sales increase from 1% to 2%, that’s a 100% increase.

    Second, it might not surprise you, but they already worked their magic to get the p value down to .06 by halving the chi-square p value as Nick explains:
    http://steamtraen.blogspot.com/2017/02/a-different-set-of-problems-in-article.html

    Third, I found this “71%” number interesting. In the JAMA article this number isn’t mentioned, although in their figure the percent change does look around 71%.

    However, they do have another paper, also involving apples, where they trumpet a 71% increase in sales. But, as Eric Robinson notes here: https://peerj.com/preprints/3137/, it’s difficult to understand where that number comes from. In fact, in the results section it specifically states “apple sales increased from 12.7% to 14.1%”.

    I can’t help but wonder if he just had this 71% number in his head when he wrote these 2 apple papers around the same time.

    The big problem with this whole story is the misconceptions it has created. There is a legitimate difference between data diving and p-hacking, between cleaning data and data tweeking–whatever that is. If you have participants who didn’t follow instructions for example, it’s perfectly reasonable to remove them. It’s an entirely different story though to remove participants based on whether they help your p value or not.

    • Carlos:

      It’s the usual problem with the suggestion that substantial progress in outcome Y, which is difficult to achieve with direct, careful interventions, is easy to do with some small indirect intervention. All things are possible, but these claims always make me suspicious. I felt the same way about the claim that women’s vote preferences changed by 20 percentage points during their monthly cycle. Convincing 20% of voters to change their mind in a high-profile election would be extremely difficult to do by any means, so I find it implausible that this sort of indirect treatment would do much of anything at all. I felt the same way about the claim that flashing a subliminal smiley face would cause huge shifts in attitudes toward immigration, or the claim that whether a hurricane has a boy or girl name would cause large aggregate behavioral changes. With all these things, if the evidence is there, I’d reassess, but in all these cases, the evidence wasn’t there. In Wansink’s case, it appears that, not only wasn’t the evidence there, the data weren’t there either.

      • If the outcome is “apple selection” and “Elmo stickers” are a “small indirect intervention”, I’m pretty sure one could achieve a better outcome with a “direct, careful interventions” like eliminating all the other choices or bundling apples with toys.

        If I’ve looked at the right paper, it seems they reported that apple selection (vs cookie) increased from 21% to 34%. Clearly the paper had issues if it was retracted (twice) but it doesn’t seem to me that the big problem here is that those numbers are ridiculous.

        • Carlos:

          The problem is not so much that the numbers are ridiculous—but I think they are—but rather that it is a mistake to take this big number as a sign that the experiment is doing something right. If you do an experiment that is very noisy, and you apply a filter so that you only report statistically significant comparisons, then by necessity your resulting estimates will be very high. So, the fact that the results look really good, is not evidence of anything.

          If someone wants to say, “We did a really noisy experiment, and the data are consistent with just about any plausible effect sizes, but for other reasons we think that a huge effect size makes sense here,” then that’s fine with me. They’re laying their beliefs out there on the table for us all to see. But if they treat this large estimate as some piece of evidence that they’re really on to something, that’s a mistake.

        • Good answer to the question I was going to ask but I looked at the comments first. The question was:
          In economics, the conventional wisdom informally, is that noisy data (measurement error, etc.) biases your estimates down. But you seem to be saying it’s more likely to give estimates that are too big. So what’s going on?

          Your answer would be, I think, from your comment that I’m replying to:

          Yes, it’s true that noise tends to bias your estimates downwards (too small magnitude). The problem I’m talking about is different. Suppose you have noisy data, PLUS you are reporting only your statistically significant results. In that case, since you’re looking for big t-values, you by definition are selecting for magnitudes far from zero, and your magnitudes are all biased upwards.

        • Why is it that “the conventional wisdom informally, is that noisy data (measurement error, etc.) biases your estimates down?”

          What is the thinking here. I realize that Economists don’t typically use Bayesian methods, and they typically like unbiased estimators. If I have some high quality data, and I add unbiased noise to it, and I use an unbiased Frequentist estimate, I don’t expect any systematic bias downwards in effect size, though I certainly expect fewer significant results.

          If I use the statistical significance filter, then yes I do expect the “significant” results to be over-estimates, since the noise makes it more likely that the estimate is far from the true value, and the significance filter makes it more likely that the estimate is far from zero. Combine the two and most likely the estimate is too high.

        • Eric:

          Regarding the question about when does noise bias the estimate up or down, see this paper by Eric Loken and myself. It may be that some of the confusion on this point is that various people, especially those trained in econometrics, have the impression that noise attenuates estimates, hence they get in the habit of (a) thinking that all their estimates are conservative underestimates, and (b) don’t think that measurement error is such a problem, because it just makes their estimates more conservative. These two attitudes can cause damage, I think, as in the example I’ve discussed many times of the estimated effect of a pre-school intervention in Jamaica.

        • Eric, the ossue is whether the noise is in measuring the left hand or right hand variable or is the variance of the error. Andrew is referring to noise in the error or outcome variable. Attenuation bias comes from error in the right hand variable. This error indeed attenuates the estimate of your coefficient, though only in simgle regression. In multiple regression, the bias can go either way.

  3. I agree this reasoning is wrong, but would the opposite inference be valid? Can we say that if effect sizes are small, most likely nothing of significance is really going on?

    • Angryeyeballs:

      Once we accept that huge effects from small interventions are pretty much not out there (the piranha problem), this recalibrates what we think is a small effect.

      To put it another way, the “job” of behavior researchers should not be to discover huge effects. The problem is not just that Wansink used bad research methods, it’s also that things were set up so he was supposed to find these big effects that weren’t there.

      We discussed the general issue here recently, making the point that unrealistically huge effect sizes should not be our standard of comparison.

  4. Authorities suggest it maybe much higher than 71% :)

    “Just imagine what will happen when we take our kids to the grocery store, and they see Elmo and Rosita and the other Sesame Street Muppets they love up and down the produce aisle,” said First Lady Michelle Obama today…. In her remarks, the First Lady referenced a recent study published in the Archives of Pediatrics and Adolescent Medicine conducted by researchers at Cornell University…

    —-

    Good news! As part of the agreement, the Produce Marketing Asssoiciation is supposed to share an estimated sales impact with the Partnership for a Healthier America. The rigor of the PR is everything one might expect.

    https://www.ahealthieramerica.org/progress-reports/2016/partners/sesame-workshop-pma

    https://www.ahealthieramerica.org/progress-reports/2017/partners/sesame-workshop-pma

    And there is even a methodology, search for sesame below. (No branding yet on the actual sesame seeds, now that is a confounder, eh?)

    https://www.ahealthieramerica.org/progress-reports/2016/introduction/methodology

  5. IEHO, the main problem is that the garden of forking paths is huge and overgrown with weeds. You need some strong theoretical constraints to really make progress by restricting the available parameter space and eliminating the chaff. Statistics alone is a prospecting tool.

Leave a Reply to Carlos Ungil Cancel reply

Your email address will not be published. Required fields are marked *