Here’s something I wrote in the context of one of those “power = .06” studies:

My criticism of the ovulation-and-voting study is ultimately quantitative. Their effect size is tiny and their measurement error is huge.

My best analogy is that they are trying to use a bathroom scale to weigh a feather—and the feather is resting loosely in the pouch of a kangaroo that is vigorously jumping up and down.At some point, a set of measurements is so noisy that biases in selection and interpretation overwhelm any signal and, indeed, nothing useful can be learned from them. I assume that the underlying effect size in this case is not zero—if we were to look carefully, we would find some differences in political attitude at different times of the month for women, also different days of the week for men and for women, and different hours of the day, and I expect all these differences would interact with everything—not just marital status but also age, education, political attitudes, number of children, size of tax bill, etc etc. There’s an endless number of small effects, positive and negative, bubbling around.

I like the weighing-a-feather-while-the-kangaroo-is-jumping analogy. It includes measurement accuracy and also the idea that there are huge biases that are larger than the size of the main effect.

Tweeting this I had to prune “they are trying to use a bathroom scale to weigh a feather—and the feather is resting loosely in the pouch of a kangaroo that is vigorously jumping up and down” and was about to remove the “resting loosely” part. But I think it’s key, right? If you know the weight of the kangaroo (minus the feather) and the physical mechanism of the scale and (I guess) something about the jumping, you might be able (with very sophisticated physics) to get a signal on the feather. But if it’s loose in the pouch you’ve lost contact with reality, right?

If you’re curious, I went with this.

Thomas:

Hmmm, maybe the “resting loosely” was overkill. My point was that the biases of measurement are both unknown (of course, because if you had a known bias you’d correct for it) and variable (over time and across populations). No single number, whether it be “power” or “signal to noise ratio” will capture that: these standard measures are typically constructed assuming bias = 0.

Andrew:

Is it possible to put numbers on this? e.g. What was the SNR or power for the ovulation study or the himmicanes study?

And what are the corresponding numbers for a typical “good” Psych study? Or maybe even one of your own studies?

PS. Maybe there’s some other metric in addition to my naive SNR / power that you’d like to add?

“Example: Menstrual Cycle and Political Attitudes….”

“Running this through our retrodesign() function, setting the true effect size to 2% and the standard error of measurement to 8.1%, the power comes out to 0.06, the Type S error probability is 24%, and the expected exaggeration factor is 9.7.” (Gelman & Carlin, 2014)

http://www.stat.columbia.edu/~gelman/research/published/PPS551642_REV2.pdf

I would add that I think maybe a “power against curve” might be an interesting metric. Meaning: Under the experimental design and alternative hypothesis of an effect size X, what is the probability that the null is rejected, and given rejecting the null what is the probability of a type-M error of at least size Y? This is a lot like what Gelman and Carlin do with Type-S errors and power, but here you graph that out over a bunch of potential effect sizes and error tolerances, and maybe indicate in some manner regions where you think the true effect size is likely to be (for whatever reasons you have for thinking that). Then you could see that under the null hypothesis of a small effect, the probability of a massive over-estimate given rejecting the null is .6, or .8, or 1.

Nice paper.

Andrew: I wish you had also given an example of a “good” study. I would love to see you similarly calculate the S-type-error-probability & exaggeration factor for, say, one of your own studies.

i.e. Showing how to do it right with an example. To contrast against the ovulation study & the sex ratio / beauty study.

I’m not sure the jumping matters. Compare to particle physics where they generate a huge number of events and the signal may consist of a few dozen. That sounds a lot like weighing a kangaroo to detect the presence, and get the weight, of a possible feather in it’s pouch. You can do it if you know the background (kangaroo weight).

If you are incorrect about the kangaroo weight, even to the smallest decimal place, you will eventually detect this deviation from the model. It is only a matter of collecting enough data to overcome the instability due to “jumping”. If no prior information is used when interpreting the signal as “weight of the feather” you may publish a paper saying the feather weighed some implausibly high or low amount. The headline will be “The kangaroo has a feather in it’s pouch!”, rather than the mundane reality “Scientists don’t know the exact weight of the kangaroo”.

If instead you try to answer the question “Does the ant have a feather on it’s back?”, the same problem will arise if we blindly interpret deviations from the estimated weight of the ant. However, once again prior information can be used to help us. We can say that

1) It is implausible that the additional weight measured is accounted for by inaccuracy in our estimate of the ant’s weight.

2) The additional weight is consistent with what would be expected for a feather.

Then it is ok to say “The ant has something on it’s back! Scientists suspect feather!”. However, other items will also be consistent with the excess weight. These can be ruled out via controlled experiments, and/or by increasing the precision of the theoretical weight of the feather along with our estimate of the excess weight.

Hey, Anon. You should use a handle so you can distinguish yourself from other anonymous commenters.

S/he’ll choose one anon.

I couldn’t think of a good handle. But realized that “Anon” can be short for “Anoneuoid”, meaning: A-=not; non-=name; eu-=good; -oid=like that of. So “not having a good name”.

Better?

Do we have a good quantitative way to identify such jumping kangaroos? e.g. SNR ratios?

My fear is we will selectively use the jumping kangaroo critique against conclusions we do not like. That’s my worry with qualitative critiques.

In other words: The methods, sampling & measurement used by red-clothes-fertility study seem not so different from so many other Psych studies out there. If their conclusions were not so bizzare would we have caught those other noise-overwhelmed artifacts masquerading as legitimate effects?

Rahul:

I think the problems that have been revealed with these studies have made many psychology researchers look more carefully at the studies that they are working with. It is clear that these are general methodological critiques that apply more broadly, and Carlin and I made this point in our recent article in Perspectives on Psychological Science.

disclaimer: no analogy is perfect

The jumping kangaroo helped me visualize that the background noises are changing over time, but it doesn’t show that the effect size (weight of the feather) is also changing over time.

Love the analogy.

Don’t worry, as they say, we don’t have one kangaroo, we have thousands of them jumping around, so problem solved.

And the thousands of kangaroos are jumping out of sync.

And there is still only one feather.

From a Newtonian mechanics perspective, is it even in-principal possible to weigh a feather under such conditions? I think not, it has to do with two factors: 1) solving a differential equation without sufficient initial conditions, namely the velocity of the Kangaroo 2) the Kangaroo changes shape, so you can’t get velocity of the Kangaroo from time-of-flight data.

You know, just to be pedantic

Though, with a long enough sample, you might be able to argue that the average velocity of the Kangaroo during the entire sequence was zero, since its elevetion presumably doesn’t change…. that’s an interesting possibility. Who do I apply to for grant funding on this topic?