You know, just to be pedantic

Though, with a long enough sample, you might be able to argue that the average velocity of the Kangaroo during the entire sequence was zero, since its elevetion presumably doesn’t change…. that’s an interesting possibility. Who do I apply to for grant funding on this topic?

]]>Don’t worry, as they say, we don’t have one kangaroo, we have thousands of them jumping around, so problem solved.

And the thousands of kangaroos are jumping out of sync.

And there is still only one feather. ]]>

Andrew: I wish you had also given an example of a “good” study. I would love to see you similarly calculate the S-type-error-probability & exaggeration factor for, say, one of your own studies.

i.e. Showing how to do it right with an example. To contrast against the ovulation study & the sex ratio / beauty study.

]]>The jumping kangaroo helped me visualize that the background noises are changing over time, but it doesn’t show that the effect size (weight of the feather) is also changing over time.

]]>“Running this through our retrodesign() function, setting the true effect size to 2% and the standard error of measurement to 8.1%, the power comes out to 0.06, the Type S error probability is 24%, and the expected exaggeration factor is 9.7.” (Gelman & Carlin, 2014)

http://www.stat.columbia.edu/~gelman/research/published/PPS551642_REV2.pdf

I would add that I think maybe a “power against curve” might be an interesting metric. Meaning: Under the experimental design and alternative hypothesis of an effect size X, what is the probability that the null is rejected, and given rejecting the null what is the probability of a type-M error of at least size Y? This is a lot like what Gelman and Carlin do with Type-S errors and power, but here you graph that out over a bunch of potential effect sizes and error tolerances, and maybe indicate in some manner regions where you think the true effect size is likely to be (for whatever reasons you have for thinking that). Then you could see that under the null hypothesis of a small effect, the probability of a massive over-estimate given rejecting the null is .6, or .8, or 1.

]]>Is it possible to put numbers on this? e.g. What was the SNR or power for the ovulation study or the himmicanes study?

And what are the corresponding numbers for a typical “good” Psych study? Or maybe even one of your own studies?

PS. Maybe there’s some other metric in addition to my naive SNR / power that you’d like to add?

]]>Better?

]]>If you are incorrect about the kangaroo weight, even to the smallest decimal place, you will eventually detect this deviation from the model. It is only a matter of collecting enough data to overcome the instability due to “jumping”. If no prior information is used when interpreting the signal as “weight of the feather” you may publish a paper saying the feather weighed some implausibly high or low amount. The headline will be “The kangaroo has a feather in it’s pouch!”, rather than the mundane reality “Scientists don’t know the exact weight of the kangaroo”.

If instead you try to answer the question “Does the ant have a feather on it’s back?”, the same problem will arise if we blindly interpret deviations from the estimated weight of the ant. However, once again prior information can be used to help us. We can say that

1) It is implausible that the additional weight measured is accounted for by inaccuracy in our estimate of the ant’s weight.

2) The additional weight is consistent with what would be expected for a feather.

Then it is ok to say “The ant has something on it’s back! Scientists suspect feather!”. However, other items will also be consistent with the excess weight. These can be ruled out via controlled experiments, and/or by increasing the precision of the theoretical weight of the feather along with our estimate of the excess weight.

]]>Hmmm, maybe the “resting loosely” was overkill. My point was that the biases of measurement are both unknown (of course, because if you had a known bias you’d correct for it) and variable (over time and across populations). No single number, whether it be “power” or “signal to noise ratio” will capture that: these standard measures are typically constructed assuming bias = 0.

]]>I think the problems that have been revealed with these studies have made many psychology researchers look more carefully at the studies that they are working with. It is clear that these are general methodological critiques that apply more broadly, and Carlin and I made this point in our recent article in Perspectives on Psychological Science.

]]>My fear is we will selectively use the jumping kangaroo critique against conclusions we do not like. That’s my worry with qualitative critiques.

In other words: The methods, sampling & measurement used by red-clothes-fertility study seem not so different from so many other Psych studies out there. If their conclusions were not so bizzare would we have caught those other noise-overwhelmed artifacts masquerading as legitimate effects?

]]>