http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0175302

‘Bayes factors’ and posterior model probabilities are also calculated.

I believe selecting effect sizes of zero, small, medium, large are more meaningful than -.001, 0, .001.

]]>> 84% as attractive, 12% as unattractive, and the rest as neither

The beauty of the British people is legendary :-)

At age 7 (the measure used in that study), the children were 85% attractive, 12% unattractive and 4% worse than unatractive (“Looks underfed”, “Abnormal feature”, “Scruffy and dirty”).

The rating at age 11 (also available) was a bit less enthusiastic: 80% attractive, 15% unatractive and 5% worse than unatractive (“Undernourished”,”Abnormal feature”,”Slovenly, dirty“).

The question of why that particular measure was used is very pertinent, specially if one considers that another study by the same author focused on people rated as attractive at both age 7 and age 11 by two different teachers, 62% of the population. By the way, this number seems very low: I don’t know if that indicates that the correlation between both ratings is low, that in many cases either of the ratings is missing, that there are many cases where the same teacher filled out the form at ages 7 and 11 and therefore was ignored (why?)…

His latest paper may be interesting: “Why are there more same-sex than opposite-sex dizygotic twins?”

Unfortunately he hasn’t posted the pdf in his webpage yet.

https://academic.oup.com/humrep/advance-article-abstract/doi/10.1093/humrep/dey046/4925331?redirectedFrom=fulltext

Misunderstanding of NHST/p-values abounds- I would say *is* still screwed up ;)

]]>As I understand it, there are very large studies on births, and so the prior information is very comprehensive about birth rates and their variability in various circumstances.

If you know ahead of time due to studies involving literally millions of births that any given situation is unlikely to move the needle by more than the 3rd or 4th decimal place, then when someone comes along proposing to study 300 attractive people or whatever you can say ahead of time “this is mostly likely worthless”

if you made that claim based on just a gut feeling or whatever, sure you could find fault with it, but when there are 7 billion people living and birth records in various countries are comprehensive and hence you might be able to get access to summaries of a half a billion birth records or something… it’s worth it to consider that information pretty seriously.

]]>Jason:

The discussion was from an email that Kahan sent me.

]]>Carlos:

I’ve seen Kanazawa’s claimed replication. Big forking paths problem, or I guess we could say p-hacking. In the first paper, he had data with attractiveness on a 1-5 scale, and he compares 5 (“very attractive”) to 1,2,3,4. (I don’t think any other comparison would yield statistical significance.) In the second paper the data are coded differently, and it ends up that he labels 84% as attractive, 12% as unattractive, and the rest as neither. This is completely different than the first paper where most of the people are *not* characterized as “very attractive.” So, basically, enough degrees of freedom to find statistical significance.

But I didn’t bother even commenting on the paper (except, when Kanazawa sent this paper to me, I replied that I thought his results could entirely be explained by noise; he thanked me but did not take my advice to heart), because, for reasons discussed above, the study had no chance of finding anything useful.

And, yes, I agree that a very fine statistical analysis was not needed here. (If you read the literature on sex ratios, you’ll see that any difference of more than one percentage point would be extremely hard to imagine.) And it’s not that “the result may be a fluke”; it’s that the data from that survey provide essentially zero evidence on the topic of beauty and sex ratios.

But that’s what’s so amazing! The beauty-and-sex-ratio paper was published in a reputable biology journal! For real! And it was featured on the Freakonomics website! Even though it was the statistical equivalent of a perpetual motion machine.

Science is (or, until recently, was) really screwed up. Anything could get published: this paper, that ESP paper, all sorts of things; all that was needed was statistical significance. In retrospect, it’s stunning that so much statistical firepower has been needed to reveal these problems.

And nothing in that above post was an exaggeration, except for that “power=.0500001” thing, which I’ve now fixed.

]]>Thanks for you answer. Maybe more than circularity reasoning I was thinking of begging the question (but of course you won’t agree with that either).

Given how people misunderstands power, I’m not sure stating your issues with this study in terms of power helps. Specially if you do it in an exaggerated fashion for higher dramatic effect.

You say that there is “plausible range of underlying differences” and whether it is +/- one tenth of a percentage point or +/- one percentage point, clearly it is quite narrow.

If the measured effect is two orders of magnitude larger than what it’s considered possible, I don’t think a very fine statistical analysis is needed to suggest that the result may be a fluke. By the way, I don’t know if you’ve commented somewhere on the (according to Kanazawa) replication based on British data published in 2011.

]]>I’m not saying “the study is useless because the power is ridiculously low.” I’m saying the study is too useless because the measurements are too noisy given any plausible underlying difference between the groups. This problem can be expressed as “low power,” and I talk about power because that’s a scale that many people are familiar with, but the fundamental problem here is not “low power,” it’s that the measurements are very noisy relative to the size of any plausible underlying differences.

This is not circular reasoning. There is no circle here. From the scientific literature and our understanding of statistics we can get a sense of a plausible range of underlying differences. Then, from statistical analysis, we can see that this particular study will be hopeless. This is direct reasoning, no circles involved.

]]>I use 0.500 for convenience, the results wouldn’t change much with another baseline. Comparing the means of the “very attractive” group with the “not very attractive group” (which is ten times at large) wouldn’t change much my analysis either. I was just trying to get an idea of how close the alternative hypothesis had to be to the null hypothesis to claim that the power is that close to 0.05, using a very simple model. I would be curious to see if another power analysis yields a very different answer.

I take back the “you know the answer” bit, but I really don’t understand what that power calculation is supossed to mean. All I can see is a circular argument: “The study is useless because the power is ridiculously low, but the power is ridiculously low because I calculate it for an alternative hypothesis which is very close to the null because I think the study is useless.”

]]>Carlos:

1. The probability of a girl birth is something like 0.485 or 0.488.

2. I haven’t always been so precise on this myself, but I try to use the term “comparison” rather than “effect” here because what’s being studied is a comparison between two groups, not a causal effect.

3. I think the difference in sex ratios between the two groups is likely to be very small, in part because there’s no clear reason to expect any systematic difference, and in part because the measurement of attractiveness in this particular study is itself so noisy, so we’re not even really comparing two distinct groups.

4. I don’t “know the answer anyway.” As I wrote, I *expect* the true difference in the population to be of order of magnitude 0.01 percentage points. In evaluating the Kanazawa paper, it was enough to point out that the analysis would be hopeless, even if the true population difference were as high as a (scientifically implausible) 1 percentage point. If someone had asked me ahead of time whether this study was worth doing, I’d’ve said no, even if I’d thought the underlying population difference were 1 percentage point. I actually expect the underlying difference to be much less, but it was not really necessary to develop that reasoning to make that point, so I didn’t bother.

5. If I really wrote, “based on the scientific literature it is just possible that beautiful parents are 1 percent more likely than others to have a girl baby,” then I guess I was being generous with the phrase “just possible.” I should’ve written that sentence more clearly.

6. It appears that my “power = .0500001” statement was an exaggeration! I’ll fix it in the above post.

]]>Ok, so you think that the proper alternative hypothesis to calculate the power of the study is P(girl)=50.01% vs P(girl)=50.00%.

This seems a bit extreme, but now you’re indeed just one zero away from your power=.0500001 statement.

But then, why do you bother discussing that “based on the scientific literature it is just possible that beautiful parents are 1 percent more likely than others to have a girl baby” in that paper?

Just say that it is impossible that there is any effect, that the power has to be calculated against the alternative hypothesis which is equal to the null hypothesis and therefore power=0.05 by definition and that there is no need to do any study because you know the answer anyway.

]]>Carlos:

In the beauty and sex ratio example, I’d expect the true difference in the population to be of order of magnitude 0.01 percentage points, which I’d write as 0.0001 except that it’s hard to keep track with all these zeroes.

]]>What definition of power is consistent with power=.0500001 or something like that?

Let’s say that I have a dataset of N=284 births from very attractive parents and I want to test if the percentage of female births is different from 50% (to keep it simple).

My two-tailed test will reject the null hypothesis if the number of girls is 125 (or lower) or 159 (or higher).

If the null hypothesis P(girl)=50% is true, the test will be rejected with probability 0.0500 (as it should).

I calculate the power for a few alternative hypothesis, based on the remark “Given that we only expect to see effects in the range of ±1 percent”:

If the alternative hypothesis P(girl)=51% is true, the test will be rejected with probability (i.e. the power is) 0.0631.

If the alternative hypothesis P(girl)=50.3% is true, the test will be rejected with probability (i.e. the power is) 0.0512.

If the alternative hypothesis P(girl)=50.1% is true, the test will be rejected with probability (i.e. the power is) 0.0501.

Jacob:

I disagree, for the following reason.

Consider your statements: “I think it is not so uncommon that it will be believed that something has an effect, but opinions will differ on the direction of the effect. . . . if I wanted to make a strong statement about whether the effect is positive or negative . . .”

I don’t think “the effect” will be positive or negative. I think it will be positive in some settings and negative in others. As I put it in yesterday’s post, “having an effect that varies by context and is sometimes positive and sometimes negative.”

]]>There have been several follow-ups to support this as well as follow-ups that suggest both zero and opposite effects. I’ve done an (unpublished) meta-analysis and found many p < .05 studies, but they are split about 50/50 positive/negative. Some of this is statistical (the inclusion/exclusion of certain control variables seems to be influential) and there are problems with the predominantly cross-sectional data used to think about this problem. But if I wanted to make a strong statement about whether the effect is positive or negative, I think the point null comes in handy — with due consideration of the Type S error rate given the design and presumed effect size.

]]>A Github repository is a great idea. I find myself writing little bits of code illustrate things like type M/S errors, hypothesis testing in low power studies, etc all the time, so having a central database to pull from would be convenient. A lot of it could easily be assembled into a sort of “tutorial R package” to let students/researchers get a sense of how the techniques they’re using actually behave in noisy settings.

]]>I’m wondering if any other readers might be interested in working with me on a public GitHub repository, dedicated to Andrew’s technical posts? I have already done a lot of work with a tutor on the “80% Power Lie” post. We worked up numerous graphics, and additional code, to make Andrew’s points more understandable at a basic undergrad level (i.e. where I’m at). When I’ve completely worked through the “80% Power Lie”, I will post a GH link in the comments to that post.

>From the literature and some math reasoning (not shown here) having to do with measurement error in the predictor, reasonable effect sizes are..

Andrew, would you please consider elaborating your math reasoning..? Or, can anyone guess and explicitly spell out, please?

]]>