Comments on: Incorporating Bayes factor into my understanding of scientific information and the replication crisis

By: Jim Hatton

Jim Hatton — Mon, 12 Mar 2018 15:13:59 +0000

My idea for a new elementary statistics textbook. There are millions of statistics textbooks and the students thereof that rely solely on p-values. And I know that purely Bayesian texts have not caught on. So how about writing a text with exactly the same problems as addressed by the classic texts but solved from a modeling perspective. Beginning students treat any computer use as a black box so why not use modeling software rather than, say, standard regression programs. What would happen, I think, is that the few students that become practitioners will use the better models and the rest of the students see best practice.

By: Marcel van Assen

Marcel van Assen — Mon, 12 Mar 2018 07:22:37 +0000

Here we use a Bayesian approach where we compare the likelihoods of zero, small, medium, large effect, taking into account the statistical significance of the original effect size, when jointly evaluation the effect of an original and replication effect:

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0175302

‘Bayes factors’ and posterior model probabilities are also calculated.

I believe selecting effect sizes of zero, small, medium, large are more meaningful than -.001, 0, .001.

By: Carlos Ungil

Carlos Ungil — Sun, 11 Mar 2018 17:11:36 +0000

In reply to Andrew.

> 84% as attractive, 12% as unattractive, and the rest as neither

The beauty of the British people is legendary :-)

At age 7 (the measure used in that study), the children were 85% attractive, 12% unattractive and 4% worse than unatractive (“Looks underfed”, “Abnormal feature”, “Scruffy and dirty”).

The rating at age 11 (also available) was a bit less enthusiastic: 80% attractive, 15% unatractive and 5% worse than unatractive (“Undernourished”,”Abnormal feature”,”Slovenly, dirty“).

The question of why that particular measure was used is very pertinent, specially if one considers that another study by the same author focused on people rated as attractive at both age 7 and age 11 by two different teachers, 62% of the population. By the way, this number seems very low: I don’t know if that indicates that the correlation between both ratings is low, that in many cases either of the ratings is missing, that there are many cases where the same teacher filled out the form at ages 7 and 11 and therefore was ignored (why?)…

His latest paper may be interesting: “Why are there more same-sex than opposite-sex dizygotic twins?”
Unfortunately he hasn’t posted the pdf in his webpage yet.
https://academic.oup.com/humrep/advance-article-abstract/doi/10.1093/humrep/dey046/4925331?redirectedFrom=fulltext

By: Chris Wilson

Chris Wilson — Sun, 11 Mar 2018 13:10:01 +0000

In reply to Andrew. Misunderstanding of NHST/p-values abounds- I would say *is* still screwed up ;)

By: Daniel Lakeland

Daniel Lakeland — Sun, 11 Mar 2018 05:07:53 +0000

In reply to Carlos Ungil.

As I understand it, there are very large studies on births, and so the prior information is very comprehensive about birth rates and their variability in various circumstances.

If you know ahead of time due to studies involving literally millions of births that any given situation is unlikely to move the needle by more than the 3rd or 4th decimal place, then when someone comes along proposing to study 300 attractive people or whatever you can say ahead of time “this is mostly likely worthless”

if you made that claim based on just a gut feeling or whatever, sure you could find fault with it, but when there are 7 billion people living and birth records in various countries are comprehensive and hence you might be able to get access to summaries of a half a billion birth records or something… it’s worth it to consider that information pretty seriously.

By: Andrew

Andrew — Sun, 11 Mar 2018 03:36:22 +0000

In reply to Jason Farnon. Jason: The discussion was from an email that Kahan sent me.

By: Andrew

Andrew — Sun, 11 Mar 2018 03:35:49 +0000

In reply to Carlos Ungil.

Carlos:

I’ve seen Kanazawa’s claimed replication. Big forking paths problem, or I guess we could say p-hacking. In the first paper, he had data with attractiveness on a 1-5 scale, and he compares 5 (“very attractive”) to 1,2,3,4. (I don’t think any other comparison would yield statistical significance.) In the second paper the data are coded differently, and it ends up that he labels 84% as attractive, 12% as unattractive, and the rest as neither. This is completely different than the first paper where most of the people are not characterized as “very attractive.” So, basically, enough degrees of freedom to find statistical significance.

But I didn’t bother even commenting on the paper (except, when Kanazawa sent this paper to me, I replied that I thought his results could entirely be explained by noise; he thanked me but did not take my advice to heart), because, for reasons discussed above, the study had no chance of finding anything useful.

And, yes, I agree that a very fine statistical analysis was not needed here. (If you read the literature on sex ratios, you’ll see that any difference of more than one percentage point would be extremely hard to imagine.) And it’s not that “the result may be a fluke”; it’s that the data from that survey provide essentially zero evidence on the topic of beauty and sex ratios.

But that’s what’s so amazing! The beauty-and-sex-ratio paper was published in a reputable biology journal! For real! And it was featured on the Freakonomics website! Even though it was the statistical equivalent of a perpetual motion machine.

Science is (or, until recently, was) really screwed up. Anything could get published: this paper, that ESP paper, all sorts of things; all that was needed was statistical significance. In retrospect, it’s stunning that so much statistical firepower has been needed to reveal these problems.

And nothing in that above post was an exaggeration, except for that “power=.0500001” thing, which I’ve now fixed.

By: Jason Farnon

Jason Farnon — Sun, 11 Mar 2018 01:09:36 +0000

May I ask where the linked discussion is drawn from? Whenever I google quotes I just get the transcript-thing hosted on this site.

By: Carlos Ungil

Carlos Ungil — Sun, 11 Mar 2018 00:42:24 +0000

In reply to Andrew.

Thanks for you answer. Maybe more than circularity reasoning I was thinking of begging the question (but of course you won’t agree with that either).

Given how people misunderstands power, I’m not sure stating your issues with this study in terms of power helps. Specially if you do it in an exaggerated fashion for higher dramatic effect.

You say that there is “plausible range of underlying differences” and whether it is +/- one tenth of a percentage point or +/- one percentage point, clearly it is quite narrow.

If the measured effect is two orders of magnitude larger than what it’s considered possible, I don’t think a very fine statistical analysis is needed to suggest that the result may be a fluke. By the way, I don’t know if you’ve commented somewhere on the (according to Kanazawa) replication based on British data published in 2011.

By: Andrew

Andrew — Sat, 10 Mar 2018 23:44:39 +0000

Carlos:

I’m not saying “the study is useless because the power is ridiculously low.” I’m saying the study is too useless because the measurements are too noisy given any plausible underlying difference between the groups. This problem can be expressed as “low power,” and I talk about power because that’s a scale that many people are familiar with, but the fundamental problem here is not “low power,” it’s that the measurements are very noisy relative to the size of any plausible underlying differences.

This is not circular reasoning. There is no circle here. From the scientific literature and our understanding of statistics we can get a sense of a plausible range of underlying differences. Then, from statistical analysis, we can see that this particular study will be hopeless. This is direct reasoning, no circles involved.

By: Carlos Ungil

Carlos Ungil — Sat, 10 Mar 2018 23:26:29 +0000

In reply to Andrew.

I use 0.500 for convenience, the results wouldn’t change much with another baseline. Comparing the means of the “very attractive” group with the “not very attractive group” (which is ten times at large) wouldn’t change much my analysis either. I was just trying to get an idea of how close the alternative hypothesis had to be to the null hypothesis to claim that the power is that close to 0.05, using a very simple model. I would be curious to see if another power analysis yields a very different answer.

I take back the “you know the answer” bit, but I really don’t understand what that power calculation is supossed to mean. All I can see is a circular argument: “The study is useless because the power is ridiculously low, but the power is ridiculously low because I calculate it for an alternative hypothesis which is very close to the null because I think the study is useless.”

By: Andrew

Andrew — Sat, 10 Mar 2018 23:05:01 +0000

In reply to Carlos Ungil.

Carlos:

1. The probability of a girl birth is something like 0.485 or 0.488.

2. I haven’t always been so precise on this myself, but I try to use the term “comparison” rather than “effect” here because what’s being studied is a comparison between two groups, not a causal effect.

3. I think the difference in sex ratios between the two groups is likely to be very small, in part because there’s no clear reason to expect any systematic difference, and in part because the measurement of attractiveness in this particular study is itself so noisy, so we’re not even really comparing two distinct groups.

4. I don’t “know the answer anyway.” As I wrote, I expect the true difference in the population to be of order of magnitude 0.01 percentage points. In evaluating the Kanazawa paper, it was enough to point out that the analysis would be hopeless, even if the true population difference were as high as a (scientifically implausible) 1 percentage point. If someone had asked me ahead of time whether this study was worth doing, I’d’ve said no, even if I’d thought the underlying population difference were 1 percentage point. I actually expect the underlying difference to be much less, but it was not really necessary to develop that reasoning to make that point, so I didn’t bother.

5. If I really wrote, “based on the scientific literature it is just possible that beautiful parents are 1 percent more likely than others to have a girl baby,” then I guess I was being generous with the phrase “just possible.” I should’ve written that sentence more clearly.

6. It appears that my “power = .0500001” statement was an exaggeration! I’ll fix it in the above post.

By: Carlos Ungil

Carlos Ungil — Sat, 10 Mar 2018 22:44:42 +0000

In reply to Andrew.

Ok, so you think that the proper alternative hypothesis to calculate the power of the study is P(girl)=50.01% vs P(girl)=50.00%.

This seems a bit extreme, but now you’re indeed just one zero away from your power=.0500001 statement.

But then, why do you bother discussing that “based on the scientific literature it is just possible that beautiful parents are 1 percent more likely than others to have a girl baby” in that paper?

Just say that it is impossible that there is any effect, that the power has to be calculated against the alternative hypothesis which is equal to the null hypothesis and therefore power=0.05 by definition and that there is no need to do any study because you know the answer anyway.

By: Andrew

Andrew — Sat, 10 Mar 2018 22:20:24 +0000

In reply to Carlos Ungil. Carlos: In the beauty and sex ratio example, I'd expect the true difference in the population to be of order of magnitude 0.01 percentage points, which I'd write as 0.0001 except that it's hard to keep track with all these zeroes.

By: Carlos Ungil

Carlos Ungil — Sat, 10 Mar 2018 22:11:26 +0000

> (not a power=.06 study but a power=.0500001 study or something like that)

What definition of power is consistent with power=.0500001 or something like that?

Let’s say that I have a dataset of N=284 births from very attractive parents and I want to test if the percentage of female births is different from 50% (to keep it simple).

My two-tailed test will reject the null hypothesis if the number of girls is 125 (or lower) or 159 (or higher).
If the null hypothesis P(girl)=50% is true, the test will be rejected with probability 0.0500 (as it should).

I calculate the power for a few alternative hypothesis, based on the remark “Given that we only expect to see effects in the range of ±1 percent”:
If the alternative hypothesis P(girl)=51% is true, the test will be rejected with probability (i.e. the power is) 0.0631.
If the alternative hypothesis P(girl)=50.3% is true, the test will be rejected with probability (i.e. the power is) 0.0512.
If the alternative hypothesis P(girl)=50.1% is true, the test will be rejected with probability (i.e. the power is) 0.0501.

By: Andrew

Andrew — Sat, 10 Mar 2018 21:59:53 +0000

In reply to Jacob. Jacob: I disagree, for the following reason. Consider your statements: "I think it is not so uncommon that it will be believed that something has an effect, but opinions will differ on the direction of the effect. . . . if I wanted to make a strong statement about whether the effect is positive or negative . . ." I don't think "the effect" will be positive or negative. I think it will be positive in some settings and negative in others. As I put it in yesterday's post, "having an effect that varies by context and is sometimes positive and sometimes negative."

By: Jacob

Jacob — Sat, 10 Mar 2018 21:06:24 +0000

I haven’t deeply reasoned about this and may not have the (mathematical) training to do so, but I feel like the point null has a particular benefit that relates to the idea of Type S errors. I think it is not so uncommon that it will be believed that something has an effect, but opinions will differ on the direction of the effect. One example of interest to me is the effect of political disagreement on engagement in politics. There were some studies (see Diana Mutz’s book) that there was an ironic effect of disagreement, which is supposed to be necessary for making good decisions in a democracy, in that those who encountered disagreement in discussions of politics were less likely to engage in politics.

There have been several follow-ups to support this as well as follow-ups that suggest both zero and opposite effects. I’ve done an (unpublished) meta-analysis and found many p < .05 studies, but they are split about 50/50 positive/negative. Some of this is statistical (the inclusion/exclusion of certain control variables seems to be influential) and there are problems with the predominantly cross-sectional data used to think about this problem. But if I wanted to make a strong statement about whether the effect is positive or negative, I think the point null comes in handy — with due consideration of the Type S error rate given the design and presumed effect size.

By: Corson N. Areshenkoff

Corson N. Areshenkoff — Sat, 10 Mar 2018 20:00:48 +0000

In reply to Brad Stiritz. A Github repository is a great idea. I find myself writing little bits of code illustrate things like type M/S errors, hypothesis testing in low power studies, etc all the time, so having a central database to pull from would be convenient. A lot of it could easily be assembled into a sort of "tutorial R package" to let students/researchers get a sense of how the techniques they're using actually behave in noisy settings.

By: Brad Stiritz

Brad Stiritz — Sat, 10 Mar 2018 19:39:33 +0000

Hi Andrew, thanks for this example. Along with your recent post on the “80% Power Lie” (http://statmodeling.stat.columbia.edu/2017/12/04/80-power-lie/), these types of calculation-based discussions are extremely helpful.

I’m wondering if any other readers might be interested in working with me on a public GitHub repository, dedicated to Andrew’s technical posts? I have already done a lot of work with a tutor on the “80% Power Lie” post. We worked up numerous graphics, and additional code, to make Andrew’s points more understandable at a basic undergrad level (i.e. where I’m at). When I’ve completely worked through the “80% Power Lie”, I will post a GH link in the comments to that post.

>From the literature and some math reasoning (not shown here) having to do with measurement error in the predictor, reasonable effect sizes are..

Andrew, would you please consider elaborating your math reasoning..? Or, can anyone guess and explicitly spell out, please?

By: Ulrich Schimmack

Ulrich Schimmack — Sat, 10 Mar 2018 18:40:37 +0000

But what should readers do when there is no credible independent evidence that can produce a reasonable prior for effect sizes?

By: Dieter Menne

Dieter Menne — Sat, 10 Mar 2018 17:00:50 +0000

Interesting point, but the text reads like it was truncated at the point where “what to write” should have started. How would you formulate this in a publication?