Tuck Ngun, one of the researchers involved in the “Twin study reveals five DNA markers that are associated with sexual orientation” project, posted a disagreement with some criticisms relayed by science reporter Ed Yong. I’d thought Yong’s points were pretty good and I was interested in seeing what Ngun had to say. Ngun wrote:
I wanted to clarify and correct some of the claims made about my work in an article in the Atlantic by Ed Yong. . . . I have reached out to the Atlantic via Twitter about this but have heard nothing back as of this posting.
I hope nobody is ever reaching out to me via Twitter. You can reach out all you want but I won’t hear you!
Ed’s claim that inappropriate statistics were used is not credible because he clearly misunderstood the analytical procedure. . . . All models (from the very first to the final one) were built using JUST the training data. Only after we had created the model did we test their performance on the test data (the algorithm didn’t ‘see’ these during model creation). If performance was unsatisfactory, we remade the model by selecting a different set of predictors/features/data based on information from the TRAINING set and then reevaluating on the test set. This approach is used widely in statistical/predictive modeling field. . . . If this approach is wrong, someone needs to tell Amazon, Netflix, Google, and just about everyone doing statistical modeling and machine learning.
Nooooooooo! The problem is here:
If performance was unsatisfactory, we remade the model by selecting a different set of predictors/features/data based on information from the TRAINING set and then reevaluating on the test set.
Wrong! Once you go back like that, you’ve violated the “test set” principle.
Now let me say right here that I think the whole training/test-set idea has serious limitations, especially when you’re working with n=47. But if you want to play the training/test-set game and the p-value game, you should do it right. Otherwise your p-values don’t mean what you think they do.
The second issue I want to discuss is his claim that we needed multiple testing correction. Again he is misunderstanding the approach and rationale. We did not need to correct for multiple testing because we did one hypothesis test. We are not testing whether each of the 6000 marks/loci are significantly associated with sexual orientation. If we had done that, multiple testing correction would have certainly been warranted. But we didn’t. The single test we did was to ask whether the final model we had built was performing better than random guessing. It seemed to be because its p-value was below the nearly universal statistical threshold of 0.05.
Ngun is, I believe, making the now-classic garden-of-forking-paths error. Sure, you only did one test on your data. But had the data been different, you would’ve done a different test (because your remaking of the model, as described in the above quote, would’ve been different). Hence your p-value is not as stated. See page 2 of this paper.
The big issue
All this is garden-variety statistical misunderstanding, which in many was is excusable. I’m a statistician and I don’t understand biology very well, and so it’s reasonable enough that a biologist can make some statistical errors.
And at this point I’d usually say the problem is with the scientific publication process, that errors get past peer review, we need post-publication review, etc.
But . . . in this case there is no paper! No publication, not even a preprint.
Talk is cheap. I want to see what Ngun and his colleagues actually did. (I’d say “I want to see the raw data” but I’m no expert in genetics so I don’t know that I’d know what to do with the raw data!)
As I wrote in my earlier post: Why should we believe these headlines? Because someone from a respected university gave a conference talk on it? That’s not enough: conference talks are full of speculative research efforts. Because it was featured in a news article in Nature? No.
Ngun is providing no evidence at all. I think a healthy skepticism is the appropriate attitude to take when someone makes bold claims based on an n=47 study. Ngun may well feel that he did the statistics right, but how can we possibly judge? Lots of people think they did the statistics right when they didn’t. I’m guessing Daryl Bem thought he didn’t have multiple comparisons problems either. I agree with Thomas Lumley that it’s pretty ridiculous to be having this discussion when there’s no actual document by Ngun and his collaborators saying what they did.
To me, the key part of Ngun’s note is the following:
I [Ngun] would have appreciated the chance to explain the analytical procedure in much more detail than was possible during my 10-minute talk but he didn’t give me the option.
I’m sorry but that’s just ridiculous. Ngun can give all of us the option by just writing up what he did and posting the preprint.
I have some sympathy for Ngun, as this is a tough position for him to be in. It seems kinda weird to me for there to be this high-profile talk without any preprint. Maybe that’s how they do things in biology. But it seems a bit much to complain that someone isn’t giving you the option to explain your procedure in detail, when you and your colleagues are free to write it up at any time.
P.S. I did not agree with the content of Ngun’s note, but I found its tone to be pleasant. I agree with him that vigorous criticism is fine and should not be taken personally. Next step is to dial down the defensiveness and realize that (a) statistics can be tricky, and (b) if you don’t make a preprint available, you can’t really blame people for doing their best to guess at what you’ve been doing.
P.P.S. I see from Lumley that the publicity ball got rolling via a press release from the American Society of Human Genetics which includes an interview and a publicity photo of Ngun and a link to an abstract. Based on the abstract and what Ngun wrote on his webpage, it looks like overfitting to me. But, again, we don’t really have enough information to judge. In general it seems like you’re asking for trouble when you start publicizing technical claims without supplying the accompanying evidence. Everything seems to depend on trust—or perhaps the fear of getting scooped by another news outlet. If you’re Nature and the American Society of Human Genetics emits a press release, and you know it’s gonna get covered by the Daily Mail etc., there’s some pressure to run the story.
But there are incentives in the other direction, too. If you keep with the hype hype hype, and the word gets out that Nature is a less reliable source for science news than BuzzFeed or the Atlantic, that’s not so good for your brand.