The other day I commented on a new Science News article by Tom Siegfried about statistics and remarked:
If there were a stat blogosphere like there’s an econ blogosphere, Siegfried’s article would’ve spurred a ping-ponging discussion, bouncing from blog to blog.
In response, various people pointed out to me in comments and emails that there has been a discussion on statistics blogs of this article; we just don’t have the critical mass of cross-linkages to maintain a chain reaction of discussion.
I’ll try my best to inaugurate a statistics-blogosphere symposium, though.
Before going on, though . . . Note to self: Publish an article in Science News. Tom Siegfried’s little news article got more reaction than just about anything I’ve ever written!
OK, on to the roundup, followed at the end by my latest thoughts (including a phrase in bold!).
– I didn’t really have much to say about the news article when it came out and in fact only posted on it because four different people emailed to ask my thoughts on it. I was already aware of the controversy surrounding the disjoint between statistics in the field and in the textbook, and I thought Siegfried captured the issue pretty well (except for the discussion of Bayesian statistics, which promulgated some common misconceptions that I tried to correct in my blog entry).
– Dan Lakeland used the Science News article to compare actual scientific research (in ts best form) to the cargo-cult version of scientific method presented in statistics textbooks (“null hypotheses,” “alternative hypotheses,” p-values, and the rest).
See also Lakeland’s follow-up, where he threw in a description of me that’s pretty accurate, considering that we’ve never met:
If you are a productive, professional, grant-funded scientist today, you are probably about 50 years old. You went to graduate school in the 1980’s. When you learned about statistics, computers were just about fast enough that they could sort of keep up with your typing speed over the 1200 baud modem that connected you to the university mainframe. The idea of running a 10000 iteration MCMC sampling scheme on a partially nested 4 level model with 1/2 million observations was something Andrew Gelman was maybe just dreaming about, and if he was trying it out he was certainly writing custom FORTRAN code to do it.
I indeed wrote custom Fortran code in my thesis! And I remember loving the 1200 baud modem. It was so, so much better than the 300 baud connection. But that was when I was in college. By the time I was in grad school we were using workstations.
– Real-life private-sector statistician Kaiser Fung slammed Siegfried for sensationalism. Kaiser says that, realistically, we’re never going to have absolute truth in our statistical analyses, but that the problem is not with p-values, significance levels, Bayes, or anything else like that, but just the nature of human knowledge: “False results are part of the process of scientific inquiry, not a sign of its failure.”
Or, as we tell the students when teaching sampling theory: In real life, sampling error is just a lower bound on uncertainty; nonsampling error is the most important problem. But as statisticians, we focus on sampling error because that’s our unique contribution to the endeavor. Your doctor helps with your health, your minister gives you religion, you get your music from WFUV, and your friendly neighborhood statistician computes your standard errors. It’s called division of labor, and criticizing statistics for not solving all your scientific problems makes no more sense than criticizing your rabbi for not curing your pneumonia or sadly concluding that your D.J.–despite his wit and excellent taste in music–can’t do anything useful about those rude drivers on your morning commute.
– James Annan agreed with Siegfried that p-values can mislead, but he doesn’t seem to feel that statistics as a whole is about to fall apart.
– A physics blogger called Tamino wrote: “the foundation [of statistics] is not flimsy, it’s solid as a rock. Statistics works, it does what it’s supposed to do. But it is susceptible to misinterpretation, to false results purely due to randomness, to bias, and of course to error. That’s what the ScienceNews article is really about, although it takes liberties (in my opinion) in order to sensationalize the issue. But hey, that’s what magazines (not peer-reviewed journals) do.” Well put, although I disagree with some of Tamino’s later statements on probability (more on this below).
– Tamino’s remarks are ultimately focused not so much on statistics but on applications in climate science, and he was responding to Anthony Watts, who welcomed Siegfried’s article for “pointing out an over-reliance on statistical methods can produce competing results from the same base data. Watts also links to this fun page of statistics quotes, but I’m not at all impressed by this quote from Ernest Rutherford: “If your experiment needs statistics, you ought to have done a better experiment.” That’s just obnoxious. In the meantime, before you have the “better experiment,” you still might have to make some decisions.
– Physicist Lubos Motl used the Siegfried article as a springboard for a very reasonable discussion of the role of hypothesis testing in statistical reasoning. I was trained as a physicist myself, so maybe that’s one reason I’m comfortable with this way of thinking. Motl writes: “statistical methods have always been essential in any empirically based science. In the simplest situation, a theory predicts a quantity to be “P” and it is observed to be “O”. The idea is that if the theory is right, “O” equals “P”. In the real world, neither “O” nor “P” is known infinitely accurately. . . . if “O” and “P” are (much) further from one another than both errors of “O” as well as “P”, the theory is falsified. It’s proven wrong. If they’re close enough to one another, the theory may pass the test: we failed to disprove it. But as always in science, it doesn’t mean that the theory has been proven valid. Theories are never proven valid “permanently”. They’re only temporarily valid until a better, more accurate, newer, or more complete test finds a discrepancy and falsifies them.” This is a refreshing departure from naive and (to me) pointless discussions of “the probability the null hypothesis is true” (again, more on that below).
Unfortunately, Motl went a bit too far for me when he starts talking about social and environmental science, saying that if effects are “claimed to be established at the 90% confidence level, it’s just an extremely poor evidence.” At a mathematical level, I know what he’s saying: 90% confidence is just 1.65 standard errors from zero, and that’s not far at all from a statistically insignificant 1 standard error from zero. Still, to go back to our earlier point (or to Phil’s discussion of inference for climate change), decisions do need to be made, and it’s best to summarize the inference we do have as best we can, even as we wait for better data and models.
– Statistical consultant Mark Palko (who wrote, “I nearly stopped reading when I hit the phrase ‘mutant form of math'”) took the discussion in a different direction: “I [Palko] wonder if in an effort to make things as simple as possible, we haven’t actually made them simpler. . . . Letting everyone pick their own definition of significance is a bad idea but so is completely ignoring context. Does it make any sense to demand the same level of p-value from a study of a rare, slow-growing cancer (where five-years is quick and a sample size of 20 is an achievement) and a drug to reduce BP in the moderately obese (where a course of treatment lasts two week and the streets are filled with potential test subjects)? Should we ignore a promising preliminary study because it comes in at 0.06?” This is a point that I’ve talked about on occasion in the political science context: There have been fewer than 20 presidential elections in modern (post-World War 2) politics, so, yes, demanding 95% confidence for inferences from such data seems to miss the point. There’s already more than a 1 in 20 chance, I think, of some sort of major change that would make your model irrelevant. On a related point, Palko asked, “In fields like econ where researchers often have to rely natural experiments based on rare combinations of events, does it even make sense to discuss replication?”
– An engineer named William Connolley linked to Siegfried’s article and writes, “much of science isn’t statistical at all. . . . the sciencenews thing itself seems to be mostly thinking about medicine, where they use stats a lot because they don’t know what is really going on.” Well, yes and no. Sometimes medical researchers know what is really going on and sometimes they don’t, but in either case there’s a lot of individual variation–people’s bodies are different–and so statistics can be helpful.
As I already noted, I thought Siegfried’s article was basically OK in that he was capturing some real discontent among users of statistics. Whether or not statistics has a firm foundation, many scientists certainly feel that there are fundamental problems, and it’s not really Siegfried’s job to take a stand here. As a reporter, he’s reporting what different people think. I mean, sure, he could’ve concluded from his interview with me that statistics has a firm foundation–but why should he have trusted me more than the various other people he interviewed? What if he had made the mistake of trusting someone who said that statistics is only for people who “don’t know what is really going on”?? Setting the rhetoric aside, and also setting aside a few technical mistakes (noted in my earlier blog entry; see the very first link above), I think Siegfried did a reasonable job of laying out the controversy.
My perspective on some of this is, I believe, similar to Feynman’s irritated reaction when people asked him if light is a particle, or a wave, or a “wavicle.” From his perspective, light is particles: yes, particles that go around corners, but particles nonetheless. Now, I certainly don’t want to get into a discussion of quantum physics here; my point is that I share Feynman’s annoyance with pseudo-deep philosophical discussions which might begin as attempts to explain tricky concepts but quickly become morasses in themselves.
That’s how I feel about this whole subjective Bayesian thing. When I set up, fit, check, and improve a “Bayesian” model, it’s no more subjective than when Brad Efron decides what “estimator” to use and what set of replications to “bootstrap” over, or when Neyman and Pearson decided what “probability law” to use, or when Savage decided what “loss function” to minimize or when Cox decided how to construct his “semiparametric” model, etc etc etc. It’s about what we do, and what information we use. I can see how Tukey could’ve gotten so fed up with all the theory and philosophy that he decided just to present some graphical methods and not specify where they came from. If it’s all about the method, just present the method. I don’t go that far–when it comes to statistics, I ultimately find modeling to be more flexible and effective than direct construction of algorithms–but I see the appeal of chucking it all. Especially after hearing one more time the same old B.S. about subjectivity and objectivity. (For perhaps my definitive take on the topic, see here.)
That said, the connection between statistical modeling and reality can be tricky. You have Larry Wasserman, who works with physicists and should know better, thinking that, in particle physics, 95 percent of published 95% intervals will actually contain the truth, while Lubos Motl, who is a physicist and actually does know better, reminding us that, no, our models our full of errors and we should be wary of the nominal probabilities that come out of our statistical estimation.
Please don’t talk to me about the Pr (null hypothesis)
I agree with Tom Siegfried, Don Rubin, and the many many others who have criticized p-values–whatever their performance might be in theory–for being routinely misunderstood in practice. A p-value is the probability of seeing something as extreme as was observed, if the model were true. It is not under any circumstances a measure of the probability that the model is true.
The logical next step, which I hate hate hate hate hate, is to then try to calculate the probability that the null hypothesis is true. No. I refuse to do this. As a statistician, I am generally supportive of “give the people what they want” sorts of arguments, but this time I say no. In all the settings I’ve ever worked on, the probability that the model is true is . . . zero! I prefer Bayesian inference (or, more generally, interval estimation) for quantitative parameters and graphical checks (or, on occasion, p-values) to summarize ways in which the model doesn’t fit reality. Lots of problems are caused, I believe, by the often unquestioned idea that we should want to calculate, or estimate, the probability of a model being true. See here for more more more on this topic.
P.S. I still think the econ blogosphere has us beat: no matter now hard I try, I can’t capture the drama of the Krugmeister battling the freshwaterites, with Cowen, Tabarrok, et al. throwing fuel on the fire and Mark Thoma keeping score. And it’s also funny for this entire discussion to have been sparked by an innocuous if dramatically-phrased article in Science News. But, hey, we gotta start somewhere.
P.P.S. Actually, Alex T. did comment on Siegfried’s article, making an observation similar to Motl’s (noted above) that all the discussions of p-values shouldn’t obscure the importance of errors in the model.