Significance testing in economics: McCloskey, Ziliak, Hoover, and Siegler

Scott Cunningham writes,

Today I was rereading Deirdre McCloskey and Ziliak’s JEL paper on statistical significance, and then reading for the first time their detailed response to a critic who challenged their original paper. I was wondering what opinion you had about this debate. Is statistical significance and Fisher tests of significance as maligned and problematic as McCloskey and Ziliak claim? In your professional opinion, what is the proper use of seeking to scientifically prove that a result is valid and important?

The relevant papers are:

McCloskey and Ziliak, “The Standard Error of Regressions,” Journal of Economic Literature 1996.

Ziliak and McCloskey, “Size Matters: The Standard Error of Regressions in the American
Economic Review
,” Journal of Socio-Economics 2004.

Hoover and Siegler, “Sound and Fury: McCloskey and Significance Testing in Economics,” Journal of Economic Methodology, 2008.

McCloskey and Ziliak, “Signifying Nothing: Reply to Hoover and Siegler.”

My comments:

1. I think that McCloskey and Ziliak, and also Hoover and Siegler, would agree with me that the null hypothesis of zero coefficient is essentially always false. (The paradigmatic example in economics is program evaluation, and I think that just about every program being seriously considered will have effects–positive for some people, negative for others–but not averaging to exactly zero in the population.) From this perspective, the point of hypothesis testing (or, for that matter, of confidence intervals) is not to assess the null hypothesis but to give a sense of the uncertainty in the inference. As Hoover and Siegler put it, “while the economic significance of the coefficient does not depend on the statistical significance, our certainty about the accuracy of the measurement surely does. . . . Significance tests, properly used, are a tool for the assessment of signal strength and not measures of economic significance.” Certainly, I’d rather see an estimate with an assessment of statistical significance than an estimate without such an assessment.

2. Hoover and Siegler’s discussion of the logic of significance tests (section 2.1) is standard but, I believe, wrong. They talk all about Type 1 and Type 2 errors, which are irrelevant for the reasons described in point 1 above.

3. I agree with most of Hoover and Siegler’s comments in their Section 2.4, in particular with the idea that the goal in statistical inference is often not to generalize from a sample to a specific population, but rather to learn about a hypothetical larger population, for example generalizing to other schools, other years, or whatever. Some of these concerns can best be handled using multilevel models, especially when considering different possible generalizations. This is most natural in time-series cross-sectional data (where you can generalize to new units, new time points, or both) but also arises in other settings. For example, in our analyses of electoral systems and redistricting plans, we were careful to set up the model so that our probability distribution generalized to other possible elections in existing congressional districts, not to hypothetical new districts drawn from a common population.

4. Hoover and Siegler’s Section 2.5, while again standard, is I think mistaken in ignoring Bayesian approaches, which limits their “specification search” approach to the two extremes of least squares or setting coefficients to zero. They write, “Additional data are an unqualified good thing, which never mislead.” I’m not sure if they’re being sarcastic here or serious, but if they’re being serious, I disagree. Data can indeed mislead on occasion.

Later Hoover and Siegler cite a theorem that states “as the sample size grows toward infinity and increasingly smaller test sizes are employed, the test battery will, with a probability approaching unity, select the correct specification from the set. . . . The theorem provides a deep justification for search methodologies . . that emphasize rigorous testing of the statistical properties of the error terms.” I’m afraid I disagree again–not about the mathematics, but about the relevance, since, realistically, the correct specification is not in the set, and the specification that is closest to the ultimate population distribution should end up including everything. A sieve-like approach seems more reasonable to me, where more complex models are considered as the sample size increases. But then, as McCloskey and Ziliak point out, you’ll have to resort to substantive considerations to decide whether various terms are important enough to include in the model. Statistical significance or other purely data-based approaches won’t do the trick.

Although I disagree with Hoover and Siegler in their concerns about Type 1 error etc., I do agree with them that it doesn’t pay to get too worked up about model selection and its distortion of results–at least in good analyses. I’m reminded of my own dictum that multiple comparisons adjustments can be important for bad analyses but are not so important when an appropriate model is fit. I agree with Hoover and Siegler that it’s worth putting in some effort in constructing a good model, and not worrying if said model was not specified before the data were seen.

5. Unfortunately my copy of McCloskey and Ziliak’s original article is not searchable, but if they really said, “all the usual econometric problems have been solved”–well, hey, that’s putting me out of a job, almost! Seriously, there are lots of statistical (thus, I assume econometric) problems that are still open, most notably in how to construct complex models on large datasets, as well as more specific technical issues such as adjustments for sample surveys and observational studies, diagnostics for missing-data imputations, models for time-series cross-sectional data, etc etc etc.

6. I’m not familiar enough with the economics to comment much on the examples, but the study of smoking seems pretty wacky to me. First there is a discussion of “rational addiction.” Huh?? Then Ziliak and McCloskey say “cigarette smoking may be addictive.” Umm, maybe. I guess the jury is still out on that one . . . .

OK, regarding “rational addiction,” I’m sure some economists will bite my head off for mocking the concept, so let me just say that presumably different people are addicted in different ways. Some people are definitely addicted in the real sense that they want to quit but they can’t, perhaps others are addicted rationally (whatever that means). I could imagine fitting some sort of mixture model or varying-parameter model. I could imagine some sort of rational addiction model as a null hypothesis or straw man. I can’t imagine it as a serious model of smoking behavior. People will still likely be looking to options like snus and the Swedish brand, White Fox, or indeed to vaping or other methods, to quit smoking.

7. Hoover and Siegler must be correct that economists overwhelmingly understand that statistical and practical significance are not the same thing. But Ziliak and McCloskey are undoubtedly also correct that most economists (and others) confuse these all the time. They have the following quote from a paper by Angrist: “The alternative tests are not significantly different in five out of nine comparisons (p<0.02), but the joint test of coefficient equality for the alternative estimates of theta.t leads to rejection of the null hypothesis of equality." This indeed does not look like good statistics. Similar issues arise in the specific examples. For instance, Ziliak and McCloskey describe where Becker, Grossman, and Murphy summarize their results in terms of t-ratios of 5.06, 5.54, etc, which indeed miss the point a bit. But Hoover and Siegler point out that Becker et al. also present coefficient estimates and interpret them on relevant scales. So they make some mistakes but present some things reasonably. 8. People definitely don't understand that the difference between significant and not significant is not itself statistically significant.

9. Finally, what does this say about the practice of statistics (or econometrics)? Does it matter at all, or should we just be amused by the gradually escalating verbal fireworks of the McCloskey/Ziliak/Hoover/Siegler exchange? In answer to Scott’s original questions, I do think that statistical significance is often misinterpreted but I agree with Hoover and Siegler’s attitude that statistical significance tells you about your uncertainty of your inferences. The biggest problem I see in all this discussion is the restriction to simple methods such as least squares. When uncertainty is an issue, I think you can gain a lot from Bayesian inference and also from expanding models to include treatment interactions.

P.S. See here for more.

11 thoughts on “Significance testing in economics: McCloskey, Ziliak, Hoover, and Siegler

  1. One of the few certainties in life is that the probability of a type I error is zero :) You can't make a type I error if the null hypothesis IS false.

    Well, in most contexts this should ring true. I have to admit that I have seen a few psychologists make valiant, and disturbingly threatening, attempts at creating counterexamples to my statement.

  2. KMC,

    Yes, some effects really are zero (or as close to zero as can be imagined). But I don't spend my time studying such phenomena. The key issue, I think, is that for things like program evaluation, individual effects are almost certainly happening, and even if they average to something small they won't average to exactly zero.


    It didn't actually get "picked up." I cheated and sent an email to Mark directly.

  3. Ah, rational addiction, thanks for reminding me. I read the original theoretical article a few years back, and it reads like a parody written by a sociologist who dislikes economic imperialism. The basic idea is that if you are the person that gets hooked easily on cigarettes, you take this information into account before smoking your first cigarette, for example.

  4. Rational addiction, as far as I understand it, is a situation where an individual maximizes the sum of current and future utilities, taking into account the impact that current decisions have on future utilities. The addiction comes from the assumption that, over some range at least, the consumption of the addictive good does not induce diminishing marginal utility. Rationality is just a way of saying that you could be a forward-looking maximizer and still get addicted. Or if you like, you don't have to be a myopic minimizer to get stuck into an addiction. The literature has nice examples such as jogging. This is my understanding of "rational addiction", based on a recollection of some reading a while back, and could well be incorrect.

    The literature is an early paper by Becker and Stigler in the 1970s, a paper by Iannaccone in Economics Letters around 1986, and a paper by Becker and Murphy around 1988 or perhaps early 1990s — you will find the papers easily on jstor and sciencedirect.

  5. There is a high probability for standard tools of significance testing to reject a valid hypothesis due to specific properties of error distribution. So, in many real cases one has to be very careful in straightforward application of LSQ approach.
    My best example is that there is no cointegrating relation (i.e. corresponding coefficients are insignificant) between the numbers of population of the same year of birth taken in subsequent years, i.e. for example 9-year olds this year and 10-year olds in the next year. Obviously, the populations are essentially the same with just very small change die to migration and total depths.

    However, due to Census Bureau's revisions there are large steps in the population distribution (especially after census years), which make standard statistical approach a bit problematic. The steps are artificial by nature but exist in the time series.

    Some more in papers:

    Chiarella, C., & Gao, S. (2002). Type I spurious regression in econometrics. Working Paper No 114, School of Finance and Economics, University of Technology Sydne y


    Kitov, I., Kitov, O., Dolinskaya, S., (2007). Relationship between inflation, unemployment and labor force change rate in France: cointegration test, MPRA Paper 2736, University Library of Munich, Germany.

    Kitov, I., Kitov, O., Dolinskaya, S., (2007). Inflation as a function of labor force change rate: cointegration test for the USA, MPRA Paper 2734, University Library of Munich, Germany.

  6. Dear Professor Cunningham:

    You have missed the main point, not on purpose, I am sure, but now obscured by higher-level comments that do miss it.

    The point? "Statistical" significance is not the same thing as importance. Fit is not the same thing as scientific significance. Significance is not a measure of anything unless there is a scale along which the importance of the variable, its Oomph as we call it, is measured. You can't substitute probability measures for scientific importance, not on the logic of Fisherian tests lacking loss functions. It's a very simple and elementary point.

    Grasping it is routine in physics and engineering. But in medicine and psychology and economics people—and I guess you too?–want to use the fit to decide on importance.

    It's a mistake, which by now scores of statisticians applied and theoretical have pointed out. (BTW: confining the reply to McCloskey and Ziliak has the effect of avoiding Gossett, Kruskal, and the others, who made our same, elementary point.)


    Deirdre McCloskey

  7. hiiiiiiii


    For more than 20 years, Deidre McCloskey has campaigned to convince the economics profession that it is hopelessly confused about statistical significance. She argues that many practices associated with significance testing are bad science and that most economists routinely employ these bad practices: 'Though to a child they look like science, with all that really hard math, no science is being done in these and 96 percent of the best empirical economics …' (McCloskey 1999). McCloskey's charges are analyzed and rejected. That statistical significance is not economic significance is a jejune and uncontroversial claim, and there is no convincing evidence that economists systematically mistake the two. Other elements of McCloskey's analysis of statistical significance are shown to be ill-founded, and her criticisms of practices of economists are found to be based in inaccurate readings and tendentious interpretations of those economists' work. Properly used, significance tests are a valuable tool for assessing signal strength, for assisting in model specification, and for determining causal structure.



Comments are closed.