The new rule in economics: One star is p < 0.20, two stars is a set of steak knives, three stars is you're fired.

Someone pointed me to a series of applied economics papers:

1. George Borjas and Nate Breznau, Ideological bias in the production of research findings:

Our study exploits an opportunity to observe 158 researchers working independently in 71 teams during an experiment. After being asked their position on immigration policy, they used the same data to answer the same empirical question: Does immigration affect public support for social welfare programs? . . . teams composed of pro-immigration researchers estimated more positive impacts of immigration on public support for social programs, while anti-immigration teams estimated more negative impacts. The differences arise because different teams adopted different model specifications. . .

The results include an unusual labeling of statistical significance:

Usually it’s one star for p < 0.05, two stars for p < 0.01, as here:

or here:

These are not intended to be authoritative references; they just turned up in a quick search. The point is that 0.05 is the usual standard. Using 0.10 is a way of manufacturing a “statistically significant” result when you don’t have it in your data (as here). In the case of the Borjas and Breznau paper, the data were too variable to get a conventionally strong result, but they still wanted to get it published, and so they shifted the stars. I’m surprised that the reviewers didn’t catch it!

Don’t get me wrong. I don’t think people should be using statistical significance, at any level, as a threshold. To get a sense of my perspective you can read our paper, Abandon Statistical Significance. Even if you have an estimate that’s just one standard error from zero, that’s still evidence of the direction of the effect, as long as no selection is going on.

2. Katrin Auspurg and Josef Brüderl, Fragile Evidence for an Ideological Bias in the Production of Research Findings: Comment on Borjas and Breznau:

Although we were able to reproduce B&B’s numerical results, our reanalysis shows that the reported association is not robust. Specifically, the association hinges on a coding error. Data from four teams that contradict the ideology hypothesis were excluded from the analysis due to idiosyncratic variable coding. Correcting this error renders the ideology effect no longer statistically significant. Also, B&B employed a different outcome variable and weighƟng scheme to that used in a previous paper based on the same data. These two analytical decisions further contribute to the observed ideology effect. Correcting the coding error or using the same specification as in the previous paper renders the ideology effect indistinguishable from zero. . . .

They also go with the 10% significance level, I guess to be consistent with the original paper?

3. Nate Breznau and George Borjas, A Lack of Robustness in Robustness Checking from Auspurg and Brüderl:

In our published paper, we explicitlyacknowledged the limitations of our findings which are based on secondary data and a small sample. After examining Auspurg and Brüderl’s claims, we conclude that they have not presented any new evidence that warrants any correction to our conclusions. . . .

This rejoinder includes the table at the top of this post, in which the significance level has now crept up to 0.20.

I’m anticipating a few more rounds of this, culminating in a table by Breznau and Borjas in which anything with a two-sided p-value of less than 0.5 is given a star. Everybody’s a winner!

P.S. Just kidding in the title of the post. This “p < 0.20" thing isn't really the new rule in econ; it's just something from this one paper. It may be that its authors got some special exemption from the 0.05 threshold.

20 thoughts on “The new rule in economics: One star is p < 0.20, two stars is a set of steak knives, three stars is you're fired.

  1. I think your focus on the unfortunate use of p-values is possibly misleading. I take the thrust of the paper to explore whether/how ideological bias may creep (or barge ) into analyses. So, I see the reference to p values as one measure of whether studies are finding support for, or opposition to, immigration. I think that is a reasonable measure to use when looking for such bias – partially since so many journals insist on “significance” for publishing papers. While I would like to see p value cutoffs abandoned entirely, this is one potential use for them that makes some sense to me. So, yes, it is ludicrous to have the creeping increases in reported p-values (reminds me of this funny website: https://mchankins.wordpress.com/2013/04/21/still-not-significant-2/), but I didn’t think this paper was seriously suggesting that the p values were a good way to determine truth.

    • Dale:

      I’m not saying that the three papers in question are about p-values. I do think that all of them rely on p-values as evidence: they’re making claims based on what is statistically significant. Given that, it’s notable that they started with a nonstandard threshold (p < 0.10) and then moved to a threshold of p < 0.20 which I've never seen before. I get their motivation: they have noisy data and they don't want to report non-significance or a null effect. So they shift the threshold so they can get significance. I have sympathy for this approach because already I don't think it's right to consider an effect as zero just because the data don't reach that threshold; see here. But, to assign a star to p < 0.20, that's just an abuse of notation. I'd prefer them to just openly say their data are noisy and they want to say what they can say.

  2. The p<0.1 star isn't some kind of novel cheating. The convention in economics for decades (maybe since significance stars were introduced here) has been * p<0.1 ** p<0.05 *** p<0.01, although p<0.1 is usually treated as a pretty marginal result that shouldn't be overly interpreted (p=0.049 is a proven effect you can take to the bank though!). It's also the default when outputting regression tables in Stata using popular packages. p<0.2 is hilarious and brazen though.

    For what it's worth, the Borjas article (like just about everything Borjas writes on immigration) is seriously flawed in other ways as well. https://braddelong.substack.com/p/crosspost-noah-smith-friends-dont

    • Joseph:

      Interesting! I guess this makes sense. In psychology you can keep adding people to your experiments so, if you’re studying a real effect with good measurements, you should eventually be able to attain statistical significance at whatever level you need. But in economics and political science we’re usually working with observational data and sample size is limited (or you can increase the sample size by widening the scope of inquiry but then you’re not necessarily still studying the question of interest), so you can’t expect/demand such a high level of certainty.

    • For some reason, I didn’t read the bottom half of the post properly before posting. It covers my final sentence, although I think Borjas’ long history of dodgy anti immigration research remains relevant.

    • Joseph:

      I followed your link and it was informative, but one part puzzled me. One of the authors of the paper in question teaches at Harvard, and DeLong wants to post this question to the Harvard administration:

      What have you been doing? And there are the deans—Look: I respected Joe Nye enormously and liked him a lot. I like and profoundly respect Carnesale, Ellwood, Elmendorf, and Weinstein. But it is the role of a dean to call faculty members in, and say: Things are in a state that your next paper needs to replicate, to be bulletproof, and to be well-respected, so how do we make this happen?

      I kinda see what he means, but . . . is it really the role of the dean to chew out faculty about their research methods? It’s hard for me to imagine this ever happening. This just doesn’t seem like the role of deans, or department chairs, in any way. Indeed, I don’t think I’ve ever heard of this happening, outside of cases of flat-out fraud. Maybe it would be good if university administrations took this role, or maybe not, but I don’t think it’s actually done.

      To put it another way: DeLong teaches at the University of California, which has a famous psychology professor who’s committed blatant research misconduct, and they’ve refused to do anything about it. And then there’s John Yoo, who’s not only evil, he also publishes unscholarly crap. Nobody expects the deans to go to these people to make sure their results are “bulletproof” or “well-respected”!

      It’s an interesting question, how universities would be differ if faculty were expected to publish “bulletproof” work . . . it’s not the world we live in.

      • Andrew – what exactly do you think deans should do? If they don’t oversee the quality of work done by the faculty (I’m not addressing the requirement that it be “bulletproof” which doesn’t seem like an appropriate standard to me), then what do you see them doing? Making class schedules? Chairing meetings? For once, I’d like to see deans actually engaged with the quality of the work done in the department/school that they are in charge of.

        • Dale:

          I don’t know what deans should do. I’ve seen what they actually do, and it’s never had anything to do with overseeing the quality of work done by the faculty, except in indirect ways such as appointing outside committees, overseeing tenure review, etc. Mostly they seem to be involved in budgetary issues and paperwork. I’m not saying the job is easy–they have to make decisions such as how when a department can hire a new lecturer, or whether a new program is approved–; I just can’t imagine them trying to directly assess the quality of a faculty member’s work, let alone intervening in the way that DeLong suggests. But that’s just my experience.

        • Andrew’s right. Deans are not, in general, involved in overseeing the quality of faculty research. At many (most?) places, deans are not now, or have not been in the past, active or productive researchers, so they would have little experiential basis for judging research. Even those deans that are qualified would be limited in their disciplinary expertise. In an actual case I know of, a dean who is a good researcher is a biologist, but would have little basis for judging the work of, say, historians or economists in the same college.

        • Andrew and Gregory
          I agree with your experiences. I’ve seen plenty of deans and their roles have mostly been focused on paperwork, budgets, overseeing processes, etc. And, in larger schools, there are assistant or associate deans who do must of that legwork, leaving the deans to be “in charge.” But if I look for one thing to change in university structure, either reducing/eliminating such deans or involving them in meaningful discussions of quality, might be a good place to start. They do not need to be experts in a variety of subjects to exercise that role. And, I thought universities were primarily there to educate students, so teaching should at least be equally important as research, shouldn’t it? Who exactly should be involved in promoting or evaluating the quality of teaching?

          If you reject deans in all these roles, then you are left with a bunch of high paid pencil-pushers (a description that probably describes too many of them). I would say that some deans are heavily involved in fund-raising, and that is a different matter. However, my experience (mostly at smaller schools) has been that deans’ roles in fund raising are often restricted by the university advancement bureaucracy.

        • About deans and their work: I write from the perspective of a faculty member who has also been a department chair at a large research-intensive state university. The dean of my college has always been an active researcher, although their research slows down a lot while they are in the position of dean. A lot of their time is spent on financial issues. Fundraising is a large chunk. Overseeing the budget is a large chunk. They have to make decisions on which departments will get funds to hire, etc. Their main contribution to quality of research and teaching is in their role in promotion and tenure (and the analogous process for the many faculty who are not in the tenure system). They also have a large staff to manage, including associate deans and support staff members. I doubt that deans at large research-intensive universities directly assess the quality of faculty members’ research or teaching. That doesn’t happen here, and I doubt that it happens at Berkeley, where DeLong, I think, is on the faculty. In my college a dean might be a mathematician, who would not be very able to assess the work of a faculty member in plant biology, for example. I think they would get involved if there were something like fraud, or if there were evidence that the research quality of many members of a department had become problematic. It seems like a deparment chair is more able to assess and deal with research and teaching quality issues. (A department might have 25-100 or so faculty. A college might have over 1000.)

        • DeLong was talking about George Borjas, who teaches at Harvard’s school of public policy. Public policy is a professional school, and it’s possible that deans at a professional school could be more involved with faculty research than deans at an arts and sciences school. I doubt it, but I guess it’s possible?

        • P.S. We once had a dean who was a mathematician and whose only involvement in our department (statistics) came when he tried to stop us from admitting a Ph.D. student who’d only scored 650 on his math GRE. That really bothered the dean. The good news is that we stood up for ourselves and the dean took the L. The student was admitted, did very well in his Ph.D. program, and has had an outstanding career.

          I have to say that I’m really bad at these sorts of internal politics. If I’d been in charge of the department, I probably would’ve just collapsed under pressure and let the dean tell us who to admit. My stat dept colleagues did well on this one.

        • Good point about Borjas being part of a professional school. The culture in those places (in my experience, education, engineering, business) is rather different than the culture in arts and sciences. I still doubt that a dean would get involved, but I don’t have enough direct experience to say that it’s crazy.

  3. The significance threshold is adjusted based on how expensive it is to collect the data vs funding available. That is why it is so small for particle physics and GWAS studies (large sample size -> so many significant results it becomes obvious that deviations from a strawman null hypothesis are meaningless).

    The grad students need a hard, but not too hard, threshold to overcome before publishing. It seems to be optimized so something like 1/3 things that get tried are “publishable”.

  4. Why are the stars connected to being fired? I thought that, just like in a video game, we were collecting significance stars!
    On a more serious note, the origin of the significance stars is shrouded in some mystery (read to the end, it is not such a long thread): https://www.statalist.org/forums/forum/general-stata-discussion/general/1754832-why-do-many-economists-put-an-asterisk-next-to-p-10
    It might explain why I as a trained economist did not flinch at the p<0.1 significance star.

  5. I find the one star = 0.1 thing strange too (shouldn’t a star mean significant, all the issues with that approach aside), but it is the modal approach in economics. I just read this in the Census disclosure review guidelines:

    “To qualify for S&S review, the significance level thresholds need to be standard in your field (e.g., 0.01, 0.05, and 0.10 in economics). Using non-standard significance thresholds would be treated as numeric output.” (https://www2.census.gov/adrm/FSRDC/Resources/FSRDC-Disclosure-Avoidance-Methods-Handbook.pdf)

Leave a Reply

Your email address will not be published. Required fields are marked *