What’s the point of a robustness check?

Diomides Mavroyiannis writes:

I am currently a doctoral student in economics in France, I’ve been reading your blog for awhile and I have this question that’s bugging me.

I often go to seminars where speakers present their statistical evidence for various theses. I was wondering if you could shed light on robustness checks, what is their link with replicability? I ask this because robustness checks are always just mentioned as a side note to presentations (yes we did a robustness check and it still works!). Is there any theory on what percent of results should pass the robustness check? Is it not suspicious that I’ve never heard anybody say that their results do NOT pass a check? Is this selection bias? is there something shady going on? or is there no reason to think that a proportion of the checks will fail?

Good question. Robustness checks can serve different goals:

1. The official reason, as it were, for a robustness check, is to see how your conclusions change when your assumptions change. From a Bayesian perspective there’s not a huge need for this—to the extent that you have important uncertainty in your assumptions you should incorporate this into your model—but, sure, at the end of the day there are always some data-analysis choices so it can make sense to consider other branches of the multiverse.

2. But the usual reason for a robustness check, I think, is to demonstrate that your main analysis is OK. This sort of robustness check—and I’ve done it too—has some real problems. It’s typically performed under the assumption that whatever you’re doing is just fine, and the audience for the robustness check includes the journal editor, referees, and anyone else out there who might be skeptical of your claims.

Sometimes this makes sense. For example, maybe you have discrete data with many categories, you fit using a continuous regression model which makes your analysis easier to perform, more flexible, and also easier to understand and explain—and then it makes sense to do a robustness check, re-fitting using ordered logit, just to check that nothing changes much.

Other times, though, I suspect that robustness checks lull people into a false sense of you-know-what. It’s a bit of the Armstrong principle, actually: You do the robustness check to shut up the damn reviewers, you have every motivation for the robustness check to show that your result persists . . . and so, guess what? You do the robustness check and you find that your result persists. Not much is really learned from such an exercise.

As Uri Simonson wrote:

Robustness checks involve reporting alternative specifications that test the same hypothesis. Because the problem is with the hypothesis, the problem is not addressed with robustness checks.

True story: A colleague and I used to joke that our findings were “robust to coding errors” because often we’d find bugs in the little programs we’d written—hey, it happens!—but when we fixed things it just about never changed our main conclusions.

31 thoughts on “What’s the point of a robustness check?

  1. Yes, as far as I am aware, “robustness” is a vague and loosely used term by economists – used to mean many possible things and motivated for many different reasons. The idea is as Andrew states – to make sure your conclusions hold under different assumptions. But which assumptions and how many are rarely specified. And, the conclusions never change – at least not the conclusions that are reported in the published paper. So, at best, robustness checks “some” assumptions for how they impact the conclusions, and at worst, robustness becomes just another form of the garden of forked paths.

    I think this is related to the commonly used (at least in economics) idea of “these results hold, after accounting for factors X, Y, Z, …). This usually means that the regression models (or other similar technique) have included variables intending to capture potential confounding factors. As discussed frequently on this blog, this “accounting” is usually vague and loosely used. Does including gender as an explanatory variable really mean the analysis has accounted for gender differences?

    In both cases, I think the intention is often admirable – it is the execution that falls short. And, sometimes, the intention is not so admirable.

  2. I like robustness checks that act as a sort of internal replication (i.e. keeping the data set fixed).

    So if it is an experiment, the result should be robust to different ways of measuring the same thing (i.e. measures one should expect to be positively or negatively correlated with the underlying construct you claim to be measuring).

    If it is an observational study, then a result should also be robust to different ways of defining the treatment (e.g. windows for regression discontinuity, different ways of instrumenting), robust to what those treatments are bench-marked to (including placebo tests), robust to what you control for…

    In both cases, if there is an justifiable ad-hoc adjustment, like data-exclusion, then it is reassuring if the result remains with and without exclusion (better if it’s even bigger).

    Shouldn’t a Bayesian be doing this too?

    Of course these checks can give false re-assurances, if something is truly, and wildly, spurious then it should be expected to be robust to some these these checks (but not all).

    • To some extent, you should also look at “biggest fear” checks, where you simulate data that should break the model and see what the inference does. Eg put an un-modelled change point in a time series.

    • > Shouldn’t a Bayesian be doing this too?
      There are other routes to getting less wrong Bayesian models by plotting marginal priors or analytically determining the impact of the prior on the primary credible intervals. I think this would often be better than specifying a different prior that may not be that different in important ways.

      And there are those prior and posterior predictive checks.

      Anyway that was my sense for why Andrew made this statement – “From a Bayesian perspective there’s not a huge need for this”.

  3. It’s interesting this topic has come up; I’ve begun to think a lot in terms of robustness.

    First, robustness is not binary, although people (especially people with econ training) often talk about it that way. It’s all a matter of degree; the point, as is often made here, is to model uncertainty, not dispel it.

    Second, robustness has not, to my knowledge, been given the sort of definition that could standardize its methods or measurement. I think that’s a worthwhile project.

    Third, for me robustness subsumes the sort of testing that has given us p-values and all the rest. That is, p-values are a sort of measure of robustness across potential samples, under the assumption that the dispersion of the underlying population is accurately reflected in the sample at hand. (Yes, the null is a problematic benchmark, but a t-stat does tell you something of value.)

    But then robustness applies to all other dimensions of empirical work. You can be more or less robust across measurement procedures (apparatuses, proxies, whatever), statistical models (where multiple models are plausible), and—especially—subsamples. Machine learning is a sort of subsample robustness, yes? I think it’s crucial, whenever the search is on for some putatively general effect, to examine all relevant subsamples. The variability of the effect across these cuts is an important part of the story; if its pattern is problematic, that’s a strike against the effect, or its generality at least. And from this point of view, replication is also about robustness in multiple respects.

    As with all epiphanies of the it-all-comes-down-to sort, I may be shoehorning concepts that are better left apart. If I have this wrong I should find out soon, before I teach again….

    • I like the analogy between the data generation process and the model generation process (where ‘the model’ also includes choices about editing data before analysis).

  4. People use this term to mean so many different things.

    In many papers, “robustness test” simultaneously refers to:
    1. Demonstrating a result holds after changes to modeling assumptions (the example Andrew describes)

    but also (in observational papers at least):
    2. Testing “alternative arguments” — which usually means “alternative mechanisms” for the claimed correlation, attempts to rule out an omitted variable, rule out endogeneity, etc.

    Drives me nuts as a reviewer when authors describe #2 analyses as “robustness tests”, because it minimizes #2’s (huge) importance (if the goal is causal inference at least). Those types of additional analyses are often absolutely fundamental to the validity of the paper’s core thesis, while robustness tests of the type #1 often are frivolous attempts to head off nagging reviewer comments, just as Andrew describes. Yet many people with papers that have very weak inferences that struggle with alternative arguments (i.e., have huge endogeneity problems, might have causation backwards, etc) often try to just push the discussions of those weaknesses into an appendix, or a footnote, so that they can be quickly waved away as a robustness test. I realize its just semantic, but its evidence of serious misplaced emphasis.

    (I’m a political scientist if that helps interpret this.)

    • I’ve also encountered “robust” used in a third way: For example, if a study about “people” used data from Americans, would the results be the same of the data were from Canadians? Mexicans? Nigerians? etc. (In other words, is it a result about “people” in general, or just about people of specific nationality?)

  5. I have no answers to the specific questions, but Leamer (1983) might be useful background reading:

    http://faculty.smu.edu/millimet/classes/eco7321/papers/leamer.pdf

    Among other things, Leamer shows that regressions using different sets of control variables, both of which might be deemed reasonable, can lead to different substantive interpretations (see Section V.). Economists reacted to that by including robustness checks in their papers, as mentioned in passing on the first page of Angrist and Pischke (2010):

    http://ftp.iza.org/dp4800.pdf

  6. I think of robustness checks as FAQs, i.e, responses to questions the reader may be having. They are a way for authors to step back and say “You may be wondering whether the results depend on whether we define variable x as continuous or discrete. Well, that occurred to us too, and so we did … and we found it didn’t make a difference, so you don’t have to be concerned about that.” These types of questions naturally occur to authors, reviewers, and seminar participants, and it is helpful for authors to address them.

    This doesn’t seem particularly nefarious to me. In fact, it seems quite efficient. It helps the reader because it gives the current reader the wisdom of previous readers. So it is a social process, and it is valuable.

    Is it a statistically rigorous process? No. But it isn’t intended to be. It incorporates social wisdom into the paper and isn’t intended to be statistically rigorous.

    • Terry:

      I never said that robustness checks are nefarious. What I said is that it’s a problem to be using a method whose goal is to demonstrate that your main analysis is OK. If robustness checks were done in an open sprit of exploration, that would be fine. But it’s my impression that robustness checks are typically done to rule out potential objections, not to explore alternatives with an open mind.

    • This may be a valuable insight into how to deal with p-hacking, forking paths, and the other statistical problems in modern research.

      This website tends to focus on useful statistical solutions to these problems. And that is well and good.

      But, there are other, less formal, social mechanisms that might be useful in addressing the problem. Discussion of robustness is one way that dispersed wisdom is brought to bear on a paper’s analysis.

      Another social mechanism is bringing the wisdom of “gray hairs” to bear on an issue. It can be useful to have someone with deep knowledge of the field share their wisdom about what is real and what is bogus in a given field. Such honest judgments could be very helpful. Unfortunately, a field’s “gray hairs” often have the strongest incentives to render bogus judgments because they are so invested in maintaining the structure they built.

      Another social mechanism is calling on the energy of upstarts in a field to challenge existing structures. This seems to be more effective. Unfortunately, upstarts can be co-opted by the currency of prestige into shoring up a flawed structure.

      Maybe what is needed are cranky iconoclasts who derive pleasure from smashing idols and are not co-opted by prestige.

      I don’t know. There is probably a Nobel Prize in it if you can shed some which social mechanisms work and when they work and don’t work.

  7. My pet peeve here is that the robustness checks almost invariably lead to results termed “qualitatively similar.” That in turn is of course code for “not nearly as striking as the result I’m pushing, but with the same sign on the important variable.” Then the *really* “qualitatively similar” results don’t even have the results published in a table — the academic equivalent of “Don’t look over there. I did, and there’s nothing really interesting.” Of course when the robustness check leads to a sign change, the analysis is no longer a robustness check. It’s now the cause for an extended couple of paragraphs of why that isn’t the right way to do the problem, and it moves from the robustness checks at the end of the paper to the introduction where it can be safely called the “naive method.”

    • Jonathan (another one);

      You paint an overly bleak picture of statistical methods research and or published justifications given for methods used.

      Or just an often very accurate picture ;-)

    • Yes, I’ve seen this many times. But to be naive, the method also has to employ a leaner model so that the difference can be chalked up to the necessary bells and whistles. I don’t think I’ve ever seen a more complex model that disconfirmed the favored hypothesis being chewed out in this way. Maybe a different way to put it is that the authors we’re talking about have two motives, to sell their hypotheses and display their methodological peacock feathers. “Naive” pretty much always means “less techie”.

      • Correct. When the more complicated model fails to achieve the needed results, it forms an independent test of the unobservable conditions for that model to be more accurate.

    • I get what you’re saying, but robustness is in many ways a qualitative concept eg structural stability in the theory of differential equations. Or, essentially, model specification. If you get this wrong who cares about accurate inference ‘given’ this model?

  8. Perhaps not quite the same as the specific question, but Hampel once called robust statistics the stability theory of statistics and gave an analogy to stability of differential equations.

    What you’re worried about in these terms is the analogue of non-hyperbolic fixed points in differential equations: those that have qualitative (dramatic) changes in properties for small changes in the model etc.

    A pretty direct analogy is to the case of having a singular Fisher information matrix at the ML estimate. Breaks pretty much the same regularity conditions for the usual asymptotic inferences as having a singular jacobian derivative does for the theory of asymptotic stability based on a linearised model. Unfortunately as soon as you have non-identifiability, hierarchical models etc these cases can become the norm.

    Funnily enough both have more advanced theories of stability for these cases based on algebraic topology and singularity theory.

  9. There is one area where I feel robustness analyses need to be used more often than they are: the handling of missing data.

    It is quite common, at least in the circles I travel in, to reflexively apply multiple imputation to analyses where there is missing data. This sometimes happens in situations where even cursory reflection on the process that generates missingness cannot be called MAR with a straight face. In situations where missingness is plausibly strongly related to the unobserved values, and nothing that has been observed will straighten this out through conditioning, a reasonable approach is to develop several different models of the missing data and apply them. Ideally one would include models that are intentionally extreme enough to revise the conclusions of the original analysis, so that one has a sense of just how sensitive the conclusions are to the mysteries of missing data.

    Regarding the practice of burying robustness analyses in appendices, I do not blame authors for that. I blame publishers. At least in clinical research most journals have such short limits on article length that it is difficult to get an adequate description of even the primary methods and results in. It is the journals that force important information into appendices; it is not something that authors want to do, at least in my experience.

    • +1 on both points. But on the second: Wider (routine) adoption of online supplements (and linking to them in the body of the article’s online form) seems to be a reasonable solution to article length limits.

  10. ‘And, the conclusions never change – at least not the conclusions that are reported in the published paper.’
    I understand conclusions to be what is formed based on the whole of theory, methods, data and analysis, so obviously the results of robustness checks would factor into them. Also, the point of the robustness check is not to offer a whole new perspective, but to increase or decrease confidence in a particular finding/analysis. I find them used as such. True, positive results are probably overreported and some really bad results are probably hidden, but at the same time it’s not unusual to read that results are sensitive to specification, or that the sign and magnitude of an effect are robust, while significance is not or something like that.
    It’s better than nothing.

    ‘My pet peeve here is that the robustness checks almost invariably lead to results termed “qualitatively similar.” That in turn is of course code for “not nearly as striking as the result I’m pushing, but with the same sign on the important variable.”’
    It is not in the rather common case where the robustness check involves logarithmic transformations (or logistic regressions) of variables whose untransformed units are readily accessible. Or Andrew’s ordered logit example above. In those cases I usually don’t even bother to check ‘strikingness’ for the robustness check, just consistency and have in the past strenuously and successfully argued in favour of making the less striking but accessible analysis the one in the main paper. Your experience may vary.

  11. Formalizing what is meant by robustness seems fundamental. Ignoring it would be like ignoring stability in classical mechanics. The unstable and stable equilibria of a classical circular pendulum are qualitatively different in a fundamental way.

    That a statistical analysis is not robust with respect to the framing of the model should mean roughly that small changes in the inputs cause large changes in the outputs. Of course the difficult thing is giving operational meaning to the words small and large, and, concomitantly, framing the model in a way sufficiently well-delineated to admit such quantifications (however approximate). Of course, there is nothing novel about this point of view, and there has been a lot of work based on it.

    However, whil the analogy with physical stability is useful as a starting point, it does not seem to be useful in guiding the formulation of the relevant definitions (I think this is a point where many approaches go astray). Here one needs a reformulation of the classical hypothesis testing framework that builds such considerations in from the start, but adapted to the logic of data analysis and prediction. (To put an example: much of physics focuss on near equilibrium problems, and stability can be described very airily as tending to return towards equilibrium, or not escaping from it – in statistics there is no obvious corresponding notion of equilibrium and to the extent that there is (maybe long term asymptotic behavior is somehow grossly analogous) a lot of the interesting problems are far from equilibrium (e.g. small data sets) – so one had better avoid the mistake made by economists of trying to copy classical mechanics – where it might be profitable to look for ideas, and this has of course been done, is statistical mechanics).

    Conclusions that are not robust with respect to input parameters should generally be regarded as useless.

    • Dan:

      I think there are two dimensions here.

      One dimension is what you’re saying, that it’s good to understand the sensitivity of conclusions to assumptions. Sensitivity to input parameters is fine, if those input parameters represent real information that you want to include in your model it’s not so fine if the input parameters are arbitrary.

      The other dimension is what I’m talking about in my above post, which is the motivation for doing a robustness check in the first place. If the reason you’re doing it is to buttress a conclusion you already believe, to respond to referees in a way that will allow you to keep your substantive conclusions unchanged, then all sorts of problems can arise. The most extreme is the pizzagate guy, where people keep pointing out major errors in his data and analysis, and he keeps saying that his substantive conclusions are unaffected: it’s a big joke. But really we see this all the time—I’ve done it too—which is to do alternative analysis for the purpose of confirmation, not exploration.

      Anyway, both dimensions are important.

Leave a Reply to Peter Dorman Cancel reply

Your email address will not be published. Required fields are marked *