A colleague pointed me to a debate among some political science methodologists about robust standard errors, and I told him that the topic didn’t really interest me because I haven’t found a use for robust standard errors in my own work.

My colleague urged me to look at the debate more carefully, though, so I did. But before getting to that, let me explain where I’m coming from. I won’t be trying to make the “Holy Roman Empire” argument that they’re not robust, not standard, and not an estimate of error. I’ll just say why I haven’t found those methods useful myself, and then I’ll get to the debate.

The paradigmatic use case goes like this: You’re running a regression to estimate a causal effect. For simplicity suppose you have good identification and also suppose you have enough balance that you can consider your regression coefficient as some reasonably interpretable sort of average treatment effect. Further assume that your sample is representative enough, or treatment interactions are low enough, that you can consider the treatment effect in the sample as a reasonable approximation to the treatment effect in the population of interest.

But . . . your data are clustered or have widely unequal variances, so the assumption of a model plus independent errors is not appropriate. What you can do is run the regression, get an estimate and standard error, and then use some method of “robust standard errors” to inflate the standard errors so you get confidence intervals with close to nominal coverage.

That all sounds reasonable. And, indeed, robust standard errors are a popular statistical method. Also, speaking more generally, I’m a big fan of getting accurate uncertainties. See, for example, this paper, where Houshmand Shirani-Mehr, David Rothschild, Sharad Goel, and I argue that reported standard errors in political polls are off by approximately a factor of 2.

But this example also illustrates why I’m not so interested in robust standard errors: I’d rather model the variation of interest (in this case, the differences between polling averages and actual election outcomes) directly, and get my uncertainties from there.

This all came up because a colleague pointed me to an article, “A Note on ‘How Robust Standard Errors Expose Methodological Problems They Do Not Fix, and What to Do About It,'” by Peter Aronow. Apparently there’s some debate about all this going on among political methodologists, but it all seems pointless to me.

Let me clarify: it all seems pointless *to me* because I’m not planning to use robust standard errors: I’ll model my clustering and unequal variances directly, to the extent these are relevant to the questions I’m studying. That said, I recognize that many researchers, for whatever reason, *don’t* want to model clustering or unequal variances, and so for them a paper like Aronow’s can be helpful. So I’m not saying this debate among the political methodologists is pointless: it could make a difference when it comes to a lot of actual work that people are doing. Kinda like a debate about how best to format large tables of numbers, or how to best use Excel graphics, or what’s the right software for computing so-called exact p-values (see section 3.3 of this classic paper to hear me scream about that last topic), or when the local golf course is open, or what’s the best car repair shop in the city, or who makes the best coffee, or which cell phone provider has the best coverage: all these questions could make a difference to a lot of people, just not me.

**Unraveling some confusion about the distinction between modeling and inference**

One other thing. Aronow writes:

And thus we conclude . . . in light of Manski (2003)’s Law of Decreasing Credibility: “The credibility of inference decreases with the strength of the assumptions maintained.” Taking Manski’s law at face value, then a semiparametric model is definitionally more credible than any assumed parametric submodel thereof.

What the hell does this mean: “Taking Manski’s law at face value”? Is it a law or is it not a law? How do you measure “credibility” of inference or strength of assumptions?

Or this: “a semiparametric model is definitionally more credible than any assumed parametric submodel thereof”? At first this sounds kinda reasonable, even rigorous with those formal-sounding words “definitionally” and “thereof.” I’m assuming that Aronow is follwing Manksi and referring to the credibility of inferences from these models. But then there’s a big problem, because you can flip it around and get “Inference from a parametric model is less credible than inference from any semiparametric model that includes it.” And that can’t be right. Or, to put it more carefully, it all depends how you fit that semiparametric model.

Just for example, and not even getting into semiparametrics, you can get some really really goofy results if you indiscriminately fit high-degree polynomials when fitting discontinuity regressions. Now, sure, a nonparametric fit should do better—but not *any* nonparametric fit. You need some structure in your model.

And this reveals the problem with Aronow’s reasoning in that quote. Earlier in his paper, he defines a “model” as “a set of possible probability distributions, which is assumed to contain the distribution of observable data.” By this he means the distribution of data conditional on (possibly infinite-dimensional) parameters. No prior distribution in that definition. That’s fine: not everyone has to be Bayesian. But if you’re *not* going to be Bayesian, you can’t speak in general about “inference” from a model without declaring how you’re gonna perform that inference. You can’t fit an infinite-parameter model to finite data using least squares. You need some regularization. That’s fine, but then you get some tangled questions, such as comparing class of distributions that’s estimated using too-weak regularization, to a narrower class that’s estimated more appropriately. It makes no sense in general to that inference in that first case is “more credible,” or even “definitionally more credible.” It all depends on what you’re doing. Which is why we’re not all running around fitting 11th-degree polynomials to our data, and why we’re not all sitting in awe of whoever fit a model that includes ours as a special case. You don’t have to be George W. Cantor to know that’s a mug’s game. And I’m pretty sure Aronow understands this too; I suspect he just got caught up in this whole debating thing and went a bit too far.

“And I’m pretty sure Aronow understands this too”

Yes, not much speculation is needed. A couple of paragraphs earlier he writes:

“We have no aversion to parametric models when justified by the researcher. This is especially true in small samples, where additional structure can be used profitably for efficiency gains, even if the assumptions are incorrect. Similarly, the small sample behavior of robust standard errors may suggest measures of uncertainty that exploit more structure (e.g., classical standard errors) are preferable when n is small…The world of small samples is a difficult one – filled with tradeoffs – and we hesitate to make any general recommendations.”

“What the hell does this mean: “Taking Manski’s law at face value”? Is it a law or is it not a law?”

I’m pretty sure it’s meant to be obvious that Manski’s Law is not a real law and the phrase “at face value” is meant to gesture at all the major caveats that Aronow mentioned a little earlier and that you mention in your post.

Z:

That’s all fine, but then look at the whole passage from Aronow:

Given that Manski’s “law” does not make sense (i.e., the major caveats etc.), what does it mean for Aronow to be using this law to evaluate some recommendations?

Aronow may well be giving good advice in the context of this debate among political science methodologists—as noted in my above post, considering that I don’t use robust standard errors myself, I don’t have so much interest in the ways in which others use them—but I think he’s going overboard in the above quote, making an evaluation on the basis of a slogan or “law” that makes no sense. This is unfortunately a common mode of argument that I’ve seen in many areas of statistics, that someone will pull out a power-word such as “Bayesian” or “unbiased” or “robust” or “semiparametric” or “minimax” or “assumptions” and think that this wins the argument for them.

I read the tone differently. I think that in the discussion section as a whole he’s essentially saying: “Most intuition on this topic is based on flawed heuristic reasoning. Countering the flawed heuristic reasoning that favors full parametric models, there exists flawed heuristic reasoning in the form of Manski’s ‘Law’ that opposes full parametric models. So there are competing flawed intuitions without a clear resolution. Maybe I’m a little more sympathetic to the Manski brand of flawed intuition (which is why I’m concluding with it), but regardless we can at least make the uncontroversial statement that researchers should not be ‘forced to adopt a full parametric model without strong ex ante theoretical justification.'”

Granted, I’m not sure who’s doing the ‘forcing’ here or if that phrase was a fair characterization of what King was advocating in the paper that this was a note on.

Z:

You could be right; it might be that Aronow here is in the middle of an argument in which many of these points are already understood, and that, coming as an outsider, I’m missing these subtleties in inflection.

All samples ares are finite. it’s always

s a choice between sampling error and bias.

Related to this issue is a recent paper in Psychological Methods: On the unnecessary ubiquity of hierarchical linear modeling, by McNeish, et al. https://www.ncbi.nlm.nih.gov/pubmed/27149401, which suggests that robust standard errors should be used more.

Jeremy:

I followed your link and . . . wow. That article is really bad. At least, that’s what it looks like to me from reading the abstract.

When someone makes a Bayesian Hierarchical model, I at least know what it means, something approximately like: thing A is in a certain region, data points B_i are clustered around a point in the region of A, the clustering of B is in a certain range that is in the high probability distribution of C etc. etc.

Can someone even explain what “clustered standard errors” actually means in an example case? ie. what is the logical contents of the model? It seems like this heuristic: our data aren’t independent, so let’s widen the intervals using some method so that …. something happens.

Daniel:

I think the whole point of clustered standard errors is that there is no model. It’s supposed to give intervals with good confidence coverage properties. I don’t like such methods because they don’t do anything with the point estimates, and as such are leaving information on the table. And, as a regular reader of my writing, you’ll know that I consider it a plus, not a minus, to make assumps.

If anyone is interested in the latest thinking on clustered SEs, this paper by a rather impressive quartet of authors is probably the place to go:

https://arxiv.org/abs/1710.02926

When Should You Adjust Standard Errors for Clustering?

Alberto Abadie, Susan Athey, Guido Imbens, Jeffrey Wooldridge

(Submitted on 9 Oct 2017 (v1), last revised 24 Oct 2017 (this version, v2))

In empirical work in economics it is common to report standard errors that account for clustering of units. Typically, the motivation given for the clustering adjustments is that unobserved components in outcomes for units within clusters are correlated. However, because correlation may occur across more than one dimension, this motivation makes it difficult to justify why researchers use clustering in some dimensions, such as geographic, but not others, such as age cohorts or gender. It also makes it difficult to explain why one should not cluster with data from a randomized experiment. In this paper, we argue that clustering is in essence a design problem, either a sampling design or an experimental design issue. It is a sampling design issue if sampling follows a two stage process where in the first stage, a subset of clusters were sampled randomly from a population of clusters, while in the second stage, units were sampled randomly from the sampled clusters. In this case the clustering adjustment is justified by the fact that there are clusters in the population that we do not see in the sample. Clustering is an experimental design issue if the assignment is correlated within the clusters. We take the view that this second perspective best fits the typical setting in economics where clustering adjustments are used. This perspective allows us to shed new light on three questions: (i) when should one adjust the standard errors for clustering, (ii) when is the conventional adjustment for clustering appropriate, and (iii) when does the conventional adjustment of the standard errors matter.

Andrew:

I don’t like such methods because they don’t do anything with the point estimatesIt’s only the simplest methods that do this. The widely-used GEE approach – which uses robust standard error estimates to allow for clustering but with a “working” correlation structure within each cluster – does give different point estimates depending on both the assumed correlation structure and the data.

I don’t know whether this is the type of thing you seek, but you might look at p. 5 of http://www.stat.berkeley.edu/~census/mlesan.pdf

From linked article: “It remains unclear why applied workers should care about the variance of an estimator for the wrong parameter.”

I think that says what I was intuiting. People seem to want to do science by pushbutton in which “model assumptions” are like some kind of deviant behavior to be avoided or denounced. I come from the opposite point of view: if you don’t have strong assumptions about a mechanism that you could codify into a model you haven’t really started working on your project yet, at least in science, maybe not in sales forecasting and marketing, where the mechanism is transient anyway.

The author of that paper seems to be basically in agreement that these methods are used inappropriately.

Daniel, this paper by David Freedman makes a similar point: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.119.135&rep=rep1&type=pdf

that’s actually the same paper as linked by Russell Lyons above, from a different site.

Dear Daniel, Andrew and Russel,

I took the liberty of pre-registering this comment 10 days ago.

http://statmodeling.stat.columbia.edu/2017/12/13/yes-can-statistical-inference-nonrandom-samples-good-thing-considering-nonrandom-samples-pretty-much-weve-got/#comment-627772

Best,

jrc

So, admittedly I’m not very into how this all works, but the way I see it is something like this: I want to know how much say changes in food prices affect spending on entertainment. I create a sampling scheme where I sample households within zip codes… I figure people have similarities if they live in the same zipcode. I collect some data before and after a price shock in food by RNG sampling of addresses and manual interviewing of the people…. I pretend that the observed data is independent from zip code to zip code but within a zip code there are similarities…. So far so good.

If I follow the linked paper Russell mentions, the next step is this:

Assume the observed data come from an iid sampling scheme in which the distribution is P(theta) even though we know this isn’t true and even though we explicitly think there is dependency, and even though the original math explicitly assumes independence (facts mentioned in the linked paper). Create a Likelihood function, calculate the maximum likelihood estimator of theta, use some math to argue that if you repeated the process with different RNGs you’d have seen MLE estimates of theta within a region whose length is given by RobustSE but whose location you really don’t know much about. Ignore the fact that you have bias that makes the location unknown, and quote the region PointEstimate +- 2 RobustSE as … what exactly? the range of values you might expect to have come out of your assumed to be incorrect MLE procedure if you’d used a different RNG seed to collect the data, even though you explicitly started out assuming this was false… Publish paper, hope to affect food stamp policy…

Yikes!

There is no likelihood function in the cluster-robust method – it is a sort of “non-parametric” variance estimate computed in a least-squares framework. The idea is that you can directly estimate the important parts of the Variance/Covariance matrix of X’s and E’s to estimate appropriately-sized standard errors using the observed X’s and the residuals (Ehat) from the regression. The “Practitioner’s Guide” article I linked to last week gives this generally un-pasteable description of why it works on the bottom of page 8:

“What is striking about this is that for each cluster g, the Ng × Ng matrix Ehat’Ehat is bound to be a very poor estimate of the Ng X Ng matrix E[EE’] – there is no averaging going on to enable use of a Law of Large Numbers. The “magic” of the CRVE is that despite this, by averaging across all G clusters, we are able to get a consistent variance estimate.”

The reason people in economics and political science like these cluster-robust methods so much is because there are good papers showing it generates appropriate rejection rates (CI width) when you do repeated (like a whole bunch) placebo trials on commonly-used data sets (setting the treatment assignment mechanism by the Cluster-by-Period level, or something similar, which is common in the field when analyzing things like state-level policies). That is better evidence of their usefulness than we have for other potential uncertainty estimates, given the kinds of analyses we commonly do.

Honestly, if other people would replicate the “How much do we trust differences-in-differences” paper linked in that same post but using hierarchical models, and show that a) they can get reasonable coverage properties or the appropriate analog using their methods; and/or b) their estimates are more tightly centered on the known (simulated, added-in) treatment effect… that would be influential in this world I think. I honestly don’t understand why there isn’t more work like that. It all seems like a very answerable question, using fairly basic simulation techniques and publicly available data.

Well, “least squares” immediately implies “Maximum likelihood under a normal iid sampling model” because the least squares *is* the maximum likelihood for that model, so to say that there is no likelihood already confuses things I think. It’s again a case of people wanting to make a ‘default’ assumption and pretend that because it’s behind the curtain, they’ve made no assumptions at all. But in general I don’t think this is a conscious thing, it’s more like… by not consciously making any particular assumptions people think they’re “OK” especially if someone proved some math to say that their fitting method is “robust against misspecification”

the part about “averaging across all g clusters gives a consistent variance estimate” sounds a lot like an unregularized hierarchical model. I’m not really sure.

In general, I don’t think “getting reasonable coverage properties under repeated simulations of fake effects” should be the goal… my goal would be something like: regularly put relatively high density on the true value under fake data simulations… the frequency with which the true value is in a given conventional interval is not that interesting to me

In a multivariate situation, it’d be something like: always have the true vector be in the high probability region / typical set of the posterior.

Someone should run such models on the diff-in-diff paper mentioned above and show that the high-density regions produced by such estimates are more tightly centered around zero than similar-coverage confidence intervals (or whatever precision estimate you want to compare it to). I think that something actually showing the precision gains of these models in real-world environments would go a long ways to convincing people of their value. But maybe those papers are out there and I just don’t know them.

For the record, I’m not advocating using the exact width of the resulting CI’s as some firm and final “test” of “statistical significance”. I just think these are methods of estimating standard errors that seem to, at least in some sense, work out in the wild. If people show me that other methods will work better for the kinds of analyses I want to do, I’ll probably start using those. Which is why I think comparing the precision of the methods and discussing clearly the trade-offs between adding assumptions and being agnostic is the way to move empirical practice forward in this case.

Basically I agree with you on the need for papers using Bayesian / Hierarchical models in typical Econ or other Social Sci contexts. But ultimately I think the advantages of Bayesian models aren’t that they give tighter intervals in the common cases in use today… it’s really that they provide a very interpretable and understandable method to fit much better kinds of models. This is what I was doing about six months ago with housing / food / cost of living across time and space at the Census PUMA level using Stan. We should get together and talk about that stuff, because I put it on hold while many other things happened, but I still think it’s useful, but requires some of the tools for model debugging that have been developed over the last say 6 months on the Stan Discourse site and etc. The model was hard to get to fit without divergences but the fits looked very reasonable when it did run.

Oh…my argument was pre-supposing the appropriate and useful application of this class of spatio-temporal difference-in-difference models and their carefully-considered, context-specific epistemological grounding in the potential outcomes framework. But of course if you wanna find absurdity in the way people employ that logic in the real-world, it isn’t all that difficult to find*.

Also lets chat soon, as long as you promise not to discuss anything you learned on the Stan listserv and keep the discussion dumbed down to the economics of it all and the link between the model and the real world. Sounds fun.

*n.b. – that seems to me true of basically every discipline and their relationship with their statistical methods.

Replying to Daniel:

Well, “least squares” immediately implies “Maximum likelihood under a normal iid sampling model” because the least squares *is* the maximum likelihood for that model.This is fallacious. Normal i.i.d. sampling and MLEs are one way to motivate least squares, but there are multiple others. (For example, <a href="https://brenocon.com/blog/2008/04/a-regression-slope-is-a-weighted-average-of-pairs-slopes/?"average pairwise slope). Your conclusion is like insisting that someone *must* be making an omelette because you saw them buy some eggs.

Least squares is the MLE of something. Saying that people aren’t thinking of it like a likelihood doesn’t mean that the method isn’t a special case of MLE estimation, and the mathematical theory of MLE estimators doesn’t apply. The sandwich estimator article linked above discusses robust SE methods with respect to MLE point estimators. The mathematics applies in least squares conditions because there is a likelihood that has the Fisher information properties discussed in the article.

Daniel: theory of MLEs that relies on parametric model assumptions does not apply, outwith those assumptions. But the rather more general estimating equation theory, that robust SEe are built on, does apply. These equation-based estimates need not be MLEs for any parametric model, MLEs are just a specific case.

I take three points away from Aronow’s paper:

1. There exists a statistical modeling framework within which robust standard errors arise naturally, and this framework is the semi-parametric framework. (For those who want a quick read to see this framework in action, see lecture 3 here: http://cyrussamii.com/?page_id=2426 For those wanting the deeper treatment, Halbert White’s or Peter Huber’s work are good places to start.)

2. There exist arguments for why the semi-parametric framework is attractive relative to parametric alternatives. These arguments are quite deep in terms of their underlying theory. In referencing Manski’s “law,” Aronow was using a shorthand to point to a rich set of ideas that deserve more thorough engagement than what can be done in trying to parse a short expression. The example that Aronow provides in section 2.4 is also helpful in illustrating these arguments. Finally, these arguments are also based on empirical assessments, as Aronow notes in referencing Angrist & Pishcke and Imbens & Kolesar. Z’s comment above (referencing the line of research on diff-in-diff methods kicked off by Bertrand et al., and carried forward by the likes of Cameron, Gelbach, and Miller) is also in line with this notion of “proof in the [empirical] pudding.”

3. Most importantly, Aronow’s note is proposing that King and Roberts do not give adequate consideration to points 1 and 2 in their paper, and by such a crucial omission, the King and Roberts paper has the potential to cause confusion.

Andy – Now, in reading your post, I get the sense that you already recognized that this was what Aronow was saying. But I also got the sense that you took Aronow’s effort to “set things straight” as an implicit invitation to debate the notions of “robust” inference. If that is the goal, then I would say the focus of your critique shouldn’t be Aronow’s paper, which does not attempt to go into depth with respect to such a debate, but rather the work that appears in Aronow’s bibliography (e.g., Angrist & Pischke, Goldberger, Hansen, Hayashi, Manski, Stock & Watson, Wooldridge, or White).

(Disclosure — Aronow and I are coauthors on a number of papers, including ones that derive robust standard errors.)