Gaurav Sood writes:

There are legions of permutation-based methods which permute the value of a feature to determine whether the variable should be added (e.g., Boruta Algorithm) or its importance. I couldn’t reason for myself why that is superior to just dropping the feature and checking how much worse the fit is or what have you. Do you know why permuting values may be superior?

Here’s the feature importance based on permutation: “We measure the importance of a feature by calculating the increase in the model’s prediction error after permuting the feature. A feature is “important” if shuffling its values increases the model error, because in this case the model relied on the feature for the prediction.”

From here: https://christophm.github.io/interpretable-ml-book/feature-importance.html

Another way to get at variable importance = estimate the model with all the variables, estimate the model after nuking the variable. And see improvement in MSE etc.

Under what circumstance would permuting the variable be better? Permuting, because of chance alone, would create some stochasticity in learning, no? There is probably some benefit in run time for RF as you permute xs after the model is estimated. But statistically, is it better than estimating variable importance through simply nuking a variable?

My reply:

I guess people like permutation testing because it’s nonparametric? I’m probably the wrong person to ask because this is not what I do either. Some people think of permutation testing as being more pure than statistical modeling. Even though there’s generally no real justification for the particular permutations being used (see section 3.3 of this paper).

That doesn’t mean that permutation testing is bad in practice: lots of things work even though their theoretical justification is unclear, and, conversely, lots of things have seemingly strong theoretical justification but have serious problems when applied in practice.

Gaurav followed up by connecting to a different issue:

It is impressive to me that we produce so much statistical software with so little theoretical justification.

There is then the other end of statistical software which doesn’t include any guidance for known-known errors:

Most word processing software helpfully point out grammatical errors and spelling mistakes. Some even autocorrect. And some, like Grammarly, even give style advice.

Now consider software used for business statistics. Say you want to compute the correlation between two vectors: [100, 2000, 300, 400, 500, 600] and [1, 2, 3, 4, 5, 17000]. Most (all?) software will output .65. (Most—all?—software assume you want Pearson’s.) Experts know that the relatively large value in the second vector has a large influence on the correlation. For instance, switching it to -17000 will reverse the correlation coefficient to -.65. And if you remove the last observation, the correlation is 1. But a lay user would be none the wiser. Common software, e.g., Excel, R, Stata, Google Sheets, etc., do not warn the user about the outlier and its potential impact on the result. It should.

Take another example—the fickleness of the interpretation of AUC when you have binary predictors (see here) as much depends on how you treat ties. It is an obvious but subtle point. But commonly used statistical software do not warn people about the issue and I am sure a literature search will bring up multiple papers that fall prey to the point.

Given the rate of increase in the production of knowledge, increasingly everyone is a lay user. For instance, in 2013, Lin showed that estimating ATE using OLS with a full set of interactions improves the precision of ATE. But such analyses are uncommon in economics papers. The analyses could be absent for a variety of reasons: 1. ignorance, 2. difficulty, 3. dispute the result, etc. But only ignorance stands the scrutiny. The model is easy to estimate, so the second explanation is unlikely to explain much. The last explanation also seems unlikely, given the result was published in a prominent statistical journal and experts use it. And while we cannot be sure, ignorance is likely the primary explanation. If ignorance is the primary reason, should the onus of being well informed about the latest useful discoveries in methods be on a researcher working in a substantive area? Plausibly. But that is clearly not working very well. One way to accelerate dissemination is to provide such guidance as ‘warnings’ in commonly used statistical software.

I agree. This fits in with the whole workflow thing, where we recognize that we’ll be fitting lots of models, many of which will be horribly inappropriate to the task. Better to recognize this ahead of time rather than starting with the presumption that everything you’re doing is correct, and then spending the rest of your time scrambling to defend all your arbitrary decisions.

I am not sure how this will work. Won’t software makers have to wade into arguments about the appropriateness of every method for every edge case? For example do we show an alert for use of linear probability model? Yes/No? If so, when? Always? Sometimes – but only if base proportions are close to 0,1? How close? Not only will there be an entire extra level to code, debug, maintain, but also judgement calls which I imagine will be seen either as insufficient or paternalistic. Just getting the ASA to come up with a statement about use of P-values caused extreme controversy – do we really want software to make similar suggestions – suggestions that will of course will be totally different depending on software – across the board?

Matt:

In Stan we have pedantic mode which spits out warnings, so there’s nothing stopping you from fitting the model you really want to fit, if you really want to do it.

+1

The benefit of permuting vs removing the variable is that if your method is liable to overfit, removal of a spurious variable will still cause an increase in MSE – potentially a large increase in MSE, whereas permutation will on average not. Of course if the method you are using is well known and you aren’t worried about this, or if you have a good performance metric that adjusts for this, then it doesn’t matter so much. But permutation is, I think, generally quite robust.

Hi Andrew:

Another great post! For what it is worth. The idea of the permutations in variable selection is to

generate the false positive distribution since you “know” they are garbage. Beyond permutations,

you can also do this with simple noise variables or more sophisticated knockoffs. However,

I have to put a plug in for a new method that blows all of these away: Thompson sampling…

https://doi.org/10.1080/01621459.2021.1928514

I think you missed one of Gustav’s links that he had in parentheses (or he forgot the link):

“Take another example—the fickleness of the interpretation of AUC when you have binary predictors (see here)”.

Rodney

> One way to accelerate dissemination is to provide such guidance as ‘warnings’ in commonly used statistical software.

I doubt there’s so much agreement on the usefulness of various discoveries/development and whatnot that this would ever happen (or should really — like there’s a lot happening and it’s hard to really pay attention to it all)

> Common software, e.g., Excel, R, Stata, Google Sheets

> where we recognize that we’ll be fitting lots of models, many of which will be horribly inappropriate to the task. Better to recognize this ahead of time

My guess is the way forward on things like this is recognition that we’ll be writing lots of programs, many of which will be horribly inappropriate to the task. Better to recognize this ahead of time and start writing new ones than to try to patch a bunch of extra ideas into the existing ones.

At least in the example given, of a correlation coefficient at the mercy of one datapoint, I wonder if a more general approach would be for the software to always output the uncertainty in the correlation coefficient along with its value. (If

ris important, people using it should anyways provide the uncertainty of their estimate!)At least doing a very quick jackknife estimate of uncertainty, I think this gives about 0.15 in the +17000 case, and 0.7 (!) in the -17000 case, which I would hope would trigger some red lights in the user’s mind.

Of course, it’s not as good as actually thinking about the data…

Check out the rvars in posterior: https://mc-stan.org/posterior/articles/rvar.html

Not all software is the same – they all provide some guidance and warnings and some do a better job than others (I have not performed an evaluation, but I tend to like the way JMP handles things more than most). Users should include this in their evaluation of alternative programs. However, I doubt it is feasible, or desirable, to try to include all such warnings. For example, I don’t like fitting a linear regression with a binary response variable, but it often works fine and almost as well as logistic regression. So many things are case-specific and depend on the particular data set. For that matter, I think the grammar suggestions in word processing programs have the same issue – they are fairly good, but far from perfect. I think there is less danger with inappropriate warnings in word processing software than statistical software, but I think it is a matter of degree, not of kind.

This reminds me of a recent article I read for software developers on an AI Tool announced by Microsoft that can recommend code as developers write. “CLIPPY”

Statistical software aiding in model building is one reason I appreciate Stan so much. For example, accounting for heterogeneity by using multiple levels in a model recently used have removed divergences, improved speed and extract effects that may prove useful in one of my contracts.

While I understand that the intentions are good*, efforts to get software—or, as in the other thread, large scientific societies—to make default recommendations are really attempts to cure the symptom, rather than the disease. As I see it, the “disease” is a lack of fluency in translating between concepts and models. Building a model is a series of interconnected choices, and a good model is one in which each choice is justified by its connection to observables or to theoretical constructs. For example, using a binomial likelihood (or normal, or Poisson, etc.) is justified by the scale on which an observable quantity is measured. Using an exponential is justified by a belief (which may or may not be true) that some process operates in a proportional way. This is what I mean by “translate”: that each element of a model has some clear meaning in the context of its application.

Typical stats education is in terms of a series of tricks that worked well in specific prior applications. When these tricks work well enough in enough cases, they become standards or “defaults” (as Andrew has said, statistics is a “science of defaults”). Then many people applying statistics, when they are at their best, try to find a default method that has been applied in situations that are similar to the one they are currently dealing with. At worst, they just pick a common option and throw it at the problem.

It sounds like the software warnings might deter this “worst” option, but it is not clear to me that the “best” case is really much better. In both cases, most of the choices that went into the model were picked by “default” or “convention” and are totally unconnected to the application. This can be fine if the model is just meant as a starting point, but most applied stats usually don’t further refine the structure of the model. Instead, they focus on things like parameter selection, as in the example above.

I can also envision a case where the warnings actually make things worse. If the warnings themselves become “defaults” or “standards”, then people will keep fitting arbitrary models until they don’t get a warning any more. They will then point to the lack of warning and say, “look, my model is entirely justified and you must trust my conclusions because it was not explicitly discouraged.” It becomes a negative version of the “find the asterisk” game of statistical significance. And, again, the choices of the model were never justified by their connection to observables or theory in an application, only by the degree to which the model conforms to some conventional default/standards. Any meaning the model might have is accidental.

*oh Lord, please don’t let me be misunderstood

Nicely put.

Also strategies like bootstrap, resampling and permutation just switch to an implicit finite probability model that often is not even thought about.

One of Don Rubin’s early criticism of the bootstrap was about how silly the implicit model was for applications. However, even very wrong models can be useful for some purposes.

> Most word processing software helpfully point out grammatical errors and spelling mistakes. Some even autocorrect. And some, like Grammarly, even give style advice.

I wonder if this analogy is intentionally chosen as the seed of destruction? (1) Despite spelling and grammar checks in software, people make plenty of mistakes and don’t bother to correct them, just as many envision that warnings will be ignored. (2) Often the meaning remains clear even if spelling and grammar mistakes are abundant, which may explain why the suggestions are ignored. (3) Double negatives aren’t necessarily wrong but may completely obscure the meaning the author intended. Spelling and grammar checks will not catch logical inconsistencies such as ‘this sentence is false,’ or ‘the previous sentence is true.’

My understanding is that permuting features (rather than refitting the model) is done because it is much, much faster. So the reason is computational. Tracking out-of-sample error would be better, but would require fitting 2^p models. That is “There is probably some benefit in run time for RF as you permute xs after the model is estimated.” is the full answer.

Huh? The Pearson correlation for the data you printed is -.036. I think you mistyped the second x value as 2000 instead of 200. Changing that value to 200 yields a correlation of .655.

A few years ago I wrote a Java statistics program called Advisor that did warn about outliers, etc. For regression alone, Advisor examined residuals and considered dozens of regression models before printing a “Noteworthy” result. It didn’t just do the usual residual tests. Advisor also looked for clumpyness in the residuals, unusual serial dependencies, predominance of zeros (when considering zero inflated models), etc. When encountering missing values, the program computed multiple imputations — not just for regression models, but for every model in the software (MANOVA, cluster analysis, etc.)

One thing Advisor does for these (corected) data is to print the discrepancy between the Pearson and Spearman correlations. For large enough batches of data, it raises an alarm if this discrepancy is large in absolute value.

The first package I wrote (SYSTAT) also provided regression warnings based on serial correlation and other statistics. However, Advisor was a research project to see if a computer program could provide a comparable level of model diagnosis as an expert statistician. I ran it against several friends/colleagues like Howard Wainer, Jerry Dallal, and Gunther Zawitzki, including some of their final exam questions in their graduate statistics courses.

I agree with the conclusion of this post. It’s about time data science packages started giving a second opinion.