Gaurav Sood writes:

There are legions of permutation-based methods which permute the value of a feature to determine whether the variable should be added (e.g., Boruta Algorithm) or its importance. I couldn’t reason for myself why that is superior to just dropping the feature and checking how much worse the fit is or what have you. Do you know why permuting values may be superior?

Here’s the feature importance based on permutation: “We measure the importance of a feature by calculating the increase in the model’s prediction error after permuting the feature. A feature is “important” if shuffling its values increases the model error, because in this case the model relied on the feature for the prediction.”

From here: https://christophm.github.io/interpretable-ml-book/feature-importance.html

Another way to get at variable importance = estimate the model with all the variables, estimate the model after nuking the variable. And see improvement in MSE etc.

Under what circumstance would permuting the variable be better? Permuting, because of chance alone, would create some stochasticity in learning, no? There is probably some benefit in run time for RF as you permute xs after the model is estimated. But statistically, is it better than estimating variable importance through simply nuking a variable?

My reply:

I guess people like permutation testing because it’s nonparametric? I’m probably the wrong person to ask because this is not what I do either. Some people think of permutation testing as being more pure than statistical modeling. Even though there’s generally no real justification for the particular permutations being used (see section 3.3 of this paper).

That doesn’t mean that permutation testing is bad in practice: lots of things work even though their theoretical justification is unclear, and, conversely, lots of things have seemingly strong theoretical justification but have serious problems when applied in practice.

Gaurav followed up by connecting to a different issue:

It is impressive to me that we produce so much statistical software with so little theoretical justification.

There is then the other end of statistical software which doesn’t include any guidance for known-known errors:

Most word processing software helpfully point out grammatical errors and spelling mistakes. Some even autocorrect. And some, like Grammarly, even give style advice.

Now consider software used for business statistics. Say you want to compute the correlation between two vectors: [100, 2000, 300, 400, 500, 600] and [1, 2, 3, 4, 5, 17000]. Most (all?) software will output .65. (Most—all?—software assume you want Pearson’s.) Experts know that the relatively large value in the second vector has a large influence on the correlation. For instance, switching it to -17000 will reverse the correlation coefficient to -.65. And if you remove the last observation, the correlation is 1. But a lay user would be none the wiser. Common software, e.g., Excel, R, Stata, Google Sheets, etc., do not warn the user about the outlier and its potential impact on the result. It should.

Take another example—the fickleness of the interpretation of AUC when you have binary predictors (see here) as much depends on how you treat ties. It is an obvious but subtle point. But commonly used statistical software do not warn people about the issue and I am sure a literature search will bring up multiple papers that fall prey to the point.

Given the rate of increase in the production of knowledge, increasingly everyone is a lay user. For instance, in 2013, Lin showed that estimating ATE using OLS with a full set of interactions improves the precision of ATE. But such analyses are uncommon in economics papers. The analyses could be absent for a variety of reasons: 1. ignorance, 2. difficulty, 3. dispute the result, etc. But only ignorance stands the scrutiny. The model is easy to estimate, so the second explanation is unlikely to explain much. The last explanation also seems unlikely, given the result was published in a prominent statistical journal and experts use it. And while we cannot be sure, ignorance is likely the primary explanation. If ignorance is the primary reason, should the onus of being well informed about the latest useful discoveries in methods be on a researcher working in a substantive area? Plausibly. But that is clearly not working very well. One way to accelerate dissemination is to provide such guidance as ‘warnings’ in commonly used statistical software.

I agree. This fits in with the whole workflow thing, where we recognize that we’ll be fitting lots of models, many of which will be horribly inappropriate to the task. Better to recognize this ahead of time rather than starting with the presumption that everything you’re doing is correct, and then spending the rest of your time scrambling to defend all your arbitrary decisions.