Why we hate stepwise regression

Haynes Goddard writes:

I have been slowly working my way through the grad program in stats here, and the latest course was a biostats course on categorical and survival analysis. I noticed in the semi-parametric and parametric material (Wang and Lee is the text) that they use stepwise regression a lot.

I learned in econometrics that stepwise is poor practice, as it defaults to the “theory of the regression line”, that is no theory at all, just the variation in the data.

I don’t find the topic on your blog, and wonder if you have addressed the issue.

Stepwise regression is one of these things, like outlier detection and pie charts, which appear to be popular among non-statisticans but are considered by statisticians to be a bit of a joke. For example, Jennifer and I don’t mention stepwise regression in our book, not even once.

To address the issue more directly: the motivation behind stepwise regression is that you have a lot of potential predictors but not enough data to estimate their coefficients in any meaningful way. This sort of problem comes up all the time, for example here’s an example from my research, a meta-analysis of the effects of incentives in sample surveys.

The trouble with stepwise regression is that, at any given step, the model is fit using unconstrained least squares. I prefer methods such as factor analysis or lasso that group or constrain the coefficient estimates in some way.

67 thoughts on “Why we hate stepwise regression”

1. “The trouble with stepwise regression is that, at any given step, the model is fit using unconstrained least squares. I prefer methods such as factor analysis or lasso that group or constrain the coefficient estimates in some way.”

As a wanna-be statistician, I’d be greatly indebted if you could explicate this further, or provide a reference where I read more.

Thanks.

• Google sums it up quite nicely; search “outlier detection” and see how long it takes you to find the phrase “data mining.”

• Is this a matter of phrasing? For example, outlier detection has reasonable uses – statistical process control leaps to mind as an area where, at least conceptually, what we’re doing is trying to detect events that don’t fit a model. Seems pretty legitimate to me, but I’m not a pro statistician.

• That’s what I was thinking. Flagging outliers seems fairly common and quite useful practically in a lot of settings.

• Rahul:

Outlier detection can be a good thing. The problem is that non-statisticians seem to like to latch on to the word “outlier” without trying to think at all about the process that creates the outlier, also some textbooks have rules that look stupid to statisticians such as myself, rules such as labeling something as an outlier if it more than some number of sd’s from the median, or whatever. The concept of an outlier is useful but I think it requires context—if you label something as an outlier, you want to try to get some sense of why you think that.

• I think the distinction is outlier detection vs outlier rejection. In my work, I use exactly the method you describe – I check whether servers are healthy by frequently checking response times for certain actions, and if I get several readings in a row that are more than 5 standard deviations from the mean of recent values, I send an alert message. It’s intentionally crude, and it’s just supposed to be a tool to grab attention and help push for investigation of what the heck is going on.

The point is, this is very literally outlier *detection*. No automatic action is taken other than an alarm going off, since it could be anything: new code is deployed that’s not performant, or a power outage in a data center across the country that we depend on, or a momentary blip due to everyone starting a House of Cards episode at the same time.

I realize this is basically the general point you were making, but it feels worth fleshing out in defense of outlier detection :)

• Ironically, even what seems like a stupid rule to you, the “labeling something as an outlier if it more than some number of sd’s from the median” prescription, in practice I can think of very few processes where that’s a terrible rule.

i.e. even that crude rule of outlier detection seems to work fine for many real world examples. e.g. QC / QA or log correlation etc.

• …except that the sd is really not a good statistic for doing this, because it is itself heavily affected by outliers.

• Yes, I agree with both of you. The problem is not with outlier detection, it is what is done after applying such a rule.

• My gripe is mainly with automatic methods of detection or labelling that are not sensitive to context and have no understanding of what the researcher is trying to do.

• Andrew, I think your original statement was correct. The problem with outlier detection is when people don’t see the gap between what they conceive an outlier to be (“a bad point”, “a suspicious transaction”, etc) and what outlier test do: determine how likely a point is with respect to a particular model.

People who stop at detection are implicitly acknowledging that their model may be useful but it’s not well-motivated. That’s why you as a statistician can tolerate that approach: who can argue with a kind of tripwire that simply says, “You might want to look more closely at this.” But “outlier” is still a loaded term and one of the more dangerous ones in statistics, and “outlier test” without further explanation should raise a red flag in your mind.

(It’s more than just the definition of “outlier” and the key concept of probability under a model. As someone else pointed out, some outlier tests are themselves influenced by outliers, leading to epicycle-like kludges. Then there is the baseline period which is considered “normal” for establishing your limits, then obsessions with intervals in models that do not reflect tolerance or prediction…)

• Agree with this point and all others related to it.

I work in behavioral statistics, and I have yet to hear of a really striking instance in which outlier testing was the lynchpin that people seem to think it is. It actually effectively tells me that the person doesn’t know what they’re doing.

That being said, as Wayne pointed out, there’s clearly a definitive difference between outlier detection and outlier testing and/or outlier rejection. I would be wary of one who doesn’t practice outlier detection because, at least in my field, it’s often the outliers that we’re supposed to be paying attention to. But that doesn’t beget any sort of data manipulation necessarily, and I think people fail to realize this.

Rather, because “outlier detection” seems to be a term that would be defined in a vocabulary section of a statistics textbook, I think people end up conflating that with other such terms, which are typically tests or calculations, thereby implying that “outlier detection” is a test or method, rather than simply a behavior of acknowledgement. I’ve had countless students as me for the formula for outlier detection when I ask them questions about the topic in assignments or tests.

2. There are many things not to like about stepwise. One not mentioned in the post is that it doesn’t even necessarily do a good job at what it purports to do. Given a set of predictors, there is no guarantee that stepwise will find the “best” combination of predictors (defined as, say, the highest adjusted R^2); it can get stuck in local optima. Example here:
http://stats.stackexchange.com/questions/29851/does-a-stepwise-approach-produce-the-highest-r2-model

On an unrelated note, I wonder if Andrew or someone else you could say a little more about why outlier detection is considered “a bit of a joke” to statisticians. Do you just mean they take a dim view of simple, thoughtless rules like “delete any observation with Cook’s D above a certain threshold,” or that they view the entire enterprise of identifying and dealing with outlying observations as fundamentally dubious? The former view is certainly understandable, I’m just wondering if you’re actually thinking of the second sort of view.

• I’m curious about the outlier thing, too. Developing influence statistics for multilevel model seemed to me to even be a recent area of applied statistical research. I think I’ve seen some stuff from Snijders and Berkhof and from Loy and Hofmann.

• I was primarily looking for diagnostics and solutions for heteroscedasticity issues, though*, and spotted the influence statistics stuff only by accident.

*To be more precise, I wondered if there is any work on multilevel models where heteroscedasticity is not seen as something you have to correct for but as something which is of substantive interest.

3. How do you equate stepwise regression and Lasso with something like BMA?

Aren’t these all a form of model selection procedures which if not useful for theory testing, can be legitimate for forecasting?

4. I was also skeptical of stepwise regression as an Biology major with an emphasis on Ecology and Molecular Ecology. Interestingly, I have seen this practice creep in both the climate and ecological literature, and it seems to be gaining popularity in fields with “messy data”. For now I’m avoiding using these methods (stepwise regression, quantile regression, etc.), but keeping my eye on them to see if they gain a broader acceptance. I do agree though that these methods will tend to produce statistically significant results that might not actually be “biologically relvant”, ie. a trend that can actually be applied and developed into a useable model.

• I’m curious, why are you lumping stepwise and quantile regression together here? While I can see the problems of stepwise regression, I can imagine settings in which quantile regression may be reasonable.

5. Andrew doesn’t mention this piece, but here is a nice little review of the problems with stepwise: http://www.nesug.org/proceedings/nesug07/sa/sa07.pdf

Issues are (paraphrased from Harrell, 2001):

1. R2 values are biased high

2. The F and c2 test statistics do not have the claimed distribution.

3. The standard errors of the parameter estimates are too small.

4. Consequently, the confidence intervals around the parameter estimates are too narrow.

5. p-values are too low, due to multiple comparisons, and are difficult to correct.

6. Parameter estimates are biased high in absolute value.

7. Collinearity problems are exacerbated

6. There is yet another problem with Stepwise Regression; a big one. It encourages you not to think.

7. Doing stepwise using significance values of the parameters is definitely a bit of a joke, but I wouldn’t necessary say so when using a criteria such as AIC. There is at least a theoretical justification for finding the ‘optimal’ model as measured by AIC. Stepwise does not necessarily find this optimum, but it does do approximate optimization.

The real sin though is when p-values are reported with a stepwise regression (shudder).

• People such as Frank Harrell would, and do, argue differently on this point. Frank has often said that (me paraphrasing) using the AIC in this manner is just the same as stepwise using p-values because the AIC is just a restatement of the p-value. They don’t give the same result but the process is the same; you’re just (potentially) using a different threshold than say p <= 0.05 when you use AIC.

Anything that imposes a hard selection threshold will fall foul of at least:

6. Parameter estimates are biased high in absolute value.

from the list above.

8. I love stepwise regression.
It is a very simple effective way to
do variable selection.
The lasso and stepwise are approximately the same
(as shown the the LARS paper by Efron et al)
There are are results by Andrew Barron et al that show that
stepwise achieves optimal risk.
see:
Barron, Andrew R., et al. “Approximation and learning by greedy algorithms.” The annals of statistics (2008): 64-94.
Of course one should not the use the output of this (or any selection method) for inference.
But for prediction it is great.

Larry

• I think the LARS paper is actually pretty critical of stepwise; Efron et al find it to be too greedy. The paper shows that LASSO and STAGEwise are approximately the same, and have better properties than stepwise regression. Stagewise takes smaller steps than stepwise, and as such allows multiple colinearity variables into the model in a way that might be better for predictive accuracy.

• Yes it is stagewise that is closer.
But in practice, when the dimension is large,
I find they are almost always very similar.
There is little practical difference.
And stepwise is easier to implement and easier to explain.
And the risk bounds, as I mentioned, are the same as those derived
by Greenshtein-Ritov for the lasso. So from that perspective they
are the same.

Larry

• Larry: “Of course one should not the use the output of this (or any selection method) for inference.” Of course? What are the AIC and other people doing? Just black box prediction? This distinction between inference/prediction is coming up on my current post (on Potti and Duke), and if what you say is true, then it seems problematic to be using any of these model selection techniques in recommending treatments for patients.

9. If anyone is serious about reliably calling out poor statistical practices rather than cherry picking pet conflicts with other professors about how taking a look at things like this:

Journalists being nothing but parrots with large amounts of salt-and-pepper noise is understandable. In general tenured researchers being completely incompetent at statistics is less so.

• If you want to criticise that research, why are you not doing it yourself? What are your problems with the approach? Etc.

10. In a manuscript review I performed last year, I criticized the use of stepwise regression and recommended the authors select covariates for their model based on their knowledge of the field (in which they are experts). I also referenced Frank Harrell’s criticisms of stepwise regression.

The reply to this criticism: “This is a standard method in the field”
(Not an exact quote but it went something like that.)

Oh, and assigned statistical reviewer did not criticize the use of stepwise regression, but noted that perhaps the study may have been underpowered. The dataset was approximately the same size as the three previous datasets used to study the effect of interest (by now, to confirm that the effect was probably not present).

Yes, I am still a little annoyed by this…

• You win some and you lose some.

In an earlier version of this paper –

Intraveous immunoglobulin therapy for streptococcal toxic shock syndrome — a comparative observational study –

– I was originally displaced from the research group by a well known biostats department one of the co-authors was associated with who had been convinced by them only the one best adjustment model (found by all possible selection) be presented in the paper.

The reviewers of the initial journals they submitted the research to stepped on them hard enough, it enabled the co-author I was associated with, to re-instate me. Both propensity score analysis and a summary of all possible linear adjusted estimates was given along with, I think, a clear discussion of uncertainties that could not be further refined.

• Even for experts, their understanding of the underlying science is limited, which especially true for biomedical sciences. So all the variable selection methods including stepwise regression can be useful for discovering something new (no guarantees though). I think it is too arrogant to believe that the experts in that field can really know all relevant predictors/covariates to be used in a regression model.

11. Stepwise regression has two massive advantages over the more advisable alternatives. One, it’s intuitive – unlike even lasso, it’s simple to explain to non-statistician why some variables enter the model and others do not. Two, it’s implemented in an easy-to-use way in most modern statistical packages, which the alternatives are not. Would I publish a paper with it or advise its inclusion in a statistical plan? No way. Am I okay with folks using it to explore their own data sets, with all the necessary caveats? Yep.

So, please consider the alternative hypothesis that the researchers who use stepwise regression are aware of the problems in a general sense, but perhaps don’t know a better option. Not unlike the problem with overreliance on p-values, actually.

And… to the frequently-repeated assertion by statisticians that clinicians/scientists should ‘use their domain expertise to select variables manually, rather than relying on the computer’: close your eyes and imagine reading that sentence in a manuscript or a grant. Now imagine just how quickly it would be shot down for lack of rigor, or suspected of data-dredging.

• cassowary37: if the clinicians/scientists don’t give just a vague statement about domain expertise, but instead say “we will adjust for X and V because of plausible confounding as described in figure Z” with some relevant citations to show they know what they’re talking about, I think getting shot down would be harsh.

However, to successfully argue one is using domain expertise, there has to be a very specific goal in mind for the analysis – a specific aim of the grant, say. When that’s not available (and it may not be) I agree stepwise approaches may have some merit as exploratory tools, although other tools are – these days – easy to use and should at least be considered.

• cassowary37, you suggest that using domain expertise to select variables would be shot down as data dredging, but stepwise regression would not. You may be right that some reviewers would react that way, but those reviewers would have it backwards. It is stepwise regression that is “data dredging”, and explicitly so: the procedure tries to identify the set of explanatory variables with the most power, whether or not they make any sense whatsoever. If you throw in a bunch of random vectors of explanatory ‘data’, some of them will be selected by the stepwise regression procedure for inclusion in the model, whereas no educated human would make that mistake.

I consider stepwise regression to be a useful tool for exploratory data analysis — here are a bunch of variables that I think might be predictive, show me which ones actually are — but for going beyond the exploratory stage it can easily lead you down the garden path.

• I’m curious, did your alternative model have more or less explanatory power than the consultant’s brute force model?

Further, isn’t what you are criticizing essentially the over-fitting aspect? If an ad hoc model performs with as good an explanatory power on out-of-sample data, can you still apply the “nonsense model” critique? i.e. a validation step is what’s needed?

At some degree of performance doesn’t one have to concede the superior explanatory power of a model notwithstanding how silly one thinks the model structure is?

• Explanatory power without causal understanding can be dangerous because you don’t know when the correlations that make the model work will be broken. You end up with black swan types of failure modes.

It also becomes unclear how to move forward with improving the model when you don’t understand why it works.

There are niche applications where you don’t care, for example, image editing / texturing software. But in cases where the goal is scientific, then no – out-of-sample prediction is not the be all end all.

• Yes, but can I dismiss a model with bad causal structure ( & excellent explanatory power ) because my alternative has an appealing causal structure yet crappy explanatory power?

e.g. in Phil’s example I think it’s too easy to make fun of the consultant but is a predictively crappy alternative any better, no matter how enticing its causal structure?

• that’s not the right dichotomy. both a model that’s poorly predictive and a model that can’t be interpreted are fairly useless for scientific purposes.

i would however, generally make more use of a model with a plausible causal interpretation with reasonable predicive power (relative to measurement error) than one that’s slightly more predictive of the available data but uninterpretable.

keep in mind our estimates of generalization error tend to be very crude,

• Rahul, the consultant’s model performed better on the data at hand but everyone realized it would perform much worse when applied to new data. Therein lies the whole problem, or at least most of the problem.

• Interesting. I didn’t know that. Thanks.

So what’s an objective way to evaluate the generalizability of a model?

• roughly speaking there’s two aspects to generalizability – the bias variance tradeoff (which encompasses “overfitting”) and heterogeneity.

You can think of reality as a mixture distribution. Often cross validation error won’t translate into a real out of sample error because your sample underestimates the variance of the variance of the variance of the variance, etc. I tend to be wary of machine learners who do a single train/test/validation error and think they’re done. What’s the credible interval on one’s estimate of a prediction error? Guess what, that’s going to depend on a model assumption (most of the time researchers don’t even provide one).

It’s important to realize that cross validation is relying on modeling assumptions which are just as subject to modeling failures as anything else.

The best case scenario for characterizing generalization error is probably when you are doing a timeseries prediction with bounded outcomes (eg election prediction).

There’s some interesting debate of this here http://statmodeling.stat.columbia.edu/2012/07/23/examples-of-the-use-of-hierarchical-modeling-to-generalize-to-new-settings/

• If the consultant had used stepwise regression to find a model based on data from half of the sampling units, he would have come up with a different model from the one he came up with. It would have performed well on the data being fit, and poorly in cross-validation. What that _should_ tell you is not to use stepwise regression, or at least not for constructing your final model.

If, instead, you keep doing different random selections and testing them, you will eventually find one that works well on both the fitted dataset and the cross-validation set. But it will generate nonsense if applied to new data.

12. I think there is a much bigger problem with how many people like to interpret the results of whatever variable selection procedure than with any specific one including stepwise. People need to understand that many things they would like to identify cannot be identified from the data, particularly “variable A has an effect on Y whereas variable B hasn’t”. I don’t think that there is anything more to like about such interpretations if they use a result of Lasso or something Bayesian than of stepwise.

13. “Stepwise regression is one of these things, like outlier detection and pie charts, which appear to be popular among non-statisticans but are considered by statisticians to be a bit of a joke.”

Tibshirani and Hastie in their recent Statistical Learning MOOC were quite positive about stepwise regression, in particular forward stepwise selection for variable selection. Their book also covers these topics. http://statweb.stanford.edu/~tibs/ElemStatLearn/

• Stepwise regression in a reasonable use case for variable selection would be simply to rank order the theoretical ‘importance’ of the variable to the model. But the outputs of a fwd stepwise regression I merely consider a mere guide on which variables to begin with, not as a viable model. In fact, I will use fwd stepwise iteratively.

Let’s say I’m trying to develop a reasonably well parameterized, well fit, minimal deviance and information loss, and variable interaction accounted model which captures the probability of some event y occurring. A let’s say I have 200-300 variables to examine for candidacy in the training model (and that 200-300 may have been a reduction from several thousand other dimensions). Where do I reasonably begin in that case?

I find that fwd stepwise helps streamline the process in this regard. I may take these variables and simply output an initial rank ordered list of variables the stepwise may be inclined to include (examining aic/bic, deviance of the residuals that theoretically may be reduced, and a few GoF measures). I may begin examining each variable as I add them, one by one, into the model.

What fwd stepwise allows me to do is determine a ‘stopping point’, which may be reached in the list when a further reduction in deviance becomes insignificant. Once I’ve reached that stopping point, I stop the manual model dev and will then run another fwd stepwise, this time having it consider a new model wherein the variables that made it successfully thus far are known. This generates a new rank ordered list, and I then return to variable testing in the model with that. I cycle through this until I’ve run through the entire list of possible candidates.

At that point I search for possible interactions and go through the process of examining each. Of course this is still quite a raw model and candidate interactions should be somewhat intuitive (and that is an admitted source of bias, but there is little perfect about ‘explanatory/predictive’ output). There is inevitably some subject domain expertise (the ‘art’ of this entire process) that comprises selection bias on what interactions are reasonable to test. I haven’t really experimented yet with how this might be improved (reduce selection bias, choice of degree of interactions [n-way] to consider).

Once that is completed, I will then use backward stepwise to examine what variables, if any, may now add little value to the model once new within variable associations have been discovered.

For the test set, I apply the training model to examine how it accepts new data. I then conduct the entire process again, and then perform a set of model comparison tests (for error noted between training/test application, what would explain this? By having both the applied model and independent model outputs generated, diagnosing potential issues I believe is aided immensely.).

So yeah, this can be quite laborious. But independent model development, particularly if what is being modeled is all rather ‘new’ (a ‘first run’), is I think a valuable added bit of ‘insurance’ that the resultant model is sound.

For certain applications, such as in certain types of risk, where a single event’s maximum severity is in the scale of things rather low in margin, I often find the more stable generalizable model over time is the one which is slightly *underfit* in the grand scheme of things. For examining what is behind more severe risk events, I don’t believe this may be sufficient, but then I don’t generally use this modeling paradigm (ie GLM, etc) for those varieties of problems anyway, unless ‘intuition’ is all that is requested. The controversy over the importance of model parsimony and stability vs accuracy is truly context dependent.

• Right, but they aren’t very complimentary about such methods in their Elements of Statistical Learning book, which whilst not the main text for that course was suggested reading for more savvy participants. There was a distinct focus of the Stanford StatLearn course on prediction, so they weren’t specifically using it for inference either.

14. This outlier detection is performed a lot by the neuroscience community. I must admit, I am one of those non-statisticians. Could you elaborate why it is a bad idea? Is it because of bias introduction?

• Luca:

Speaking generally, we want to understand where the outliers are coming from. There’s a big difference between an observation that happens to have a high value, and a data recording error, for example. Automatic rules for removing outliers can’t really handle that. Beyond this, the concept of an “outlier” seems in many cases to be a crude substitute for the more valuable concept of a “distribution.” It disturbs me that, of all the statistics jargon, the term “outlier” is so popular.

15. So my lecturer has asked we compare/contrast stepwise & hierarchical multiple regression and give an example of when we would use both.

I can think of all the reasons we shouldn’t use stepwise in social sciences and I can’t think of a time I would willingly use stepwise. Hjaelp! Example where this is a good idea…

16. I am sorry I am entering this discussion. I am an engineer and often confront this problem in my work. Generally, I have used singular value decomposition to help identify variables with low predictive power. Ultimately, I select variables with singular values above some threshold. This procedure has not let me down, at least so far, in numerous problem areas. The views of the statistical experts would be appreciated. Thank you.

• beginner in statistics here, but using SVD for feature selection seems wrong. By choosing singular values above certain threshold, you are using the top singular vectors (linear combination of your original features), meaning that you are transforming your feature linearly before truncation. (In particular, you aren’t selecting your original features at all)

17. Interesting discussion and very helpful. As a layman to the concepts behind statistical modelling, I would like to share my experience and see if it reveals another aspect of outliers and stepwise as valuable tools to those of us using the math to derive meaning in very useful modelling systems.

Im a software engineer and developed a web-tool for a company that needed a way to quickly develop non-linear regression modelling and charting, and constantly re-analyze energy data. Billions of dollars are at stake in this industry. With the tool I developed, I was able to pick a database from a long list, pick dependent variables, and any number of independent variable, along with a wide selection of expressions to apply to each coefficient/variable. Very complex non-linear regressions were then run, deriving equations from these and various charts with best fit lines, scatter charts, R2 results. Unlike those in the medical field or social sciences, we were not looking for boolean or Bayesian results…..but levels of variability and accuracy of the models, so that companies could benchmark themselves. Models were then used such that companies could get an idea of “best practices” based on the use of their own independent variables used in the model and their own data.

Very good results were achieved in the benchmarking of these multinational companies using these models and in helping them achieve some level of improvement in their practices, saving many of them millions of dollars.

After I fine-tuned the software, out of curiosity, I went back to see how the models and algorithms created by my software compared to stepwise regression results. (Some of my stepwise implementations implemented stagewise strategies). In turned out the stepwise results resulted in the picking of variables the software and users picked. In other words, the intuitive choices of experienced users (familiar with years of working with the data) and what the software created in the final non-linear regressions fit the stepwise results looking at all the available data columns in several databases that I tested. So, I would say, stepwise is not “evil” its portrayed here, but useful in verifying models already created by software systems. We got very good correlations in the data we were using, and got the maximum value out of it using non-linear regressions and stepwise confirmation.

Second, “outlier” may be a dangerous term, and not accurate in helping people in certain fields derive understanding of distributions, etc. But we found, looking at data points from multiple models, that in almost every case, such data was flawed in one or more data columns in our databases. These outliers revealed, later, proven flaws in SQL procedures done by users in the past, that wrote over correct values, corrupting the databases. The use of outliers and analysis of outliers AFTER creating non-linear regression modelling by my software revealed easily identifiable “mistakes” in the data that expanded further analysis of broader errors in data entry systems and software used elsewhere in the company and even specific people who entered the data, thus allowing us to not only improve processes by design mathematical analysis queries in SQL to identify associations between the flawed outlier columns and sister columns ALSO containing flawed data.

Again, “outlier” here isnt defined as beyond range, accepted distribution, or data beyond regression modelling. Its defined as “flawed data”….ie bad data, which the math helped us to identify. So I think there is much more value to be found in use of outliers and the use of outlier identification that just the statistical perspective here.

18. A bit late to this discussion, but I have a question.

Are there any circumstances where it is OK to use stepwise regression? More specifically, I am thinking of a problem where I have way more data points than I have regressors and in the final model, I have c100 or more data points per term in the model

Basically, are there ever any circumstances where overfitting is so unlikely that a stepwise procedure is unlikely to give you a model that doesn’t generalise well, in particular if you specify a strong penalty for including additional terms (e.g. stronger than BIC)?

The reason I ask is that, and not being ideological about certain techniques, it strikes me as though stepwise can be useful provided its limitations are known, and the modelling exercise does not take you outside the appropriate circumstance for using stepwise regression.