Rob Tibshirani writes:
Hastie et al. (2001) coined the informal “Bet on Sparsity” principle. The l1 methods assume that the truth is sparse, in some basis. If the assumption holds true, then the parameters can be efficiently estimated using l1 penalties. If the assumption does not hold—so that the truth is dense—then no method will be able to recover the underlying model without a large amount of data per parameter.
I’ve earlier expressed my full and sincere appreciation for Hastie and Tibshirani’s work in this area.
Now I’d like to briefly comment on the above snippet. The question is, how do we think about the “bet on sparsity” principle in a world where the truth is dense? I’m thinking here of social science, where no effects are clean and no coefficient is zero (see page 960 of this article or various blog discussions in the past few years), where every contrast is meaningful—but some of these contrasts might be lost in the noise with any realistic size of data.
I think there is a way out here, which is that in a dense setting we are not actually interested in “recovering the underlying model.” The underlying model, such as it is, is a continuous mix of effects. If there’s no discrete thing to recover, there’s no reason to worry that we can’t recover it!
I’m sure things are different in a field such as chemistry, where you can try to identify the key compounds that make up some substance.
P.S. The above quote and link come from Rob’s chapter, “In praise of sparsity and convexity,” in the Committee of Presidents of Statistical Societies volume. My chapter, “How do we choose our default methods?”, is here.
P.P.S. I do think it can often make sense to consider the decision-analytic reasons why it can make sense to go for sparsity: sparse models can be faster to compute, easier to understand, and yield more stable inferences. (Sometimes people say that a sparse model is less likely to overfit but I don’t think that’s quite right, as you can also get rid of overfitting by using a strong regularizer. But I think it is fair to say that a sparse model can yield more stable inferences, in that the inferences for the more complex model can be sensitive to the details of the regularizer or the prior distribution.)