Rob Tibshirani writes:

Hastie et al. (2001) coined the informal “Bet on Sparsity” principle. The l1 methods assume that the truth is sparse, in some basis. If the assumption holds true, then the parameters can be efficiently estimated using l1 penalties. If the assumption does not hold—so that the truth is dense—then no method will be able to recover the underlying model without a large amount of data per parameter.

I’ve earlier expressed my full and sincere appreciation for Hastie and Tibshirani’s work in this area.

Now I’d like to briefly comment on the above snippet. The question is, how do we think about the “bet on sparsity” principle in a world where the truth *is* dense? I’m thinking here of social science, where no effects are clean and no coefficient is zero (see page 960 of this article or various blog discussions in the past few years), where every contrast is meaningful—but some of these contrasts might be lost in the noise with any realistic size of data.

I think there is a way out here, which is that in a dense setting we are not actually interested in “recovering the underlying model.” The underlying model, such as it is, is a continuous mix of effects. If there’s no discrete thing to recover, there’s no reason to worry that we can’t recover it!

I’m sure things are different in a field such as chemistry, where you can try to identify the key compounds that make up some substance.

P.S. The above quote and link come from Rob’s chapter, “In praise of sparsity and convexity,” in the Committee of Presidents of Statistical Societies volume. My chapter, “How do we choose our default methods?”, is here.

P.P.S. I do think it can often make sense to consider the decision-analytic reasons why it can make sense to go for sparsity: sparse models can be faster to compute, easier to understand, and yield more stable inferences. (Sometimes people say that a sparse model is less likely to overfit but I don’t think that’s quite right, as you can also get rid of overfitting by using a strong regularizer. But I think it is fair to say that a sparse model can yield more stable inferences, in that the inferences for the more complex model can be sensitive to the details of the regularizer or the prior distribution.)

In your Chapter: “true that, conditional on the model, Bayesian inference gives the right” probabilities (rather than answer)?

Also, the Gelman (2007) referred to is not in the references.

I strongly disagree with the “pessimistic” view that in the social sciences every variable is (causally) associated with every other variable bc no (causal) parameter is ever zero. If network density is such that all variables are connected then even _prediction_ is likely to be misleading out of the training sample (among other disturbances can never be (conditionally) independent).

Moreover, even if density is true many connections will likely be so close to zero we can practically treat them as zero. In other words, a little coarsening may drastically reduce the density of the causal network.

And if you insist on sticking to your guns remember that in a dense world even randomized controlled experiments have to be uninformative. After all exclusion restrictions, randomization, and non-interference are all structural zero assumptions (if you analyze experiments with graphs you will see this). Why should we believe these structural zeros and not others?

I think extreme Cartesian doubt has a use but I would not take it so seriously. This being said I do not expect to uncover the true model, just a useful approximation. But the key is useful. In a truly dense world with reverse causality, everything connected, etc.. I don’t think even useful approximations can exists. If your models are useful it is bc implicitly – even if you don’t assume it — there is some sparsity in the world.

PS By the way there are many instances where causal structure was learned form observation including smoking and cancer, cholera (john Snow), etc… Humans rely on this sort of learning all the time. Without it we would not function.

Fernando:

I’m not expressing pessimism or extreme Cartesian doubt, I’m just saying that in the sorts of problems I work on, everything is statistically dependent on everything else. For example, we’ve modeled responses to “How many X’s do you know?” questions. Older people know more older people, younger people know more younger people. Women know more women, men know more men. If we have enough data, we’d find that the differences between old and young are different comparing men and women. Of course the differences will differ—that is, of course there are two-way interactions, if we were to look at the entire population. There’s no way that the differences would be identical. Similarly there is geographic structure in who you know. People know other people who live closer to them. But the relevant distance scale is different in NY than in LA. But it’s not just travel time either. And of course the differences by distance are different for different age groups. They have to be: how could they be exactly the same? And so on. I agree that some of these differences are small but I don’t think any are exactly zero, nor do I think it’s such a good idea to work on statistical methods that are based on finding the zeros or finding the nonzero interactions. Cos the trouble is, that in the worlds where I work, there’s a continuum of dependences. It’s not like there are a few huge things and a bunch of tiny things.

Also, I of course agree with you that causal structure can be learned from observation. Indeed, some of my most successful political science papers involve causal inference from observational data. I wrote more about the topic here.

Andrew:

Just to be clear by network I was referring mainly to a Bayesian network, and more specifically to DAGs, not necessarily to social networks. But I am ok having the discussion about the latter.

Causal networks, or DAGs, are completely heterogeneous so I don’t understand your point about heterogeneity. Interaction is a quantitative property of a functional form, not a qualitative feature of a causal structure. That is, the network X ￫ Y ￩ Z is compatible with any functional form such as Y = X + Z or Y = X*Z or whatever. Stronger interactions may in fact make it easier to find conditional independence if it exists.

PS I think your point about heterogeneity is this: Say X causes Y in the population, but is moderated by Z. Now suppose in some Z=z strata data are sparse and a hypothesis test rejects that X has an effect on Y. Your point — and I agree if this is your point — is that this is somewhat silly. Most likely this is due to the sparsity. A hierarchical model would do better.

If this is your point then note this is an inference about the parameterization of a qualitative relation (e.g. a piece-wise function or whatever whereby X affect Y on some levels of Z but not others), and not about the causal relation itself (whether X causes Y for any level of Z).

The structural zeros I am talking about involve deleting an arrow unconditionally (e.g. as would be the case if X had no effect on Y whatsoever, irrespective of Z). I am not talking so much about X having some effect at some level of Z and not others. That is an order of magnitude more ambitious. And yes, for that kind of inference I think shrinkage is the way to go against strata sparsity.

Fernando:

I have no doubt that what you are talking about is interesting and important, but it’s really different than what I’m talking about. When Tibshirani refers to sparsity, I think he’s talking about setting parameters in some large model to zero: the sparse solution is of lower dimensionality than the full, non-sparse solution. I prefer a nonsparse model with regularization (“soft constraints”) but I completely agree with Tibshirani’s point on the general effectiveness of statistical methods such as lasso that enforce sparsity. My point is that, in the problems that I work on, reality is not sparse. None of the coefficients are exactly zero, and they only look like they could be zero because of small sample sizes. In addition, I don’t work in worlds where there are a few large parameters and many tiny ones (in which case a sparse model could be an excellent approximation). Instead, I work on problems where parameters take on a continuous range of values. Again, sparsity in solutions can be useful for various good reasons, but the underlying reality is not sparse in the worlds where I work.

If I understand you correctly, then, you are saying that you don’t want your regularization to do variable selection as well as shrinkage. You want only shrinkage. The reason is you don’t believe any parameter in your model is zero.

That is fine, but this is quite different from saying the world is dense and there are butterfly effects (everything is connected to everything). If so, including less than an infinite number of variables in your model is imposing, at least implicitly, some zero coefficients. Variable selection at some level.

Now, you might say that I am taking you literally, and you are only saying that in your particular applications it is only the variables in the model that don’t have zero coefficients. That is fine. But when I read your paper linked above, that is not the sense I get in your critique of graphical modeling, where, as a a reader, I was left with the impression that mapping conditional independencies — which is what DAGs do — is not well grounded bc there is universal dependence in social sciences.

Fernando:

There’s a big difference between what I think the world is, and what I think our estimates should look like. Tibshirani writes that, if “the truth is sparse, in some basis . . . then the parameters can be efficiently estimated using l1 penalties. If the assumption does not hold—so that the truth is dense—then no method will be able to recover the underlying model without a large amount of data per parameter.” What I’m saying, is that in the problems I work on, the truth is dense, but it can still be useful to produce sparse estimates (see my PPS in the above post).

But in the papers I’ve read by you, Gelman, you sort of induce sparsity. You don’t include all interactions that would be possible, not even higher polynomials. So, instead of letting the data (and the constraint or regularization) induce the variables selection, you chose them based on theory or whatever.

Sure, you use posterior predictive checks, which helps you to go back and see what else could be included in the model in the first place. But I’d like to be able to formally (or quantitatively) compare these two approaches on variable selection, or model building.

Or maybe we could start with Lasso, do posterior predictive checks, and try to include a few variables that we think would improve the model, but this time using other regularization thank l1?

Also, even if the world is dense, we may be forced to induce sparsity. I’m thinking here of the ugly-duckling theorem and all no-free lunch theorems.

Last, but not leas, as far as I understand, you don’t like to average models. Does this mean that in this case you prefer sparse models?

Manoel:

Yes, for the reasons given in my PPS, I find sparse models useful in many settings. Other times, I like some density. For example in my public opinion models, if I have a few predictor variables, I like to include all their interactions. But you could still say that such models are sparse because I’m only including a few predictors. I tried in my post above to emphasize that I do not necessarily disagree with Tibshirani’s methods, I’m just putting a different spin on his justification. I use sparse models for various reasons but not because I think reality is sparse (in the problems where I work).

So you are not dismissing differences but rather choosing not to make distinctions on them (for now)?

I have an idea bouncing around my head that involves “dense” modeling followed by sparsification plus estimation of non-negligible causal effects via an approximation justified in Bayesian terms (as in the recent paper Bayesian Estimation of the Discrepancy with Misspecified Parametric Models).

Corey:

Yup. I added PPS above in response.

I have several text classification and regression examples where I separately tried L1 versus L2 penalties. Sometimes L1 gives better held-out predictions, but often L2 is better. I have no idea whether the truth is sparse, but it seems like “bet on sparsity” was sometimes violated here.

Brendan:

Maybe a mix of L1 and L2 (with weights fit to data) could work even better?

Sometimes it does. Sometimes it doesn’t…

If the assumption holds true, then the parameters can be efficiently estimated using l1 penalties. If the assumption does not hold—so that the truth is dense—then no method will be able to recover the underlying model without a large amount of data per parameter.

Am I to assume Brendan has a large amount of data per parameter? What is large, anyway? 30 (the usual number)?

What exactly are we referring to when we say that truth is “sparse”? It’s causal connectivity rather than “significant” correlations. Nature’s causal graph tends to be sparse but connected. Thus, if one is just testing associations or theory is lacking, non-zero correlations will be everywhere. Without a theory, “selecting variables” based on a “nature is sparse” argument is a flawed, dead-end approach to science imo.

Why is it that sparsity principles are intuitively appealing? What I leap to immediately is something about computational complexity; it seems very difficult to make the world work at all if at every point in time every variable must somehow consult every other variable’s value before updating properly. It feels intractable somehow, though I’m not sure how to fit that into what I know about modeling tractability. From that perspective, the density-sparsity contrast between the social/human/behavioral sciences and the physical sciences is really intriguing, especially as in the former you’d expect computational constraints on system complexity to be all the more natural.

Maybe that’s in part the distinction between “having only a moderate number of predictors” (sparsity in the large?) and “only selecting a small subset of your predictors as relevant” (sparsity in the small?), though. The former seems really appealing to me, and it’s troubling to think that apparent denseness in social sciences data might be intrinsic and not just a biproduct of our limited understanding. The latter seems like a modeling tool more than anything, although I guess that blurs into meta-methodological statement about underlying reality pretty easily if you let your data set get large enough.

re: the computational constraints and social sciences.

It might be that any particular organism/person/agent only responds to a small number of inputs. But which they respond to might be very situation dependent such that, on average, there are many, many inputs that appear in a true model of the aggregate causal relationships.

Hm! I like that example/idea, Dean. Maybe it suggests that there’s not much reason in the first place to assume causal sparsity? Or is there another ‘fundamental reason’ we tend to have faith in causal sparsity aside from computational complexity?

On the other hand: what is the precise relation of this idea to causal density? I mean — suppose we are considering a number of variables N and describing a system that evolves from time t to time t+k, moving one unit of time at a time. The ‘efficiency of computation’ principle might be something like: the organism has some bound B, and whatever rule describing their N variables’ change from time t to t+1 can’t use more than B causal connections in total. In each single time step this means we’re not allowed to use more than a fraction B/N^2 of the total, directional, possible causal connections between pairs of the N variables. If N is large and B is roughly fixed, this ratio’s close to 0; I think this is what I have in mind when I think of causal sparsity.

But using your idea, we could, for each variable i in our set of N variables, have some associated variable j the levels of which tell i which of the other N-2 variables i should use to determine its state in the next step. There are efficiency restrictions on this process: the process j uses to pick the relevant subset for i can’t itself be intractable, we must be able to identify the relevant subset. I’m not sure what restrictions that makes for causal density, if any? A second restriction is that, at any particular time t, summing over all variables i, the variable subsets they depend on can’t exceed B, since this would require more than the organism’s computational power.

But even so, while we can only use B ‘causal influences’ from t to t+1, in moving from t to t+2 we could seemingly use B^2 influences, and from t to t+3, B^3 influences? So the number of causal influences seems to grow exponentially, if our system switches subsets in a clever enough way (supposing this can be done while still making it easy to identify which subsets are relevant at each time step, for each variable). Of course if variable i could be influenced by N-1 variables in stepping from t to t+1, then it could be ‘influenced-at-distance’ of k time steps by sequences of (N-1)^k variables, and B is small compared to N, so B^k/(N-1)^k is presumably shrinking very quickly, not growing.

Well I’ve rambled enough, sorry! I really like your idea, though. It seems to me there’s something here that could be played with and elaborated in a formal model, but maybe I’m just over-attached to complexity considered. I am really curious whether there’s an alternative, fundamental justification for sparsity principles…

This is a bit off-topic but I bring it up anyway: LASSO is presented in the context of linear mixing model. It’s formulated to deal with a scenario where the weight coeffs of the basis functions are sparse. When we ask “Is the world (non-)sparse?” to what extent are we asking “Is the world (non-)sparse in the context of a linear subspace model?” Linear subspace models are extremely useful but not universal. Suppose your data can be described as a low-dimensional manifold which, when viewed globally, isn’t well accounted for using a subspace model? Would you call your data sparse or non-sparse? (I’d call it sparse if the manifold occupied only a modest fraction of the volume available in its dimensionality.)

Two references:

C.M.Bachmann et al, “Improved Manifold Representations of Hyperspectral Imagery,” link = http://www.dtic.mil/dtic/tr/fulltext/u2/a452673.pdf

S.T.Roweis and L.K.Saul, “Nonlinear Dimensionality Reduction by Locally-Linear Embedding,” Science, vol. 290, pp.2323-2326, 22 Dec 2000. link = http://www.sciencemag.org/content/290/5500/2323

Every time we wonder about sparsity, we generally wonder about sparsity in a very specific context. As Chris G pointed out, Rob is probably talking about it in the context of linear models and that nonlinear ones are left out of that discussion. An extension of compressive sensing for instance goes into the manifold signal processing route because we have noticed that indeed signals have more structure than just sparsity and most data actually live in a world on their own. When you make this assumption, you end up needing fewer measurements for the reconstruction of the signal.

Brendan O’Connor mentioned that his l1 or l1/l2 models are of varying quality, again, what I am hearing from these experiments is that sparsity in a linear context may not fit his reality, but even then, l1 solvers are really just a relaxation of the actual l0 problem so that when l1 fails, there is still the nagging feeling that somehow the solver itself does not go far enough. In effect, given his problem is really in the class of problem where sparsity in the main feature, that solver is probably hitting the universal sharp phase transition of Donoho and Tanner ( Observed Universality of Phase Transitions in High-Dimensional Geometry, with implications for modern data analysis and Signal processing, David L. Donoho AND Jared Tanner http://people.maths.ox.ac.uk/tanner/papers/DoTa_Universality.pdf )

In compressive sensing, for instance, these sharp phase transitions are used for many purposes: http://nuit-blanche.blogspot.com/2013/11/sunday-morning-insight-map-makers.html

Even within the linear models there are several good reasons why a reconstruction could not work, I personally tend to work those out first before going outside. But then again if you are on sensors and making sense of their data, you really want to stay inside the linear models.

Igor.

Thanks for the links to the Donono and Tanner reference and your blog post. Reading DoTa’s abstract makes me think of Hughes’ “On the mean accuracy of statistical pattern recognizers” (http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=1054102), i.e., ‘curse of dimensionality’ considerations.

Thanks for the reference, I’ll have to look it up.

Igor.

[…] it actually has lots of good stuff, including the chapter by Tibshirani which I discussed last year (in the context of the “bet on sparsity principle”), and chapters by XL and […]

[…] estimation techniques, including the GLASSO, rely on the assumption of sparsity, also termed the bet on sparsity. That is, these methods work very well assuming the true model is sparse (e.g. contains relatively […]