If the assumption holds true, then the parameters can be efficiently estimated using l1 penalties. If the assumption does not hold—so that the truth is dense—then no method will be able to recover the underlying model without a large amount of data per parameter.

Am I to assume Brendan has a large amount of data per parameter? What is large, anyway? 30 (the usual number)?

]]>Igor.

]]>Thanks for the links to the Donono and Tanner reference and your blog post. Reading DoTa’s abstract makes me think of Hughes’ “On the mean accuracy of statistical pattern recognizers” (http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=1054102), i.e., ‘curse of dimensionality’ considerations.

]]>Brendan O’Connor mentioned that his l1 or l1/l2 models are of varying quality, again, what I am hearing from these experiments is that sparsity in a linear context may not fit his reality, but even then, l1 solvers are really just a relaxation of the actual l0 problem so that when l1 fails, there is still the nagging feeling that somehow the solver itself does not go far enough. In effect, given his problem is really in the class of problem where sparsity in the main feature, that solver is probably hitting the universal sharp phase transition of Donoho and Tanner ( Observed Universality of Phase Transitions in High-Dimensional Geometry, with implications for modern data analysis and Signal processing, David L. Donoho AND Jared Tanner http://people.maths.ox.ac.uk/tanner/papers/DoTa_Universality.pdf )

In compressive sensing, for instance, these sharp phase transitions are used for many purposes: http://nuit-blanche.blogspot.com/2013/11/sunday-morning-insight-map-makers.html

Even within the linear models there are several good reasons why a reconstruction could not work, I personally tend to work those out first before going outside. But then again if you are on sensors and making sense of their data, you really want to stay inside the linear models.

Igor.

]]>Hm! I like that example/idea, Dean. Maybe it suggests that there’s not much reason in the first place to assume causal sparsity? Or is there another ‘fundamental reason’ we tend to have faith in causal sparsity aside from computational complexity?

On the other hand: what is the precise relation of this idea to causal density? I mean — suppose we are considering a number of variables N and describing a system that evolves from time t to time t+k, moving one unit of time at a time. The ‘efficiency of computation’ principle might be something like: the organism has some bound B, and whatever rule describing their N variables’ change from time t to t+1 can’t use more than B causal connections in total. In each single time step this means we’re not allowed to use more than a fraction B/N^2 of the total, directional, possible causal connections between pairs of the N variables. If N is large and B is roughly fixed, this ratio’s close to 0; I think this is what I have in mind when I think of causal sparsity.

But using your idea, we could, for each variable i in our set of N variables, have some associated variable j the levels of which tell i which of the other N-2 variables i should use to determine its state in the next step. There are efficiency restrictions on this process: the process j uses to pick the relevant subset for i can’t itself be intractable, we must be able to identify the relevant subset. I’m not sure what restrictions that makes for causal density, if any? A second restriction is that, at any particular time t, summing over all variables i, the variable subsets they depend on can’t exceed B, since this would require more than the organism’s computational power.

But even so, while we can only use B ‘causal influences’ from t to t+1, in moving from t to t+2 we could seemingly use B^2 influences, and from t to t+3, B^3 influences? So the number of causal influences seems to grow exponentially, if our system switches subsets in a clever enough way (supposing this can be done while still making it easy to identify which subsets are relevant at each time step, for each variable). Of course if variable i could be influenced by N-1 variables in stepping from t to t+1, then it could be ‘influenced-at-distance’ of k time steps by sequences of (N-1)^k variables, and B is small compared to N, so B^k/(N-1)^k is presumably shrinking very quickly, not growing.

Well I’ve rambled enough, sorry! I really like your idea, though. It seems to me there’s something here that could be played with and elaborated in a formal model, but maybe I’m just over-attached to complexity considered. I am really curious whether there’s an alternative, fundamental justification for sparsity principles…

]]>Sometimes it does. Sometimes it doesn’t…

]]>re: the computational constraints and social sciences.

It might be that any particular organism/person/agent only responds to a small number of inputs. But which they respond to might be very situation dependent such that, on average, there are many, many inputs that appear in a true model of the aggregate causal relationships.

Two references:

C.M.Bachmann et al, “Improved Manifold Representations of Hyperspectral Imagery,” link = http://www.dtic.mil/dtic/tr/fulltext/u2/a452673.pdf

S.T.Roweis and L.K.Saul, “Nonlinear Dimensionality Reduction by Locally-Linear Embedding,” Science, vol. 290, pp.2323-2326, 22 Dec 2000. link = http://www.sciencemag.org/content/290/5500/2323

Maybe that’s in part the distinction between “having only a moderate number of predictors” (sparsity in the large?) and “only selecting a small subset of your predictors as relevant” (sparsity in the small?), though. The former seems really appealing to me, and it’s troubling to think that apparent denseness in social sciences data might be intrinsic and not just a biproduct of our limited understanding. The latter seems like a modeling tool more than anything, although I guess that blurs into meta-methodological statement about underlying reality pretty easily if you let your data set get large enough.

]]>So you are not dismissing differences but rather choosing not to make distinctions on them (for now)?

]]>Brendan:

Maybe a mix of L1 and L2 (with weights fit to data) could work even better?

]]>Manoel:

Yes, for the reasons given in my PPS, I find sparse models useful in many settings. Other times, I like some density. For example in my public opinion models, if I have a few predictor variables, I like to include all their interactions. But you could still say that such models are sparse because I’m only including a few predictors. I tried in my post above to emphasize that I do not necessarily disagree with Tibshirani’s methods, I’m just putting a different spin on his justification. I use sparse models for various reasons but not because I think reality is sparse (in the problems where I work).

]]>But in the papers I’ve read by you, Gelman, you sort of induce sparsity. You don’t include all interactions that would be possible, not even higher polynomials. So, instead of letting the data (and the constraint or regularization) induce the variables selection, you chose them based on theory or whatever.

Sure, you use posterior predictive checks, which helps you to go back and see what else could be included in the model in the first place. But I’d like to be able to formally (or quantitatively) compare these two approaches on variable selection, or model building.

Or maybe we could start with Lasso, do posterior predictive checks, and try to include a few variables that we think would improve the model, but this time using other regularization thank l1?

Also, even if the world is dense, we may be forced to induce sparsity. I’m thinking here of the ugly-duckling theorem and all no-free lunch theorems.

Last, but not leas, as far as I understand, you don’t like to average models. Does this mean that in this case you prefer sparse models?

]]>Fernando:

There’s a big difference between what I think the world is, and what I think our estimates should look like. Tibshirani writes that, if “the truth is sparse, in some basis . . . then the parameters can be efficiently estimated using l1 penalties. If the assumption does not hold—so that the truth is dense—then no method will be able to recover the underlying model without a large amount of data per parameter.” What I’m saying, is that in the problems I work on, the truth is dense, but it can still be useful to produce sparse estimates (see my PPS in the above post).

]]>If I understand you correctly, then, you are saying that you don’t want your regularization to do variable selection as well as shrinkage. You want only shrinkage. The reason is you don’t believe any parameter in your model is zero.

That is fine, but this is quite different from saying the world is dense and there are butterfly effects (everything is connected to everything). If so, including less than an infinite number of variables in your model is imposing, at least implicitly, some zero coefficients. Variable selection at some level.

Now, you might say that I am taking you literally, and you are only saying that in your particular applications it is only the variables in the model that don’t have zero coefficients. That is fine. But when I read your paper linked above, that is not the sense I get in your critique of graphical modeling, where, as a a reader, I was left with the impression that mapping conditional independencies — which is what DAGs do — is not well grounded bc there is universal dependence in social sciences.

]]>Fernando:

I have no doubt that what you are talking about is interesting and important, but it’s really different than what I’m talking about. When Tibshirani refers to sparsity, I think he’s talking about setting parameters in some large model to zero: the sparse solution is of lower dimensionality than the full, non-sparse solution. I prefer a nonsparse model with regularization (“soft constraints”) but I completely agree with Tibshirani’s point on the general effectiveness of statistical methods such as lasso that enforce sparsity. My point is that, in the problems that I work on, reality is not sparse. None of the coefficients are exactly zero, and they only look like they could be zero because of small sample sizes. In addition, I don’t work in worlds where there are a few large parameters and many tiny ones (in which case a sparse model could be an excellent approximation). Instead, I work on problems where parameters take on a continuous range of values. Again, sparsity in solutions can be useful for various good reasons, but the underlying reality is not sparse in the worlds where I work.

]]>PS I think your point about heterogeneity is this: Say X causes Y in the population, but is moderated by Z. Now suppose in some Z=z strata data are sparse and a hypothesis test rejects that X has an effect on Y. Your point — and I agree if this is your point — is that this is somewhat silly. Most likely this is due to the sparsity. A hierarchical model would do better.

If this is your point then note this is an inference about the parameterization of a qualitative relation (e.g. a piece-wise function or whatever whereby X affect Y on some levels of Z but not others), and not about the causal relation itself (whether X causes Y for any level of Z).

The structural zeros I am talking about involve deleting an arrow unconditionally (e.g. as would be the case if X had no effect on Y whatsoever, irrespective of Z). I am not talking so much about X having some effect at some level of Z and not others. That is an order of magnitude more ambitious. And yes, for that kind of inference I think shrinkage is the way to go against strata sparsity.

]]>Andrew:

Just to be clear by network I was referring mainly to a Bayesian network, and more specifically to DAGs, not necessarily to social networks. But I am ok having the discussion about the latter.

Causal networks, or DAGs, are completely heterogeneous so I don’t understand your point about heterogeneity. Interaction is a quantitative property of a functional form, not a qualitative feature of a causal structure. That is, the network X ￫ Y ￩ Z is compatible with any functional form such as Y = X + Z or Y = X*Z or whatever. Stronger interactions may in fact make it easier to find conditional independence if it exists.

]]>Corey:

Yup. I added PPS above in response.

]]>Fernando:

I’m not expressing pessimism or extreme Cartesian doubt, I’m just saying that in the sorts of problems I work on, everything is statistically dependent on everything else. For example, we’ve modeled responses to “How many X’s do you know?” questions. Older people know more older people, younger people know more younger people. Women know more women, men know more men. If we have enough data, we’d find that the differences between old and young are different comparing men and women. Of course the differences will differ—that is, of course there are two-way interactions, if we were to look at the entire population. There’s no way that the differences would be identical. Similarly there is geographic structure in who you know. People know other people who live closer to them. But the relevant distance scale is different in NY than in LA. But it’s not just travel time either. And of course the differences by distance are different for different age groups. They have to be: how could they be exactly the same? And so on. I agree that some of these differences are small but I don’t think any are exactly zero, nor do I think it’s such a good idea to work on statistical methods that are based on finding the zeros or finding the nonzero interactions. Cos the trouble is, that in the worlds where I work, there’s a continuum of dependences. It’s not like there are a few huge things and a bunch of tiny things.

Also, I of course agree with you that causal structure can be learned from observation. Indeed, some of my most successful political science papers involve causal inference from observational data. I wrote more about the topic here.

]]>Moreover, even if density is true many connections will likely be so close to zero we can practically treat them as zero. In other words, a little coarsening may drastically reduce the density of the causal network.

And if you insist on sticking to your guns remember that in a dense world even randomized controlled experiments have to be uninformative. After all exclusion restrictions, randomization, and non-interference are all structural zero assumptions (if you analyze experiments with graphs you will see this). Why should we believe these structural zeros and not others?

I think extreme Cartesian doubt has a use but I would not take it so seriously. This being said I do not expect to uncover the true model, just a useful approximation. But the key is useful. In a truly dense world with reverse causality, everything connected, etc.. I don’t think even useful approximations can exists. If your models are useful it is bc implicitly – even if you don’t assume it — there is some sparsity in the world.

PS By the way there are many instances where causal structure was learned form observation including smoking and cancer, cholera (john Snow), etc… Humans rely on this sort of learning all the time. Without it we would not function.

]]>Also, the Gelman (2007) referred to is not in the references.

]]>