Sparse regression using the “ponyshoe” (regularized horseshoe) model, from Juho Piironen and Aki Vehtari

Posted on July 9, 2017 9:30 PM by Andrew

The article is called “Sparsity information and regularization in the horseshoe and other shrinkage priors,” and here’s the abstract:

The horseshoe prior has proven to be a noteworthy alternative for sparse Bayesian estimation, but has previously suffered from two problems. First, there has been no systematic way of specifying a prior for the global shrinkage hyperparameter based on the prior information about the degree of sparsity in the parameter vector. Second, the horseshoe prior has the undesired property that there is no possibility of specifying separately information about sparsity and the amount of regularization for the largest coefficients, which can be problematic with weakly identified parameters, such as the logistic regression coefficients in the case of data separation.

So, what are they going to do?

Funny you should ask:

This paper proposes solutions to both of these problems. We introduce a concept of effective number of nonzero parameters, show an intuitive way of formulating the prior for the global hyperparameter based on the sparsity assumptions, and argue that the previous default choices are dubious based on their tendency to favor solutions with more unshrunk parameters than we typically expect a priori. Moreover, we introduce a generalization to the horseshoe prior, called the regularized horseshoe, that allows us to specify a minimum level of regularization to the largest values. We show that the new prior can be considered as the continuous counterpart of the spike-and-slab prior with a finite slab width, whereas the original horseshoe resembles the spike-and-slab with an infinitely wide slab. Numerical experiments on synthetic and real world data illustrate the benefit of both of these theoretical advances.

This is a big deal. It’s the modern way of variable selection for regressions with many predictors. And it’s in Stan! (and soon, we hope, in rstanarm)

52 thoughts on “Sparse regression using the “ponyshoe” (regularized horseshoe) model, from Juho Piironen and Aki Vehtari”

Eh2406 on July 10, 2017 10:18 AM at 10:18 am said:

This is exciting. I look forward to an example of how to use it in stan, with discussion of when it is/not appropriate. Some of us have a lot to learn before we are knowledgeable.

Reply ↓
- Daniel Simpson on July 10, 2017 11:22 AM at 11:22 am said:
  
  Last time I was loitering in the Stan offices (because sometimes I just hang around in the offices of people much smarter than me. I recommend it!) someone (probably Aki) pointed out something that I’d never really thought about: These priors aren’t about sparsity. Instead you can think of them as “sensible priors” for high-dimensional problems, while things like iid Gaussian priors are *not* sensible priors in more than a few dimensions.
  
  Why sensible? Because sensible priors for regression coefficients have to allow the coefficients to be small (otherwise you’ll get nonsense results and overfitting). Once you start thinking like that (rather than chasing the frequentists), things like the Finnish horseshoe (Aside: I don’t love “ponyshoe” Andrew. It’s a shame they’re not from Iceland) suddenly make a lot of sense.
  
  Reply ↓
  - Andrew on July 10, 2017 2:32 PM at 2:32 pm said:
    
    Dan:
    
    1. See here: “Whither the “bet on sparsity principle” in a nonsparse world?” The comment thread is pretty good too.
    
    2. I don’t like “ponyshoe” either: that’s why I recommended that Juho and Aki call it “regularized horseshoe” instead.
    
    Reply ↓
    - Michael Betancourt on July 10, 2017 3:18 PM at 3:18 pm said:
      
      I think Dan is making a more geometric argument (D – please correct me if you’re thinking about something else!). If you add together a bunch of extraneous random effects with Gaussian priors then the prior for their overall sum will grow with the square root of the number of dimensions. In other words, independent Gaussian priors will be less effective and regulating overfitting the more dimensions that you have.
      
      One of the nice features of the Finnish horseshoe (although I would love “Reindeer show” to catch on) is that you can control the joint behavior directly so that the regularization is robust to the number of parameters.
  - ojm on July 10, 2017 6:44 PM at 6:44 pm said:
    
    > These priors aren’t about sparsity. Instead you can think of them as “sensible priors” for high-dimensional problems
    
    I don’t really understand this comment. Isn’t the idea of sparsity that one is interested in an approximate, ’emergent’ lower dimensional model of high-dimensional data?
    
    I suppose this isn’t a prior in the sense that one doesn’t ‘believe’ that the ‘true’ coefficients are zero (as would technically be the case with formal Bayes), but that they can be treated approximately ‘as if’ zero. So, OK, not really a prior in the sense that it’s not a probability assignment to a proposition about an underlying truth.
    
    But how are these to be interpreted as ‘priors’ rather than say sparsity regularisers? A sparsity regularised problem is simply ‘find an adequate, low dimensional representation of the data’.
    
    I guess I’m saying I see the rationale for sparsity regularisation, but struggle to see in what sense it can be interpreted as ‘prior’, other than a formal mathematical correspondence. Which comes first, the regularisation or the prior?
    
    In what sense does ‘otherwise you’ll get nonsense results and overfitting’ define a ‘sensible prior’?
    
    Reply ↓
    - Andrew on July 10, 2017 7:42 PM at 7:42 pm said:
      
      Ojm:
      
      Here’s one way to think about it: What makes a distribution a “prior distribution” rather than a mere “distribution” is that it is getting multiplied by a “likelihood.” To put it another way, the properties of a prior (for example, whether it can be reasonably considered “weakly informative” depend on available information and context, of course, but also on the likelihood.
      
      So what Dan is saying, I think, is that in the context of regression-type likelihoods, the regularized horseshoe prior can behave well. That is, have good “frequency properties.” (Sorry, Dan.)
    - ojm on July 10, 2017 8:11 PM at 8:11 pm said:
      
      > What makes a distribution a “prior distribution” rather than a mere “distribution” is that it is getting multiplied by a “likelihood.”
      
      OK, fine. But I guess then I see regularisation, and e.g. the existence of emergent sparse approximations, as the deeper principles. And not necessarily tied to probability theory as such.
    - ojm on July 10, 2017 8:15 PM at 8:15 pm said:
      
      (which makes historical sense for what that’s worth since, as far as I am aware, Tikhonov and others first introduced the ideas of regularisation in the context of deterministic inverse/ill-posed problems…)
    - Bob Carpenter on July 10, 2017 11:59 PM at 11:59 pm said:
      
      I think you may be thinking more of low-rank matrix factorization models or the various attempts to embed data in lower-dimensional manifolds nonlinearly.
      
      I am curious what these priors do in the case of perfectly correlated linear predictors. For L1 (Laplace), the penalized MLE is arbitrary for a fixed regularization scale. For L2 (normal), the coefficient is doled out evenly between the two predictors. In matrix factorization, they just get combined and you lose one rank.
    - ojm on July 11, 2017 1:07 AM at 1:07 am said:
      
      Is that question for me? The nesting to hard to figure out.
      
      Something like truncated SVD is a regularisation method, right? In general I find regularisation an easy enough idea to understand, but struggle to see how to interpret it as a prior in _formal_ Bayesian terms beyond taking exponentials.
      
      Even something like the typical set sampling makes sense to me as a regularisation concept, but little sense to me as a Bayesian concept. For example, what definition of Bayes’ theorem do you use for continuous models? If in terms of densities it seems ill-posed until you add something like typicality.
    - Stephen Martin on July 11, 2017 3:12 AM at 3:12 am said:
      
      There’s a point where one has to realize that “prior” can be a misnomer.
      “Priors” can have lots of uses, and I think about them sometimes as data-independent information about the parameter process, or at least the process underlying unknowns.
      Priors, which are ultimately just probability statements, or a part of a model that is a function of unknowns, can:
      
      – Act as soft probabilistic constraints for model identification
      – Act as regularization for estimates
      – Guard against sampling-error induced extreme estimates
      – Act functionally, to improve estimation efficiency or to improve inferences
      – Include prior empirical information
      – Describe parameter generation processes (e.g., random effects models)
      And other things.
      
      I think horseshoe, lasso, ponyshoe, etc, fall under my heading of ‘functional priors’, in the sense that I’m not sure anyone is arguing that the horseshoe methods describe a DGP, but they nevertheless permit better inferences. It’s still a probabilistic component to the joint density, but it’s one used functionally.
      If it’s described as a continuous spike-and-slab prior, then it’s a bit like describing a parameter process in which a parameter is either essentially nil or it isn’t a-priori, and the known portions of the joint density permit marginal inference about which parameters are indeed nil or not. I don’t see that is particularly non-bayesian; it’s a joint density, it’s a probabilistic process on parameters, and one obtains marginal posteriors assuming that process and the likelihood.
      
      TLDR; I think once one gets rid of the “prior” nomenclature, some misconceptions about Bayesian inference can go away; in the end, what matters is a joint density represents a model of knows and unknowns.
    - ojm on July 11, 2017 5:13 AM at 5:13 am said:
      
      > in the end, what matters is a joint density represents a model of knows and unknowns.
      
      My general point was that it seems to me that probability density-based modelling is a special case of more general regularisation concepts.
      
      Two examples: 1. Two densities can be arbitrarily far apart yet define the same probability measure arbitrarily closely. 2. Probability is by definition normalised over the set of possibilities.
      
      So probability density modelling can be useful, sure, but I’m not convinced it’s the more general or interesting way to understand the problem of learning from data etc.
    - ojm on July 11, 2017 5:47 AM at 5:47 am said:
      
      (This was all promoted by Daniel’s comment above about how to think about sparsity priors. Similarly I think Andrew has described not appreciating sparsity initially because ‘no true effects are zero’. If Bayes is to guide regularisation then it seems like it should have its own interpretation that is a helpful guide to more complex problems. If it’s just for regularisation then it seems worth considering other ways of thinking about this)
    - Keith O'Rourke on July 11, 2017 8:38 AM at 8:38 am said:
      
      Stephen: Nicely put.
      
      I have put more vaguely as the prior representing (via a probability model) how parameter values came about or were set and then the data generating model representing (via a probability model) how observations came about and ended up in your current data set (given various points in the parameter space). So the getting multiplied by the “likelihood” simply reflects it being a joint model. There is some discussion of that here https://www.stat.columbia.edu/~gelman/research/unpublished/Amalgamating6.pdf (have you written anything up on this?)
      
      Now if I am getting Michael’s comment here, a joint prior with dependencies is a better (less wrong) representation of “the process underlying unknowns”.
    - Daniel Lakeland on July 11, 2017 12:34 PM at 12:34 pm said:
      
      ojm: you’ll be shocked, shocked I tell you to find out that I am currently working on an alternative model (in the sense of model theory, I think, I honestly know about >||< that much about model theory) of Bayesian inference which may actually answer many of your questions, or at least bring the interpretation of Bayes closer to the Box quote about all models are wrong….
    - ojm on July 11, 2017 8:31 PM at 8:31 pm said:
      
      Daniel – glad to hear my comments aren’t falling on completely deaf ears :-)
      
      If you’re really brave enough to tackle this question via model theory etc (definitely braver than me!) then an idle thought occurs to me:
      
      Bayes usually claims to be an epistemological theory (of what we know, states of information etc) but implicitly uses what seem to be strong, non-constructive assumptions about ‘truth’ via its reliance on Boolean algebra/the excluded middle (even in continuous contexts).
      
      Thus it seems to me to be more connected to model-theoretic truth semantics. And thus it fits awkwardly with the more epistemological idea that ‘all models/our knowledge are wrong but some are useful tools for thinking about some aspects of the world’. And with a desire for ‘generative’ or constructive model falsification as opposed to NHST style reasoning.
      
      On the other hand, a proof-theoretic approach to ‘truth’ is typically related to constructive reasoning and hence avoiding use of the excluded middle and e.g. replacing Boolean algebras by Heyting algebras and such things. The issue is that this is at odds with all proofs of e.g. Cox’s theorem that I know of, and explicitly introduces a gap between what is true and what can be known to be true (see also Godel, as you alluded to).
      
      So, one challenge is to develop a justification of Bayes using only constructive or proof-theoretic reasoning, rather than relying on e.g. Boolean algebra. My feeling is that it can’t be done in a non-controversially _unique_ way i.e. without allowing for a variety of other approaches to uncertain inferences that exist.
      
      Anyway, a very off-topic, wildly speculative comment…
    - ojm on July 11, 2017 8:43 PM at 8:43 pm said:
      
      (PS, I also know very little about model theory, so don’t trust me on anything I just said!)
    - Daniel Lakeland on July 12, 2017 1:29 AM at 1:29 am said:
      
      ojm. not deaf at all. My current theory goes more along the lines of defining a concept of “accordance”, which is in essence something about a degree to which all model assumptions are jointly met. The accordance function then needs a way to take an accordance with a given scientific model and some data, and generate a new accordance by including additional data…
      
      that this process should be independent of the order in which you apply the data, then will I think imply the product rule. adding in a requirement for the total accordance to integrate to either 0 or 1 (so that, we can rule out a theory if it predicts that certain data is absolutely impossible, and yet we observe that data) will lead to a sum rule.
      
      One can perhaps then argue for alternative accordances… that’s fine, but I suspect mine will be the unique one that is order independent and conserves accordance.
      
      Then, rather than probability being a “plausibility that X=a is true” it will be something like “the degree to which X=a accords with all theory and data available”. In the end then, non-identifiability becomes a fact about how the model accords with many possibilities. That then becomes a fact we can use in observer logic to make decisions about the “usefulness” of a false but intended to be useful model.
    - Corey on July 11, 2017 12:52 PM at 12:52 pm said:
      
      A challenge for you: (i) predict the glib, facile response of a Jaynesian such as myself or Daniel Lakeland; (ii) say why it’s unsatisfying.
    - Daniel Lakeland on July 11, 2017 1:17 PM at 1:17 pm said:
      
      Corey, I think ojm has a point. When I use Bayes I’m not doing it with things like some kind of theory of fundamental particles and interactions, where I might have a hope of calling them “True” with a capital T. What I’m typically doing is trying to find what is the reasonable set of parameter values that contains parameters that give the best approximations in the sense of agreeing with data to within the soft margins defined by the likelihood. And, I’m also wanting that to be a soft subset of those that I already knew to search inside of.
      
      The more fundamental concept for actual statistical practice is something like filtering subsets using a quantity that describes degree to which the subset has been included in the candidates.
      
      That Cox’s theorem gives us uniqueness of a particular kind of filtering (filtering *one* truth out of many possibilities) and that there is a model (Kolmogorov probability) and therefore a proof of consistency, makes a powerful argument that Probability is a good way to filter. But I do think in Statistics the more fundamental concept is filtering out those things that are contradicted by either data (likelihood) or purely theoretical (prior) considerations.
      
      In this sense, a my-little-pony-shoe prior is a statement about the regions of N dimensional space within which we expect accordance of a particular predictive model with data. It becomes more obvious too why you need to constantly alter your prior as you alter your likelihood, because the meaning of the parameter space is tied directly to the predictive model, and the likelihood, and so the region that “makes sense” a-priori is also tied inevitably to that likelihood.
      
      There really are no priors without likelihoods. Or to put it better, the joint distribution is more fundamental than any factorization.
    - Keith O'Rourke on July 11, 2017 1:42 PM at 1:42 pm said:
      
      > the joint distribution is more fundamental than any factorization
      I think so and there was a discussion of that on this blog earlier (2 or 3 years ago?)
    - Daniel Lakeland on July 11, 2017 1:55 PM at 1:55 pm said:
      
      I think this also makes good sense out of the ABC method. “Likelihood free” inference is nothing of the kind, it’s inference in which the function that defines accordance with the theory (the likelihood function) is to take the predicted outcomes calculate a low dimensional summary of them, and then in the simplest version say there is 100% accordance if the difference between the summary of the prediction and the summary of the observed is less than some explicit epsilon below which we don’t care about the difference and zero accordance otherwise….
      
      the next step is to smoothen that filter and you get something like a kernel ABC method…
      
      All of this plays nicely into my stuff on declarative models (in which we declare that some function of prediction and data is in a soft region defined by a probability distribution). The fact that all this works in practice, gives hints as well. For example I have a model of certain economic conditions in which I have a predictor function F that takes some covariates, and an observed data D, and I’m assigning D/F ~ some_distribution() in Stan. It’s worked very well to give meaningful inferences on certain costs, and the inferences have a long right tail which is in accord with my expectation that some people just like to spend more on those things, more than is necessary for a family of their size.
      
      So next, I’m working on formalizing this notion into something in such a way that this notion becomes an alternative model of probability theory in the way that frequentist, and Cox interpretations are also alternative models.
    - Daniel Lakeland on July 11, 2017 2:17 PM at 2:17 pm said:
      
      I think it also makes sense out of the problem with thinking purely in a Generative Model context, that is, Data is as if it comes out of plugging parameters into a random number generator. I personally give what you might call “generative” likelihoods and what you might call “pseudo-likelihoods” equal footing in the theory. The preference for generative likelihoods comes out of a history of Frequentist thinking. But if you conceive of a statement
      
      p(F(Data,Params) | F, Params, OtherParams, Model)
      
      as a statement about the accordance of the prediction about the combined quantity F(Data,Params) with the theoretical expectations for such a quantity, then it makes just as much sense as
      
      p(Data | Params, Model)
      
      since this is just a special case, where you are now just using the “identity function”
      
      p(ID(Data) | Params, Model)
      
      The thing that I suspect though, is that the structure of the Cox proof really doesn’t need much if any alteration to become a proof about this new interpretation (and I think, though I’m woefully ignorant, that in Model Theory this reads as my interpretation is a model of probability theory in the same way that Cox generalized boolean logic is a model of probability theory)
      
      The fundamental difference seems to be to interpret the probability as a measure of agreement between theoretical and observed rather than a measure of credence that the quantity is the correct one.
    - Corey on July 11, 2017 11:31 PM at 11:31 pm said:
      
      When I’ve wrestled with the issue of capital T Truth I’ve come to the conclusion that de Finetti was basically right to focus on observables. When I go parametric (andparticularly when I “widen” a parametric model and/or prior in light of data) I generally think of what I’m doing as approximating inference under a Bayesian nonparametric prior stochastic process; and the essential underlying meaning of the resulting posterior on parameter space is to be found in the predictive distribution it induces. (When I “narrow” a model it’s generally to incorporate prior information that I had left out that removes an ambiguity not addressed by the data.)
    - Daniel Lakeland on July 12, 2017 1:19 AM at 1:19 am said:
      
      Yes, exactly, the real purpose of doing the Bayesian inference is to find the portion of parameter space that makes your predictions do what you expect they should do. Sometimes this is because you want to know the value of the parameter, and sometimes this is because you want to make good predictions… but either way, you don’t want to find Truth with a capital T, you want to find some kind of smaller truth within the model, a truth about what the model is doing relative to what you theoretically think it should be able to do.
    - ojm on July 11, 2017 4:01 PM at 4:01 pm said:
      
      Corey: fair enough. I’m also not really sure why I bother…
  - Bob Carpenter on July 10, 2017 11:50 PM at 11:50 pm said:
    
    I recommended “reindeer shoe” during the meeting. Google tells me that’s “poron kenkä” in Finnish.
    
    Reply ↓
    - Aki Vehtari on July 11, 2017 4:06 AM at 4:06 am said:
      
      I’m replying to many comments here
      
      For me this is about prior information. 1) In these biomarker logistic regression examples, it’s very unlikely that any single biomarker could predict the correct class with high probability, which means that none of the weights should be really large. This has been well discussed, e.g., in https://projecteuclid.org/euclid.ba/1340371048 and https://projecteuclid.org/euclid.ba/1488855634 2) In these biomarker examples it’s likely that many of the biomarkers measured are not relevant at all and those which are relevant have may have different effect sizes.
      
      Sometimes these priors are use also as regularizers. For example, I would choose p_0n, we also know that the data has to lie on lower dimensional space restricted by the number of observations, and we could include the prior information that the effects will have dependencies due to non-identifiability leading to sparse factor models and then we could again say it’s a prior and not a regularizer.
      
      Comments about high-dimensionality are relevant and the same examples used to illustrate the concentration of measure in posterior sampling are useful. When the number of dimensions increase Gaussian prior will put most of the mass far away from the mode. Horseshoe is sharp enough near the mode, so that even when the number of dimensions increase it’s possible to have most of the mass near the mode. Alternatively factor models will effectively set the prior on lower dimensional manifold allowing more mass near the mode.
      
      Reindeer’s don’t (at least usually) have shoes like horse’s have.
    - Michael Betancourt on July 11, 2017 3:19 PM at 3:19 pm said:
      
      “Reindeer’s don’t (at least usually) have shoes like horse’s have.”
      
      But Aki we’re innovating here! Well-posed sparsity for posteriors and shoes for reindeers! Let’s start with flats and then build up to heels when the reindeer are ready.
    - ojm on July 11, 2017 3:58 PM at 3:58 pm said:
      
      > we could again say it’s a prior and not a regularizer.
      
      What’s the difference to you?
    - Daniel Simpson on July 11, 2017 12:57 PM at 12:57 pm said:
      
      I picture a gloriously kitsch pair of christmas-themed high heels, which is certainly the image of Bayesian analysis that I most want to promote.
    - Daniel Lakeland on July 11, 2017 1:32 PM at 1:32 pm said:
      
      https://litbimg6.rightinthebox.com/images/190×190/201608/qfuggv1472095578978.jpg
    - Daniel Simpson on July 11, 2017 1:44 PM at 1:44 pm said:
      
      I was thinking more https://cdn.stylefrizz.com/img/rudolf-reindeer-shoes.jpg
Daniel Lakeland on July 11, 2017 3:50 PM at 3:50 pm said:

Ok, so seriously, thinking about this in terms of accordance with a theory and all that jazz, I set up the following Stan “prior” for a sparse vector. Imagine we have a vector of positive quantities, most of which are near zero, and some small number of which we expect to be far from zero. I set up the following Stan model as a prior without any data.

parameters{ vector [100] q;
} transformed parameters{ real summary; summary = sum(inv_logit((q-6)*5)); } model{ q ~ normal(0,1); summary ~ normal(5,.2); }

The idea being that in each vector of 100 q values there are about 5 of the values that are somewhere bigger than or close to 6

When I run the model, I get exactly that, if I do as.matrix(stansamples) and then exclude columns other than the q, and then do a density plot of any given row of the samples, I get a big bump near 0, and a little bump near 6.5 or so.

Of course, in the absence of data, the Rhat is atrocious, because there are around choose(100,5) possible partitions between the two modes. But in the presence of data that would pick out the 5… you’d expect to get what you were looking for pretty nicely.

Now, I see this as taking an accordance for each individual q, that each one is most likely close to zero, and combining it with an accordance for the theoretical quantity equal to the sum of the nonlinear functions, to get a distribution that describes the accordance with the overall information, namely that a few of these dimensions are outliers.

Reply ↓
- Daniel Lakeland on July 11, 2017 3:51 PM at 3:51 pm said:
  
  Stupid blog:
  parameters{ vector<lower=0> [100] q;
  } transformed parameters{ real summary; summary = sum(inv_logit((q-6)*5)); } model{ q ~ normal(0,1); summary ~ normal(5,.2); }
  
  Reply ↓
- Daniel Lakeland on July 11, 2017 3:53 PM at 3:53 pm said:
  
  choose(100,5) is around 75 million, so if you asked Stan to give you maybe 300 million samples you’d expect your Rhat to be pretty good.
  
  ;-)
  
  Reply ↓
laurie davies on July 12, 2017 11:24 AM at 11:24 am said:

Some time ago

https://statmodeling.stat.columbia.edu/2017/01/17/laurie-davies-time-series-decomposition-birthday-data/

I offered an analysis of the birthday data. The same method can be applied to the microarray cancer classification data sets considered in Section 4.2 of Piironen and Vehtari. The only difference is that a robust version was used for the the birthday data whereas the following uses the much simpler least squares method.

Prostate data: the first three covariates together with their p-values are (2619 0.000),(203,0.57),(1735,0.50). The time required was 0.15 seconds. Using the cut-off p-value 0.01 gives 2619 as the only relevant covariate of the three. A simple linear regression using
these three covariates gives the following p-values (2619,2e-16),(203,1.97e-5),(1735,1.11e-4) which are misleading. A classification based on 2619 alone results in 8 misclassifications. Including the covariates 203 and 1735 reduces this to 7. In fact you have a 88% chance of a better classification, that s 6 or less misclassifications, if you retain 2619, replace all other covariates by white Gaussian noise and then calculate the number of misclassifications base on the first three covariates. These always include 2619. The average number of misclassifications is 5. This is based on 1000 simulations.

This does not imply that 2619 is the only relevant covariate, only that given 2619 the remaining covariates are no better than Gaussian noise. If 2619 is removed and replaced by Gaussian noise the relevant covariates are (5016,0.00) and (5035, 2.63e-05). These two result in 12 misclassifications. Now replace 5015 and 5035 by white Gaussian noise. This results in (1839 6.70e-13),(4898,1.37e-6) and (2503,5.32e-3). This process can be continued until there are no more relevant covariates (cut-off p-value 0.01). This results in 180 specified relevant genes in 80 clusters of 1-4 genes. The smallest number of misclassifications was 5 but all were based on 3-4 genes and so no better than 2619 with Gaussian covariates. The simple conclusion is that you can not do better than gene 2619 alone. More information may be available from an expert who may be able to detect a pattern in the 80 clusters of “relevant” genes.

The birthday data: a robustified version was used. The sample size is n=7305. The covariates are sin(pi*i*(1:7305)/7305) and cos(pi*i*(1:7305977305) for i=1,…,7305. This gives 14610 covariates but they were treat in pairs, sin(pi*i*(1:7305)/7305) and
cos(pi*i*(1:7305)/7305) together giving 7305 pairs. There were no dummy covariates. The choice of 0.01 for the cut-off value leads to 105 pairs. The time required was 27.5 minutes. Using the residuals one can clearly identify Valentine’s, Halloween, 1st April etc. and the drop in the number of births on the 13th of each month.

The method is not based on a linear model with error term. As there is no error term there is no modelling of an error term. The analysis applies to the data at hand. It is not Bayesian, it is not frequentist. It does not require any possible prior specification of the number of relevant covariates. No simulations are required nor cross-validation. It is fast, simple and interpretable. There is no hypothesis testing in contrast to the claim by Andrew. What hypothesis is to be tested?

Various versions of this paper have an impressive list of rejections: JRSSB, AoS, AoAS, EJS, JASA. Piironen and Vehtari do not mention it in their bibliography. On the positive side ojm likes it and Corey has stated that it is “clear, simple and powerful” or words to this effect- thank you.

Reply ↓
- ojm on July 12, 2017 6:25 PM at 6:25 pm said:
  
  Yup, I do indeed like this method. I haven’t yet had the chance to use it properly myself but plan to.
  
  Perhaps similarly to Andrew and sparsity, I didn’t initially understand/appreciate the underyling idea because I was thinking in terms of modelling, noise, truth etc.
  
  It became much clearer to me when I started thinking in terms of regularisation, approximation, adequacy etc.
  
  Reply ↓
- Andrew on July 13, 2017 6:01 PM at 6:01 pm said:
  
  Laurie:
  
  Thanks for the comment. The purpose of the method described by Piironen and Vehtari is prediction, not variable selection; they just use variable selection as an intermediate step to improve predictive performance.
  
  Reply ↓
laurie davies on July 12, 2017 12:56 PM at 12:56 pm said:

Addendum: if you analyse the prostate data using the logistic model and least squares you get gene 5016 with a p-values of 6.10e-11 and 9 misclassifications. The logistic model combined with Kullback-Leibler again gives gene 2619 with a p-value of 0.00.

Reply ↓
ojm on July 12, 2017 9:48 PM at 9:48 pm said:

Very dumb question – is the unnumbered expression on pg 4 relating beta bar and beta hat (just above eq 2.3) correct? I would expect an extra X tranpose X there (given the definition of beta hat)? But I haven’t looked very closely so feel free to ignore.

Reply ↓
Kevin S. Van Horn on July 13, 2017 12:36 AM at 12:36 am said:

I like a lot of things about this paper, but it goes off the rails when discussing the “effective number of nonzero coefficients” m_eff. The authors write,

“it is evident that to keep our prior information about m_eff consistent, τ must scale as σ/√n”

and then proceed to advocate setting τ = τ0 = Sσ/√n or τ ~ HalfCauchy(τ0) for a particular positive S. This data-dependent “prior” is nonsense — for τ to depend on n, one of the following would have to be true:

* the number of observations n somehow retrocausally influences the magnitudes of the β parameters, or

* the magnitudes of the β parameters somehow influence your choice of how much data to collect, or

* in some other way YOUR CHOICE of the amount of data to collect provides information about the β parameters.

They continue,

“Priors [on τ] that fail to [scale as σ/√n]… favor models of varying size depending on the noise level σ and the number of data points n.”

This is a problem with their concept of “effective number of parameters,” not with priors that are independent of n. OF COURSE, the effective number of parameters is going to depend on σ/√n; as this quantity gets smaller, you are capable of detecting ever-smaller regression coefficients β_j!

Reply ↓
- Carlos Ungil on July 13, 2017 3:22 AM at 3:22 am said:
  
  Your comment reminded me of the following remark from Radford Neal:
  
  “From a Bayesian perspective, adjusting the complexity of the model based on the amount of training data makes no sense. A Bayesian defines a model, selects a prior, collects data, computes the posterior, and then makes predictions. There is no provision in the Bayesian framework for changing the model or the prior depending on how much data was collected. If the model and prior are correct for a thousand observations, they are correct for ten observations as well (though the impact of using an incorrect prior might be more serious with fewer observations).”
  
  Reply ↓
  - Aki Vehtari on July 14, 2017 5:52 AM at 5:52 am said:
    
    Note that the prior we propose is directly on the model complexity, ie, how many effectively non-zero components we would observe and thus we assume fixed model complexity (with some uncertainty).
    
    Consider this example: There has been a murder, and we know there is exactly 1 murderer. You have either n=10 suspects or n=100 suspects. What is you prior probability that a random suspect is the murderer in these two cases? How would you choose this prior probability so that it doesn’t depend on n?
    
    Reply ↓
    - Keith O'Rourke on July 14, 2017 9:28 AM at 9:28 am said:
      
      Nice – the separate “terms defines a model, selects a prior, collects data” do give a false impression of really being separate.
    - Kevin S. Van Horn on July 14, 2017 10:30 AM at 10:30 am said:
      
      You seem to be mixing up D (number of parameters) and n (number of observations). Furthermore, the notion of “effectively nonzero” is not a property of the the model parameters themselves, as it also depends on σ and n.
    - Carlos Ungil on July 14, 2017 11:03 AM at 11:03 am said:
      
      My comment was not a direct response to your paper, that I have not read so what follows may be completely off-target. But I understand you change the model (by modifying the priors) depending on the amount of data to get the kind of posterior that you want.
      
      I don’t follow your analogy. By changing the number of suspects from 10 to 100 you’re not changing the amount of data, you’re changing the number of parameters. The procedure seems to me closer to the following:
      
      You have several suspects and you would like the result of your analysis to identify the murderer beyond any reasonable doubt. If you don’t have enough data the posterior probability may be too dispersed, so you adapt your prior depending on the amount of data available to increase the chance that the “effective number of likely authors” is one.
    - Daniel Lakeland on July 14, 2017 10:09 PM at 10:09 pm said:
      
      I too haven’t read the paper, so may be off in the way that Carlos is. But, can we resolve this by thinking in terms of an unmentioned parameter?
      
      There is effectively a “count of nonzero parameters, C” in our background information. Given a data set, and resulting values for n and sigma, there exists a probability distribution for some other parameters q that depends on some quantity, call it f, such that p(q|f)p(D|q,f) has the property “count of nonzero parameters ~ C” a property that the posterior has to have in order to accord with the knowledge set K.
      
      suppose that in some more complex model where f is directly a parameter, the posterior for f is sharply peaked around a closed form expression in terms of n and sigma… then we can approximate the more full model by substituting the closed form expression for f.
      
      ??
    - Daniel Lakeland on July 14, 2017 11:32 PM at 11:32 pm said:
      
      Thinking a little more on this, let C be the effective parameter count, and c the approximately known value that this count should have. then the posterior should be:
      
      p(C,D|q)p(q|f)
      
      now if the posterior value for f can be inferred “mechanically” or “in closed form” so to speak, and it depends only on sigma, n then effectively
      
      p(D|q)p(q|sigma,n)
      
      is a “data dependent prior for q” which is in fact just a closed form approximation to the posterior for f
  - ojm on July 14, 2017 8:02 AM at 8:02 am said:
    
    Sample size: data or parameter?
    
    Reply ↓
- Kevin S. Van Horn on July 13, 2017 6:08 PM at 6:08 pm said:
  
  I also take issue with this:
  
  “…Fig. 6 illustrates the importance of scaling τ with the noise level σ… when the observations are scaled by multiplying them by 0.1 (bottom row), the value for τ that does not scale with σ yields clearly worse results than in the first case, while the results for the latter value remain practically unchanged.”
  
  But by scaling *everything* by 0.1, they have created a situation wherein the β’s are *by design* proportional to σ as you move from the first to the second case! An honest test would have been to scale only the errors while keeping the β’s unchanged.
  
  BTW, despite these criticisms, I’m very glad they wrote this paper–I’m facing exactly the issues they bring up (need a prior that both encodes sparsity and bounds effect size) and will be trying out the regularized horseshoe soon. I just won’t be using their prior on τ. :-)
  
  Reply ↓
  - Daniel Lakeland on July 14, 2017 10:10 PM at 10:10 pm said:
    
    Kevin, how does a model similar in style to the one above work in your problem?
    
    https://statmodeling.stat.columbia.edu/2017/07/09/updated-ponyshoe-paper-juho-piironen-aki-vehtari/#comment-523431
    
    Reply ↓

Statistical Modeling, Causal Inference, and Social Science

Sparse regression using the “ponyshoe” (regularized horseshoe) model, from Juho Piironen and Aki Vehtari

52 thoughts on “Sparse regression using the “ponyshoe” (regularized horseshoe) model, from Juho Piironen and Aki Vehtari”

Leave a Reply Cancel reply