Joël Gombin writes:

I’m wondering what your take would be on the following problem. I’d like to model a proportion (e.g., the share of the vote for a given party at some territorial level) in function of some compositional data (e.g., the sociodemographic makeup of the voting population), and this, in a multilevel fashion (allowing the parameters to vary across space – or time, for that matter).

Now, I know I should use a beta distribution to model the response. What I’m less certain of, is how I should deal with the compositional data that makes the predictors. Indeed, if I try to use them in a naive model, they don’t satisfy the independence hypothesis, create multicollinearity and are difficult to interpret. The marginal effects don’t make sense since whenever one variable goes up, other should go down (and at any rate can’t be considered as constant).

Until now my strategy has been to remove one of the predictors or to remove the constant. But obviously this is not satisfactory, and one can do better. Another strategy I’ve used, and I think it was inspired by you, was to center-scale the predictors. I wonder what you think about that.

However I’m sure one can do better than that. I’ve read about different strategies, for example using principal components or log ratios, but what bothers me with this kind of solution is that I find it very difficult to interpret the parameters when they are transformed this way.

So I wondered what startegy you would use in such a case.

My reply:

First off, no, there’s no reason you “should use a beta distribution to model the response.” I know that we often operate this way in practice—using the range of the data space to determine a probability model—but that’s just a convenience thing, certainly not a “should”! For modeling vote shares, you can sometimes get away with a simple additive model, if the proportions are never or rarely close to 0 or 1; another option is a linear model of logistic-transformed vote shares. In any case, if you have zeroes (candidates getting no votes or essentially no votes, or parties not running a candidate in some races), you’ll want to model that separately.

Now for the predictors. The best way to set up your regression model depends on how you’d like to model y given x.

It’s not necessarily wrong to simply fit a linear regression of y on the components of x. For example, suppose x is x1, x2, x3, x4, with the constraint x1 + x2 + x3 + x4 = 1. It could be that the model y = b1*x1 + b2*x2 + b3*x3 + b4*x4 + error will be reasonable. Indeed, there is no “independence hypothesis” that regression coefficients are expected to satisfy. You can just put in the 4 predictors and remove the constant term (unnecessary since they sum to 1) and go from there.

That said, your model might be more interpretable if you reparameterize. How best to do this will depend on context.

Hmm… I posted a suggestion and it seems to be caught in the spam filter. This is a test.

My suggestion was to use a hierarchical model, where the coefficients on the parameters come from some hyperdistribution. You could do that in stan, and you wouldn’t even need MCMC; use variational mode.

Variational mode may or may not work, but in my experience with anything like the complex model varying across time and space… it won’t work.

What it will do is give you a reasonably fast way to get starting values / initialization points for HMC that don’t take as long a time to converge to the stable equilibrium position.

Maybe I don’t understand whats being asked but cant you basically use any method to generate the logits and then softmax?

https://en.wikipedia.org/wiki/Softmax_function

Not exactly…the softmax is self-normalising, but it really pushes values out to the corners of the simplex. If you don’t exponentiate the logits you get better compositional properties (in both elements of the fraction), but then you risk spurious correlations. The standard way of working around this is to use log ratios to work on an unbounded transformation of the simplex, but Joel finds this unintuitive.

If you have G groups, where everyone in the population belongs to 1 and only 1 group, and so you have G proportions of the population, couldn’t you just enter G – 1 of these proportions as predictors? Then the coefficient on group g’s proportion would be interpreted as the change in the outcome associated with a unit increase in the proportion in group g, at the expense of the reference group. As with dummy coding, the choice of reference group wouldn’t matter, as you could shift the effective reference group using model contrasts.

I see your correspondent mentioned this option, but said it is obviously unsatisfactory. I’m not sure why it would be, but I guess it depends on what he is trying to accomplish.

One reason this could be unsatisfactory is that the direction: “foo goes up and the reference group goes down” is not necessarily a good and interpretable direction in the N dimensional space of the data. Let’s take an example. We choose the US “Peace and Freedom” party (aka “California’s Feminist Socialist Political Party”) as the reference party.

“Democrats go up by 1% and peace and freedom party go down by 1%”

Well in the US the Peace and Freedom party is an irrelevantly small group of people. They receive far less than 1% of the vote for example, but even if they did, a *swing* of 1% from Democrats to PFP is totally unlikely. This “direction” just doesn’t make sense as telling you much of anything.

Principal component analysis would at least give you reasonable “directions”, that’s what it does, rotate the coordinate systems until they align with the maximal variation directions.

“It’s not necessarily wrong to simply fit a linear regression of y on the components of x. For example, suppose x is x1, x2, x3, x4, with the constraint x1 + x2 + x3 + x4 = 1. It could be that the model y = b1*x1 + b2*x2 + b3*x3 + b4*x4 + error will be reasonable. Indeed, there is no “independence hypothesis” that regression coefficients are expected to satisfy. You can just put in the 4 predictors and remove the constant term (unnecessary since they sum to 1) and go from there.”

I was wondering if there are some rigorous bounds that one could put on the errors. Suppose we consider a more textbook-ish example of a spherical constraint, x1^2 + x2^2 = 1. Then nature of the predictors are fundamentally different different in that the underlying space of variation is a circle. Can we still make go through and fit y = b1*x1 + b2*x2 + error and expect reasonable answers?

A lot depends on the specifics of the “real” process. If y really is in reality a*x1 + b*x2 + error for some certain a,b then obviously it will work really well. If this is just our model and the reality doesn’t approximate that… it won’t work.

Yes but is it possible to break down the error into two misspecification terms here. One is obviously misspecification because of *wrong* regression but there should be another one is because of a *wrong* domain assignment. My curiosity comes how this works in physics and in applied mathematics. The underlying topological space and manifold structure is important analytically for calculating stuff. It is a thing that still surprises me; statisticians really do not care about the underlying topological and manifold structure in practice. If there is some mathematical intuition behind it, I would like to know.

It’s not clear to me what you mean by a wrong domain assignment. We assume X1 and X2 are measured values. The fact that they need to be on the circle is irrelevant to how they enter into the prediction for y

Let me ask a different question. The family of functions y = f(x1,x2) that can be defined on (x1,x2) \in \mathbb{R}^2 is not the same the family of functions that can be defined on S^1. Mathematically, when we are trying to pick \hat{y} from a space of functions, it makes a difference which space we are picking from. And I am wondering if this does not make any difference anywhere in our analysis.

Nothing in the usual theory for linear regression makes a distinction between predictors in R^2 and predictors restricted to a subset of it. All you need is for the linear model for Y to be a good approximation at the observed predictors. Like this: https://goo.gl/rdEkuM

What Corey said.

Elaborating, it’s not relevant how badly the chosen function would predict y in the region where it’s impossible for x to lie… It’s like asking how well physiological models predict the caloric needs of 7 story fire breathing dragons

> remove the constant term (unnecessary since they sum to 1)

I remember David Cox warning us not to do that in one of the Phd seminars – unless you know you have the true model (e.g. output from a simulation.)

He used a simple one variable regression model where physics would guarantee the y value had to be 0 when x was zero.

Now do not set the intercept to zero because the relationship will not be strictly linear, there will be lack of fit and the lower order terms (here the intercept) do a better job of allowing for lack of fit. So you get the true intercept value but a worse overall fit.

Daniel seems to making the same argument – “If y really is in reality a*x1 + b*x2 + error for some certain a,b”

As this is just another distraction for me – does any know of reference for this or counter arguments?

I think this is a different case because here the constant term is strictly redundant; include the constant term and you just get a rank-deficient predictor matrix.

As no-one has mentioned it yet, the way I have gone typically with compositional predictors is an isometric log-ratio (ILR) transform which maps (defines an isometry between) a D-dimensional simplex to D-1 dimensional real space, with the bonus that the dimensions are orthogonal to one another. Other transforms are available, e.g. centred log-ratio, but these aren’t as good in my view. You can then use the ilr-transformed components as predictors, and since there is a one-to-one mapping you can convert between points in one space to the other quite simply to predict effects across your composition. You can’t really just put the untransformed composition into a regression model as the coefficients are uninterpretable since you can’t increase one predictor without changing the value of the others.