Still more on dolphin fetuses

Posted on January 27, 2006 12:02 AM by Andrew

Eric Archer sent in a question on Bayesian logistic regression. I don’t really have much to say on this one, but since it’s about dolphins I’ll give it a try anyway. First Eric’s question, then my reply.

Eric Archer writes,

I’ve been working on a Bayesian logistic regression as part of a larger Bayesian model and have run into a confusing conundrum that I would really appreciate your input on if you have the time. It has to do with whether or not to explicitly model a random effect for each observation in a data set.

My data consists of 1273 dolphin specimens which are either fetuses or postnatal individuals (coded as 0 or 1 respectively). Each animal has been measured, so I’d like to model the the probability of being born as a function of length. Specifically, I’m interested in estimating length at birth. I’ve been using a standard logit link function:

logit(p) = alpha + beta * length

I’ve run this model in BUGS as specified here:

model {
for(i in 1:n) {
postnatal[i] ~ dbern(p[i])
logit(p[i]) <- alpha + beta * length[i] } alpha ~ dunif(-100, 0) beta ~ dunif(0, 1) lab <- -alpha/beta } where 'postnatal' is 0 if fetus, and 1 if born, and 'lab' is the estimated length at birth. This model runs and I get reasonable posterior distributions on alpha and beta, and more importantly on lab. However, I wondered if the uncertainty in alpha and beta might be inappropriately inflated because there was no explicit random effect error term in the model, as standard GLM models do. So, I tried to convert the model to include this, so the link function would be: logit(p) = alpha + beta * length + eps where eps is a random error term. I restructured the model according to the "Seeds" example in BUGS: model { for(i in 1:n) { postnatal[i] ~ dbern(p[i]) logit(p[i]) <- x[i] x[i] ~ dnorm(mu[i], tau) mu[i] <- alpha + beta * length[i] } alpha ~ dunif(-100, 0) beta ~ dunif(0, 1) var ~ dunif(0, 100) tau <- 1 / var lab <- -alpha/beta } I had a lot of problem getting this to run. By tweaking the priors on beta and var, I discovered that for some reason the logit function was causing an overflow error due to p[i] being estimated that ultimately evaluated to 1. I then changed the model to the following, which runs fine: model { for(i in 1:n) { postnatal[i] ~ dbern(p[i]) p[i] <- 1 / (1 + exp(-x[i])) x[i] ~ dnorm(mu[i], tau) mu[i] <- alpha + beta * length[i] } alpha ~ dunif(-100, 0) beta ~ dunif(0, 1) var ~ dunif(0, 100) tau <- 1 / var lab <- -alpha/beta } After watching these runs, I noticed that the chains for alpha, beta, and var would wander in a correlated manner until var hit the maximum prior value and then alpha and beta would lazily wander near their current values for the remainder of the chain. Increasing or decreasing var would duplicate this behavior regardless of where alpha and beta were started. I then created some simulated data, and even if I started alpha, beta, and var at the simulated parameter values, they would wander in the same manner towards the maximum var. While exploring the behavior of these parameters on the distribution of the data, it became obvious to me that I could create essentially the same data with small values of alpha and beta (high intercept, shallow slope) and low variance, or large values of alpha and beta (near zero intercept and high slope - approximating a step function), and high variance. Thus, there seemed to be a high density "ridge" in the 3-d joint posterior. One final test was to examine the marginal posterior of lab in the both the model with and without a variance component. The distribution was exactly the same in both cases, suggesting that including variance in the model does not reduce uncertainty around lab as I had expected - I'm stumped on this one. This led me back to question whether or not its really appropriate to be estimating a variance component in a model like this. The response variable is a binary state, so what is the plain-english concept of variance with data like this? Does the uncertainty around alpha and beta correctly account for all and allocate all of the uncertainty in the model/data? It doesn't seem to affect the parameter that I'm really concerned about, lab, so I'm perfectly happy to use the simpler model. However, it seems to be an exception to the standard GLM format, unless eps is fixed at 0. I notice in "Bayesian Data Analysis, 2nd ed", on page 90, you do not use a variance component in equation 3.20 where the logistic model is first introduced. On page 416, following equation 16.1 you state that the dispersion parameter is fixed at 1 in binomial models. However, on page 441, exercise 3 involves making an inference on the variance of a logistic regression. Is there something different about this data (I'm about to convert it to binary data to make it comparable to mine - although I realize it removes the independence between dosages and correlation within) that makes this approach appropriate?

My reply:

There are so many things I don’t understand about this example. For example, I don’t know why alpha is restricted to be negative and beta is restricted to lie between 0 and 1. Second, I’d think you’d want to start with a plot (maybe you did this already) of probability of being born as a function of (discretized) length, to see if the linear model on the logit scale is reasonable. (Also you need to think about selection issues: where did your sample come from?) Third, I don’t see why “lab” is the length at birth. I mean, I understand that it’s the length at which Pr(born)=1/2, but I don’t see how this quite maps to “length at birth.”

But yes, in answer to your question, you don’t need to add an epsilon to logistic regression. “Epsilon” is already there in the non-deterministic aspect of the logit model. (We discuss further in Section 5.2 of our forthcoming book.) Basically, adding an “epsilon” doesn’t change the model any, it just changes the interpretation of alpha and beta. (This statement is not 100% strictly true, since epsilon has a normal distribution and the latent-data errors have a logistic distribution in your model, but realstically you’re not going to be untangling a convolution of the logit and the normal in any case.)

Actually, though, adding the completely unncessary epsilon does not affect alpha/beta–it affects the scale of them together–hence inference for “lab” should indeed be unchanged.

Finally, the exercise on p.443 involves overdispersion in the binomial model with cases of the binomal n>1. In your case (as is typical with logistic regression), you have binary (0 or 1) data, in which the binomial “n” parameter is always equal to 1. In this case, you can’t esitmate epsilon.

5 thoughts on “Still more on dolphin fetuses”

Jason Roy on January 26, 2006 4:05 PM at 4:05 pm said:

Only slightly on-topic, but I have also noticed that

logit(p[i])
Eric on January 26, 2006 4:46 PM at 4:46 pm said:

Andrew,

Thanks for your reply about my logistic regression query. I think I finally understand this business about estimating epsilon and will continue leaving it out of the model.

To clarify some of the items you mentioned in the opening of your reply,

1) alpha is negative and beta is positive (just happens that in my case I know it won't be more than 1) because the data exists in the upper right quadrant (positive x and y) and P(born)=0 if you are still a fetus (i.e., of small size). So the curve goes from 0 at small lengths to 1 at large lengths. A negative alpha and positive beta specify such a curve (at least as I have formulated the model).

2) As far as I know, the linear model on a logit scale for data such as this is standard operating procedure. One is attempting to describe the relationship of a state (being born or not) as a monotonic function of a continuous characteristic (length). Without further information, the safest assumption is that the "rate" of this state change is a constant multiple of length. However, (as may be obvious) much of my modelling is self-taught, so I'd appreciate pointers to other approaches that may be more appropriate.

3) We use the length at which Pr(born)=1/2 ("lab") as the best estimate of the average length at which animals are born. It has the convenient property that it is the point at which the frequency of the two types of "errors" are equal – animals larger than lab which have not been born, and animals smaller than lab which have been born. Again, if this interpretation seems odd, I'd be interested in hashing it out so I know I've got my ducks in a row.

Thanks again for your time and this cool blog.

Cheers,
e.
Andrew Gelman on January 27, 2006 9:19 AM at 9:19 am said:

Jason,

Could you explain more? Are you talking about using a linear rather than a logit model, and just not worrying about the transformation? Or are you talking about something different. I too have had crashes with the logit model in Bugs, and I use the following workaround:

y[i] ~ dbin (p.bound[i], 1)
p.bound[i]
Jason Roy on January 27, 2006 10:29 AM at 10:29 am said:

For some reason some of my comment was left out.

I was talking about specifying the model the usual way (logit of p equals linear function of x)

logit(p[i])=x[i]*beta

versus specifying it as p equal to the inverse logit

p[i]=1/(1+exp(-x[i]*beta))

They're mathetmatically equivalent, obviously, but I have found that there are less crashes with the latter specification. Not sure why.

I haven't tried the bounding approach, but that sounds like a good suggestion.
Eric Archer on January 30, 2006 7:07 AM at 7:07 am said:

From the tinkering I've done with my model, I think that what happens is that when you specify

logit(p[i])=x[i]*beta

some p's get estimated that evaluate to 1, so

log(p/1-p)

is undefined. This error can't happen when you specify it like:

p[i]=1/(1+exp(-x[i]*beta))

Thanks for the heads up on the bounding approach – I'll check that out as well.

Comments are closed.