## Modeling probability data

Rafael Huber writes:

I conducted an experiment in which subjects where asked to estimate the probability of a certain event given a number of information (like a wheater forecaster or a stockmarket trader). These probability estimates are the dependent variable of my experiment. My goal is to model the data with a (hierarchical) Bayesian regression. A linear equation with all the presented information (quantified as log odds) defines the mu of a normal likelihood. The tau as precision is another free parameter.

y[r] ~ dnorm( mu[r] , tau[ subj[r] ] )
mu[r] <- b0[ subj[r] ] + b1[ subj[r] ] * x1[r] + b2[ subj[r] ] * x2[r] + b3[ subj[r] ] * x3[r] My problem is that I do not believe that the normal is the correct probability distribution to model probability data (... because the error is limited). However, until now nobody was able to tell me how I can correctly model probability data.

My reply: You can take the logit of the data before analyzing them. That is assuming there are no stated probabilities of 0 and 1. You could round these to 0.01 and 0.99. In any case you should graph the data and fitted model to see if there are problems.

### One Comment

1. Jerzy says:

I agree the linear model part of the regression may work best on the logit scale. But you can also consider the Beta distribution for your dependent variable, since it is for data in the (0,1) range.

So either you could do as Andrew suggests:
logit(y) = mu
mu = X*B + e
e ~ Normal(0,sigma^2)
with some diffuse prior on sigma^2.

Or you could try:
y ~ Beta(a,b)
mu = a/(a+b)
gamma = a+b
logit(mu) = X*B
with some diffuse prior on gamma
so that the randomness in the residuals comes from the Beta distribution, instead of Normal on the logit scale.

If you *do* have extremes (0s or 1s) in your data, consider a zero/one inflated model, like in Wieczorek and Hawala 2011: