Aki, thanks for additional comments. It’s not just that initial range is the issue, but also I’ve found that when you start well out of equilibrium, sometimes you wind up with tremendous overshoot in certain variables. Perhaps potential energy coming from one variable gets transferred into another during the equilibriation process. If the second variable is constrained to *never* go into bananas territory it may help. It may not. But particularly with a variable you’re planning to exp(x) keeping x from ever being 1000 seems like a good idea ;-) obviously mileage may vary. I would rarely put a hard constraint on a variable (except if logically it can’t be outside the range), this kind of case where I’m transforming the variable and it can result in truly ridiculous quantities on the other end of the transform is probably the only time I usually put hard constraints.

I did know that in the real paper the prior is not directly on log(PM2.5) but felt that it would be helpful for people to see the kind of modeling that can go into choice of priors, so I abstracted that away for the comments here.

However, when it comes to setting priors on parameters that indirectly affect another measurement, I do think we can do well by using a “pseudo-likelihood” function. That is, imagine you have a,b,c with some kind of simple order of magnitude priors:

a ~ normal(10,5);

b ~ normal(10,5);

c ~ normal(10,5);

But through some function you have a fourth quantity that is physically meaningful about which you have prior information.

d = f(a,b,c)

In the Stan language there’s nothing that prevents you from calculating d in the transformed parameters block, and then in the model block doing:

d ~ gamma(5,4.0/10);

or whatever is meaningful for the physical information. The idea is, you know that d needs to be something in the vicinity of 10 so you reweight your prior on a,b,c by a function that downweights the regions of a,b,c space when they result in d being well outside the range you expect. The prior is now implicitly a multidimensional correlated one, but if your information is primarily in d space, this prior should be much more effective than the generic ones on the individual coefficients.

]]>The prior predictive data y_tilde, log(PM2.5), generated conditional on x, log satellite measurement in each ground station. x_obs is common for prior predicted y_tilde and observed y_obs.

]]>Daniel, I like your additional considerations on feasible range. Two other comments

1) In the paper, the prior is not directly on log(PM2.5) but on model parameters. It is more difficult to set prior on several parameters so that it would be constrained in some range for the outcome.

2) You justified constrained range by avoiding “wacky” initial values. With your constraint, due to transformation, the initial values by default in Stan would be from range [-7.85, 5.85]. Without constraint the default initial values are from range [-2, 2].

Dan, I do miss a lot of points and even entire lines! So what is this particular point that I missed? Kaniav and I tried to publish a rebuttal to this paper in TAS, but between some maladresses in the early version and some difficulties in communicating with the TAS editors, and again another maladresse in selecting the next journal it could be submitted to, the paper ended up in a predatory OMICS journal. I was going to suggest that our experiment in the first example with the unlikely distributions of death was akin to your elimination of vague priors in the paper. But since I missed “the” point, this is presumably not the case.

]]>That’s right, continuous is an approximation to discrete, not the other way around as is often thought.

]]>Other than (possibly) time, this seems true for just about everything. What is commonly measured that is “truly” continuous?

]]>Sure, but my suggested prior is normal(11, 2) with bounds at -11 and 15. The main reason for the bounds here is to prevent problems with getting really wacky initial values. Remember we’re probably exponentiating this value since we’re working on the log scale!

All that being said, the *real* concern here is that my back of the envelope calculation suggesting that 10 to 12 is a reasonable quantity completely fails to cover the range of the actual data which is 1-5 !!!

I think the problem is that I was using parts per million as what I understood the airnow values to be, but in fact they are of course micrograms/m^3 for pm2.5 (they are parts per billion for ozone for example so it’s always *really important* to check the units of measurement), so my calculations are off because I was multiplying by density of air to get micrograms/m^3

Always check the sensibility of your basic math first. Issues like whether or not you’re putting a hard boundary out pretty deep in the tail of your distribution come second ;-)

So, correcting my math: log(1000) giving the high end of pollution = 6.9 and -10 as the lower bound, with 50 ug/m^3 being a typical value during the day in Pasadena so log(50)=3.9 we can set our prior as something like

real<lower=-10,upper=8> logpm25;

logpm25 ~ normal(4,2);

and now going back to the graph we’ll find that all the data is in fact somewhere between 1 and 5, and the tail at either -10 or 8 does not enter into the calculation so whether it’s got a hard bound or not won’t really matter (I only do it as I say because if it gets initialized to 15 or something things could go very wrong and take a long time to get initialized)

]]>Daniel:

I think hard constraints are generally a mistake. Instead of bounding a parameter between A and B, better to give it a normal prior with mean (A+B)/2 and sd (A-B)/2. I say this both for computational reasons and modeling reasons.

Two advantages of the normal rather than uniform prior:

1. Partial pooling toward the center of the range.

2. No pathological behavior at the ends of the range.

Oh my. X did rather miss the point, didn’t he.

]]>Yup – as I found out when I tried to publish this in 2011 (which had similar but much briefer arguments) http://statmodeling.stat.columbia.edu/wp-content/uploads/2011/05/plot13.pdf

> I had not seen that paper!

So you missed Xi’ans and my comment it in 2013

https://xianblog.wordpress.com/2013/11/21/hidden-dangers-of-noninformative-priors/

I ripped this plot out of context, but yeah, it should definitely be labelled better. Ironic given it was a paper about visualization :p

]]>I had not seen that paper! I like it.

As to why it’s not in a prominent technical journal, I know the answer to that: prominent technical journals have no interest in publishing this sort of work, so you put it where you can.

]]>Of course, wordpress ate the angle brackets, the stan code should look like:

real<lower=-11,upper=15> logpm25;

]]>Also, let’s go for a lower bound. One 2.5 micron cuboid particle of carbon in 1 m^3 of air. Volume (2.5e-6)^3 m^3, density of graphite = 2.3 g/cm^3, I hate converting units by hand because I’m always making mistakes, so I ask gnu units to do it… 3.6e-5 micrograms/m^3

ln(3.6e-5) = -10

so now in Stan we can do something like

real logpm25;

Now if we leave this uniform it’s relatively uninformative, all it does is say “we’re somewhere between the smallest conceivable non-zero value and the highest you can breathe without dying rapidly of smoke inhalation (1000ppm).”

But beyond that, we know it should be typically something like 10ppm to 200ppm with median around say 50ppm because we live in LA and watch the airnow.gov so our kids don’t stay out at summer camp too long in bad air. 200ppm is something like 12 on the log scale, and 20 ppm is 10, so let’s do

normal(11,2)

having done all this we now need to ask if something went wrong, because although it seems that 11 is a reasonable pm2.5 number corresponding to 5.9e4 micrograms/m^3 and density of air is 1.2e9 micrograms/m^3 so that we’re talking 50ppm which is a typical mid-day september reading near Pasadena… the observed levels in Dan’s paper are log(pm2.5) of say 3 or exp(-8)= 3.3e-4 times the mass density typical on a good day where I live.

which suggests a units issue, like using cm^3 instead of m^3 or kg instead of g or something, or maybe he’s taking readings in an industrial clean room?

]]>By the time you’re seeing counts into say 10000 or so, it probably makes sense to divide everything by 1000 and treat it as continuous with 3 decimal digits of accuracy. Lots of things we automatically treat as continuous, like the output of a digital voltmeter, are obviously discrete when you think about it, but it really doesn’t matter so long as the discreteness is at a small enough scale relative to the measurement overall value.

]]>Wouldn’t it be better to label both axes with “Log (PM_2.5)”, followed by “[ground measurement]” / “[satellite data estimate]” / “[simulated satellite data estimate]”?

]]>I love that bit about breathing concrete and neutron stars. Let’s take it farther though. You can’t breathe 50% smoke particles by mass. Density of air is 1.2e9 micrograms per cubic meter, and ln of that is about 21.

Next let’s look at the airnow.gov site where they state 500ppm smoke is beyond the range of air quality and into hazardous to life… So we can go a factor of 1000 smaller ln(1.2e6)=14

So now I think we’ve finally got to an uninformative prior, in that we’ve only ruled out the completely impossible.

]]>Plots often are good at making the implications of poor priors obvious, but my guess almost anyone who has done Bayesian analyses in real applications has been blinded sided by this issue. Wonder what percent of the time it was noticed (within and over analysts)? On the other hand, there did seem to be a reluctance to simulate and plot priors, which has hopefully past.

Unlike your claim of it being hazardous to breath concrete, I can point to some empirical evidence here – Hidden Dangers of Specifying Noninformative Priors. John W. Seaman III, John W. Seaman Jr. & James D. Stamey The American StatisticianVolume 66, Issue 2, May 2012, pages 77-84. http://amstat.tandfonline.com/doi/full/10.1080/00031305.2012.695938

Now why did this only? arise in “amstat” journal rather than a more prominent technically oriented stats journal and only in 2012?

(To date about 40 citations, so not totally being ignored.)

I was about to say that we saw this instance of the folk theorem arise for Poisson distributions on the Stan mailing list in the generated quantities block. Simulations would generate counts that were bigger than our 32-bit integers can hold (about 2 billion; we’re going to move to long ints one of these days).

I think this is also a really good way of reminding people that their inferences are a blend of prior information and information in the data, so that if you don’t have much data, you need more prior information to make reasonable inferences. In the extreme case of not having collected any data, you get the prior predictive distribution.

]]>I’ve done it too!

]]>This happened to Susanna and me with the Poisson distribution once! We were setting up a prior distribution for a model involving survey weighting—I don’t remember the details, but I think it’s related to what’s in this paper), and we were having some computing difficulties, and then we did some prior predictive simulation and we realized that our prior for the log of the rate parameter was way too vague; we were getting some simulations where the rates were all things like 0.001. The discreteness of the Poisson distribution implies that the scale of the prior can really matter. It’s not like the normal distribution this way.

Also this was a good example of the folk theorem.

]]>