Skip to content

Data-dependent prior as an approximation to hierarchical model

Andy Solow writes:

I have a question about Bayesian statistics. Why is it wrong to use the same data to formulate the prior and to update it to the posterior? I am having a hard time coming up with – or finding in the literature – a formal reason.

I asked him to elaborate and he wrote:

Here’s an example. Independent observations X1, X2, …, Xn from the uniform distribution on (0, U). Interested in the posterior distribution of U. Likelihood is 1/U**n for U > xmax, 0 otherwise. Exponential prior for U with parameter log 2/xmax so that prob(U < xmax) = 1/2. Apart from lacking justification, the prior depends on the data.

I replied: Yes, in general you don’t want the prior to depend on the data. When prior is expressed to depend on data, we think of it as an approx to a prior distribution that depends on some unknown parameter that is estimated from the data. So the idea is that there is a coherent model that is being approximated for convenience.

For another example, consider regression models with standardized predictors, as discussed in my 2008 paper, “Scaling regression inputs by dividing by two standard deviations.” If you rescale predictors based on the data, then put informative priors on their coefficients (as we recommend in our other 2008 paper), then you have a prior that depends on data.

I’d like to think of this as an approximation to a prior that depends on a hyperparameter—the population sd of the predictor in question—which in turn is estimated from data using a simple point estimate. Thus, an approximation to a hierarchical model.

Now that Stan is available it would make sense to set up the full hierarchical model, partly for practical purposes when sample sizes are small, partly because the full model will be easier to extend if we want to go to other scalings, and partly as a demonstration so that if we do use the approximate approach, we’ll know what we’re approximating.

Maybe we can put this in our Bayesian econometrics text. I’m thinking of an intro chapter, or maybe an appendix, with all the basic regression models with different priors.


  1. Corey says:

    If you use the same data to construct the prior and update it to the posterior then you’re violating the premises of Cox’s theorem. The resulting probability distribution will not map to plausibility in the manner that Cox’s theorem guarantees for posteriors that do not make double use of the data.

    • Laplace says:

      Here’s Bayes theorem with a “prior” dependent on the data.

      P(param|Data, Data) = P(Data|param, Data)P(param|Data)/P(Data|Data)

      this reduces to the statement P(param|Data, Data) = P(param|Data) which seems completely consistent with “plausibility in the manner that Cox’s theorem guarantees for posteriors”.

    • Corey: I remember reading this paper previously. It’s a great paper, but it’s dense. Do you happen to know what is being violated here? It’s not clear to me that “time” has any actual role in Cox’s theorem, there’s no notion of “before and after” that I remember. The only thing is plausibility conditional on your model and the data. And in this case, the problem with using data to construct your model is that your model is conditional on the data you found. So the “goodness” of that model is only as “good” as the data you used to construct it.

      Obviously, in a Bayesian updating system, you always have limitations that come from your data. But when the limitations are on the structure of your model.. it’s not clear that that’s a good thing, so it may be a good idea to avoid conditioning your choice of model on some sample statistic of the data.

      On the other hand, in the case where you’re estimating the upper bound of x ~ uniform(0,U) and you have some data. You are 100% sure that U is not less than max(x). So using the data to choose a model (ie. maybe U-max(x) ~ something() ) seems like it’s unlikely to invalidate anything Cox’s theorem would tell us. But I’d be interested to see if you can point to something specific being violated in Cox’s theorem.

      My suspicion is that Cox’s theorem still applies, but that using the data to form your model can sometimes restrict or alter your model in unjustified ways (ie. in essence it adds noise to the process of model choice, or it implies higher plausibility than is really justified by the information you have by excluding alternative models you should have chosen based on your true state of knowledge)

      Here’s the basic reasoning:

      Suppose my state of knowledge is (model M is a decent model for this data D).
      Now, suppose that someone tells me (oh if you subset data D and do some stuff then f(D) = F, so my state of knowledge is M is a decent model for D and also f(D) = F).

      If you take the second state of knowledge as implying that some hyperparameter F’ is very close to F… this can be reasonable under many circumstances especially when the sample size involved in the calculation of f(D) is large, but it can be excessively restrictive when you actually set F’ = F rather than merely F’ ~ distribution_around(F)

      But, the second state of knowledge is still a valid state of knowledge. In fact, it maps pretty closely to what’s done in stat-mech (ie. Pressure measured by my meter is P is pretty similar to f(D) = F)

      • Corey says:

        The prior is supposed to encode the state of knowledge exclusive of the data; this is the premise being violated. Using the data to set the prior is tantamount to plugging the data into a (fairly arbitrary) mapping from the data space to the space of priors and using the prior picked out by the data. Updating to the posterior is then the process of using plugging the data into Bayes’ theorem and using the posterior picked out by the data. In this process the P(param|Data, Data) does not necessarily cancel back to P(param|Data) as it would in a correct application of probability theory.

        The whole process is no longer under control, so while it’s possible that you could get results that make sense, you no longer have the protection of Cox’s theorem to ensure that they make sense. For example you could be using an approximation for convenience as described in the OP, and then you’ll be okay; or you could be using a point estimate of a hyperparameter in a situation where the uncertainty in the estimation is substantial, whereupon your “posterior” will be far narrower than it would be under full Bayes and a reasonable hyperprior.

        • “Using the data to set the prior is tantamount to plugging the data into a (fairly arbitrary) mapping from the data space to the space of priors and using the prior picked out by the data”

          Ok, I think you, Laplace, and I have all made that point. It’s possible to go wrong, and the way in which you’re likely to go wrong is to pick something too specific to the accident of the data compared to what you really know about the whole thing.

          “the prior is supposed to encode the state of knowledge exclusive of the data”

          well, I think I disagree with this, I mean, it encodes SOME state of knowledge. The fact that it’s exclusive of the data is the usual case. but you could imagine the following case: a probe on Mars is sending data to us. It first calculates some data statistic f(D) and sends it to us, and then it begins a download of the full data which thanks to the 300 baud speed and a lot of noise on the channel, takes a month.

          During that month, we’d like to set up our model so we’re ready to go. But, we have that statistic. It’s real actual information about what’s going on. And if we have a model for how it occurred, then the data value is valid data to put into a model.

          p(F’|f(D)) = something….

          It’s reasonable to choose something here. whether it’s reasonable to choose

          p(F’|f(D)) = delta_function(f(D))

          is a different story, most likely it’s not. But it’s not so much that we’ve used the data twice here, as that our model for f(D) is a poor and overly-certain one.

          let’s say we develop our model along the lines of

          P(F’|f(D)) = normal(f(D),something)

          and we choose “something” based on our knowledge about how f(D) was calculated and etc.

          Now, when the data comes, do we have to go back to the drawing board and say “gee this was a terrible model because it used the data twice and it violates Cox’s theorem?” I think NO. Where we might have to go back to the drawing board is if we discover that our model for how f(D) arises is lousy. For example if we assumed something like a central limit theorem, but the sample size was small and the distribution for D was odd.

          • Rahul says:

            Isn’t the million dollar question how to tell apart what’s “specific to the accident of the data” versus “what you really know about the whole thing”?

          • Corey says:

            Now, when the data comes, do we have to go back to the drawing board and say “gee this was a terrible model because it used the data twice and it violates Cox’s theorem?” I think NO.

            It’s not categorically NO. It’s easy to contrive situations where for most possible data sets lying on the level set {D | f(D) = F} we’d be happy with our approximation and yet there exist subsets of the level set where, upon receiving the full data, we’d definitely want to go back to the drawing board. But this whole discussion is a bit removed from the issue of double-use of the data…

            • So, one thing I’ve done in the past, is I’ve tried to build models like the one Andrew mentioned, where you’re rescaling the data by something. (as is well known, I’m a stickler for rescaling).

              But in most of my models rather than rescaling the data by the sample standard devation, or some other such thing. I create a parameter, say s, and then I rescale by s, and I also inform the model that s is somehow related to the data. The interpretation is: “there’s an unknown value by which all the things should be rescaled in order for my model to work, and by the way we can discover information about that value from the data”

              mu ~ exponential(1/100);
              s ~ exponential(1/10);
              data ~ normal(mu,s);

              those above lines have interpretation p(data | s,mu) p(s) p(mu)

              But we don’t have just a model for all the data as an ensemble, we also have predictions for individual points after rescaling (or after some other transformation), then we move on to things like:

              data[i]/s ~ normal(regression-model-here, prediction_error_here);

              this last line has the interpretation of
              p(Data[i]/s | s, RegressParams)

              assuming I’ve defined the regressParams distributions previously. We now have an issue. In Stan at least, it’s going to throw a warning about the nonlinear transform on the left hand side. Do you need to do anything?

              Suppose instead of just one parameter you had three or four.

              f(data[i],s,q,n) ~ normal(regression-model-here,prediction_error_here)

              In the first case g(x) = x/s, or what’s the same, data[i]/s you have one parameter, and you can talk about the jacobian and get yourself confused. In the second case, it’s obvious that there is no jacobian, f has values in a 1 dimensional space, and (s,q,n) are in a 3 dimensional space.

              What I finally decided was both of these statements are about samples in a newly constructed space, the d/s or the f() space as the case may be. The probabilities involved are about fixed-width and infinitesimal regions in those spaces conditional on all the s,q,n etc values being true. My conclusion here was that we shouldn’t be transforming anything.

              It seems like the “full bayesian” version of the transformed-by-sample-sd model is a subset of this type of model. It is, however, not entirely a trivial matter to understand what is going on.

              • Chris Wilson says:

                Daniel, do you apply this method of rescaling to predictors (inputs in Gelman’s terms) or to the response variable or both? Also, do I interpret correctly that you conclude that Stan’s warning message is not a problem for your method? Seems like an interesting idea.


              • Chris Wilson: I apply it all over the place, I always write my models in dimensionless form, so essentially everything is rescaled in some way. Note, I tend to work on physical stuff, so not things on “reported scales” on surveys or the like. Also, I don’t always have uncertainty in my rescaling, sometimes I just rescale by chosen constants. But when I believe that there’s a symmetry, in other words that many different cases all collapse down to the same basic thing so long as you rescale each one appropriately, then I put uncertainty on my rescaling factors.

                Also, yes, I concluded that Stans warning was useful for people who are doing things like transformations of variables (from say n variables to n other variables) but that it wasn’t applicable in this case. There is certainly no mathematical way it CAN be applicable when you’re doing f(data, param1, param2…) so that there doesn’t exist a jacobian correction because you map from N variables down to 1.


              • Also, Chris, if your left hand side DOESN’T involve the data (ie. it’s only a function of multiple parameters), then I’m not sure what to think of it. pretty clearly, if you’re transforming 1-1

                log(foo) ~ normal(a,b);

                you’re probably making a mistake and need a jacobian correction.

                but what if you’re doing:

                log(foo+bar) ~ normal(a,b)

                There is no sense in which a jacobian correction is possible, since you’re transforming two parameters foo, and bar into 1 value.

                On the other hand, if you had two such transformations involving foo,bar you could do the jacobian correction, but you’d have to think hard about what your statements mean. You may be better off reparameterizing your model. Remember Stan is imposing a constraint that frequencies of samples of parameters have a certain probability measure on the parameter space.

                On the other hand, when you have data values on the left hand side, the interpretation is that under the appropriate values of the parameters, your output is a transformed data value which (when the parameter values are appropriate) has a defined distribution *in the outcome space*. (ie. it’s a conditional probability conditional on the values of the parameters).

                you’ll notice that if you write out the full probability measure, you’ll need an extra differential factor to make the units come out to probability:

                p(mu)dmu p(s) ds p(data|mu,s)ddata p(f(data,mu,s)| mu,s) df(data,mu,s)

                it’s not that you’re transforming some other variable and need to preserve a measure, you’re creating a new probability space and your statement places a measure on the new space.

                The overall model sketched out above translates to the following state of knowledge:

                mu is about 100, s is about 10, if mu,s have the right values then the data itself is distributed around mu with a standard deviation s, and if s is correct (conditional probability again), each of the data[i] values when divided by s is distributed around regression-model-here with some error prediction-error-here (which I’d have to give priors on to complete the model).

                in essence the statement:
                f(data[i],s,q,n) ~ normal(regression-model-here,prediction_error_here)

                is a kind of likelihood, but a likelihood of outcomes in a transformed space. it plays the same role as p(data|mu,s)ddata in my above symbols

                another way to think of this is as two models, one on mu, s, and the data that produces a posterior for mu,s and one which uses the posterior on mu,s as a prior in a higher level model on transformed data, further modifying the posterior for mu,s.

                I hope that helps :-)

              • ack, wordpress ate my long string of spaces that put the ^^^ values in the right place, I was hilighting the df(data,mu,s) differential factor.

              • Chris Wilson says:

                Thanks Daniel! Although I’ve modeled uncertainty in predictors before (measurement error, etc.), and have done plenty of standardization (usually Gelman’s divide by 2 SD approach), I’d not considered incorporating uncertainty with re-scaling. Definitely have to think about this more for future applications…

              • The question you have to ask yourself before going to all this trouble is why do you want uncertainty on your rescaling? When it’s sufficient to rescale things by known constants, you should do that. But if you’re claiming something like:

                there exists a scale factor S[i] for each group of data indexed by i, such that
                f(data[i]/S[i]) = g(somestuff) + error

                in other words, with the appropriate rescaling all the groups have the same behavior g.

                then in realistic cases it’s rare for you to know what the s[i] values are exactly ahead of time, and you’ll want to put uncertainty on them.

                The microarray example (which I discuss elsewhere on this page) is a good one. You have brightness measurements off of an array of DNA spots. The overall brightnesses are affected by HOW MUCH material you put on the slide. So there’s some constant with units of brightness/mass and an unknown mass, which you could multiply together to get something in units of brightness which you could then divide all your brightness measurements by and all the arrays would be in a common comparable dimensionless form.

                But you don’t know the constant, or the mass… however you know that the median value on the array varies fairly linearly with the mass. so you can start there, and add some uncertainty and determine a scaling factor with uncertainty.

  2. Christian Hennig says:

    The whole Bayesian setup depends on the prior being a prior, i.e., prior to the data. If you use the data in choosing the prior, the data is used twice in a setup in which it is supposed to be used only once.

    In the example, you choose an exponential prior for U with parameter log 2/xmax. You could be even more radical and choose the prior as a one point-distribution on xmax. So the data would determine your prior and your posterior would say that you *know* that U=xmax for sure. The prior is meant to formalise uncertainty before data, and the posterior then gives you your uncertainty after data. If you mix the data into the prior, you will underestimate the uncertainty after data because the data was used not only to update the prior but also to remove some uncertainty in the prior that should have been there because before data indeed there was no certainty about what the data would be.

    Of course using a one point data dependent prior looks extremely foolish but it should serve to illustrate what happens; it is just more extreme than letting the data sneak in in a less extreme fashion by for example using it as a parameter estimator for a parametric prior.

    (I’m not saying what Andrew wrote about approximation is wrong, but I think it’s better to make a bigger and straighter point of what’s the problem in having the prior depending on the data before starting to devise approaches that can somehow justify doinbg it in certain cases.)

    • Rahul says:

      Naive question:

      Suppose you retain the strict textbook separation of your Prior-1 & Data-1. You get a pristine Posterior-1.

      Now for your next study couldn’t Posterior-1 become Prior-2? So although you may still retain the strict separation between Prior-2 & Data-2, your Prior-2 is already contaminated by Data-1, right?

      So, it’s a question of whether you contaminate your prior with Data-1 alone or both.

      My point is, how much does this really matter in practice? Unless you choose some extreme prior.

      • Carlos Ungil says:

        Not only you could use Posterior-1 as Prior-2, you really should do that if you want to use all the available information. Or you could use Prior-1 and (Data-1+Data-2) to arrive to the same results.

        I don’t know why do you think there is a “contamination” problem when you analyse Data-2 using the all the prior information available (which obviously includes Data-1).

        However, I hope you will agree that if one makes a copy of Data-2 and calls it Data-1 then one cannot include Data-1 in the prior used to analyse Data-2. One cannot simply make copies of the data to increase the number of observations when doing a frequentist analysis either.

        • Rahul says:

          No, to clarify: I don’t think there’s a contamination problem. At least not a serious one.

          But the strict separation between prior & data camp made me think they perceive there is a serious problem?

          • Christian Hennig says:

            No. The separation between prior and data is relative to the specific situation. You have already observed data-1 so now your uncertainty is expressed by posterior-1, but if you’re about to observe and analyse new data-2, relative to these data it’s prior-2. No problem there. (It would be a problem if you wanted to use data-1 once more for updating after you already used it to arrive at posterior-1.)

          • Carlos Ungil says:

            You seemed to be arguing that prior-2 including information from data-2 is not worse than prior-2 including information from data-1. In reality including information from data-1 in prior-2 is not wrong (in fact it is optimal) while including information from data-2 in prior-2 is indeed wrong.

            How wrong? Maybe not “can’t look myself in the mirror anymore” wrong but wrong nevertheless. How much does it really matter? If it doesn’t really matter it’s because you’re getting the same result as if you were doing the right thing, so what is the point in doing the wrong thing in the first place? And the more it makes a difference, the more wrong it is.

            Taking a bit of information from the data to formulate a prior might be easier than doing it correctly. And maybe it doesn’t do much damage. But one should be aware when going off-limits.

  3. Laplace says:

    The approach people want to use is to view everything philosophically and to impute to Bayesian statistics whatever half learned philosophical crap they picked up in some dumb stats text book. Even the use of the word “prior” is an example of this tendency. There are no distinct “priors” in statistics. Just distributions conditional on whatever they’re conditional on, and manipulated by the basic equations of probability theory. A “prior” as a distinct kind of probability distribution is purely a figment of your philosophical imagination.

    An alternative approach is to take the basic equations of probability theory as given, and see what they actually do. Not what you imagine they do based on your philosophical prejudice, or hope they do, or want them to do, but what they actually do as mathematical facts.

    If you do that, you’ll discover there’s all kinds of setups where a term which holds the “prior” position in the equation depends on the data. Perhaps the simplest case is when there’s uncertainty about the model M and you have to average over that uncertainty using the basic equations of probability theory. It often happens that the data effectively picks out a specific model (call it M_0). Since both the “likelihood” term and “prior” both depend on the model, it will appear like the “prior” depends on the data. Specifically, you get something like:

    P(params | Data ) = \int dM P(params,M|Data)
    = \int dM P(params|M, Data)P(M|Data)
    \propto \int dM P(Data|params, M)P(params|M)P(M|Data)
    \propto P(Data|params, M_0) P(params|M_0)

    whenever P(M|Data) is sharply peaked about M_0.

    Or if you don’t like thinking of M as a model, think of it as some higher (uncertain) parameter as in Gelman’s post above.

    Thats not the only situation though. There are many where the basic equations of probability theory lead to a distribution in the “prior” position which depends on the data. Please note: if you disagree with this, you’re not just disagreeing with me, or Bayesians, or whoever, rather you’re claiming the two most basic equations of probability theory (which everyone agrees on) are wrong. Good luck with that.

    • So, I think you and I made the same point in different ways. Above I talk about how sometimes f(D) = F implies that some hyperparameter F’ is basically equal to F, other times it implies just that F’ is “close to” F. But in both cases, we can say that having seen f(D) so that it’s included in our state of knowledge, we can now create a model p(D|M,F’)p(F’|f(D))

      The key is to be realistic about how we’re using our knowledge of f(D) to form p(F’|f(D))

    • Rahul says:

      Dumb question: Which are the “two most basic equations of probability theory”?

      • Sum rule and product rule.

        P(A or B) = P(A) + P(B) if A, B mutually exclusive

        p(A and B) = P(A|B) P(B)

      • Laplace says:

        the product rule: P(A and B|C) = P(A|B,C)P(B|C)

        and the sum rule (probabilities sum to 1):

        if {A_i} is an exhaustive, mutually exclusive list then

        P(B|C) = \sum_{A_i} P(B and A_i|C)

        as well as various generalizations of either.

      • Laplace says:

        Only two things are required to get out of the endless stupid stat debates: (1) You believe in the basic equations of probability theory and (2) you believe they apply with enough generality to yield Bayes theorem as used by Bayesians.

        Every statistician agrees with (1) and is thus halfway home. All “bayesians” presumably agree with (2). So just take (1) and (2) and trust them. Instead of forcing your own prejudices, intuitions, and beliefs, on to the equations and trying to twist them into what you think is true, just respect the equations and see what they say. You’ll be surprised how much of the ridiculousness of statistics just melts away if you do that.

    • Corey says:

      It often happens that the data effectively picks out a specific model (call it M_0). Since both the “likelihood” term and “prior” both depend on the model, it will appear like the “prior” depends on the data. Specifically, you get something like:

      whenever P(M|Data) is sharply peaked about M_0.

      This is just one of those valid approximations to a fully Bayesian analysis. The validity of the approximation is guaranteed by an additional premise over and above those of Cox’s theorem, to wit, the assumption that P(M|Data) is sharply peaked about M_0.

      • Laplace says:

        What if M_0 is uniquely determined by Data (but changes as the data changes). Then it’s not an approximation.

        This is just one example though. There are practically infinite number of scenarios the basic equations of probability theory can be applied to. Some will yield such a result in approximation, and some will yield it exactly. Although in practice, pretty much every bayesian analysis is really an approximate bayesian analysis under close inspection.

      • Laplace says:

        In fact, we can make this a little more concrete. Suppose we’re studying some newly discovered fly. There are two candidate metabolic mechanisms for controlling the age/death of these flies. With mechanism M_0 the flies will never live more than 7 days. With mechanism M_1 the flies can live up to a month.

        We are interested in estimating some parameter lambda related to the fly’s metabolism. It turns out that the prior P(lambda|M_0) is different from P(lambda|M_1). So now we collect some data and get FLY_LIFESPAN = 3 days, 4 days, 20 days, ….

        Then an analysis using the sum/product rules of probability theory will cause us to select P(lambda|M_1) for the prior exactly since P(M_1|data)=1 in this case. If the data had been different, we’d get a mix of M_0 and M_1 in general. A naive interpretation of Bayes would say it’s wrong to look at the data and use it to select the prior, but it’s absolutely and exactly correct in this case and this requires no new principles beyond those needed to derive Bayes Theorem .

        • Corey says:

          Recall that the issue is using the data to formulate the prior. This is not the kind of process being criticized as double-use of the data — it’s pretty much the definition of data double-use that it results in a posterior more precise than what you’d get from a proper application of Bayesian probability theory.

        • Bill Jefferys says:

          An alternative analysis would simply look at this as a mixture of M_0 and M_1, put a prior on the two models (any prior so long as it doesn’t assign 0 to either), and then a standard analysis will show that the likelihood under M_0 is 0 so the likelihood throws that model away. It’s not necessary to look at the data to choose M_1; the likelihood does it for you.

          • Laplace says:

            Bill, that’s what I said.

            In practice though, a biologist would often do the same thing but without writing out the full Bayesian equations. They would have seen from their data that M_0 couldn’t be true. Then they would have simply used P(lambda|M_1) as the prior in bayes theorem without ever mentioning M_0 in the equations (perhaps without even mentioning M_1). If asked why they did so, they would say they chose it after seeing the data.

            What the biologist is doing is entirely equivalent to the formal Bayesian analysis and is as correct as the sum and product rules of probability theory.

            Then some ridiculous Frequentist will come along and point out the intuitive correctness of Biologists decision, and claim this contradicts “bayes Theorem” and “priors” being pre-data. Then the asinine Frequentist will announce it to the world as big paradox which invalidates Bayesian statics. Then we will have to listen to another century of their never ending drivel.

            That’s how things work in Statistics. Statisticians would rather be wrong for a century than do 30 seconds of simple math and clearheaded thinking.

            • Bill Jefferys says:

              Hi Laplace, I thought you were saying to use the data point that rules out M_0 to choose the prior. Maybe I misunderstood. My point is that regardless of the prior on the two models (so long as none is excluded), the real work is done by the likelihood.

              There’s no need to choose a prior based on the data.

              • I think he’s saying that informally, when you go to build your model, and you see the data rules out M0, you move to the prior for M1, and this looks like you “peeked at the data and chose your model m1” and although the math will work out correctly, plenty of people will have a fit. However, all you’ve really done is used your logical certainty (about the fact that the data rules out M0) to shortcut the process.

              • Bill Jefferys says:

                Sure, Daniel, but my point is just that this is not really an example of using the data to decide on a prior, since the likelihood automatically takes care of it in any case.

                Informally as you say you can just ignore M_0; but you don’t have to do this because formally Bayes does it correctly.

    • Christian Hennig says:

      The situation in which the data “picks a model” sharply from a range of possible models is not particularly interesting here, because this is just a case in which computation that do *not* use the data in the prior agree with using the data in constructing the “prior” (which then indeed is not an appropriate term anymore). So in these situations having the prior dependent on the data just doesn’t do harm, but in others it does.

      By the way, I doubt that this happens “often”; I’d rather think that this happens extremely rarely and I haven’t seen it even once in a situation that is of real research interest, apart from artificial examples constructed for the sake of making an argument.

      Neither is the statement particularly interesting that you can do valid probability computations involving conditioning in which everything depends on the data. Of course you can, but I’d think that this is not what the question was about.

      What I find interesting about this discussion is that I’d have thought that the use of the term “prior” implies a specific way of using probability calculus to formalise how to learn from data in which the data are used to update some pre-data uncertainty specified in the prior in order to arrive at a posterior. If this is so, *by definition* the prior cannot depend on the data (although approximations with a data-dependent prior can still be valid, and in exceptional circumstances such approximations may be “perfect”).

      But of course there can be valid computations in which something that looks like a “prior” depends on the data. However, it needs to be explained in a different manner what their use is.

      I’m not objecting to such computations, I rather object to having a prior that depends on the data and then interpreting the posterior as if it resulted from piecing together uncertainty before data and the data in a proper Bayesian way.

      If you’re ready to have the prior dependent on the data, why not cut the work and write down P(parameter|Data) without doing tedious Bayesian statistics to arrive at a “posterior”? (An answer is Daniel’s setup in which you know certain information from the data but not all the data, which is fair enough but also a rather artificial and rare situation.)

      So I think that the term “prior” should be reserved for coding of pre-data information/uncertainty; if you have it depending on the data, a different term should be used, and there is a requirement to explain in which way what is done is still a valid and useful way of combining the information from the data with other (prior) information.

      • “(An answer is Daniel’s setup in which you know certain information from the data but not all the data, which is fair enough but also a rather artificial and rare situation.)”

        Well, it’s not an artificial situation when instead of the mars rover, what you have is an R script

        mydata = read.csv(“mydatafile.csv”)

        sfoo = sd(mydata$foo)

        ### ok, now what do I do, do I use sfoo as a known value that informs the distribution I put on some parameter s, or not??

        that’s precisely the case in every “data peeking” situation.

        • For example, you could say that the “sd” function gives you something that is on average the same as the underlying standard deviation from a large population. So you could say

          s ~ exponential(1/sfoo)

          on the theory that you’re using a maximum entropy model with an approximately known mean. I think you’re unlikely to go wrong here, as you’re inferring pretty much the minimum amount of information from your sfoo calculation.

          or you could say s = sfoo, in which case you’ve gone wrong already.

          • Christian Hennig says:

            Your mean is not known so you’re hiding uncertainty that actually exists. Granted, as an approximation this may be OK (but with small amounts of data uncertainty may be substantially understated). But I’m still not totally sure whether you mean this as an approximation of an analysis which does not use a data dependent prior, and which can be used to measure how good your approximation is, or whether you claim that this is a good and valid thing to do in its own right, in which case I wonder what exactly it means to “go wrong” and how to show that it’s “unlikely”.

            • so, in this model I’m saying that the “real” standard deviation s is somewhere in the range about 0 to say 6 times sfoo.

              So, let’s suppose foo really is a random sample from a large population, maybe it’s the mass of the active ingredient in some pills produced by a pharma company or something like that.

              At this point, we may know nothing about the “real” distribution of foo, except that it has a mean and a standard deviation, and maybe an upper and lower bound (the mass of the pills, and zero)

              since it’s a real random sample, we have some kind of sampling distribution theorems, central limit theorems, or some kind of resampling approximations etc we could potentially rely on. Those theorems are essentially always going to place the high probability region of s entirely within the high probability region of exp(1/sfoo) because this is the maximum entropy distribution, and we actually have stronger information than this (it’s pretty hard to take a random sample N > 2 and calculate a sample standard deviation and be off by a factor greater than 6!)

              So, I think this ISN’T hiding uncertainty, in fact it’s probably always *conservative* when the sample size is more than maybe 3.

              I think it’s always fine to approach things this way, provided you are modeling your state of information accurately, or conservatively.

              The alternative would be something like a guess at the size of s, and it’s entirely possible to guess wrong in a serious way if you aren’t being honest. So it’s easily possible to go WORSE from a “pure Bayesian” model than to peek at one or two statistics in the data and use those to create a conservative model which converges quickly.

              Unless you know a way to extract “THE TRUE” distributional model of your “REAL” state of information, there’s no way to say what is the approximation and what is reality.

              IMHO every interpretation of a bayesian analysis basically starts out like “If I make the following assumptions, and I look at the data, then I calculate that such and such is probably about…”

              Provided that the assumptions are reasonable (such as, the real standard devation is exponentially distributed with a mean equal to the actual sample standard deviation) then you will not go far wrong, but if you make bad assumptions (such as the actual standard deviation is exactly equal to the sample one) then you can.

              ** some simulations from a normal, for 3 data points only about 2% of samples are off by a factor of 6, for 6 data points only 1 in 1000. so assuming you have more than 3 data points from a normal, you’re probably being conservative (you can use a density plot to show the exponential vs the observed… I’m sure it’s known analytically, but I’m lazy)

              sum(replicate(1000,1/sd(rnorm(3,10,1)) > 6));
              [1] 22
              > sum(replicate(1000,1/sd(rnorm(4,10,1)) > 6));
              [1] 6
              > sum(replicate(1000,1/sd(rnorm(5,10,1)) > 6));
              [1] 1

              • Christian Hennig says:

                What puzzles me about this is that you defend your “prior” by argueing that it makes sense (and is actually conservative) to say that this is your information after knowing the data (or more precisely its sd). That’s fair enough. But if you use this as a prior in the usual way, you then use the information in the data including what is expressed in the value of the sd once more for updating, and the issue is whether you’re understating the uncertainty after that.

                Also, what is meant by this: “At this point, we may know nothing about the “real” distribution of foo”?
                I thought there is no such thing as a “real” distribution of foo, this distribution is meant to formalise *your* uncertainty, and because this is about your information or lack of information, there’s nothing unknown about it (apart from the fact that it may be difficult to translate it into a prior).

              • “At this point, we may know nothing about the “real” distribution of foo”

                In my example I was assuming we had a very large population of values, and that foo was a true random (RNG) sample from it. So there is a “real” distribution of the fixed finite population. I don’t believe in actual infinite populations. But, for example, if we’re sampling pills in a million pill lot, it might be that several pallets worth had some totally different concentration of the drug due to machine failure or whatever, the distribution might be lumpy and multi-modal and generally weird.

                As for the other bits about whether doing the updating under-states your uncertainty, I suspect this is model dependent, but I’ll think about it a little before replying. One thing I can say is it’s easily possible to make up a guess for your prior which under-states your uncertainty in a different way, possibly unintentionally, and so there are risks to both methods.

              • Suppose we have two options: we write a model without peeking, or we write a model with peeking, we’ll use the symbol K to represent our background knowledge:

                p(data | s) p(s | K)


                p(data | s) p(s | K , sd(data))

                the factor p(data|s) presumes that we know (or at least have a model for) how data is distributed if we know the true value of s (ie. conditional on s). So it shouldn’t change under the two scenarios. Note that our model may not express things accurately when s is very far from its true value. For example suppose the true s is 1 and we plug in 10^36, we presumably get a VERY small value for p(data|s), but is it an *accurately* small value? Is there even a sense in which we can answer that question? Most likely the p value should really be 0 as we can actually rule out this possibility entirely, but in real world cases it usually be some small positive value. Usually we only demand that our likelihoods be accurate in the neighborhood of the correct parameter values.

                It’s entirely appropriate that p(s | K, sd(data)) is different from p(s | K) because they represent two different states of knowledge. But HOW should the be different?

                The first model p(s | K), takes into account whatever we know about s just from background. Like maybe that it should be smaller than 100, or that it’s about as big as some other value we saw last week or whatever.

                p(s | K, sd(foo)) should take into account both all the stuff in K, as well as what we know about the process of sampling foo and calculating a sample standard deviation. For example, we might know that foo is a true random sample of a large population, or that foo is a biased sample of the population, or that foo is a convenience sample or that foo is a moderately large sample from a smallish finite population … in each case our model for p(s) should change. in other words, we could symbolically break it out into

                p(s | K, sd(foo)) p(sd(foo)|K)

                Provided we’re doing that, I don’t see any way you can object, because in each case we’re using valid information to condition our model on. Now, if we do something dumb, like saying p(s|sd(data)) = delta_function(sd(data)) then we’ve made a mistake. If we do something like pretending data is a true random sample from a large population when it’s really a biased cluster sample with approximately known biases… then again we’ll have problems.

                It’s not the *using data twice* that is the problem here, it’s incorrectly interpreting the information we get from sd(data).

                Note that it’s possible for EITHER of these models to be wrong. if K has bad information in it about s, but has good information in it about how sd(data) arises, then the second model is likely to be a lot better (less wrong) than the first.

              • Note also that the more sample statistics you use to construct your model, the harder it is to specify realistic priors on the parameters, because you need to start taking into account correlations between the sample statistics and things. So I don’t recommend writing your model in terms of a large number of sample statistics. But I think it can make sense to look at a few of them if you otherwise would have very vague ideas of what to do for your priors.

                Let me give an example, in microarrays in Biology you get a brightness measurement on an arbitrary scale. To give a meaningful scale to the measurement you need to divide by some reference brightness. One thing that’s done is to simply divide by the median across the entire array. This gets you approximately into a measurement scale that is independent of the amount of stuff you pipetted onto the array.

                Is it appropriate? Well, it can work fine. But it might be better to define a parameter Norm[i] for the normalizer in each array. The measurements are pretty arbitrary though, so it could be quite hard to determine a prior for Norm. It makes sense to write a model where you put your prior for Norm somewhere in the vicinity of the observed median.

                Norm[i] ~ normal(median(data[i]), sdnorm)

                where sdnorm is informed by the number of spots on the array and various characteristics of your sample etc.

                this is clearly “data peeking” because you’re putting a sample statistic in your prior for Norm, but I imagine it’s also a lot “less wrong” than just putting something like

                Norm[i] ~ uniform(0,max_possible_brightness)

                or something like that. That really over-states the uncertainty you have because you know that the median is an ok proxy for the normalizing constant.

              • Christian Hennig says:

                Daniel, thanks. This looks fine to me. The thing is that up to now, whenever I have seen work in which some aspect of the prior was estimated from the data, the prior was not justified in this way. If the prior was justified at all (which is apparently not fashionable in some branches of Bayesian statistics), it looked like they attempted to model the information before data which had some parameter that was then without further justification replaced by a statistic of the data.

                But oh, wait a minute… if we want to specify this: p(s | K, sd(foo)) p(sd(foo)|K) … isn’t it the case that people would usually think about p(sd(foo)|K) through p(s|K), interpreting s as the “truth” behind sd(foo), i.e., wouldn’t it be the more straightforward way to specify p(s|K) directly instead of taking the detour through sd(foo)?
                “if K has bad information in it about s, but has good information in it about how sd(data) arises, then the second model is likely to be a lot better (less wrong) than the first” – in the second model, p(s | K, sd(foo)) will be in trouble then, and I don’t see how it can be in less trouble than p(s|K) in the first approach.

              • p(s|K,sd(foo)) p(sd(foo)|K) could be in a lot less trouble than p(s|K) when something like the following happens:

                we take a random sample of foo, and we know that through random sampling sd(foo) is not too far from s, so we write a model p(s|sd(foo),K) = normal(sd(foo),somethingorother), or if you like maybe you’d use a chi distribution here, as I think it’s the sampling distribution…

                but without sd(foo) your K may say something like “s is somewhere between 0 and 500”, whereas sd(foo) turns out to be say 3… now your s has way too long of a tail (sure your likelihood would help).

                or maybe K says something like “s is around 75 +- 10” s ~ normal(75,10)… but sd(foo) = 3 and that would never happen if s really was around 75…

                hopefully you’d think harder about s and be realistic about its value, but sometimes it’s just better to specify something more reasonable and base it on a sample statistic. For an example look again at my biological microarray example.

              • Christian Hennig says:

                OK, that’s a good example. Next time I see data dependent priors, I’ll be less critical… if the authors make their point as well as you do. ;-)

      • Laplace says:

        On the contrary it’s done all the time by model builders, it’s just done informally. And when done informally, it will appear exactly like a “prior” depending the data. In particular, it or something resulting from a similar analysis, is done whenever someone following Gelman’s books does posterior checks and then modifies the model based on the check.

        You can’t reserver “prior” to mean anything other than a distribution that takes the “prior” position in any equation that looks sorta like Bayes Theorem. That’s as close as you can come to defining “prior” since a “prior” doesn’t really make an appearance in the mathematics otherwise. It comes entirely from your philosophical prejudices and fantansies.

        The mistake you keep making Christian is you want to first take a philosophical viewpoint, which in your case is sometimes Bayesian or sometimes Frequentist, and then derive the equations/methods from it. What you can never seem to understand is this is the mother of all mistakes and failures in statistics. What you don’t get is that we already know the correct equations and can already derive correct methods. Instead you should take the equations as given and derive the philosophy from them, not the other way around.

        Because of this you miss the point. The point isn’t what you find interesting. The point is to determine what can happen in Bayesian statistics. If you agree to the sum and product rule, and you agree they apply in enough generality to use Bayes theorem the way Bayesians do, then “priors” sometimes depend on the data. It’s irrelevant what you find interesting, or whether you agree, or whether you understand it, or whether you like it. That’s just the way the mathematics works out.

        • When the likelihood of your data under some initially plausible model is zero, then yes, it will “look like” the prior depends on the data. But mathematically it’s really that the p(Data | model0) is zero.

          Just define the word “prior” to mean the probability distribution of a parameter conditional only on things that are not functions of the data (including the data itself).

          But it seems to me it can be reasonable to condition on summaries of the data for efficiency purposes, without necessarily harming your analysis. My example above (sfoo = sd(data$foo), s ~ exponential(1/sfoo)) is very reasonable. It keeps your model from going off into left field, and it doesn’t assume anything other than pretty much the order of magnitude of the s parameter is the same as the value of the sample standard deviation.

          In a model like this, there is no prior as I defined it. But it’s still a full Bayesian model. It’s just one that is conditional on a state of knowledge that includes sd(data$foo). It’s a different model than you would have chosen if that were left out of your knowledge… but it’s still a full bayesian model.

          • Laplace says:

            Daniel, it’s a mistake to get bogged down in definitions of priors. The only reason anyone cares about it is because there are situations where our intuition tells us strongly we should use the data to modify the prior, yet a simplistic application of Bayes theorem say we should. The fact is, if you find the Bayesian solution for the real scenario/assumptions being considered you get exactly the behavior your intuition tells you should happen.

            That’s true no matter how you define “prior”. So everyones definition of prior is irrelevant. It doesn’t matter what any of us consider a “prior” because Bayesian statistics still does the right thing no matter what.

            So just forget about “priors”. They were never a part of the mathematics anyway and were just an intuitive philosophical extravagance that was tacked onto the mathematics. Use the basic equations of probability to derive everything and you’ll be fine.

            • Sure I agree, but I also think it’s useful to keep in mind whether your model depends on some ideas and guesses about how the world works… or some ideas and guesses about how the world works, and also how certain functions of the data have certain values.

        • Christian Hennig says:

          We don’t disagree about the mathematics but surely the mathematics need to be interpreted in order to tell us something about the world, and in order to choose what mathematics to use for what problem. The term “prior” implies a certain interpretation, I’d say; it’s not a purely mathematical technical term. Some of what you write seems to agree with this, actually (even if it’s the bit in which you rather seem to despise the use of the term).
          Mathematics alone has nothing to say about what are “the correct methods” for any real problem.

          • Laplace says:


            In both instances the prior for the params depends only on M: P(params|M). If M is certain, then this prior will not depend on or be affected by the data. If M is uncertain, it can be affected by the data causing the form of the prior to change.

            Now you claim the former case corresponds to your notion of “prior” while the later is something conceptually and wholly different. But you’re just blowing smoke here. The difference has nothing to do with “pre” or “post” data, rather the different is entirely “certain” or “uncertain” M.

            The problem isn’t with interpreting the equations (which are perfectly clear) or with the mathematics (which is perfectly right). The problem is the gratuitous and irrelevant philosophical sludge you decided to overlay on them.

            • Christian Hennig says:

              “If M is uncertain, it can be affected by the data causing the form of the prior to change.” Not necessarily. You can model the uncertainty pre-data if you want. Whether or not to do it that way and what this means is a matter of interpretation, not of mathematics.

  4. gagan says:

    although i know it’s a ‘sad holiday’: “happy” easter dr gelman ;)))

  5. Dustin says:

    Just to emphasize, it shouldn’t be called ‘wrong’ to use data-dependent priors, in the same way it’s not ‘wrong’ to do penalized maximum likelihood estimation. It’s just an approximation to the full Bayesian thing, which you can interpret and reason with as you will.

    • Christian Hennig says:

      It would then be helpful to give some more details about what is approximated, so that there is a possibility to figure out whether this approximation is any good.

  6. Anoneuoid says:

    As I understand it, Bayes’ rule is derived by considering the case that A is independent of B and vice versa. The “prior” refers to the probability an explanation is correct, p(Explanation), independent of the data used to get P(Data|Explanation).

    This thread makes me think have missed something… What is wrong with this:

    • It’s simplistic because it considers primarily yes/no variables. But there’s nothing major wrong there. It’s just that this is a special case.

      The fundamental modeling construct in bayesian analysis is the join probability of the data and the parameters.

      p(data,param_a,param_b,param_c | Knowledge) if you know how to write that down directly, you’re golden. But unfortunately we usually don’t know how to just write this down. Instead what we know is how to factor it into different pieces.

      p(data | param_a,param_b,param_c) p(param_a,param_b, param_c | Knowledge)

      and then we might know how to further factor the second term (ie. certain parameters might be independent, certain parameters might be connected together through various modeling schemes).

      Also, when we build our model with different knowledge, we wind up with different models. A lot of the discussion here is ultimately about how to incorporate some knowledge we gain by looking at certain aspects of the data before writing down our model. It makes sense that we write down a different model if we know additional stuff, but we have to do it in a logically consistent way.

      • Anoneuoid says:

        Can you link to or give a derivation of Bayes’ rule you think is general enough to cover the data-dependent prior use?

        • The fundamental thing is not Bayes’ theorem, but rather the joint probability of the data and the unknown parameters. For example:

          p(data1,data2,data3,param1,param2,param3 | Knowledge)

          Bayes’ theorem is just factoring this into

          p(data… | param…, Knowledge) x p(param… | Knowledge)

          and realizing that it has to be the same as factoring the other way

          p(param… | data… , Knowledge) x p(data | Knowledge)

          With a data dependent prior, you just condition your whole thing on additional knowledge of the sample statistic and how the sample statistic arises (the sampling process). The reason I gave my example of the mars rover is that it’s a situation where we mentally can see how we might get the value of the sample statistic well before knowing the rest of the data, and we might be tempted to put it into the category of “things we know” during our model building. The same things follows even if there’s no multi-month delay.

          The key is not so much that we peeked at the data, we acknowledge that by making the whole thing conditional on the sampling process and value of the sample statistic… the thing is whether or not our modeled uncertainty about the parameter post-sample-statistic-observation is realistic.

          • Carlos Ungil says:

            > With a data dependent prior, you just condition your whole thing on additional knowledge of the sample statistic and how the sample statistic arises (the sampling process).

            Can you be more explicit on how do you do that?

            I think we all agree on Bayes’ rule (I’m dropping the background Knowledge term that appears everywhere and using ~ as “proportional to” to drop the normalization term as well)

            p(params|data) ~ p(data|params) x p(params) = likelihood x prior

            Let’s imagine you can restate the data as three datasets mean(data), sd(data), and standardizedData (where mean(data) and sd(data) are the sample mean and standard deviation and standardizedData is the standardized sample). If the data points were independent I think we can write p(data|params) = p(mean(data)|params) x p(sd(data)|params) x p(standardizedData|params).

            If you really want to include the information provided by sd(data) in your prior, you can do so by defining an “informed prior” p(params|sd(data)) ~ p(sd(data)|params) x p(params) which is just the posterior obtained from combining the “prior prior” with the likelihood of getting sd(data) given the parameters. But then you’ve lost the p(sd(data)|params) term and the Bayes’ rule becomes

            p(params|data) ~ p(mean(data),standardizedData|params) x p(params|sd(data)) = likelihood (on the reduced data set, excluding sd(data)) x “prior” (informed by sd(data))

            I don’t think you can have your cake (keep sd(data) in the likelihood) and eat it too (include sd(data) in the prior).

            Using an ad-hoc procedure to change the prior based on sd(data) will make a non-optimal use of that bit of info if you do indeed fix the likelihood term to ignore it, and will make an unwarranted use of this bit of info if you are not fixing the likelihood to account for it. Why don’t you just use the whole dataset in the data-ignorant model? I could see a justification for this approximation if it was to simplify the model. However, in the example you give elsewhere you’re still going to include uncertainty around this scaling parameter of yours and it seems easier to do it properly: start with a non-informative prior and let Bayes’ rule take care of getting the posterior distribution.

            • p(data | params) p(params | sd(data))

              if you like you could think of p(params | sd(data)) as:

              ~ p(sd(data) | params) p(params)

              with p(params) as an improper uniform “prior prior”, but while p(params) is potentially improper, p(sd(data) | params) p(params) ISN’T and so using your assumption that you understand the sampling process and you are going to identify your uncertainty with the uncertainty involved in sampling then you can just write down some approximation based on the sampling distribution, however if it makes people feel better to think of this improper hyper-prior that’s an ok idea.

              In the end you have:

              p(params | sd(data))

              based on sampling theory or some approximation or some model of bias together with sampling theory…

              And, presumably, the term

              p(data | params) relies on the parameter, let’s call it s, and should not rely on the value of sd(data) directly.

              • Carlos Ungil says:

                > p(data | params) relies on the parameter, let’s call it s, and should not rely on the value of sd(data) directly.

                Are you’re using the original likelihood term p(data|params) without any correction for putting sd(data) into the “prior”?

                If we define data’ as the data excluding sd(data), and assuming it is independent from sd(data),

                p(data|params) = p(sd(data),data’|params) = p(data’|sd(data),params) p(sd(data)|params) = p(data’|params) p(sd(data)|params)

                I maintain that to keep things consistent you should be getting to

                p(params|data) ~ p(data’|params) p(params|sd(data))

                and, unless I misunderstand your reply or I am making a mistake at some point, you arrive at

                p(params|data) ~ p(data | params) p(params | sd(data)) = p(data’|params) * p(sd(data)|params) * p(params|sd(data))

                Do you agree that your result has an extra p(sd(data)|params) term in the likelihood compared to mine?

                I’m not sure if your point is:
                a) my result is wrong and yours is right
                b) my result is right and yours is approximately right
                c) both results are equivalent because I’m not understanding your notation
                d) none of the above

              • “Are you using the original likelihood term p(data | params) without any correction for putting sd(data) into the “prior” ?

                There is no correction involved. If you tell me the “true” value of s and mu (say from your random number generator that generated data) then I’ll tell you what the probability of getting some data is… p(data | s,mu) ~ normal(s,mu) or whatever distribution your RNG uses.

                When you aren’t simulating from an RNG, there’s no “one true” likelihood anyway. Also, there’s no way for data to be independent of sd(data), sd(data) is a function of the data!

                So, that’s the correct likelihood under this model regardless of what prior you put on s, or mu

                I think this concept of “correction” comes from frequentist unbiased estimation theory, where you’re trying to correct some kind of bias, but that’s not what we’re doing, we’re building a Bayesian model. The Bayesian model biases things towards whatever the prior says. As long as the prior says valid things, that’s ok.

                Instead I explicitly say that my model is conditional on sd(data) and my knowledge of the sampling procedure, and so it IS A DIFFERENT MODEL than if I only used “K” which you dropped out of everything, but I think is important to keep K in (at least in your head).

                Keeping K in the notation keeps us from getting confused, because it shows that the two models are conditional on different things, one is conditional on K alone, and the other is conditional on sd(data) together with K.

                I suspect you might say that the “correct” model is whatever the model would be if you used only K, and the new model where I put sd(data) into the prior should be “corrected” to get me back to where I was when I knew only K, or something like that. I don’t think that’s a useful way to think about this. I’m not trying to build a model based only on what K says about s, I’m acknowledging that K is uninformative about s but informative about the sampling procedure, so I’m better off modeling s in terms of a sample statistic and my knowledge of the sampling procedure. In other words, I have choices to make, and I use the information in K as wisely as possible.

                But, another way to think about it (as discussed above) is that my model is equivalent to “K says s has an improper prior” together with some likelihood information about the sampling distribution of sd(data) under my model for sampling (iid, biased, cluster, whatever I use for my model).

                You’re probably going to use this technique only when K is truly unreasonably vague about what the distribution should be. such as when your “hyper-prior” for s is approximately improper. If you have even better information in K about s than what the sampling model tells you, you can put it in as your prior directly. If you have a poor sampling model (ie. there’s potentially a bunch of bias, but you don’t know what it is). then you might be better off just putting your vague prior directly on s. Or you might do something like:

                b ~ uniform(-1e6,1e6); /* your vague model for the bias*/

                s <- sd(data) + b;

              • “Do you think you could improve your model by changing the mu prior as follows?


                If not, why not?”

                Improve it for what purpose? ie. what’s your utility function?

              • Carlos Ungil says:

                We may be having different discussions. My point is that

                1) from some prior knowledge *K* and some *data* the optimal way to get a posterior probability distribution for the *params* in the model is

                p(params|data,K) ~ p(data|params,K) * p(params|K) = likelihood * prior

                Maybe we are already in disagreement, otherwise we carry on:

                2) you are following a procedure that leads you to the following

                p(params|data,K) ~ p(data|params,K) * p(params|sd(data),K) = same likelihood * another prior

                If the result (1) and the result (2) are different and the result (1) is optimal, then (2) is non-optimal and it’s at best an approximation.

                Maybe we agree on this as well, I’m not sure.

            • It might help here to use a concrete example. We have 20 data points that we think came from a normal distribution with a positive mean in the vicinity of 1000, and want to find a mean for the distribution. We’ll use a simulation of sampling to get an estimate of the sampling distribution for the standard deviation:

              s = 10:1000; foo <- sapply(s,function(x) {sd(replicate(100,sd(rnorm(20,rexp(1,1/1000.0),x))*sqrt(20)))}); ggplot(data.frame(s=s,err=foo)) + geom_point(aes(s,err))+xlim(c(0,1000))+ylim(c(0,1000));

              This shows that sqrt(N) * sd(sampling_distribution_of_sd) \approx .75 * sd(data) for the full range of standard deviations between 10 and 1000. We can estimate it for other sample sizes etc, and see that it does in fact scale like this. Now using that knowledge, we'll write a model:

              s ~ normal(sd(data),0.75 * sd(data)/sqrt(20));
              mu ~ exponential(1.0/1000); /* don't know much about the mean except it's positive and in the vicinity of 1000 */

              data ~ normal(mu,s); /* doesn't depend on sd(data) directly, depends only on the value of the parameter s*/

              The key is that the model for s is based on valid information not that it comes "before" the data.

              • Carlos Ungil says:

                Do you think you could improve your model by changing the mu prior as follows?


                If not, why not?

              • Carlos Ungil says:

                > Improve it for what purpose? ie. what’s your utility function?

                For whatever purpose made you use
                s ~ normal(sd(data),0.75 * sd(data)/sqrt(20));
                instead of a real prior.

  7. Tom Passin says:

    In truth, the probabilities on both sides of Bayes’ theorem span the entire population space, so they apply to the entire data collection (i.e., prior and posterior, if one insists on the conventional terminology). As @Lakeland said, that’s fundamental to the derivation of the theorem, Some problems may lend themselves to partitioning such that the left hand side depends mainly on previous data and the right side on the newer data. But there’s no reason why that should usually be true.

    The philosophical conundrum comes about when one comes up with a “prior” distribution, or at least one that depends on “beliefs” about the data set. In that case, the terms conditional on the prior should themselves be expanded to be conditional on the probability of the prior itself. But what in the world could that be, really: what would be the prior’s probability, and how could one determine it from some data set? Better to stick close to the data and try hard to partition it in some sensible way.

  8. Carlos Ungil says:

    I’ve been playing with the problem given as an example by Andy Solow (“Independent observations X1, X2, …, Xn from the uniform distribution on (0, U). Interested in the posterior distribution of U.”). In fact the data-peeking prior suggested (“Exponential prior for U with parameter log 2/xmax so that prob(U < xmax) = 1/2") gives results that are almost identical to the non-informative prior 1/x. Unsurprisingly, the uniform prior gives results that are more spread out. Something more interesting is that it is the 1/x prior which gives a 90% posterior interval which includes the true value 90% of the time (in this experiment we have fixed the true value and are sampling from uniform(0,10), if the sampling space was different the results could obviously differ).

    • Above you ask me “Do you think you could improve your model by changing the mu prior as follows?


      If not, why not?”

      and I replied, but the nesting is too difficult to follow. so I’m replying here, hope you see it.

      my answer is: “improve it in what way? What is the utility function that determines whether an estimate is better or worse?”

      • Carlos: nesting is broken above. can we start here?


        We may be having different discussions. My point is that

        1) from some prior knowledge *K* and some *data* the optimal way to get a posterior probability distribution for the *params* in the model is

        p(params|data,K) ~ p(data|params,K) * p(params|K) = likelihood * prior

        Maybe we are already in disagreement, otherwise we carry on:


        Yes, fine, but what is your prior knowledge? If K basically says only “data was sampled in a way that sd(data) is close to s and the error is similar in size to the one you get according to a certain simulation model… and I know very little else about s” then we could write:

        e ~ normal(0,error_size_from_K);

        s <- sd(data) + e;

        or what's the same in this case

        s ~ normal(sd(data),error_size_from_K);

        does this re-parameterization help you understand what's going on? Now the prior on e doesn't depend on the data, but the information in K is being used to estimate s from some other parameter e.

        there's no reason why this has to be additive or normal either.

        The point is, the models are different whether you use some "prior knowledge of s available directly from K" or "some prior knowledge about how different s is from sd(data) that we get from background knowledge K"

        • Carlos Ungil says:

          > The point is, the models are different whether you use some “prior knowledge of s available directly from K”

          Yes, this is what the prior is.

          > “some prior knowledge about how different s is from sd(data) that we get from background knowledge K”

          Whatever information sd(data) adds to the prior it will be properly included in p(params|data,K) ~ p(data|params,K) * p(params|K)

          Do you agree that (1) is optimal?

          Do you agree that (1) and (2) give different answers?

          I think my point is clear. But I still don’t understand if you are changing the models or not, you say now that “the models are different” but you also said before “that’s the correct likelihood under this model”. I’ll be happy to continue the discussion if you can provide a yes/no answer to the two questions above so I understand if there is really a disagreement.

          • Carlos:

            Do we agree that there’s a difference between the two models:

            1) p(s | K) = exponential(1/1000); /*where I ask K what is s and it says “somewhat up to several thousand”*/


            2) p(e | K) = normal(1,0.75/sqrt(20)); s = sd(data)*e; /* where I ask K what is s and it says that it knows something fairly specific but not exact about the multiplicative factor e which get’s you to s*/

            but that *both* of them are consistent with K, and neither of them have data in the “prior” for the parameter as defined, but that if you think of “s” as the parameter in (2) then there is data in the prior for “s” since s ~ normal(sd(data),sd(data)*0.75/sqrt(20));

            If (1) and (2) are both “proper” according to not having data in the priors… and they’re both consistent with K, then we have to choose which one we want.

            Since these are both consistent with K, we have to choose which one we want, it’s not necessarily the case that (1) is “optimal” and so your statement about

            p(data|s) p(s|K)

            being optimal is not necessarily true.

            • Carlos Ungil says:

              Of course there is a difference, one is right and the other is not :-)

              Seriously, using the same logic we could also ask K what mu is and it will say it’s close to the mean of the data so instead of

              1) p(mu|K) ~ exponential(1/1000)

              we can choose

              2) p(e|K) ~ normal(0,1) ; mu = mean(data) + sd(data) *e ;

              because both of them are consistent with K and neither of them have data in the “prior” for the parameter as defined.

              • The normal distributions are not exactly the sampling distributions, but there is NOTHING wrong with the following model:

                es ~ normal(1,0.75/sqrt(N));

                s = sd(data)*es;

                emu ~ normal(0,s/sqrt(N));

                mu = mean(data) + emu;

                data ~ normal(mu,s);

                This model is consistent with:

                1) what I think I know about how the data arises.
                2) what I think I know about the errors between the sample statistics and the true mu, s based on some simulations or analytical results or whatever.

                there is nothing wrong with this model… It’s a fully consistent Bayesian model. It has no data in the priors. If data really does come from a random number generator that’s normal(a,b) this will give good inference for a,b.

                HOWEVER, this ONLY incorporates information that we have about sampling. in the case where we’ve got an RNG that’s pretty much the best we could do. In real science, we can sometimes add additional information based on science, but we also have to remember that our sampling theory becomes only approximate.

              • Carlos Ungil says:

                Ok, I think we agree to disagree.

                I believe Bayes’ rule shows the right way to combine prior information and data.

                I don’t think estimating the location parameter in a
                model using as prior
                mu ~ N(mean(data),std.error(data))
                is a fully consistent Bayesian model.

              • Carlos, I’m fine with agreeing to disagree, but I don’t think we disagree that Bayesian statistics is the right way to go. I just think you don’t acknowledge as much freedom to form models as I do. Before you commit to that, consider this scenario:

                I will choose out of my own head two valid double-precision IEEE floating point numbers m and s. s will be > 0 and abs(s/m) > 100 * .Machine$double.eps I will then go out and get a random seed from and then set the R RNG seed, and call rnorm(20,m,s) in R and report the values to you.

                You in turn will run Stan using this model (translated to proper Stan syntax)
                m ~ uniform(-max-float, max-float);
                s ~ uniform(0,max-float);
                data ~ normal(m,s);

                I in turn will use the parameterization on emu and es which you agree has no information about what m and s is built in but relies instead on sampling simulations. (and I will check my simulations first and make any corrections in case I made some error)

                Now we will make a bet on which model had the better result.

                You will agree that my model is fully consistent with our knowledge, but you refuse to use it because it is not somehow “proper”?

                Your model uses K = “m is a double float, s is a positive double float, data comes from normal(m,s) as IID samples, and simulations of IID sampling from normals suggests m-mean(data) is distributed according to me, and s/sd(data) is distributed according to se as specified by Dan’s simulations”

                my model uses the same K. Your model however refuses to use the sampling information because it’s not information directly placed on m,s by K.

                Which of us will do better?

                By better, I’ll mean after 20000 samples from Stan with 10000 warmup, we’ll plug the real m,s values into our model and calculate the posterior probability of those values averaged over all 10000 samples and whoever has the larger expected posterior probability “wins”.

                I don’t have the time to write the code and do the computing, but if someone wants to run this simulation I’d be interested to hear it. Since we’ve now specified your prior based on knowledge, and you won’t be able to utilize any information I reveal here, I can freely just type in the m and s values… so let me think some up.

                m = 1.87704995e43
                s = 5.60779002e36

                whatever the closest IEEE float is to each.

                well, since we’re not really betting money here. I might as well tell you a seed: 21028580 which came from, and then generate the values.

                m = 1.87704995e43; s = 5.60779002e36; set.seed(21028580); d <- rnorm(20,m,s); sprintf(“%.18g”,d);
                [1] “1.87704963915473391e+43” “1.87705044510327081e+43”
                [3] “1.87704956431864945e+43” “1.87705057829903988e+43”
                [5] “1.87705100035275354e+43” “1.87704976139413546e+43”
                [7] “1.8770494342744514e+43” “1.87704949418607237e+43”
                [9] “1.87704987820835268e+43” “1.87704969307615267e+43”
                [11] “1.8770500637371498e+43” “1.87705073605542073e+43”
                [13] “1.87705003940398127e+43” “1.87704914950130018e+43”
                [15] “1.87704916824537839e+43” “1.87704888896367e+43”
                [17] “1.87704992843512256e+43” “1.87704865770861173e+43”
                [19] “1.87705089639225344e+43” “1.87704989742548195e+43”

                This models pretty well a situation where you know a lot more about sampling than you do about what I am going to choose for m, s

                I suspect you are MUCH better off working with my scale free model based on sampling theory.

              • Just for posterity sake I’m mentioning here that we’ve detected and solved the confusion that arises here in comments elsewhere on the page. And yes, Carlos is right you have to condition both prior and likelihood on the information you’re using. Explained here:


                another instance of the imortance of good conditional probability notation for clear thinking as advocated here:


        • or in my example above:

          e ~ normal(1,0.75/sqrt(20));

          s <- sd(data)*e;

          there's no way for you to say this is a "non real" prior because it depends on data. Since the prior on e doesn't depend on data here it depends only on my knowledge of how the simulations work.

          But this is the same model as my other one above, but also a very different model than if I just queried K about the value of s directly.

          s ~ exponential(1/1000); /* K doesn't tell me much about s directly*/

          So I get to choose models, one in which I model s directly given crappy knowledge, and one in which I model s through knowledge of a multiplicative factor based on sampling theory.

          Which model should I choose?

          Can I *always* take a data dependent prior and turn it into a data independent prior on a different parameter and a function connecting that parameter to the parameter of interest? I'm not sure. My guess is probably.

  9. Carlos Ungil says:

    Just so it doesn’t seem that I refuse to answer your questions, let me stress again my point.

    >Yes, fine, but what is your prior knowledge? If K basically says ….

    K is the same in both

    1) p(params | data,K) ~ p(data | params,K) * p(params | K)

    and what I understand you accept as you approach (“if you like you could think of”)

    2) p(params | data,K) ~ p(data | params,K) * p(sd(data) | params,K) * p(params | K)

    If I’m not misunderstanding your previous replies your likelihood p(data | params,K) is identical to mine and your prior is “better”.
    This extra term p(sd(data) | params,K) is the only difference between (1) and (2). And again, if (1) is optimal (2) cannot be better.

    • 1 is not necessarily optimal. It may not be the best model for the data available to us under K. We might do better by asking K what else we know, and modifying the model. See my discussion where I reparameterize my model to get s in terms of a multiplicative factor e.

      If I ask K “what is my prior on s directly” vs “what is my prior on e such that s = e*sd(data)” I could get two different answers. These are two different models, they have different parameters, they’re both consistent with K, and they’re both data independent in the prior. But the second one reduces to a model where s is a parameter that is data dependent.

      • I now suspect but have no proof, that for every model with a data dependent prior, there’s another equivalent model with different or additional parameters whose priors are free from data.

        • George says:

          Doubtful. If that was true there would be no marginalization paradoxes (and the like… the coherence that comes with “full Bayes”) with data-dependent priors.

          • Well, at the very least we can say that people who disagree in principal with data dependent priors must agree that there’s no principal violated when the data dependent prior can be converted by reparameterization to a data independent prior in a different model.

            • George says:

              Perhaps. I wouldn’t expect such people to get on board with that idea, if in practice the data-independent formulation gave a prior or model that seemed unrealistic. And there are also those for whom expressing a prior as a distribution is against all principles of statistical inference… I wouldn’t expect them to agree.

              Even restricting to just Bayesian analyses, I think one wants the equivalence class of ALL models/priors that lead to the same results, or approximately the same, regardless of how they are formulated. This class tells you the range of types of analysis among which it’s not worth arguing over which one to use. Its complement tells you who you can expect to disagree with, which is useful for helping plan any next steps.

      • Carlos Ungil says:

        * On optimality:

        Imagine you have a random variable with a normal(0,1) distribution.

        Do you think there is an optimal probability distribution to describe this random variable?

        I would say that the normal(0,1) distribution is the optimal probability distribution.

        Note, however, that the normal(0,0.95) distribution will “win” more than two thirds of the time using your “larger expected posterior probability” rule.

        * On your “reparametrization”:

        > e ~ normal(1,0.75/sqrt(20));
        > s
        > there’s no way for you to say this is a “non real” prior because it depends on data. Since the prior on e doesn’t depend on data here it depends only on my knowledge of how the simulations work.

        Believe it or not, I will say it’s not a real prior because it depends on the data. The parameter is s and I don’t see how what you wrote above is different from

        s ~ normal(sd(data),0.75*sd(data)/sqrt(20))

        I could do a similar “reparametrization” to convert a model of the form

        y ~ normal(mu,sigma)


        e ~ normal(0,1)
        y = mu + sigma * e

        Does it mean that the “new” model doesn’t have any parameters ?

        Maybe you think that getting sd(data) out of the distribution makes your prior data free but the fact is that s is still a parameter and it depends on sd(data).

        • “Imagine you have a random variable with a normal(0,1) distribution.”

          As a Bayesian I never have such a thing. I have some data, which is fixed and finite (and may have come from a RNG with normal(0,1) distribution), and I have distributions over my parameters which describe what I know about the true value of a parameter. (let’s just suppose here that such a thing as a “true” value does exist).

          If I have some data that comes from
          data = rnorm(20,0,1);

          and a parameter m representing the true mean of the normal distribution RNG that generated this data (in this case exactly 0)

          and I have some information that leads me to determine that (never mind how I got this information, perhaps I got it by putting a bug on your keyboard or something)

          m ~ uniform(-0.000001,0.000001);

          then m *is* a better distribution than normal(0,1/sqrt(20)) for determining what the “true” value of m is, in this case the true value *is* 0 and this distribution is highly peaked around the true value. It is telling me highly specific information about the true value of the parameter.

          You: “Believe it or not, I will say it’s not a real prior because it depends on the data.”

          Then I think at this point I simply agree to disagree. I am free to define any parameters I want in my model, and you will not convince me otherwise. In my model the parameter is e, and e tells me the ratio of s/sd(data) and its posterior value will be of interest to me, and the prior on e is not data dependent.

          in my model, s *is not* a parameter, it is a function of a parameter e

          It sounds like you are confused about how Bayesian distributions work. There is no uncertainty on the data. The data are fixed things. The only uncertainty is on the parameters. p(data | parameters) is thought of as a probability (density) but you PLUG IN the fixed data values, and so it is only a function of “parameters” it is a weighting function that downweights the “prior” probability in regions that are not consistent with what you would know about data if you were told the true values of parameters.

          • Carlos Ungil says:

            > (and may have come from a RNG with normal(0,1) distribution)

            This is what I would call a random variable with a normal(0,1) distribution. I was not talking about priors, inference, etc. My point was that when you have a probility distribution there might be multiple ways to describe it that might be “better” in some sense, but nevertheless there is an optimal one. In the same way, Bayes’ rule gives you the optimal posterior probability distribution and your alternative might allow you to “win” on your game, but can only be non-optimal.

            I understand that I won’t convince you of anything, but I think what you’re suggesting is outside of Bayesian orthodoxy (looks like empirical Bayes to me). I’ll try to make my point as clear and simple as possible.

            Feel free to ignore this, but if you want to reply I would appreciate if you can tell me precisely where you think the argument is derailing.

            The problem – we know there is a satellite in a geostationary orbit, but we have no idea were is it. The parameter mu is the (unknown) position of the satellite measured in km along the 265000km of the orbit (the reference frame is fixed with respect to the Earth so the position doesn’t change). We can measure the position of the satellite but there is a normal error term with standard deviation 1km.

            The data – we get one measurement which happens to be, expressed in km,


            The prior – we describe our (non-existent) knowledge about the position mu before making the measurement with uniform distribution over (0,265000):

            p(mu|K) = 1/265000

            The likelihood – given the parameter mu, the probability of observing x is

            p(x|mu,K) = 1/sqrt(2*pi) exp(1/2*(x-mu)^2)

            (the tails are essentially zero long before merging at the opposite side of the orbit, this probability is well defined)

            The posterior – using Bayes’ theorem, we can combine the data and the prior under this model and get the following probability distribution for the parameter (~ means proportinal to)

            p(mu|x,K) ~ p(x|mu,K) * p(mu|K) ~ exp(1/2*(42-mu)^2)

            Our knowledge about the parameter, based on the background knowledge, the prior, and the data, is therefore described by a normal distribution centered at 42km with standard deviation 1km.

            The orthodox Bayesian theory stops here (and of course I don’t think what follows is valid, but it’s what I understand you’re claiming).

            Now, we can actually do better using the same background knowledge arriving at the following “fully consistent Bayesian model”.

            We don’t have to content ourselves with a uniform, non-informative prior on the parameter mu (the position of the satellite), becase after getting the measurement we know approximately where it is.

            Instead of using the uniform prior on mu, we can reparametrize the model to use a new parameter e (the error, in km) which doesn’t depend on the data and we know is distributed as e ~ normal(0,1). We can then write mu = x + e

            Therefore, the “informative” prior term will be a normal distribution centered at 42 with standard deviation 1

            1/sqrt(2*pi) exp(1/2*(42-mu)^2)

            and we have the same likelihood as before

            1/sqrt(2*pi) exp(1/2*(42-mu)^2)

            Multiplying both terms, we get the “improved” posterior:

            p(mu|x,K) ~ exp(1/2*(42-mu)^2) * exp(1/2*(42-mu)^2) = exp((42-mu)^2)

            This is a normal distribution centered at 42km with standard deviation 707m

            In summary – from the same starting point
            1) the orthodox Bayesian procedure gives a normal(42,1) posterior
            2) your procedure gives a normal(42,0.7) posterior

            • Yes, I see where things have gone wrong. You mentioned before modifying the likelihood. As I stated before the whole model is conditional on our knowledge of some sample statistic, and you were right, I should modify the likelihood as well as the prior to be conditional on the value of the sample statistic. Again, proving that it’s always best to keep the conditional stuff in the notation!

              When we do that in your example, and K includes the pre-knowledge of the mean then:

              p(data | parameter, K) = 1 in the case where N = 1 since if you include the knowledge of the mean in K and the data point is the mean, then you know ahead of time what data is. :-)

              In fact, I believe you can in general do this in this type of case by simply dropping one data point at random, since if you know N-1 data points, and you know the sample mean, you can calculate the other data point so it’s contribution to the likelihood is just a factor of 1. So correcting the likelihood is easy. However, as you say, in this case you have no advantage from formulating this type of prior. It still has some value computationally, as it avoids you having to put in a uniform prior on the whole real line (or on all IEEE floats or whatever).


              Now, here’s an alternative situation which is more along the lines of what I imagined this type of thing would be used for. I have a physical simulation and I can put in some information, run the simulation, and get out a value of a parameter Q. I do this for a large variety of conditions, and get a distribution on Q.

              now, I can index this distribution on Q based on something… like say a total energy E.

              Now I get some data. The experimenter can set the total energy E to anything they like. But they don’t tell me what they set it to.

              Putting in a model of what I think the experimenter’s motivations are to choose different amounts of energy E is truly ridiculous here. Instead, it makes sense to say something like:

              E 3){
              observed_quantity[i] ~ some_distribution(some_nonlinear_function(Q,, energy[i], other-observables-and-parameters));

              However, it would be wrong in my opinion to start modeling a prior for the total energy in terms of what you know about what the experimenter decided to have for lunch and how that made them feel about where they should set the dials… you’re likely to go far wrong there! And, if you know the total energy (which you do as soon as you sum it up) then it would be wrong to ignore your vast simulation of the physics which tells you how Q behaves at that total energy level, particularly when Q can be vastly different at different levels of E.

              • ack, BLOG ATE MY CODE… I was saying this approximately:

                it makes sense to say something like:

                Q ~ fitted_dist(sum(E));

                but you need to modify the likelihood to reflect the fact that sum(E) is in your knowledge! However, if the likelihood is independent of the sum(E) then you don’t!

                what if you have a complex model on some subset of the events (a subset where say there’s “high” energy where your model is relevant):

                for(i in 1:N){

                if(energy[i] > threshold) {
                observed_quantity[i] ~ some_distribution(some_nonlinear_function(Q,, energy[i], other-observables-and-parameters));

                It’s going to be impossible in general to figure out whether knowledge of the sum(E) alters the likelihood in some subtle way.

                Thanks for persisting though, I think we’ve gotten to the bottom of your objection, and I agree with you, we DO need a conditional likelihood but in general, that might or might not be a different likelihood than what we’re otherwise using, depending on the model.

              • Carlos Ungil says:

                I’m happy we reached a point of agreement. My point all along was that if you take a bit of the data to update the “prior prior” then you may end up using it twice if you are not careful The proper thing to do is to condition also the likelihood on that info, as you said you were doing but you failed to do :-)

                In this example is easy, because once you remove the mean from the data you don’t have anything else and the likelihood becomes 1. And the posterior remains unchanged, as it should be.

                In general one degree of freedom is lost. You can remove x_n from the dataset and substitute it with a function of the sample mean and the other x_i. Or you can reparametrize the model. In a previous message I wrote how one could transform the dataset {data} into {mean(data),sd(data),standardizedData} to make explicit the transfer of the sd(data) bit from the likelihood to the “informeded prior” in your array example. In my satellite example, one can convert {x1,x2,..,xn} to {mean(x),z1,z2,..,zn} where zi=xi-mean(x). Instead of losing one variable we keep n of them, in addition to the sample mean, but one degree of freedom is lost as their sum is zero. I don’t remember the details but this helps later as some terms in the likelihood function cancel out. The procedure is of course model-dependent and can be tedious, but if it is done right the resulting posterior should be the Bayes’ one.

                In practice, in particular if one has lots of data, maybe one can skip the correction part and just consider the procedure an approximation. Which might be useful if it simplifies the interpretation of the model or the numerical computation as in the cases you suggest. But as a matte r of principle, it’s not strictly Bayesian and we are sacrificing correctness for convenience.

              • Carlos Ungil:

                Yes, I so rarely do things like fit the mean and standard deviation of a normal model, that this kind of thing wasn’t in my mind. Much more often I’d be interested in doing say a regression where I’m looking for parameters that define a functional form, and where the likelihood may not actually depend or may depend only very vaguely on the fact that I know some sample statistic.

                I just put another post on my blog regarding this, and give you the credit you’re due there:


        • Carlos Ungil says:

          A few hours ago I asked you about the difference (which I didn’t see) between
          e ~ normal(1,0.75/sqrt(20));
          s <- e*sd(data);
          s ~ normal(sd(data),0.75*sd(data)/sqrt(20))

          My conclusion was that s is a parameter of the model and depends on sd(data), but you replied:
          "I am free to define any parameters I want in my model, and you will not convince me otherwise.
          "In my model the parameter is e, and e tells me the ratio of s/sd(data) and its posterior value will be of interest to me, and the prior on e is not data dependent.
          "in my model, s *is not* a parameter, it is a function of a parameter e

          Imagine my surprise when I visit your blog and I see that you wrote yesterday:
          "e ~ normal(1,0.75/sqrt(20));
          "s <- e*sd(data);
          "s ~ normal(sd(data),sd(data)*0.75/sqrt(20));
          "where the last two models are totally equivalent, but the final one has a "data dependent prior".

          And you added today (I don't know if it was before or after denying that s is a parameter, or maybe simultaneously):
          "Strong Distributional Assumptions;
          "When we construct a prior such as the following:
          "e ~ normal(1, 0.75/sqrt(N));
          "s <- e*sd(data);
          "or what's the same, in a data dependent prior on s:
          "s ~ normal(sd(data),0.75*sd(data)/sqrt(N));

          So the model with a parameter s and the model with a parameter e and a non-parameter s are totally equivalent but one has a data dependent prior and the other doesn't. I won't convince you that s is a parameter because it's also not a parameter, and both models are the same but they are different. A bit like the trinity but one person short. As a bonus we get a Schrödinger prior, which is simultaneously data-dependent and data-independent. I guess that's how you can have your cake and eat it too! *Now* I'm confused.

          • Hey, I’m writing up my summary of what we’re discovering here. I am not committed to being error free or entirely correct!

            if you look above, I now agree with you that although we’re free to parameterize how we like, we need to keep the rules of probability in mind, and in particular we need to conditionalize the likelihood on the full set of knowledge, and when that includes the values of sample statistics, it requires us to do something different in the likelihood… sometimes.

            In fact, perhaps just because of my bias on the types of problems I work on, I don’t think in the cases where I would be likely to use this technique, the likelihood necessarily needs corrections. See way up at the top where I describe normalizing microarray data.

            normalizer[i] ~ normal(median(brightness),some_error_here)

            brightness[i]/normalizer ~ foo_bar_baz(quux);

            The knowledge of the median is not being used to infer the median… and the value of the normalizer is marginalized away as I don’t care about it. In fact, what I’m doing is adding uncertainty to an analysis that otherwise would normally be done without any uncertainty in the normalizer.

            it still is the case that when I set up my model with e, then s is not a parameter, it’s a function of a parameter. I see a difference in part because there’s a computational difference. You may disagree depending on what your definition of a parameter is.

            • for example, in a Stan code when we set up

              e ~ foo()

              s <- sd(data) * e

              e is a parameter on which we're specifying distributional information. But s is purely calculated from e, it has no distributional information SPECIFIED on it. It has a distribution only because e has a distribution. That's what I mean by s is not a parameter.

              just as if I wrote

              m ~ normal(0,1);

              data ~ normal(2*m,sigma);

              2*m is not a parameter, m is a paramter. So I apologize if I confused you there due to terminology differences.

            • I can see that we really did talk past each other on the word “parameter”.

              I’m using parameter strictly to mean something that i specify distributional information directly on. With that definition, it’s equivalent to something that would appear in the “parameters” section of a Stan code.

            • Carlos Ungil says:

              My definition of a parameter is one that is invariant to algebraic manipulations that transform a model into a totally equivalent model.

              In the example I wrote elsewhere, the parameters of a model are mu and sigma regardless of whether I write it as
              y ~ normal(mu,sigma)
              or as
              e ~ normal(0,1)
              y = mu + sigma * e

              Also, a parameter is what goes into the likelihood and it is not data. If you are keeping s in the stan code it’s because you’re using it somewhere!

              In that sense I considered that s=sd(data)*e is a parameter with a prior distribution which depends on the data. Exactly as the distribution of y above is normal centered at mu with standard deviation sigma in both cases.

              Anyway, that’s a secondary issue and given that we agree on the main issue there is no point in discussing it further.

              • Cheers! Yes, I think you’re talking about the fact that s is a parameter of the Random Number Generator and no amount of altering the parameters of the model changes the fact that someone put something specific for s in to the RNG.

                In any case, I agree with you about the main issue, and am going to do a few more attempts at summarizing things over at my blog.

  10. Herman says:

    Hi guys

    Thanks for this discussion so far. I’m not following all of this with complete understanding, but I’m working on a scenario that is slightly different to those mentioned but reasonably common, I think. Maybe you have some comments.

    I have a nonlinear regression model with errors in X (X_obs = X_true*Normal(1,0.1)). I know that finding a Weighted Minimum Absolute Error (WMAE) fit/estimation will get me quite close to the true params; WMAE errors on data such as mine are ~Normal(2%,25%). Although I want to keep my prior as “clean” and data-free as I can, I know the WMAE estimates going in. If I spec my prior as p(params)=Normal(WMAE_mean,0.25*WMAE_mean), my Bayesian parameter estimates are better than if I feign ignorance, because I simulate X_true and X_obs and calculate the errors, my errors are smaller and the variance on my errors is also smaller than for non-WMAE priors.

    So on the one hand, after simulating multiple data sets, I can say it works (what it does to the variance on my estimates, I’m less confident about). And it’s not just data reinforcing data, because X_obs (X measured with error) is misleading and attenuates/biases the parameter estimates away from the true values. On the other hand, I’m kinda data-peeking for a prior.

    Any thoughts?

    • Hi Herman.

      Can you be a little clearer here. I take it you have some observed values x which you think of as a true value X times a random normal error with distribution normal(1,0.1);

      so you could define a model with some multiplier params:

      xmult ~ normal(1,0.1);

      and some transformed parameters that represent the true value of X:

      xtrue <- x/xmult;

      and then you have a regression model:

      y ~ some_distribution( f(xtrue), some other params);

      Now, what is your concern?

      • Specifically, it sounds like you have some information about the sampling distribution of your regression errors? so some kind of info about “some other params” in my notation above on the second to last line, which defines your distribution of the regression error. Is that right?

        • Herman says:

          Thank you Daniel

          Yes, that’s about right. Basically, I have a non-Bayesian parameter estimation procedure: WMAE. I want to use its results as a prior for my Bayesian analysis, but such a prior is obviously based on the data.

          I get my idea of the WMAE errors from generating multiple synthetic data sets for x_obs from x_true. I find (params_WMAE | x_obs), and compare to (params_WMAE | x_true). It is from these simulations that I kinda know what to expect from the WMAE method in terms of estimation errors and their distribution for my specific case of 10% measurement error in X.

          To elaborate:

          It’s like using OLS regression estimates as priors for a Bayesian regression, except that there is this caveat about measurement error: I know that my prior is based on mismeasured data, which is what I seek to correct/compensate for in my Bayesian appproach. WMAE is still naive in the sense of using x_obs as a surrogate for x_true, but I limit the damage by putting smaller weights on larger values (which have larger variances). So this procedure uses a form of the data, but it isn’t quite as bad as using OLS regression estimates as priors in standard regression.

          A Bayesian approach allows me to put a distribution on each x_obs point, (say, x_B ~ Normal(X_obs, X_obs*0.1)) simulating the measurement error, and bringing me closer to y ~ some_distribution( f(x_true), some other params). The only time the prior really does refer to the same data points, is for the MCMC cases where x_B = x_obs.

          Using the prior described above for params does help, it seems, even after I spec the starting point for the NUTS sampler to be params_WMAE. I’m just curious as to whether such an approach is a mortal or a venial Bayesian sin :)

          A more complete description of the problem is available here:
          altough I don’t want to make more work for you, and hopefully it is not necessary to read it. It is also slightly outdated, and I hope to upload a more correct version within a day or two.

          • Herman, very nice project indeed, I love measurement error :-)

            I think what you’re saying is that you don’t have a good model for the error in the regression, but that simulation studies have helped you figure one out. I’ll use yobs and ytrue and xobs and xtrue here. You’d like your model to look like:

            xerr ~ normal(1,0.1);
            xtrue <- xobs/xerr; /*potentially long tails!*/

            yerr ~ normal(1,0.1);

            yobs/yerr ~ some_distrib( (1+a)xtrue*cos(phi+phierr), some_err_param);

            I think you're happy with the priors on xerr and yerr, but you don't know what the regression errors look like, namely what to pick for some_distrib and some_err_param, is that right?

            OR am I misunderstanding and you're not happy with the normal priors on xerr? or are you trying as in your code to do

            xtrue ~ normal(xobs,xobs*0.1); /*which is NOT the same thing */

            and then trying to compensate for the error here by altering 0.1 to something else from your WMAE analysis?

            If so, I'd suggest doing what I do above, create the parameter for the normal multiplier, and then divide the xobs by it. It fits your mental model of what's going on more exactly.

            But, I do think your situation is ripe for some kind of information coming out of simulations to determine what some_distrib and some_err_param should be. And that simulation might well be informed by the actual data you have, and then you'd have a data-dependent model, and the question would arise again, have you done something secretly wrong.

            This is where I think the advice "don't use the data twice" is not helpful. The typical stats 101 example that Carlos Ungil gave me above to convince me that I had some error has the flavor of "using the data twice" but what's "really" wrong with it is *failure to condition properly*. So the question you have now is whether it's possible to do simulations based on your data and some information you have about the physics and soforth, to then produce your regression error model in such a way that you believe you are conditioning properly on whatever your model really depends on.

            • Herman says:

              I’m quite comfortable with my prior on xerr and yobs, and I think selecting a T-dist on yobs is well-founded. I’m just wondering about my priors on parameter estimates (x, phi_err, etc) for the y-function, given their data dependence. What you say about conditioning makes sense, especially in the light of Darnieder’s PhD thesis shared below by Carlos. Darnieder provides an interesting overview and a rigorous framework within which the data-dependent prior can be set (admittedly, I’m only halfway through the thesis at the moment). I think he addresses Laplace’s comments above as well. I’ll have to figure out if my use of the estimates fall within his procedure if I want to be really rigorous, but I suspect that they do.

              I really appreciate your help with clarifying a bunch of issues for me, even if just by forcing me to think about some of them.

              Thank you!

              • A big part of why I engage in all of this is to help clarify issues that I also face in the course of doing often fairly complex stuff. And I hope that it helps other people! Doing this takes a willingness to be wrong in public. Something I appreciate about this blog vs the typical academic journal situation is that it’s a discussion.

                The concept of “using the data twice” is clearly not the real problem, but if that’s how you look at it, there’s no way to proceed that makes sense in something like my molecular dynamics example (over at the blog). It makes NO sense for a model of how physics works to include within the model some psychological model of what the experimenter decided to do, simply because the experimenter forgot to tell you. If you can figure it out from the data… and then condition on it properly when you analyze the data… that makes sense.

                technically even using N, a count of how many data items you were given, is “using the data twice”. But, you condition on it, so it’s fine!

                Alternatively, you could put a prior on N and put random N values in until Stan didn’t crash… does that make sense?? of course not. But it’s the same as conditioning on N.

  11. Ok, yet again another blog article trying to summarize this discussion:

    When you use some statistics of the data in your model, your WHOLE MODEL becomes conditional on knowing that statistic of the data. If you write your model so that it actually expresses your conditional knowledge, then you haven’t gone wrong. But if you naively write down something without realizing that you’ve failed to condition, you can go seriously wrong, as I did above.

    In the above blog post I reiterate Carlos Ungil’s simple example of going wrong, and then I give two examples that are closer to what I had in mind when defending the idea of using statistics of the data to build a model. In both of those examples you use statistics of the data, but then condition your model on knowing that statistic, and now you’re not violating the logic of probability theory.

    In the physics simulation example in particular, it’s clear what the issue is. The fact that someone else ran the simulation and didn’t tell you the mean(energy) they put in to the computer is no reason to bring in a parameter and a prior model based on their psychology. Just calculate the mean energy, and then condition the model properly on knowing that value.

Leave a Reply