“Can you change your Bayesian prior?”

Deborah Mayo writes:

I’m very curious as to how you would answer this for subjective Bayesians, at least. I found this section of my book showed various positions, not in agreement.

I responded on her blog:

As we discuss in BDA and elsewhere, one can think of one’s statistical model, at any point in time, as a placeholder, an approximation or compromise given constraints of computation and of expressing one’s model. In many settings the following iterative procedure makes sense:

1. Set up a placeholder model (that is, whatever statistical model you might fit).

2. Perform inference (no problem, now that we have Stan!).

3. Look at the posterior inferences. If some of the inferences don’t “make sense,” this implies that you have additional information that has not been incorporated into the model. Improve the model and return to step 1.

If you look carefully you’ll see I said nothing about “prior,” just “model.” So my answer to your question is: Yes, you can change your statistical model. Nothing special about the “prior.” You can change your “likelihood” too.

And Mayo responded:

Thanks. But surely you think it’s problematic for a subjective Bayesian who purports to be coherent?

I wrote back: No, subjective Bayesianism is inherently incoherent. As I’ve written, if you could in general express your knowledge in a subjective prior, you wouldn’t need Bayesian Data Analysis or Stan or anything else: you could just look at your data and write your subjective posterior distribution. The prior and the data models are just models, they’re not in practice correct or complete.

More here on noninformative priors.

And here’s an example of the difficulty of throwing around ideas like “prior probability” without fully thinking them through.

95 thoughts on ““Can you change your Bayesian prior?”

  1. Here is what’s going on. The sum and product rules of probability theory are fundamental. Everyone agrees with them and they’re implicitly used in every nontrivial probability calculation. Frequentists believe they only apply to “random variables”, while Bayesians think they apply generally including to hypothesis and so on.

    If you agree with Bayesians, you can immediately derive Bayes Theorem and use it the way Bayesian do. However, Bayes Theorem assumes the model is given with certainty.

    If there’s doubt about the model, you need re-do the derivation from the sum/product rules with that new assumption. When you do, the result is the distribution in the “prior” spot can easily depend on the data.

    So basically everyone has taken a result which assumed a fixed model, applying it when the model isn’t fixed, found discrepancies and then variously claiming Bayes is nonsense or incomplete according to philosophical taste.

    The details are here:

    http://www.bayesianphilosophy.com/the-data-can-change-the-prior/

    In general, there’s lot’s of instances in which the sum/product rules imply you should effectively change assumptions in ways separate from Bayes Theorem. See some more examples here:

    http://www.bayesianphilosophy.com/wrong-distributions/

      • It might be more correct to say “the data can change which model you’re going to pay attention to”. The model = the prior, as well as any information you’re using about the data/outcomes.

        In the simple discrete case, you could say that there are say 5 or 6 models you might think are worth pursuing. You can start by fitting what you think a-priori is the best one, and seeing that it doesn’t fit at all, you can down-weight your belief in the goodness of this model and go back and start working with one of the other models. Furthermore, in the presence of the information you gleaned about the first model’s lack of fit, you can revise your probabilities on the other models.

        Of course, you might in all these cases keep what is thought of as the likelihood information constant, and simply change the portion of the model that is thought of as “prior”.

        Another way to say this is, initially, your prior was some complicated mixture model of multiple different priors, but you started with just one of the mixture components to see if you could avoid doing a lot of work, and you got unlucky.

  2. #3 is highly problematic:

    I expect my inference to return result X because it “makes sense”, the inference from model 1 did not return result X. I run model 2, the inference returns result X. Therefore, I will use model 2 and my study supports result X.

    Obviously an oversimplification, but not by much.

    • If you assign probabilities to models/hyothesis/fixed parameters/whatever then #3 is often exactly what the basic equations of probability say you should do. For a Bayesian to accept this requires them to accept nothing more than is need to derive Bayes Theorem. Therefore it should be acceptable to all Bayesians and everyone else with a clue for that matter.

      • I agree that there is a way to formalize the incorporation of prior probabilities for models/hypotheses via the laws of probability, but this is not what is described in #3.

        In your post

        http://www.bayesianphilosophy.com/the-data-can-change-the-prior/

        you formalize the claim that it follows the laws of probability, but moving from your Equation 2 to Equation 3 requires the strong assumption that the data strongly indicates a particular model. Upon failure of this assumption, only your Equation 2 holds, which is not what is described in #3 of the post on this blog.

        So yes, to say this does require them to accept more than is needed to derive Bayes Theorem. To use your Equation 2 does not.

        • But in the presence of this issue Laplace’s initial point that #3 is “exactly what the basic equations of probability say you should do” is false.

        • Sure; and Laplace has been perfectly clear on this point. The full quote is “#3 is often exactly what the basic equations of probability say you should do” (emphasis added). If you look at the original post, you’ll see that the antecedent to equation 3 is “Consider the case when the data strongly picks a model so that P(M|D_n) is sharply peaked about the most likely model”.

  3. A number theorist is interested in some question or other. While investigating it, they happen to infer that 1=2. Do they (a) publish a note saying that Peano’s axioms of arithmetic are inconsistent, or (b) look for their error, correct it, and go on? If you chose (b), why would you think that there is anything “problematic” with a subjective Bayesian correcting their errors?

    To put it another way, subjective Bayesians do not “purport to be coherent”. Like all other non-divine beings, they realize that they can make errors. Or they realize at the meta-level of inference that doing the “right thing” is a lot of work, and will probably not make any difference anyway. Except sometimes they realize later that it does make a difference, and have to go back and do more work.

    • Radford: Nicely put.

      However, at the JSM2015 there were folks (who now accept 3) who talked about others (sometimes earlier versions of themselves) who refused to entertain 3 as it undid mathematical coherence (not realizing that it never actually applies to anything other than abstractions.)

      Specifically Tony O’Hagan talking about Dennis Lindley and Stephen Fienberg talking about his earlier objections to the prior depending on the data.

      These issues perhaps need more discussion in the literature.

      Holding a model as true (momentarily) one should stay consistent with it (but only until it is really doubted).

  4. Clearly this is a non-trivial problem for Bayesians as was clear from the conversation that resulted from my blog, and from your own experience w/ Bayesians being reluctant to test their models. I spoze the most well worked out view, in the discussion to my blog, was Phil Dawid’s. The thing about the likelihood is that it describes the data generating procedure via explicit statistical assumptions that can be checked via the n data points. With a prior, we typically imagine selecting a single instance of a parameter. Priors serve very different purposes, so is one testing whether it works for a given purpose or whether it adequately describes the distribution of the parameter? The other issue is how to distinguish between model violations (without the prior) and prior violations. You speak of changing the prior if the inference doesn’t “make sense” — although it might flag some problems, that’s scarcely a test with given properties. Granted, my blog query was mainly focused on subjective priors. A high posterior to H might make perfect sense to one who believes strongly in H, despite model violations. I guess we’ve been through a lot of this before. I was seeking the current Bayesian position, and it’s varied.

  5. An extreme form of this is how my research group worked. Set prior. Analyze data. If results confirm Principal Scientist worldview (“make sense” in Andrew’s more diplomatic language) stop. Otherwise, change prior. Bayesian statistics helps because you can modify the prior to make sure results follow a given worldview, and get phylosophical cover for it. With frequentist statistics, you have to tamper with the data or the model, and feel guilty about it. As long as highly subjective terms such as “makes sense” are incorporated in the data analysis process, the results will be subjective, and not in the good sense of “what a person believes” but in the sense of “what a person believes will secure the next grant”. Maybe it’s time to abandon any ambition of mathematical rigor in the model generation phase and rely purely on the model verification phase, e.g. direct replication, a little like theorems can be inspired by dreams, deities and what not but the consensus comes from proof verification, which seems to be not so controversial. Because I don’t really think there is an alternative to the iterative process Andrew describes, either in a Bayesian or frequentist setting (see Tukey “the future of data analysis” for a more authoritative opinion on the matter) but it is very prone to abuse.

    • It’s this kind of abuse that worries me about advising people to check that results ‘make sense’ in an informal fashion. Glad to see you articulate it more clearly than I’ve been able to before!

      • +1

        Excellently articulated. And your research group isn’t the only abusing this. I’ve seen this abuse in other contexts.

        Which is why I’m a huge fan of Bayesian models in projects where *predictive power* is very clearly the arbitrator of a good model.

        In other contexts both Bayesian & Frequentist approaches are, sadly, equally prone to abuse.

        • I think very often, I’m interested in using a Bayesian model to find out information about an unobserved quantity. In many cases such a quantity is highly theoretically motivated, examples could include things like the diffusivity of a protein, the time-averaged rate of growth of cells, the degree of damage in a certain structure (measured as a reduction in stiffness for example), the permeability of soil under the ground… whatever. The point of a Bayesian model is to connect observable data which doesn’t directly tell me what I want to know, through an often well founded mathematical model, to a quantity I can not directly observe.

          In other contexts, where underlying theory is more limited, often people are left with something like linear regression as a workhorse, to at least specify the first 2 terms in a taylor series.

          Also, people often like to categorize things into “yes/no” categories, and when they do that, they often wind up with “what’s the difference in means between categories”. I know sometimes you can’t do better than that because of the available data… but at each step going from the partial differential equations of the evolution of protein concentration in a fluid…. down to “is there a difference between these two groups” you are doing more and more glossing over of details, and getting less and less information out of your analysis.

          When you get down to “is there a difference in the frequency of red shirts in women during peak fertility” or “is there a difference between two schools in the test scores of black male students” you’ve glossed over so much detail that there’s a lot of room to “put back” details in your model in the wrong, misleading, dishonest, too-hopeful or other ways.

        • @Daniel

          It is in those first type of projects you described i.e. “forecast an un-observable quantity from observables” that the risk of abuse is highest.

          The priors can, and often are, manipulated to yield an estimate for the un-observable quantity that matches the author’s prejudices or interests. In many areas the consensus on priors is so weak, and the range of reasonable looking priors is so wide, that you can move the predicted quantity to wherever you want it to be.

          Camouflaging the author’s purely speculative hunch via the respectable intermediary of a Bayesian model lends it the false appearance of “objectivity”.

    • Anon, Convex:

      Yes, the idea is that “makes sense” is defined with respect to actual prior knowledge, not prior hopes and dreams. If you have a principal scientist who believes that political attitudes will change by 20 percentage points during the menstrual cycle, or a principal scientist who believes that the children of beautiful parents can be 8 percentage points more likely to be girls, then you can indeed end up with “garbage out.” I think (hope? dream?) that expressing prior information explicitly is a useful step, in that it puts researchers face to face with their assumptions about the world.

    • Anon, any time I read a post that attacks some statistical framework on the basis of it being “prone to abuse”, I wonder whether the writer is blind to the world around them. The currently popular statistical approaches must be prone to abuse because they are so clearly being abused! We should make sure that the behaviours that constitute ‘abuse’ are obvious rather than trying for a mythical ‘non-abusable’ method.

    • This shows that transparency (detailing just what was done and why) is important in reporting Bayesian as well as frequentist analyses. In terms of Andrew’s Garden of Forking Paths metaphor, this means detailing all decisions and the criteria on which they were made (e.g., what the criteria for “makes sense” is in tweaking a prior.)

  6. I think of the BDA approach as similar to the idea of proof by construction in mathematics. First assume a (Bayesian) solution exists. If it does then it should satisfy some other known properties (Bayes theorem etc), so construct the candidate based on this property. However the proof is not finished until you verify that the candidate actually satisfies the main criteria (ie fits the data). So check this.

    I do think that ‘frequentist style’ reasoning tends to enter under point #3 but that this can be done in within the BDA Bayesian framework.

    • Or, to put it another way. First find the relative fit of a set of candidates (likelihood or regularized likelihood functions) then check the absolute fit. Both are necessary for obtaining a good solution. The first quantifies how good they are if they exist (relative to the best say) and the second establishes that they exist (eg that the best is satisfactory).

      • A prior is just a way of specifying the feasible/candidate parameter space (in a more continuous way than sharp bounds). Why wouldn’t it be part of the model? And why wouldn’t we be able to change the candidate space if no good solution exists?

        When mathematical models are ill-posed we regularize/add more info. Would someone criticise a person using linear programming for modifying their constraints if they couldn’t find a good solution in the original parameter space?

        • Fine, so long as I know how much one had to fish around for the right prior till the model gave you the fit you are claiming as reasonable.

          Basically, do I look at the prior as a distillation of the pre-existing, consensus knowledge in the field or as a freely adjustable parameter that can be tweaked around till you get predictions that you like.

        • Depends on the circumstances I suppose. At minimum you report what prior you used and people can decide if it is reasonable. You might go further and report some priors that don’t work. The best would be to provide your code for people to play with.

          As for consensus vs adjustable parameter specifically, I’d say it’s a bit of both. What’s makes a reasonable parameter space may depend on consensus knowledge outside the current experiment but you may wish to make weaker or stronger assumptions.

          I’d say some frequentist esq (or maybe predictive rather than inverse inference) reasoning comes in when you want to consider how well your analysis would hold up on data not observed. Posterior predictive checks are one way of doing some limited investigations based on hypothetical replications; cross validation and actual replications are also helpful.

        • Depending on authors to report which priors didn’t work is sort of like expecting NHST studies to self-report on fishing expeditions.

          There seems to be little consensus on what constitutes a reasonable prior.

          What I hate is a prior tweaked till a desirable model was obtained & then the prior presented as if it was a priori chosen.

        • There’s no magic bullet of course. I just prefer an explicit mathematical model with a clear estimation method and explicit parameter constraints to point hypothesis testing. This at least promotes making the theories more explicit.

          Would you be similarly annoyed if someone presented a differential equation and boundary conditions as if they had been a priori chosen when in fact they had to try a few different derivations and bcs to get a decent model?

  7. As Andrew points out, the model includes the prior and the likelihood (or, if you are doing frequentist inference, the sampling distribution…or for some other kind of inference some other assumptions), and as George Box reminds us, “All models are wrong, but some models are useful,” it seems to me that finding that the model produces results that don’t make sense is just telling us that the model we are using is not only wrong (which we already knew) but also not useful. If this is so, there is no compelling reason not to revisit the model to try to find one that makes more sense.

    This is true regardless of whether you are doing Bayesian inference or some other sort of inference.

    https://en.wikipedia.org/wiki/All_models_are_wrong

    • Bill: Lumping together all treatments on the grounds that “all models are wrong” is a way to downplay the distinctions that exist in the uses of models. People may scoff, but I’ll say it anyway. The frequentist is interested in employing models that enable flaws to be determined and that allow approximately true things to be learned; for this only aspects of patterns of variability and associated error probabilities need to hold approximately. The “all models are false” mantra, as I see it, makes it easy to trivialize the genuine differences between approaches to specifying, testing, and using statistical models with integrity.

      • Mayo:

        You write, “The frequentist is interested in employing models that enable flaws to be determined and that allow approximately true things to be learned; for this only aspects of patterns of variability and associated error probabilities need to hold approximately.”

        It’s funny that you say that, because it’s also true that the Bayesian is interested in employing models that enable flaws to be determined and that allow approximately true things to be learned; for this only aspects of patterns of variability and associated error probabilities need to hold approximately.

        OK, not all Bayesians. But, then again, not all frequentists are interested in employing models that enable flaws to be determined etc. A lot of frequentists are happy to run logistic regressions or whatever. The best frequentists are interested in finding flaws in their models and doing better; so are the best Bayesians.

      • May: Sorry that you didn’t seem to notice that my remark was really directed at which models are useful. Box’s comment was basically a lead-in to that point.

  8. Michael, I am sorry my comments were so easy to misunderstand. I am interested in how different methods support or not the scientific inquiry under real world conditions and, having been sent back to my desk many times because the results “did not make sense”, I am keenly aware that subjective elements are not helpful to real analysts, or to subjects of inquiry like mine that are experiencing a paradigm struggle, with widely diverging world views represented at the highest levels. For the record, we never used any Bayesian method, that was not the problem, I am here to learn about them and how they can help. I have plenty of misgivings about frequentist statistics. I am just saying, we’ve got to find a better test than “makes sense”.

  9. @andrew: this is one of my favorite quotes of yours ever:

    “subjective Bayesianism is inherently incoherent.”

    i know you’ve said it before, this version of the sentiment is so gloriously unabashed and honest.
    all inferential methodologies have pro’s and con’s.
    knowing the con’s is perhaps even more important than knowing their pro’s.
    maybe one day, the review process will be more supportive of us writing about the con’s of our methods….

  10. Laplace

    How often do you get the true model and how do you know it is true
    when you get it? Is it possible to have two different true models?

    Mayo

    The thing about the distribution function is that it describes the
    data generating procedure via explicit statistical assumptions that
    can be checked via the n data points

    The thing about the likelihood is that it describes the data
    generating procedure via explicit statistical assumptions that can be
    checked via the n data points.

    Data are generated at the level of distributions
    functions. Distributions which are close generate data which are
    close.

    Distribution function and likelihood are connected by the
    pathologically discontinuous differential operator. That is the
    distribution functions can be arbitrarily close and at the same time
    the likelihoods arbitrarily far apart. How do you intend
    to get around this problem?

    Tell us what you mean by
    “for this only aspects of patterns of variability and associated error probabilities need to hold approximately”.

    The “all models are false” mantra, as I see it, ….
    Agreed but do tell us what you mean by approximation, a simple example
    would help: when do you accept the standard i.i.d. normal model as an
    approximation?

    Andrew

    If you operate mathematically correctly within a probability model you
    are coherent, you have no choice. If you apply any such model to real
    data you cannot afford to be coherent, whatever it may mean in this
    context. I agree.

  11. Laplace
    Not at all.

    These two failures require different remedies. If K is wrong we must replace it with a true K’ to get P(x|K’). If K is true but misleading we need to add relevant information H to get P(x|K,H).

    try again.

    • All I said was either an assumption is wrong, or they’re all right but missing something relevant. You’re the first person to ever deny that. But what’s really funny is this is just a convenient verbal formulation of what the basic equations of probably (sum/product rules) say to do. Do you really think if I have to choose between “Laurie Davies” and the foundational equations of probability theory, I’m going to choose “Laurie Davies”?

  12. Laplace
    All I asked was

    How often do you get the true model and how do you know it is true
    when you get it? Is it possible to have two different true models?

    My answers are (1) never, (2) I don’t and (3) no.

    What are yours?

    • Well Davies, I suppose we could argue about this for another 200 years like statisticians have already done; each giving our opinions, hopes, and philosophical ideology. But there is another approach, which seems to be favored by no one but me. We could simply look at the two basic equations of probability theory, the sum and product rules. Everyone agrees on these two equations and they’re used in every non-trivial probability calculation ever done. So I think we’re on firm ground there. We can take those two equations and use them in full generality. The same generality needed to get Bayes Theorem.

      Then we can drop all opinions/philosophy/hopes and just examine what those equations actually do. Forget about what we imagine they do and just use them as is, without any additional inputs or other ad-hoc features. Just take them seriously and see what they say.

      When you do that something amazing happens. Problem after problem disappears. Issue after issue melts away. In particular, those equations say to do the following in general. Judge P(x|M) in part by how high P(x*|M) is relatively where x* is the actual value of x. This isn’t a philosophical position it’s just literally what the equations do.

      In some cases x is a vector or sequence of outcomes. So we get P(x1,….,xn|M) and we judge it by how high P(x*1,…,x*n|M) is relatively speaking. Furthermore in some cases this reduces to the following: think of this as a repeated “sample” from P(x) and then P(x*1,…,x*n|M) will be high whenever the histogram of x*1,…,x*n looks like P(x) approximately.

      People have taken this special case (and I’m guessing you’re one of those people) and mistakenly come to believe it’s the general case. They try to force every application of statistics into this mold no matter how ridiculously ill fitting it is. If you make that mistake, you start to think things like “you can’t have two true models”.

      But when you go back and look at the general case, a very different picture emerges. The distributions describe uncertainty and have nothing to do with frequencies in general. Even when frequencies are involved as in the above example, P(x1,…,xn|M) describes an uncertainty range of the actual x*1,…,x*n. And when you realize that it is possible to not only have true models but more than one.

      Now you may object to that, but it is a historical fact that Laplace ~200 years ago gave a (Bayesian) interval estimate for the mass of Saturn. Our most precise modern estimates confirm the true value is within his interval. If that isn’t a “true” model then I don’t know what is. Furthermore, if Saturn’s mass is within our modern shorter/more precise interval estimate, then our current statistical model is “true” as well. So it’s a historical fact that we got two “true” interval estimates for Saturn’s mass, one just has less uncertainty than the other.

      So I think we’re at an impasse here. Because I’m not going to believe your philosophy of statistics if it means rejecting the basic equations of probability theory and nor am I going to believe it if it means there can’t be two truthful interval estimates for Saturn’s Mass when that’s exactly what we got.

  13. Laplace

    P(x|K_1) is “more right” than P(x|K_2) whenever P(x^*|K_1) > P(x^*|K_2).

    27 (real) measurements of amount of copper (milligrammes per litre) in a
    sample of drinking water.

    2.16 2.21 2.15 2.05 2.06 2.04 1.90 2.03 2.06 2.02 2.06 1.92 2.08 2.05
    1.88 1.99 2.01 1.86 1.70 1.88 1.99 1.93 2.20 2.02 1.92 2.13 2.13

    Two models: N(mu,sigma^2) and Comb(mu,sigma^2)

    Log-likelihoods 20.31 and 31.37 respectively.

    Which model do you chose and why?

    See Chapter 1.3.2 and Table 1.2 of

    Data analysis and Approximate Models Laurie Davies (2014)

    • If these differences are due to measurements errors for example, then I would use properties of the measurement process/devices to get a handle on the general order of magnitude of their size (not the frequency of their occurrence, which we’d generally don’t know). Then I’d create a distribution off that information so that the actually errors in those numbers you gave were in the high probability region of the error distribution (i.e. it’s a ‘good’ distribution by the criteria you seem to think is silly)

      At the end of this and a bunch of other stuff, I’d get a P(mu|all evidence). and if the true mu* is high in that distribution (i.e. a ‘good’ distribution by that criteria you think is silly) then the Bayesian Credibility Interval I compute from it will contain mu*.

      Again, you may not like this, but my goal is to get a interval estimate of for the true mu*. I want this interval estimate to contain the true value and beyond that hopefully it’s as short as possible. It’s silly to claim I never do this (no true models) or that there’s at most one true answer (only one interval could ever possibly contain the true mu*)

    • > Log-likelihoods 20.31 and 31.37 respectively

      I assume these are max likelihood values?

      a) I wouldn’t compare likelihoods between different models, just within a model
      b) I would look at the whole likelihood function for each before making a decision
      c) I would adjust the likelihood function to take into account measurement accuracy if relevant
      d) I would regularize my parameter solutions using a prior if relevant

      • Also

        – I would say that the parameters mu and sigma have meaning only wrt a model and hence are different for different models, unless you give an explicit transformation between models

        – Presumably there is some characteristic you like of the normal model and some you don’t of the ‘comb’ model. If you would like to compare different models then include this as a parameter in the mapping between the models, preferably in some smooth way

        – Encode this preference as a prior on the mapping parameter to ‘regularize away’ the thing you don’t like about the comb model, if this is desirable

  14. Laplace

    They are measurements with errors as standard in analytical chemistry. The standard deviation is a good indication of the size of the errors. For your information both models are equally good based on the Kuiper metric, i.e. they both fit the data.

    The 95% confidence interval for are [1.970,2.062] and [2.0248,2.0256]
    respectively. Bayesian posteriors for mu are essentially concentrated on these
    intervals. So as the likelihood under the comb is much higher than
    under the normal, you choose the comb, after all

    P(x|K_1) is “more right” than P(x|K_2) whenever P(x^*|K_1) > P(x^*|K_2).

    I am prepared to accept that there is a true amount of copper in the
    sample of water, number of copper atoms which can be expressed as
    milligrammes per litre. Is this the same as your true mu*?

    • Yes, that’s the same as his true mu, unfortunately, it’s not observable. If it were observable, we wouldn’t need statistics. The fact that the “comb” model (whatever that means, I’m not familiar with your usage of the symbol “Comb”) gives a smaller interval estimate is only indicative that it’s a better model if the interval actually contains the actual true mu (the true number of atoms of copper in the water).

      Let’s suppose that another 100 measurements indicate that the concentration is 1.99507 mg/L. Then the normal is a true model, and the Comb is not. To be specific, the comb would be using some information which was unjustified.

      Unfortunately, we can’t know ahead of time. What we can know is something about how well our models fit the mechanisms and scientific information we have available. If comb uses some information that we believe to be true which the Norm model doesn’t then we’d expect it to give sharper estimates and still be correct. But if that information is incorrect, then we’ve made a mistake. There’s no way to avoid making any mistakes ever.

    • Davies,

      There’s one truth in statistics that’s super simple. Beyond obvious really, but for anyone who’s been exposed to frequentist statistics (i.e. everyone) it’s nearly impossible to put into words in way they’ll get. I’ll give it another try though since there is truly no hope of progress until it becomes well known.

      The goal in setting up an error distribution isn’t to have P(error) match the frequency of errors in any sense whatsoever. Not even approximately. There’s no intention at all of chasing after a mythical interval estimate which contains the true value most of the time in repeated trials that never happen.

      The goal rather is to create an interval which contains the true value for the majority of potential errors numbers my one data set could reasonably have.

      The former goal requires us to somehow have God like knowledge of frequencies that don’t exist. And if we’re off by too much our (1-alpha)% interval is more like an alpha% interval.

      The later merely requires us to have a range of possible values for the one set of errors in my one real data set. That’s it. An uncertainty range for actual error numbers that actually exist. The information needed to do this doesn’t require God like knowledge and it’s practically checkable. Not every error distribution is suitable for this goal, but in practice many will work reasonably well usually.

      You may think this absurd, but pretty well every major statistician since Laplace’s time has remarked on how often multivariate normal distributions are used for errors and seem to work in examples where the frequency distribution of those errors is highly unlikely to be “normal”. It works because while NIID may be worse-than-useless as a prediction of future frequencies, it’s a sensible way of describing the range of uncertainty in the one set of errors numbers actually in my data.

  15. Laplace

    Sorry missed your last but one reply, its getting late here (Berlin).

    Judge P(x|M) in part by how high P(x*|M) is relatively where x* is the
    actual value of x. This isn’t a philosophical position it’s just
    literally what the equations do.

    `In part’? You can drive a bus through this.

    … how ridiculously ill fitting it is …

    Do tell me what you mean by this. Or rather tell me what you mean by a
    good fitting.

    I understand that a model is a specification of the distribution of
    observations. A model is true if the observations do in fact have this
    distribution. There is no way of checking this.

    Now you may object to that, but it is a historical fact that Laplace
    ~200 years ago gave a (Bayesian) interval estimate for the mass of
    Saturn. Our most precise modern estimates confirm the true value is
    within his interval. If that isn’t a “true” model then I don’t know
    what is.

    This is a very different definition of truth. A true model is one
    which produces an interval containing the true value, in this case of
    of the physical quantity you are trying to measure. So any model no
    matter you absurd is true if the calculated interval contains the true
    value of the physical quantity. Why so cautious and put true in
    inverted commas? If this is what you mean by true remove them and
    state your case with vigour. This is the weakest definition of truth
    I have ever come across. No `tertium non datur’ for you.

    • “I understand that a model is a specification of the distribution of
      observations. A model is true if the observations do in fact have this
      distribution. There is no way of checking this. ”

      “Laplace” will quickly disagree with this.

      Regardless of what the observations do in fact have as a frequency distribution, any distribution which makes the true value high in the posterior distribution is good enough. For example, a Normal(0,1) model is good enough whenever the actual data hardly ever fall more than about 1 to 3 units away from 0. All it does is downweight things that we know can’t be true (like values around 6 or -10 or 17541).

      Here is an example: http://models.street-artists.org/2014/03/21/the-bayesian-approach-to-frequentist-sampling-theory/

      the model I use for the data doesn’t even approximately fit the frequency distribution of the data… but the inference I get is accurate.

  16. Daniel Lakeland

    I agree with you. Laplace seem to have a very different concept of
    truth. Any model whose Bayesian interval estimate contains the true
    value is “true”. So we have true, “true”, possibly false and “false”,
    a four valued logic or maybe a continuum. I prefer standard binary
    logic. How about you?

    The normal distribution does not downweight the extreme values, it is
    on the contrary very sensitive to them. What you mean is something
    different. If the data contains large outliers you would not use the
    normal distribution, precisely because it would not downweight
    them. What do you mean by “units”? Certainly not standard deviations.

    100.00 100.00 100.00 -1.36 -1.14 -0.98 0.18 -0.36 -1.14 -3.84
    0.45 -0.40 0.20 -0.49 -0.96 -0.63 0.60 0.07 -0.53 -0.40

    All observations are within 3 standard deviations of the mean so you
    use the normal distribution.

    In your example you wish to estimate the average content of a jug. You
    take 10 and calculate the mean as you use the normal model. The central
    limit theorem tells you that this will be roughly normally distributed
    and this regardless, within limits, of the actual distribution of the
    volumes of the jugs and also of your prior. This has nothing to do
    with Bayes and to understand it you appeal to the central limit
    theorem. What do you do if you are interested in the median
    content. Exactly the same model?

    What is the true value you talk about? You have a true value for the
    100 jars but you also have a parameter mu. Are they the same? Does
    the data come with a true mu attached to it and that for every
    possible model? Take a log-normal model. Do the data come with true
    values for the parameters of this model as well? I prefer a less
    baroque ontology.

    Note that when we use a continuous probability distribution and then
    observe particular data, the probability of observing exactly that
    data is a slightly problematic concept that I don’t want to get into
    too much at this point.

    The answer is zero. Gauss had problems with this in the context of maximum likelihood, here in a letter to
    Bessel

    Ich muss es nämlich in alle Wege für weniger wichtig halten,
    denjenigen Werth einer unbekannten Größe auszumitteln, dessen
    Wahrscheinlichkeit die größte ist, die ja doch immer nur unendlich
    klein bleibt,

    You get to the likelihood by differentiating the distribution function. The
    operator is pathologically discontinuous, in mathematical terms you
    have an ill-posed problem which mathematicians typically solve by
    regularizing. How do you regularize?

    • Laurie, you seem to have missed a few things. The most glaring is “In your example you wish to estimate the average content of a jug. You take 10 and calculate the mean as you use the normal model.”

      Actually, I use a model for the data which is exponential with an unknown rate (or average value mu). The actual data is produced using a uniform distribution between 1.4 and 2.0. There is no central limit theorem in my model. Why does it work?

      It works because when the rate in the exponential distribution is about 1/mu with mu being *the true average over all 100 jugs*, then the data values will have a relatively large value for exp(-x/mu)/Z(mu). When mu is too big, the Z value will be large, and the overall value will be smaller. When mu is too small the exp(-x/mu) value will be very small. So, when mu is about right, the data will fall in a high probability region of the distribution I chose. mu is constrained to 0 – 2.25 with a uniform prior implied by stan’s defaults, based on my knowledge of the situation (that you can’t fit an extra 0.25 liters in a 2 liter jug).

      The distribution isn’t the frequency of anything. It isn’t dependent on a central limit theorem. It just is set up so that exp(-x/mu)/Z(mu) is large when mu is near the correct value, and small otherwise.

      second:

      “The normal distribution does not downweight the extreme values, it is on the contrary very sensitive to them.”

      In your world, the data are random, and the parameters fixed.

      In my world view, the parameter is unknown and the data are fixed. If the data fall very far from the center of the normal distribution for a given parameter value, then the *parameter value* is highly down-weighted by the fact that that value is inconsistent with the data. In that sense, the normal radically downweights any parameter value that causes the data to fall more than a few sigma from the mean. If it’s true in the world that the data do not ever fall more than a few sigma from the mean, then this is a true model ***regardless of the actual frequency histogram of the data***. For example such a model will give you good inference even if the actual data are uniform between -2,2 or bimodal mixture of normal(-1,0.25) and normal(0.8,0.55). If, however, this is a bad assumption, if for example some of your data collectors report values in mm and other in meters, so that there is a big disconnect in scale between the two groups of observations, then the assumption that the data will always be a few sigma from the mean is a bad assumption.

      In my world view, the goal of statistics is to find out approximate true facts about unknown things in the world. Many times, those things are very well defined. In my case, mu is not an abstract parameter in the model, it is an actual average over 100 jugs of the actual amount of OJ in the jugs. Other times the thing is an abstract not well defined thing. For example, the poisson rate of car accidents along a certain road. There is no “randomness” in the world which enforces a poisson process, yet the abstract estimated rate can tell us true information about what’s going on, namely when two different time periods have different estimated rates, probably something changed in the traffic environment.

      I personally do not believe in “infinite sequences” of measurements in the world. All science is about finite distributions. sometimes, as an asymptotic approximation, you can treat the distribution as infinite/continuous. Whenever that becomes a problem, it is the model that must change, not the amount of orange juice in the purely theoretically infinite number of containers that the machine will produce in an infinite set of future possibilities.

      • As for your example with the data concentrated about 0 and with a few 100 values thrown in. The sigma you are using is obviously the sample value. The sigma I am interested in is the sigma which describes the scale of the variation caused by some physical process that results in variation in the data. When the variation is of the type you show, it’s the model that must change, not the parameter value. Such variation is inevitably produced by non-homogeneous mechanisms.

        I think it’s a mistake to start with the math and make the data fit the math in some way. Just as it’s a problem to start with a distribution function, describe the density as a derivative, and then say that the derivative operator is unbounded therefore the science is shaky.

        Start with a physical understanding of the problem, describe an approximation of that understanding in mathematics, and then proceed to calculate. If the calculation returns nonsense, start over with the mathematical description, or change your physical understanding. Mathematical models that return nonsense are wrong. That is the essence of science and the difference between modeling using math, and pure math. In pure math, the initial assumptions are true by assumption. In science, the initial assumptions are assumed true until the data shows they are false.

        • In science, the data never come from random number generators with given frequency distributions. They always come from some set of atoms and sub-atomic particles interacting to alter the position of some other particles. it’s pretty much always position, or occasionally velocity, too.

          Position of a needle on a gauge caused by velocity of electrons flowing through a wire causing a magnetic field which provides a torque to counteract the torque applied by a spring.

          Position of electrons in a capacitor that produces a voltage for an A/D converter.

          Position of graphite molecules on a paper survey answered by an enormous bundle of atoms we call a Psychology Student who uses the position of photons striking her retina to interpret the meaning of the position of the ink molecules on the page which alters the position of sodium ions in the neurons of her brain.

          Position of electrons in a bank computer which cause a money balance to be shown on an account which causes a bundle of molecules called a consumer to be able to alter the position of a bundle of molecules called a product from one location on the earth called a store to another location on the earth called a home without another bundle of molecules called a police officer confining the consumer bundle into a small space by strong molecular repulsion.

          It’s never Platonic infinite sequences of bits coming from a particular abstract mathematical sequence whose Godel number is some particular value and which satisfies the Kolmogorov axioms for a particular distribution function to within a given epsilon when calculated by the non-constructive uniformly most powerful computable test.

  17. hjk

    I apologize, I completely missed your contribution, I was writing way past
    my bedtime.

    Yes, these are maximum likelihood values but if you take a uniform
    prior for mu over [1.78,2.25] and an independent one for sigma over
    [0.042747,0.315859] then the posterior for mu is essentially
    concentrated on the interval [2.02122,2.02922] agreeing more or less
    with the 0.95-confidence interval for mu based on this model.

    (a) The posterior for the comb model is virtually concentrated on
    [2.02122,2.02922]. If you take it seriously and compare its value for
    different mu it will tell you that mu=2.00 is way off with a
    likelihood essentially zero. But 2.00 is a perfectly reasonable value for the
    data so comparing within the model doesn’t help in general.

    (b) Again if your likelihood is pathological it won’t help.

    (c) Well, how do you adjust? Adjust suggests a minor alteration but
    you may need to reject it and start again.

    (d) There seems to be a feeling amongst some Bayesians that one can
    regularize using a prior. This is sometimes true, say ridge
    regression, but not always. You can’t repair the comb disaster by
    regularizing your prior.

    Bayesian statistics in its standard form requires truth, behaving as
    if the model were true. They are not alone in this. The frequentist
    also requires truth: how do you otherwise define a confidence
    interval? All arguments about the likelihood principle are conducted
    on both sides on the assumption that the model is true. Birnbaum uses
    the word adequate but there is a huge chasm between adequate and
    true. It is perfectly possible for two different models to be adequate
    but they cannot both be true. Worse, there is no continuity principle
    to take you from adequate to true. On the contrary there is a
    discontinuity principle, namely the unbounded linear differential
    operator.

    The location-scale problem with choice of model is
    ill-posed. Bayesian priors will not regularize it. One way of
    regularizing it, but not the only or even the best way, is to minimize
    Fisher information. If you do this for the copper data and the normal
    and comb model the empirical Fisher informations are 2.08e3 and
    3.73e7 respectively so you choose the normal model. The comb model
    imports efficiency in the frequentist sense. It gives you a very
    precise result which is nonsense because it is so precise. Tukey calls
    this a freee lunch and states that are no free lunches in
    statistics. He calls models that don’t import efficiency bland or
    hornless and these are the ones to be used.

    Your also
    – A model is something you impose on the data from the outside. The data
    do not come with the true parameters attached. So if you are interested in
    some numerical quantity, say the amount of copper in the sample of
    water this quantity must be related to the parameters of your
    model. Suppose we also consider the double exponential model for the
    copper data which is also consistent with the data. If we use maximum
    likelihood for mu based on the normal model the answer is the mean, if we use the
    double exponential it is the median. So the same parameter in two
    different models has two different interpretations. However in this
    case we can give a consistent interpretation for mu across the models,
    namely that it is the point of symmetry (the comb model is also
    symmetric). But Tukey, who early in his career was a chemist, states that such data
    are often not symmetric. Consider the log-normal distribution. What
    function of the parameters do you now associate with the copper content
    of the water sample? How do you achieve a consistent interpretation
    across the different models? Make a suggestion.

    – Yes, it minimizes the Fisher information across an interesting class
    of models. I do however prefer Huber’s minimization.

    – As I write above I don’t think you can do this using a prior. The
    problem goes much deeper.

    The above may give the impression that likelihood only goes wrong in
    ill-posed problems and that this can be overcome by some form of
    regularization. An example due to Andrew Gelman in his Bayesian EDA
    paper shows that likelihood can go wrong in well-posed problems. The
    problem is to estimate the four parameters of a 50-50 mixture of
    normal distributions. This problem is well posed, there is nothing
    pathological about it. Nevertheless he states `the Bayesian averaging
    over uncertainty does not save use’. Why not just choose the parameters
    to minimize the Kuiper distance?

    • Thanks for your response :-) Interesting. Here is my attempt at responding.

      > (a) …But 2.00 is a perfectly reasonable value for the data so comparing within the model doesn’t help in general.

      Do you mean that 2.00 is perfectly reasonable as a parameter value or as a data value? I would only interpret the relation of mu to y via the model. If y_predict(mu=2.0) is far from the data y0 then mu is a bad parameter *within* that model. The model itself may be ‘bad’ according to some other criteria – which is in some ways your point – but then we have to be careful how parameters are interpreted. To me, here, they are (part of) the mechanism of generating data, not data themselves.

      > (b) Again if your likelihood is pathological it won’t help.

      True, but it may help identify pathological likelihoods.

      > (c) Well, how do you adjust? Adjust suggests a minor alteration but you may need to reject it and start again.

      You may reject it and start again, sure. In terms of using an ‘adequacy’ criterion I would suggest a relaxed rather than exact conditioning on the observed data a la something like ABC or a concentrated measurement model on top of a latent variable.

      > (d) There seems to be a feeling amongst some Bayesians that one can regularize using a prior. This is sometimes true, say ridge regression, but not always. You can’t repair the comb disaster by regularizing your prior.

      I’m not sure I agree that this cannot be done – if a description of the ‘disaster’ aspect of the comb model is included as an additional parameter then why can’t a prior over the ‘disaster’ parameter do some regularizing? Your model parameters should include all those relevant for describing your model, not just those of primary interest. What quantity (or quantities) besides mu and sigma differentiates the normal and comb models and what is the implicit prior in each case? Could you give more details so that I could try sometime?

      > Bayesian statistics in its standard form requires truth, behaving as if the model were true. They are not alone in this. The frequentist also requires truth: how do you otherwise define a confidence interval? All arguments about the likelihood principle are conducted on both sides on the assumption that the model is true. Birnbaum uses the word adequate but there is a huge chasm between adequate and true. It is perfectly possible for two different models to be adequate but they cannot both be true. Worse, there is no continuity principle to take you from adequate to true. On the contrary there is a discontinuity principle, namely the unbounded linear differential operator.

      This does make me nervous. But all mathematical frameworks assume that their assumptions can be interpreted as true in order to trust their consequences. This may be ‘raised a level’ to statements of inequality or accuracy about a model embedded at a lower level. For example you assume that your adequacy criterion is exactly satisfied when performing mathematical manipulations, even though it itself only imposes an approximate constraint on a substructure of your framework (the ‘generating model’ or whatever). Would you accept a criticism based on an assumption that violates your adequacy criterion?

      Similarly, if adequacy is defined in Bayesian terms using an approximate conditioning (ABC) or an additional measurement model then this moves the mathematical premises to be satisfied by reality to a higher level than those of the generating model itself. This relates to your comment about Gauss, perhaps.

      Again though, I also think when comparing different models one should also explicitly define the relation between them. This mapping is parameterised by some set of parameters – it may be that when including this parameter that two of these parameter values are given equal posterior weight etc, indicating that the models corresponding to those values are equally valid. One must distinguish between the ‘model’ being true and the ‘parameters’ being true. The *highest-level* ‘model’ (or mathematical framework) should be accepted as ‘true’ but different parameters can be equally well supported. The parameters can represent equally well supported sub-models and thus this ‘truth’ issue presents no obvious difficulty as far as I can tell.

      > The location-scale problem with choice of model is ill-posed. Bayesian priors will not regularize it. One way of regularizing it, but not the only or even the best way, is to minimize Fisher information…He calls models that don’t import efficiency bland or hornless and these are the ones to be used.

      See above – introduce a ‘blandness’ parameter with a prior favouring bland models.

      > How do you achieve a consistent interpretation across the different models? Make a suggestion.

      My point is that it is (probably) not in general possible to give consistent interpretations across different models that holds under all circumstances. Establishing these correspondences and their conditions is an important scientific activity. For example deriving equilibrium thermodynamics for an ideal gas from kinetic theory and thus relating the thermodynamic temperature to the average kinetic energy in kinetic theory. Theories have domains of applicability. It’s interesting to explore these. It may not always be possible to connect them.

      > – Yes, it minimizes the Fisher information across an interesting class of models. I do however prefer Huber’s minimization.

      Again index the class by an interpretable parameter and impose a bias towards models you like (‘bland’ etc).

      > – As I write above I don’t think you can do this using a prior. The problem goes much deeper.
      I think the problem is deep but can often be translated into the language of priors.
      > The above may give the impression that likelihood only goes wrong in ill-posed problems and that this can be overcome by some form of regularization. An example due to Andrew Gelman in his Bayesian EDA paper shows that likelihood can go wrong in well-posed problems…

      I’ll have to look into this. Maybe Gelman can answer himself?

    • An example due to Andrew Gelman in his Bayesian EDA paper shows that likelihood can go wrong in well-posed problems. The problem is to estimate the four parameters of a 50-50 mixture of normal distributions. This problem is well posed, there is nothing pathological about it. Nevertheless he states `the Bayesian averaging over uncertainty does not save use’

      Slow down now, hold on a second. In that example Bayesian averaging over uncertainty with an improper prior doesn’t save us, but improper priors should only ever be considered approximations to proper priors. If you use a proper, you’re guaranteed a proper posterior.

  18. Corey
    I admit not to having done the calculations. The paragraph commences `Unfortunately, this problem is {\it not} immediately solved through Bayesian inference’. Is it the case that it is immediately solved by taking proper priors? If this is so I don’t understand the paragraph as there is no hint of this. I assumed that taking proper priors over large intervals and letting these tend to infinity would also lead to most of the probability being concentrated near a zero of one of the sigmas. But maybe this is not so. Let me know.

    • I assumed that taking proper priors over large intervals and letting these tend to infinity would also lead to most of the probability being concentrated near a zero of one of the sigmas.

      I assume so too. I wasn’t talking about taking this kind of limit though — I was just talking about any one given proper prior.

  19. hjk, Keith O’Rourke, Daniel Lakeland

    Almost sinking under the weight but here goes. I always thought the
    comments were ordered chronologically but that is not so, hence some
    missed comments.

    Daniel Lakeland

    I distinguish between the true amount of copper in the water and a
    parameter mu used to model the data. I can and do identify them but
    this is speculative and always subject to revision. You can take
    Stephen Stiglers historical data sets, calculate the say 95% interval
    for the `true’ value and then see how many of these really do contain
    the true modern value. I agree with your comments about the comb
    model. I only made it up to show that you require some form of
    regularization unless you have very specific knowledge of what is
    going on.

    hjk

    (a) A reasonable value for the quantity of copper in the water
    sample. If the data really were comb and you had good reasons for
    assuming this, then there is nothing to be said against the comb. If
    you don’t know this then you should use bland models which lead to
    large intervals. mu=2 is indeed bad within the comb model but the comb
    model is so informative based on a very particular density function
    and you have no reason for assuming that the measurements behave in
    this manner.

    (b) There are too many of them, uncountably many.

    (c) A confession of ignorance. People use concepts that I have to look
    up. The last but one was an ancillary statistic. Now ABC. As far as I
    understand it you have to generate data for a given set of
    parameters. Data are simulated at the level of distribution functions
    via F^-1(U) with U uniform. If F and G are very close 10^-6 in the
    Kolmogorov metric then F^-1(U) and G^-1(U) will in general also be
    very close, given finite precision maybe even equal. However F and G
    can be orthogonal in spite of the 10^-6, that is the total variation
    distance is 1. So you can’t possible distinguish the densities of F
    and G even though they are disjoint.

    (d) It is not easy to regularize an ill-posed problem. A standard
    method is the Tikhonov-regularization of ill-posed linear
    operators. The linear operator in our case is the differential
    operator. This is a big problem. I doubt very much you can regularize
    the comb density, that is reduce its Fisher without making it very
    like the normal or Huber distributions.

    hjk, Daniel Lakeland

    The easiest thing is to explain how I look at it. I have a data set
    given of size n. I think about how to model it and in the present discussion
    these models will be stochastic ones, let us say with parameters. I
    specify a set of parameter values and generate 999 samples of size
    n. I insert the real data and ask an expert to specify 50 abnormal
    data set. If the real data set is not one of the 50 then I say that
    the model with these parameter values is an adequate approximation to
    the data. More formally I will give a precise definition of what I
    mean by an adequate approximation and this will be in principle
    computable. The inputs are data +model and the output is 1 or zero
    indicating whether the model is an adequate approximation or not. I
    specify a number alpha, say alpha 0.95 so that if the
    inputs are a sample generated by the model and the model then the
    output is 1 with frequency 0.95. Not all data generated by a model
    will satisfy the approximation constraints but typical data sets
    should. alpha quantifies what you mean by typical. You run the
    programme with the real data for all possible parameter values. The set
    of parameter values which result in 1 is the approximation region. I
    may well be empty. There is no assumption of a true model, in
    particular that the data is generated by the model for some parmeter
    values. there is no attempt to approximate the `true generating
    mechanism’ of the data whatever that my be. The data are
    approximated. It is a concept of approximation without truth
    `approximatio sine veritate’. There are no true parameters and no
    unknown parameters. Every parameter value is or is not an adequate
    approximation. The concept is neither Bayesian nor frequentist. You
    approximate that data as they are, you do not need to repeat them. You
    can approximate the binary expansion of pi using the binomial
    distribution. I have no problems with this.

    hjk
    Ask Gelman. See corey above.

    Daniel
    So my world is nothing like you described it.

    Daniel
    Yes I did misunderstand your model. I got mixed up between a maximum
    entropy prior, which could be exponential, and the model where in this
    context you mention the Gaussian distribution towards the end. No
    matter. You want the mean. You model is the exponential with parameter
    mu. The sufficient statistic is the mean of the data. The maximum
    likelihood estimator for mu is the mean. So clearly you posterior will
    be clustered around the mean of the data, a consequence of the
    exponential model. The same would apply for the Gaussian model, so yes
    you do calculate the mean of the data which was my main point. As long
    as the data do not contain outliers it is irrelevant how you
    simulate. Replace the exponential model by the double exponential
    model or Laplace model with mu the mean. Again chose a prior for the
    mean. Do the analysis again. What happens now? If your data are to two
    decimal places multiply by 100 and use the Poisson model with
    parameter mu and divide by 100 at the end. This will
    presumably work as well.

    If you have a good estimate of the variability of the data then you
    can clearly use it. You only mentioned units so I asked. But often you
    don’t have good estimates and the variability must be quantified from
    the data. Interlaboratory tests are one example and they abound with
    outliers. The data example I gave shows that the standard deviation is
    not to be trusted and this was one reason for altering the German
    DIN standard for evaluating such tests.

    If you do engineering or physics your models are typically
    regularized, solutions of differential equations, so you don’t have
    the problem of regularization. If you don’t have a good physical
    model and I don’t and if you intend to use likelihood although I don’t then
    you have to regularize. If the normal and double exponential models
    fit the data, and both fit the copper data, you choose the normal
    distribution unless you have very good reason for the double
    exponential. The normal distribution is the blandest in this
    situation, any other model will import efficiency which is not
    justified. That is all that I am saying. Why do you object?

    From In science … to …most powerful computable test. Nothing to
    do with me. Bohmian deterministic mechanics or Copenhagen randomness,
    nothing to do with me.

    Keith O’Rourke
    It was the other way around. He started off with maximum likelihood
    and hence the Gaussian distribution. He may of course have retracted
    the retraction but I have not heard of it if he did.

    • Laurie. As far as I can tell you are interested in how to tune random number generators so that they produce data which are similar to some particular data set. You then test your data set against the RNG to see if it’s reasonable to think that the data came from such a distribution. You then downweight parameter values which make the data fail your test. Except, instead of doing so continuously, you do so in a binary manner, 0,1

      BUT, THIS IS EXACTLY WHAT BAYESIAN INFERENCE DOES. In particular this is exactly what ABC does (it generates a set of fake data, computes distance(summarize(fake_data(params)), summarize(real_data)) and accepts the params when this distance is less than some epsilon, rejects otherwise..

      Congratulations you have reinvented Bayesian inference with flat priors.

      The only problem is if you think that having tuned an RNG, the world will continue to behave like that RNG. That’s where the actual science comes in and is what I mean by all the stuff about positions of subatomic particles. Having found an RNG that can generate datasets similar to the actual set, you have pretty much no scientific information built into your model. It’s like finding a Godel number that encodes a sequence that just happens to start with the exact same sequence as your data set, and then continues with slight perturbations of the same dataset. It doesn’t tell you anything about the dissolution rate of copper in a pipe under todays pH conditions. Unless you also use scientific knowledge to make your set of RNGs have parameters that correspond to facts about the world, you are basically just defining compression algorithms. If you do connect that science to the RNG, then you are doing Bayesian statistics like “Laplace” advocates.

      • Another way to put it, is that unless you are incorporating scientific knowledge into your model construction, then as I said earlier:

        “It’s never Platonic infinite sequences of bits coming from a particular abstract mathematical sequence whose Godel number is some particular value and which satisfies the Kolmogorov axioms for a particular distribution function to within a given epsilon when calculated by the non-constructive uniformly most powerful computable test.”

        is exactly what is wrong with your approach, namely that you’re using some computable test to accept or reject the idea that some particular RNG might have produced your observed data.

        There are billions of RNGs that produce stuff that looks like your data. The vast majority of them are scientifically meaningless. They may do a good job of compressing your data, so they may be useful for other purposes, but to find out about the world, you had better encode some scientific knowledge into the model.

        The same idea is true when it comes to fitting regressions using polynomials. There is an N degree polynomial that goes through your N data points exactly. So what? What do the parameters tell you about the world?

  20. Daniel Lakeland

    In your example you model the data with as far as I can tell the
    exponential distribution with parameter mu which is the mean of the
    corresponding exponential distribution. What are your scientific
    grounds for this choice? Tell me them in great detail. For a given mu I would generate
    999 samples of size 10 and calculate their means. I would order the
    means and given alpha=0.95 I would check if the actual mean is either
    one of the 25 smallest or one of the 25 largest means. If not the
    value of mu used for the simulations would be defined as adequate. But
    of course I have no need to do this. I know the distribution of the
    sum of exponential random variable so I could simply use the quantiles
    of this to specify the set of adequate values. Moreover it does not have
    to be a 0-1 decision, For each value of mu calculate the appropriate
    percentile of the mean of the original data. I end up with a curve
    similar to yours but with a completely different
    interpretation. Furthermore have simulated data under the model I am
    in a position to tell you that the model produces data nothing like
    the actual data apart from agreement in the means. You can of course
    reply that this doesn’t not interest you in the slightest, all you
    want is the mean. This seems to me to be a very starnge attitude for a
    scientist. You mention towards the end that you could use another
    model perhaps the Gaussian to obtain a smaller interval (your choice
    of interval is also a 0-1 situation, so its allowed for you but not
    for me). You could use the comb distribution to get an even smaller
    one. However you offer no further justification for the normal model
    apart from a shorter interval. This is at odds with your claim of
    incorporating your scientific knowledge. In this situation which is a
    location problem I would regularize and use the normal model on the
    grounds that it is minimum Fisher information, certainly not on the
    grounds of hoping for a shorter interval.

    The copper data come from analytical chemistry for
    water. Interlaboratory test have to be statistically evaluated and,
    according to law, by a set procedure. For a long time there had been
    dissatisfaction with the old German DIN Norm and so a new one was
    proposed and is now in use.
    @misc{DIN38402,
    Howpublished = {Deutsches Institut f\”ur Normierung},
    Title = {{DIN} 38402-45:2003-09 {G}erman standard methods for the examination of water, waste water and sludge – {G}eneral information (group {A}) – {P}art 45: {I}nterlaboratory comparisons for proficiency testing of laboratories ({A} 45).},
    Year = {2003}}

    I was involved with this together with a group of very experienced
    analytical chemists. Tell me what is wrong with it and how a Bayesian
    version would be much superior.

    There is a data set on the weight of newly born children
    @Book{HOSLEM89,
    author = {Hosmer,~D.~W. and Lemeshow,~S.},
    title = {Applied Logistic Regression},
    publisher = {Wiley},
    year = {1989},
    address = {New York},
    }

    For the present purpose there is one dependent variable the weight of
    the child and 9 covariates ranging from the weight and age of the
    mother, smoking, hypertension, to race. The way from hypertension to
    weight of child is long and complicated if there is indeed an
    effect. In other words in this situation there are no well founded
    scientific grounds for assuming a dependency other than a feeling that
    it could be important. I assume you would refuse to analyse such a
    data set. It turns out that hypertension does have an effect. This
    would have to be checked using future data but then we have learnt
    something in spite of no substantial scientific grounds to start
    with. On the other hand the number of prenatal visits to a doctor
    convey no more information than random noise.

    I have the strong feeling that you intensely dislike what I am
    doing.

    There are billions of RNGs that produce stuff that looks like your
    data

    Yes there are. The problem is ill-posed and requires
    regularization. If I regularize over the set of RNGs which produce
    finite variances then I end up with the normal models. If I regularize
    over a Kolmogorov neighbourhood of the data there are not billions
    of models but non-countably many. I regularize using Fisher
    information and obtain the Huber distributions. I know what I am doing
    and why.

    There is an N degree polynomial that goes through your N data points
    exactly. So what? What do the parameters tell you about the world?

    The problem of fitting a curve to a set of data points is ill-posed
    and requires regularization. You clearly dislike this. Take an example from
    thin film physics. The data are photon counts dependent on the angle
    of diffraction. There are grounds for assuming a Poisson distribution
    for a given angle but there is also electronic noise which has a
    definite effect for small counts. Of interest is the number, position
    and power of the peaks. Instead of a model based on the Poisson
    distribution we take the model

    Y(x)=f(x)+sqrt(f(x))Z(x)

    where Z is standard Gaussian noise. Given any function g(x) you look
    at all interval and calculate the absolute normalized sum of the
    residuals (y(x)-g(x))/sqrt(y(x)) over each interval. You then bound the maximum of
    this which is going to be about sqrt(2 log (n)). The set of functions
    g for which this holds is the approximation region. It will include
    any function which interpolates the data as these residual are
    zero. as I saind the problem is ill-posed. So we regularize by
    minimizing the total variation of functions the approximation
    region. This will give us the peaks, their location and their size. It
    is a linear programming problem. We can go further and minimize the
    total variation of the first derivative subject to the monotonicity
    constraints and the adequacy constraints. This will give us the
    intervals of convexity and concavity. Given all this we may decide we
    ant a smooth function. So we minimize the total variation of the fourth
    derivative. More than the fourth is numerically not possible.Here are some references.
    @article{DAGAMEMEMI08,
    Author = {Davies,~P.~L. and Gather,~U. and Meise,~M. and Mergel,~D. and Mildenberger,~T.},
    Journal = {Annals of Applied Statistics},
    Number = {3},
    Pages = {861-886},
    Title = {Residual Based Localization and Quantification of Peaks in X-Ray Diffractograms.},
    Volume = {2},
    Year = {2008}}

    @article{DAVKOV01,
    Author = {Davies,~P.~L. and Kovac,~A.},
    Journal = {Annals of Statistics},
    Number = {1},
    Pages = {1-65},
    Title = {Local extremes, runs, strings and multiresolution (with discussion)},
    Volume = {29},
    Year = {2001}}
    @article{DAVKOVMEI09,
    Author = {Davies,~P.~L. and Kovac, ~A. and Meise, ~M.},
    Journal = {Annals of Statistics},
    Number = {5B},
    Pages = {2597-2625},
    Title = {Nonparametric regression, confidence regions and regularization},
    Volume = {37},
    Year = {2009}}
    Tell me why this is all wrong and how you could do much better with a
    flat prior over the set of functions over [0,1] and ABC. I’ll send you the
    data if you tell me how.

  21. Dear Laurie,

    Summary of where I think we are at:

    – I agree that a notion of ‘adequacy’ of one dataset as an approximation to another, purely in ‘data space’, is frequently important and not often emphasised in naive Bayesian accounts

    – However, I think that this particular notion is, in part, addressed in a Bayesian context by more explicit procedures – e.g. hierarchical Bayes, ABC etc (I can name a couple of others, if interested)

    – Perhaps you may wish to explore the connection between these and your approach? One other thought – your procedure of seeing if experts can distinguish datasets may be similar to de Finetti’s exchangeability?

    – I agree the other core issue is that these are ill-posed inverse problems and require regularization

    – However I think that most (all?) good regularization procedures can be interpreted in terms of priors over an expanded model class or model space

    – Being explicit and careful about distinguishing your meta-model/model space, model(s), parameters of interest, nuisance parameters etc and their relations seems to deal with most issues you’ve raised about ‘truth’ (to me, anyway)

    – I think yours in an interesting perspective; I am interested in incorporating/relating it to other ideas that I find appealing rather than throwing everything out. This is a matter of taste, probably

  22. “You can of course reply that this doesn’t not interest you in the slightest, all you want is the mean. This seems to me to be a very starnge attitude for a scientist”

    Not at all! If someone has paid me to determine whether pallets of orange juice have the appropriate total quantity of orange juice in them, then I build a model to determine if this is true. I use scientific and logical information to build this model. Namely:

    The quantity of OJ measured in Liters, in a bottle is a number definitely between 0 and 2.25 L inclusive.

    The quantity in a bottle is a positive number with an average

    The maximum entropy model for a positive number with an average is exponential. This incorporates the least information possible. You would call this regularization. In my blog post I was doing this simply because I wanted to point out that you need not match the frequency histogram of the data to the way in which you choose to “filter” the prior down to the posterior.

    The posterior distribution successfully predicts the approximate value of the average, putting the true value in the high probability region.

    You are absolutely right that if I want a smaller interval I need to justify the choice of alternative more informed model. I can not choose a model simply because it gives a smaller interval! That was the point of starting with maximum entropy! Hooray, we agree!

    Everything you are saying above seems to fall into the category:

    1) Reinventing Bayesian Inference
    2) Bolting on the additional skill of finding distributions from among specific mathematically motivated classes whose frequency distribution matches an observed data set well.

    you may be surprised to find that I have no objections to this, provided that you can also use your class of distributions to connect your parameters to physical facts about the world.

    In fact, the 2 steps above are basically exactly what I and a few others here advocate, provided you perhaps make the third caveat

    3) When given a choice between matching the distribution frequency to the observed frequency of the data, and having strong mathematical principles used for regularization, vs using physical knowledge and logical principles to create models that connect data directly to unobserved facts about the world. Choose the latter in all cases. The frequency histogram doesn’t matter, it’s matching the value of the parameter to the physical fact! The frequency histogram only matters when the frequency histogram is the physical fact of interest!

    Also, I would never argue in favor of “flat priors over…” I always argue in favor of using as much of the information that you have which you can be sure of. The OJ in the bottle is between 0 and 2.25 I am confident… The function starts at (0,0), goes through (1,1) is differentiable, all values in this interval are between 0 and 1, and is everywhere concave downwards… etc etc

    When the frequency histogram of highly repeatable experiments is a physical fact of interest I have no doubt that you will do well and we will agree completely!

    • I am being too glib, I apologize. Your procedures for finding distributions that “look like” the data may be excellent for designing Bayesian inference procedures under certain circumstances. In particular, when the only information we have about what data points are likely to come out of our measurement apparatus is the historical behavior of that apparatus, then using some good regularized means to choose a distribution which adequately summarizes this knowledge is a great thing.

      But, by itself, it is a piece of the puzzle. And insisting (as some do, possibly not yourself) that it is a lynchpin piece belies the reality, which is that we care about some physical fact today, not about the characteristics of the random number generator that approximates the historical facts in the past.

      Put another way, you have discussed coming up with adequate models of the data, stochastic replacements for previous measurements if you will. Thinking that this is the goal of statistics is like thinking that the goal of microscopy is to take pictures. Instead, the goal of microscopy is to discover what very small things are doing!

  23. “For the present purpose there is one dependent variable the weight of
    the child and 9 covariates ranging from the weight and age of the
    mother, smoking, hypertension, to race. The way from hypertension to
    weight of child is long and complicated if there is indeed an
    effect. In other words in this situation there are no well founded
    scientific grounds for assuming a dependency other than a feeling that
    it could be important. I assume you would refuse to analyse such a
    data set.”

    Hardly ever would I refuse to analyze any data set, only maybe if I think it’d been created fraudulently, and/or so sloppily as to be meaningless, or if the analysis was being pushed heavily in certain directions for political reasons etc.

    I think there would be several steps here:

    1) look at the data to see what I have
    2) Check for obvious data errors and discrepancies (measurement units, transposed digits etc)
    3) Discuss with doctors some hypothetical ways in which the variables could cause the outcomes:

    a) smoking can cause hypertension
    b) smoking can cause damage to fetus
    c) in absence of smoking hypertension is probably indicative of genetic and lifestyle issues (exercise etc)
    d) exercise can cause health improvements for mother and fetus
    e) age of a person together with lifestyle is causal for overweight in many people
    f) age of mother is causal for genetic issues in eggs which can cause fetal health issues
    g) socioeconomic status is causal for having time available for exercise and for affording good nutrition and living conditions

    4) because of physiology weight is right skewed. it is easier to put on extra weight and remain alive than to be severely underweight and remain alive.

    etc etc .

    from that I glean things such as the expected direction in which certain effects act and how they may inter-act. When using a covariate which is an indicator of some underlying observed value I may actually model that relationship, for example income may affect nutrition and exercise which are unobserved. I may look to pull in other observational data on how income and nutrition or exercise are related.

    Then, I build some kind of model starting probably with the idea that there is an overall average behavior, and that perturbations from the average produce effects according to the directions hypothesized above, and the interactions hypothesized above.

    I also write down priors for the parameters in my model, my priors are informed by the expected direction but typically would not rule out the alternative directions unless logically impossible (you can’t have negative weight etc). They’re informed by order of magnitude estimates as well.

    Then I run my model through Bayesian machinery, and may find out a variety of things, or may not have enough data to find out anything considering how my model works. I may choose to simplify my model if it seems appropriate. I may look for additional datasets to augment my analysis. I may try to justify collecting more data. I may do many things. I may not be so certain that the effects you mention about hypertension or doctors visits are so clear cut. I may be even more certain than you are.

  24. hjk, Daniel Lakeland

    Let us take ABC which you both mention. As far as I understand it
    involves comparing the original sample with a sample generated from
    the prior. This does indeed have a certain similarity but it depends
    how you do the comparison. It seems that a bound on the Euclidean
    distance between the two samples is a standard method of comparing
    samples. This runs counter to my philosophy and the way I think about
    statistics. For me the basic concept is that of the space of
    probability measures. I think in terms of probability measures, weak
    metrics on the space of probability measures (think
    Vapnik-Cervonenkis), functionals,continuous and locally differentiable functionals
    etc. Moreover some form of Popperian falsifiability must hold.

    hjk

    Experts and simulated data sets. This is an old idea going back at
    least to Neyman and Scott

    @article{NEYSCOSHA53,
    Author = {Neyman,~J. and Scott,~E.~L. and Shane,~C.~D.},
    Journal = {Astrophysical Journal},
    Pages = {92-133},
    Title = {On the spatial distribution of galaxies a specific model},
    Volume = {117},
    Year = {1953}}

    @article{NEYSCOSHA54,
    Author = {Neyman,~J. and Scott,~E.~L. and Shane,~C.~D.},
    Journal = {Astrophysical Journal Supplement},
    Pages = {269-294},
    Title = {The index of clumpiness of the distribution of images of galaxies},
    Volume = {8},
    Year = {1954}}
    See also david Donoho
    @article{DON88,
    Author = {Donoho, D. L.},
    Journal = {Annals of Statistics},
    Optpages = {1390-1420},
    Optvolume = {16},
    Title = {One-sided inferences about functionals of a density},
    Year = {1988}}

    and Andreas Buja and co-authors
    @Article{BUJETAL09,
    author = {Buja,~A. and Cook,~D. and Hofmann,~H. and Lawrence,~M. and Lee,~E-K. and Swayne,~D.F. and Wickham,H.},
    title = {Statistical inference for exploratory data analysis and model diagnostics},
    journal = {Philosophical Transactions of the Royal Society A},
    year = {2009},
    volume = {367},
    pages = {4361–4383},
    }

    I am not sure what de Finetti has to do with this. Be more explicit.

    I can’t offer a theorem to show that all forms of regularization can
    be accomplished by taking a Bayesian prior but it may well be the
    case. Take the location-scale problem. Put a prior on the set of all
    models with finite Fisher information and then apply Bayes. Not
    easy. Or for the curve fitting. Put a prior on all functions with a
    finite number of local extremes, a finite number of intervals of
    convexity-concavity and a piecewise constant fourth derivative. Again
    not easy and I fail to see what will be gained even if you managed it,
    especially because the functions you most want, smallest number of peaks
    etc will be the ones with the smallest likelihood.

    Daniel Lakeland

    I still don’t understand your orange juice example. It seems that the
    bottles have a maximum volume of say 2.25 litres. You the data by the
    exponential distribution with mean mu. Your posterior for mu includes
    the value 2 with a high density. If you generate 10 i.i.d. exponential
    random variable with mean two then they will regularly 32% include values
    greater than 2.25 and this doesn’t worry you?

    The model I use for the data doesn’t even approximately fit the
    frequency distribution of the data… but the inference I get is
    accurate.

    You also write

    I use scientific and logical information to build this model. Namely:
    The quantity of OJ measured in Liters, in a bottle is a number
    definitely between 0 and 2.25 L inclusive. The quantity in a bottle is
    a positive number with an average.

    And yet you ignore the upper bound of 2.25 in your model

    The justification is The maximum entropy model for a positive number
    with an average is exponential. This incorporates the least
    information possible. You would call this regularization.

    Yes it is regularization, not that I think the exponential
    distribution can be so called but that is another matter. But is you
    take the standard incorrect expression for the entropy of a continuous
    random variable then you should maximize it under the constraint that
    the support is [0,2.25]

    For the record I would regularize by taking the minimum Fisher information
    models with support [0,2.25] indexed by their variance as parameter.

    Here’s a good/bad book for you

    Science from Fisher Information: A Unification by Roy Frieden

    Reinventing Bayesian Inference

    You are having great difficulty in understanding me. As I said above I
    think in terms of spaces of probability measure, weak metrics,
    differentiable functionals etc. Likelihood plays no role.

    Bolting on the additional skill of finding distributions from among
    specific mathematically motivated classes whose frequency distribution
    matches an observed data set well.

    Well sometimes I do that but sometimes I do the exact opposite: this
    distribution does not fit the data a la Popper.

    he OJ in the bottle is between 0 and 2.25 I am confident

    3) etc
    Ok, then give me a convincing explanation of (i) why my approach to
    X-ray diffraction is wrong and (ii) how you would do it. I really mean
    this. You have criticized my approach very strongly, You asked how about
    a polynomial of degree n through n points. I gave you the reference to
    three papers. Now do tell me what is wrong with my approach and how your
    Bayesian analysis would improve on it. Tell me also what small things
    I am missing. Come on, I do want to know.

    Thinking that this is the goal of statistics is like thinking that
    the goal of microscopy is to take pictures. Instead, the goal of
    microscopy is to discover what very small things are doing!
    using physical knowledge and logical principles to create models that
    connect data directly to unobserved facts about the world

    The copper data comes from analytical chemistry. All you want is a good
    interval for the range of possible values of the true water content
    (note, the true water content, not the true parameter). Determining the
    quantity of copper in a water sample is taking a picture and I should
    be really interested in the whole process from the effect of an atom
    of copper right through to the final reading with noise all the
    way. You seriously think this would improve the analysis? For the
    record the mean is no good.

    using physical knowledge and logical principles to create models that
    connect data directly to unobserved facts about the world

    Now tell me also what physical knowledge and principles you would invoke
    for the effect of hypertension on the weight on a newly born child. maybe you don’t think this is important. On the other hand if treatment of hypertension is possible and it does reduce the possibility of a low birth weight, then we should use this information even if we don’t understand why the effect occurs.

    Do you model the simple coin toss with the binomial distribution or Newtonian mechanics? Presumablby the latter
    @Article{STGRSTPEKA08,
    author = {Strza{\l}ko,~J. and Grabski,~J. and Stefa\'{n}ski,~A. and Perlikowski,~P. and Kapitaniak,~T.},
    title = {Dynamics of coin tossing is predictable},
    journal = {Physics Reports},
    year = {2008},
    volume = {469},
    OPTpages = {59-92}
    }

    • Put a prior on the set of all models with finite Fisher information and then apply Bayes. Not easy. Or for the curve fitting. Put a prior on all functions with a finite number of local extremes, a finite number of intervals of convexity-concavity and a piecewise constant fourth derivative. Again not easy and I fail to see what will be gained even if you managed it, especially because the functions you most want, smallest number of peaks etc will be the ones with the smallest likelihood.

      Actually these are solved problems that have been fruitfully applied to a number of problems. Dirichlet process mixture models provide priors over probability measures with proven consistency properties (f’r’instance). Gaussian process models provide priors over function spaces, and covariance functions are known that lead to posteriors with expectations that have piecewise constant derivatives. Usually we’re talking piecewise-constant third derivatives, as these correspond to regularized cubic splines (see equation 6.28 here) but I think that’s just because Grace Wahba is awesome and set the agenda there.

      It’s worth pointing out that these stochastic process priors don’t give rise to Radon-Nikodym derivatives that can be used as likelihood functions; nevertheless, posterior stochastic processes exist and often have very nice properties. Bayesian analysis at its most fundamental level isn’t about likelihoods — it’s about updating and getting the posterior.

  25. Davies: the point of the orange juice example is that fitting a distribution to the data is not the goal, and the actual goal, of determining the approximate value of the average OJ content, can be accomplished in many ways some of which fit the data very badly in the sense of generating from the distribution and comparing to the data. the point was to intentionally use as little information as possible and to fit the data badly in the sense of generating from the distribution, and to still get good inference for the thing I want to know, the average OJ quantity. I mention at the end of the example that it is possible to get better inference by incorporating better information. For example you mention taking into account the upper limit. It is not necessary, but it could be helpful!

    Bayesian inference works like this according to those who believe in Cox’s theorem as foundation of bayesian inference:

    1) A distribution encodes a kind of “reasonableness” for a given value. Higher probability in a certain region of space = you should believe more in this being where the true value lies.

    2) A distribution is valid when the true value in the world of the unobserved thing is located in a place where there is relatively high density. This is only determinable after-the-fact (ie. after collecting even more data)

    3) A distribution is more helpful when it is more narrowly peaked while remaining a valid distribution (ie. while keeping the true value in the high probability region).

    Anything you can do to make these things happen is great, but it is in practice only possible when you incorporate lots of valid scientific information into your model. The validity of the information is always up for argument, just like the correctness of computer code, or the validity of the standard model of particles, or the validity of a medical treatment for actually curing an illness.

    In this formulation, as in your discussions, likelihood, in the sense of IID draws from a distribution plays no role! Or rather, it is a special case.

    I start with a bag of parameter values that I think could be true (a prior) and then I use any valid information at all about the data to filter this bag (multiply by a function of data and parameters that encodes a true if approximate statement). Afterwards, the bag has more of some parameters and less of others. That is the essence.

    I admit, many bayesians stick to “generative iid likelihoods” in their modeling efforts. My OJ model did. But there is nothing in Cox’s theorem that requires it. In my dissertation I fit the parameters of an ODE model by assuming the difference in logarithm between observed total kinetic energy and predicted total kinetic energy through time was a non-stationary transformed gaussian process. The idea being that it should fit well when the energy was kinetic and need not fit well when the energy was mostly potential and only thermal noise remained so we are taking a logarithm of near zero. There were no iid draws, so no iid likelihood principle. I just knew that parameter values that fit the data well at a series of certain times when the data was meaningful were “more believable” and so I got very good inference on those parameters that made the ODE solution produce results that were more believable relative to the data.

    I have just finished a series of blog posts about “declarative vs generative” models: http://models.street-artists.org/tag/declarative-models/

    in this you will see that I use the success of ABC as an argument for the declarative style (in addition to the generative).

    you say “It seems that a bound on the Euclidean distance between the two samples is a standard method of comparing samples.”

    I would say that this is most likely due to the mathematical limitations of many practitioners, or to just convenience. It is certainly NOT a LOGICAL requirement for the method.

    I would describe my understanding of your approach as basically using different, clever, and very interesting alternative definitions of the abstract “distance” and “summarize” functions in the abstract declarative statement that defines ABC:

    distance(summarize(fake_data(parameters)),summarize(real_data)) < epsilon

    so, in this broader sense, my naive view of what you are talking about is designing particular nice classes of declarative statements of the form above for use in ABC :-)

    You start with an improper prior (a bag of parameter values that contains all possible values equally weighted) and you use a declarative statement of the form above, to filter down this bag to a much smaller set. That is the essence of Bayes to those of us who have drunk the full dose of kool-aid ;-)

    my only concern is that you also remember to make your classes of models have direct connection to science, not purely the whole realm of what is mathematically possible regularized by some mathematical tool

    I believe we are much closer in philosophy than you think, but that probably you have not encountered this full radical version of Bayes. Our blog commenter "Laplace" has been one of the more radical advocates here, also in his former pseudonym "Entsophy" prior to reorganizing his online presence.

  26. “I still don’t understand your orange juice example. It seems that the
    bottles have a maximum volume of say 2.25 litres. You the data by the
    exponential distribution with mean mu. Your posterior for mu includes
    the value 2 with a high density. If you generate 10 i.i.d. exponential
    random variable with mean two then they will regularly 32% include values
    greater than 2.25 and this doesn’t worry you? ”

    That is the beauty and purpose of this exercise. To force people to confront this fact. The exponential does not fit the data by frequency, but it gives good inference in the posterior because it encodes true information, namely that mu > about 2.25 is unlikely and mu 2.25 because I will NEVER GENERATE EXPONENTIAL VARIATES. I will ONLY generate from the posterior for mu, and the posterior for mu is concentrated in the range 0.8 to 2.25 with a peak about 1.6 which is very close to the actual value.

    If, on the other hand, I was trying to fit the frequency distribution that generates the data, I would be very concerned if I came up with an exponential. But notice, there is NOTHING unknown about the data, there is only something unknown about the mu, an average over the 100 jugs.

    Put another way, in my stan program:

    wsamp ~ exponential(1/mu);

    can be read as: “when mu has its correct value, wsamp (the data) are all within the high probability portion of the exponential distribution. They need not fill it, they need not have its same frequency distribution, they just need to be in the high probability region.”

    Such a statement is only helpful if for correct values of mu it is true, and simultaneously for definitely wrong values of mu it fails to be true. That is the main criterion by which you should choose such statements.

    • ack, I apologize the blog eats things after a less than sign assuming they are HTML…

      the essence of what I was saying that got cut off was: I do not worry about the fact that the exponential will generate values greater than 2.25 and thus fit the data very badly because I WILL NEVER GENERATE EXPONENTIAL VARIATES … etc

    • i’d add that fitting a distribution is feasible in a bayesian context though as a special case. it’s a matter of adding layer in the hierarchy that models the parameter values of the data distribution. Each sample in the HPD region of the data distribution parameters should generate one data-like instantiation of the data distribution.

  27. Why do I regularize with Fisher information? There is a reason. In the absence of concrete knowledge which would justify some other model in the location-scale case it is a class of models which do not introduce an increase in `precision’ which cannot be justified. If your model is not minimum Fisher it will per se give an unjustified increase in precision. In the photon refraction case the physicist is interested in peaks. that is why I regularize with total variation. Every time I regularize I have a reason for the regularization I choose.
    On your orange juice example. If you model the data as double exponential with mean mu instead of exponential with mean mu your posterior will be concentrated around the median and not necessarily the mean of the data. Were you just lucky that the maximum likelihood estimator for the exponential distribution is the mean or were you aware of this when you chose it?

    I am now off to Warwick for the workshop. We now may be the only ones reading this. We can continue via email if you wish.

    • In my Orange Juice example you’re right that the choice of data model DOES matter, it just doesn’t matter in the way people often think. (for example, the data need not fill the “typical set” of the data model, as they don’t in my case)

      My criterion for choosing the exponential is that I knew that I had positive data with a mean value that I was interested in. I chose the maximum entropy distribution for positive data with a mean, and things worked well. I didn’t incorporate the information that the data couldn’t be greater than 2.25, except that I did incorporate information that the average couldn’t be bigger than 2.25. incorporating that appropriately might have improved things.

      More generally, when I have relatively little knowledge of the shape or form of my distribution, it makes sense to choose a flexible family with interpretable parameters which I can then give additional model structure to. This is the Hierarchical model that Gelman advocates frequently. Hierarchical models are hugely flexible. For example, I’m going to assert without proof that every distribution with a density can be approximated by a hierarchical mixture of uniform(a,b) with the appropriate distribution p(a,b) at the second level of hierarchy. The intuition is that the density is Lebesgue integrable so if I can pile rectangles on top of each other in the right way I can build up the area under the density. I blogged an example of doing this for symmetric distributions a while back. I started with the normal, Corey generalized to all symmetric distributions in the comments.

      http://models.street-artists.org/2013/08/15/what-is-a-gaussian/

      In practice, we’d more often use a hierarchical model with non-uniform densities that express things like “the data tends to be lumped around a certain value and then have maybe a bit of skew caused by a certain kind of error, and then also have a second smaller mode caused by a different cluster of less usual issues”

      This can be modeled by something like (pseudocode)

      (data – skew_error) ~ mixture(pct1 * normal(mainmode,mainscale), pct2 * normal(secondmode,secondscale))

      or in actual Stan code (plus or minus my own coding errors):

      increment_log_prob(log_sum_exp(normal_log((data – skew_error),mainmode,scale1)+log(pct1),normal_log((data-skew_error),secondmode,scale2) + log(pct2)));

      with appropriate information put into the priors on skew_error, mainmode, secondmode, pct1, pct2, etc to specify what we know about the relative percentage of the mixtures and the location of the two lumps, and the width of the two lumps.

      or perhaps we’d use some kind of dirichlet process as Corey mentioned above. We’re not required to fit the data in frequency sense, only to make it so that the data isn’t too weird compared to the data distribution when the parameters have their appropriate values. It’s a focus on putting a distribution on the PARAMETERS that distinguishes the Bayesian approach.

      The fact that perhaps the lumps are actually better approximated by say t distributions, or by triangular distributions with finite tails, or by gammas or some other family is nowhere near as important as the bigger picture, that the data tend to fall into one main clump and one second clump and have a small bit of skew etc.

      From your question:

      The shifted scaled laplace distribution is maximum entropy for data with a certain mean absolute deviation from a location. I would expect it to work well when I am trying to infer information about the scale of the mean absolute deviation, or the location about which that MAD is calculated. The mean and the median are the same for symmetric data, but are different for asymmetric data. That information, whether data arrive in a symmetric or asymmetric way is something you should probably take into account when choosing your model as mentioned above.

      So, am I allowed to choose any model? Well, any model that is justified by my knowledge, and doesn’t assume too much information. The maximum entropy formulation helps with that, and I assume the Fisher information does as well. The maximum entropy formulation has different justifications, and I think they make good sense in a Bayesian context, but I confess to not having compared the approaches in detail.

      In practice however, we’re often in a position as Bayesian modelers where we need to choose some distribution that is adequate to filter the parameter down to a posterior distribution. The importance of the OJ example is that it focuses in the mind that the criterion for a good “filter” distribution is NOT that the data fills the typical set of the distribution and does so in a way that closely approximates the data spit out by an RNG with that distribution.

      Instead, the criterion is that when the parameter has its true value, the data fall in a high probability region of the data distribution, so that that parameter gets a relatively large posterior probability. This means that if you expect things like asymmetry, multi-modedness and soforth, it’s worthwhile to incorporate that into the data distribution so you’re not fooled into downweighting true parameters.

      Inventing extra parameters to describe real uncertainty is always appropriate as well and flows naturally in the Hierarchical context.

      • > choose some distribution that is adequate to filter the parameter down to a posterior distribution.
        I would agree but far to much here for me to read carefully :-(

        But I think it clarifies a disagreement I had with Martha on this blog regarding meta-analysis – from my recollection she preferred that study medians (and inter-quartile or other range) from each study be used while I strongly preferred the means and sds.

        Now, she is correct that the median and whatever range will reflect the shape of the data distribution with in the study better, but what I want to do is approximate the addition of that study (because we can’t get the original data) to the posterior of the other studies and the mean and sd will be much better at that. And that’s what I am primarily interested in – the posterior distribution given all the studies not the shapes of data distributions within the individual studies.

        So thanks for all the comments!

        • I’m glad I wasn’t just wasting everyone’s time. I think this stuff is really important, and I thank Laurie for engaging us here in a polite way to have this conversation once again, perhaps in more detail than before.

          You’re exactly right, we as Bayesians only ever care that the posterior for the *parameters* matches what it aught to be given the input model knowledge, and the data. The shape of the data distribution is only relevant in so far as it helps us discriminate between different parameter values given our scientific knowledge.

          Whether you get the individual data points, or just some kind of summary statistic, the main information that is relevant to you is some way to filter your parameter space. If you are doing inference on average values, and someone gives you median values, you will have to take an extra step, and basically model the average as unobserved conditional on an observed median. If on the other hand, you observe the average, you can avoid all that extra uncertainty.

          If you were trying to do inference on parameters that represent medians, your preference would probably reverse.

        • > do inference on parameters that represent medians, your preference would probably reverse.
          Actually it depends on what you assume as the data generating model for the raw data from which you only got the chosen summary.

          The likelihood is the probability of what you observed, which here is the chosen summary and this is seldom available in closed form. That is what my Phd thesis was about and in cases where you can get something in closed form you can compare the concentration of the likelihood for the median parameter based on observing the mean to likelihood for the median parameter based on observing the median.

          Fortunately, ABC given current computer speed has made this all trivial as it gives the posterior for the reported summary (to a chosen level of accuracy) and you can divide out the prior (remove it) to get a likelihood for whatever summary for whatever data generating model (non-identifiability of the chosen parameter will cause problems). So my thesis is now obsolete :-)

  28. In response to Anonymous’ request, for what it’s worth I sent Laurie a long email emphasising further points of agreement. I’ll paste most of it below (with some edits), but if Gelman or others would prefer I’m happy to take it back offline…


    Dear Laurie,

    After going back over our exchanges and looking again at your book (which I quite enjoy) I have a refined sense of where we agree. I will try not to focus on the points which we may disagree as I do not think they are as important for the moment (ironic, at first blush, given the Popperian theme of what follows).

    The following is quite long, apologies if you do not have the time.

    I think our agreement centres on a limitation of the traditional Bayesian approach which we both perceive, albeit perhaps differently.

    First, I note that I am a fan of emphasising prior and posterior predictive distributions as the most important objects in the Bayesian or Bayesian-esque approaches [see Corey’s comment above]. You may wish to interpret these each as a family of stochastic models under consideration (where each member within a family is ‘indexed’ by ‘parameters’ if you wish) a) before seeing the data and b) after retaining those models (parameters) which are an ‘adequate fit’ to the data in a sense to be discussed. Predictions are generated by a) choosing a ‘parameter’ (in some way e.g. according to a prior in the standard Bayesian account) b) generating a number of data realisations for a given parameter. These simulated realisations are then compared to the actual data.

    My view is that the usual Bayesian conditioning procedure is not the only nor best way of updating the prior predictive distribution into an ‘adequate’ posterior predictive distribution, even within a fixed model class (which was perhaps chosen by some ‘meta regularization’ procedure over model classes).

    In particular, when I raised the possibility of a correspondence between Approximate Bayesian Computation (ABC) and your approach you raised the point ‘Moreover some form of Popperian falsifiability must hold’ which I now think is a valid objection. Allow me to ignore measure-theoretic and/or other technical difficulties (though important) for the moment to focus on the conceptual issues, and hence I will speak loosely. I do think there are possible issues of ‘robustness’ or stability of the Bayesian procedure, as well as a number of potential conditioning paradoxes which may undermine the usual story somewhat. I think resolving the conceptual issues also motivates the resolution of the technical issues, but I won’t focus on this.

    So, back to Popper. I have always been fond of the idea that ‘theories that predict everything, predict nothing’. Of course this led Popper to further propose that we must take some ‘risk’ in our predictions and thus allow us to falsify our theories; that is, we must subject our theories to some sort of ‘severe test’. Probabilistic theories, however, are often capable of predicting an extremely large range of possibilities with each having at least some non-zero – though possibly small – probability (speaking loosely in a technical sense). Thus it has been argued by those taking a Bayesian approach that it is best to not think of our theories (parameters within a model) as falsified, rather as being assigned a probability which is continuously updated as new data arrives. On the other hand, Popper might respond by asking whether we should really allow a particular theory to survive when it predicts observations with at best a small probability. To me, this is a central tension in attempts to relate Popperian and Bayesian approaches.

    I believe this can be resolved by first seeing Bayesian conditioning as a (possibly weak) falsification procedure – it restricts attention to a subset of those parameters that predict the observed data with some non-zero probability (or probability density in the continuous case), thus ruling out those parameters that do not. The remaining parameters are ranked in a relative sense according to posterior probability (or likelihood in the uniform prior case). In order to respond to Popper’s objection and have some ‘risk’ of falsification, however, this usual conditioning procedure should be replaced by a type of ‘approximate conditioning’, relating both a set of model predictions and given observed data, as a primitive notion – perhaps it could be called ‘severe conditioning’ (or ‘conditioning at a given severity level’) to (unofficially) follow Popper/Mayo. At first I thought ABC might cover this, but it doesn’t quite.

    In particular we can introduce Popperian risk by only allowing the ‘most likely’ (at some level) predictions to be compared with the given data. We could in principle then carry out the analogue of the usual Bayesian parameter updating during this comparison (with some technical caveats of how to define the conditioning in a ‘safe’ way) but only in the restricted space of ‘adequate’ data. We hence only retain parameters (in the form of a ‘posterior’ collection of parameters) which have a high probability of predicting the data, reminding me of part of Mayo’s severity criteria. Within this set the parameters are ranked in a relative sense according to posterior probability, or whatever, (this relative ranking perhaps relating to the other part of Mayo’s criteria) but the set itself has a new ‘size’ depending on the ‘severity’ or ‘approximation’ level or whatever you want to call it.

    [Note also that the retained parameters may still then predict unusual future data when plugged back into the model, just with low probability, as we have only conditioned on ‘realised data’ and ‘typical predictions’ to choose parameters. More speculatively, I believe this may offer a response to philosophical issues such as the ‘grue’ problem and other issues with ‘unusual events’. We may retain models predicting unusual things but we don’t evaluate them against an unusual event unless it actually occurs. Unusual events will tend to rule out similar models, perhaps keeping those with ‘ad-hoc’ (non-general) fixes. This doesn’t seem like a bad thing though – unusual events may require ‘unusual’ (non-general) explanations – e.g. ‘the theory of relativity works and I forgot to connect the red wire’.]

    The standard Bayesian approach then corresponds to a case of ‘zero-severity conditioning’ since it retains all parameters which predict the data no matter how improbably. This can be related to Gelman’s Bayesian ‘fix’ for this is – adding ‘outside-the-model’ posterior predictive checks as an ‘absolute’ test of model adequacy to complement the ‘inside-the-model’ use of Bayesian conditioning, which only gives a relative ranking of parameter values within a given model. Thus ‘severe conditioning’ is a proposed ‘single-step’ approach (at a given severity level) in some sense equivalent to the two-step ‘Gelman-Bayes’. This could complement and help interpret (perhaps replace) the usual two-step procedure; I also think that it offers the possibility of avoiding misspecification/robustness issues more directly or ‘upfront’, intuitively by avoiding conditioning on events that are unlikely under most models. The extreme result may be an empty set, where no parameter values are adequate (and hence also why we need a ‘safe’ version of conditioning, not discussed here); it would then make sense to explore misfit by relaxing the adequacy criterion, which in the opposite limit would give something like Gelman’s Bayesian exploratory data analysis.

    In summary, I propose a shift in emphasis during the ‘conditioning’ step (taking the prior collection to a posterior collection of parameters/models) from an ‘exact match to realised data but occurring at any probability level’ (usual Bayes) to an ‘approximate match to realised data occurring with high probability’ criterion, representing a ‘severe conditioning’ combination. The ‘high probability’ component of this involves retaining, during the step of comparing predictions to realised data, only predictions which are made with high probability for a given parameter and can be related to Mayo’s informal severity criterion and Popper’s ideas of ‘risk of falsification’ (I think). The result of an analysis performed using this approach is analogous to a Bayesian posterior distribution over the parameters, as well as an associated severity level. This posterior parameter distribution gives a posterior predictive distribution having a high probability of generating replicated data that fit the observed data, but may also predict other ‘unusual’ data with some probability.

    Finally, parameters should only be given meaning within a fixed model (or meta model). If you want to relate parameters in different models then you need to provide a transformation between them and may require additional regularization at a ‘higher’ level. This regularization may take the form ‘consider models with high/low values of property X more often’ and is then in some sense a prior over the property X (think about sampling candidates from your prior). Furthermore, I take parameters to be most meaningful when they have the same value in multiple contexts (datasets) for the same model i.e. are invariant under a wide range of scenarios. Parameters that are only valid for a restricted context are valid for that context but not do not generalise and are hence less interesting scientifically.

    I think most of the above is reasonably consistent with your perspective, but would be keen to hear your thoughts.

  29. All

    I am glad that others found it interesting so we were not waisting our time. A very first version of my thoughts was sent to John Tukey in 1993. He was kind enough to reply with a paper entitles, I quote from memory being on the move, Some issues in statistics in the light of Laurie Davies’s paper. It contains his words bland and hornless. If anybody would like a pdf file send an email to [email protected]

Leave a Reply to Laplace Cancel reply

Your email address will not be published. Required fields are marked *