Untangling the Jeffreys-Lindley paradox

Ryan Ickert writes:

I was wondering if you’d seen this post, by a particle physicist with some degree of influence. Dr. Dorigo works at CERN and Fermilab.

The penultimate paragraph is:

From the above expression, the Frequentist researcher concludes that the tracker is indeed biased, and rejects the null hypothesis H0, since there is a less-than-2% probability (P’<α) that a result as the one observed could arise by chance! A Frequentist thus draws, strongly, the opposite conclusion than a Bayesian from the same set of data. How to solve the riddle?

He goes on to not solve the riddle. Perhaps you can?

Surely with the large sample size they have (n=10^6), the precision on the frequentist p-value is pretty good, is it not?

My reply:

The first comment on the site (by Anonymous [who, just to be clear, is not me; I have no idea who wrote that comment], 22 Feb 2012, 21:27pm) pretty much nails it: In setting up the Bayesian model, Dorigo assumed a silly distribution on the underlying parameter. All sorts of silly models can work in some settings, but when a model gives nonsensical results—in this case, stating with near-certainty that a parameter equals zero, when the data clearly reject that hypothesis—then, it’s time to go back and figure out what in the model went wrong.

It’s called posterior predictive checking and we discuss it in chapter 6 of Bayesian Data Analysis. Our models are approximations that work reasonably well in some settings but not in others.

P.S. Dorigo also writes:

A Bayesian researcher will need a prior probability density function (PDF) to make a statistical inference: a function describing the pre-experiment degree of belief on the value of R. From a scientific standpoint, adding such a “subjective” input is questionable, and indeed the thread of arguments is endless; what can be agreed upon is that in science a prior PDF which contains as little information as possible is mostly agreed to be the lesser evil, if one is doing things in a Bayesian way.

No. First, in general there is nothing more subjective about a prior distribution than about a data model: both are based on assumptions. Second, if you have information, then it’s not “the lesser evil” to include it. It’s not evil at all! See, for example, the example in Section 2.8 of Bayesian Data Analysis.

P.P.S. I ran the above by a couple of physicist types. They said it was ok, but one wrote:

There is a lot of poop being thrown these days between bayesians and frequentists. I am not sure why; that fight seems so 2003 to me. But anyway, I guess it is worth responding.

I agree and have no desire to throw any poop. It can be useful to highlight the differences between different approaches, and I think it can also be useful to clear up misconceptions, such as the idea that Bayesian inference is particularly “subjective” (or, to put it another way, the idea that non-Bayesian inference is particularly “objective”). One thing I like about Bayesian methods is how they force you to put your assumptions out there, potentially to be criticized. But I understand that others prefer a mode of inference that makes minimal assumptions.

63 thoughts on “Untangling the Jeffreys-Lindley paradox

  1. Them mixing up discrete and continuous distributions could also be screwing things up. I’m not all convinced the calculation has been done right.

    Is it even possible the probability of a hypothesis to go from .5 in the prior to .97 in posterior when the data moves against it?

    • Yeah, the more I look at the calculation the less sense it makes. The posterior doesn’t look to be normalized to one and they seem to be evaluating a continuous posterior at a single point.

      • Ok, I see what they were getting at. I think I could express my remaining confusions like this:

        The prior is presumably of the form: p(R) proportional to .5*delta(R-.5)+.5*F(R) where F(R) is some function such that it’s integral from [0,1] minus some epsilon region around .5 goes to 1 in the limit as epsilon goes to zero.

        They in fact seem to be using the F(R)=1 here. If they use a different one, then they diffidently get a different posterior. There is actually quite a bit of ambiguity here. F(R) could be something wildly different. But even using F(R) = 1-delta(R-.5), which is morally identical to the one they do use gives a very different answer.

        But then examine the prior they do use P(R) = .5*delta(R-.5)+.5*F(R). It is clearly normalized to 1 over the interval [0,1]. But if you evaluate P(R=.5) the way they did the posterior you get P(R=.5) = 1.

        So according to their calculation the probability that R=.5 goes from 100% in the prior to 97% in the posterior!

        • finally got it. They are using “areas” to get probabilities and not evaluating point values. They’re using the prior P(R) = .5*delta(R-.5)+.5 and the prob of R=.5 goes from .5 to .97.

          Taking the prior as given does that result make sense?

          Using a uniform prior gives the result that can be expressed “if we start out with no idea where R is then we can be pretty certain that R is in the range [.48,.51]”

          Using their prior gives the result that can be expressed “if we think it very likely that R=.5 before hand, then we are pretty certain that it actually is after seeing that data”

          I guess that kinda makes sense.

    • Yes, it’s possible. The discrepancy between use of posterior probabilities and p-values can be as strong as you like, given extreme enough priors and data that provides sufficient precision. The example here has both.

      • Fred, I see your point now. You could make this example even more extreme by increasing the prior density on .5 and moving the actual data further out into the tails of the likelihood. Thereby giving even smaller p values with the same posterior probability.

        I guess what throws the intuition off is that the data by itself is consistent with the idea that the R is near .5. So in the uniform prior case the posterior is giving fine details about where in the vicinity of .5 the true R is. With the prior they use, the same data is simply confirming their prior belief.

        • Yes, there’s only support in the likelihood near 0.5 – but enough to reject 0.5 as the null, in the classical analysis.

          In the Bayesian version the prior gives (perhaps modest) support to the exact value 0.5, and gives very little support to values near 0.5, because it’s so diffuse. Consequently the value at exactly 0.5 comes out with lots of posterior support, exactly the opposite of the classical result.

  2. It’s hard to accept that model evidence is a purely mechanical calculation, and that the collection of models implied by a prior can on average be terrible even if a set of them are great, AND that the data nails down which ones it is. In this case the R!=.5 model is on average terrible even if there are particular R which are clearly much better than .5.

    I’ve wondered if there’s a better answer than the host of partial-BF-type (intrinsic, fractional, CPO, WAIC, etc) approaches. Essentially, a multiple-testing-like correction to the posterior-BF for having used the data twice. I’ve never seen an optimality result suggesting how much of the data one should use for the two tasks in the partial-BF-type approaches. Doing interval estimation is attractive, but doesn’t work for points on the boundary where the prior has zero mass (like variance parameters in most setups).

  3. > All sorts of silly models can work in some settings […]

    Which settings would that be?

    It strikes me that your philosophical articles don’t address this point. If (almost) all models are approximations, then given enough data, they can all be “falsified” using Chap.6 of BDA. Then what?

    Suppose that I’m considering two models for a particular problem: G is a simple Gaussian fit, and C is a more complex model, with a better fit (not due to overfitting – it’s a genuinely better model).

    Isn’t it true that if I care about the expectation, G could give me a better answer than C (because the average is a sufficient statistic of G)? Yet, I could “falsify” G, and choose C instead. This might be a silly example, but it makes me worry a great deal about my own complex models.

    • Cedric:

      From Chapter 6 of BDA:

      The relevant goal is not to answer the question, ‘Do the data come from the assumed model?’ (to which the answer is almost always no), but to quantify the discrepancies between data and model, and assess whether they could have arisen by chance, under the model’s own assumptions. . . . the discrepancies found by posterior predictive checking should be considered in their applied context. A model can be demonstrably wrong but can still work for some purposes, as we illustrate in a linear regression example in Section 14.3.

      There’s a choice of whether it’s ok to stick with a particular, flawed, model. That choice depends in general on the data and also on the applied context.

  4. I’m not at all sure a p-value of 0.01 ought to clearly reject a null hypothesis when your sample size is 10^6. The fundamental problem with not increasing the test level as sample size increases (what I was taught is desirable) is that all the gains from increased sample size go to power, none to reducing Type I error. So, for any given mix of true and false null hypotheses, all your errors become Type I errors – and you’re guaranteed to be making errors, in the case of the test used in the paper, at least 2% of the time, no matter how large your sample size is. Not likely to be ideal!

    • This only holds if we assume that the effect sizes under plausible alternatives are the same in sample sizes of e.g. 10^2 vs 10^6. In many fields (though not all) we may do larger studies in order to assess smaller effects.

  5. It’s frustrating when people don’t realize that making “no assumptions” is an assumption. A flat prior is an assumption. Zero imputation is an assumption.
    Also, making “no assumptions” when there is available information is to assume that the available information is wrong, which is an assumption.

    • Maybe a flat prior is an assumption but if everyone assumes the same, one less degree of freedom for people to play (cheat) with. What worries me is cherry-picking a prior to get oneself a result one wants.

    • The nice thing about flat priors is that, if you publish your results using a flat prior, anyone can multiply them (if only approximately in their head) with their preferred prior and normalize to get the result _they_ would’ve gotten, if they’d done the analysis using their prior from the beginning.

      Of course, often there are good reasons for presenting data with a specific non-uniform prior pre-applied. In such cases, a (prior, likelihood, posterior) triplot is a useful tool, since it provides all the information together; after all, the normalized likelihood is exactly what the posterior would be, if one had used a flat prior.

  6. “There is a lot of poop being thrown these days between bayesians and frequentists. I am not sure why; that fight seems so 2003 to me.” 2003 is being generous, I would have said maybe 1997.

  7. This “riddle” (and the reason why the a priori PDF is silly) might make more sense if we reformulate it in discrete terms, with bins corresponding to R=0-0.001, 0.001-0.002, … , 0.498-0.499, 0.499-0.501, 0.501-0.502, …, 0.999-1.

    The frequentist says that, if the tracker is unbiased, given the number of samples, the outcome of the experiment will almost always end up in the middle bin (0.499-0.501). The probability of being in any other bin is 4.55%. When he runs the experiment and finds the result in the bin 0.498-0.499, he concludes that it is an unlikely outcome and the tracker might be biased.

    By using the a priori PDF in question, the Bayesian researcher says that he expects the tracker to have the R in the middle bin with the probability 50%, and in the bin 0.498-0.499 with the probability 0.05%. In essence, he expresses extreme confidence in the null hypothesis – he disfavors the possibility of having a slightly biased tracker vs. the possibility of an accurate tracker as 1000 to 1. If this is indeed the case, then finding x=0.4988 is insufficient evidence to rule out the null hypothesis.

    In practice, of course, 1000:1 odds are almost certainly unwarranted, and the researcher would be better off using a flat a priori PDF, which would result in the paradox going away.

    • Blaise:

      Yes, there’s a long literature on the topic. I’m not adding anything new here, I’m just emphasizing that here’s a case where a weird prior can lead to a weird result. The problem is that people are lulled into a false sense of security: in some settings, you can use a weird prior and get an ok posterior, but not here.

  8. I would scarcely know where to begin with this round of hoop throwing, but one thing cannot really be meant: “there is nothing more subjective about a prior distribution than about a data model: both are based on assumptions.” Is the presumption that two evidence sources that involve assumptions are equally subjective (in the sense of equally open to error, bias, and unreliable inferences?) Of course not. Some assumptions are testable and relevant for a reliable inference. What matters is not that there are assumptions, but what is being measured, and how it’s to be used in inference. (a) I assume the scale that correctly weighs objects with known weight one minute ago is working now, and I use it to infer my approximate current weight (after which I can recheck the scale). (b) I assume Senator X really believes that such and such economic policy Q is best for the world, and use this to infer Q is the best economic policy.
    People will (rightly) say that measuring & using priors is not analogous to measuring and using senator X’s beliefs. Fine, then why appeal to such an analogy as if it sufficed to defend priors (as just as checkable/reliable as models or instruments), simply on the grounds that all involve assumptions?

    • Mayo:

      I don’t know who you’re talking to, but I don’t think it’s me. I never talked about Senator X—that’s your example. So when you ask, “why appeal to such an analogy?”, I don’t know who your asking. I referred above to the example in Section 2.8 of our book, an example that’s all about cancer rates and nothing about senators.

      • Sorry I missed all this, I am focusing on writing a chapter on this and related issues! Now itt was you who said “there is nothing more subjective about a prior distribution than about a data model: both are based on assumptions.” So I asked: Is the presumption that two evidence sources that involve assumptions are equally subjective (in the sense of equally open to error, bias, and unreliable inferences? clearly not, and yet that is what you said, and you’re not alone. If you don’t hold this view, great! Then I expect it will be absent from future discussions of objectivity! As for the illustration (first example that popped into my head), as I explain to logic students, an instantiation of an argument is used in order to show the argument is fallacious in general. Fortunately, we do not have to demonstrate fallacious forms for each particular instantiation, else we’d never get done.

    • Mayo,

      You seem to be the under the impression that, for example, the iid normal model for the error measurements is some kind of physical fact which can be verified. I’ve never once this verified in practice and turn out to be correct. In fact, it’s almost always assumed or impossible to verify. So most of the time the data model IS an assumption and to the extent that this frequency distribution is objective, it is objectively wrong. And yet these methods can still lead to useful answers an amazing amount of the time.

      This later point is the primary difficulty to be explained.

      Incidentally, if you do use a prior based on the Senator’s belief that Q is best for the world, and you are led in the final analysis to conclusion which you know to be wrong, haven’t you just learned that Q isn’t best for the world?

      • Indeed, we need only that the computed error probabilities adequately approximate the actual ones. And this too is an assumption; yet it is both relevant (for finding out what is the case) and checkable. I was merely denying that any two methods with assumptions are open to charges of subjectivity. It’s a non sequitor. At the risk of alluding to an example, choice of low power to detect m’ is a background discretionary choice, which can be due to all sorts of things, but my ability to scrutinize an inference based on using such a test is not subjective. See my discussion of “clean hands” and objectivity on my blog. Sorry, I’m dashing off!

        • The statement “we need only that the computed error probabilities adequately approximate the actual ones” is flat out false.

          Suppose you take ten measurements from a model Y_i = mu + error_i and the errors are assumed to be IID N(0,100).

          If the actual errors are error_i = 1 for i=1 to 10 then both a Bayesian (uniform prior) and Frequentist will report intervals that contain the true value of mu.

          If the actual errors are error_i = +500 for i=1 to 5 and error_i=-500 for i=6 to 10 then both a Bayesian (uniform prior) and Frequentist will report intervals that contain the true value of mu.

          Now in no way do either set of errors even approximately resemble IID N(0,100). Yet they get the right answer.

          You’ve completely misidentified what is driving these inferences. If we can’t even get this stuff right there’s no point in addressing the rest.

        • I think we’ve been over this rather recently, but “intervals that contain the true value of mu” is not the same thing as “the right answer”, even approximately. For starters, a 50% credible interval should *not* contain the true value half of the time.

        • If a carpenter or scientist uses a ruler whose smallest division is 100 units and takes a series of measurements of an object of length mu and then determines mu = 2000 +/- 50, then they naively believe that they do indeed have the right answer if mu is actually in the interval [1950,2050].

          Furthermore, they’re liable to believe that they have the wrong answer if mu isn’t in the interval [1950,2050].

          Now if a statistician wants to invent some further measurements, which were never taken and only exist in their fevered imagination, and speculate profusely about their values, then they’re certainly free to do so. But I assure you that the carpenter or scientist will stubbornly continue to believe that they have the right answer if and only if mu is in the interval [1950,2050].

          So what is the necessary and sufficient condition that the actual realized error_i have to satisfy to guarantee that mu is in the stated interval?

          Hint: it’s definitely not that the histogram of the actual errors resemble N(0,100)!

  9. To state the absolutely obvious, and it is not worth doing so in this audience:

    The frequentist and the bayesian more-or-less agree about the relative consistency of each hypothesis in the bag of models with the data, in the sense that they agree on the values of the likelihoods. At this point they have both made many assumptions by just choosing what set of models can be considered.

    The (true, not cartoon) frequentist won’t go further, because the bayesian needs a metric on hypothesis space (or a prior) to say something about the relative plausibilities or posterior probabilities of the hypotheses. That’s an additional assumption but gives power to, eg, marginalize out the hypothesis choice (or parameters of the model). That power is essential in many circumstances.

    Oddly, Dorigo (and full disclosure, I am the source of the “poop” quote except that I didn’t say “poop”) explains the difference between the frequentist and the bayesian in terms of what they each *decide* about the outcome of the experiment, but that requires, in addition to a choice of models and a choice of prior, a description of *utility*, which is outside the domain of *either* frequentism or bayesianism.

    So it is odd to state the difference between the frequentists and the bayesians in terms of what they end up *deciding*.

    Also, all hell would have broken loose if the two experimenters had included in their bag of models (at the beginning) a model in which the bias of the device was permitted to have *every different value for every different trial*. That would have had *awesome* likelihood but *terrible* posterior probability and would *not* have been decided upon by any (sane) experimenter. But by Dorigo’s ridiculous set-up, the frequentist would have been forced to accept it.

    • David:

      Sometimes is not obvious how to make the obvious – obvious. (DAS Fraser quote)

      Why I always insist on a Bayesian triplot (prior,posterior/prior,posterior), marginally and if feasible on full model/parameters (e.g. see nameless post above).

      The likelihood cab be defined as c*posterior/prior and invariant to prior on the full model.

      So Bayesians and frequentists _always_ agree on the full parameter likelihood.

      But being silent on loss functions should keep us silent regarding decisions!

      Thanks

  10. Well… “anonymous”, the one Andrew referred to, was me. I was happy to see that he had a similar view of the issue. My knowledge of physics is within epsilon of zero, so I didn’t want to get into those aspects there.

    IMHO, too much is made of the whole “Bayesians vs. Frequentists” debate. In the end, we all agree on using likelihoods for inference. Frequentists were hampered by computational limitations, and so looked at the “maximum likelihood”, crossed their fingers that asymptotics would kicked in, and made inferences. Now, we can do better.

    I rarely write into blogs, especially with diatribes of my own, but recently there was an influential article, causing all sorts of hair-pulling, by Simmons, Nelson, and Simonsohn, psychologists making recommendations for other psychologists (it’s here: http://people.psych.cornell.edu/~jec7/pcd%20pubs/simmonsetal11.pdf). The problem is, they are all statistical recommendations. Some are sensible, others (shall we say) less so. But one made me see red; here it is in full:

    “Using Bayesian statistics. We have a similar reaction to calls for using Bayesian rather than frequentist approaches to analyzing experimental data (see, e.g., Wagenmakers, Wetzels, Borsboom, & van der Maas, 2011). Although the Bayesian approach has many virtues, it actually increases researcher degrees of freedom. First, it offers a new set of analyses (in addition to all frequentist ones) that authors could flexibly try out on their data. Second, Bayesian statistics require making additional judgments (e.g., the prior distribution) on a case-by-case basis, providing yet more researcher degrees of freedom.”

    Whoa! Don’t do the CORRECT analysis because it’s more work, and it gives psychologists another “degree of freedom”? I sent out a note to all our grad students about this. It’s a bit scary that this sort of advice will be taken as gospel by unsuspecting readers, given the imprimatur of Psych Science and three prominent authors.

    • You know I can see where they are coming from. Don’t you see potential for abuse? There’s abuse in other techniques too so it is up to practitioners of a field to decide what the potential of harm is and whether it gets offset by the increasing utility of applying Bayesian techniques to a problem. Till they know and trust a tool it is indeed prudent to treat it with caution.

      I don’t think they are objecting to the increased work; the objection is selection of a prior is a very subjective process and too easily lets a researcher twist conclusions by cherry-picking priors. Most areas don’t have a good way of vetting priors.

      • I don’t think the point of the statement in the paper was so much “Bayesian methods make things even worse” (although that’s how it comes across, and how it could be interpreted by a naive reader or one with an anti-Bayesian slant) as “Bayesian methods aren’t a magic bullet against data snooping either”. Or at least that’s how I took it.

  11. There might be insufficient attention paid on this thread to the question of why the proposed prior is “weird”. It does seem weird if you think of it the way Nameless puts it above, but that is not necessarily an obvious way to think about it. Andrew, perhaps you should link again to your “Bayesian model building through pure thought” paper, in which you discuss several reasonable-seeming priors and show that they have very undesirable properties.

    In this case, assigning a probability of 0.5 to the case that x=0.50000000000000000000000000000… is ridiculous. There is just no way that both particles can be detected with a probability that is equal even out to the 10th decimal place, much less the 100,000,000th decimal place. And this prior says that there is a 50% chance (before seeing the data) that they are equal out to the 100,000,000,000,000,000,000,000,000,000,000,000,000,000,000th decimal place and beyond!

    And yet, this prior seemed so reasonable to Dorigo, a presumably intelligent physicist, and to Jeffreys and Lindley whoever they are, and probably to everyone those people have talked to about it, that Dorigo wrote up this piece, and apparently all of these people are either baffled by it or are willing to condemn Bayesian statistics on the basis of this example. (One wonders what Dorigo and the rest would have said if there had been a counterproposal to assign 50% prior probability to the range 0.499999999999999999999999999 – 0.500000000000000000000000001; would that have seemed more obviously wrong?)

    Rather than simply telling them that they are wrong, and telling ourselves that we would never use such a crazy prior, I think it is worth asking — perhaps literally asking Dorigo — why this prior makes sense to some people. Perhaps no reader of this blog would make such a trivial mistake, but evidently many potential users of Bayesian statistics would. Choosing a reasonable prior is apparently not so easy for them. What could we do to make it easier? And, as Andrew’s paper on Bayesian model building points out, even priors that seem reasonable to a highly trained Bayesian statistician can be demonstrably “wrong” (just read the paper, you’ll see what I mean). Maybe we need a general theory of prior distributions, or at least some principles or rules of thumb, to help people avoid this kind of mistake.

    • Phil:

      As I noted above (albeit briefly), the issue is that all sorts of silly priors work in some data settings. For example, the N(0,100000000) prior on my weight works, when combined with the likelihood obtained from a single measurement on a bathroom scale. The problem arises when someone takes Silly Prior that works in Problem A, and then unthinkingly applies it to Problem B. The problem is not just with the prior, it’s with the prior in this setting. One way to notice there’s a problem is through posterior predictive checking.

      • Is there a definition of what constitutes a “silly” prior? Say, you showed his prior to 10 statisticians would they agree it was silly?

        • Rahul:

          Just about any prior is silly if you put it in a situation where its weaknesses are exposed. I think that just about any statistician who thought about it carefully would agree that the proposed distribution does not make much sense as a model for the parameter being studied. The real question is whether the resulting posterior inferences make sense.

        • Rahul:

          All inference is circular in some way, but maybe better to think of it as a helix: we go in circles but move forward. See BDA or my published papers for many examples.

          By analogy, dictionary definitions are circular: we define words in terms of other words. But dictionaries still can be useful.

    • Phil – I really hope “Jeffreys and Lindley whoever they are” is a joke. Lindley (1957) is not naive about the choice of prior; he explicitly mentions some situations where this form of prior *might* be reasonable, and points out there are many situations where it isn’t.

      Furthermore, the “paradox” can also persist when the prior mixes a diffuse part with a concentration at the null value – even if the concentration isn’t a point mass – so your point about the bazillionth decimal place does not clear things up.

    • Imagine you’ve just learned about antimatter, and are then interested in the hypothesis that electrons and positrons have the same mass. There’s a principled reason to suspect that they truly, truly, are the same, and that a deviation at the billion’th (or ANY) decimal place would be important (e.g. it would mean theoretical models that postulate full symmetry cannot be the end of the story.) So a prior like this is not inherently silly.

        • I don’t understand. I’m trying to give an example where the hypothesis that some difference is really, truly, absolutely zero is reasonable (*). Measurement error comes into how how my sample data does (or does not) speak to that point but how does it affect the hypothesis of interest?

          (*) You often state something to the effect that in social sciences all the =0 hypotheses are false. I’m merely trying to make the point that in physics, sometimes there my be questions where==0 (not incredibly close to 0, but is it really 0)
          are actually interesting, but where statistics remains relevant (mathematics has lots of crisp == 0 or !=0 questions, but we don’t think of answering them via sampling!)

        • Andrew’s point was that the question was not whether the electron and positron actually DO have the exact same mass, but rather whether some detector is able to detect some kind of discrepancy in their mass, that’s where the measurement error comes in.

        • An example is the recent claim of superluminal neutrino propagation.

          Recent reports are that the experimenters have found a problem with a loose connection in their apparatus that delayed the GPS information by about the 60 ns claimed.

          Systematic errors (biases) are hard to detect, especially when they are small (60 nanoseconds???)

          These are not problems with the hypothesis, they are problems with the experimental implementation.

          For example, Delampady and Berger mention in their paper (search jstor.org) that one might think that talking to your plants does not affect their growth.

          Yet it might, if you talk to them while breathing excess carbon dioxide over them.

          So, when considering actual experiments claiming to test null hypotheses, “buyer beware.”

        • > at first I thought the vaccine almost surely had some effect but after talking to a vaccine expert they convinced absolutely zero effect was the best bet.

          Absolutely zero? I can’t make sense of this if I take it literally. Are you saying that you’d bet better than even odds this with me if I instead said I expected absolute magnitude of the effect to exceed 1/10^10^10^10^10^10^10^10 effect (one direction or another), and that someone else showed up with some amazing analysis which, if pursued, would distinguish between the absolutely-zero world and my alternative.

      • You make an excellent point. Consider, in physics, CPT symmetry, which says (roughly) that physical laws should be symmetric in charge, parity, and time (i.e., if you swapped ALL three at the same time; violations of C and P were long known). The question is what you’d do with “small” violations coming from, say, an acceleration experiment. How much ‘weight of evidence’ would be required to abandon this cornerstone of physics?

        You might reason that thousands of experiments found no such violation. But you might also reason that C-symmetry and P-symmetry *also* seemed reasonable, until they were found to be wrong. If the prior on CPT being true were fantastically concentrated on “yes”, then the degree of evidence for violation would need to be exceedingly strong to abandon it. That’s pretty much what’s happening now with the superluminal neutrino data: no one is believing it, at least not without more corroborating evidence.

        There’s that old physics quip that, if your experiment requires a statistician, you should design a better experiment. I find this obnoxious, in that it views statistics only as something to pick up extremely weak signals. On the other hand, there are some experiments where the results are so obvious (e.g., double slits) where you don’t need stats to help with them.

        Anyway, it’s not clear what physicists and others with strong “prior theory” should be doing with that in their analyses. That is, no specific prior that makes more sense than others.

        • I might say that if Bayesians and Frequentists reach different conclusions from your experiment, you should design a better experiment.

        • Fisher (in his later career) would have agreed with you, the main point in good design is to make assumptions true or not very important for inference.

          And some might see what Mayo and her crew are doing as trying to make this so for analysis – structure the analysis so that prior assumptionf are not important if not even not _necessary_.

        • OTOH a staggeringly large body of existing scientific results out there have been demonstrated using Frequentist approaches and have weathered the test of time. So if indeed there is any systematic disagreement between the results of the two methods the onus should lie on the Bayesians to prove their case or resolve the differences. If in a pinch I bet most scientists (outside of Statistics) would trust the Frequentist results more than the Bayesians.

        • Rahul, seriously? Here are some points:

          (1) At this point there is clear evidence that any effect reported in the life and social sciences that aren’t so big it could be seen without statistics is very suspect. 65% of peer reviewed academic drug trials cannot be replicated by Pharmaceutical companies that want to commercialize them. About half of the papers in the life and social sciences seem to be wrong.

          (2) The original applications of statistics to science were due to Laplace (who many would consider the real originator of Bayesian statistics). I defy you to name one result Laplace got applying Bayesian statistics to Astronomy which has sense been overturned.

          (3) In my own field, far from designed experiments and clean data, I don’t think I’ve ever seen a Frequentist analysis that got it right. As soon as I see a p-value or hypothesis test I know I’m going to have to redo the entire analysis.

          (4) On the other hand, very sophisticated Bayesian methods applied to say Radar Target Analysis and Discrimination has been fantastically successful.

          (5) There’s overwhelming evidence that the “most scientists outside of statistics” naturally think like Bayesians.

          (6) Check out that paper someone referenced on another post where they examine what actual coverage 95% Confidence intervals had for physical constants in physics. The 95% coverage claims are absurdly false.

          (7) Do you have any idea how many health recommendations have been changed in my adult life. “Salt is bad”, “salt is ok” and so on? To say that the Frequentist statistics used to back up these claims has “held up” is beyond absurd.

          I’m sure other could add another 20 points.

        • > I might say that if Bayesians and Frequentists reach different conclusions from your experiment, you should design a better experiment.

          But there’s an even earlier step to take: perhaps the conclusions are different because the are addressing different _questions_ in the first place. One should be first sure that you are clear what you are asking. The experiment doesn’t need changing if the two
          analyses are using it for distinctly different purposes: one says “yes”, the other “no”, but if to different questions than perhaps the experiment is just fine as is.

          The physicist here is NOT asking the question whether the detector bias is zero. S/he knows it isn’t, and doesn’t need even a single data point to be morally certain of this. That would be a silly question, and for practical purposes a silly interpretation (except for other cases where it isn’t, but _here_ it’s silly). But on the other hand the Bayesian analysis is predicated on the question of whether the bias is 0 – indeed it uses a contrived prior to actually make this plausible. And the strangeness of the Bayesian answer is precisely because it’s giving a lot of credit to “=0, truly” vs “it’s really extremely small” – because the prior specifically tells us to think about the former very seriously. Different answer, to a different question.

          The actual physicists’ question of interest concerns, broadly speaking, whether possible bias is small enough to be ignorable. (You can read the comments to the original article: this is not my speculation, this really is the real question.) What is small enough, what size errors qualify as ignorable, aren’t clear, but from the his comments after the article it’s something like (my paraphrase) “unlikely to be influential in _other_ statistical analyses using this detector where the sample size is similar”. I really don’t know enough to see whether that specifically actually the interesting question (versus whether it’s a cop-out to avoid having to think about loss functions & etc) but if this is the specific question it seems reasonable to think that the frequentist test speaks very well to it. But no one told the Bayesian analysis that this was the goal! The Bayesian was told “hypothesis: R == 0.5”, go to it!

          IMO frequentist statistics has demonstrated one clear fault here. It has given us a convenient language so you can jump immediately from (to cite the original article) “the researcher might decide to study with high precision the possible charge bias” to then “After counting many tracks, a result significantly different from zero would …” and so then right on to “The hypothesis under test is that the fraction of positive tracks is R=0.5”.

          So with the convenient intervention of the significant vs statistically significant confusion, you’ve got yourself very quickly a seemingly clear question that (a) didn’t even require you to THINK about whether you needed a loss function or some measure of what error would be too much, (b) sounds very scientific, (c) already bakes in many frequentist assumptions, (d) will trap any Bayesian analyst who mistakenly interprets this in her world as “ok, so you want to know whether R=0.5 or not?”. Read the article and comments if you disagree with this, but I think it’s clear that this is precisely what has happening and it’s as clear example as you will ever see. “R=0.5?” is absolutely not the question that the author (or his fictional protagonist) wants to ask or that the frequentist
          analysis is trying to answer. Bayes has strived to make an entirely different question (i.e. R=0.5?) reasonable – it’s generally not – and then tried to answer _that_.

        • bxg: The paper by Delampady and Berger, which I mentioned earlier, gives criterion to decide whether an exact point prior on the null hypothesis is a good approximation to a prior that is somewhat spread out. IOW, even though an exact point prior may not by physically realistic (given possible bias in the experimental setup), it may be just fine in practice. OTOH, if their criterion is not satisfied, then the null hypothesis should be represented by a spread-out prior that reflects the possible bias.

  12. fred, if I ever heard of Jeffreys and Lindley I’ve forgotten about them. No joke!

    The problem (or a problem) with the prior used by Dorigo is not that it assigns 50% probability to a point value per se, it is that it assigns a 50% probability to an essentially impossible range of values, of which the point value is just the reductio-ad-absurdum limit that Dorigo chose. If Dorigo had assigned 50% probability to the range 0.4999999-0.5000001 the situation would be little better; in either case he is assigning a lot of prior probability to a region that is in fact very very very unlikely to contain the right answer. To me, one way to clarify this is to look at the absurdity of suggesting that the right answer could match your estimate out to a zillion digits, but I am no expert on pedagogy and if you want to say there’s a better way to make the point then I don’t necessarily disagree.

    bgx, I understand what you are saying, and indeed I almost wrote up that example in my previous comment in order to contrast it with the experimental issue in question. We could imagine a thought experiment in which Physicist A thinks two particles must have exactly the same mass, and Physicist B thinks they can be different but should be within some range of each other. We could assign each physicist a 50% prior probability of being right, and maybe end up with a prior like the one in question. I agree it is not a crazy prior in all contexts.

    If you have as much data as in Dorigo’s example, you should really just use a fairly noninformative prior and let the data dominate. There is no point even asking the question of whether the detector is biased, you know the answer before you even see any data.

    A more interesting and perhaps realistic question might have been to say: typically detectors like this have a bias of up to 0.5%, but we have made a change that we think should make this one a lot better. Researcher A thinks the bias should definitely be under 0.1%, but Researcher B thinks the change won’t help very much and could still be as high as 5%. We want to use the data to figure out which of them is right. Of course, if you see a very small bias then this tends to argue for Researcher A but it is not literally inconsistent with Researcher B…and vice versa for a large bias.

    Andrew, I disagree that “the issue is that all sorts of silly priors work in some data settings.” That’s a true statement but almost completely unhelpful…unless you take it to the next step, which is to give some guidance on how to tell whether the prior is silly in the setting you are interested in at the moment. You’re right that posterior predictive checks can help with this — if the actual data are very unlikely under your final model then you should suspect a problem with your prior or with your model or with your coding — but helping people recognize when their prior is a poor choice after they have already used it is unsatisfying for a bunch of reasons, not least of which is that it doesn’t necessarily tell them how to make a better prior.

    • Phil: A marginal (focussed on the parameter of interest) Bayesian Triplot _should_ help separate where the model is most wrong.

      [that is what those Zombie plots of mine were about and unfortunately a submitted paper on that is still under review since Sept 2011].

      More formally, Mike Evans (U of T) has formalized this separating of model failure in his work on relative surprise inference (pretty technical) but some do not think separating makes sense (Andrew, I believe is or was of that opinion).

    • Phil: A marginal (focussed on the parameter of interest) Bayesian Triplot _should_ help separate where the model is most wrong.

      [that is what those Zombie plots of mine
      were about and unfortunately a submitted paper on that is still under review since Sept 2011].

      More formally, Mike Evans (U of T) has formalized this separating of model failure in his work on relative surprise inference (pretty technical) but some do not think separating makes sense (Andrew, I believe is or was of that opinion).

  13. I don’t think there’s really a paradox at all, and I also think the prior is not really the problem. In my understanding, the two approaches are answering different questions.

    The Frequentist is answering “is R=0.5 a good explanation for the observation?”

    The Bayesian is answering the question “which of (R = 0.5) or (R uniformly drawn from [0,1]) is a better explanation of the data, given that either hypothesis is equally likely based on prior knowledge?”

    The small p-value of (R=0.5) means the Frequentist answers “no”. The high posterior probability of (R=0.5) vs. (R \in [0,1]) means the Bayesian answers that (R=0.5) is the better of the two choices.

    I spent some time editing the Wikipedia article to try to make the distinction more clear (and to clean it up a bit in general), and would appreciate feedback from those with a better understanding of Frequentist or Bayesian methods, to make sure I didn’t screw it up.

Comments are closed.