What’s wrong with Bayes

My problem is not just with the methods—although I do have problems with the method—but also with the ideology.

My problem with the method

It’s the usual story. Bayesian inference is model-based. Your model will never be perfect, and if you push hard you can find the weak points and magnify them until you get ridiculous inferences.

One example we’ve talked about a lot is the simple case of the estimate,
theta_hat ~ normal(theta, 1)
that’s one standard error away from zero:
theta_hat = 1.
Put a flat prior on theta and you end up with an 84% posterior probability that theta is greater than 0. Step back a bit, and it’s saying that you’ll offer 5-to-1 odds that theta>0 after seeing an observation that is statistically indistinguishable from noise. That can’t make sense. Go around offering 5:1 bets based on pure noise and you’ll go bankrupt real fast. See here for more discussion of this example.

That was easy. More complicated examples will have more complicated problems, but the way probability works is that you can always find some chink in the model and exploit it to result in a clearly bad prediction.

What about non-Bayesian methods: they’re based on models too, so they’ll also have problems? For sure. But Bayesisan inference can be worse because it is so open: you can get the posterior probability for anything.

Don’t get me wrong. I still think Bayesian methods are great, and I think the proclivity of Bayesian inferences to tend toward the ridiculous is just fine—as long as we’re willing to take such poor predictions as a reason to improve our models. But Bayesian inference can lead us astray, and we’re better statisticians if we realize that.

My problem with the ideology

As the saying goes, the problem with Bayes is the Bayesians. It’s the whole religion thing, the people who say that Bayesian reasoning is just rational thinking, or that rational thinking is necessarily Bayesian, the people who refuse to check their models because subjectivity, the people who try to talk you into using a “reference prior” because objectivity. Bayesian inference is a tool. It solves some problems but not all, and I’m exhausted by the ideology of the Bayes-evangelists.

Tomorrow: What’s wrong with null hypothesis significance testing.

85 thoughts on “What’s wrong with Bayes

  1. “exploit it to result in a clearly bad prediction.”

    Three points:

    (1) The goal isn’t to make a good prediction. That is usually impossible given the information put into the problem. The goal is “to do the best you can with the information used, however good that may be”.

    (2) If it’s that “clearly” then you haven’t used all the information available. There’s been a long history of mass producing “bad Bayes” examples this way. Simply do a bayesian analysis without using all the information that one’s intuition uses, and then claim Bayes doesn’t agree with the intuitive answer. It’s a favorite strategy of frequentist fanatics.

    (3) I wont go through a lengthy analysis of your example (it’s been done before and had no impact). Just let me say that in your example, I believe you’re confusing two very different states of information. You can have either (a) “the frequency distribution of a changing lambda is normal” or (b) “lambda is fixed with an uncertainty described a normal”. If you’re frequentist or worse yet, half-frequentist/half-baysian then you’re liable to confuse these two very different types of information and easily generate problems/paradoxes.

    Perhaps it would be more accurate to say “the problem with Bayes is half-frequentists/half-bayesians”

    • Anon:

      1. Whether or not the goal is to make a good prediction, I do think we want to avoid clearly bad predictions.

      2. Exactly: In the example above, there is available information that’s not used in the model. The concern is that we are often in this situation. Researchers commonly make Bayesian statements (or give classical confidence intervals with an implicit Bayesian implication) that don’t make use of relevant information. When above I wrote, “What’s wrong with Bayes,” I was talking about what’s wrong with Bayesian methods as they are used, not what’s wrong with ideal Bayesian methods.

      3. Yes, indeed, a few years ago I gave this example at a conference where someone said that Bayesians should be more frequentist and that frequentists should be more Bayesians. I replied that I’d be happy if Bayesians were more Bayesian (including more real information in their models) and if frequentists were more frequentist (evaluating procedures as they are actually done, including forking paths etc).

      • >In the example above, there is available information that’s not used in the model.

        Sorry if this extremely basic/is an example you’ve expanded on elsewhere, but what additional information is available for this model but not incorporated? I assume that this effects the choice of prior, but I don’t see what prior would be more appropriate in this example.

        • You need just slightly more context before you can apply prior information. I mean, what are we measuring? There’s always some information we have once we know that information. For example suppose we’re measuring the number of gallons of milk in people’s fridges… it can’t be negative, and people don’t often buy more than a couple gallons at a time, so an exponential(1) prior might be appropriate. Or we’re measuring the vote differential in percentage points for some relatively close election… maybe normal(0,5) is appropriate… etc

          I think the point was an infinitely wide uniform prior is never a good model for any real world problem… but which alternative can’t be specified without knowing something about the actual problem.

        • If you ask a different question, it will have a different answer. I am not seeing a problem with the original answer to the original question.

          If the problem is that an infinite flat prior fails to encode the available information, that is a problem for the original question. There is no sense in complaining that the machine answered the question you actually asked, and not the one that you should have asked.

          Besides, the unrealism of the infinite prior is a red herring, at least in this problem. You get a similar posterior probability for theta > 0 from a flat prior on [-10,10]. Or flat on [-50,5]. Or N(-5,10). (Calculated from simulation.) Your prior has to be pretty informative about the sign of theta in the region where most of N(1,1) lies before you get a posterior outside 0.8 to 0.9.

          And finally, who estimates the sign of a variable from one observation?

        • I agree that the example should be a better example. I’ve never liked this example because it obviously seems meaningful to Andrew, but it isn’t that meaningful to me. Basically I’m with you.

          Nevertheless, it’s a very typical thing in the recent past at least that people will use some kind of default or “non-informative” prior on problems where information is available, then get posterior inference that makes not very good sense, and then assert that it “must be right” because they followed the rubric.

          This is particularly true when the parameter vector is moderate to high dimensional and the “uninformative priors” tend to overwhelmingly pick out stupid unrealistic prior predictive data values.

          For example, the “uninformative prior” which is say uniform(-very_big_number, very_big_number) is actually *extremely informative*… it tells you that there’s a 95% chance that the value is outside the range (-very_big_number/20, very_big_number/20). So if “very_big_number” is something like the largest possible float at 1.79e308 or so you’re saying that there’s a really good chance your parameter is bigger than about 9e306

          The only sense in which it’s not informative, is that it says nothing about which values within the range where the likelihood is appreciable are more or less likely.

        • Richard:

          Indeed, the problem is not that the original prior is improper, it’s that it’s too weak.

          Also, if you look at the statement of the problem, we’re not estimating from one observation. Theta-hat is an estimate, not a single data point. Imagine an experiment is conducted with unbiased estimate theta-hat with standard error of 1.

          Finally, you write, “There is no sense in complaining that the machine answered the question you actually asked, and not the one that you should have asked.” My problem is not with the Bayesian math, it’s with the entire “machine” which includes the social structure by which a uniform or very weak prior is the default. This entire procedure, prior and all, will routinely give bad answers if it is taken seriously. Indeed, the only reason these Bayesian inferences are not so horrible in practice is that practitioners don’t go around making those 5:1 bets.

    • Regarding point (2), even if we take the argument on its own terms we can say that however indistinguishable the observation is from noise it is equally indistinguishable from what would be expected if theta = 2. Maybe 5-to-1 odds for theta > 0 don’t seem so bad in that light.

  2. > I replied that I’d be happy [..] if frequentists were more frequentist (evaluating procedures as they are actually done, including forking paths etc).

    Would you? You told us the other day who when someone asked you for a clarification on frequentist methods your response was “Statistical significance doesn’t answer any relevant question. Forget statistical significance and p-values. The goal is not to reject a null hypothesis; the goal is to estimate the treatment effect or some other parameter of your model.”

    • Carlos:

      “Frequentist” described the approach of choosing and understanding statistical procedures based on their frequency properties, i.e. how they would work on repeated use. So, if I do a frequency analysis and determine that classical null hypothesis significance testing has poor frequency properties, in the sense that if it is applied repeatedly it will often lead to bad estimates and bad decisions, then I’m being frequentist.

      To put it another way: “frequentist” != null hypothesis significance testing.

      • Ok.

        By the way, talking about “an observation that is statistically indistinguishable from noise” sounds a bit null-hypothesis-significance-testingy.

  3. Is this a test?
    The 84% is one minus the one sided p-value but you shouldn’t bet using that.
    For betting you need the likelihood ratio which is about 1.6.
    If you have no other information then the prior odds equal one and the post odds are 1.6.
    So a bookmaker would offer 3/2 not 5/1.
    (Disclaimer: I often bet on the horses and usually lose)

    • Nick:

      I can see why you usually lose when you bet! The math in my above post is correct, as I’m talking about a bet on theta being positive. In this example, Pr(theta>0|y) = 0.84, i.e., Pr(theta>0|y) / Pr(theta<0|y) = 5.

        • so I think you’re doing dnorm(1,1,1) / dnorm(0,1,1) which is about 1.65, so you’re comparing the likelihood of mu = 1 to mu = 0 but the bet isn’t if mu = 0 we pay 1.65 and if mu = 1 we keep your dollar, the bet is “if mu is less than 0 we pay 5 vs if mu is greater than 0 we keep your dollar”

          the likelihood ratio answers the wrong question, there are a continuum of possible mu values.

        • I’m using generalised maximum likelihood of a composite hypothesis, mu=0 being the supremum of mu less than 0 compared with mu=1 being the supremum of mu greater than zero.

          See:
          A Law of Likelihood for Composite Hypotheses, Zhiwei Zhang.
          The Strength of Statistical Evidence for Composite Hypotheses: Inference to the Best Explanation David R. Bickel.
          Both preprints, easily googled.

        • Nick:

          This seems super-complicated, given that I just want to know Pr(theta>0|y), which I can compute directly from the posterior distribution with no supremums etc. needed. I’m not interested in the probability that theta=1. y=1 is just the point estimate, that’s all.

        • Doesn’t this totally break down if you have a high thin spike in your likelihood… like p(mu|data) ~ 10% uniform[.9999,1.0001] mixed with say 90% [-10,10]

          The maximum likelihood for it to be positive is 500 (0.1 * 1/.0002) and the max likelihood for negative is .9/20 = .045 giving you 11111 to 1 odds?

          whereas the probability to be positive if you normalize this likelihood is about .9 * 1/2 + .1 = 0.55

          I don’t see why we should care about that first calculation.

        • I guess for most real-world problems the likelihood function is reasonably smooth without discontinuities. If you have a really weird likelihood function (and you’re not aware of such) it’s probably going to cause a lot of problems for any method: Bayesian, frequentist or maximum likelihood.

        • This was just a way to point out that your calculation is only sensitive to the value at a single point. The general problem still holds for any function. Probability depends both on the density and the width of the interval, where supremum of likelihood depends only on a point density.

        • For example, suppose you have something like dgamma(x+1,2,1) which has a peak at x=0.

          about 26% of the probability mass is left of 0, and 74% is to the right of 0.

          which is the better bet? 50/50 which you get for the likelihood ratio thing, or 25/75 for the probability?

        • Nick:

          For “most real-world problems the likelihood function is reasonably smooth without discontinuities” no, actually it is discrete unless some day someone finds a way to determine observations to all decimal places (not finite).

          Thinking one can can actually observe _continuous_ outcomes leads to all sorts of anomalies.

        • Keith, although I agree that all real world problems involve data measured to finite precision (it’s usually not even that many decimal places, between say 2 and 6) nonstandard analysis shows us that discrete and continuous aren’t really that different provided precision is sufficiently high. My example with a high peak could easily be made continuous, even infinitely smooth, just convolve my function with a normal distribution of width .00000001 😉

          it’s not the rapid change I create that makes Nick’s suggested method problematic, that’s just a device to amplify the problem to make it more visible, it’s the fact that he makes a global decision on the basis of local information.

        • Nick:

          Likelihood gives you the answer to the question, what are the relative probabilities of theta=1 and theta=0. But that’s not typically a question of direct interest. Pr(theta>0) is what I’m after here.

        • If you are assuming the prior is symmetric around 0 (which I think you are doing), Pr(theta>0|y)/Pr(theta0″ vs “theta0” vs “theta without inequality restriction” is (again assuming a symmetric prior around 0) equal to 2*.84, which seems to be a less counterintuitive result. (I am aware of the fact that you hate Bayes factors, but given that Pr(theta>0|y)/Pr(theta<0|y) can be interpreted as a Bayes factor, you might as well consider a more relevant Bayes factor.)

        • (something went wrong here. new try)

          If you are assuming the prior is symmetric around 0 (which I think you are doing), Pr(theta>0|y)/Pr(theta0 vs theta0 vs theta without inequality restriction is (again assuming a symmetric prior around 0) equal to 2*.84, which seems to be a less counterintuitive result.

          (I am aware of the fact that you hate Bayes factors, but given that Pr(theta>0|y)/Pr(theta<0|y) can be interpreted as a Bayes factor, you might as well consider a more relevant Bayes factor.)

        • Wolf:

          I don’t quite understand your comment. I think it is a very reasonable question to ask: What is the probability that the average treatment effect is positive? In this model, that is Pr(theta>0) and the posterior probability Pr(theta>0|y) is 0.84 in this case. If you believe your model, you’ll be willing to say that there’s a 5/6 chance that theta is positive, that is, you’d bet 5:1 on this. It’s just a probability. You can call it a Bayes factor if you want; it doesn’t really matter one way or the other!

          It’s funny to see how many comments this example received. I think the problem is that I’ve thought about this example a lot, but I’ve never formally written it up. Because the example is so familiar to me, I presented it in abbreviated form, which led to confusion.

  4. In your example, is the model just

    y ~ normal(theta, 1)

    with data y and parameter theta and an improper uniform prior on theta and a single observation y = 1? Then the claim is Pr[theta > 0 | y] = 0.84 given the model?

    If so, I don’t see a problem unless it’s the improper prior, which will defeat running calibration tests as there is no way to generate fake data. Otherwise, you just seem to be saying the model is wrong.

    • Bob:

      What Daniel said. The improper prior typically makes no sense in these situations, which is why it would a ludicrously bad idea to be betting 5:1 or anything like that on Pr(theta > 0|y) based on seeing an estimate that’s 1 standard error away from zero. The fact that the bet looks so wrong is telling us that we have prior information not in the model, and Bayesians typically duck this problem by just not thinking hard about the implications of this posterior.

        • Not being Bayesian enough in the sense of fully appreciating how to do good empirical inquiry utilizing Bayes theorem and/or not have a truly scientific attitude but just wanting to pull the Bayesian crank and submitting an invoice or claiming to have made aa academic contribution.

        • Richard:

          The bet looks so wrong because it’s a strong statement (Pr(theta>0) = 5/6) based on data that are indistinguishable from noise. If you go around making 5:1 bets based on noise, you’re just overreacting all the time.

        • According to the simulation I made, confirmed by considering what it is telling me about the underlying mathematics, any prior that is flattish over the bulk of N(1,1) (say, 1 +/- 3) will give a posterior probability for theta > 0 of around 5 to 1. Betting at those odds is fair. If someone is offering even money on that bet, bet that theta > 0. You will win 5 times out of 6.

          It does not matter what the prior looks like outside that range. In fact, the wider the prior, the less like noise a single observation of 1 (or anything else) looks like. By the definition of the puzzle, a single observed value t narrows the posterior down to something very close to N(t,1), and exactly so in the limit of the infinite flat prior.

          In what sort of situation do you know the standard deviation and shape of a distribution, but not its mean? The example that I am imagining is measuring something of unknown length, with a device whose measurement errors are distributed as N(0,1). For convenience I take the standard error as the unit of length. Let us suppose it is a micrometer gauge. I measure an object that I can see is about an inch long, but the gauge is far more accurate than my feeling for what an inch looks like, so my prior range of uncertainty might be a few thousand standard deviations of the gauge. The gauge reads 1 inch plus 1 standard deviation. Is the object more or less than one inch long? 5:1 that it’s greater.

          When statistics is being applied to some practical end, a single observation almost always is indistinguishable from noise. This hypothetical puzzle is not such a situation. A single measurement from a micrometer is very much distinguishable from noise.

        • Richard:

          Agree you will win 5 times out of 6 IFF the “truth” is a random draw from a prior that is flattish over the bulk of N(1,1).

          A reassurance for it being roughly like that, would require a large amount of credible background knowledge that you don’t have.

        • To even say “5 times out of 6” is problematic. In a given instance, you will either win, if after collecting a lot of data everyone agrees that the “truth” is a positive number. Or you will lose, if after a lot of data, the truth is a negative number.

          To say 5 times out of 6 you have to define some population over which you will use this procedure. And then, the population of truth values shouldn’t be uniform over the bulk of N(1,1) but rather more or less contained in some reasonably flat distribution over the true values of the problems this procedure will be put to use to detect…

          In which case the Bayesian model is also a Frequency model.

          But if the procedure is “ask some people about what they’re doing, then formulate a prior for that problem, and then collect one data point, and apply the likelihood to that problem and then use the posterior to make bets” then the actual frequency with which you win will be a function of how well you set your priors for each individual problem.

          With this Bayesian procedure, the frequency of actually winning is not the same as the credence you give to the idea that you will win because there is no real world frequency based probability for “the uses you will put your procedure to”… except of course when there is, like if you’re an analytical chemistry lab and doing the same water quality test day in and day out…

        • Richard:

          You speak of “any prior that is flattish over the bulk of N(1,1).” My point is that such a prior is really strong; it assumes a high possibility of very large effects, enough so to yield that very strong 5:1 odds.

          To put it another way: if, after seeing theta-hat = 1 with that model, you’re really willing to bet 5:1 that theta>0, then, yeah, that flat prior might be reasonable. If not, you should be concerned. What I’m saying is that in the typical examples we see, we are not and should not feel comfortable with that 5:1 bet, hence we should not be comfortable with that flat prior.

          Finally, in my example it’s not a single observation. Theta-hat is an estimate. It’s often reasonable to suppose that theta-hat has an approximately normal sampling distribution with standard deviation approximately equal to the standard error.

        • Just to try to clear up some possible confusion: In an earlier comment you wrote:

          > Indeed, the problem is not that the original prior is improper, it’s that it’s too weak.

          You mean a flat prior is weak in the sense that it doesn’t do much shrinkage.

          In this comment you’re saying a flat prior is strong in the sense that it yields a strong conclusion (5:1 odds).

        • Every prior is “strong”, for being that prior and no other. In this discussion we have seen that Uniform(-BIGNUM,BIGNUM) is “strong” because it is so broad, while Uniform(-2,4) is “strong” because it is so narrow. But this deprives the word “strong” of any meaning. What is a “strong” posterior? 5:1 odds is weak for making a one-off high-value decision, and you would be better to hold off, if you can, while you collect better data. But if you’re a professional gambler on the horses, those sorts of odds are literally your bread and butter.

          Is the problem with the original example that in realistic applications of statistics you have no way to express your prior knowledge in terms of exact numbers, and therefore you want your posterior to be robust to any reasonable choice of prior? Then how robust that 5:1 result is will depend on the range of reasonable priors. For the micrometer example, the naked eye gives me no information on the scale of thousandths of an inch, so every reasonable prior is flattish everywhere that matters. In situations where one does use statistics practically, with so little data you could find a reasonable-looking prior to obtain any conclusion you want, and parlaying a single data point into 5:1 odds is not useful.

          The Bayesian machine is indifferent to these considerations. It will answer the question you ask of it.

        • Richard: I just wanted to point out that Andrew has called the flat (or flattish) prior both too weak and too strong in this discussion. I was only trying to clear up some linguistic confusion.

        • Erik: Yes, exactly.

          Richard: It is a Bayesian principle that if a model consistency gives predictions that are wrong in a particular direction, that model should be fixed. If I were to consistently bet on the sign of effects whose estimates are 1 se from 0, I’m pretty sure I’d get the sign wrong more than 1/6 of the time. Hence there’s a problem with the model.

  5. I tend to agree. I’m not opposed to the idea of inverse probability, and to the extent I trust any statistical inference procedure I tend toward Bayesianism. In principle I think it ought to be possible to write down the model and prior that properly captures what the data analyst knows or believes, update that knowledge via Bayes rule, and so on. In practice, I think Bayesian data analysts are prone to specifying models poorly, not doing sufficient model checking, and relying on summary measures (most egregiously, the Bayes factor) that can be very sensitive to uninteresting or unimportant characteristics of the prior, the model and even the data.

    • Danielle:

      Your comment makes me think of an analogy between the following two things:

      – Model checking in Bayesian statistics, and

      – The self-correcting nature of science.

      The story of model checking in Bayesian statistics is that the fact that Bayesian inference can give ridiculous answers is a good thing, in that, when we see the ridiculous answer, this signals to us that there’s a problem with the model, and we can go fix it. This is the idea that we would rather have our methods fail loudly than fail quietly. But this all only works if, when we see a ridiculous result, we confront the anomaly. It doesn’t work if we just accept the ridiculous conclusion without questioning it, and it doesn’t work if we shunt the ridiculous conclusion aside and refuse to consider its implications.

      Similarly with the self-correcting nature of science. Science makes predictions which can be falsified. Scientists make public statements, many (most?) of which will eventually be proved wrong. These failures motivate re-examination of assumptions. That’s the self-correcting nature of science. But it only works if individual scientists do this (notice anomalies and explore them) and it only works if the social structure of science allows it. Science doesn’t self-correct if scientists continue to stand by refuted claims, and it doesn’t work if they attack or ignore criticism.

      • Under the circumstances I think it would be best if I did not comment much on this analogy. Suffice it to say I agree with much of what you’re saying here but would also argue that there are relevant respects in which the two scenarios are quite different. Perhaps another time, or in another forum :-)

        • Danielle:

          Sure, these are two completely different situations. My point with the analogy is that if a system produces failure, this can be bad (if the failures are not recognized as failures, or if the failures are avoided), or it can be good (if the failures are used as opportunities for improvement. Bayesian inference can easily fail spectacularly. I’m with Jaynes that these failures are good news as they can be great opportunities for learning. But if someone were to perform a Bayesian inference, get ridiculous conclusions, and then bet on these conclusions or otherwise act as if they are real, this can cause problems. Or if people perform Bayesian inferences and then just ignore the inferences that don’t make sense, then the result is a sort of minefield where later researchers have to step carefully to avoid the bad conclusions.

          It’s similar with scientific research in general, I think. But I agree that, in other ways, the society of scientific research is different than a single statistical model, Bayesian or otherwise. I don’t want the details of the imperfect analogy to distract from whatever understanding we have of each situation.

        • I’m sure everyone already knows this anecdote: Supposedly some famous physicists — Einstein and Pauli? Bell? Dirac? — were once arguing about the consistency of quantum mechanics. They kept proposing thought experiments, working out the answers, and finding that everything worked out and gave sensible answers. It seemed like an afternoon wasted until they stumbled on two different ways of solving a problem that gave different answers. “Aha!”, said one of them, “now we’re getting somewhere.”

  6. Since we’re interested in P(theta > 0 | y), the value zero is apparently special. To make this explicit, reparameterize theta into its sign and absolute value. Put a uniform prior on the sign, i.e. P(theta 0)=1/2, but don’t put a prior on the absolute value. Now if you observe y=1 then the MLE of |theta| is zero! I think it’s kind of interesting how much information is in that prior on the sign.

  7. Sure. Since there’s a prior on the sign of theta, the only parameter is |theta|. The likelihood is

    lik(|theta|)= 0.5 * phi(y+|theta|) + 0.5 * phi(y-|theta|),

    where phi is the standard normal density. If you put any value for y between -1 and 1 and plot the likelihood, you’ll see it’s decreasing.

    y=1
    abs.theta=seq(0,2,0.01)
    lik=0.5*dnorm(y,abs.theta,1) + 0.5*dnorm(y,-abs.theta,1)
    plot(abs.theta,lik)
    abs.theta[which.max(lik)] # mle

    I could send you the proof if you’re interested.

    • As Carlos points out, there are two parameters, one is p(positive) which has a prior of 0.5 but the posterior will be something else. Without thinking too hard about it I’d expect it to be around the tail probability of normal(1,1) to be positive.

      basically I don’t expect your reparameterization to change the inference at all,

    • > Since there’s a prior on the sign of theta, the only parameter is |theta|. The likelihood is
      > lik(|theta|)= 0.5 * phi(y+|theta|) + 0.5 * phi(y-|theta|),

      That’s not a y ~ N(theta,1) model with “a prior on the sign of theta”. That’s a 50/50 mixture of two normals N(theta,1) and N(-theta,1).

      There is not such a thing as “the sign of theta” because both |theta| and -|theta| define the same model.

      That likelihood function is symetric on both |theta| and y, the sign of the data doesn’t matter either.

  8. > there’s a prior on the sign of theta

    What does that mean? If you mean that there is a parameter with possible values {+,-} and prior {1/2, 1/2} then the posterior won’t be the same.

    • If I’m not mistaken, in the « standard » problem, the likelihood L(theta) is proportional to exp[-1/2*(1-theta)^2] and the MLE is theta=1.

      If we separate the sign and absolute value the likelihood L(sign,abs) is proportional to exp[-1/2*(1-sign*abs)^2] and the MLE is sign=+1, abs=1.

      • Carlos, Daniel:

        Suppose you observe y ~ N(theta,1) and you’re interested in the sign of theta. You can choose not to put a prior on theta and do frequentist inference, for example test H0 : theta > 0. Or you can put a prior on theta (proper or not) and compute P(theta > 0 |y). I just wanted to point out something in between, namely putting a prior on the sign of theta but not its absolute value. This “semi-Bayesian” approach has some interesting features. For instance, contrary to the purely frequentist approach we can talk about P(theta > 0 |y). If you observe y=1, then the MLE of |theta| is zero and hence the MLE of P(theta > 0 |y=1) is 1/2.

        I’m not saying that the semi-Bayesian approach is better or anything. I just think it’s kind of interesting. I got the term “semi-Bayes” from Greenland, S. (1992). “A semi‐Bayes approach to the analysis of correlated multiple associations, with an application to an occupational cancer‐mortality study” Statistics in medicine.

        • So in this case it seems like you aren’t creating a continuous parameter that represents the probability that theta is greater than 0, which means that your prior probability of 1/2 remains as the posterior probability as well, a constant no matter how much data you observe. So if you observe the data {10.1,9.9,11.2,8.4,9.2,11.6} you will still say there’s a 50% chance theta is less than 0.

          which is obviously silly.

          rather than this, the only sane model is a likelihood like you wrote, but with the parameter p,

          so L(theta) = p*dnorm(theta,1,1) + (1-p)*dnorm(-theta,1,1)

          where theta is the magnitude parameter, and p is the probability of a positive theta.

          Your model is equivalent to putting a prior on p that is a delta-function at 1/2, whereas a more reasonable prior on p is beta(1,1) or the same as uniform(0,1), where the expected value is 1/2 but it could range over any logically possible value.

          It’s not surprising that if you insist that there will always and forever be a 50% chance that the parameter is negative no matter how much data you collect, that you will find it “kind of interesting how much information is in that prior on the sign.”

  9. No, that’s not I’m saying. If the observation y is greater than 1 (or less than -1) then the MLE of |theta| is greater than zero. If |y| is very large then the MLE of |theta| will be approximately equal to |y|, and there will be near certainty about the sign of theta.

    • You have only one parameter, the absolute value of theta. How can you learn the sign?

      after seeing data like 900 you may learn that the absolute value of the theta is about 900 but you still are 50/50 on it being positive or negative

      • No, that’s not true.

        P(theta > 0 |y) = dnorm(y,|theta|,1) / [ dnorm(y,|theta|,1) + dnorm(y,-|theta|,1) ]

        We get the MLE of P(theta > 0 |y) by plugging in the MLE of |theta|. If the observed |y| is large, then the MLE of |theta| will be approximately equal to |y|. If y is large and positive then P(theta > 0 |y) will be approximately 1, and if y is large and negative it will be approximately 0.

        • Another way to put this: there’s no possibility for the probability of the sign to be negative vs positive to ever change in your model, it’s not a parameter that has any uncertainty associated with it, it’s a constant.

        • let sign be either -1 or 1 with 50/50 chance. Let atheta be the absolute value parameter. Let p(y | atheta,sign) be the probability of seeing the data if atheta,sign are true.

          p(y | atheta,sign) = dnorm(y,sign*atheta,1)

          p(atheta,sign | y , background) \propto p(y | atheta,sign,background)p(atheta | backgorund)p(sign | background)

          suppose p(atheta | background) = uniform(0,maxfloat)… it’s a constant… drop it

          p(atheta,sign=1 | y,background) \propto dnorm(y,atheta,1) * 0.5
          p(atheta,sign=-1 | y,background) \propto dnorm(y,-atheta,1) * 0.5

          p(atheta,sign=1 | y,background) = dnorm(y,atheta,1)*0.5/ (dnorm(y,atheta,1)*0.5 + dnorm(y,-atheta,1)*0.5)

          drop the 0.5s…

          so for y large and positive, only the first one has any component and the answer is as you said for the atheta, the MAP estimate is 0.

          However, the probability for theta itself looks like a mixture model with two bumps, one around 1, and another around -1.

          The MAP estimate for theta itself doesn’t change, I think, since sign=1 is the MAP for sign, and when we plug that in, p(theta | y) = dnorm(y,theta,1) which has a bump around theta and is truncated at theta=0.

        • Sure. If you let sign be either -1 or 1 with probability 1/2 and you put the uniform(0,infty) prior on atheta, then that’s the same as the usual flat prior on theta.

          But the semi-Bayesian model I’m talking about does ~not~ have a prior on atheta. So while sign is random, atheta is not. We can talk about p(sgn=1|y) but not about p(atheta|y).

        • I really don’t know what you mean. I think you’re talking about a hybrid model in which you allow that 50/50 on the sign is a reasonable thing, but you are only allowing inference on atheta using maximum likelihood.

          In that model, you can’t do coherent inference on the sign either. I mean, in the Bayesian model you get about 80-90% probability for a positive theta after seeing 1. But in your model, you have to plug in atheta=0 since it’s the max likelihood estimate. At this point you have theta=0 precisely, there is no probability on the sign…

          basically that kind of model is incoherent.

          also, when you supply the prior on atheta, you don’t get the inference on the theta that’s the same as the uniform prior… because you have this little shadow lump on the symmetric side that only asymptotically for large theta goes away.

        • Yes, I just put 50/50 on the sign of theta. Then I have a (single) observation from

          f(y | atheta)= 0.5*dnorm(y,atheta,1) + 0.5*dnorm(y,-atheta,1).

          Now I can do frequentist inference – nothing special. For example, I can try to estimate atheta or p(sign=1|y=1,atheta). Maybe even put confidence intervals around them. Again, I’m not saying this is a great model though.

          I’m not sure, but I don’t think you’re right about that “shadow lump”. You can think about a prior that’s symmetric around zero as independently choosing the sign (50/50) and the magnitude.

        • RE the shadow lump… well I don’t know what all you’re up to with your inference procedure, but at least in the Bayesian model with a flat prior on atheta, you always have

          p(y | atheta, sign)

          as two different functions depending on whether sign=1 or sign=-1… unless p(sign=1) = 1 or p(sign=-1) = 1 then you’ve got some posterior probability in a lump symmetrically opposite whatever the main lump is.

          so, if you see y = 900 you’ll have essentially nothing… but if you see y=1 you’ll have posterior predictive distribution for future data which are 80 or 90% distributed around the area of 1, and 10-20% distributed around the area of -1

          that’s what posterior probability of say 80% that sign greater than 0 means… it means you’ve got future data predicted to be out around the region of atheta 80%, and around the region of -atheta 20%

          until posterior probability of sign is near 1, you’ll always have that shadow lump in your posterior predictive distribution.

  10. Here’s a contextualized example of this problem that’s interesting to think about. Say we randomly sample a school and we’re interested in the probability that the average IQ of this school is higher than 100 (the known standard deviation being 15). We test one student and get 115. So we have:

    y ~ normal(theta, 15)
    y1 = 115
    P(theta > 100|y1)?

    So the flat prior gives P ~ 0.84 which indeed seems silly. But what would be a better prior? Given that there are schools for the gifted as well as schools for the intellectually challenged, we shouldn’t pick something too narrow. But if we go with N(100, 20) or something similarr we still get P ~ 0.80, so not that different.

    What would be a good prior in this example? I suppose something multimodal (corresponding to normal schools, schools for the gifted, and schools for the intellectually challenged) but relatively narrow around the modes?

    • Although the standard deviation of the population is 15, the standard deviation of the school is not necessarily that at all, nor is the probability of the data particularly symmetric around an average for a school for the gifted or for the challenged… gifted schools should tail off to the right, and be truncated on the left… challenged schools opposite.

      so it’s not just the prior that’s problematic here, but also the likelihood.

      • Yeah, I kept the SD fixed as in the original problem which is indeed not realistic, although I’m not sure how important that is here. Good point about the likelihood.

  11. You say “the people who say that Bayesian reasoning is just rational thinking, or that rational thinking is necessarily Bayesian” and contrast this with your view that “Bayesian inference is a tool. It solves some problems but not all”.

    This reminds me of an essay from last year by such a bayesian called “Toolbox-thinking and Law-thinking” which tries to point at the source of such disagreements directly.

    Key quote:

    > I’ve noticed a dichotomy between “thinking in toolboxes” and “thinking in laws”.
    >
    > Msr. Toolbox: “It’s important to know how to use a broad variety of statistical tools and adapt them to context. The many ways of calculating p-values form one broad family of tools; any particular tool in the set has good uses and bad uses, depending on context and what exactly you do. Using likelihood ratios is an interesting statistical technique, and I’m sure it has its good uses in the right contexts. But it would be very surprising if that one weird trick was the best calculation to do in every paper and every circumstance. If you claim it is the universal best way, then I suspect you of blind idealism, insensitivity to context and nuance, ignorance of all the other tools in the toolbox, the sheer folly of callow youth. You only have a hammer and no real-world experience using screwdrivers, so you claim everything is a nail.”
    >
    > Msr. Lawful: “On complex problems we may not be able to compute exact Bayesian updates, but the math still describes the optimal update, in the same way that a Carnot cycle describes a thermodynamically ideal engine even if you can’t build one. You are unlikely to find a superior viewpoint that makes some other update even more optimal than the Bayesian update, not without doing a great deal of fundamental math research and maybe not at all. We didn’t choose that formalism arbitrarily! We have a very broad variety of coherence theorems all spotlighting the same central structure of probability theory, saying variations of ‘If your behavior cannot be viewed as coherent with probability theory in sense X, you must be executing a dominated strategy and shooting off your foot in sense Y’.”
    >
    > I currently suspect that when Msr. Law talks like this, Msr. Toolbox hears “I prescribe to you the following recipe for your behavior, the Bayesian Update, which you ought to execute in every kind of circumstance.”

    Post is by Eliezer Yudkowsky, here: https://www.lesswrong.com/posts/CPP2uLcaywEokFKQG/toolbox-thinking-and-law-thinking

  12. “What Bayes theorem cannot do is actually perform the function that scientists and philosophers who call themselves Bayesian say it can: to be a philosophy of inferring the best explanation. It cannot possibly create new explanations (which is, and should be the focus of science as much as gathering new evidence) and nor can it tell us what we should do. If we have a problem and we have no actual solution to it, Bayes’ theorem cannot possibly help. All it can do is assign probabilities to existing ideas (none of which are regarded as actual solutions). But why would one want to assign probabilities to possible solutions, none of which are known to work? There can be no reason other than if one wanted, to say, wager on which idea is likely to be falsified first, perhaps. But we must know – following Faraday and Popper, and Feynman and Deutsch: we must expect all of them will be falsified eventually. Your theories should be held on the tips of your fingers, so said Faraday, so that the merest breath of fact can blow them away. So no amount of assigning 99% probabilities to the truth of them makes them anymore “certain” or “likely to be true”. We need to have a pragmatic approach: take the best theory seriously as an explanation of reality and use it to solve problems and create solutions and technology – but don’t pretend that the content is “certain” to any degree. Just useful with some truth more than those other theories that have gone before and fallen to the sword of criticism and testing.

    When we have actual solutions in science they go by a generic honorific title. We call them “The scientific theory of…”. So for example we have “The scientific theory of gravity” (it’s given name is General Relativity). We don’t need to assign a probability to it being true. We regard it as provisionally true knowing it is superior to all other rivals (insofar as there are any (and there are not!)) and we use it as if it’s true (this is pragmatic). But actually we expect that one day we will find it false. Just as we did with Newton. But this philosophy that our best theories are likely misconceptions in some way has no practical effect on what we do with them. We take them seriously as conditional truths about the world. As David Deutsch has said: it would have been preferable if long ago we’d all just decided to call scientific theories “scientific misconceptions” instead. It would save much in the way of so many of these debates. We’d all know that our best explanations, though better and closer to true than others that went before, are nonetheless able to be superseded by better ideas eventually.”
    http://www.bretthall.org/bayesian-epistemology.html

  13. > Go around offering 5:1 bets based on pure noise and you’ll go bankrupt real fast

    If, after the first bet that you have lost, wouldn’t you update your estimate of the posterior probability and change the odds accordingly?

    • Andrew (not Gelman):

      Yes, I’d hope that people would do so! But standard Bayesian practice is to use weak priors unless absolutely forced to do otherwise. Again, practitioners avoid the worst excesses by simply not focusing on obviously stupid bets such as the 5/6 probability of theta>0 based on an estimate that’s one standard error from zero—but the concern remains that other bets, just as stupid but not so obviously stupid, are being made all the time.

      It’s like . . . remember that study from a few years ago claiming that North Korea was less more democratic than North Carolina? What a joke that was. But what did the researchers do on that one? They just removed North Korea and maybe a couple other countries from their dataset. They didn’t use their failure with that country to face up to the underlying problems with their methods; instead, they brushed aside the obvious failure and moved on.

  14. > remember that study from a few years ago claiming that North Korea was less democratic than North Carolina?

    That’s indeed a ridiculous claim. It’s obvious that North Korea is more democratic, it’s even in the name of the country.

  15. I agree with pretty much all of this posting. Of course Bayesians respond that the prior is bad and we need information to create a better prior. That’s fair enough. However I think they underestimate how difficult this often is in practice, particularly if you don’t want to go for the first impulsive idea you have like “I know it can’t be negative so an exponential(1) prior could be fine” but actually try to put together all information you have in a sensible way and on top of that try to convince yourself that you couldn’t get fundamentally different outcomes with other priors that are compatible with the same information.

    Most “information” doesn’t come in the form of probabilities for some parameters to be true or even indirectly tell you how bets on future events should be affected by it. Some information has to do with what the analyst wants (e.g. how much of a problem it is in practice to have too many variables in a regression model – which is not only about overfitting and prediction errors), some other is very imprecise – chances are missing values are almost always NMAR but very often there is very little information about how they deviate from MAR etc.

    I appreciate a thought through and well argued Bayesian model that comes with some sensitivity analysis, but we rarely see that, and surely it is a very sophisticated task to build one that cannot be done by many people who want to run statistical analyses – who then either will do NHST or bad Bayes…

  16. I have come to think about this entry only two years after it was originally published, but here go my comments:

    1) The first one is that under a flat improper prior, an observed value of 1 (or, for that matter, an observed value in [-100, 100]) is highly unlikely, practically an outlier. I do not know what a typical value under that prior would be; maybe something like 1.82343e233. The “ridiculous inference” would only show up with probability ~ 0.

    I could devise a simple model about coin tossing and make up a thought experiment in which I, say, toss the coin 100 times and get 100 heads. I’d surely get paradoxical inferences and would be forced to rethink my priors, my model, etc. But it is a pseudo-problem: I would never realistically get 100 heads in 100 tosses.

    2) I have always felt that simple, standard Bayesian models (normal with known sigma as here, beta-binomial, etc.) tend to converge too fast, i.e., that the posterior variance decreases faster than most people’s intuition would say. If the maths behind Bayes theorem were such that the posteriori in this example happened to be N(1, 10), the “ridiculous inference” would be no more. Much of my effort while building Bayesian models is to try to find out models which would inflate the variance of my estimations up to reasonable, expected values. E.g., say that I observe people who buy/do not buy a product; a simple beta-binomial model would soon give me a very tight estimate of the “actual” proportion of buyers; but I know it is nuts as, e.g., the daily variance is much higher than what this simple model would provide.

    • Regarding your first example: Yes, we are in agreement. The uniform prior makes no sense. But practitioners use it all the time. It can be useful as long as the certain aspects of the posterior distribution is not taken too seriously. That’s my point, that Bayesians (including myself) are not typically fully “Bayesian” in their thinking, in that we work with procedures that are not truly coherent, regularly produce inferences that we don’t believe, etc.

      Regarding your second example: Yes, I think the problem here is that real-world parameters (except in some very well-defined areas of physics, chemistry, etc.) tend to vary over time and across scenarios. If you get data from big N, this will typically involve collecting date from many people over some relatively long time period, so that the “theta” you think you’re estimating is actually unstable. This can be captured in a hierarchical model, but there are always sources of variation that our model does not account for.

Leave a Reply to Daniel Lakeland Cancel reply

Your email address will not be published. Required fields are marked *