Should we worry about rigged priors? A long discussion.

Posted on October 4, 2017 9:55 AM by Andrew

Today’s discussion starts with Stuart Buck, who came across a post by John Cook linking to my post, “Bayesian statistics: What’s it all about?”. Cook wrote about the benefit of prior distributions in making assumptions explicit.

Buck shared Cook’s post with Jon Baron, who wrote:

My concern is that if researchers are systematically too optimistic (or even self-deluded) about about the prior evidence—which I think is usually the case—then using prior distributions as the basis for their new study can lead to too much statistical confidence in the study’s results. And so could compound the problem.

Stuart Buck asked what would I say to this, and I replied:

My response to Jon is that I think all aspects of a model should be justified. Sometimes I speak of there being a “paper trail” of all modeling and data-analysis decisions. My concern here is not so much about p-hacking etc. but rather that people can get wrong answers because they just use conventional modeling choices. For example, in those papers on beauty and sex ratios, the exciting but wrong claims can be traced to the use of a noninformative uniform prior on the effects, even though there’s a huge literature showing that sex ratios vary by very little. Similarly in that ovulation-and-clothing paper: for the data to have been informative, any real effect would have had to be huge, and this just makes no sense. John Carlin and I discuss this in our 2014 paper.

To address Jon’s concern more directly: Suppose a researcher does an experiment and he says that his prior is that the new treatment will be effective, for example his prior dist on the effect size is normal with mean 0.2 and sd 0.1, even before he has any data. Fine, he can say this, but he needs to justify this choice. Just as, when he supplies a data model, it’s not enough for him just to supply a vector of “data,” he also needs to describe his experiment so we know where his data came from. What’s his empirical reasoning for his prior? Implicitly if he gives a prior such as N(0.2, 0.1), he’s saying that in other studies of this sort, real effects are of this size. That’s a big claim to make, and I see no reason why a journal would accept this or why a policymaker would believe it, if no good evidence is given.

Stuart responded to me:

“Implicitly if he gives a prior such as N(0.2, 0.1), he’s saying that in other studies of this sort, real effects are of this size.”

Aha, I think that’s just the rub – what are “real” effects as opposed to the effects found in prior studies? Due to publication bias, researcher biases, etc., effects found in prior studies may be highly inflated, right? So anyone studying a particular social program (say, an educational intervention, a teen pregnancy program, a drug addiction program, etc.) might be able to point to several prior studies finding huge effects. But does that mean the effects are real? I’d say no. Likely the effects are inflated.

So if the prior effects are inflated, how would that affect a Bayesian analysis of a new study on the same type of program?

I replied: Yes, exactly. Any model has to be justified. For example, in that horrible paper purporting to estimate the effects of air pollution in China (see figure 1 here), the authors should have felt a need to justify that high-degree polynomial—actually, the problem is not so much with a high-degre curve but with the unregularized least-squares fit. It’s enough just to pick a conventional model and start interpreting coefficients. Picking a prior distribution based on biased point estimates from the published literature, that’s not a good justification. One of the advantages of requiring a paper trail is that then you can see the information that people are using to make their modeling decisions.

Stuart followed up:

Take a simpler question (as my colleague primarily funds RCTs) — a randomized trial of a program intended to raise high school graduation rates. 1,000 kids are randomized to get the program, 1,000 are randomized into the control, and we follow up 3 years later to see which group graduated more often.

The simplest frequentist way to analyze that would be a t-test of the means, right? Or just a simple regression — Y (grad rate) = alpha + Beta * [treatment] + error.

If you analyzed the RCT using Bayesian stats instead, would your ultimate conclusion about the success of the program be affected by your choice of prior, and if so, how much? My colleague has the impression that a researcher who is strongly biased in favor of that program would somehow use Bayesian stats in order to “stack the deck” to show the program really works, but I’m not sure that makes sense.

I replied: The short story is that, yes, the Bayesian analysis depends on assumptions, and so does the classical analysis. I think it’s best for the assumps to be clear.

Let’s start with the classical analysis. A t-test is a t-test, and a regression is a regression, no assumptions required, these are just data operations. The assumptions come in when you try to interpret the results. For example, you do the t-test and the result is 2.2 standard errors away from 0, and you take that as evidence that the treatment “works.” That conclusion is based on some big assumptions, as John Carlin and I discuss in our paper. In particular, the leap from “statistical significance” to “the treatment works” is only valid when type M and type S errors are low—and any statement about these errors requires assumptions about effect size.

Let’s take an example that I’ve discussed a few times on the blog. Gertler et al. ran a randomized experiment of an early childhood intervention in Jamaica and found that the treatment raised earnings by 42% (the kids in the study were followed up until they were young adults and then their incomes were compared). The result was statistically significant so for simplicity let’s say the 95% conf interval is [2%, 82%]. Based on the classical analysis, what conclusions are taken from this study? (1) The treatment works and has a positive effect. (2) The estimated treatment effect is 42%. Both these conclusions are iffy: (a) Given the prior literature (see, for example, the Charles Murray quote here), it’s hard to believe the true effect is anything near 42%, which suggests that Type M and Type S errors in this study could be huge, implying that statistical significance doesn’t tell us much; (b) The Gertler et al. paper has forking-path issues so it would not be difficult for them to find a statistically significant comparison even in the absence of any consistent true effect; (c) in any case, the 42% is surely an overestimate: Would the authors or anyone else really be wiling to bet that a replication would achieve such a large effect?

So my point is that the classical inferences—the conclusion that the treatment works and the point estimate of the effect—are strongly based on assumptions which, in conventional reporting, are completely hidden. Indeed I doubt that Gertler et al. themselves are aware of the assumptions underlying their conclusions. They correctly recognize that the mathematical operations they apply to their data—the t-test and the regression—are assumption-free (or, I should say, rely on very few assumptions). But they don’t recognize that the implications they draw from their statistical significance depend very strongly on assumptions which, in their example, are difficult to justify. If they were required to justify their assumptions (to make a paper trail, as I put it), they might see the problem. They might recognize that the strong claims they draw from their study are only justifiable conditional on already believing the treatment has a very large and positive effect.

OK, now on to the Bayesian analysis. You can start with the flat-prior analysis. Under the flat prior, a statistically significant difference gives a probability of greater than 97.5% probability that the true effect is in the observed direction. For example in that Gerler et al. study you’d be 97.5%+ sure that the treatment effect is positive, and you’d be willing to bet at even odds that the true effect is bigger or smaller than 42%. Indeed, you’d say that the effect is as likely to be 82% as 2%. That of course is ridiculous: a 2% or even a 0% effect is quite plausible, whereas an 82% effect, even if it might exist in this population for some unlikely historical reason, is not plausible in any larger context. But that’s fine, this tells us that we have prior information that’s not included in our model. A more plausible prior might have a mean of 0 and a standard deviation of 10%, or maybe some longer-tailed distribution such as a t with low degrees of freedom with center 0 and scale 10%. I’m not sure what’s best here, but one could make some prior based on the literature. The point is that it would have to be justified.

Now suppose some wise guy wants to stack the deck by, for example, giving the effect size a prior that’s normal with mean 20% and sd 10%. Well, the first thing is that he’d have to justify that prior, and I think it would be hard to justify. If it did get accepted by the journal reviewers, that’s fine, but then anyone who reads the paper would see this right there in the methods section: “We assumed a normal prior with mean 20% and sd 10%.” Such a statement would be vulnerable to criticism. People know about priors. Even a credulous NPR reporter or a Gladwell would recognize that the prior is important here! The other funny thing is, in this case, such a prior is in some ways an improvement upon the flat prior in that the estimate would be decreased from the 42% that comes from the flat prior.

So I think my position here is clear. Sure, people can stack the deck. Any stacking should be done openly, and then readers can judge the evidence for themselves. That would be much preferable to the current situation in which inappropriate inferences are made without recognition of the assumptions that justify them.

At this point Jon Baron jumped back in. First, where I wrote above “Even a credulous NPR reporter or a Gladwell would recognize that the prior is important here!”, Baron wrote:

I’m not sure it wouldn’t fly under the radar just like the other assumptions in Gertler’s study that make its findings unreliable—I think the Heckmans and many other wishful thinkers on early childhood programs would say that the assumption about priors is fully justified.

I replied: Sure, maybe they’d say so, but I’d like to see that claim in black and white in the paper: then I could debate it directly! As it is, the authors can implicitly rely on such a claim and then withdraw it later. That’s the problem I have with these point estimates: the point estimate is used as advertising but then if you question it, the authors retreat to saying it’s just proof of an effect.

That happened with that horrible ovulation-and-clothing paper: my colleague and I asked how anyone could possibly believe that women are 3 times as likely to wear red on certain days of the month, and then the authors and their defenders pretty much completely declined to defend that factor of 3. I have this amazing email exchange with a psych prof who was angry at me for dissing that study: I asked him several times whether he thought that women were actually 3 times more likely to wear red on these days, and he just refused to respond on that point.

So, yeah, I think it would be a big step forward for these sorts of quantitative claims to be out in the open.

Second, Baron followed up my statement that “such a prior [normal with mean 20% and sd 10%] is in some ways an improvement upon the flat prior in that the estimate would be decreased from the 42% that comes from the flat prior,” by asking:

What about the not-unrealistic situation where the wishful thinker says the prior effect size is 30% (based on Perry Preschool and Abecedarian etc.) and his new study comes in with an effect size of, say, 25%. Would the Bayesian approach be more likely to find a statistically significant effect than the classical approach in this situation?

My reply: Changing the prior will change the point estimate and also change the uncertainty interval. In your example, if the wishful thinker says 30% and the new study estimate says 25%, then, yes, the wiseguy will feel confirmed. But it’s the role of the research community to point out that an appropriate analysis of Perry, Aecedarian, etc., do not lead to a 30% estimate!

84 thoughts on “Should we worry about rigged priors? A long discussion.”

Dale Lehman on October 4, 2017 10:58 AM at 10:58 am said:

For me (not trained in Bayesian techniques), this was an excellent post. It makes clear how a Bayesian approach differs from the classical approach – albeit without the mechanical details about how they differ. And, I am in complete agreement about the value and relative superiority of making assumptions explicit and open to debate and criticism. But I think this points to the real reason why making this shift is so difficult and part of the reason why entrenched interests manage to hold on so long to their existing approaches.

Explicit statements about priors seems to undermine the very prestige and mystique of science. I’ll anticipate Andrew’s (and others’) response here – and I agree with it – that scientific methodology is exactly about making such explicit assumptions and evaluating them. But something is lost in the translation. To make this concrete – let’s say that prior studies show that gender differences in educational attainment rarely differ by more than 5%. Somebody tries a novel educational approach and finds a difference of 50%. We should be skeptical and the suggested Bayesian approach will cause us to be so. However, the researcher may be convinced that their intervention is surely superior to anything anyone else has tried – hence, the prior is mostly irrelevant. According to Andrew’s reasoning above, that’s fine for them to believe, but what is the evidence? It’s a fair and necessary question. But so is the belief that new approaches should not be handcuffed by evidence on past relatively inferior methods. Past experiments probably had any number of features that undermined their effectiveness. The researcher, with their new approach, may feel fully justified in wanting to distance themselves from past poorly executed methods.

Again, the response will be, and should be, make your assumptions explicit and justify them. I agree, but I think this is much of the resistance to this approach. It makes science look like all other reasoning – namely, human. Science has been on a pedestal (at least to non-scientists) that is somehow above human subjective judgements. Now, it will be just another human fallible way to think about the world. In the skeptical,post-modern world, this tumbling from the pedestal will only be welcomed by the most hardy (and best) scientists.

Reply ↓
- Jonathan (another one) on October 4, 2017 11:16 AM at 11:16 am said:
  
  To amplify this slightly, in the expert witness business the expert’s opinions are automatically suspect, and indeed, the other side will have an expert who thinks your reading of the literature is “fatally” flawed. (There seem to be no other types of flaws.) What’s an outside nonexpert to do in adjudicating between these two views? While I agree in principle with the “paper trail” theory, the advantage of the classical approach is summarized as: “Nobody cares what you think. What do the data say without your finger on the scale?” The fallacies of post-t-test inference are of course still fair game for the opposing expert, but priors are often harder to justify than data. (What makes the beauty-sex ratio example so compelling is that it’s one of the times that a robust literature actually supports a specific prior.)
  
  Reply ↓
  - Daniel Lakeland on October 4, 2017 12:11 PM at 12:11 pm said:
    
    The vast majority of classical regression analyses are equivalent to Bayes with a flat prior. On a computer a flat prior is uniform over a finite floating Point interval. Why is it that we feel that a prior specifying a 99.9% chance an effect size has absolute value greater than 10^305 or so is “not having your thumb on the scale”?
    
    Reply ↓
    - Jonathan (another one) on October 4, 2017 12:22 PM at 12:22 pm said:
      
      Because this prior is “automatic.” (I agree with you by the way.)
    - Daniel Lakeland on October 4, 2017 3:57 PM at 3:57 pm said:
      
      It may be automatic, but it automatically assumes the value in question is ridiculously enormous. That can’t be a good thing. Yes I know you’ve already agreed with me, but I think it’s good to push this point. A maximum likelihood analysis with a flat-floating-point prior *is* a Bayesian analysis with an obviously stupid prior that is in essence intentionally over-inflating estimate magnitudes.
      
      So, this is a very powerful argument against someone who wants you to remove the prior and do the analysis that way “for objectivity” or something. Objectively, I’ve never *once* estimated a parameter in a real-world statistical analysis that had a value anywhere near 10^308 and it’s objectively stupid to expect that ahead of time it would.
    - Wayne on October 4, 2017 4:22 PM at 4:22 pm said:
      
      My intuition agrees with you. What holds me back is that I’m pretty intuitive and not as mathematically rigorous as many in these here parts, and so I hesitate to say what I want to say: “As a frequentist, you think you don’t have priors, but in fact you have an implicit improper, flat prior that says that any value — ANY value — is equally likely which no one actually believes once it’s made explicit.” I’m not sure that “vast majority of classical regression analysis” is broad enough to make that statement. What about exceptions that are not in the majority? What about non-regression-based analysis? Does “equivalent” really mean the same or that in the end, under certain constraints, the numbers turn out to be similar?
      
      If you can reassure me, I’m all on board. That’s what my intuition believes, and it seems reasonable to me that no one actually believes that 20-foot-tall humans just might show up in our data so we can’t rule them out with mysterious and confusing priors and must let the data speak for itself.
    - Daniel Lakeland on October 4, 2017 5:57 PM at 5:57 pm said:
      
      Mathematically, any maximum likelihood based model including any least squares model (which is the same as maximum likelihood under normal errors) counts. The Bayesian posterior is p(Data | Parameters) p(Parameters | Background)/Z where Z is a constant normalizing factor. If you do a maximum likelihood analysis you pick out Parameters such that p(Data | Parameters) is maximized ignoring p(Parameters | Background). Well ignoring it in a floating point computational analysis gives exactly the same result as assuming uniform(-MaxFloat,Maxfloat) prior and taking the MAP estimate because that uniform prior is a constant and so when multiplied into the expression just winds up altering the Z value, it doesn’t affect the inference at all.
      
      What doesn’t so obviously have this flavor is stuff like simulation based p-value stuff: randomization / permutation tests etc etc. In these cases the inference isn’t obviously based on a clear likelihood and so I have to give that caveat. Still, everywhere people are doing least squares, or maximum likelihood on a modern computer, the mapping to a Bayesian model with a uniform(-MaxFloat,MaxFloat) proper prior applies.
    - Jonathan (another one) on October 4, 2017 8:11 PM at 8:11 pm said:
      
      But of course “flat over the realistic possibilities” has exactly the same math. The 10^308 is a bit of red herring, no?
    - Andrew on October 4, 2017 8:33 PM at 8:33 pm said:
      
      Wayne:
      
      You write that you want to “let the data speak for itself.” The trouble is that sometimes our data are weak. Consider the example from section 2.1 of this paper, where the data alone are consistent with a wide range of treatment effects including some that are highly implausible.
      
      More generally, when we have a lot of data we ask more questions, so we’re always pushing against data limitations. We can use MRP to estimate state-level opinion without a large sample size in every state.
    - Daniel Lakeland on October 4, 2017 9:09 PM at 9:09 pm said:
      
      Jonathan: No I don’t think it’s a red herring. Sure of course the likelihood will rapidly reduce the range from +- 10^308 down to something way smaller, the point is that if you use likelihood inference you *are* using a prior even if you think you’re not, it’s just one that is totally and unambiguously wrong and supplied by the IEEE committee on floating point. Even a uniform prior on a much smaller range is wrong, for example uniform(-10,10) for a height of a person in feet is unambiguously wrong too. The appropriate prior can’t have ANY mass less than 0 ft. uniform(0,9) is maybe not sooo wrong, but it’s still wrong, because the probability that a person is between 0 and 1 ft tall is clearly not the same as the probability that the person is 5 to 6 ft tall, and again not the same as a person is 8 to 9 ft tall… These uniform priors are just not reasonable priors.
      
      However, if you think that you get away from having a prior by ignoring it and focusing only on the likelihood… you’re wrong. The thing is, the person doing likelihood based models *combined with maximum likelihood estimation*, collapses their estimate to a point, and so the effect of the flat prior outside a neighborhood of this maximum is less noticeable to the user of the method. But point estimation has serious problems with over-fitting and soforth.
    - Phil on October 5, 2017 5:27 AM at 5:27 am said:
      
      Responding to Jonathan (another one), who says ‘But of course “flat over the realistic possibilities” has exactly the same math. The 10^308 is a bit of red herring, no?’
      
      I wouldn’t call it a red herring: it’s a legitimate way to get people to realize that a flat prior over a the entire real line does not make sense. Once they concede this — no, I don’t really think the effect could be 10^308, or 10^30.8, or even 10^3.08 perhaps — then you can say “how do you know”, and “what’s the biggest you think it could be and why do you think that”, and so on.
      
      As for flat priors in general, no, you don’t get exactly the same math. For instance, consider the case that the infinite-flat-prior estimate is near the upper border of the range of realistic possible values: it’s plausible, but barely. But there’s some uncertainty, so 20% of the probability is in an implausible range of parameter space. If you impose a flat prior that is bounded to the realistic values, that 20% gets moved down into the plausible range.
      
      But: there are nearly no cases where your prior information would justify a flat prior. Suppose you think such-and-such an effect must be between -30% and +30%. Could this really happen? You think 29.999999 is plausible, but 30.0000000001 is not?
    - Wayne on October 5, 2017 8:44 AM at 8:44 am said:
      
      Andrew,
      
      Actually, I phrased that last sentence poorly. I hear “let the data speak for itself” a lot, and like you I disagree with it, in two ways:
      
      In a Bayesian/Frequentist context I prefer Bayesian which says that we need to make prior knowledge (common wisdom, our assumptions, etc) explicitly part of the model and then let the data push things around, speaking more loudly or more softly depending on how much and how strong it is.
      
      In a general Data Science context, the methods and models we use will find a signal, if that’s possible. But the signal may not be what we hope it is. It could be a “leak from the future” in the data, which is very common. It could be a bot “clicking” on links rather than a potential customer. Heck, almost every engagement I go into doesn’t have a data dictionary and that data doesn’t speak for itself. (In fact, when I make the mistake of thinking I hear it talking based on the name of a field, I’m often deceived because that name doesn’t mean what I think it means.) So the data doesn’t actually speak for itself in this context either.
      
      Only in the narrow sense of “don’t necessarily believe what ‘experts’ say about the data” does “let the data speak for itself” make sense to me.
    - Carlos Ungil on October 5, 2017 6:23 AM at 6:23 am said:
      
      > it automatically assumes the value in question is ridiculously enormous.
      
      A flat prior doesn’t assume that it *is* enormous, it assumes that it *could be* enormous. An informative prior may be better, but an uninformative prior is not obviously stupid. What matters is what may be the effect on the inference when this prior is used in the context of the model once we include the data.
      
      If you say that the flat prior means that you expect the value of interest to be greater than 10^305 you make it look stupid.
      
      If you say that the flat prior means that you will take the mean of the data to estimate the value of interest it looks much less stupid, actually it looks quite reasonable.
      
      Let’s say you measure the height of a sample of people to estimate the average height in the population and you get mean=170cm. Maybe you have reasons to think you should correct it a bit in either direction, but taking the 170cm at face value is not obviously stupid. If you get mean=512km there are issues with your model or experimental setup much worse than the fact that the prior doesn’t rule out that value.
      
      Of course nothing is normal, all models are wrong, etc. Everyone understands that if we say that the height in a population is normally distributed with such and such mean and standard deviation this is just an approximation. The median and the mode might be different from the mean, the shape of the distribution around the mean might be far from normal, and surely there are no negative heights or heights larger than 10^305.
    - a reader on October 5, 2017 6:34 AM at 6:34 am said:
      
      +1
    - Daniel Lakeland on October 5, 2017 7:27 AM at 7:27 am said:
      
      Carlos, this is intimately tied up in the insistence on a point estimate though. The behavior of a point estimate of course is far less affected by the clearly wrong tails of the prior because the location of the point estimate is determined by the location of the optimum which is totally insensitive to the tails.
      
      This is of course by design for the person who distrusts priors, nevertheless as soon as you want to construct a measure of uncertainty or a risk and utility based decision you have a different story.
      
      The risks associated with point estimation when outcomes and their consequences can vary widely are significant. If a posterior distribution is tightly peaked near your point estimate then things are ok, if there is nontrivial width then that flat prior can be deadly for your decisions as you wind up considering possibilities well outside what anyone actually thinks might happen, simply because no one wants to be in charge of justifying a prior choice. Walds theorem applies whether the user of statistics likes it or not.
    - Andrew on October 5, 2017 7:53 AM at 7:53 am said:
      
      Carlos,
      
      As we discuss in this paper, the prior can often only be understood in the context of the likelihood. In particular, a sample average or maximum likelihood estimate can be “quite reasonable” in some contexts but not in others. In a setting where measurements are accurate and plentiful and the goal is an estimate of a simple parameter whose value is not near the boundary of parameter space, then, sure, the flat prior can work. In a setting where measurements are noisy, sample size is not huge, and the goal is something more specific, then maximum likelihood or Bayesian inference with a flat prior can give bad answers: estimates with bad frequency properties, with high bias, high variance, high type M errors, high type S errors, the whole deal.
    - Anonymous on October 5, 2017 1:04 PM at 1:04 pm said:
      
      Just breaking the math down a bit here in the Bayesian case may help. Suppose you have a prior on a parameter
      
      $latex \theta \sim \mbox{Uniform}(-10^6, 10^6)$.
      
      Just a simple uniform prior on an interval. That prior says it’s very unlikely that the value of $latex \theta$ is small, because
      
      $latex \displaystyle \mbox{Pr}[ | \theta | 10^5] = 1 – \frac{2 \times 10^5}{2 \times 10^6} = 0.9$.
    - Daniel Lakeland on October 5, 2017 1:05 PM at 1:05 pm said:
      
      Also Carlos: from a probabilistic perspective, the flat prior assumes the value *is* enormous. Flat on +- 10^308 has 99.9% probability outside the +- 10^305 region. As soon as you add in an assumption of non-probabilistic estimation (ie. point estimation) the prior has a different effect, which is to not change the location of the maximum, so you might argue that pure maximization based point estimation has no real probabilistic content (in the Bayesian sense of probability on the parameter space). The question I have is if the flat prior results in a Bayesian posterior that makes no sense, why would you necessarily think it would be a good idea to take the maximum a posteriori value from this posterior and call it a good estimate. Later we can get into James-Stein estimation and the inadmissibility of this flat-prior point estimator.
    - Carlos Ungil on October 5, 2017 1:07 PM at 1:07 pm said:
      
      Daniel,
      
      if the point estimate obtained from a flat prior is ok but the posterior distribution is too wide maybe the problem is with the likelihood function and not with the flat prior. In any case, I don’t think the problem is that the prior specifies a 99.9% chance an effect size has absolute value greater than 10^305.
      
      Andrew,
      
      I agree and I think I said something similar myself (“What matters is what may be the effect on the inference when this prior is used in the context of the model once we include the data.”). Regarding the paper you link to:
      
      “For a fully informative prior for δ, we might choose normal with mean 0 because we see no prior reason to expect the population difference to be positive or negative and standard deviation 0.001 because we expect any differences in the population to be small, given the general stability of sex ratios and the noisiness of the measure of attractiveness.”
      
      Why would the prior depend on the noisiness of the measure of attractiveness? Say I have a prior for some experimental setting. If I had a similar setting with more noise I think I would still use the same prior for the parameter of interest (but maybe there would be a nuisance parameter related to the noise).
      
      I also find that prior very strong. If the beautiful parents had *only girls*, you would estimate the population difference to be just 0.1%. Maybe that’s your point, that the whole study makes no sense because you know that there is no difference and even in the most extreme outcome you wouldn’t really change your mind?
    - Carlos Ungil on October 5, 2017 1:13 PM at 1:13 pm said:
      
      > If the beautiful parents had *only girls*
      
      For context, I mean if all the 600 kids from beautiful parents in the study were girls.
    - Bob Carpenter on October 5, 2017 1:15 PM at 1:15 pm said:
      
      If we take a uniform prior over the range plus or minus one million, what does it say probabilisitically?
      
      1. The probability the parameter is in (-1000, 1000) is only 0.1%
      
      2. The probability the parameter is outside of (-1000, 1000) is 99.9%.
      
      That’s probably not the information you want to provide to your Bayesian model if you don’t expect the parameter to have values outside of (-1000, 1000). I keep meaning to write a case study that shows how this works (along with the truncation you get that Daniel Lakeland describes above if you err on the other side and make the boundaries too tight). Andrew’s already written papers showing how the original diffuse inverse gamma priors suggested in the original BUGS examples led to overinflated variance estimates.
    - Daniel Lakeland on October 5, 2017 2:58 PM at 2:58 pm said:
      
      Bob, a useful construct when you have a region where you really are pretty indifferent, like say +- 1000 but you want to include some weight on the whole real line is something like
      
      parameters{
      
      real p0;
      real dp;
      }
      
      transformed parameters {
      real p;
      
      p = p0 + dp; // a convolution of a uniform with a normal
      }
      
      model{
      
      dp ~ normal(0,some_scale);
      
      }
      
      thereby giving you a nice flat plateau in -1000,1000 but convolved with some gaussian to give an infinitely smooth prior over the whole real line.
    - Daniel Lakeland on October 5, 2017 2:59 PM at 2:59 pm said:
      
      ack, of course the blog ate the angle brackets Stan uses for bounds on p0… sigh.
    - Andrew on October 5, 2017 7:13 PM at 7:13 pm said:
      
      Carlos:
      
      You write:
      
      Why would the prior depend on the noisiness of the measure of attractiveness? Say I have a prior for some experimental setting. If I had a similar setting with more noise I think I would still use the same prior for the parameter of interest (but maybe there would be a nuisance parameter related to the noise).
      
      I also find that prior very strong. If the beautiful parents had *only girls*, you would estimate the population difference to be just 0.1%. Maybe that’s your point, that the whole study makes no sense because you know that there is no difference and even in the most extreme outcome you wouldn’t really change your mind?
      
      In answer to your first point: noise in x will attenuate the correlation between x and y. Suppose, for example, that there’s some precisely measured “beauty” variable x for which the more beautiful parents are 0.1% more likely to have girls. Now suppose you don’t observe x, instead you observe z, a noisy measure of x, and then you compare the proportion of girls among parents who have high and low values of z. This difference will then be less than 0.1%. It’s called attenuation in econometrics and it’s easy to show analytically or by simulation.
      
      In answer to your second point: No, I don’t know there’s no difference. There is a difference, it’s not zero. Older mothers and younger mothers have (small) differences in Pr(girl), white mothers and black mothers have differences in Pr(girl), etc. Take any two groups and you’ll get different probabilities. But, given all the empirical research on sex ratios (and there’s a lot, because N is huge and the data are just out there for free in birth records), we know that these differences are small. Not zero. Small.
    - Carlos Ungil on October 5, 2017 7:49 PM at 7:49 pm said:
      
      > In answer to your first point: noise in x will attenuate the correlation between x and y.
      
      Sure, if there is an effect it will be smaller. The attenuation will result in weaker data and the likelihood will move towards zero. Even if you don’t change the prior, the posterior will change as expected. I guess that if you had a prior centered at some value other than zero it would make sense to move the prior accordingly (to reflect the attenuation in the expected effect). I’m not so sure about changing the variance of the prior.
      
      > In answer to your second point: No, I don’t know there’s no difference.
      
      Ok, let me rephrase it. You know that the difference is small (much lower than 1%) and even the most extreme outcome wouldn’t provide enough evidence to suggest otherwise.
    - Carlos Ungil on October 6, 2017 8:30 AM at 8:30 am said:
      
      “For a fully informative prior for δ, we might choose normal with mean 0 because we see no prior reason to expect the population difference to be positive or negative and standard deviation 0.001 because we expect any differences in the population to be small, given the general stability of sex ratios and the noisiness of the measure of attractiveness.”
      
      To be fair, I see that narrowing the prior can be justified from a purely probabilistic point of view. If you have the “correct” prior for the “clean” case, for example the effect of true beauty on sex ratio is effectively sampled from a N(0,0.002) distribution, knowing that there is a certain level of attenuation you can easily derive the effect of measured beauty on sex ratio. At least if the “measured beauty” is only partially correlated to the “true beauty” and is not correlated at all to any other factors that could affect the sex ratio. If it is partially measuring beauty and partially measuring something else, the net effect is not trivial to determine. If the “noise” is completely random, you will have in the extreme case (measured beauty uncorrelated to true beauty) a prior equal to zero.
      
      In summary, it’s not impossible that you chose your prior by assuming first a precise prior for the effect of true beauty and then a precise amount of classification error. I guess I cannot accuse from over-precision, given that you said that’s a “fully informative prior”.
- Anon on October 4, 2017 12:15 PM at 12:15 pm said:
  
  “However, the researcher may be convinced that their intervention is surely superior to anything anyone else has tried…, that’s fine for them to believe, but what is the evidence?”
  
  Now tie the issues of a contentious prior with open data and open code. Not only can the original authors do a sensitivity analyses of their priors, but the skeptics in the greater research community might just try out a few priors of their own. I believe this could be part of what Andrew’s talked about with post-publication peer review.
  
  Reply ↓
  - Martha (Smith) on October 4, 2017 7:47 PM at 7:47 pm said:
    
    +1
    
    Reply ↓
- Jorge on October 5, 2017 2:52 AM at 2:52 am said:
  
  Dale:
  
  Then, what would be the problem in showing two or three models with the two/three most widely used findings from the literature as priors? I guess that would use the new and innovative evidence as well as the previous evidence, just to put it out there and other can judge.
  
  Reply ↓
  - Bob Carpenter on October 5, 2017 1:18 PM at 1:18 pm said:
    
    The usual problem is that you only get summary statistics, not Bayesian posteriors. It’d be great if you could just include data from other studies in one big meta-analysis, but that’s rarely possible.
    
    If you did get some kind of Bayesian posterior downstream, there’s the problem of how to compute with it if it’s not conjugate. That’s one of the reasons working directly with other data is easier.
    
    Reply ↓
    - Jorge on October 6, 2017 1:30 PM at 1:30 pm said:
      
      Hi Bob,
      
      But shouldn’t it be also informative to run the same model with different priors based on previous findings? That is, using only the effects (summary statistics, not posterior draws) from similar studies and just see how much models change/robustness/reliability.
Wayne on October 4, 2017 11:17 AM at 11:17 am said:

Very good post! I would also add that explicit priors are open to sensitivity analysis and “what-if”.

So if the overly-optimistic experimenter used N(0.75, 0.1) and ends up with a result of 42%, you can say, “Hmmm… a 42% increase is too large to make sense to me and the prior seems questionable, let me try a prior of N(0, 0.2)… Well, it looks like my answer is 39%, so the answer is not very sensitive to two very different assumptions, which really gives me something to think about.” On the other hand, if you use your own prior and get a result of 2%, you have justification for questioning the original result because: a) the outcome is apparently sensitive to the choice of priors, and b) your outcome is much more reasonable in light of the current state of the field. (So you have at least one example of a prior that results in an outcome that you can get behind and that provides some amount of information to the discussion.)

This goes back to why p < 0.05 is an issue, and your frequentist-analysis example: someone believes that if you plug in required data and turn the crank properly the answer that pops out is Truth that no one can doubt because… statistics. A Bayesian analysis makes certain assumptions explicit and provides a well-defined way to "what-if" to poke and prod the mode in relation to these assumptionsl. Of course, there are even larger issues regarding forking paths that made the data that goes into the model, but we do what we can…

Reply ↓
Ben Goodrich on October 4, 2017 11:54 AM at 11:54 am said:

I think there is a bit too much concern about the case where

– The true causal effect of some intervention is essentially zero
– The prior on the causal effect is Normal(big, small)
– The posterior margin on the causal effect is approximately Normal(small, tiny)
– The decision is then to invest tons of money to do this intervention widely

First, empirically I don’t think you see Normal(big, small) priors on causal effects very often because authors don’t want to justify them, which is fine. The real problem is that there are too many Normal(0, yuuuge) priors, which are difficult to get the posterior for and the resulting posterior may have a lot of the same problems as with unregularized point estimates. If there were more Normal(0, moderate) priors, things would be better and I don’t think people would get overly worked up about exactly what number “moderate” corresponds to.

Andrew and coauthors wrote a paper recently arguing that you can’t choose a prior without thinking about what the likelihood function will be. There is already a bunch of papers saying you can’t choose a prior without considering what the loss function for the decision will be. So, Normal(big, small) priors on causal effects when the study is going to be small and there are few other studies to justify such a prior would set you up for a big loss when the decision is how much to invest in this intervention.

Reply ↓
- Andrew on October 4, 2017 11:56 AM at 11:56 am said:
  
  Ben:
  
  I agree. My defense in the above post is that I was specifically being asked about how to think about normal(big,small) priors. In general I agree with your point.
  
  Reply ↓
- Kyle MacDonald on October 4, 2017 1:49 PM at 1:49 pm said:
  
  Roughly how often would you say that an explicitly specified prior in the published literature is clearly ridiculous? Most of what I’ve read on this topic, on this blog and others, has concerned bad priors that are inferred from the authors’ choice of frequentist inferential framework. Does the act of clearly and openly stating the prior tend to make researchers think harder about that prior? I’m inclined to guess “yes”, but betting against human folly is always risky.
  
  Reply ↓
- Georgette Asherman on October 10, 2017 8:42 AM at 8:42 am said:
  
  In industrial settings this can be the case. There is a small effect, essentially zero, with a relatively large known mean and low variance. However that ‘essentially zero’ effect can be meaningful in terms of production cost or other concerns. That is why equivalence and non-inferiority testing is commonly used.
  
  Reply ↓
Anoneuoid on October 4, 2017 12:17 PM at 12:17 pm said:

Suppose a researcher does an experiment and he says that his prior is that the new treatment will be effective,

See, I think the mistake has already occurred if you design a study looking for “an effect” like this. These studies are made for NHST rather than scientific purposes. Do not design studies to look for “an effect”. Instead design them to distinguish between various explanations for something (test theories), or to describe some phenomena/condition in detail. Looking for “effects” simply is not a viable way to learn accurate information about the world.

Reply ↓
- Kyle C on October 4, 2017 4:17 PM at 4:17 pm said:
  
  Well put. “Effective” should be a policy judgment (it may even be debatable) that comes after we gather the best evidence about the world. It is not a datum in itself.
  
  Reply ↓
  - Anoneuoid on October 8, 2017 12:55 AM at 12:55 am said:
    
    I’m not sure if that is the same point I am making, but don’t disagree with it. I’m saying the NHST paradigm is based on the principle that correlations/effects are rare. Thus it is somehow exceptional to find such, and studies are designed for this purpose. Instead the principle should be everything is correlated with everything else, and studies should be designed based on that principle.
    
    Reply ↓
David Harville on October 4, 2017 12:48 PM at 12:48 pm said:

There is a practice that can lead to “rigged” priors that I believe to be very common. As the name suggests, the prior distribution is supposed to be formulated prior to observing the data. I suspect that it is often the case that the prior distribution is not formulated until after the data are observed. And even if the prior distribution is formulated in advance, I suspect that on those occasions where it turns out to be inconsistent with the data, it is likely to be revised.

Reply ↓
- Chris Wilson on October 4, 2017 2:37 PM at 2:37 pm said:
  
  This is a bit too oversimplified. The name “prior” is kind of unfortunate actually. Better to think of it as a necessary mathematical part of a joint generative model (also, I much prefer to think of it as supplying information “external” to the given model rather than “prior” in some temporal sequence). The posterior is interpreted relative to the prior, so it is perfectly legitimate to try out multiple priors after running a model, provided that you bear in mind the interpretation of what it is you are doing. The key is, at the end of the day, to clearly present all of the modeling steps and the assumptions that go into them, as Andrew is emphasizing at great length above.
  
  Reply ↓
  - David Harville on October 4, 2017 4:40 PM at 4:40 pm said:
    
    The key point is that if one’s choice of prior distribution is influenced by the data (either consciously or subconsciously), there is an opportunity for self-delusion and for the delusion of others.
    
    Reply ↓
    - Andrew on October 4, 2017 4:51 PM at 4:51 pm said:
      
      David:
      
      With the prior as with the statistical model more generally, there’s a tradeoff. Pure preregistration avoids some biases but blocks some opportunities for learning and discovery, so often we need to go back and forth. Also, preregistered procedures can still lead to biases, for example least squares or maximum likelihood or flat-prior Bayes can lead to big type M errors as illustrated in section 2.1 of this paper. No surprise from a mathematical perspective, as the flat prior corresponds to an averaging over an unrealistic space, but it’s a mistake we see a lot because people somehow have got the impression that least squares etc. gives unbiased or objective answers.
Keith O'Rourke on October 4, 2017 3:09 PM at 3:09 pm said:

In a letter to the editor I wrote on Jay Kadane’s “Prime Time for Bayes” in 1996 – one of the things we agreed on was “[for] dealing with the arbitrariness of prior specification in RCTs, I agree with the author that it is critical that peer review protocol be developed and disseminated through the RCT community so that benefits can be better obtained when the Bayesian approach is adopted.”

Obviously this has not happened and I think this long discussion reflects that, again having to argue that point in the apparent absence of much support to point to.

For instance, in David Cox’s recent paper “Statistical science: a grammar for research” Eur J Epidemiol (2017), he qualifies priors as “just” someone’s opinion in this statement – “although for an individual research worker a subjectivist view of probability [prior] might be invoked”. Without sufficient justification, subject in principle to critical peer review, it is hard to dismiss David’s view as being overly dismissive. Though data generating models are also often adopted without sufficient justification, two wrongs don’t make a right.

Reply ↓
mic on October 4, 2017 5:52 PM at 5:52 pm said:

Very interesting. As an applied empirical person in the social sciences, this raises two questions.

(1) Should a paper be rejectable based on its priors? If reviewers disagree with the priors, if the priors affect substantively the conclusions, then should they be allowed to reject? It’s one thing to say that Bayesian stats forces us to be transparent with respect to our priors. It’s another thing if they become a topic of conflict between researchers and reviewers. Transparency is great, but in the existing peer-reviewed paradigm, it doesn’t solve everything.

(2) I often see variations on: “Use priors based on the existing literature.” (Of course, one could also use common sense, but let me set this aside for now.) The issue here is that in many important areas, the existing literature is (a) thin and (b) full of poorly constructed empirical analyses. In fact, often we do empirical work because we believe that previous studies were badly done. Why should I use priors based on articles I think are wrong?

Again, this is a great conversation — but issues such as those I just raised would concern me.

Reply ↓
- Andrew on October 4, 2017 5:58 PM at 5:58 pm said:
  
  Mic:
  
  Hey, I’m an applied empirical person in the social sciences too!
  
  To answer your questions:
  
  1. It’s not really up to me what journal editors do. As an occasional editor, sure, I’d reject a paper if any aspect of its model was not well justified. Or, I guess, I’d give the authors an opportunity to justify their choice of model. It could be that their justification is, Everybody does it. Depending on context, that could be enough of a justification. But then I’d like that justification to be in the paper.
  
  2. Yes, I agree that a crude reliance on the existing literature can give bad priors. In particular, the existing literature is typically full of overestimates (type M errors). So I think it can be a bad idea to just grab estimates like that. It’s a challenge to do something better. And in some cases maybe you don’t need to do anything because the prior doesn’t matter much. But in cases where the prior does matter—such as the sorts of small noisy studies used by advocates to claim success of early childhood intervention programs—then, in such cases, it could be worth the effort to try to construct a good prior.
  
  Constructing a prior is work. It involves data gathering and data analysis, which is work. But sometimes that work (or something like it, not necessarily a Bayesian version) is necessary to get a reasonable answer!
  
  Reply ↓
  - Keith O'Rourke on October 5, 2017 8:00 AM at 8:00 am said:
    
    > Constructing a prior is work.
    It was the original motivation for the work I did in meta-analysis (to get prior for cost/benefit analysis of funding for clinical trials).
    
    A little bit of thought about this soon suggests you don’t want some weighted average of the (mostly crappy) studies that happened to get published. Or maybe it takes more than a little thought…
    
    Reply ↓
- Daniel Lakeland on October 4, 2017 6:04 PM at 6:04 pm said:
  
  “(1) Should a paper be rejectable based on its priors? If reviewers disagree with the priors, if the priors affect substantively the conclusions, then should they be allowed to reject?”
  
  No, because it amounts to censorship, not allowing a scientist to publish what they think is true… but it seems totally legitimate to me to request that an additional run of the model with a different prior also be included as a way to broaden the result to a population that might have reasons to believe that other regions of parameter space should be considered.
  
  More fundamentally, peer review is broken, and raw data and statistical analysis code should routinely be published as part of EVERY analysis, so that follow up researchers can run their own version of what they think is going on. It’s inappropriate to have anonymous gatekeepers with veto power.
  
  2) Sure, but it seems reasonable to argue that if you want your analysis to be relevant to the broader field, you should including the “consensus” region within the high probability range of whatever prior you use. If most people think that x is close to 0.12 and you think more like 31 for example, then you’d better be using a prior that includes both 0.12 and 31 in the high probability region, or many people will strongly disagree with your analysis and the conflict won’t be resolved until someone runs the model with a prior that does include 0.12 anyway.
  
  Reply ↓
  - Dale Lehman on October 4, 2017 7:47 PM at 7:47 pm said:
    
    This “consensus” thing bothers me. In fields where I work (economics and public policy), there is no such thing. Estimates generally range from zero to large effects. All analyses serve some political ends and thus all are suspect. I like the idea of requiring code and data to be public, and I also like the idea of sensitivity analysis. But the idea that we should base priors on consensus regions seems to me to solidify past poor analysis. It also invites the very things often criticized on this blog – reputation based on credentials, status, publications (largely based on p<.05), etc. I just don't see the open dialog about what is known and what can be agreed upon. For example, how large is the multiplier effect of government budget deficits? What should the prior be?
    
    Reply ↓
    - Daniel Lakeland on October 4, 2017 9:25 PM at 9:25 pm said:
      
      Well, when there exists a consensus in the field, that value should be in the high probability region of at least one prior that you analyze, because that will answer a question that people reading your paper would have. Should that be the only analysis you do? No. Do any analysis you like. If there doesn’t exist a consensus then you’re free to argue for whatever prior you want, and don’t have to worry about many people all wanting to know why you excluded their favorite consensus value. ;-)
      
      Fiscal multiplier question, I’m afraid looking things up on the Wiki gave me too ambiguous a definition. Suppose the government spends 1 billion dollars on something abhorrent, and as a result everyone in the country goes on strike and stops spending entirely for a month. Does this count as a “negative multiplier” because overall spending went down? Also, what is the time-frame involved? The math of this seems highly ambiguous to me. If you can explain precisely what the fiscal multiplier is, I will give you some thoughts as to how to generate a prior for it.
    - Dale Lehman on October 5, 2017 7:07 AM at 7:07 am said:
      
      Daniel
      You can start here (http://marginalrevolution.com/?s=multiplier). Of course, that is not an authoritative source and it represents the more right wing side of economics – Krugman would have a somewhat different take. But I have no doubt you can generate a prior – or even two or three. And, I believe doing that would be superior to conducting a new study using some data and declaring a confidence interval for the *true* size of the multiplier from that single study. I am not disagreeing with the post or your comments here – I am providing my view for much of the underlying resistance to change and clinging to these frequentist methods. If our estimates for the size of the multiplier shift depending on which prior you choose – and I believe they would – then it exposes the entire enterprise to be a sort of mathematical trick, a way to couch a subjective belief as “scientific.” And, who wants to do that? (only real scientists perhaps).
    - Daniel Lakeland on October 5, 2017 8:07 AM at 8:07 am said:
      
      I get the basic / intuitive idea behind the “multiplier” effect, my big issue is that I don’t see how it can be defined *precisely* to give a universal way of calculating it. Let me explain
      
      Suppose we take C(t) to be the total consumption by all members of the US at time t, a continuous function of time. Well, of course we know, like in the stock market, that consumption is not continuous. When I buy a sandwich a few dollars is transferred all at once. This is not the same thing as saying that all day long I spent a few pennies each hour…
      
      You might think this is pedantic, but it seems to me the “multiplier” effect is some kind of derivative, how much total consumption changes when some particular amount of consumption by a certain party occurs. d something / d something
      
      But the derivative is an unbounded operator, and it doesn’t even exist for a discrete series of transactions… and so we can really only discuss this in terms of taking the real series of discrete transactions, smoothing them in some way, and then defining our derivative of this smoothed thing… Fine, but then the result we get is dependent on the way in which we do the smoothing… Is there a way to define all of this in such a way that the result is largely independent of our choice of smoothing method for a wide range of smoothing methods? If so, we’re in the same situation as we get when trying to represent a steel bar using continuum mechanics, sure it’s atoms, but if we smooth the atoms by a smoothing kernel of width greater than 100 atomic distances and less than 1mm which is quite a few orders of magnitude… the results are nearly the same.
      
      It’s less obvious to me how this would work for consumption. First off, consumption clearly has a very strong daily oscillation. I buy very little at midnight, and quite a bit more at noon. So any smoothing we do must be over a timescale large with respect to a day. But, there’s also clearly seasonal effects in consumption, christmas is big for retail, summer is big for travel… so smoothing seems to need to be large with respect to a year! But over decades technology and policy and things all change a lot. So I don’t think we’re ever in any regime where a smoothing based view of what’s going on really applies very well.
      
      Now of course we’re interested in a causal effect, spending G government dollars causes some change in something, over some time period relative to what it would have been if the G event hadn’t occurred…. So it’s not a simple derivative in time, it’s a counterfactual about how much consumption would occur in some time period after the G event compared to what would have happened in the absence of G… But defining this in a way that is insensitive to the choice of time period still seems impossible. You could for example do a truncated Laplace transform (ie. discount all future consumption out to some window according to some discount rate) but then you’ll wind up with a result that’s very sensitive to the discount rate and the truncation window.
      
      So, if you want to do a particular analysis, and you want to choose a particular way of doing the calculation, then I can give some particulars of the appropriate prior. All this is to back-up the assertion that Andrew made in a recent paper: The choice of prior is intimately connected to the choice of likelihood / data model.
    - Dale Lehman on October 5, 2017 8:41 AM at 8:41 am said:
      
      You are trying to build a dynamic model – a worthy goal, but not what the multiplier was designed to represent. It is a comparative statics result: if we increase government spending by $1 (without increasing taxes), what is the final increase in GDP after the system equilibrates. There is still a time dimension, as many economists will give different answers if you ask what the change in GDP will be after 6 months or after 1 year, etc. Also, the answer will vary, depending on the initial state of the economy (extent of unemployment, etc.). We can put more detail in and I have no doubt you can provide a prior distribution that will be defensible (as well as open to criticism). But this misses my point. I don’t believe you can provide a prior that can be said to represent “consensus” because there is none. And, while I do think the effort would be worthwhile, I think you will find great resistance to this approach for the very reasons I am trying to convey. I think the resistance to specifying priors is, in large part, a resistance to revealing that the emperor has no clothes. After all, economists pride themselves on being more scientific than the other social sciences.
    - Daniel Lakeland on October 5, 2017 8:51 AM at 8:51 am said:
      
      Aha I see we are secretly pointing out the same thing. The truth is that although the basic idea that spending money can induce growth in the economy “the multiplier” really doesn’t exist as a well defined thing. So it’s not surprising that the numerical value is controversial ;-)
    - Daniel Lakeland on October 5, 2017 8:59 AM at 8:59 am said:
      
      * the basic idea makes sense… Editing on phone, sentence fragments…
ojm on October 4, 2017 7:37 PM at 7:37 pm said:

In general I think of priors in terms of stabilisation/regularisation rather than what you believe the ‘true’ value _should_ be.

So when thinking in these terms you aim to ‘rig’ the model fitting towards ‘null’ or ‘simpler’ or whatever fits. So the main danger is in ‘missing’ effects that are there by being _too conservative_.

Reply ↓
- ojm on October 4, 2017 8:22 PM at 8:22 pm said:
  
  Also, I find it helpful to plot post the prior and posterior (parameter and predictive) distributions or at least some summaries of. This makes it clearer where you started from not just where you got to.
  
  Reply ↓
  - Chris Wilson on October 4, 2017 11:08 PM at 11:08 pm said:
    
    +1.
    
    Reply ↓
  - Keith O'Rourke on October 5, 2017 8:07 AM at 8:07 am said:
    
    Strange that this is not commonly done – the technical challenges are not that hard http://statmodeling.stat.columbia.edu/wp-content/uploads/2011/05/plot13.pdf
    
    (Actually that was the reason the journal editor gave for rejecting the paper – not enough technical innovation to justify publication in my prestigious journal)
    
    Reply ↓
Shravan on October 5, 2017 1:35 AM at 1:35 am said:

“If you analyzed the RCT using Bayesian stats instead, would your ultimate conclusion about the success of the program be affected by your choice of prior, and if so, how much?”

How about the person who asked the question generate some plausible fake data, plug it into glm and write down the “answer”, and then redo the model in rstanarm or brms using different priors? Most of the people who question the role of priors usually have no experience with Bayesian statistics, and could not be bothered to acquire it. In a way, one could say that they have very tight priors and simply refuse to collect any data, because they already *know* what will happen. Sort of like the very same situation they are imagining is a problem with Bayes.

Reply ↓
- Keith O'Rourke on October 5, 2017 8:10 AM at 8:10 am said:
  
  I think so – and in general that is well argued here – Calestous Juma. From coffee to tractors: Why fear of loss inspires resistance to new technology. And today’s pragmatic Bayesian approaches are new technology – just the theorem is old.
  
  20 minute podcast here http://www.cbc.ca/radio/thecurrent/the-current-for-march-30-2017-1.4045972/march-30-2017-full-episode-transcript-1.4048646#segment2
  
  Reply ↓
  - Shravan on October 5, 2017 11:22 AM at 11:22 am said:
    
    Very interesting podcast, I’m buying the book by him. The idea that Germans banned coffee to protect beer—this I have got to read about.
    
    Talking about resistance, I spent the morning trying to figure out how to convince an action editor that a bunch of low-powered big effects is not as convincing as a small effect from a large-sample study. First I have to demonstrate how Type M error arises… the news has apparently not reached psychology.
    
    Reply ↓
    - Shravan on October 5, 2017 11:24 AM at 11:24 am said:
      
      Maybe psych journals should require a retrodesign function execution from Gelman and Carlin’s article, so the reader can assess the Type S/M and retrospective power.
Robert Krause on October 5, 2017 9:17 AM at 9:17 am said:

I am a regular reader of the blog, but I am neither a mathematician, statistician or econometrician (or any other -ician), maybe I am best described as an (aspiring) methodologist*. Priors are quite important in my work but I am not sure how to apply this discussion. My work is about Multiple Imputation procedures for missing data in network models (exponential random graph models -ERGM, and stochastic actor based models – SAOM) and I started to use Bayesian estimation methods because of they allow to create “proper” imputations in the sense of Rubin. My initial models had very vague priors, Normal(0,100), as it was the default. Obviously these were “bad” priors, because these models are comparable to logistic regression and given the statistics that are multiplied with the parameters, values outside say +/-10 are absolutely unrealistic. I went with it, because I did not know enough about Bayesian statistics.
However, once you create enough missing data you cannot estimate the models anymore, because some of the statistics are not observed enough anymore to estimate the parameters (e.g. I had posteriors from -100 to +150). Luckily I came across a youtube video of one of Andrew’s presentations about weakly informative priors, where he discussed a similar issue that parameter could not be estimated, because there was (nearly?) no data for it. Now, using these priors, Normal(0,4), the models converge nicely with 50% of the data missing (which in networks means that for many statistics you have 75% of the data missing). My point is that my main reason to choose this prior is pragmatical, you cannot run the model with a flat prior. I therefore wonder how much of this discussion applies directly to my choice of prior?

*I am not a native English speaker (as you might have guessed), but is there an difference between -icians and -ists? It seems to me the -icians (statist-, econometr-, psychometr-, mathemat-,…) have a better understanding of what they are doing compared to the -ists (psycholog-, sociolog-, biolog-,…).

Reply ↓
- Bob Carpenter on October 5, 2017 1:37 PM at 1:37 pm said:
  
  How about beauticians, dieticians, and musicians vs. physicists, hypnotists, and barist(a)s?
  
  Reply ↓
  - Martha (Smith) on October 5, 2017 6:00 PM at 6:00 pm said:
    
    +1
    
    Reply ↓
  - Robert Krause on October 6, 2017 4:44 AM at 4:44 am said:
    
    Good point…
    
    Reply ↓
Daniel Lakeland on October 5, 2017 10:58 AM at 10:58 am said:

Dale Lehman:

Following on some thoughts on priors for economic “multiplier effects” but we’d run out of reply room above.

Let’s let t be defined in years, and the “one year future total consumption per capita” function be

C1(t) = integrate(C(t+s)ds,s,0,1)

Where C(t) is the sum of all transactions that occur on a given day divided by the population N divided by 1/365 to put C(t) in units of dollars per person per year. C(t) is a piecewise constant function over each day.

Now, I take the 1 year multiplier effect to be

C1(t) if we have the government spent G dollars per capita (Call this C1_G(t)), where G dollars is any number between 0.001 times GDP/capita and 0.01 times GDP/capita (we assume an intermediate asymptotic stability of the effect for these moderately small spending levels)

minus

C1(t) if we don’t spend the G dollars per capita

divided by G

M = (C1_G(t) – C1(t))/G

Now clearly, this quantity depends on our choice of 1 year as the time period of interest, but we might expect that we’d get a similar effect for a range of window lengths from say 1/2 year to 2 years and so it’s *not extremely sensitive* to the window length. This is partly due to the fact that we average over 320 million people, and that we integrate our function over a full year or so, thereby smoothing out short term fluctuations quite a bit.

Next we note that logically we can in fact get quite large negative values, as I say if everyone in the country goes on strike because the Nazi party comes into power and whatnot… then C1_G(t) could go to zero, while C1(t) the counterfactual would have been something like 57000 $/person but… it’s extremely unlikely

In fact, for the most part, we’d expect this number to be something like 1 as the increase in GDP caused by spending G dollars per person would be something like G dollars per person, divided by G we’d get 1. So probably the peak of the prior density should be 1.

Furthermore it also seems like we could easily get 0, where each dollar spent by the govt causes someone to withhold a dollar of spending. This would be the case where we’re pretty much just doing a straight transfer from one group of people to another…. So the prior should be wide enough that 0 has density that is not so much lower than the density at 1. Finally, it’s reasonable that you might activate a lot of activity by your government spending, if it’s targeted properly (maybe you stimulate the economy of a depressed region, where lots of labor is available but little free cash for example). So you should be considering quantities out into the range of 2 or 3.

With all this in mind… an initial prior seems like normal(1.0,2.0) would be a good place to start, including values well into the negative range, and well above 1.0 but giving 1.0 the peak.

Reply ↓
- Dale Lehman on October 5, 2017 2:06 PM at 2:06 pm said:
  
  I’ll put it in more traditional economic terms. The textbook Keynesian model says that if the economy is not at full employment, then the multiplier (the effect on GDP of increasing the gov’t budget deficit by $1 = 1/(1-MPC) where MPC is the marginal propensity to consume (the derivative of total consumption spending with respect to income). Since the MPC is around .8, the multiplier would be around 5. Somewhat more sophisticated models incorporate taxes and imports and these will reduce the size of the multiplier somewhat. This model prevailed until the 1970s. Since them, a portion of the economics discipline would claim that the multiplier is 0: any increase in government spending will squeeze out private investment dollar for dollar. Some other extremists would go so far as to make it negative, claiming nefarious influences on the private sector and worrying about what the government spend money on. And, of course, there are some ideas that the multiplier is not at all stable and it depends on many other things, such as consumers going on strike, etc.
  
  But the point is that these differences are deeply rooted in philosophical differences in how people believe the economy works. There is no consensus. We could establish several priors corresponding to different schools of thought and then examine the same evidence in each case. That would be instructive and I would support that. But I don’t think you will see that any time soon – it makes these schools of thought less “scientific” and more “subjective.” If you want to claim these beliefs are wrong, I’m in agreement with you. But I think it is part of the fundamental reason why economists, at least, would resist the advice in Andrew’s post (of course, I could be wrong, since I can’t really speak for most economists).
  
  Reply ↓
  - Daniel Lakeland on October 5, 2017 3:05 PM at 3:05 pm said:
    
    Yet, your description here, that people think the multiplier could be anything from negative to positive, and O(1) or so… suggests a perfectly fine prior, maybe normal(1.0,10.0)
    
    Nevertheless, I agree with you about your skepticism that economics will begin to do this. I just don’t think this is because doing it is hard, or wrong, or anything like that, it’s because of politics etc.
    
    Reply ↓
Huw Llewelyn on October 5, 2017 11:28 AM at 11:28 am said:

I agree with Andrew about taking prior evidence into account in a measured and carefully reasoned way. My understanding is that it is the same as doing a pre-study meta-analysis in a Bayesian way of thinking to arrive at a Bayesian prior probability distribution (and including it in a paper’s introduction). The new data (i.e. its likelihood distribution) is then interpreted against this prior background and used to update the meta-analysis to create the Bayesian posterior probability distribution (for the discussion).

The prior probability can be based on a series of data sets each being assumed to share the same ‘true’ mean but each with its own likelihood distribution. The likelihood densities of the different likelihood distributions can be multiplied together to form a joint likelihood distribution and then ‘normalising’ the latter so that all the posterior probabilities sum to 1 (normalisation always assumes that the ‘baseline prior is uniform or flat for random sampling, which is correct – see my blog: https://blog.oup.com/2017/06/suspected-fake-results-in-science/). The resulting posterior probability becomes the prior probability distribution for the new study. This is multiplied by the likelihood distribution of the new study data and normalised again to give the latest updated posterior probability distribution (to be discussed in the ‘discussion’ section of the paper).

Reply ↓
- Huw Llewelyn on October 5, 2017 11:47 AM at 11:47 am said:
  
  PS. If there are no real prior data sets, then a pseudo-data set would could be ‘imagined’ subjectively based on informal experience or theories, its subjective likelihood distribution arrived at and normalized to give a prior non-baseline prior probability distribution.
  
  Reply ↓
  - Daniel Lakeland on October 5, 2017 12:48 PM at 12:48 pm said:
    
    This is a really useful way to get nuanced priors.
    
    Reply ↓
- Keith O'Rourke on October 5, 2017 1:42 PM at 1:42 pm said:
  
  Huw (and Daniel)- it you have not noticed it yet you might find my comment of interest http://statmodeling.stat.columbia.edu/2017/10/04/worry-rigged-priors/#comment-578656
  
  The different likelihood distributions multiplied together is approximately a weighted average and if the likelihoods are quadratic it is exactly equal to the inverse variance weighted average.
  
  Something more thoughtful is advisable and if such can’t be discerned – flatten the multiplied together likelihood to reflect more uncertainty e.g. raise it to some number less than one (called something like fractional likelihood).
  
  There also will be a related post later this afternoon.
  
  Reply ↓
  - Daniel Lakeland on October 5, 2017 2:35 PM at 2:35 pm said:
    
    Yes, I think Huw though is imagining “making up” a dataset that you a-priori think might be representative of the range of stuff you expect to see, then do Bayesian inference on this fake dataset, and see which parameter values are consistent with this fake data thereby backing out a prior for a parameter from what you think the data ought to look like… I like this idea a lot as a way to get informative priors, and since it’s not a weighted average of crappy studies it might be more reasonable, certainly doesn’t suffer from file-drawer and poor research practices etc.
    
    Reply ↓
    - Huw Llewelyn on October 5, 2017 3:00 PM at 3:00 pm said:
      
      A fresh unbiased study performed meticulously will continue to converge on the true mean as the number of observations increase. However, unless the prior probability distribution shares the same mean, it will bias the fresh study and delay its convergence on the true mean and thus be counter-productive. It would have a similar biasing effect as ‘P-hacking’. Ideally, the prior data should have been a pilot study for the ‘fresh’ study so that it could be regarded as part of it. In other words, the ‘prior data’ would have to be chosen very carefully. Others reading the study might prefer for the fresh data to be ‘normalized’ on its own to create a ‘fresh’ posterior probability distribution and to use the author’s prior probability as a guide to suggesting his or her own for personal use e.g. deciding to perform another study to replicate or contradict it.
- ojm on October 5, 2017 3:14 PM at 3:14 pm said:
  
  See also Edwards’ ‘prior likelihood’ (eg in his book Likelihood).
  
  This will only work for identifiable parameters though, ie those that just need enough data to estimate.
  
  Reply ↓
  - Huw Llewelyn on October 5, 2017 5:39 PM at 5:39 pm said:
    
    I was assuming that each study used in the ‘meta-analysis’ was based on random selection from the same population with a single unknown true mean or proportion. Each study would be performed separately and would have different means or proportions but could be regarded as part of one large study and their data pooled to give a better estimate of the true mean or proportion. If this could not be assumed (at least to be roughly true) then I agree it would not work.
    
    Reply ↓
Oren Cheyette on October 5, 2017 12:27 PM at 12:27 pm said:

Perhaps I’ve missed earlier posts on this, but it seems to me that there is ever more attention in media to dubious health studies afflicted with unstated priors and forking paths. Two recent instances that come to mind are the IARC designation of a common herbicide as a “probable carcinogen” based on a handful of cases of one type of rare cancer in one study of ag workers (forking paths), and the recent NIH/NTP assessment that cell phone radiation may be carcinogenic, based on a rat study with low single-digit counts of two rare cancers and unexplained differences between male & female rates and also signal type (GSM vs. CDMA). (So that’s forking paths plus priors – at least for people with physical science backgrounds, there’s a pretty strong prior against the idea that non-ionizing radiation at levels too low to cause measurable heating could cause any genetic damage.)

Both studies continue to get media attention – out here in the Bay Area, we were just treated to an alarmist story by the SF Chronicle’s health writer on the risk of smart watches, quoting heavily from two go-to figures in the “cell phones will give us all cancer” community and mentioning the NIH/NTP report. Particularly at the local level, a lot of questionable policy gets made based on these sorts of reports – e.g., Berkeley on cell phone warnings and Petaluma on herbicides used by public maintenance staff.

Reply ↓
- Daniel Lakeland on October 5, 2017 12:54 PM at 12:54 pm said:
  
  Oren, re the cell phone stuff, I basically agree with you, but the idea that non-ionizing radiation could cause cancer is I think a little more nuanced. Any enzyme associated with DNA repair or oxidative stress or whatnot that could be activated or inactivated by selective absorption of microwave type radiation could cause cancer over time through this indirect method, basically inhibiting the ability of the cells to cope with naturally occurring processes, or increasing the rate at which those naturally occurring processes occur. If I wanted to study such things I’d be looking at molecular resonances of the proteins to see if their chemical kinetics or protein folding configurations could be affected by absorption of certain wavelengths…
  
  Also, hi from an ex colleague, assuming there aren’t too many Oren Cheyettes in the SF Bay area.
  
  Reply ↓
  - Oren Cheyette on October 5, 2017 2:28 PM at 2:28 pm said:
    
    Hi Daniel – I knew I recognized your name from somewhere.
    
    Regarding the effect of EM on cells: the problem is that, not only is the radiation non-ionizing, it’s not even comparable to thermal energy. So any effect involving some activation barrier being surmounted by the radiation would already be blown past by ambient thermal noise. Robert Adair (Yale physicist) treated this issue at great length in the early ’90s (back when there were scares about power lines), albeit focusing on lower frequencies where the issue is even more clear cut. (My physics chops are a little rusty, and I don’t have a strong intuition about the resonance idea, except that it seems unlikely at those energies. Cell phone frequencies are below the blackbody peak at room temperature and I’m pretty sure there are a gazillion energy levels accessible to pretty much any large molecule in those ranges, particularly in a liquid environment.)
    
    But at any rate, this is just to emphasize my mechanistic prior, which is evidently different from that of the Chronicle’s health writer and the Berkeley city council, who seem ready to use the uninformed (ha!) prior that every modern technology is carcinogenic unless proven otherwise, and also the studies showing otherwise should be ignored (because they disagree with said prior too strongly).
    
    Reply ↓
    - Daniel Lakeland on October 5, 2017 2:45 PM at 2:45 pm said:
      
      Oren: here’s the first PubMed link I found on searching “microwave resonance proteins” https://www.ncbi.nlm.nih.gov/pubmed/18240290
      
      taking their abstract at face value (possibly not a good idea, but a starting point since I don’t have access to full text) they suggest that resonant absorption of microwaves can certainly affect proteins selectively. It’s at least plausible. Yet, I fully agree with you in the basic point you’re making that policy is being made by people with strong but uninformed priors.
      
      When it comes to power-lines at 60Hz I think the results are completely different, such small objects as proteins are likely to see 60Hz as essentially DC, resonance absorption should be up in the range of microwave ovens certainly above 500Mhz etc.
    - Daniel Lakeland on October 5, 2017 3:42 PM at 3:42 pm said:
      
      Here’s a pretty cool science based animation of how a particular motor protein works (and a bunch of other stuff too):
      
      https://youtu.be/yKW4F0Nu-UY?t=3m40s
      
      You could suppose for illustration purposes that say the “feet” of this protein could absorb microwaves selectively because they “walk” at some 1000Mhz or whatever (or the microwave energy is a 1st 2nd or 3rd harmonic of whatever they do). If you add microwave energy, perhaps they vibrate back and forth rather than moving forward, hence a certain thing doesn’t get transported to its appropriate place as quickly, and so some chemical reaction does or does not occur fast enough to prevent some naturally occurring damage. This is more of a heuristic than anything else, obviously I have no particular candidate process in mind, just the idea that the intricate mechanical processes that large bio-molecules undergo could be selectively disrupted due to resonance at microwave frequencies. The more I learn about biology the more impressed I am at how complex it is, but also robust.
    - Daniel Lakeland on October 5, 2017 3:47 PM at 3:47 pm said:
      
      Blog post discussing the accuracy of said video:
      
      https://helix.northwestern.edu/blog/2010/11/cell-biology-animation-and-reality
      
      links to actual atomic force microscopy of the Myosin V molecule, which does have little feet that bind and unbind at a regular interval…

Statistical Modeling, Causal Inference, and Social Science

Should we worry about rigged priors? A long discussion.

84 thoughts on “Should we worry about rigged priors? A long discussion.”

Leave a Reply to Huw Llewelyn Cancel reply