## Why I keep talking about “generalizing from sample to population”

Someone publishes some claim, some statistical comparison with “p less than .05” attached to it. My response is: OK, you see this pattern in the sample. Do you think it holds in the population?

Why do I ask this? Why don’t I ask the more standard question: Do you really think this result is statistically significant?

Why? Because I think that the question about sample and population leads more directly to discussions of substance, what Rubin would call “the science of the problem.”

Here’s a (stylized) example.

Old way:

Researcher: We discovered X, as demonstrated by this comparison with p less than .05.
Questioner: Do you really think this result is statistically significant?
Researcher: Yeah, sure, we did the calculation right.
Q: But what about multiple comparisons?
R: No, we reported all the analyses we did. Also we did some multiple comparisons corrections.
Q: But what about the garden of forking paths? What about the different analyses and data processing choices you would’ve done, had the data been different?
R: Hey, that’s not fair, now you’re hanging me for something I’ve never done, you’re criticizing my analysis based on your supposition about what I might have done, under different circumstances!
Etc.

New way:

Researcher: We discovered X, as demonstrated by this comparison with p less than .05.
Questioner: OK, you see this pattern in the sample. Do you think it holds in the population?
R: Sure, the result was statistically significant, so we can make that claim, that’s the point.
Q: So if your estimated (multiplicative) effect size was 3, with a 95% confidence interval of [1.1, 8.2], do you really believe an effect of 3 in the population?
R: No, not really. That’s our estimate but the real point is that it’s statistically significantly different from no effect, that is, 1.0 in this case.
Q: So what you’re saying is, you can reject the null hypothesis of no effect because, if there really were no effect, there’s a less than 5% chance that you’d get an estimate as high as 3. Is that right?
R: Yup.
Q: But what about the garden of forking paths etc etc?
R: Hey, that’s not fair etc etc!
Q: Yes it is fair! I’m saying that, even if there were no effect, there’s a more than 5% chance that you’d see a comparison with p less than .05. Your claim that you can generalize to the population relies on a statement from you about what could’ve happened under other circumstances. That is, you are making a claim about the general population that requires me to make assumptions about your hypothetical behavior under other circumstances.
Etc.

My point is to keep your eyes on the prize, which is the goal of learning about the population. There are various statistical arguments a researcher can use to convince you that a certain pattern holds more generally. The researcher is under no obligation to use p-values at all, but if he or she does, then the concern is not that the p-value in question is “wrong.” It’s not about the p-value, it’s about what can be claimed about the population.

1. Dale Lehman says:

Actually, I think the real question is something else – not whether the statement is true about the population but whether it is true about a different population – usually the future one. Aside from some applications, most analyses are done on a sample (ideally random, but usually less than random) which is then used to say something about the population that will exist in the future. The implicit assumption that today’s population is essentially the same as tomorrow’s is unwarranted in most cases – at least most of the interesting cases. So, the question of whether it is true in the population is really a question about whether this particular sample from one population provides meaningful information about a somewhat different population.

To me (untrained in Bayesian analysis), this sounds like an inherently Bayesian task. Frequentist statistics cannot answer this question without assuming that the population the sample was drawn from is essentially the same as the one the results will ultimately be applied to. Perhaps this is an equal problem for Bayesians. But, to me, this is the prize to keep your eyes on.

• Jameson says:

I get your point, but I think that’s the wrong way to put it. It’s not that “frequentist statistics cannot answer this question without assuming that the population the sample was drawn from is essentially the same as the one the results will ultimately be applied to”; there are definitely frequentist techniques to handle a difference in populations. But I think all such techniques, frequentist or Bayesian, will ultimately rely on some kind of assumptions; that is, on constraints on the possible data-generating models. And the fact is, Bayesian tools for dealing with extra assumptions tend to be more flexible than frequentist ones. If I’m building a Bayesian model in Stan (or Mamba or whatever) and I suddenly think of some new constraint, there’s a good chance I’ll be able to simply plug it in with a few extra lines of code. But for a frequentist, it’s almost certainly going to send them back to the drawing board. If they can solve the problem, it might be a glittering jewel of a solution and there may be a very citable paper in it for them; but meanwhile, the Bayesian has churned on, answering 3 more questions in the time they’ve spent scratching their head.

In other words, if you already know what you want to do and how to do it, go ahead and be a frequentist. And if you’re just digging for potentially-interesting patterns, frequentist-inspired tools (such as LASSO) may be best. But if you want results that are likely to generalize, and your problem doesn’t fit neatly into the assumptions of an existing frequentist technique, build the most realistic Bayesian model you can afford to, then test it on virgin data.

• @Dale,
Is this more of a question about if the process is nonstationary or not. Not sure if that is really more or less a Bayesian task than for Stationary data. I do agree though that this is often ignored – seems more like it might be best to treat the problem as a restless or dynamic bandit, than as a hypothesis testing problem.

2. Kyle C says:

For the nonspecialist, this relates directly to the Freshman Fallacy. “What makes you think your result generalizes?” is often dismissed as a naive question, akin to, “What makes you think the correlation you found implies causation?”

3. brian says:

Agree with Dale. Unless we’re talking about random sampling from a properly defined population, then isn’t the problem of external validity much larger than the issue of the forking paths? I’m not saying that the latter is unimportant, but my sense is that observational studies tend to focus on internal validity, whilst the more important (and difficult) issue of whether the results will hold in different populations, in different places, and at different times, is treated informally if at all.

• Brian:

In some sense, the garden of forking paths isn’t a problem at all. Or, to put it another way, it’s a problem when a researcher wants to summarize inference using a p-value from a single comparison. If all the data are analyzed, forking paths shouldn’t be an issue.

I agree with you that issues of measurement (reliability and validity) and external validity are more important, and I think that’s one of the advantages of the “generalizing from sample to population framework,” that it draws attention to these issues.

Forking paths is just an annoying thing we have to deal with because we’re surrounded by Bem-like researchers who don’t know any better and who think that the right way to draw conclusions is via selected comparisons.

• Rahul says:

That’s my basic issue with the garden of forking paths. In some sense it is one thing. In another sense it is another thing. The “thing” has never been stated precisely enough for anyone to disprove it. In some senses it is not even a hypothesis so the case for disproving it doesn’t arise. It is not clear as to what one would have to do to be able to show that the garden does not exist.

The fact that paths can fork or do fork is just that. A fact. Nothing that a researcher does methodologically change that. No matter how one chooses to analyse he must make decisions. Explicitly or implicitly. Can one point to cases where one is sure there were no forking paths that the researcher had to make choices between? That makes the garden ubiquitous. So what exactly are we criticizing?

OTOH, something like pre-registration, now that’s a concrete, clear prescription that I can wrap my head around. Either one pre-registers or one doesn’t. Maybe one can exhort people to pre-register. One could even insist that unless a study is pre-registered one will not pay heed to the conclusions. Fair enough. Maybe that’s exactly what we should do.

Or maybe we insist on independent replications. Another concrete suggestion. Or perhaps insisting on researchers using an explicit loss function / effect sizes / economic impact instead of meaningless significance. Fine. Let’s do that.

But I wish there was similar clarity as to what the garden-critique is asking me to do. (What is the garden after all? A hypothesis? A story? A critique? A description? )

• Andrew says:

Rahul:

The point of forking paths is that the classical p-value requires that there be no forking paths, or that all such paths have been correctly modeled. It is a modeling assumption. One reason I’m not such a fan of classical p-values is that they are so sensitive to this assumption about what a researcher would’ve done under other circumstances. Likelihood functions and prior distributions are also sensitive to assumptions, but they’re assumptions about the data model and about the underlying parameters, which to me are more useful to model in that these models connect to the ultimate topic of the research being conducted.

If one is going to do a p-value-based analysis, then preregistration is great because it ensures that the no-forking-paths model is correct.

• Dale Lehman says:

For my field (economics, specifically focused on policy issues), the forking paths problem is severe, but I am less concerned with the use of p values and/or confidence intervals than with the lack of replication. Preregistration might work for some problems, but is probably too limiting. Take one concrete example – do increases in minimum wages lead to increases in unemployment? This issue has been studies many times with conflicting conclusions (and little evidence that the results are converging). Preregistration seems unwieldy (who would be the clearinghouse?) and perhaps counterproductive. I am also not that disturbed that someone, after running the particular regression model they report, finds that the effect is significant (at p=something) or that a confidence interval for the impact runs from x to y.

What really bothers me is that it is so rare to be able to replicate results. Data are rarely made available, and if public data is used, the precise dataset that was used in the research is not reported – only the public source is cited. I think the best protection against the forking paths is for others to try to replicate the results and the opportunity for others to choose other forks and show how the results differ.

In fact, in policy settings, I think policy makers should declare that they will attach no weight to research studies that do not provide their data publicly. I think this would go a long way to solving the issue. The fact that this is never done speaks more about the flawed political process than any lack of imagination. Policy makers are not seeking truth – they are seeking “evidence” to support their position. And many researchers are content to provide that for them.

• Rahul says:

Can you elaborate on why you think pre-registration is so unwieldy or onerous?

I mean, most researchers will spend a lot of time writing grant applications anyways and the work needed to preregister seems far lighter. Also, the methodology needs to be written up anyways prior to publication. Pre-registration just shifts the time-lines a bit.

For preregistration, do we even need a new dedicated clearinghouse? I’d be happy enough with a post of the analysis methodology on something like arxiv. Maybe with some way to avoid deletion but I doubt anyone’s actively trying to be malicious.

Even a post of the proposed plan of action on your own website should work. ( perhaps with an Internet Wayback Machine archival link for the extra paranoid)

• Anonymous says:

@andrew this absolutely needs to be communicated more clearly because it’s obvious to me that maybe for 80% of your readership, the take-home has been “some sort of multiple comparison adjustment or equivalent pre-registration is needed” when really the point is that “it’s needed if you’re going about the analysis using p-values”.

The distinction may be clear to you, but I don’t think it is well understood by others.

Secondly, as much as I am a proponent of Bayesian analyses / shrinkage / multi-level models I don’t think they can be said to be a silver bullet for researcher degrees of freedom.

There’s not a mathematical theorem that says shrinkage will cancel out selection bias. On the contrary, every “forking path” instance will introduce a different direction/degree of bias _and_ every choice of an “integration model” shrinks estimates in a different way. The mere existence of integrated modeling and shrinkage alone does not automatically negate the implicit selection bias effects of the researcher.

• Andrew says:

Anon:

There are two things going on.

First, in some settings there is a clear population of potential comparisons, and the classical approach is to pick the largest comparison, or the set of comparisons exceeding some threshold, in which case it is clear that some multiple comparisons adjustment is necessary. Lots of examples here, but we can start by thinking of cases where the population is well defined, for example some number of different treatments are considered, or comparisons are being made across 50 states, whatever. In such settings, one can instead fit a hierarchical model looking at all the comparisons at once, in which case no multiple comparisons correction is needed. This is the point of my paper with Jennifer and Masanao.

Second, just about any analysis that is not preregistered has researcher degrees of freedom, and to the extent that choices in data processing and data analysis are made after seeing the data and with a goal of getting some sort of result, this should be accounted for. My preferred way of accounting for these degrees of freedom is to try to identify them and include them in the analysis, to essentially fit them all at once. For example, instead of using a hard threshold for “peak fertility” (as in those notorious papers), use a continuous measure that is more consistent with scientific understanding. Still, there’s always the possibility of selection bias, and that’s an issue in any analysis.

My point, though, is that it’s a good idea to at least try to model everything. For example, when those researchers looked at color of clothing, it’s best to analyze all the colors, don’t just pick out one color and correct for multiple comparisons. And similarly for other choices in these analyses. For example, if there are dozens of possible interactions that might be of interest (age, marital status, parental socioeconomic status, political ideology, weather, etc etc etc), the best thing is to model as many of these as possible. That makes a lot more sense to me than holding these explanations in reserve, to use as necessary. The more selection that is done in the data processing and analysis, the more the garden of forking paths is a concern.

• Rahul says:

Great points. Basically rather than do a Yes / No judgement on a specific parameter it’s more useful to do a relative quantitative estimation of many parameters.

• Anonymous says:

From different anon than above:

Andrew, you are misdiagnosing the problem and getting the cure wrong. Why is creating a model after observing the data a problem? That’s the key question and everything hinges on the answer.

Frequentists believe that creating the model after seeing the data changes what’s “random”, which in turn changes the “probabilities”. The effect is very similar the way Frequentists will analyze a binomial experiment different on the stopping rule used.

If you think this is the issue, then there’s two cures. Restrict people from creating models after seeing the data (pre-registration, proscriptions against p-hacking, and so on). Or as you say “account for” decisions made after seeing the data. In a delicious irony for Frequentists, either option introduces an absurd dependence on the subjective psychological state of the researcher.

But that isn’t the problem. The problem is caused by the fact that given any data set (no matter how it was generated or it’s ’cause’) you can create a “successful” statistical model to account for the data. As a result, all those model validation efforts proving the model is “good” don’t tell you whether the model generalizes to new data or new populations.

Theres no inherent connection between “cause of data” and “we found a good statistical model for the data”. Frequentism makes people believe there is, but it’s absolutely untrue most of the time. Without such a connection the inference sequence “we have a good model, therefore we understand the causes of the data, therefore we can generalize to new situations” breaks down at that first “therefore”. That’s why most statistics heavy research isn’t reproducible.

The “cures” listed above will not solve the problem. It might make the problem slight worse, or slightly better, but it will remain. There is no statistical model check or statistical method or statistical trick that could solve it.

The only solution is to have outside, independent, non-statistical, checks that you’ve “understood the cause of the data”.

The reason pre-data modeling feels right is because if you can create the model before seeing the data, then typically you knew enough about the causes to get reproducible results before you started. That’s also why physicists and engineers are so much more successful with statistical methods than social scientists: they already know an enormous amount about the real causes before they get the data.

It’s not the pre-registration that yields reproducibility, it’s having accurately identified the causes that yields reproducibility. In other words, pre-registration might be correlated with better results, but it isn’t the reason the results are better . If you pre-register, but still get the causes wrong, which is the path reformers are leading us down, you still get crap research.

• Andrew says:

Anon:

I think you’re arguing against someone other than me. I say this because you use, in quotes, many phrases that I have never used.

1. “cause of data”: This is something I never say.

2. “we found a good statistical model for the data”: I wouldn’t typically say this either. I might talk about goodness of fit, but the goodness of the model itself depends on more than the data.

3. “we have a good model, therefore we understand the causes of the data, therefore we can generalize to new situations”: Nope, I’d never say this. See item 1 above.

4. “cures”: No, I don’t say that either.

For Bayesian reasons discussed in chapter 8 of BDA3, it can be important to understand the selection of data used in one’s analysis.

Finally, I agree with you that preregistration will not solve problems of data collection. If a researcher is studying small and variable effects with highly noisy measurements, all that preregistration will do is to likely confirm that the study design is no good. As in the “50 shades of gray” paper, the researcher will just end up with the negative finding that the study did not work; there will be no positive contribution to science.

4. Anonymous says:

There’s another way used by Laplace over two centuries ago:

(1) Take location measurements of a heavenly body (a comet or planet for example).

(2) Know the range of uncertainty of those measurements (not their frequency of occurrence which is unknown, but a range for their typical size which is easily known).

(3) Compare the heavenly body’s position to that calculated from Newton’s laws. If the difference between the calculated and observed is greater than the range of uncertainty, then create the hypothesis that there’s an additional currently unknown heavenly body perturbing it.

(4) Search for and find the new heavenly body.

(5) Use it to make accurate predictions decades (even centuries) in the future of the kind that will make social scientist eternally jealous.

Where in any of this does multiple comparison or garden of forked paths, or anything else that depends the psychological state of the researcher enter? Frequentism causes smart people to think stupid things.

• Andrew says:

Anon:

This approach doesn’t quite work in my social science research because I typically don’t have discrete hypotheses of the form “there is a new planet,” but indeed what you describe is the iterative approach we describe on the very first page of Bayesian Data Analysis. In formulating inference this way I was heavily influenced by the writings of Jaynes, and I think my collaborator Rubin was highly influenced by the work of George Box.

As I noted in response to a different commenter, the garden of forking paths doesn’t really come into this at all (which is why Jennifer, Masanao, and I wrote a paper a few years ago on why we don’t usually care about multiple comparisons). The garden of forking paths arises only in an attempt to meet p-value people on their own terms. Like it or not, researchers are often reporting, as strong evidence, p-values based on selected comparisons, and I’ve found it helpful to try to understand how this is happening. The garden of forking paths is part of the story.

• Anonymous says:

My point is this. When Laplace did this sort thing, which amounts to a kind of Bayesian significance testing, he would have written down something like N(0, sigma) for the measurement errors. To some extent, it makes no difference what he really meant by that. He could have meant:

(A) The frequency histogram of measurement errors looks like N(0,sigma)

(B) The typical size of an error is around ~sigma.

The former is about frequencies, the later is about uncertainty. The frequency of errors is almost never known in truth or even knowable in principle, while the general size of errors (range of uncertainty) from a measuring device is easily known.

They are two radically different things. The errors can generally have size ~sigma while their histogram looks nothing like N(0,sigma)!

The immediate consequences of sloppily confusing (A) or (B) may not be much. But the past two centuries of statistical experience indicates that if you use the mistaken interpretation (A) you are eventually lead to an enormous number stupid ideas, which statisticians seem eternally unable to fix, and cause really smart people to loose the ability to think about even really simple problems correctly.

The only solution is to insist adamantly that frequencies are frequencies and should always be labeled as such while probabilities model uncertainty. With that change, people who can’t reason their way out “garden of forked path” type muddle to save their lives can easily, even trivially, see what’s true.

• Aki Vehtari says:

> When Laplace did this sort thing, which amounts to a kind of Bayesian significance testing,
> he would have written down something like N(0, sigma) for the measurement errors.

He used the double exponential distribution, later also known as Laplace distribution (Laplace, 1774)

• Anonymous says:

The fact that (A) implies (B) creates a mental trap for Frequentists, because if they assume (A) is true they will get a good result using Laplace’s method.

Then they mistakenly think (A) is not only sufficient, but is required as well. This is absolutely false. It’s not even an approximate requirement. All that’s required is (B), which is a dramatically weaker condition (A).

Once frequentists fall into this trap however, they are forever stuck thinking of probabilities as frequencies and they can’t make the mental adjustment to thinking of probabilities as modeling uncertainties.

So they can never see that trivial fact that (B) is all that’s required to make Laplace’s method work.

• Anonymous says:

Just to give an idea of the monumental levels of stupidity that the frequentist interpretation (A) leads to, consider this. Most statisticians think that if the measurement errors don’t have a histogram that looks like N(0,sigma) Laplace’s method will fail.

This is absolutely untrue!

Look back at the original logic used by Laplace in my first comment. Laplace hypothesized a new heavenly body when the difference between the observed and calculated positions were greater than the range of uncertainty of the measurement errors.

The only thing this requires is that the errors have sizes around a few ~sigma or less (or alternatively, that the measuring device has precision noticeably smaller than |observed -predicted|). I stress this is a radically different criterion from saying their histogram looks like N(0,sigma). You can easily have errors of size around ~sigma, but whose error histogram doesn’t look even approximately like N(0,sigma).

In essence (A) implies (B), but the converse isn’t even close to being true. You can have, and in reality almost always do have, (B) being true while (A) is false.

Indeed, this almost always happens in practice.

• Anonymous says:

“as strong evidence, p-values based on selected comparisons”

Here’s an interesting question Andrew. If Laplace had used the procedure described above, but had made lots of comparisons between theory and observation (essentially a version of p-hacking, cherry picking, or whatever you want to call it), would it have derailed him in any way?

If the difference between Observed and Predicted is much larger than the error sizes the measuring device gives off (i.e. the precision of the ruler), does it in any way make a difference, either theoretically or in practice, if he had made lots of “observed vs predicted” comparisons?

None that I can see. Exactly the opposite in fact. He would be better off making lots of comparisons between theory and practice because that would have increased his chance of finding new heavenly bodies.

5. Keith O'Rourke says:

Given some comments here, it might be better to make the example a randomised comparison of say cultured cell lines (that are hoped not to evolve) treated two ways with no mishaps but a complicated outcome needing an aspect to focus on – so one only needs to worry about the analysis issues.

Then, I think this is “That is, you are making a claim about the this well defined population that requires me to make assumptions about your hypothetical behavior under other circumstances.” seems clear and important.

6. Anonymous says:

Here’s another way to do science.

(1) Suppose you’re interested in theorizing a new effect in some social science. The effect is described by some “lambda” not equal to zero. We can even assume for the sake of argument that lambda will never be exactly equal zero (i.e. all null models are wrong).

(2) Gather up all the partial information/evidence we have about lambda.

(3) Determine every value of lambda reasonably consistent with all information/evidence. Use this to form a range of possible values for the true lambda.

(4) Three cases:

(I) If the range for lambda looks something like (-.0000001,.0000001) then the true lambda is so close to zero we can ignore it in our theories without getting a large error.

(II) if the range looks something like (1,2) then we need to adjust our theories for a non-zero lambda, since it makes a difference.

(III) if the range looks something like (-.1, 1), we’re not sure if neglecting lambda will make a difference or not. Collect more evidence.

The question once again is, where in any of this does it matter how many effects you looked for, multiple comparisons, “garden of forked paths” or anything else related to the psychological state of the researcher?

• Dale Lehman says:

I don’t understand your setup. If (ii), you say we need to adjust our theories for a nonzero lambda – because it makes a difference. But how do we know it makes a difference without knowing all the conditions that were or were not tested? Possibly it makes a difference because the researcher failed to incorporate Y into their model and adjusting for Y really renders the effect to be more like case (i). I suppose what you mean by (3) “consistent with all information/evidence” might mean to include this, but then your conditions are unrealistic. It is like saying if there were no issue with forked paths, then there would be no issue with forked paths.

• Anonymous says:

“But how do we know it makes a difference without knowing all the conditions that were or were not tested?”

Because saying “(1,2) contains all values of lambda reasonably consistent with the evidence” is just another way of saying “the true value of lambda is in (1,2)”. So if the difference between lambda=0 and lambda=1 is important, then you’d better not assume lambda=0.

If that “evidence” isn’t true then obviously you may have a problem. Sometimes the “evidence” is really an unchecked assumption, but that’s a problem for you not for Bayesian statistics.

If you use probabilities to model uncertainty rather than frequencies, then the Bayesian Credibility Interval derived from the posterior P(lambda | evidence) will in effect tell you which values of lambda are consistent with the evidence. For a fixed lambda, P(lambda |evidence) is really telling you the strength of that consistency rather than the frequency of occurrence of that lambda. Indeed, there may only be on lambda so no “frequency” exists at all even in principle.

You don’t have to make the kind of judgments in (4) explicitly. It’s better to simply use P(lambda |evidence) in future calculations. For example, if (I) holds and P(lambda|evidence) is sharply peaked about zero, then the posterior is approximately a delta function about zero and blindly using it future calculations will effectively set lambda equal to zero.

Similarly with the other cases. So in practice, whenever you have an expression dependent on lambda in your theory, just average over P(lambda |evidence) and all those judgments in (4) are automatically taken care of. This is especially true if lambda is a vector of variables (the multi-comparison scenario).

Note you cannot use Confidence Intervals to do any of this. CI’s are constructed with a different goal in mind. In fact, it’s possible to get CI’s which only contain impossible values of lambda. That’s the exact opposite of we need to pull this off.

Agree with Dale. We don’t want to know whether the effect exists under the conditions where the data were collected (we have exact measurements for what happens under exactly those conditions, so no need to talk about probability there) – we want to know whether the effect exists under _slightly different_ conditions, where some variables are different but others are kept the same. But, unlike physics, we have no idea which variables matter for defining these conditions of interest, i.e. which ones can vary and which ones should stay the same. If we keep all conceivable variables the same, no data set will contain more than one sample.

So there is typically no guarantee that the conditions under which the available data were collected are representative of the conditions under which we want to draw a conclusion. We’re still stuck with the question of whether the non-zero lambda generalizes to conditions of interest, because we don’t know what the conditions of interest _are_.

Oh, and you left out the most relevant case:
(iv) if the range looks something like (.0000001,.0000002) then lambda is reliably known to be non-zero under the conditions covered by the data, but there are a zillion possible things that could have caused it and lambda need not be non-zero under the actual (unknown) conditions of interest.

• Anonymous says:

Uh, I thought it was obvious that lambda was a fixed unchanging constant like the speed of light. If we learn about it under any circumstances we learn about it.

Whether it is fixed or not is part of the “evidence”. Physicists didn’t just assume it was universal the way Social Scientists are wont to do with their variables/models. They checked that assumption. If you use untrue evidence that’s your problem.

Having said that, it was only for simplicity that I didn’t state things in completely generality. With an appropriate generalization lambda could be a vector or a function of time. Lambda could be a functional of a function of time. It could even be a frequency for that matter. Or a frequency function which changes with time. It cold be all kinds of things.

You have to do some thinking on your own. I don’t have a mind ray that can just beam stuff from my brain to yours.

And no I didn’t leave out a special case. As I explained to Dale, if you just use P(lambda |evidence) and carry if forward through future calculation it automatically handles all cases, as well as any in between cases. The cases I listed were to show that it’s intuitively doing the right thing. Since the Bayesian mathematics is just as available to you as it is to you me, you can check yourself what happens in intermediate cases.

• I think the issue though is that we have to do the Bayesian calculation on more than one model if we don’t know “THE MODEL”, ie. lambda is a constant, lambda is a slowly varying function of time, lambda is an observed frequency which is a slowly varying function of time, lambda is a frequency which is a slowly varying function of time but a rapidly varying function of space, race, genetic composition, blablabla… and then we have bayesian intervals that are incommensurate with each other in some sense because we don’t have any way of deciding between models.

• Anonymous says:

There is no version of statistics that can gauruntee good results from false inputs. If there were you’d never have to collect any evidence or data. You could just make up evidence, stick it in, and your statistical methodology would still give you the right final answer.

Therefore its up to statistians to ensure they put true things, and only true things, into the analysis.

That’s what they get played for. Lord knows they ain’t getting paid for their looks, whit, or personality.

• Dale Lehman says:

Well, I’m sorry if you think that I am asking you to beam stuff into my brain. How about some clarity and civility instead? I really don’t understand your point. “True inputs” seems like a hypothetical construct that simply recreates the problem at hand. To be concrete: numerous studies have been conducted about the effects of minimum wages on unemployment – they reach varying conclusions. I was not trained as a Bayesian but I accept that we should have some prior distribution based on all of these studies conducted over the past decades. Then we do a new study and it finds something surprising. Given that evidence, we construct the posterior distribution and it is somewhat different than what we thought before – because we have new evidence that is worth “something.”

Isn’t the question, just what is the value of the prior information and the new information. Both came from a number of models (possibly an unknown number of modeling attempts), each of which has the forked path problem. How do we weight these earlier models and how to we weight the new evidence? You seem to suggest that this is not an issue, provided that all of these studies were done based on “true assumptions.” It sounds like that is a meaningless hypothetical. If only people did “true things” then the truth would be straightforward.

• Anonymous says:

I was talking to Konrad. I was civil. He condemned me for not giving answers to questions I didn’t asked and didn’t think himself about how to modify what I did say for knew situations.

“True inputs” is pretty simple.

If you’re analysis assumes lambda is constant and it’s not then that assumption is not a true input.

If you’re analysis assumes your ruler can measure distances accurate to around 1mm and it has errors of 20cm then that assumption is not a true input.

If you’re analysis assumes a prior for the effect of the minimum wage of on unemployment which says “an increase in minimum wage of \$X will result in somewhere between 100,000-500,000 jobs lost” and it turns out that it actually cost 10 jobs or 10,000,000 jobs than that prior is not a true input into the analysis.

Anon: that’s not a fair summary of what I said and you know it. I was making a specific point, but “I thought it was obvious that lambda was a fixed unchanging constant like the speed of light” indicates you are missing it. What should be obvious is that there are no such constants in social science – if your suggestions are only useful for finding such constants, they are of no use in this context. Have another look at what I wrote – if you think the problem is one that can be fixed by introducing time variation etc into the model, you are missing the point.

Also, you enumerated a set of special cases based on varying (a) small vs large amount of uncertainty and (b) interval containing vs not containing 0. Clearly there are 4 such cases to enumerate, but you listed only 3. My point was that it is the one you omitted that is actually relevant to the OP.

• Keith O'Rourke says:

Anon: Agree, but

“could just make up priors (and likelihoods), stick your data in, and my statistical methodology would still give you the right final answer” describes how many Bayesians work and explain that work in applications – according to some of my Bayesian colleagues and my general experience.

> That’s what they get played for.
Exactly with the typo, they are used to convert intolerable but real uncertainty into sure answers (in terms of probabilities) – right now Bayesian ones are likely to be taken more seriously than they should be.

For instance, the joint model (prior and likelihood) _tells_ you what is informative or not (does it change the probabilities if conditioned on? – as Dempster pointed out in Andrew’s Wrestler/Boxer paper) but that is just an indirect means of assuming various things don’t matter – how do you check that other than your intuition?

• Anonymous says:

Most “Bayesians” are like Andrew, they (mostly) use the sum and product rule to derive their methods, but they often or always think of probabilities as frequencies. Half Bayesian and half frequentist. Like the word “television” which is half latin, half greek: no good can come from it. See “garden of forked paths” fantasies for example.

I gave the cure for that illness. Call frequencies “frequencies” always, and reserve probabilities for modeling uncertainties. I would also recommend calling cats “cats” and dogs “dogs” and so on. You get the idea.

I’ve personally never had to use my “intuition” to check that I was modeling uncertainty correctly. When I use the rest mass of an electron in a model for example, I don’t use my “intuition” to figure out the uncertainty of it. Physicists report an uncertainty/error-range with the point estimate for the mass. I just use what they report.

And like Laplace I’ve never had the slightest issue figuring out how to check whether a modeling was giving accurate inferences to withing those uncertainties. So I’m not sure what the problem is.

• george says:

Anon: “half Latin/half Greek” doesn’t help your case much, it was said by CP Scott, who had a massive conflict of interest on television.

Also, “semi-parametric” is another hybrid, and is viewed as pretty useful – in Bayesian work, frequentist work, and work that tries to use the best of both approaches.

• Anonymous says:

George,

Thank you for the CP Scott reference!

Any and all claims about what is Bayesian are extremely suspect. If they are made by a Frequentist they should be discounted entirely.

Centuries from now when people look back on 20th century statistics they will condemn Frequentists for believing and indoctrinating ideas no less insane than the “earth is flat” or “you can cure most diseases by bleeding”.

But they’re are going to condemn Bayesians for being overly fixated on “Bayes Theorem”. In truth, Bayesian statics is based on the sum and product rules of probability and the supposition that they can be used on any well defined propositions. One consequence of the product rule is the classical Bayes Theorem. But that’s just one tiny part of their implications.

Even if you restrict yourself to “updating” type formulas where you go from a P(x) to some p(x|A) the sum and product rules imply an infinite number of such formulas. Essentially, you get a different one for each context and each set of auxiliary assumptions. For example, if there’s a nuisance parameter present and you integrate it out the way you’re supposed to (using the sum/product rules), you get an “updating” formula which differs from Bayes Theorem.

One consequence is that there are a mass of things which people (including ‘Bayesians’ like Andrew) claim aren’t “Bayesian” but are in fact simple and intuitive consequences of the sum and product rules.

• Martha says:

Anon:
Re “Call frequencies “frequencies” always, and reserve probabilities for modeling uncertainties”:

But sometimes previous empirical frequencies provide important information for modeling uncertainties.

Re “When I use the rest mass of an electron in a model for example, I don’t use my “intuition” to figure out the uncertainty of it. Physicists report an uncertainty/error-range with the point estimate for the mass. I just use what they report.”:

This is a fairly simple case, in contrast to many statistical problems, where prior information (especially prior credible information) is scarce.

• Anonymous says:

Martha,

Any data/information, not just frequencies, can potentially be used to create a probability distribution. Similarly, we can use prob distros to predict data/information/frequencies. But it’s insane to confuse or equate frequencies to probabilities.

We can use temperature measurements to build weather models to predict tomorrow’s rain, but everyone understands “temperature”, “weather models”, and “rain” are completely different things.

“This is a fairly simple case, in contrast to many statistical problems, where prior information (especially prior credible information) is scarce.”

Actually, I’ve found, like Jaynes, that no matter how seemingly little prior information there is, there’s always enough to build a fairly informative prior. But if I were stuck with an uninformative prior the world’s not going to end. For example, rumor has it that Gelman is as fat as a blimp. I have no idea whether that’s true or not, so I’m going to us a prior for his weight uniform on [0 lbs, weight of earth].

That’s a highly diffuse prior which says Gelman has a weight intermediate between that of a ghost and the entire earth. I’m confident his true weight is within that range whether the rumors are true or not.

If I collect data and use bayes theorem, I’ll get a posterior which implies a shorter, more precise range, that is contained entirely within those initial limits. Since his true weight is genuinely inside those initial limits there wont be any problems. The prior range for his weight will be consistent with both the posterior range and his true weight.

The only difference is that the prior range estimate for his weight is huge, while the posterior range estimate for his weight is small, but that’s how learning works.

• Anonymous says:

Martha,

A little more succinctly: using an outrageously uninformative prior will not give you a ‘bad’ posterior. In fact, it wont have any negative consequences at all other than you’ll need slightly more data to get the same ‘good’ posterior.

This reflects a conservation of information principle. If you need your posterior to be highly informative, then you can reach that goal using any combination of (information in prior)+(information in data). The posterior doesn’t care how that total is broken down.

• george says:

Anon: you’re welcome. Here’s another quote for you, this one from David Cox:

“I want to object to the practice of labelling people as Bayesian or Frequentist (or any other ‘ist’). I want to
be both and can see no reason for not being”

(He said it in discussion of Lindley’s “Philosophy of Statistics” paper.) I appreciate that it’s difficult to avoid ad hominen arguments. But the discussion should really be over methods and analyses, which may have multiple Bayesian and/or Frequentist interpretations – that are not mutually exclusive.

Are these justifications relevant? Are they useful? If not, how could their weaknesses be addressed, while keeping their strengths? Maybe consider that, instead of ranting about how insane you think 20th century statistics is. Thanks.

• Anonymous says:

George,

Cox’s view can fairly be described as the most common one, but it’s wrong. Lord knows it’s not the first time the majority of statisticians were all wrong about something (See Ronald Fisher’s and Gelman’s thesis advisor’s work on smoking/cancer). Here’s the problem with it.

Frequentists believe probabilities are frequencies and thereby a reflect physical properties of the universe.

Bayesian believe probabilities model uncertainty about physical facts.

At first it looks liked these are merely two different interpretations of the word “probability”, but that’s not quite true. If the “physical facts” the Bayesian is concerned with happen to be “real physical frequencies” then “Frequentism” drops out as a special case of the Bayesian analysis.

So the claim here is not that these are two different interpretations of probabilities. The claim is that the Bayesian interpretation subsumes the frequentist one when it actually makes sense, but is far more general.

And that has huge practical consequences.

But in the spirit of compromise, I’ll agree to stop labeling people as “Frequentists” or “Bayesians” if all statisticians agree to always call frequency distributions “frequencies distributions”. Fisher used to do that in his early papers. They should all follow his example.

• Andrew says:

Anon:

You write, “rumor has it that Gelman is as fat as a blimp. I have no idea whether that’s true or not, so I’m going to us a prior for his weight uniform on [0 lbs, weight of earth].” This is an excellent example of how the relevance of any part of the model can often be understood in the context of the rest of the model. Get one or two reasonable measurements of my weight, and it’s irrelevant whether your prior is uniform(0, weight of earth) or uniform(0,1000) or normal(200,200) [using the Stan notation here for the normal distribution]. On the other hand, if you only have 0 observations and you need to decide whether to sell me a plane ticket, then your choice of prior could make a difference. The point is, in this example the prior distribution can best be understood in light of the data.

A similar issue arose, in the opposite direction, in our discussion of some of those silly “power = .06” studies from Psychological Science. In those cases, any reasonable prior distribution was far more precise than the information from the available data, thus it was silly to try to look at the data without priors, indeed that makes about as much sense as looking at the prior without data in the “Andrew’s weight” example.

• Anonymous says:

Andrew,

I think there’s a substantial amount of useful, but largely unexplored territory there, which illustrates perfectly why Bayes vs Frequentism matters. What I wrote to Martha was meant to be intuitive, but with a definition for “information” (entropy), and the sum/product rules of probabilities you can derive all kinds of expressions which make it concrete.

With them you get lots of intuitive results. If you have a diffuse (uninformative) prior and a weight scale with large errors the you get a diffuse posterior. If the scale is highly accurate (informative), the posterior will be narrow no matter how diffuse the prior is. And so on.

All that is old news. What’s perhaps not fully exploited is that whether a distribuiont is informative enough depends critically on what you’re using it for. Or to turn it around, distributions only need to be informative about things you care about. For example, if you only care about a known loss function then your distributions only needs to be informative about that loss function in some sense. They can be dramatically ‘wrong’ (really ‘uninformative’) in all other respects.

So in practice, if it’s convenient, you can throw away information (or simply not collect it), so long as the distribution is still informative in the one aspect you care about.

This is where it matters whether you’re a Frequentist or Bayesian. Frequentists think in terms of one correct probability distribution (because they think it’s a measurable frequency distro). If you suggest using a very different distribution which is still accurate/informative in the one aspect you care about, they look at you like you have a third eyeball growing out of your forehead. To them that more convenient distribution is hopelessly incorrect and can’t possibly work.

And yet it will work for the one task you need it for. Moreover, that ‘wrong’ distribution can be so much easier to get and work with it can make a practically impossible analysis suddenly easy.

7. (I’m soon going to be a Stan t-shirt wearing, Stan-mug wielding Bayesian, but)

One response in that exchange with Andrew could be:

I could still fit linear mixed models using lmer and draw my inferences from these using Andrew’s “secret weapon” (which I think turns up in a footnote in Gelman and Hill): by replicating my results by re-doing the experiment several times and checking that I get the same sign for the coefficient of interest. When I can afford to, that is exactly what I do. That’s more convincing to me than even a fully Bayesian approach. At least in my life as an experimenter, I only need to ensure I have high power and can replicate the result. The rest is just irrelevant in practical terms. For me, the attraction of the Bayesian approach lies in its astonishing flexibility compared to the relevant frequentist tools available.