Ed Bein writes:

I’m hoping you can clarify a Bayesian “metaphysics” question for me. Let me note I have limited experience with Bayesian statistics.

In frequentist statistics, probability has to do with what happens in the long run. For example, a p value is defined in terms of what happens if, from now till eternity, we repeatedly draw random samples from some population of interest, compute the value of a test statistic, and keep a running tabulation of the proportion of values that exceed a certain given value. Let me refer to probability in a frequentist context as F-probability.

In Bayesian statistics, probability has to do with degree of belief. Prior and posterior distributions refer to our degree of confidence (prior to looking at data and after looking at data, respectively) that a parameter falls within certain ranges of values, where 1 represents total certainty and 0 represents total disbelief. Let me refer to probability in a Bayesian context as B-probability.

Both F-probability and B-probability are valid interpretations of probability, in that they satisfy the axioms of probability. But they are distinct interpretations.

My conceptual confusion is that Bayes Theorem combines a term with an F-probability interpretation (the likelihood, which is essentially the density of the sampling distribution) with a term with a B-probability interpretation (density of the prior distribution) to produce an entity with a B-probability interpretation, namely, the density of the posterior distribution. I’m not questioning the validity of the derivation of Bayes Theorem here. Rather, it seems conceptually messy to me that an F-probability term is combined with a B-probability term; both terms have to do with “probability,” but what is meant by “probability” is very different for each of them.

Can you provide some conceptual clarity?

My reply:

See here

and here, also here and here.

At this point, I’ve written about this so many times I just have to point to the relevant links. Kinda like that joke about the jokes with the numbers.

A simple measure like carefully reading journal articles seems far more helpful than new-fangled measurement tools offerings, given that there appear to be a plethora of miscitations in journal articles. Careful reading has demonstrated that non-standard & standard definitions in specific contexts can confuse researchers. Moreover, as Rex Kline has alluded, academics reveal different degrees & types of biases & lack of fully understanding their own field when discussing them in informal settings. Even a critically thinking non-expert, a naturally curious diagnostician, can discern such biases, with no or minimal statistical background.

Replication is a means to revisiting the state of theory & practice in these fields. And we are benefiting already. Some small subset may be able to offer up new more substantive insights in the process.

Lastly, all of us can gain if we can convey our viewpoints well. Convoluted writing pervades journals.

Sameera:

Careful reading helps, but you need to know what to look for. We’ve learned so much in the past decade about what to look for. I’m much more aware of researcher degrees of freedom and forking paths, and just the general point that just cos something’s published in a respected journal, it doesn’t have to be any good. There are lots of papers where now it’s clear that there are problems, but ten years ago I and others would’ve just taken their claims at face value. Just remember: all of Brian Wansink’s published papers were read carefully by at least three reviewers . . . who didn’t notice any of the in-retrospect-obvious problems.

We learned many of the same lessons about research methods & tests from the Evidence-Based Movement during the 90s, which, btw, have resurfaced in the last decade or more. So some of us are not so ill-equipped to think critically about the literature. I do not have the statistical competencies you possess. However, when I talk to statisticians informally, I can discern that they stray from their stated assumptions.

In so far as Brian Wansink’s reviewers, I would guess they weren’t sharp enough. I’m suggesting that if you countenance rigor as integral then ‘careful reading’ is entailed. I guess what I’m suggesting is that the qualitative side seems to have to take a back bench to the quantitative side. Rex Kline conducted a survey of academics knowledge of statistics. So there are some indicators that some % have been winging it in their teaching.

Sorry, some % has…..

Eeeekkk.

Sameera:

Sure, Wansink’s reviewers “weren’t sharp enough.” But . . .

1. Until a few years ago, I “wasn’t sharp enough” either, in the sense that I don’t think I would’ve caught those problems either.

2. Nutrition and behavior research is important. It affects policy! If the reviewers in that field are, like me, not so sharp, that’s something we have to deal with. So I think the advice to read more carefully can be part of the solution, but another part has to be giving people some sense of how to read, what to look for, what evaluations make sense, etc.

We have to move beyond the existing approach of check boxes for:

(a) statistical significance,

(b) causal identification,

(c) novelty,

(d) the wow factor.

Think about it: Wansink did well in all four of those dimensions! He always had statistical significance (or so it seemed), he had randomized experiments thus causal identification for free (or so it seemed), and of course he had novelty and the wow factor. Sure, he got some of this by cheating, but, as we’ve discussed on the blog, had he put a bit more effort into each project, he could have gotten all these results without cheating, just using standard forking paths and storytelling of the sort that is

stillconsidered acceptable in scientific publications.Andrew,

I consider you one of the most intellectually sharp and interesting thinkers on this planet. I meant ‘careful reading’ as in being paying full attention to definitions & interpretations. You know researchers come in all varieties and competencies. And even the best of readers can be distracted from making appropriate findings.

I have followed a considerable amount of research in three areas: nutrition, kinesiology, & hormones. So maybe I had some background to evaluate Cuddy’s thesis for example. But I don’t pretend that I am right in my interpretations either. I am relieved when someone dissuades me from an inaccurate position I hold.

If you notice I was referring to informal discussions as well, where I think researchers’ biases are more readily discernible.

It may also be that as a non-statistician I am forced to rely on verbal explanations, which someone Taleb characterizes as inadequate. But he tends to make a lot of verbal evaluations too.

The mistake is to think that the likelihood is an F-probability. The likelihood is a measure of what values are plausible given the model and parameters. All of Bayesian probability is plausibility. If you happen to be studying calibrated cryptographic pseudo-random number generators for which you do not know the entropy source / key then it will turn out that your plausibility coincides with frequency precisely because of the nature of the thing you are studying. In all other cases the plausibility remains different from frequency, usually in an unknown way, since the frequency distribution is not known a-priori and you usually aren’t going to collect enough data to establish it.

@Andrew, there’s lots of food for thought in the linked texts, but I don’t think they address the question posed by Ed Bein.

He correctly points out that probability theory is a set of properties that may apply to different things, in this case “long-term frequency” and “degree of belief/knowledge/plausibility”, and the fact that these properties apply to both of them doesn’t justify plugging these different things into the same formula (Bayes’ rule).

Daniel Lakeland’s reply addresses this concern. In using Bayes’ rule one does not put a frequency-probability into a belief-probability context, but rather a belief-probability that is induced by a frequency-probability. In the rare case in which we can reasonably know the frequency-probability, this leads to degrees of belief that a realization of the random process has a given outcome. I think it would be great if someone could work this out in detail, especially for cases where “reasonably knowing the frequency-probability” is not so simple.

“In using Bayes’ rule one does not put a frequency-probability into a belief-probability context, but rather a belief-probability that is induced by a frequency-probability.”

This is a good way of putting it. To elaborate (or perhaps rephrase): The prior needs to be based on whatever credible prior knowledge one has; if there is credible frequency-probability available, that is reasonable information to include in the prior belief-probability (in fact, excluding it would be unreasonable).

I think you only need infinite exchangeability to get a law of strong numbers — that is to say, to have the theorem that expected frequency is numerically equal to plausibility.

What I mean is that nothing about what seems plausible to you can possibly affect how the world actually behaves (mind projection fallacy). So if you set up a model and say the P(Data | Model, parameters) = normal(x,0,1) nothing about this fact or how often you are willing to claim this fact will affect the physical process that makes it have some totally different frequency distribution which you don’t know because you’re not collecting tens of thousands of data points over multiple years in all possible weather conditions, etc.

A likelihood is conceptually distinct from an “F probability.” Conceptually, the likelihood is merely (proportional to) the probability of the *observed* data, given each possible parameter value. But usually the likelihood will be derived from a model of how a range of *possible* data are generated. Moreover, if you are going to do posterior predictive checking (or a prior predictive analysis), then you *need* a model of the data generating process, and not just the probability of the observed data given each parameter value. And if the model of the data generating process disagrees too much with the distribution of the actual data, then (by the logic of posterior predictive checking and prior predictive analysis), the model should be rejected. So in practice, I think it’s fair to say that “F probabilities” are very important to Bayesians, even if (strictly speaking) an application of Bayes’s formula merely requires an estimate of how probable each parameter makes the observed data.

Quite. No doubt you’ve seen this:

What is the matter with people and probability theory and its interpretation, I wonder. Similarly, many people can’t seem to make – and maintain – the even more important distinction between (epistemic) probability space state and (ontic) configuration or phase space state in (applied) non-classical probability.

Andrew, thank you for the links to your papers. They are excellent. That said, it always bothers me how highly regarded Popper’s arguments are (outside of analytic philosophy), because he is clearly wrong and Carnap showed him to be wrong a long time ago. Popper regarded himself as answering Hume’s problem with induction by stating that falsification sided stepped the problem. We reason, Popper thought, only deductively, and although Hume is correct that we cannot ever confirm a scientific law, we can falsify it. But, this is false. Hume and Popper’s arguments depend on scientific laws involving universal quantification (“All x are such that Fx”). We cannot Hume argued prove a universal. True enough. Popper thinks we can disprove a universal. Maybe, sometimes. But, Carnap pointed out that most scientific laws, facts will have nested quantifiers (“All x are such that there is a y such that Fxy”) You cannot falsify those propositions any more than you can confirm them (in the sense of confirmation and falsification that Hume and Popper’s arguments require). Just think of Newton’s first law. It could be written as “for every x such that there is a y such that if x changes motion, then y causes the change in motion”. What happens when I find an instance of something that changes its motion. I will look for the cause of the change in motion. When I fail to find it, I will have a choice. Do I reject Newton’s first law or conclude that I simply did not find the cause of the change in motion. Then, I will go looking for more evidence for and against Newton’s first law. We are with falsification in exactly the same situation that Hume was with verification. We are missing a premise in the argument namely, “I have enough evidence to make the inference.” I do not think that Popper hypothetical-deduction framework is of any use. Of course, part of what science does is deduce hypotheses from its theories and then tests them. But, falsification doesn’t help. We have the same problem. We have to somehow justify the leap that same I have enough evidence to verify or falsify the hypothesis. The problem is thinking that induction has to preserve the truth of the premises to the conclusion in the way that deduction does. Why should we think that?

Check out Imre Lakatos:

Lakatos, Imre. 1968. Criticism and the methodology of scientific research programmes. Proceedings of the

Aristotelian Society. Issue 69, p. 149-186. personal.lse.ac.uk/robert49/teaching/ph201/week05_xtra_lakatos.pdf

tl;dr

Theories that make surprising new predictions, ie have a high p(theory|data) since all the other factors in the denominator of Bayes rule are so low, advance science. If you are always making ad hoc adjustments to your theory (the theory lags the observations) then it is pseudoscience.

Anon:

Lakatos is my hero. I allude to his multiple Poppers in footnote 6 of this article of mine which I absolutely love.

Yes, Meehl and Lakatos ftw. Actually I am not sure if I learned about Lakatos from Meehl or this blog. Great paper.

I never really “got” induction. I think what I do is abduction (guessing), then deduction to work out the consequences of the guess. Then if the guess is good, the predictions should be good. I guess maybe induction is where the guesses are coming from?

Pierce thought induction was was the inference to the conclusion that the frequency of the trait t in my sample is the same as its frequency in the population. Put another way, I have enough evidence to generalize. I think that is distinct from deductive and abductive reasoning although it may be that in many areas of the social sciences only abductive inferences are available. However, deductive reasoning itself is never going to be sufficient. There has to be an inductive inference somewhere to justify the accepting or rejecting the hypothesis. So, the hypothetical-deductive model is not a complete description of scientific inference.

“I never really “got” induction. I think what I do is abduction (guessing), then deduction to work out the consequences of the guess. Then if the guess is good, the predictions should be good.”

To put it simply, induction is simply going in the other direction: if the predictions are good, then the guess is good.

Yes, Popper’s logical point about existential quantifiers is well taken. However, in my view, the best refutation of Popper’s deductivism is via the Quine-Duhem thesis: that is, in practice all refutation is relative to background/auxiliary assumptions (or “conditional on” background assumptions, to put it in Bayesian terms). But if you allow that refutation is conditional, there’s really no good reason not to also allow conditional confirmation.

Sorry, that should be *Carnap*’s point about existential quantifiers.

I think that the Quine-Duhem thesis is closely related to Carnap refutation of falsification. I do not know the historical sequence, but Quine and Carnap worked closely with each other, and both were aware of the indeterminacy problem very early on.

Maybe reasoning in terms of “direct probability” and “inverse probabilty” will make easier to understand why they are related by the Bayes formula and what is the role of the likelihood function.