This post is by Keith O’Rourke and as with all posts and comments on this blog, is just a deliberation on dealing with uncertainties in scientific inquiry and should not to be attributed to any entity other than the author. As with any critically-thinking inquirer, the views behind these deliberations are always subject to rethinking and revision at any time.

I keep being drawn to thinking there is a away to explain statistical reasoning to others that will actually do more good than harm. Now, I also keep thinking I should know better – but can’t stop.

My recent attempt starts with a shadow metaphor, then a review of analytical chemistry and moves to the concept of abstract fake universes (AFUs). AFU’s allow you to be the god of a universe, though not a real one ;-). However, ones you can conveniently define using probability models where it is easy to discern what would repeated happen – given an exactly set truth.

The shadow metaphor emphasizes that though you see shadows, you are really interested in what is casting the shadows. The analytical chemistry metaphor emphasizes the advantage of making exactly known truths by spiking a set amount of a chemical in test tubes and repeatedly measuring the test tube contents with the inherently noisy assays. For many empirical questions such spiking is not possible (e.g. underlying cancer incidence) so we have no choice but the think abstractly. Now, abstractly a probability model is an abstract shadow generating machine: with set parameters values it can generate shadows. Well actually samples. Then it seems advantageous to think of probability models as an ideal means to make AFUs with exactly set truths where it is easy to discern what would repeated happen.

Now, my enthusiasm is buoyed by the realization that one of the best routes for doing that is the prior predictive. The prior predictive generates a large set of hopefully appropriate fake universes, where you know the truth (prior parameters drawn) and can discern what the posterior would be (given the joint model proposed for the analysis and the fake data generated). That is, in each fake universe from the prior predictive, you have the true parameter value(s), the fake sample that the data generating model generated from these and can discern what the posterior would have been calculated to be. Immediately (given computation time) one obtains a large sample of what would repeatedly happen using the proposed joint model in varied fake universes. Various measures of goodness can then be assessed and various averages then calculated.

Good for what? And in which appropriate collective of AFUs (aka Bayesian reference set)?

An earlier attempt of mine to do this in a lecture was recorded and Bob Carpenter has kindly uploaded it as Lectures: Principled Bayesian Workflow—Practicing Safe Bayes (YouTube) Keith O’Rourke (2019). I you decide to watch it, I would suggest setting the play back speed at 1.25. For those who don’t like videos slides and code are here.

The rest of the post below provides some background material for those who may lack background in prior predictive simulation and two stage sampling to obtain a sample from the posterior.

But how does this differ from how frequentist statistics approaches things? Isn’t the whole point of frequentism the existence of these AFUs?

Vk:

Bayesian statistics is frequentist.

Ay! I missed that one at the time. For me:

Frequentist means interpreting probability models as mechanisms to (repeatedly) generate data.

Bayesian means analysis in which data are used to arrive at posteriors from priors using Bayes’ rule.

Nothing contradictory between them, although there’s of course non-Bayesian frequentist analysis and also non-frequentist Bayesian interpretation of probability.

Well, I don’t buy any of that. I mean, when I put a prior on the gravitational acceleration at the surface of the earth so I can do my dropping paper balls experiment, it’s not the frequency distribution over different worlds or even different locations on the surface of the earth.

when I have a nonlinear function over spatial locations that describes low spatial frequency variation in cost of living, it’s not a distribution over different versions of the United States where costs vary in a different way…

all of this is encoding information about the actual world we live in, where I don’t know exactly what the acceleration will be but I know it’s 9.79 m/s^2 to within about .01 and I don’t know what the costs of living are but I know that they are consistently higher in the coastal areas and lower in the Midwest, and my function is smooth because all the very high spatial frequency variation is intended to be taken up by PUMA specific errors…. so no I don’t think this is a good general characterization of Bayesian methods

I don’t really understand what exactly you don’t buy. I didn’t say you have to use, or you automatically use, a frequentist interpretation of probability when applying Bayesian methods. It seems you give examples where you don’t do that, which is fine by me (at least regarding the definitions I proposed above).

Maybe you replied to Andrew not me?

Yes, that’s right, it was a reply to Andrew. What I don’t buy is the stuff in the post he linked… that priors are kind of distributions over populations of future applications for the model or something.

What I do buy, is the idea that Frequentist statistics is about modeling the stable frequency properties of repeatable experiments using a frequency interpretation of probability calculus, and Bayesian statistics is about modeling the information state that we have about what might happen, which can come from many sources, and one source of information is when things have relatively stable frequency properties, we can use the frequency as our source of information.

From this perspective, I think Bayes is a much larger superset of Frequency based statistics.

VK, having finally had a chance to watch Keith’s talk, at least about half of it, I would say that there are a number of things that are different from standard Frequentist statistics in Keith’s idea.

The first is that we are putting probability over the parameters of the model, which is nonsensical in standard Frequentist statistics. Andrew’s view appears to be that this can be seen as how a model will be repeatedly used throughout some abstract life… but I don’t buy that, for example if I want to drop a paper ball in a space ship accelerating at around 3 m/s^2 I would analyze that with a completely different prior than the prior I used on the surface of the earth. I wouldn’t say “about 1 out of a thousand of these kinds of experiments will be done in a spaceship, so I’ll make my prior have a left tail that includes about .001 probability in the vicinity of 3m/s^2”. I would just say “hey in this case I know the acceleration is 3 +- 0.1 m/s^2”

So, I think for the most part the prior in Keith’s discussion is still “information” type probability. In his example, Keith uses underlying long term fainting rates (what would be the long term average fraction of people who faint when they are brought in to watch a surgery).

If we are analyzing one hospital and one population of people, we don’t put a prior on fainting rate that says “well, maybe in 10% of other hospitals the underlying fainting rate would be .15 or less, and in 10% of other hospitals the underlying fainting rate would be .9 or more and around 50% of hospitals will experience fainting rates between .2 and .6 …. etc”. We ask instead: in *this* single population we’re looking at, which fainting rates are more reasonable to consider, and which are less reasonable to consider?

If we do an analysis of multiple hospitals, we may use a hierarchical model, at this point you could say something like “as a collection of underlying fainting rates, what regions of rate space are most reasonable to contain the N rates we use for our N hospitals” and you could think in a frequency way, but this kind of thing is *always* a discrete frequency distribution. We never have an approximately continuous distribution, because we don’t have 800 trillion hospitals, or even 8 million. Most models with hierarchical structure like this have from 2 to a couple thousand elements. So the continuous distributions we use for hierarchical priors are still *information type* distributions. In fact, we could think of them in the same way Keith did… if we have 35 hospitals, ideally we’d choose a prior such that the actual underlying rates look un-surprising as a sample of 35 draws from this continuous distribution. Sampling is a way to convert a continuous distribution into a discrete distribution so to speak.

On the other hand, Keith takes a frequentist view of the “likelihood”, or the “data model”. Each of the parameter sets he uses indexes a particular “model” of the world in which fainting occurs as if by random number generator… and then he can use an actual RNG to generate a dataset, and see what “might have happened in this abstract world”.

As far as I’m concerned, this is a fine way to conceptualize computation. Because there’s a 1-1 correspondence between a state of information and a frequency distribution from a random number generator, we can always *calculate* our information state in terms of the frequency with which a random number generator would do such and such…

What we should be careful about though, is confusing what we’re calculating with what the model actually means about the real world. There may be situations where we consider the universe as if it were a random number generator… but this is not the only interpretation here.

Suppose I use normal(m,s) as my Bayesian model for the outcome of some physical measurement. For example say some light intensity in a photograph of some fluorescent stuff… Does this mean that I think about 1% of the time that I measure the intensity it will be more than m+2.33*s brightness? If I find out that in fact this only occurs about .002% of the time is my model bad? NO. What it means is that if I find it out in that region I will be unhappy with my prediction in kind of a similar way that I would be unhappy to find that I predict 0 in a normal(0,1) but wind up getting 2.33. In some sense, it is telling me how to balance whether I got the right parameter and the wrong prediction because of model error, or I got the wrong parameter and I should be predicting with a different parameter which would leave the model error smaller…

If we give a prediction of 0 with normal(0,1) as the error, and we see 33, then we *know* we got the wrong parameter, as the number drops down from 33 to 2.33 we begin to accept the idea that the parameter could be right, and as it drops to 1 or .22 or whatever we find it very reasonable that the parameter is right… It’s a balancing act between “the data rules out this parameter” and “the model with this parameter doesn’t rule out this data”

The probability density over the data p(data | parameters) is a measure of “how un-surprising would observing this data be, if the parameter is correct”. Higher density, more “un-surprising”, lower density, very surprising data. You can work instead with the negative log of the density, and call that “surprisal”, small density, large value for -log(p), very surprising.

One thing to remember is that measurements *always* come as discrete values (every measurement is represented by some finite string of symbols like 2.223 or 0.0015 or 255871). the density over data is just a way to interpolate a fine grid of probability values in an efficient way.

I think Keith’s talk emphasizes this connection between Bayesian probability and a computational method. It’s a way to understand the computation. And in models where we have no information other than “it’s as good as a random binomial outcome from an RNG” you can actually make a direct connection between the frequency output from the RNG and the thing your model predicts.

Such a connection is much less obvious when for example your model is an agent based model of fluorescent cells crawling around through an organ to affect healing of the organ… or an agent based model of an information system attacker trying to infiltrate a voting registration system… Or a weather model trying to predict landfall of a hurricane.

Claiming to have an actual mapping between an RNG and the frequency with which the model gives certain sized errors… is a much stronger claim than is justified in these kinds of situations. These situations involve information about what might happen only, not frequency with which the universe would cause a thing to occur.

Daniel raises a number of varied concerns and issues that are not clear to me but I will try to respond to what I see them to be.

“What sort of a conception we ought to have of the universe, how to think of the ensemble of things, is a fundamental problem in the theory of reasoning” (CP: 6.397, 1878).

So the issue here for me is how to best represent the universe (conceive of it) for the purpose of enabling those new to Bayes to get some meaningful grasp of Bayesian analyses. Now, they are representations (the map is not the territory!) spelled out as _abstract fake_ universes that are created to discern which are compatible with our universe (i.e. what was observed). I don’t think anyone needs to worry that later if they become convinced by Daniel’s “information” type probability arguments that they will not able to safely kick away this introductory ladder they initially climbed up on.

More generally some of Daniel’s concerns seem to be like those of a husband of a wife Picasso painted. The husband complained to Picasso that she did not look at all like his painting. Picasso asked what she did look like and the husband showed him a picture of her he had in his wallet. Picasso then commented “she is awfully tiny!” Representations do not have to exact be replicas to be useful representations. In terms of a portrait, Daniel seems to want it to be life size, three dimensional and made from tissue cultures.

A couple other points.

> of thing is *always* a discrete frequency distribution.

But continuous distributions can be useful approximations.

> nonsensical in standard Frequentist statistics.

Who cares about what others believe standard this or that is, except to clarify what we want to mean.

> If we are analyzing one hospital and one population of people,

Single studies are not a science. Yes individuals can learn from them but it is not science until at least another study is done and then parameter will be somewhat different (can’t put your foot in the same river twice).

> If we do an analysis of multiple hospitals, we may use a hierarchical model

Again, it is not science until you are in a position to consider using a hierarchical model and then it becomes empirically possible to learn about distributions of parameters. Then we can argue about them being epistemic or aleatory and for what purpose.

Keith, my responses were in direct response to the question VK had about how your lecture’s position differed from “frequentism”… so it should be read in that light, I was trying to highlight those differences for VK.

I liked your lecture a lot, and I think it’s fine to ask how well an abstract fake random number generator model of the universe fits with the real data. I do think it’s important to make that statement you are making about how the map is not the territory. I also think it’s important to let people know that we can interpret probability in ways that gives a model validity even if it can clearly be proven to have an incorrect frequency distribution.

To the point about “representations do not have to be exact replicas to be useful representations” is very true. I just think it’s important to discuss the *ways* in which a representation is supposed to be similar vs different. The standard statistical textbooks out there make the leap to frequency probability models of data collection in an inappropriate way in my opinion.

When we are collecting a survey of a finite population using an RNG to select the sample … the RNG view is very justified, it’s an imposed property of the experiment.

When we have say a bunch of fish PIT tag scans at fish ladders on dams, or a time series of crimes committed and reported in a certain county… it’s not so justified. The idea that there is a stable frequency through time is itself a strong assumption that can often be checked and proven incorrect. This invalidates a frequency interpretation of a statistical model, however, it *doesn’t* invalidate a Bayesian interpretation in and of itself, as the Bayesian interpretation is something like “conditional on our knowledge being that the probability isn’t changing in time, what conclusions could we draw”. These conclusions are valid answers to that question, even if they are not valid frequency descriptions of outcomes.

If we don’t encourage people to understand this distinction, and give them some knowledge about tools that have a useful interpretation in both conditions of this distinction, we will wind up with people doing inappropriate things. I think this is the biggest problem in stats education today, standard textbooks tend to immediately move from “you have some data” to something like “it is repeated samples from a normal distribution with unknown mean and variance”. The “survey sampling” view of what statistics *means* is far too prevalent

I also agree with you about science requiring us to find stable consistent facts, so for example we need to check models against data from multiple hospitals. But plenty of data analyses are done with just the data we have today. We should give people interpretations that hold even if there isn’t repetition.

And that’s the main complaint that I had above. Andrew linked to a blog post in which he interpreted the prior over parameters as a frequency with which we would use the model in different conditions in which the parameter would come from different places. That might apply to some of the problems Andrew works on, for example where we do analyses across counties in the US, or analyses across states, or economic analyses across multiple companies, etc but it’s absolute nonsense in other problems, like predicting the damage that a hurricane approaching florida that will make landfall some time in the next week will do… this is a one time event.

OK, I thought VK was a typo for OK ;-)

> The “survey sampling” view of what statistics *means* is far too prevalent

I agree – Rob Kass makes that point very well.

> Andrew linked to a blog post in which he interpreted the prior over parameters as a frequency with which we would use the model in different conditions in which the parameter would come from different places.

I agree with Andrew and do use it even if I am studying a one time event. Peirce argued that inference in a single case is meaningless or at least unjustifiable until it is placed it a counter-factual “would it be repeated indefinitely”. Tht convinced me and so to me that’s just a representation similar in important ways but different in repeating indefinitely.

I don’t expect to convince you here (sure you are not in the minority) and maybe I should do a post on that some time.

Keith, if we want to understand the damage that will be done about a week from now by a storm about to hit Florida, should we inform the prior on certain parameters involving say solar power input to the heat-engine process that causes circulating storms by the fact that our Astronomy friend just told us he wants to use the same model to understand the red spot on Jupiter? We’ll run this program 10 times, once for every day in the next 10 days, and he’s going to run the program once, on just the red spot… so about 9% of the prior mass should involve insolation levels around … (earth is 93 million miles from sun, jupiter is 365 million… inverse square law… (93/365)^2 = 0.065) 6 percent of what is likely to occur on earth.

It should be mentioned that the conditions of confidence intervals and p values in standard Frequentist tests are designed for *exactly* this insensitivity to what the purpose of the analysis is… The tests give coverage *no matter what problem you use them on* (provided some other assumptions are met, which they’re mostly not). Whereas Bayesian methods using informative priors are designed to give correct answers to the actual problem we face in our analysis: A particular storm near Florida at a certain time of year.

My view is that Pierce is wrong here. He is using a crutch most likely because Cox’s theorem hadn’t been even formulated much less proved at the time he was writing. As I see it, we can justify an inference about a particular storm by showing that we have good reasons to believe that parameters in the high probability region of the prior will predict what happens in Florida, and also good reasons to believe that parameters outside that region will not predict what happens in Florida.

Most of those reasons will come in the form of “other situations have shown us that parameters of interest should come in this region” (prior knowledge) or “parameters in that region imply storms will behave in certain ways that they have historically not behaved” (knowledge about the prior predictive distribution). This is where the “repeatedly” actually comes from. Science is definitely the process of studying *regularities* in the behavior of the world. But regularities are not necessarily regularities in outcome. For example the viscosity you should use for air to compute drag on a golf ball is very consistent even though you may find that 100 different golfers get VERY different drive distances.

OK our continuing will probably not be too distracting to others.

> But regularities are not necessarily regularities in outcome.

But regularities in distribution (that may evolve but not too rapidly) – are necessary for science. Otherwise nothing would replicate and there would no reason to do science.

> good reasons to believe that parameters in the high probability region of the prior will predict what happens in Florida

My AFUs are “a” way of representing this sentence of yours to enable a clearer sense of what your sentence means – the very same sentence – that is to simulate from that prior and then given the point in the parameter space drawn, generate a sample using a data generating model to “see” a fake happening for Florida that your joint model (Bayes needs a joint model) “says” is likely to happen.

I believe I am just discretely recasting what you are saying to some degree of approximation. I have a hard time grasping objections other than its not a good enough approximation.

Now, I do understand you wish to block considerations of Bayesian reference set (places one thinks are “exchangeable” with Florida) to inform the prior. But for priors you do accept, I don’t think you should have objections to AFUs.

Keith, when you say “I do understand you wish to block considerations of Bayesian reference set (places one thinks are “exchangeable” with Florida) to inform the prior.”

that’s not a correct statement of my view. If you think a place is similar to Florida, or a physical process is similar to a hurricane in the Atlantic in a certain way, and you have information about what the parameter values are in those situations… then absolutely you should use the information in your priors for hurricanes in Florida, even if they come from say Indonesia or whatever!

What you shouldn’t do is think that somehow the prior needs to be frequency calibrated to the repeated uses it will actually be utilized for in the future.

For example, suppose I will do a survey of the United States every week for the next year… Then I will do a survey of Germany every week for the following year. Evidently half the time I use my model I will be using it for Germany. Suppose I think Germany is a very different place than the US regarding my survey. The “correct” thing to do if you think that priors are frequency distributions over the instances where the model will be used, is to put as your prior 50% whatever you think is going on in the US, and 50% what you think is going on in Germany.

Why would you do that during the first 52 weeks where all your data is from the US? And, after fitting to the US and you move to Germany, why would you continue to include the US information in your prior for Germany if you think Germany is a very different place (like for example you’re doing a survey about the use of long complicated compound words)?

When fitting to the US, use the prior that expresses your information about what is probably going on in the US… when fitting to Germany use information about Germany.

If you think somehow that the US is “similar to” Germany, and you’ve already fit the US… then absolutely you should consider what you found out about the US when you make your prior for Germany. For example the German prior should probably include all of the US posterior HPD region as well as some additional regions to include the fact that you don’t think Germany actually IS the US.

So, a Bayesian reference set is a fine idea in terms of “these situations are similar to and define a region of space that is relevant to my current analysis”. But it’s not fine in terms of “the frequency with which my model gets used to analyze Florida must be equal to the weight I apply to Florida in my prior and the frequency with which my model gets used to analyze Bangladesh must be equal to the weight I apply in my prior to describing Bangladesh”

You might want to make this true when a *single* analysis includes both Florida and Bangladesh with known fraction of data coming from each… But it certainly shouldn’t be true about say repeated uses of the model in future data analysis. Even thinking that you know precisely how often your model will be used to analyze Bangladesh over the next 100 years is a joke.

> So, a Bayesian reference set is a fine idea in terms of “these situations are similar to and define a region of space that is relevant to my current analysis”. But it’s not fine in terms of “the frequency with which my model gets used to analyze Florida must be equal to the weight I apply to Florida in my prior

OH, agree – I’ll try to make it clear I don’t intend that.