An economist wrote in, asking why it would make sense to fit Bayesian hierarchical models instead of frequentist random effects.

My reply:

Short answer is that anything Bayesian can be done non-Bayesianly: just take some summary of the posterior distribution, call it an “estimator,” and there you go. Non-Bayesian can be regularized, it can use prior information, etc. No reason that a non-Bayesian method has to use p-values. To put it another way, there’s Ms. Bayesian and there’s Ms. Bayesian’s evil twin, who lives in a mirror world and does everything that Ms. Bayesian does, but says it’s non-Bayesian. The evil twin doesn’t trust Bayesian methods, she’s a real skeptic, so she just copies Ms. Bayesian but talks about regularizers instead of priors, and predictive distributions instead of posteriors. It doesn’t really matter, except that the evil twin might have more difficulty justifying her estimation choices because she can’t refer to a generative model.

Now if people want to defend some *particular* “frequentist” procedure, that’s another story. The procedures out there tend to under-regularize; they get noisy estimates of group-level variance parameters (see here and here) and they lead to overestimates of magnitudes of effect sizes (see here).

The usual non-Bayesian procedures are designed to work well asymptotically (in the case of hierarchical models, this is the limit as the number of groups approaches infinity). But as noted Bayesian J. M. Keynes could’ve said, asymptotically we’re all dead.

“But as noted Bayesian J. M. Keynes could’ve said, asymptotically we’re all dead.”

This is one of the only times I’ve seen that quote used to express concerns that are actually analogous to the concerns Keynes had in mind when he wrote that. (“But this long run is a misleading guide to current affairs. In the long run we are all dead. Economists set themselves too easy, too useless a task, if in tempestuous seasons they can only tell us, that when the storm is long past, the ocean is flat again.”)

Maybe he was under frequentist censorship and “ocean” is a metaphor for “priors”…

> Economists set themselves too easy, too useless a task.

One might speculate that Keynes put aside the more challenging topics of probability and inference for easier less useful (for others) work in Economics.

Is

> the limit as the number of groups approaches infinity

The same asymptotic limit as ‘in the long run’?

Only by analogy. Most people who quote “in the long run we are all dead” are either asserting or attributing to others the view that thinking about the long-term is of limited value; they are missing the point that Keynes was talking about people whose economic models were inadequate (by his lights) to address the short run at all.

Ha! This gave me great chuckles.

I found the inspiration for this joke. I.J., Good, 1983 p.69:

“[Von Mises] stated, like the 19th century frequentists, that in the applications the sequences must be long, but he did not say how long … But the modern statistician often uses small samples … He would like to know how long is a long run. As J.M. Keynes said, ‘In the long run we shall all be dead.'”

It’s not always clear which twin is copying which. E.g., it seems the Laplace prior became more popular with (some) Bayesians only after its non-Bayesian motivation in the context of the LASSO was put forward. But who gets to claim credit for what is, to me, not a terribly interesting question.

A point of clarification is that (most) frequentists care only about the performance characteristics of statistical procedures under repeated sampling which hold uniformly (over some scenario space). If you want to call a procedure Bayesian, quasi-Bayesian, or non-Bayesian, OK …but I’m happy to use whatever has the desired performance characteristics in the problem at hand, regardless of what we call it.

It’s also the case that trying to understand a method from a Bayesian perspective by trying to reverse engineer it often misses the point. For example, trying to understand the bootstrap from a Bayesian perspective makes it look peculiar, since the associated prior is odd. And yet, the bootstrap is very popular with frequentists, because it provides asymptotic coverage guarantees under (relatively) weak assumptions. I think such reverse engineering exercises can be insightful, but I think they’re not a substitute for thinking through the frequentist properties of these procedures, since the latter are not always obvious based on the former.

Ram:

I agree that the flow goes both ways. See footnote 1 of this article:

Ram, very interesting summary. I am wondering whether most researchers appreciate these distinctions. I feel like I tag along and pick up the more elaborated points. I enjoy this blog a lot.

> uniformly (over some scenario space)

I used think that was a good property but that should be questioned for instance in some scenario spaces it may result in very poor type M and S errors.

> understand the bootstrap from a Bayesian perspective makes it look peculiar, since the associated prior is odd.

This is something that seems to be left unresolved – do you have any recent references or discussions about it?

I get this question a lot. I give a very pragmatic answer similar to the second paragraph in the OP: particular Frequentist procedures are often less reliable. It’s not the philosophy that’s the problem. It’s the algorithm.

My colleagues seem to have a lot of trouble getting non-Bayesian random effects models to provide reliable estimates. Either the variance parameters are zero or the correlations are –1 or +1 and the software doesn’t converge half the time. As a result, there is a culture of doing all manner of covert regularization, like assuming correlations between intercepts and slopes are zero, in order to get convergence.

So the main benefit of going Bayesian is that you can trust the resulting posterior (more). Maybe there are non-Bayesian ways that also address the convergence issues. But installing rstanarm or brms is pretty easy.

Besides, we need regularization. And I don’t see how that’s easy to achieve in the Frequentist packages my colleagues are using. I assume there are other approaches that do allow it.

In the end, my opinion is: Model first, fit second. The problem with Frequentist models is that the models are all totally wrong: “x comes out of an RNG with distribution D”.

If your model is something like: “Y should be approximately proportional to X when X is small enough, and approximately constant when X is large, and Z will shift the Y values up and down from where they would have been if Z=0…”

then you’ll get a model that’s maybe Y[i] = A*(1-exp(-B*x[i])) + C*Z[i] + error[i] and you won’t know what A, B, or C values to put in… so you’ll need a way to figure out what they should be… some kind of *inference* for the A,B,C values… But we know that A has to be positive, B has to be positive, and the magnitude of C should be not too large compared to A …

If only there were some way to logically infer which region of A,B,C space made the most sense…. well there is: Bayesian Inference with informed priors.

This makes sense to me, yes. The RNG intuitions seem really limiting and dangerous. But I’ve had limited success with colleagues when I begin with talking that way. I have lots of luck in contrast getting them to see that their frequentist software is not reliable. I worry now that they are using Bayes but thinking Frequentist, and I’ve traded one set of problems for another.

> using Bayes but thinking Frequentist, and I’ve traded one set of problems for another.

Yup, if anything we just get _less_ wrong.

> To put it another way, there’s Ms. Bayesian and there’s Ms. Bayesian’s evil twin, who lives in a mirror world and does everything that Ms. Bayesian does, but says it’s non-Bayesian.

I thought Ms. Bayesian’s evil twin was the one using data-dependent priors and saying it’s Bayesian.

Did this economist find your answer persuasive?

I have started to list “uncertainty of whether your software works as advertised” as something to be concerned about, in addition to uncertainty over which model(s) to use, uncertainty over what the parameters in those models are, and uncertainty about how to analyze the results. So, some people might be persuaded by pointing out that the posterior mean or median from Stan can be made into a frequentist estimator that has a sampling distribution over datasets drawn from the same population that is, say, closer to a multivariate normal distribution than some other frequentist estimator (as implemented in some software).

But I would imagine most people would just respond that better fidelity to a multivariate normal sampling distribution is not worth the difference in cost to learn and use Stan, particularly when the first time you try to draw from the posterior distribution you tend to get warning messages that sound similar to the warnings you get (if any) when using some other frequentist estimator.

I would have denied that the sampling distribution of the estimator is an important consideration when choosing what estimator to use.

The main distinction seems to be that the frequentists almost always stop at their version of the posterior mode. That seems more at risk of overfitting. I think frequentists actually have the vocabulary to do MCMC – projecting effects instead of estimating parameters, sampling from the distribution of projection errors instead of the distribution of parameters – but I haven’t seen them take that step. That would allow them to take the mean from the distribution of projection errors instead of the maximum of the joint likelihood – so not be stuck with the mode.

Gary:

The interesting question is: What’s the “frequentist” principle by which it’s ok to take the mode but it’s considered poor form to average over the distribution?

If “frequentist” just means having standard frequency properties: asymptotic unbiasedness, low mean squared error, etc., then it’s not clear what’s so great about the mode.

So I’m thinking there’s something else going on, and that “something else” is a desire to avoid the use of prior information, I think as part of a principle of objectivity, or, to put it more precisely, a principle that all inferences should come directly from the data at hand. The maximum likelihood estimate does not depend on the prior distribution, but any averaging will depend on the prior.

Discussion of these matters can be frustrating, though, because these principles are not always clearly expressed.

I wonder if at least some people who are reluctant to use prior information might be at least somewhat convinced by demonstrations of the sort where you first do a Bayesian analysis with a sample of size n1, then use the posterior from that calculation as prior and a new sample of size n2, and show that the result is the same as you would get using the original prior and the combined sample of size n1 + n2.

“result” = posterior resulting from the second calculation

Larry Wasserman (when he was in a Bayesian phase) tried out that argument in a talk at the University of Toronto, in particular with n1=1 and n2=total in sample size.

Seemed to go over like a lead balloon – but really it’s just likelihood mechanics in that with independent observations the likelihood can be factorized into any set of multiples. That’s a trick Sander Greenland has used effectively to implement Bayesian analyses in standard software – the software can’t tell what is prior versus likelihood as their _just_ functions.

Anyone who wants the gory details of likelihood mechanics may wish to suffer through this paper http://statmodeling.stat.columbia.edu/wp-content/uploads/2011/05/plot13.pdf

I downloaded your paper Keith. Probably don’t have sufficient background for comprehension. I’m intellectually adventurous however.

This is basically the same as Edwards’ idea of a prior likelihood right?

There is a difference though – a prior likelihood is based on data and is not additive over parameters.

“This is basically the same as Edwards’ idea of a prior likelihood right?”

I don’t know — I’m not familiar with Edwards’ idea. My reason for making the suggestion is that when I was teaching introductory Bayesian analysis to high school teachers, we did Bayesian estimation for proportions using beta priors, and after I did one calculation, a student then asked if the above result were true, so we did the calculation — and the result impressed the students. I think this plus the fact that Bayesian uncertainty intervals have a more intuitive interpretation than frequentest confidence intervals seemed to convince a lot of the students that Bayesian analysis made more sense than frequentest.

The result follows trivially from likelihoods for independent data:

P(y0,y1; theta) = P(y0; theta)P(y1;theta)

Right? Or am I missing something deeper? If you have past independent data y0 you can use this in a new context for y1 by multiplying likelihoods. No priors over theta required.

As a cautionary tale: I’ve also seen people try to multiply the two posteriors together in this context, leading to the prior appearing twice. So I think emphasising the likelihood is useful here.

Yes, that’s true as long as the data is considered independent. Frequentist analysis seems to have conceptual problems with dependency, typically you need some kind of conditional independence at least (ie. independent errors) otherwise how do you talk about frequency? A thing needs to be repeatable without the past affecting it or it’s… not meaningful to talk about frequency right?

Whereas in a Bayesian analysis, the plausibility that data point 2 is something can easily be affected by what you saw data point 1 come up as in a logical way, so you can have dependency p(D2 | D1,K) p(D1|K) and make some sense of it.

I’m not saying you can’t do that math in a frequentist analysis, just that I don’t know what it could possibly mean about the actual world.

Replace P(y1;theta) by P(y1 | y0 ; theta) and you’re done.

Re independence: note that y1 | y0 is independent of y0.

Mathematically yes, but what are you assuming about the world, and how realistic is that. Let’s just assume we’ve got some serial trial of different things, so you get data points say 100 of them. And let’s say that we discretize the outcomes so there are 10 possibilities. Now

p(D100 | D99,D98,…;theta)

is an assertion that, if we repeat an experiment and the outcome is this 1 in 10^99 chance, that we know exactly what the repeated frequency of outcomes of D100 will be, and to write it down with any chance of being meaningful, we should also know exactly what it would be if any of the other 10^99 outcomes were repeatedly observed a large number of times….

it’s insanity.

I don’t understand – it’s just a probability model for the data right?

When y1 and y0 are dependent, we get P(y1|y0 ; theta)P(y0;theta). When independent we get P(y1; theta)P(y0;theta).

The conclusion is the same regardless of your philosophy. I was just pointing out that this (what Martha mentioned) is a property of the likelihood part of the model, not the prior. It has nothing to do with Bayesian inference as such.

One can include ‘prior data’ by including a ‘prior likelihood’ rather than a prior probability distribution. Furthermore, these multiply in a sensible way as you accumulate data. Again, nothing to do with Bayes as such.

How to deal with approximate models and the connection to the world etc is, I think, something that none of the standard approaches are very good at imo. To me, Bayes is particularly vulnerable to the ‘true model’ issue, but that’s another topic.

Reaching back to Gary Venter and Andrew’s initial part of the conversation, it was all about “frequentist principles” that justify doing certain things…

It’s not too controversial in my opinion to look at say a histogram of past data and say that your new data comes from a similar process and for all you know is as if from a RNG with that histogram and each event independent… I mean I at least understand why that might be ok in some circumstances.

But now with dependencies, you’re claiming that this event, which involves something that you could basically NEVER EVER observe two of in the lifetime of the universe (probability is 10^-99) is supposed to have a known frequency histogram in infinite repetitions… well it’s craziness.

I’m not arguing about the algebra, just that in this context claiming frequency knowledge (ie. making a *frequentist model* out of it) is the ultimate in delusion.

Frequentist probability claims frequencies converge to probabilities, not that frequencies are probabilities. Hence standard errors in estimation.

Are you saying there are no ‘frequentist’ models of dependent data?

Or do you mean that if you treat a long sequence as perfectly unique it has a low probability? I agree – this is an n = 1 case without further assumptions. Typically you assume a finite memory/finite correlations etc right? Ie dependent doesn’t mean no repetition, it just means less repetition than the perfectly independent case.

So a sequence of n dependent observations has less info than n independent repetitions, but may be approximated by a set of r x m independent observations, say, depending on how long the correlations are.

Regardless, unless there is some repetition in a sequence you basically have an n=1 problem, which I don’t think anyone is very good at handling.

Yep I think we’re on a similar page now, when you have dependencies you don’t have repetition, or you are claiming frequency information about something that is deep in the tail so that observing repetition is somewhere between unlikely and impossible.

It still makes perfect sense to me to say “my knowledge tells me that after D1,D2,D3 happen then D4 should be somewhere near a particular value with likelihood of various values declining rapidly as we move away from say D4* a particular value… or the like. this is just a statement about what you know about some physics/process based on your assumptions, it doesn’t express how often you’d get anything in repetition of a complicated combined set of criteria that might rarely if ever be repeated even approximately.

Frequentist models without independence or at least conditional independence become extremely non-realistic *as models* very rapidly as the number of dependencies increases, simply because the probability of observing anything in the neighborhood of any given thing drops to zero rapidly with increasing dimension, so you’re claiming frequency knowledge about extreme tail probabilities.

I don’t see any special problems for frequentists. I don’t think self-proclaimed frequentists have the difficulties with it that you do.

The assumption that the past is representative of the future in some sense is necessary for all approaches.

Well, I do think a lot of people do statistics by pushing buttons and waiting for food pellets, so they may not have a problem with it, but it doesn’t mean there isn’t a problem. ;-)

“The assumption that the past is representative of the future in some sense is necessary for all approaches”

Sure, but for example the assumption that the laws of physics are constant in time (which implies conservation of energy which is experimentally verifiable for example) is very different from the assumption that our data is random and the distribution of each result is dependent on the N previous results, and every time we do the experiment and get the same data as we just did for the last N results we’ll get a final result which is random and has a particular frequency distribution that we just happen to know, even though under our assumption the frequency-probability that we’ll get the same N results if we run our experiment N more times is astronomically small so that we will never be able to observe even two identical conditions even if we do the experiment every second for the rest of our lives…

Frequency probability makes sense when there’s a verifiable stability of frequency. But you can’t verify the stability of the frequency of a thing you can’t ever make repeat.

Again, I think your difficulties are not difficulties of the ‘frequentist’ approach as such but of your

interpretation of it.

You measure something. If it’s deterministic with no error you only need to measure it once. If it is stochastic you measure it many times until the distribution of measurements is representative of the thing of interest.

The thing you measure can have both deterministic structure and stochastic randomness.

Eg y = m*theta + eps, eps ~ normal

For a stochastic process model like

yt = f(y_{t-1},y_{t-2},…) + eps, eps ~ normal

You’re typically interested in a case where you have, or assume you have multiple realisations, eg

(y0,y1,…ym)_1

(y0,y1,…ym)_2

…

etc. Often you make some kind of ergodic/stationarity etc assumption to do so.

But this is all standard stuff – there are many frequentist models of dependent data!

All of this is beside the point about likelihoods multiplying under independent data, which was presumably the scenario Martha was considering.

>For a stochastic process model like

>yt = f(y_{t-1},y_{t-2},…) + eps, eps ~ normal

Here yt Is a deterministic function of the earlier realizations plus an independent error. This isn’t the case I’m concerned about. The case I’m interested in is where the distributional choice for the err is dependent on the outcomes, or like in a Gaussian process where the entire sequence is a single realization.

Why can’t you analyse a Gaussian process using frequentist methods???

I don’t see any special complications: it has two ‘parameters’: a mean function and a covariance function.

Other than the infinite-dimensional nature, how is it different from a multivariate Gaussian with mean vector and covariance matrix?

Of course you can analyze it using frequentist methods, but it’s a very limited set of uses in which you can claim to be actually doing science as opposed to pure math with a pretense.

For example a Gaussian process model for an electrical circuit where you can repeatedly grab a few seconds of samples… Ok… But a Gaussian process for the polling of Trump vs Clinton in the lead up to the last election? Or a Gaussian process for global temperatures in the last million years or etc, if you do frequentist analysis on these kind of one off events you are not describing verifiable frequency properties of the world. It’s not a falsifiable scientific model. I don’t have a problem with the math, I just have a problem with calling it science.

ojm, you write “to me, Bayes is particularly vulnerable to the ‘true model’ issue”. I know we have kind of discussed before, but what is your thinking here? My first reaction is that any such vulnerability will be shared if not worse by max likelihood (and hence least squares) estimation of equivalent models, so you must be referring to *generative models* more specifically. Or maybe you mean BMA, and the property that it asymptotically converges to ‘true model’ in M-closed scenario, but arguably has undesirable properties in M-open?

DL – but polling is a random sampling situation (give or take some adjustments etc) right?

You want to measure the underlying support for Clinton vs Trump. You have a series of noisy measurements (polls) of this. You also have past experience for how this type of measurement (polls) relate to the quantity of interest (underlying support on the day).

CW – Bayes relies on generative models yes. And is also generally density based, which is incredibly sensitive to fine details of your assumptions.

I’m in favor of Bayesian methods (at least when the prior is justified), but I’m confused – why should this convince a frequentist that it’s okay to use information from the prior?

Austin:

The prior’s just other data. The point is that there’s no reason to privilege one part of the data when making inferences and decisions.

A prior isn’t in general ‘other data’ – it’s a probability distribution over parameters. A prior probability distribution isn’t even the same as a prior likelihood.

It is not usually ‘other data’ of the *same type* as enter in the likelihood. But it is (or should be) reflective of ‘other information’.

Yes of course, I wouldn’t claim otherwise.

You are so correct. These principles are not always clearly expressed. It has puzzled me as well. And I’m nowhere near as insightful you in these issues.

The nice things about many frequentist versions of the Bayesian procedure is that they develop the (asymptotic) properties under weaker assumptions and specify their analytic goal. Remembering that an estimator maintains its desired property under certain conditions and picking based on those is more in the pattern of our reasoning than trying to back-figure which approximate priors and likelihoods match those same criteria (or are “good” approximations in the same sense). While Bayesian theorists can work these out, it gets a little confusing for practicioners to type in A ~ normal(mu, sigma) when really the necessary assumption for the analytic goal with reasonable sample size is, eg, a few finite moments and correctly specified structure of mu. With practice and more familiarity the library of model properties fills in.

Ryan: Thanks its nice to get a view from propertyist – http://statmodeling.stat.columbia.edu/2017/04/19/representists-versus-propertyists-rabbitducks-good/

I always struggle with the justifications of what are _good_ analytic goals and desired properties (from link above “good properties” always beg the question “good for what?”).