I sent Deborah Mayo a link to my paper with Cosma Shalizi on the philosophy of statistics, and she sent me the link to this conference which unfortunately already occurred. (It’s too bad, because I’d have liked to have been there.) I summarized my philosophy as follows:

I am highly sympathetic to the approach of Lakatos (or of Popper, if you consider Lakatos’s “Popper_2” to be a reasonable simulation of the true Popperism), in that (a) I view statistical models as being built within theoretical structures, and (b) I see the checking and refutation of models to be a key part of scientific progress. A big problem I have with mainstream Bayesianism is its “inductivist” view that science can operate completely smoothly with posterior updates: the idea that new data causes us to increase the posterior probability of good models and decrease the posterior probability of bad models. I don’t buy that: I see models as ever-changing entities that are flexible and can be patched and expanded, and I also feel strongly (based on my own experience and on my understanding of science) that some of our most important learning comes when we refute our models (in relevant ways). To put it another way: unlike many Bayesians, I believe that a model check–a hypothesis test–can be valuable, even when (or

especiallywhen) there is no alternative at hand.I also think that my philosophical approach fits well with modern Bayesian data analysis, which is characterized not just by the calculation of posterior probabilities but by a three-step process: (1) model building, (2) inference conditional on an assumed model, (3) model checking, then returning back to step (1) as needed, either to expand the model or to ditch it and start anew.

I think that the association of Popperian falsification with classical statistical methods, and the association of inductive reasoning with Bayesian inference, is unfortunate, and I’d like to (a) convlnce the Popperians that Bayesian methods allow one to be a most effective Popperian, and (b) convince the Bayesians of the problems with formal inductive reasoning. (See the second column of page 177 here.)

Mayo and I then had an email exchange, which I’ll repeat here. I’m hoping this will lead to clearer communications between philosophers and applied statisticians. (As Cosma and I discuss in our paper, philosophy is important for statisticians: it can influence how we use and interpret our methods.)

Mayo:

You wrote “I’m not quite sure what a “frequentist method” is, but I will assume that the term refers to any statistical method for which a frequency evaluation has been performed.” I have no doubt that, if you’re saying this—and I’ve heard a few others echo this idea—that this may be what’s taught nowadays (yes?), but I find it perplexing all the same.

My reply:

Yes, I was always taught that what makes a method “frequentist” was the evaluation of its frequency properties. My understanding of the frequentist approach is that it treats all inferential statements as functions of data and thus as random variables, with the randomness induced by the sampling distribution, that is, the probability model describing which particular data happened to arise. From this perspective (associated with George Box and Donald Rubin, among others), any method is frequentist if it is evaluated in that way. In my experience, I find this approach to evaluating inferences to be limited, so I don’t spend much time on it (although it occasionally does arise in my work), but that’s my best undestanding of what “frequentist statistical analysis” is.

Perhaps I can return your implicit question by asking why this view seems perplexing to you?

Mayo:

(a) First Perplexity: A little bit of statistical amnesia?It’s as if “frequentism” as understood by frequentists (on the order of Fisher, Neyman,Pearson, Cox, Lehmann and their various modern counterparts—even granting their differences), had been forgotten and something different put in its place. Granted, calculating frequentist error probabilities of procedures is a necessary part of “frequentism” but it was never intended to be sufficient. Moreover, even the meaning of this necessary part has been open to vastly different interpretations.

Gelman: I’m not quite sure what a “frequentist method” is, but I will assume that the term refers to any statistical method for which a frequency evaluation has been performed (analytically or via simulation) conditional on some family of models that is considered relevant to the frequentist doing the evaluation. ….As far as I know, there is no “frequentist method” for coming up with an estimator. As noted in section 1 above, the frequentist method, as I understand it, is an approach for evaluating inferences …..

So, but what would you call standard (frequentist) procedures for arriving at estimators and tests (that satisfy frequentist goals e.g., being close to the truth with high probability)?

(b) Second Perplexity: Where’s the Bayes?Gelman: A big problem I have with mainstream Bayesianism is its “inductivist” view that science can operate completely smoothly with posterior updates: …… I see models as ever-changing entities that are flexible and can be patched and expanded, and I also feel strongly (based on my own experience and on my understanding of science) that some of our most important learning comes when we refute our models (in relevant ways).

(i) Bayesian inference involves calculating Bayes theorem, someplace along the way, no? Refutations and falsifications are not Bayesian. It’s not clear where Bayes theorem enters your series of moves, or doesn’t it.

(ii) Note too, that actual falsifications of statistical hypotheses do not occur, because they would be deductive….and there is (virtually?) never a case that empirical premises entail statistical hypotheses or conclusions. That would require that presuming the premises true while the conclusion false leads to a logical contradiction (e.g., a statement logically equivalent to p & ~ p. Instead, we set out a “rule” or some such thing for inferring a statistical hypothesis has been falsified (some call it a decision rule, though I don’t).

For example, when 10 different scales show an increase of at least 1 pound in my weight since returning from Paris (yielding data set x), where the scales, we may imagine, are shown to work reliably and to a specified precision, we take this as evidence for H: I’ve gained at least some weight. Strictly speaking though, (x and not-H) does NOT yield a logical contradiction.

(iii) As an offshoot of (ii): this is what stymied Popper. Not knowing statistics, he scarcely saw how a statistical “rejection rule” might be erected. Worse, he failed to see how failing to reject could, in certain cases, allor a hypothesis to be corroborated in his sense, i.e., well tested (without assigning it probability).

(iv) In this connection, it is most perplexing to interpret “falsification” Bayesianly. For starter, Popper adamently rejected Bayesian approaches to philosophy of science/inference. Now, granted, Bayesian philosophers have at times interpreted Popper Bayesianly, and he indvertently opens himself to this false reading by not being clear that the only statistical view in sync with his philosophy is frequentist error statisics.

Notably, Popper’s requirements for H to be highly corroborated or well tested are:

– H entails or fits data x, and

-P(x|not-H) is small

(where he allows not-H to be merely the existing rivals to hypothesis H).

But P(x|not-H) is a likelihood not an error probability, and Popper really wanted the latter. We know this because he talks everywhere about his insistence that probability attach NOT to hypotheses but only to methods or procedures of testing. Everything he says about novelty and avoidance of adhockeries likewise shows he rejected the (strong) likelihood principle.

(v) Finally, Popper denied that high posterior probability in hypotheses is at all desirable as a scientific goal. He always said we wanted IMPROBABLE hypotheses in science, since they are the most informative, bold and testable.

(c) Third Perplexity (connects to (b)): Where’s the significance test?If, as you seem to suggest, we are to use significance tests to check models, then it would seem you are appealing to frequentist error probabilities, e.g., p-values. But you also write at times as if you don’t make use of error probabilities, so your tests must be some other kind of tests????

Gelman:

1. You ask: “would you call standard (frequentist) procedures for arriving at estimators and tests (that satisfy frequentist goals e.g., being close to the truth with high probability)?”

My reply: These procedures seem to exist only to be supplanted. Or, to put it another way, Bayesian go from model to model to model. The Bayesian models of 50 years ago seem hopelessly simple (except, of course, for simple problems), and I expect the Bayesian models of today will seem hopelessly simple, 50 years hence. (Just for a simple example: we should probably be routinely using t instead of normal errors just about everwhere, but we don’t yet do so, out of familiarity, habit, and mathematical convenience. These may be good reasons–in science as in politics, conservatism has many good arguments in its favor–but I think that ultimately as we become comfortable with more complicated models, we’ll move in that direction.)

Anyway, that’s the Bayesian story: we advance “technologically” through better models. The frequentist story, as I undertstand it, is to advance through better procedures. What are the “standard (frequentist) procedures for arriving at estimators”? There was the method of moments, then there was maximum likelihood, then there were some nonparametric methods based on the empirical distribution, then semiparametric methods such as the proportional hazard model, then generalized estimating equations, multiple comparisons, . . .

I don’t really see there being a general “frequentist method of inference,” anymore than I see there being a general Bayesian method of coming up with models. They come up with methods through some combination of historical examples, mathematical extensions, inspiration, and evaluation, and we (the Bayesians) come up with models through an analogous procedure.

2. You ask: “Where’s the Bayes” [in what I do].

My reply: I do Bayesian inference conditional on a model. See Bayesian Data Analysis for a few zillions of examples. And, in contrast to some Bayesians, I _do_ view refutations and falsifications as Bayesian. See chapter 6 of Bayesian data Analysis. The short version is that Bayesian refutations use the posterior predictive distribution, p(y.rep|y). The conditioning on y is what makes them Bayesian. I don’t compute Pr(H|y). But that doesn’t mean I’m not being Bayesian. It just means that I’m doing a different Bayesian calculation than Jim Berger or Adrian Raftery might want to do.

I defer to you and others on Popper. I am following a Lakatosian strategy of taking the Popperian views that I like and labeling them as Popper. Perhaps I should refer to this construction as Popper_3. Or perhaps Lakatos_2.

3. You ask: “If, as you seem to suggest, we are to use significance tests to check models, then it would seem you are appealing to frequentist error probabilities, e.g., p-values.”

My reply: I am interested in p-values–although, in practice, usually in graphical checks rather than numerical summaries; see my 2003 paper on the unification of Bayesian inference and exploratory data analysis and also my related paper from 2004. But these are Bayesian p-values, not frequentist p-values. The “p-value” concept, like the “estimation” concept and the “standard error” concept, are too valuable to be restricted to frequentism. Yes, there are frequentist p-values, but there are Bayesian p-values too. In many important special cases they coincide, and in many other cases they are close to each other. But that doesn’t mean they’re not Bayesian.

I think Mayo’s take on this is that I’m not really giving frequentist methods a fair shake. Which may be true, it’s hard for me to say. I do think, though, that many people (including the authors of the WIkipedia entry on Bayesian statistics) don’t really have a good sense of Bayesian data analysis as I understand it. The rhetoric of Bayesians can obscure the useful reality.

Well put "rhetoric of Bayesians can obscure the useful"

Quickly and vaguely for Fisher, one needed to avoid recognizable subsets, for Lehman logical equivalents of subsequent randomized tests, for Cox non-conditional evaluations – but ways to evaluate not create methods.

K?

Gelman: we advance "technologically" through better models. The frequentist story, as I undertstand it, is to advance through better procedures.

Vogelstein: I'm pretty sure that not all people that do model selection refer to themselves as Bayesians. cross-validation is arguably a frequentist procedure, and nearly all the people i know that refer to themselves as Bayesians do it to model check. people who refer to themselves as frequentists certainly do.

i'd argue that the main difference between people who refer to themselves as bayesians and the people who refer to themselves as frequentists are the procedures they try first, and whether they call those procedures bayesian.

Vogelstein: my tentative predictor of self reference is whether they insist on rigorous mathematical determination of operating characteristics (type 1 & 2 error) versus being not that precisely interested in these exactly and beign more than satisfied with them being pinned down for others by simulation.

Anyways, the important thing I think I learned in my thesis…

K?

And I do remember David Cox arguing that better models provide extra assurance for better procedures (better generalization) and then Brian Ripley commenting that "this" took some time and effort for him to get the full gist of.

But models do seem to be hidden or least not given the highlighting they do deserve in many frequentist writings…

K?

Vogelstein:

Yes, non-Bayesians improve their models too. What distinguishes Bayesians here is that they (we) advance

onlythrough better models, in contrast to non-Bayesians who also advance through new estimators, tests, etc.Gelman: really? variational bayes seems like a new kind of estimator to me. slice sampling too.

Gelman and O'Rourke: here my perspective on the issues. a paper that i like that performs statistical inference tends to incorporate the following:

1) define an exploitation task (eg, hypothesis testing, anomaly detection, classification, prediction, etc.)

2) assume a family of models, {P[X,theta] | theta in Theta}, where X is a random variable representing the data.

3) establish some desiderata, like computational efficiency, interpretability, etc.

4) describe an algorithm (or a small set of algorithms) to perform the inference task

5) apply the algorithm to the data to perform inference

6) check model fit

the above applies to all. the difference is below:

a) people who refer to themselves as bayesians seem to tend to infer approximate posterior distributions, whereas people who refer to themselves as frequentists tend to infer approximate point estimators (often MAP estimators).

b) people who refer to themselves as bayesians tend to check model fit using approximate bayes factors, whereas frequentists tend to use cross-validation

c) people who refer to themselves as bayesians tend to use slightly different algorithms to do the inference, like mcmc or variational bayes, whereas people who refer to themselves as frequentists seem to prefer other approximations.

Vogelstein:

No. Variational Bayes, MCMC, slice sampling are computational algorithms for approximating a posterior distribution. They are not "estimators" in the classical sense. You do make a good point, though, that Bayesian inference does not advance by models alone. Computational progress is important too.

Finally, regarding your point (b), I do not think Bayes factors are a measure of fit in any generally meaningful way. See chapter 6 of BDA for further discussions of this poing.

Vogelstein: Believe it may be getting vague, which is why this post is interesting and _should_ be drawing more comments.

Re: using approximate bayes factors, whereas frequentists tend to use cross-validation

the Bayesians I hung out with this summer were very skeptical of bayes factors here and were all _falling back_ on cross-validation (given some real skeptacism of prior or posterior predictive checks supplanting the need for cross-validation)

But there does seem to be a tendency for confusing means and ends (thinking that computational algorithms for approximating a posterior distribution are more than just tools) and a more natural or immediate focus on the model rather that procedures in Bayes

K?

Gelman: i see your point, although it is subtle. i guess people who call themselves frequentist come up with M-estimators and other robust things, whereas people who call themselves bayesians have robustness built in to some degree. on the other hand, if one defines an estimator as a function that maps data onto the parameter space, it seems to me like the MCMC estimate of a parameter is a different estimate than the variational bayesian estimate, and the functions that do the mapping are the different estimators.

(i'll check out Ch. 6 regarding bayes factors)

O'Rourke: "the Bayesians I hung out with this summer were very skeptical of bayes factors here and were all _falling back_ on cross-validation"

yes, this is why i write things like "people who call themselves X tend to do Y". it seems the frequentist-bayesian divide is largely what people call themselves. that "bayesians" do xval is evidence along those lines. if the divide exists, imho, the gap is exaggerated.

cheers ;)

I've been noodling over this post off and on for some time now. My own data analysis is essentially in line with Gelman and Shalizi; but my introduction to Bayes through Jaynes's famous/notorious text instilled in me the view that the inductivist approach is the ideal. The ideal would be something like Solomonoff induction (optimal, but unfortunately uncomputable).

It's basically the aphorism of theory and practice: in theory, there's no difference between theory and practice, but in practice, there is. Hypothetico-deductivist practice is the safest/best humans can do because we're not equipped to encode our prior information in probability distributions, nor are we "logically omniscient", that is, able to instantly ascertain all consequences from any given set of axioms.