Someone writes:

After listening to your EconTalk episode a few weeks ago, I have a question about interpreting treatment effect magnitudes, effect sizes, SDs, etc. I studied Econ/Math undergrad and worked at a social science research institution in health policy as a research assistant, so I have a good amount of background.

At the institution where I worked we started adopting the jargon “statistically significant” AND “clinically significant.” The latter describes the importance of the magnitude in the real world. However, my understanding of standard T testing and p-values is that since the null hypothesis is treatment == 0, then if we can reject the null at p>.05, then this is only evidence that the treatment is > 0. Because the test was against 0, we cannot make any additional claims about the magnitude. If we wanted to make claims about the magnitude, then we would need to test against the null hypothesis of treatment effect == [whatever threshold we assess as clinically significant]. So, what do you think? Were we always over-interpreting the magnitude results or am I missing something here?

My reply:

Section 2.4 of this recent paper with John Carlin explains the problem with talking about “practical” (or “clinical”) significance.

More generally, that’s right, the hypothesis test is, at best, nothing more than the rejection of a null hypothesis that nobody should care about. In real life, treatment effects are not exactly zero. A treatment will help some people and hurt others; it will have some average benefit which will in turn depend on the population being studied and the settings where the treatment is being applied.

But, no, I disagree with your statement that, if we wanted to make claims about the magnitude, then we would need to test other hypotheses. The whole “hypothesis” thing just misses the point. There are no “hypotheses” here in the traditional statistical sense. The hypothesis is that some intervention helps more than it hurts, for some people in some settings. The way to go, I think, is to just model these treatment effects directly. Estimate the treatment effect and its variation, and go from there. Forget the hypotheses and p-values entirely.

We use the terms “clinically relevant”, meaning a lower 95%CI/CredI of the difference of 10 mmHg (for blood pressure) or 10% for gastric secretion overshoot, or similar.

I see the point that you have to agree on the threshold use, and if there is a sponsor with some financial interests is involved, this can lead to extended fights. Tell them that putting the limit into the protocol “is our SOP”: this is the language they understand.

What you say sounds to me (as a consumer) like it is not really getting at what I would call “clinically relevant”. To me, to consider a difference “clinically relevant,” you would need to give a good medical reason why that difference makes a practical difference in health outcomes. Saying “standard operating procedure” is not adequate information.

No, it is not standard information, but sponsors (not researchers) always insist on “objective” numbers (like, they believe, p-values). This allows for a wide range of interpretations and forking paths post-hoc. Using SOPs to force pre-assigned thresholds often accepted as an argument.

Dieter

Andrew, when you say: “The way to go, I think, is to just model these treatment effects directly. Estimate the treatment effect and its variation, and go from there. Forget the hypotheses and p-values entirely.”

Doesn’t this really mean: “Do Bayesian modeling and get a posterior distribution for the effect size conditional on the covariates?”

In Frequentist statistics, there is a 1-1 correspondence between confidence intervals and statistical tests. The confidence interval is the region of parameter space that isn’t rejected by some test (that this is true can be established by the fact that you can calculate the confidence interval for say p=0.05 and then “whether the parameter value is in the confidence interval” *is* a test at level p=0.05 of the hypothesis that the parameter takes on that value)

I fully agree with you: be Bayesian, but I think people read this advice of yours and are confused because Frequentist statistics really only does one thing… test and if they don’t realize that you mean switch to a Bayesian framework… then they’re left with this mysterious advice: “estimate [and] … forget … p-values entirely”

In particular, consider the posterior distribution for 1 parameter in Bayes…. it’s a curve over the parameter space. The height of this curve tells you something in a Bayesian analysis, it tells you “given your assumptions, when the curve is high, those values are more credible, and when it’s low, those values are less credible”

But a confidence interval has no such interpretation. Instead, for different alpha levels you get a statement about how often in future experiments you’d be able to reject the idea not that the parameter takes on some value in *this* interval you have in front of you, but how often it’d take on values in future intervals constructed in the same way with different data you might hypothetically get in the future… The confidence isn’t *in the specific interval you have* it’s *in the construction process*

So, I think the advice “estimate” and “ignore p values” are both good pieces of advice, but they basically imply “be Bayesian”.

‘Frequentist statistics really only does one thing… test and if they don’t realize that you mean switch to a Bayesian framework… then they’re left with this mysterious advice: “estimate [and] … forget … p-values entirely”’

Huh? Frequentist statistics is just as concerned about estimation as Bayesian statistics.

The publication process is the only thing really concerned with p-values/posterior probabilities of hypotheses…

In practice frequentist statistics only estimates by acceptance and rejection of some test. A confidence interval is just the region that can’t be rejected by some test.

So if you call that “estimation” then fine, but it’s still based on testing.

I think there’s even something stronger to say here. Frequentist statistics is necessarily about the frequency with which something happens in repeated trials (I take this as definitional).

So, for continuous parameters (basically all of them in clinical applications) a point estimate is *not* a frequentist estimate as there is no frequency with which that exact point estimate will be repeated. Only interval estimates have a hope of having frequencies, and the frequency with which the interval construction process gets the right answer in the interval in repeated trials, conditional on the model being true, is the confidence level, which is the p value associated with the test “is the value I’m considering in this interval I just constructed?”.

In practice, rejection of the null is taken as license to use the point estimate, which is not really Frequentist logic. The logic of frequentism really doesn’t offer any reason to favor any particular point inside the confidence interval over another point.

Yes, but (as should always be mentioned for these discussion) in many circumstances the confidence intervals approximate the credible intervals calculated under a uniform prior. So the confidence interval calculation can be used as a computationally efficient heuristic for the corresponding credible interval. As far as I know there is no “offical” explanation for why this should be the case, but it is.

Maybe I misunderstand, but isn’t the “official answer” the Bernstein-von Mises Theorem? Here’s a nice formal statement of it:

“The Bernstein-von Mises theorem … asserts that the posterior for a smooth, finite-dimensional parameter converges in total variation to a normal distribution centred on an efficient estimate with the inverse Fisher information as its covariance, if the prior has full support.”

Source is B.J.K. Kleijn, “On the frequentist validity of Bayesian limits”, arXiv:1611.08444v2.

And a nice informal statement:

“The Bernstein-von Mises theorem is a formalization of conditions under which Bayesian posterior credible intervals agree approximately with frequentist confidence intervals constructed from likelihood theory.”

Sources is Johnstone, “High dimensional Bernstein-von Mises: simple examples”, Inst Math Stat Collect. 2010 ; 6: 87–98. doi:10.1214/10-IMSCOLL607.

My own informal interpretation: *asymptotically*, for most practical purposes, frequentists and Bayesians will end up in the same place. But a lot of work in that statement is being done by the “A-word”!

Thanks, Mark. Not sure if that explains it but it is honestly over my head. I am unable to track all the assumptions being made, etc. The phenomenon I am talking about works for small sample size and uniform prior only. The Bernstein-von Mises theorem appears to be for large sample sizes and any prior.

Ah, right, sorry. Bernstein-von Mises says frequentist and Bayesian intervals end up in the same place asymptotically because (my intuition, ymmv) the relative contribution of the extra information in the prior goes to zero. You’re asking about a uniform or “uninformative” prior, when there isn’t (supposed to be) information in the prior to begin with. (No doubt someone here can put it better than me – mea culpa.)

I started looking for a good formal statement and didn’t get very far. But Larry Wasserman’s (late lamented) blog has a nice discussion with references that might lead to what you want:

https://normaldeviate.wordpress.com/2013/07/13/lost-causes-in-statistics-ii-noninformative-priors/

> So, for continuous parameters (basically all of them in clinical applications) a point estimate is *not* a frequentist estimate as there is no frequency with which that exact point estimate will be repeated.

The frequency property is typically something like unbiasedness. And, as you rather nicely put it, it refers to “construction process.” If I estimate something a bunch of times, none of the estimates will be right but they’ll be right on average. That’s a frequency property of the construction process.

Also it strikes me as kinda silly to say that point estimation is incompatible with frequentist statistics when Lehmann wrote a book called “The Theory of Point Estimation.” If your definition of frequentist statistics excludes a book by the Ur-Frequentist, it’s probably too restrictive.

“In practice, rejection of the null is taken as license to use the point estimate, which is not really Frequentist logic.”

The sample mean (assuming the usual qualifications) is an unbiased estimate of the “actual” mean; for the right distributions it’s the most probable value. So if you rejected the null, and if you have no other information, *and* if you felt you had to come up with a single number, then the measured sample mean would be the best thing to report.

Conclusion – reporting a single number is not usually the best thing to do.

And don’t go nit-picking those ifs!

A frequentist confidence interval could come from the distribution of bootstrap estimates (the SD or, say, the regions from the 5th to 95th percentiles), in which case you are estimating the variability of the point estimate directly, not as an inversion of a NHST.

“Note: I really am not trying to promote Bayesianism here, what I’m trying to do is interpret what Andrew’s advice could possibly mean other than “Go Bayesian”. “

Daniel – but for real, ¿why does the bootstrap not offer a non-Bayesian means of embracing uncertainty without relation to NHST? i mean, in the cases where the bootstrap distribution converges to the sampling distribution of BetaHat, how is that not a non-Bayesian, non-NHST way of adhering to Andrew’s recommendations?

Bootstrap is the secret weapon, after all.

“In practice frequentist statistics only estimates by acceptance and rejection of some test. A confidence interval is just the region that can’t be rejected by some test.

So if you call that “estimation” then fine, but it’s still based on testing.”

What?!? No, frequentist statistics also provides point estimation. And these are used all the time: for example, estimating return in A/B testing, which is likely to be the most common use of frequentist statistics.

Sure, if you have strong, reliable prior information and weak data, you can have estimation with lower errors with Bayesian estimates than frequentist estimates. But saying saying “estimation is only a Bayesian property” is a completely false statement.

“In practice, rejection of the null is taken as license to use the point estimate, which is not really Frequentist logic”

No, this statement is also completely wrong. Point estimation is completely built up in the frequentist methodology without any sort of hypothesis testing.

Daniel:

Perhaps an example will help motivate things. The law of large numbers can be proven in the Frequentist setting. In this case, we know that the expected value of x bar is mu. Therefore, we can use x bar as a point estimate for mu. Next, the CLT can be proven in the Frequentist setting. Now we have an estimate of the standard deviation for x bar, so we know the precision of x bar.

NHST has never been used in any of those proofs. In fact, it doesn’t come about until AFTER those proofs.

The point estimate of a continuous parameter has zero frequency of occurrence in repetition. By the definitional logic of Frequentism there is zero probability that the point estimate is correct.

So, it’s frequentism to accept as correct something with exactly zero probability but it’s not frequentism to choose to accept as correct something outside an interval of 95% confidence…

Okay then!

“The point estimate of a continuous parameter has zero frequency of occurrence in repetition. By the definitional logic of Frequentism there is zero probability that the point estimate is correct.”

Yes…but exactly the same reasoning can be applied to Bayesian statistics: any point estimate for a continuous parameter has 0 probability of being correct. Does that mean Bayesian statistics can’t do point estimates? No.

But frequentist reasoning can tell us that using the data alone, this will be the estimator with the lowest MSE. Hence frequentist statistics motivates us to use this as a point estimate.

It seems that you believe frequentist = p-values. This is fundamentally wrong.

A point value of a continuous parameter has zero probability. If we are philosophical Bayesians, treating probability as quantified credibility, then any single point value of a continuous parameter has zero credibility. Is it then problematic for a Bayesian to treat a (Bayesian) point estimate as credible, while simultaneously regarding intervals outside of, say, 95% HDIs as not credible?

Noah,

Bayesian point estimates simply don’t make sense except as conveniences for calculating integrals. If the posterior is sharply peaked, you can calculate the expected value of some function as f(x*) for x* at the peak. If it’s not sharply peaked, what are you doing using a Bayesian point estimate? There’s a reason we all want Stan !

The density in the vicinity of a point is basically the weight function for an integral.

E(f(x)) = StandardPart(sum(f(x_i) p(x_i)dx)) for x_i on a nonstandard grid

(this is equivalent to the limit definition of an integral)

or in real world terms:

E(f(x)) ~ 1/N f(x_i) for x_i a sample from Stan

So, by all means, the Bayesian point estimate is usually not what we care about, that’s why we want things like Stan! The exact point I’m making is that we want to treat different points within the interval differently!

Cliff: From Wikipedia: “Frequentist probability or frequentism is an interpretation of probability; it defines an event’s probability as the limit of its relative frequency in a large number of trials”

From Merriam-Webster, essentially the same thing: https://www.merriam-webster.com/dictionary/frequentist

From the OED (by way of googling and finding someone who quoted it, I don’t have the OED myself): “Oxford English Dictionary. Frequentist: One who believes that the probability of an event should be defined as the limit of its relative frequency in a large number of trials.”

Ok, so **definitionally** Frequentist means “probability = frequency of occurrence in repeated trials”.

Now, p values are one kind of frequentist probability, they’re precisely the probability for some parameter to be in a region outside some interval *in repeated trials of collecting data and constructing an interval*.

Since the probability to be outside the interval that’s a single point is zero, the frequentist point estimate makes no sense.

From the Bayesian perspective it makes sense only when the interval is narrow enough that the integral of interest can be approximated by a sum of one term.

So, the Bayesian point estimate has a direct interpretation as a location at which we can do finite sum quadrature under special circumstances.

Outside those circumstances, we want to treat different points inside any interval differently by weighting them according to a density to calculate an integral. So we need samples from Stan etc.

Frequentist only says what “randomness” means; variation in a population, or more precisely, variation from a random sample from a population. When we say P(H) = 0.5, we mean consider the infinite sized population of possible coin flips. What proportion is heads? 0.5. So if you take a completely random draw from that population, 50% of the time it is heads. That’s a frequentist interpretation of probability. It doesn’t say you have to use p-values, hypothesis tests, etc. With continuous variables, it’s a little more tricky (you are now talking about the limit of a probability divided by epsilon), but no different than how tricky it is in the Bayesian standpoint.

So if your randomness comes from variation in random draws from a population, maybe you want an estimate with minimal MSE given those draws. No p-values, nothing. Still frequentist statistics.

Bayesian statistics opens up that randomness can mean personal uncertainty, rather than just measurable variation across a population. So this motivates using priors in your analysis and generating point estimates (i.e. you get can lower MSE by using more than just the data alone, because you’ve changed the definition of MSE). In fact, if you state that your prior is what generated the parameters which then generated your data, there’s actually no difference between an optimal frequentist analysis and an optimal Bayesian analysis; the prior moves directly into the likelihood. Analysis can mean point estimate, hypothesis test, whatever.

But this idea that frequentist statistics starts from hypothesis testing and moves backward is just plain wrong.

You’re right “choose the value which has minimum future MSE given the model” uses frequentist probability for its definition. But unfortunately that value is unknown to you, so operationally it is meaningless. It says, in essence, “choose the real parameter”

So, we’re stuck with estimators so for example “find a procedure that is an unbiased estimator of the value that will have MSE in the future then apply that procedure to your actual data and choose that value” which is not the same thing, and it’s been proven that shrinkage estimators have lower MSE but are biased.

There is no “lowest MSE shrinkage estimator” as the shrinkage estimator you get from a Bayesian estimate with a prior normal(the_true_value,epsilon) gets better and better as epsilon goes to zero, the problem is you don’t know “the_true_value” to plug in. But if you include this big family of Bayesian models in “Frequentist estimators” then in this sense you agree with me, Andrew means “Go Bayesian” by choosing real priors and then using shrinkage estimators calculated from Bayesian Posteriors.

Note also that in the case where you really like point estimates, you will estimate for my clinical example that x ~ 0.4 or something like that, and then what do you do with this information? when x < 0 we assumed the consequences could get deadly. But the point estimate procedure will find 0.4 and then what? Recommend treatment? Even though the Bayesian posterior expected utility is highly negative because you kill say 1% of your patients?

Cliff AB said: “Frequentist only says what “randomness” means; variation in a population, or more precisely, variation from a random sample from a population. “

This doesn’t make sense to me: you seem to be defining “randomness as “variation from a random sample from a population”. But this is a circular definition.

FWIW, my attempt to explain the different perspectives on the concepts of probability and random (for a lay audience) is in Sections I – III of http://www.ma.utexas.edu/users/mks/CommonMistakes2016/WorkshopSlidesDay1_2016.pdf

Ah I know, in the clinical case, you could still be frequentist, calculate the future mean squared error of the consequences!!! Yes… that’d surely do it.

If only there were some math that would tell us which of the methods for Frequentist estimates of the future consequence produces good results… Maybe an “Essentially Complete Class of Admissable Decision Functions” or something

https://projecteuclid.org/euclid.aoms/1177730345

So, thanks to Wald’s theorem (linked paper above) basically again we arrive at “Be Bayesian”, use a real world prior, and real world consequences, and choose the parameter value that minimizes/maximizes the expected consequences and this ensures that there isn’t a decision rule that always dominates your decision in Frequentist risk sense.

Note: I really am not trying to promote Bayesianism here, what I’m trying to do is interpret what Andrew’s advice could possibly mean other than “Go Bayesian”.

It’s typical for him to say things like “focus on estimation and embrace variation” which to me means “be able to consider a range of possibilities not all of which are equally important” so it rules out point estimation as an inference procedure

By saying things like “ignore p values” he is also implicitly saying “ignore confidence construction procedures” because a choice of p value and a construction procedure, plus Data, leads to a confidence interval in the same deterministic way that 1 + 1 = 2… so his advice basically implies don’t use confidence intervals either.

If we want to stick to Frequentist inference procedures, then, how do we choose the procedure? Your suggestion is by reference to some optimality in future expected Errors. By using the clinical example where a small number of patients are going to be killed but a large number of patients will get a small benefit, it makes it clear that we can’t just refer to the error in the parameter, without including the spectrum of consequences for making said error.

Wald’s theorem makes it clear that if you are going to choose a Frequentist procedure that considers the consequences of making the error, if you don’t do something equivalent to choosing a prior and calculating expected consequences under the model… then you are outside the class of admissable rules… your Frequentist future risk under future treatment is higher than it needs to be. So once again we arrive at Be Bayesian.

In all cases, “estimate effects, embrace uncertainty, ignore p values” means Be Bayesian (without all the kerfuffle caused by saying the B word among people who are predisposed to be against it).

Martha:

Suppose the true measure on a population is that the standard deviation of some value is 2. There is nothing random about this: it’s just a measure over a population.

Now randomness enters the picture in considering random draws from that population. In the long run, the standard deviation of these random draws is 2.

I’m failing to see what is circular about this definition.

The only thing I think is left is something like “Be A Bayesian Who Explicitly Models Frequencies” which might mean Andrew’s typical advice to:

1) Create a Bayesian model and fit it in stan

2) Using the posterior distribution of your model, generate fake datasets

3) Look at the fake datasets and see if there’s something noticeably different from your real data in the sense of frequency fit, and use that information to adjust your model.

Bayesians are of course free to make models of frequencies. and when they’re doing that (3) is BOTH Bayesian, AND Frequentist.

In essence your Bayesian background says “with enough data to inform it, my posterior predictive histogram of fake data should get “close” to the histogram of the real data”

posterior predictive checking then looks like Bayesian ABC inference:

1) “Randomly” select a model from within the set of models you’d be willing to consider.

2) fit model using Stan to produce samples of parameters

3) Using samples of parameters generate fake data sets

4) COMPARE HISTOGRAMS of fake data to real data to see if the frequencies come out “similar” (a similarity measure as in ABC)

5) accept model if similarity is good enough, otherwise return to step 1.

At the end of this, you have a Bayesian model for a Frequency process. Is this Bayesian or Frequentist. I don’t really care. I just wonder if it maps to what Andrew means when he gives this kind of advice.

My understanding of Frequentist _statistics_, at least in the Neyman sense, is about frequency of inferential errors. One can be frequentist in interpreting probability while not being frequentist in the error sense. Fisher is an example – he strongly criticised the idea of type II errors but generally restricted probability to apply to repeatable, observable phenomena.

At 2:41 pm Cliff AB said, “Frequentist only says what “randomness” means; variation in a population, or more precisely, variation from a random sample from a population.”

I responded that this sounds like a circular definition.

At 3:45 pm he said, “Suppose the true measure on a population is that the standard deviation of some value is 2. There is nothing random about this: it’s just a measure over a population.

Now randomness enters the picture in considering random draws from that population. In the long run, the standard deviation of these random draws is 2.

I’m failing to see what is circular about this definition.”

I’m failing to see how this is a definition — in particular, what it is a definition of.

We seem to have some serious differences in terminology, at the very least. I’m guessing that when you say, “Suppose the true measure on a population is that the standard deviation of some value is 2,” you are trying to say what I would say by, “Suppose the standard deviation of a certain random variable on a certain population is 2.” In other words, we can’t talk about a standard deviation unless we are talking about a random variable — so randomness is inherent in the definition of standard deviation.

Martha:

Suppose you have a finite population. You can talk about standard deviation in your population. There’s no randomness: it’s just a simple measure over your sample. To make it super concrete, the standard deviation of {1,2,3} is about 0.82. There is nothing random about that; it is simply a measure over a population. Similarly, the mean is 2.

Now, if you sample with replacement from {1, 2, 3}, the standard deviation of your sample, as the number of samples approaches infinity, will approach 0.82. Also, the x bar will approach 2 in probability. The random (i.e. not perfectly predictable) aspect of the statistics you draw come not from the population, but rather which ones are randomly selected to enter your sample.

Daniel:

“Why use MSE for frequentist analysis? Why not some other loss function?” (paraphrasing as this thread is getting very long)

Well, don’t worry, we have frequentist theory that totally motivates these things!

After reading through the thread, I see where the confusion is: you’ve defined Frequentist in your own way. This isn’t what I was taught frequentist statistics is, and I don’t like this weird methodology that you’ve relabeled as Frequentist statistics either! However, before you say “‘A’ is awful!”, it would help to make sure you are using the canonical definition of “A”, or at least made it clear that you are talking about something very different than everyone else is talking about.

I would call “Frequentist statistics” as statistical methodology that derives directly from examining frequentist properties of an estimator. Bias, MSE, confidence levels, significance levels are all frequentist properties…posterior probabilities are not. However, if you can show that incorporating prior information from a certain expert reliably reduces MSE of your estimates, then you’ve justified the Bayesian analysis in a Frequentist setting! However, if you cannot show that incorporating a prior will help out some frequentist characteristic, then it’s hard to justify using priors in the frequentist setting.

…but not impossible! For example, you rightful point out that penalized MLE’s are exactly equivalent to MAP estimates (the line between a Frequentist method and Bayesian method is not as distinct some seem to believe), and if you use cross validation to show that the penalized estimate has lower expected out of sample error (definitely a frequentist property), then you’ve totally justified using a prior in a Frequentist setting. This is why I find arguments like “Bayesian statistics would make a better decision 9/10 times than the Frequentist method in problem X” silly…well you’ve just justified your Bayesian method as a Frequentist method, so the Frequentist will say the Frequentist method is to do whatever the Bayesian would have done, as it has better Frequentist properties than whatever the straw man frequentist method was!

In short, I think Bayesian methodology is great; if I didn’t, I wouldn’t have spent writing Bayesian aspects into my code. But I, and most statisticians these days (I think) see the whole “Frequentist vs Bayesian; there can be only one” arguments to be silly at best. That being said, most people refer to Bayesian statistics as using priors and Frequentist statistics as ignoring priors, but this is for simplicity rather than philosophical rigor.

Cliff, yes sorry if I’m confusing you. I think there are sort of two definitions of “Frequentist Statistician”

one, the common one, goes something like “anyone who does statistics who doesn’t use generative models with proper priors”

But, like the common definition of “significant” this is not the technical definition you’d get when you ask someone who is interested in the philosophical difference between Frequentist and Bayesian conceptions of probability (note I often “embody” this into “Frequentist statistician” and “Bayesian statistician” but it’s not really about people, it’s about what ideas are implied, so the statistician is a caricature of the “fully committed” version of those ideas)

The technical usage/definition of a Frequentist statistician (caricature) is something like one who uses probability for inference, but who treats this probability as a definite objective property of the world that describes the relative frequencies of how often a certain thing will happen when you repeat it. Note, this doesn’t say you can’t use likelihoods + priors! So it would be wrong to say that “Being Frequentist requires you to not use priors” (ie. the common definition is different from the technical one).

It only says you need to justify the use of the prior by some Frequentist property of the method!

So, you’re right we should justify our method with Frequentist properties if we are Frequentist (or wearing a Frequentist hat at the moment). In fact, we’ve got Wald’s theorem, which short of some minor technicalities says that basically if we want good frequency properties, we know how to get them! Be Bayesian!

In fact, we’ve got the James-Stein estimator which shows us that if we pick *any number at all* to shrink our estimate of the mean towards, we can get a better mean squared error after 3 samples than the Maximum Likelihood estimator. It also shows that if we pick the number to be close to the real number, things get better and better.

That was shocking to people in the 1960’s but it was *already entailed in what Wald had proved in the 1940’s!*

So my biggest beef is not that people want to do Frequentist statistics…. it’s that they DON’T want to do Frequentist statistics!

The desire that is being optimized in the “likelihood models + no priors” case is not “get low mean squared error” or even “get low mean real world badness” or anything like that. The desired quality that people act as though they’re maximizing when they use all those mixed effects with flat priors type model is “above all else, don’t ever make me, the statistician, actually have any responsibility in the matter.” That is “cover your ass”.

People are fine using priors, as long as the IEEE committee or the guy who programmed the complicated lmer function or whatever are the ones who choose them. Then, they have no responsibility in the matter.

Doing real-world Frequentist statistics involving point estimation basically requires that you do it via Bayes to get good frequency properties. Doing Frequentist statistics involving interval estimation basically requires doing it with Bayes. Both of those things are more or less true due to Wald’s theorem, but James-Stein estimators show that they are true even in some kind of non-bayesian usage as well.

Next we get to things like Mean Squared Error, or trimmed mean error, or mean absolute error, or whatever your choice of criterion is for a point estimate.

If you aren’t using some kind of real-world measure of goodness to pick your point estimate, how is that good? What Frequency properties that anyone cares about are you achieving? I don’t care how about the mean squared error in your dosing of the drug you’re giving me, I care about whether errors that big might make me sick or die.

Sure, some people don’t really work in areas where things really matter. You might just want to know “about what is the mass of an adult frog of the species foo” in which case you just take a sample average, and you put the p value there so the reviewers don’t complain and to make it look like science. But this post was about CLINICAL SIGNIFICANCE…

So, having thought about this through ranting and raving on the internet (and I admit, this topic gets me pretty over-animated, but I think that’s because in the end I think it’s got important moral implications for what people should do). I conclude that Andrew’s got something when he advocates something along the lines of:

When you have a problem in which it’s a desirable goal to match your predictions to the frequencies of actual occurrence of stuff…

1) Fit a Bayesian Model using a Real World Prior and

2) Check this via posterior predictive checking to see if it has good frequency properties relative to your data.

I’d like to add, though I know Andrew has said it before,

3) Use real-world “utilities”/”consequences” to make decisions.

And really although this means *use the math of Bayes* it’s still going to give you better frequency properties.

The cool idea I had in this is that the 1,2,3 method above is basically Bayesian ABC method of model selection. http://statmodeling.stat.columbia.edu/2017/06/26/problems-jargon-jargon-statistically-significant-clinically-significant/#comment-514552

So, I just don’t think you really get to call yourself a Frequentist Statistician just because you use some likelihoods and no priors. Because doing so doesn’t minimize any frequency property. For example, the MLE is NOT an admissable estimator, because James-Stein does better. So if you’re going around doing ML inference, you’re in some sense not really a frequentist!

So, maybe we’re more close together philosophically than you think.

Cliff AB re your June 27, 12:43 reply:

Re your response to me: OK, if you have a finite population, then the standard deviation of the random variable on that particular finite population can be calculated by the standard formula usually used just for (finite) samples. But in applications such as studying a drug for a certain disease, we need to consider the population as “all possible people with the disease”, which is not reasonably considered as a finite population.

Re your comment to Daniel saying, “before you say “‘A’ is awful!”, it would help to make sure you are using the canonical definition of “A”, or at least made it clear that you are talking about something very different than everyone else is talking about.”

Unfortunately, there are usually not “canonical definitions” of the types of things we’re discussing — terms like “frequentist statistics” have been defined different ways by different people. I agree that people ought to try to give their definitions (and assumptions) when possible, but assuming there is a “canonical definition” is misguided (sorry, I can’t come up with a kinder way of saying it).

Cliff AB: Reading over the comment I just posted, I realize it’s got some problems. In particular, I was wrong when I said, “OK, if you have a finite population, then the standard deviation of the random variable on that particular finite population can be calculated by the standard formula usually used just for (finite) samples” — my error was in talking about”the standard formula usually used just for (finite) samples. Among other reasons that is a silly statement is that that formula mentioned is a formula for sample standard deviation as an estimate of standard deviation for a normal random variable — and if the population is finite, then the population distribution can’t really be normal, because the normal random variable is a continuous random variable, not a discrete one, and a finite population would have a finite-valued (hence discrete) distribution for any variable on it.

(Also a minor point: I should have inserted a colon after the second quote from you.)

CLiff AB:

Another attempt at trying to clarify what I was trying to say: You can’t calculate the standard deviation of a random variable on a finite population without knowing the distribution of the random variable.

Put another way: Just having a measure doesn’t tell you what the distribution of the values of that measure is; this gets back to why we say “random variable” rather than just “measure”: to emphasize that the values of the measure on that population have a particular distribution — which is also called a probability distribution, which is where the randomness (hence the term “random variable”) comes in.

Martha: you can define the standard deviation of a finite population as sum((x-xbar)^2)/N where xbar is sum(x)/N

At this point, there’s no randomness, just some functions of definitely known values.

The only way to get Frequentist randomness out of this is to make reference to a random number generator and a sampling procedure.

When you say “You can’t calculate the standard deviation of a random variable on a finite population without knowing the distribution of the random variable. ”

You can map that to “it matters how you sample from the population” for example “sample uniformly at random but leaving out elements 1 7 and 44” induces different random variable than “sample uniformly at random” induces a different random variable than “sample uniformly at random with replacement”

And, btw, when it comes to situations other than sampling from a finite distribution with an RNG in a specified way, then you’re really talking physics, chemistry, biology, geography, etc and there’s often no real reason to think the random variable has a stable in time stationary distribution even as an approximation to the physics.

Cliff AB: what I think is that in practice Frequentist statistics gives people stuff they don’t care about, such as confidence intervals. So, people add things on top of frequentism, namely point estimates and behavior rules for what to do after you get a confidence interval and conventions for p values that vary from one field to another and etc etc etc. That all gets associated to “frequentism” but it has no real justification via the mathematics of frequentist calculations and in fact if there were a coherent view of what to do, there wouldn’t be so many different conventions (5 sigma in physics, p = 0.05 in psych, non-inferior at the p = 0.01 level + an ad-hoc analysis of possible side effects or unusual special individual patient responses in FDA approval, etc etc)

If, as is typical in a clinical setting, the change in some parameter due to a treatment has 95% confidence interval 0.2 to 0.97 with mean 0.45 and 0.2 is considered about the size of a change that would be discernible to you clinically…. what do you do with this information?

If your Bayesian HPD interval is 0.2 to 0.97 with posterior average 0.45, and has density p(x) what do you do with this information? In this case I can tell you what you should do, you should figure out what the consequences of different xes would be, and then calculate integrate(p(x)Consequences(x) dx) and see whether you expect a good consequence compared to the same calculation for the alternative treatment.

If consequences of x 0 are small, even if the average of x is 0.45 and the lower end of the 95% interval is 0.2 it may be totally irresponsible of you to recommend the treatment (for example if there’s a long tail to the left).

In this sense, the ability to treat different values with different weights is critical. The practice is definitionally illegitimate in Frequentism as “the probability that the parameter equals x” is undefined. To a frequentist, it either does equal x or it doesn’t, there is no probability associated to the parameter, only to the data.

So, when Andrew recommends people estimate directly and ignore p values… Operationally, what can a budding researcher do with his suggestion that isn’t start building Bayesian models and fit them in Stan?

Here’s Andrew quoted: “The way to go, I think, is to just model these treatment effects directly. Estimate the treatment effect and its variation, and go from there.”

“model these treatment effects directly” (note model *the effects* which are necessarily unknowns, parameters)

“estimate the treatment effect *and its variation*” (treatment effects are not observable, how do you model the variation in an unobservable other than to put more or less weight on various values? == Bayesian)

So I interpret this advice as “Build Bayesian models”

I mean, you can disagree with Andrew that this is good advice, but please tell me what this advice means other than “Go Bayesian”?

It just feels to me like a way to advocate Bayesian models without breaking the eggs that come from telling people to do something they have some kind of strong bias against.

Darn blog ate things surrounded by < and > in the above the paragraph should read:

If consequences of x < 0 are more and more deadly, but consequences of x > 0 are small, even if the average of x is 0.45 and the lower end of the 95% interval is 0.2 it may be totally irresponsible of you to recommend the treatment (for example if there’s a long tail to the left).

>“estimate the treatment effect *and its variation*” (treatment effects are not observable, how do you model the variation in an unobservable other than to put more or less weight on various values? == Bayesian)

Mixed effects models (or multi-level modeling, for a more general category) exist in the frequentest toolbox. And they are quite popular!

I think the problem is that you are fundamentally misinterpreting what “Frequentest statistics” means. At this point, it seems I cannot make clear to you your misinterpretation. Sorry.

I think it’s you that mistakenly attributes fundamentally Bayesian models to Frequentism.

Yes I have no doubt that people do multi-level mixed effects models, and they don’t call themselves Bayesian. In every case I’ve ever heard of, there’s a model of a data generating process resulting in a likelihood, and then a reluctance to put a prior, and then a resulting posterior is produced based on essentially a flat prior and then there’s a choice of method by which a point estimate or interval estimate will be given from this posterior.

Wald’s theorem assures us that the failure to put a real world prior puts this choice of point estimate in a class of rules that is dominated in a frequentist error sense by Bayesian rules. (in essence: shrinkage estimators using Bayesian priors have lower Frequentist risk. Is failure to look for your estimate within the group of rules that have the best Frequentist risk properties really “Frequentism”?)

Can you think of any case where this multi-level model stuff occurs where the method of inference isn’t a likelihood function and either a “penalizer” (a prior in disguise) or a flat prior?

Like for example, maybe a multi-level mixed effect permutation or resampling based inference procedure or something? Do you think this is what Andrew means when he recommends what he recommends?

I’m guessing inference using something like bootstrap or permutation testing (these seem fundamentally frequentist as they directly model the larger population as if resampling with replacement were a good model of infinite sampling) in a multi-level mixed effects scenario is probably pretty uncommon, but if you happen to know thousands of people who are doing it, that’d be interesting to know about.

Cliff it’s actually worse than this. I just noticed that if you do your likelihood based frequentist multi-level model in floating Point arithmetic then you are actually running a Bayesian model in which your prior specifies that there is almost 100% chance that your perimeter is greater in magnitude than 10 to the 300th or so.

> people add things on top of frequentism, namely point estimates

That’s a very strange thing to say. Point estimates predate the development of frequentist methods, which among other things improved point estimation (MLE) and proposed interval estimation to provide not just a point estimate but also an indication of the precision.

Frequentist methods are not just tests. I agree with Cliff regarding your lack of understanding of frequentist statistics. Which is fine, you may have decided that there is no point in knowing the subject well and you might be right. But you should be aware if your limitations.

knowledge not worth acquiring and probably you’re right.

Carlos: thermodynamics predates statmech too, but statmech puts the framework around it and gives it a theoretical structure. If we say that the steam tables are not part of statmech then I think we are correct. By the same classification idea if we say that “taking the sample average as a point estimate” is not really Frequentist since it doesn’t use the frequency of anything… I think it’s fair. Yes calculating confidence intervals for this estimator… that’s frequentist… but just taking the sample average is “stuff that’s done” in the same way that the steam tables are Thermo Stuff without being statmech.

If you reject the idea of using p values for anything, you need to reject confidence intervals as well because there’s a 1-1 correspondence between a confidence interval and a hypothesis testing procedure. If you reject confidence intervals, you’re just left with taking averages, stuff that’s done, no longer really embedded in a theory.

Perhaps we need a distinguishing term “Classical Statistics” vs “Frequentist Statistics” vs “Bayesian Statistics”

Classical Statistics is a mishmash of a bunch of stuff people do. In it, people typically refuse to use priors in their mathematics but they often set up complex likelihoods, then they run the model on a floating point machine which automatically sets a uniform(-LargestFloat, +LargestFloat) prior for them. They implicitly are Bayesians who are dogmatically sure pre-data that the magnitudes of all their parameters are around 10^307 but they don’t even know it.

Frequentist statistics is the framework that contains all the stuff that is supposed to be principled application of frequency ideas. In it we have things like permutation tests, resampling, simulation based CI construction, analytical CI construction, and lots of stochastic processes, markov chains, etc etc.

Bayesian statistics is the framework that contains all the stuff that is principled application of Cox/Jaynes probability theory.

I think it’s fair to say that I have a good understanding of the Bayesian part because I’ve spent a bunch of time on it, but I’m always delighted when I find out how to carefully deconstruct some portion of it (like the stuff we’ve been discussing about stopping rules).

Classical statistics it’s fair to say that I’m willfully ignorant of the extent of this mishmash because I think it’s a mishmash and I won’t learn much from it, but I can detect a Bayesian model with a flat prior truncated to uniform(-MaxFloat,+MaxFloat) when I see one and every “mixed effects” model I’ve ever seen in the literature fit with lmer or SAS or whatnot is exactly that except the ones using penalized maximum likelihood, and they are just Bayesian MAP estimation with someone elses default prior other than the one imposed by the IEEE floating point committee.

Having realized the flat prior on a floating point machine issue elsewhere in this thread, I can now see why that maximum marginal likelihood blabla stuff is stupid except as a way to fit fairly simple models dominated by nice large chunks of data with very little computation. I NEVER think my parameter is around 10^307 but apparently if you do you’re “Frequentist” ?? Nah, makes no sense to me. There’s a reason Mayo has no love for the likelihood principle. https://errorstatistics.com/2014/09/06/statistical-science-the-likelihood-principle-issue-is-out/

Frequentist statistics I do think I actually understand how to distinguish between an idea that is fundamentally bayesian and an idea that is fundamentally frequentist most of the time (though there are subtleties here again, an example is a “deterministic” stopping rule… frequentist deterministic in the sense of repeating it results always in the same thing, or bayesian deterministic in the sense of seeing the data always makes me know exactly whether you will stop?). And I don’t think the “random effects mixture model that is mathematically equivalent to a Bayesian model with a flat prior” is essentially frequentist. I do think some pretty cool advanced things like the distance correlation method are essentially frequentist: https://cran.r-project.org/web/packages/energy/energy.pdf

What I do think is the case though, is that my definitions for what constitute frequentist inference are much more limited than what is used by people… Classical mishmash of ideas statistics doesn’t count for me.

Daniel:

You write, “I NEVER think my parameter is around 10^307 but apparently if you do you’re ‘Frequentist.’” Unfortunately, most of the frequentist theory I’ve ever seen, seems designed to work for any values of the parameter, including 10^307. But there is some frequentist work that is done conditional on particular parameter values, or ranges of parameter values. For example my 2014 paper with Carlin.

Andrew: yes I was being tongue in cheek there. My real point being that by using a likelihood, with a flat prior with double precision floating point arithmetic your model *is literally* a Bayesian model with uniform(-MaxFloat,MaxFloat) prior

As soon as you realize that’s what you’re doing, then you should probably tell yourself “you know what? I don’t really think my parameter is almost sure to be greater than 10^300” and become Bayesian, because you *are* running a Bayesian model with a particular real prior anyway.

Now, when you don’t use a likelihood to do your Frequentist stuff… then there’s no direct correspondence. So things like goodness of fit tests, or resampling or whatever are clearly frequentist… but “likelihoodist with flat prior” on a floating point machine is just “Bayesian with a stupid prior set by the IEEE-754 floating point committee”

Also if you take a look at http://statmodeling.stat.columbia.edu/2017/06/26/problems-jargon-jargon-statistically-significant-clinically-significant/#comment-514552

What do you think of that interpretation of the suggestion to use posterior predictive checking, as an ABC type method with implicit prior over models (whatever generating process causes you to write down Stan models) and an informal “closeness” measure related to your graphical checks of frequency properties?

Andrew: see also Philip Stark on ‘constraints vs priors’

https://www.stat.berkeley.edu/~stark/Preprints/constraintsPriors12.pdf

Daniel – no that she not true. For example ‘likelihoodists’ elimate nuisance parameters differently to Bayesians.

Arg autocorrect. ‘That’s not true’

ojm: likelihoodists refuse to do integrals over posterior samples because this would make them Bayesian… as soon as they’re doing that, they are doing full Bayes.

So, on floating point machines, they’re using Bayesian models with priors set by the IEEE to do “stuff that’s done” with little in the way of useful theory to guide them. Much of it involves methods of picking point estimates. Much of it involves looking at one-parameter-at-a-time sensitivity analyses in the region of those point estimates. Sometimes it involves setting up some kind of alternative to max-likelihood point estimation method, such as penalizing the likelihood using some kind of hold-out-error thing… Other times they take Hessians and use them for some kind of purpose of setting a region of space that is “in the high likelihood region” using Normal Approximations / Taylor series for the likelihood around the max point…

Like I said, “stuff that’s done” it’s not really principled Frequentist anything, though some of it might have some specific frequency interpretations. Machine Learning has this flavor. “do some stuff with a function of many variables”

The other possibility is that people have legitimate reasons for doing what they do…

ojm, likelihoodists eliminate nuisance parameters by profile likelihood; Royall regards this as an ad hoc device because it’s not a true likelihood and thus can violate the universal bound on the (sampling) probability of misleading evidence (which is what likelihoodist use for sample size determination). I would also add that in small samples profile-likelihood-based intervals can be pretty bad according to both frequentist and Bayesian criteria. In the large sample limit, profile likelihood is equivalent to marginalization against a Jeffreys prior to some high order of approximation, n^-1 or something. So you’re right that it’s not equivalent to a flat prior, but Daniel’s point about an equivalence of likelihood-based methods to some Bayesian posterior is broadly correct too.

Ojm:

In the article you point to, Stark does the familiar thing of focusing on difficulties with the prior distribution while taking the data model as known and perfect. This doesn’t make sense in the problems I work on in political science, pharmacology, etc., where data models have a lot of arbitrariness and where a lot of prior information is available, and that information is not in the form of constraints. Stark’s not describing any sort of scientific problem that I’ve ever seen.

Andrew – fair enough. I was just responding to the idea that interval constraints are not possible within frequentist inference.

For what it’s worth Philip Stark started out in geophysical inverse problems and did some nice work there.

Ojm:

Yes, I think different methods can be useful in different fields of application. A method is not just a method, it’s part of a larger workflow. For example, I can only assume that when Fisher and Tukey used p-values and significance testing, this was functional for the problems they studied and in the context of their workflows. Apply the very same methods to noisy data and a disjointed workflow, and you get Satoshi Kanazawa wasting his and everybody else’s time chasing noise.

Corey – too much and too off topic to discuss further. But yes, I’m aware of what you mention and no, I don’t think Daniel is broadly right.

Andrew – yes, I agree. I use Bayes not infrequently, I just think it also has significant limitations. One limitation is, perhaps ironically, that it relies too strongly on the correctness of the model. I see Tukey as a precursor to ‘data science’ and, to some extent, ‘machine learning’, so his general ideas have certainly been extended to complex problems where the model is in question or unknown.

ojm: I don’t think the machine learning people are necessarily wrong, and I don’t think the idea of just taking a sample average, or just maximizing some function is wrong either. I think all that stuff can be useful. I just think that lots of it is not really frequentist, and that the meaning of frequentist is not the same as “not bayesian”. I think there’s a lot of “not bayesian” stuff that’s just “heuristically pragmatically motivated stuff people thought of that kind of works for them” or is easy to calculate with some off the shelf software, or easy to parallelize or whatever.

So, the real point here is not to advocate “everyone become bayesian” it’s to understand what Andrew is really advocating while at the same time advocating for a nuanced understanding between what different camps do that is more accurate than “Bayesian is probability theory as real valued logic, and Frequentist is just everything that isn’t Bayesian”.

Is he advocating “everyone become a pragmatic machine learning programmer and learn all the ins and outs of TensorFlow?” is he advocating “start doing likelihoodist inference but stick to flat priors on the range of floating point?” is he advocating something else? I think the only thing that fits with my model of what he means is “fit things in Stan using hierarchical Bayesian models and then if you care about the frequency properties check them using posterior predictive checks” but he’s never too explicit about that.

The best way to decide which approaches to use is to understand them properly and use them properly. When you say ‘frequentists do this’, ‘likelihoodists do that’ etc and it doesn’t seem like a correct representation of what they do and why then I feel like it doesn’t help people decide which approaches to use. On the other hand, to say ‘this is what I do and why I think it might be reasonable’, then that’s more constructive.

Fine, I think that’s a reasonable thing to say. In this case I think we’ve wound up in a case where Cliff suggested that frequentists do all kinds of mixed/random effects models, and I disagreed on the principle that just because it’s “not Bayesian” doesn’t make it frequentist and then I had to kind of back that up and it all got blown out of proportion about labels.

For example “use an unbiased estimator that minimizes the MSE” isn’t frequentist in the following sense: why the MSE? why unbiased? what’s the principle that goes into choosing something to minimize? In my opinion the real principle is “we saw some stuff that was done, such as taking sample averages, and then explained after the fact what its frequentist properties were and then we realized that there was some general way to evaluate frequentist properties of things, and we created a theory around what was done that could be generalized into giving you ideas of other things that could be done, and then people took those frequentist properties for the thing that was done as “powerful” justifications of why they should continue to do what they were doing in the first place”

In fact though, why the MSE why not the Mean Absolute Error (advocated by Taleb right?)? Why not the Trimmed Mean Absolute Error? Why not all those things suggested in the Robust Statistics chapters of Venables and Ripley? Why not Minimize the Mean Generalized Real World Loss? Eventually you’ll wind up at Wald’s theorem and realize that any kind of “minimize some frequentist risk of doing bad stuff” has to be a Bayesian Decision Rule or its equivalent, otherwise it will be dominated by a Bayesian rule in a Frequentist sense.

So, I think it’s important to have more than just “Bayesian” and “everything else equals Frequentist” and to avoid thinking that just because you could figure out a frequentist property of some thing that is done that it therefore must be principled frequentist thing to do.

I’m not an ML expert, but it seems like there are lots of ML algorithms that aren’t even motivated by probabilistic considerations, they’re kind of a generalized approximation technique, like splines or whatever. “what’s done” even if it does have some interpretation as probabilistic if you look at it right, is done on the basis of “get close to these data points” it’s “not Bayesian” but that doesn’t mean it’s Frequentist. It’s basically “pragmatically makes us a bunch of money without costing too much in computing time”.

It can be helpful to realize that the world divides into more than 2 groups. It can also be helpful to realize when something you thought was one thing (frequentist multilevel mixed effects models) are really after you put them on a floating point computer something else entirely (bayesian posterior distributions using proper priors determined by the IEEE floating point committee)…

But I’d still like to hear Andrew explain what he means by:

” The way to go, I think, is to just model these treatment effects directly. Estimate the treatment effect and its variation, and go from there. Forget the hypotheses and p-values entirely.”

HOW does a budding but confused “clinical” researcher operationalize that advice?

OK but you’re doing it again re: complete class stuff. To get that you have to allow improper priors – sorry, non-generative models…- which opens up its open issues etc etc. We could do – and seem to have done – this all day. Bayes doesn’t ‘kill’ or ‘dominate’ or whatever Freq and vice versa. They are different things with different strengths and weaknesses (and some interesting connections too of course).

Hmm. The word “dominate” is a technical term, as you know. I agree with you that it is a bad term, like “significant” or “uniformly most powerful” or whatever, because it sounds like “kill” or “destroy” and these things have nontechnical meanings that everyone will assume. There’s a lot of math terminology invented during WWII that sounds like “Killing and destroying the enemy”. So, let me try to go further than accidental punning.

If you want to make decisions, such as choosing a particular point estimate, or deciding whether to give a drug, or whatever. And you want your decision making rule to have the *Frequency related* (Frequentist) property that under repeated application it will on average have small “badness” (or larger “goodness”) then you should look for your procedure within the class of procedures mathematically proven to have unbeatable frequency of badness properties. This class is the Bayesian decision rules.

The boundary of the class is the Bayesian decision rules with flat priors, but we know more, we know that the frequency properties of Rule A will be better than Rule B whenever a region around the real parameter is higher in the prior probability distribution under A than under B. Then we are giving more weight to the actual correct value and so our decision is based more on what will turn out to happen.

In other words, if you REALLY DO care about minimizing mean real world badness of your decisions, a Frequency property that it seems Frequentists should care about, then you need to choose real world priors of some kind, and use real world consequences not “Mean Squared Error” which is chosen really on the principle that it makes some math easy.

I had a bunch more to say on this, but it was maybe a little overwrought for this blog. I put it at my blog.

http://models.street-artists.org/2017/06/27/on-morality-of-real-world-decisions-and-frequentist-principles/

Yes we can discuss another time. I first encountered Bayes proper via decision theory, but never really liked decision theory myself. Though weirdly, have come to appreciate the rationale behind maximin/minimax much more.

I love Bayesian Decision Theory, but maybe it’s because I come from a background where decisions get made that really do kill people (Civil Engineering, Bio/Med)

I don’t want to get too involved here, but FWIW, in quantitative ecology, I have never seen an application of max likelihood that could not be usefully mapped to Bayes with flat prior- with the result, usually, of better understanding the model and the implications. Conversely, I have never seen formal frequentist justification for any given application of the ML approach (i.e. the large-sample properties have been studied or something). Rather it is mostly used out of either a) convenience, or b) misplaced skepticism or simple lack of knowledge of Bayesian methods, Stan, etc.

Chris: thanks, that’s my experience too. I once sat down at a tutorial session where someone who has a 6 figure salaryexplained to my wife how to do analysis of RNA-seq data via software with a multi-million dollar site license. (and I kept my mouth SHUT as instructed by the boss / my wife)

It would best be summarized as “push these buttons and select these menu items and then after about a half hour of computing you’ll have your p values for publication” took about an hour.

I don’t think there was anything frequentist about it, there were enough buttons and menu items to select, that the selections put in place by the analyst would have to enter into any frequentist calculation to have it make any sense… it was just “this is what is done”.

Of course deep inside the production of this software were some frequentist principles guiding the production of the software, but the bigger principle definitely seemed to be “don’t blame the analyst”. It conveniently spit out 3 or 4 alternative analyses for you so you could pick the one that gave you the highest chance of getting a grant.

Max likelihood is neither likelihoodist nor frequentist in general.

FWIW there are simple examples, similar to ecological models eg systems of ODEs, where I would argue eg profile likelihood is less misleading than marginalisation. Furthermore, both Bayes and likelihood are quite susceptible to model misspecification and/or outliers (which would be expected in ecology right?).

And as far as I am aware Stan can’t handle nonidentifiable/multimodal problems which would surely be expected to be the norm in ecological modelling? You can of course ‘fix’ the issue by adding prior info til the model is well-identified but this can be handled in Bayesian, likelihood and/or frequentist ways and really is solving 90% of the problem without updating via Bayes or evaluating coverage or…

Daniel,

OK, I guess I was just finding it hard to imagine having complete data for the finite population to do the calculation — I was thinking in terms of needing to know how often particular values occur, but if one does have complete data, then the simple formula does work.

The not-stable-in-time distribution was also something I meant to mention but forgot — in most practical applications, that is important.

ojm: the non-Bayesian uses of likelihood that I have seen are all max likelihood, so I won’t comment on other approaches. Let’s take for granted that max likelihood inference almost always relies on asymptotics (i.e. large-sample properties). But in ecology we generally are data-limited, and trying to fit models that push the envelope of what we can do – Bayes is especially useful in that we can seemlessly incorporate across multiple datasets and even include expert opinion in order to constrain our inferences. That said, I agree nothing is perfect. Users of Bayesian methods need to do the hard work of model checking, and carefully think through how their model(s) relate to substantive scientific questions/hypotheses.

IN terms of multi-modality, sure, that can be a problem. Relatedly, have you seen this? http://mc-stan.org/users/documentation/case-studies/identifying_mixture_models.html

ojm: “Max likelihood is neither likelihoodist nor frequentist in general. “

right, I think that is something not widely appreciated, particularly among people who use “Frequentist = not Bayesian” as the definition.

Max likelihood has asymptotic frequency properties, as does everything really, I mean, there’s always some property. But is MLE Frequentist in any sense other than “it has frequency properties that can be quantified?” I don’t see it. I think it’s the other way around, where MLE is what was done, and then some principles of Frequentism were sort of solidified, and some mathematicians quantified the MLE estimator’s properties.

The James-Stein estimator as far as I can tell, was explicitly constructed to prove a point, which is that you get better frequency properties from shrinkage than from “what is done” such as MLE. This was shocking because “what is done” was based on very powerful intuition that what’s in the data should be enough, just choosing any old extra number shouldn’t help you. Turns out… not so.

I don’t see how a thing can be frequentist just because it has quantifiable frequency properties. I mean, the estimator “always choose the largest floating point number your computer supports” has frequency properties (to high approximation it’s very wrong 100% of the time!)

The principle of Frequentist statistics can’t just be “use something that has quantifiable frequency properties” because that would give the OK to “always pick 0” or “always pick a uniform random number from the range of your floating point arithmetic” or anything like that. So there has to be more to Frequentist statistics, and I think it’s got to be something like “decide on an optimality property involving the frequencies of something, and then pick something that at least approximately satisfies those optimality properties”

The confidence interval approach is basically “try to find the smallest interval that is guaranteed to satisfy coverage requirements regardless of the parameter value”.

The penalized maximum likelihood method is … a bastard version of decision theory. Use a type of model from the optimal decision model set. pass the buck on the choice of prior (optimizes the deniability of responsibility constraint) and then optimize the resulting function over the parameters. Again, I don’t think it really is frequentist because it’s not designed to optimize the frequency properties of anything (except maybe the frequency with which you can deny responsibility for choosing the prior? a little glib, but I do think that’s a nontrivial motivation).

I think MLE and penalized MLE are more along the lines of what I glibly called “do some stuff with a function of many variables”. Of course penalized MLE is also Bayesian MAP estimation, but Bayesian MAP estimation isn’t really very Bayesian because its only interpretation in Bayes is as a usually crappy integral approximation. The Bayesian method is “choose something that minimizes expected real world cost” but the MAP doesn’t help us with the expected real world cost unless the posterior is super sharply peaked, but that sharp peak stuff isn’t really part of PMLE. When the posterior is high dimensional, and not sharply peaked at the MAP then it’s a little like “take a single sample and assume the population is all exactly equal to that sample” as a frequentist estimator of the mean of a population. Yikes!

But, when your downstream real-world concern isn’t very sensitive to the inference precision (such as “as long as X is positive go ahead and do Y”) then MAP and MLE and all those things can be super useful as computationally simple ways to basically make money or whatever else people use statistics for. Usually it’s all embedded in an iterative process anyway: “Think of a thing to do, get some data, find a PMLE estimate, if it’s positive/negative/far enough from zero/blablabla start doing the thing, repeat”

So, the three principles seem to be: Bayesian probability as logic. Frequentist optimize some future frequency of occurrence of something. And, Do Some Stuff With A Function of Many Variables That Makes Us Money / Gets Us Good Enough For Our Purposes Inference / Isn’t Obviously Wrong/… whatever MLE and PMLE and Support Vector Machines, and Kernel Tricks Plus Run Some Big Name Software, and fuzzy-k-means clustering, and penalized spline regression and soforth are.

Sure, in this latter class some of them have interpretations in terms of probabilistic Bayesian models or whatever, but the motivation isn’t usually “think of a Bayesian thing and then turn it into efficient software” usually it’s more “think of some efficient software that makes us money” and then some academic comes along and says “this is kind of a MAP approximation to the following Bayesian model…”

https://en.wikipedia.org/wiki/Maximum_likelihood_estimation#History

Basically makes this explicit. MLE was justified initial on Heuristics, and then later Wilks came up with a proof that allowed him to construct a confidence region around the MLE. At this point does MLE become Frequentist? Not really. I can come up with a confidence region around “choose 0” too… For models that are numerically well posed (by which I mean computable on a floating point computer) the finite region [-LargestFloat, +LargestFloat] is a 100% confidence interval for “choose 0”

So, just the existence of a confidence interval isn’t sufficient to make something a “Frequentist” method I think, because no Frequentist alive would advocate “choose 0” but for a VERY large class of models it’s got a confidence interval.

This is probably getting too long and too off topic by now! But yes I’ve seen the mixture model example – I see it as a cautionary tale (now imagine something more complicated than a mixture model!). And all this before you can run Stan.

RE incorporating constraints and expert opinion – I used to think Bayes had a monopoly on this and was clearly the way to go. Now it’s not so clear to me ( it’s still *a* way, just not the only one and has both strengths and weaknesses).

I think I have figured out how to say what I was really thinking about when I responded to Cliff AB re his comment “Suppose you have a finite population. You can talk about standard deviation in your population.”

The real underlying question is: Why are you calculating the standard deviation of that population? If you know that you are working with a normal population, the standard deviation is a good thing to know, because if you know the standard deviation and the mean, you know the entire normal distribution, and you can use that information to do inference. But if your population is not normal, you need to do inference in a way that takes into account the actual distribution. If you have, for example, a noticeably skewed distribution, the standard deviation is not very informative, because it does not take the skewness into account. (Even real estate agents report medians rather than means for the a typically skewed distribution of housing prices!).

Daniel, I can use a frequentist confidence interval perfectly fine without appealing to a test or even thinking about it as testing. For example, “If I say the mean effect was between 2 and 3 (90% CI)” what typical test does this relate to? I’m just saying that using a process that’s correct 90% of time the mean is in this range and I can go on to act as if it’s in that range until I get further evidence. I can just focus on those values, whether the range is usefully narrow enough, or suggests the sample is representative (usually wide enough given N), or whether it captures an anticipated value based on prior work, but I don’t ever have to discuss or care about the obvious fact that 0 isn’t in the range. There’s no reason to ever discuss testing or limit yourself to only discussing it as an inverted test. In fact, once you use it as a traditional frequentist test you limit it’s usefulness as an estimate because you’ve implicitly stated that the CI is calculated assuming the null was true and if it’s not then range doesn’t really have a meaning (you can calculate them without assuming the null is true).

The testing procedure

function(q, Data, p, CI)

using confidence procedure CI construct interval I = CI(Data,p)

if q is in I return 1 else return 0

end

is a hypothesis test of parameter = q at level p IFF CI is a “true” confidence procedure

So a confidence interval construction procedure **is** a rule for constructing “Frequentistly legitimate” hypothesis tests.

Everything you do in Frequentist statistics is equivalent to a test of some kind. This is definitional, if it has the correct frequency then it’s a (valid) test, and if it’s a (valid) test then it has the correct frequency.

Here’s Larry Wasserman a strong proponent of frequentism on this very blog saying the same thing:

http://statmodeling.stat.columbia.edu/2013/06/24/why-it-doesnt-make-sense-in-general-to-form-confidence-intervals-by-inverting-hypothesis-tests/#comment-147455

A given CI is equivalent to an inverted hyp test, yes. But I would say a single CI is not the be all and end all of Freq inference, even sticking to traditional Freq.

For example there can be multiple intervals with the same coverage – which do you prefer? Classic Freq theory offers some _additional_ guidance (eg shortest intervals for fixed coverage etc).

So Freq theory can’t be identical to just presenting eg a 95% interval.

You could for example stack all possible shortest (or whatever – don’t ask me to provide conditions for existence) CIs at all levels to get a confidence distribution. Or you could go the other way and just bootstrap, which would then imply a family of CIs. Etc.

But neither Bayes nor Freq is perfect (or even that good).

As a side point, here’s Andrew Gelman a strong proponent of Bayes acknowledging issues with the foundations of logical Bayes:

http://statmodeling.stat.columbia.edu/2017/06/19/not-everyones-aware-falsificationist-bayes/#comment-513608

ojm: fine, but I still thing to be clear that if you’re not constructing an interval or family of intervals or decision procedure with a known frequency of success or using the frequency of anything… if you’re just doing something like taking a sample average and publishing it… you’re not doing frequentist statistics… you’re just taking averages in the same way that if you look up values in the steam tables you’re not doing statistical mechanics you’re just using a lookup table that summarizes some data.

Constructing minimum length intervals with given coverage is balancing the two classical frequentist errors, type I and type II. Bootstrap seems to me perhaps closer to Fisher as it seems to rely more directly on frequentist probability than frequentist error rates for basic justification (though this is just a personal interpretation).

The posterior density $latex p(\theta \mid y)$ doesn’t matter, only expectations over that density,

$latex \mathbb{E}_{p(\theta \mid y)}[f(\theta, y)] = \int_{\Theta} f(\theta, y) p(\theta \mid y) \, \mathrm{d}\theta$.

That is, everything we care about (parameter estimates, event probability estimates, and predictions) an be formulated as an integral with appropriate choice of function. This is important, because we can change the curves while maintaining the integrals by reparameterizing and applying a Jacobian. A good example would be a Beta(0.5, 0.5) reparameterized with log odds; in the Beta, there’s no bound on the density as it approaches the boundary, whereas in the log odds, the density asymptotes at zero as you go toward plus or minus infinity.

Right, there’s no Bayesian probability to have a parameter be exactly a given value. So we can take the probability over any small but finite interval dx and get something well approximated by p(x*) dx with x* the midpoint of the small interval. If you want to know something about the probability of stuff that happens when x is near x* then you can calculate it, and when p(x*) is high density, you’ll get higher probability for that stuff than when p(x*) is low density.

By the same token in a frequentist sense instead of fixing alpha and getting a confidence interval, we can instead construct confidence intervals with smaller and smaller confidence (larger and larger p value) until we get one of finite but small size dx.

In problems involving enormous amounts of data, you will generally get to the point that the confidence interval and/or the bayesian high probability interval all wind up smaller than epsilon the resolution that you care about.

In practice, in problems involving less than enormous amounts of data, if the Bayesian say 95% interval is bigger than several multiples of epsilon, your resolution of caring, then the p value required to make the confidence interval of length about epsilon will be laughable large, p = 0.33 or p = 0.75 or whatever. By the same token, the Bayesian interval of length epsilon will have posterior probability less than 1, but, with the Bayesian probabilities of each interval of length epsilon, you can at least do things like calculate expected utilities using the posterior probability and make Bayesian Decision Theory decisions. In Frequentist inference, what can you do with the fact that if you make p = 0.33 your confidence interval is about length epsilon?

You can do that, but the problem is that the probabilty mass is going to be close to zero in a small interval, so it’s not so useful for makig probability statements. For exmaple, with a multivariate normal, the highest density point is the mode (same as the mean for the normal), but there’s not much mass near the mode in high dimensions. You have to expand more than a little around the mode, at which point the approximation of mode times volume is poor and you really need to do the integral.

Naive users of statistical inference often erroneously assume that the parameter values must lie near the mode beause it has the highest density.

I know you (Lakeland) know what’s going on here, I just wanted to point this out for the unwary.

Yep, thanks Bob, it’s important to think of the silent audience that I know is out there reading this stuff.

At least the point estimate at the peak of the distribution has an interpretation as a very lousy integral estimator… and elsewhere I agree with people who say things like the bayesian MAP point estimate is invalid just like my complaint about the frequentist ML estimate. Yes, that’s true unless the distribution is super-tight. That’s why we need Stan!

If you tell Stan to run, as you add one point after another from Stan you’ll get better and better estimates of integrals. If all you have is one point, you still get an estimate, but it’s a lousy one!

(note we had a discussion a while back in which we agreed that the peak point isn’t in the typical set, but it is in the “high probability volume” and there was just very subtle distinctions that had no practical meaning because the region around the peak has a tiny volume anyway in high dimensions)

@bob

dumb question – what is your preferred general definition of Bayes’ theorem for continuous variables (and conditioning on the observed data)? Is it defined purely in terms of expectations?

Here’s one important thing that we care about that cannot be formulated as one of those integrals: the entropy of the distribution.

Entropy is the expectation of the log-density (or the log of a ratio of densities), right…?

Well, yes, but that means that the density does matter! I tried to make sense of Bob’s comment as saying something like “the density is there just to allow us to calculate the expectation of the things we care about”.

I actually thought of writing my comment from a different angle, questioning whether his statement had any meaning at all! I guess anything you can do with the density, can also be done with integrals. For example, defining a HDR is something quite natural to do with a density but you could also do it a more contrived way from expectations.

Fair point! And while the integral is needed to tell you how much mass is in your HPD region, the density is also needed to know the location of the boundary of the integral.

Densities are really there to be integrated. You can define a high probability region as “the smallest region containing probability 1-epsilon” but to calculate this will require integration of your density over the candidate region to see if it has 1-epsilon probability.

Sometimes the integral can be thought of as a contrived expectation. like the expectation of an indicator function to calculate the total probability in a region. This is the kind of thing Mathematician’s like because it lets them put an umbrella over everything “everything is an expectation” yes… ok technically.

The reason though that “everything is an expectation” is so useful for someone like Bob is that Bob is writing software to do a really really good job of calculating expectations from finite samples. If you can make everything an expectation, then Bob’s software can do everything for you.

I think one could give a constructive definition of the HDR as finding the boundaries that minimize the expectation of one over the density (gives the length of the interval, if I’m not mistaken) under the probability coverage constraint.

Carlos, I wrote up the loss function that leads to HPD intervals a while ago here.

In Frequentist statistics, there is a 1-1 correspondence between confidence intervals and statistical tests.A statistical test only tells you something about an observed test statistic relative to the distribution of that statistic under the assumption that the null is true. Confidence intervals tell you this, but they also tell you something about the range of non-null hypotheses that you would reject, too.

It is surprising that this idea is still only being adopted in some places/fields. I thought by now everyone had heard of this issue. Certainly it was already well known when I entered grad school, and I consider my statistical training there to be wholly inadequate to the point of fraudulent (apparently not even really the stats teachers’ fault, one of whom said they were scared of getting fired if they dared teaching anything besides NHST). Here it is being reviewed >30 years ago: http://www.sciencedirect.com/science/article/pii/S0005789484800027

NHST was adopted so quickly but looks like it will be so slow to die. Why this asymmetry? If the right person gets in charge of NIH and refuses to fund any NHST-based studies, will it die within a few years?

Why the asymmetry? incentives. NHST lets people get money without doing the work.

I think that in practice, “clinical significance” is often concluded if you get a treatment effect point estimate that is big enough (over your threshold for significance, wherever that comes from). So it isn’t necessarily the result of a test.

Andrew why not follow Deborah Mayo & Aris Spanos and use severity testing here? Your correspondent can then pretty much keep doing what he or she has been doing but just re-interpret this: e.g. http://www.phil.vt.edu/dmayo/pubs/Mayo_Spanos_2006_Severe_testing_Basic_Concept_NP_indcution.pdf I personally prefer this to assigning probabilities to hypotheses. It’s also pretty close to testing another hypothesis.

AG won’t do this because Mayo’s formalization of severity (her SEV function) doesn’t permit shrinkage to enter into an analysis in the manner to which AG is accustomed (i.e., through the prior). (It is also, as far as I can tell, not applicable in even moderately complicated statistical models because it requires a total order on both the parameter space and the test statistic sample space.)

Hi Corey thanks for your response. Can you perhaps give me a concrete example? I personally don’t use nor have I encountered such complicated models in my studies / research. Would be happy to learn more.

In practice the requirement for a total order on parameter space probably excludes any actual real world models using more than 1 parameter.

You can get a total order on 2 parameter models by defining something like

(a,b) less than (c,d) if a less than c or if a = c and b less than d (ie. alphabetical ordering / tie breaking)

or something like (a,b) less than (c,d) if the sqrt(a^2+b^2) less than sqrt(c^2+d^2) or the like…

but suppose you have a model of say the population density of frogs, and the radiation intensity of the sun, and you’re using those to predict say the survival of some plant

what’s the total order on the parameters Frogs,Sun that makes sense in your model?

and this is just 2 parameters.

Daniel, Thank you for your answer. I think I understand the total order bit. But what I don’t understand is how that is a problem for using Mayo & Spanos notion of severity. I’ve seen it being used in a multiple regression: e.g. https://pdfs.semanticscholar.org/8622/60b6319801402cab2fb38f146da8f2a7d1ba.pdf section 4.2

It seems pretty straightforward to me how the sampling distribution is used. But perhaps I am missing something?

I don’t know why Corey thinks a total order is required, you’ll have to ask him. I’ve been ignoring Mayo’s Severity stuff because I already have Cox’s theorem and it gives me a proof that Bayes does a certain thing in a fully consistent way, and I like that thing that it does.

Toby, normal linear models can be considered simple for these purposes. Suppose Spanos’s model was non-linear, the data were inadequate for approximate normality of the sampling distribution (you can check this with the bootstrap), and the regions of substantive significance were not nice flat sub-manifolds of parameter space. How then can the SEV function for substantive parameter values be defined? Maybe I’m just suffering from a failure of imagination…

Those all to convenient normal linear model assumptions are perhaps the epitome of non-failure of imagination ;-)

The critic is not required to show they can’t do something when the proponent has as of yet to show that they can.

p.s. Enjoy the Canada day long weekend.

Corey, thanks for the response.

I don’t see why the sampling distribution has to be approximately normal for the severity function to be defined. The way I understood it is that the probability of a test statistic being more extreme (or equal and less extreme) than the one observed can be computed using the sampling distribution under the assumption that a hypothesis is true or false. The shape of the sampling distribution seems irrelevant for that. As long as it exists, then you can compute this probability. The same holds for the regions of substantive significance. It seems like a fairly general approach.