All statistical conclusions require assumptions.

Mark Palko points us to this 2009 article by Itzhak Gilboa, Andrew Postlewaite, and David Schmeidler, which begins:

This note argues that, under some circumstances, it is more rational not to behave in accordance with a Bayesian prior than to do so. The starting point is that in the absence of information, choosing a prior is arbitrary. If the prior is to have meaningful implications, it is more rational to admit that one does not have sufficient information to generate a prior than to pretend that one does. This suggests a view of rationality that requires a compromise between internal coherence and justification, similarly to compromises that appear in moral dilemmas. Finally, it is argued that Savage’s axioms are more compelling when applied to a naturally given state space than to an analytically constructed one; in the latter case, it may be more rational to violate the axioms than to be Bayesian.

The paper expresses various misconceptions, for example the statement that the Bayesian approach requires a “subjective belief.” All statistical conclusions require assumptions, and a Bayesian prior distribution can be as subjective or un-subjective as any other assumption in the model. For example, I don’t recall seeing textbooks on statistical methods referring to the subjective belief underlying logistic regression or the Poisson distribution; I guess if you assume a model but you don’t use the word “Bayes,” then assumptions are just assumptions.

More generally, it seems obvious to me that no statistical method will work best under all circumstances, hence I have no disagreement whatsoever with the opening sentence quoted above. I can’t quite see why they need 12 pages to make this argument, but whatever.

P.S. Also relevant is this discussion from a few years ago: The fallacy of the excluded middle—statistical philosophy edition.

38 thoughts on “All statistical conclusions require assumptions.

  1. “I guess if you assume a model but you don’t use the word “Bayes,” then assumptions are just assumptions.”

    My opinions on that:

    -I believe a logistic regression is logistic regression for you, for I, for grandma, for anyone. But priors on the other hand are not as agreed upon or ‘tried and true’ and can be just about anything for anybody, even improper. For example, with your golf model work: the logistic regression is standard agreed upon, but your geometric model (which is very sensible IMO) is just one of many possible models, all with their own assumptions and beliefs, possibly infinite choice of models with their own fit to the data.

    -I’m wondering where the term “subjective Bayes” come from? Bayesians themselves, or frequentists calling it that?

    -There is also something about the number of assumptions and Occam’s Razor type of arguments. If we let E stand for the event, and H1 for the one hypothesis, H2 for the other hypothesis, then Ockham’s Razor is: If hypotheses H1(m) and H2(n), with assumptions m and n respectively, explain event E equally well, choose H1 as the best working hypothesis if m < n. Does Bayes in general with its priors and hyperparameters use more assumptions than frequentism? I'd say yes, but others may so no.

    -if "likelihood swamps prior" holds, especially for any prior, IMO it could be argued that priors do have less objective status

    Justin
    http://www.statisticool.com

    • >logistic regression is standard agreed upon

      the use of the logistic curve is purely meaningless. you could use *any CDF* it’s really just a mathematical tool to constrain the output of some other function to the range [0,1]

      So, yes that’s standard, and meaningless, the *real assumptions* are just that we need a function that’s constrained to [0,1] and then…. everything else… like that you should use distance from the hole, that it should appear linearly, and be connected to the precision of the angle….

      or maybe not… maybe we should predict success using the height and weight of the golfers, and the temperature, wind speed, and rainfall on the day the shot is taken… or whatever..

      These assumptions about how the world works are *HUGE* and they come *BEFORE* any prior, because they determine what the parameters are… this is why you can’t interpret the prior except in the context of the likelihood, or a better term would be the predictive model.

      • I agree, re: being able to use other CDFs. I think I’ve used probit (instead of standard logit) before. I just think that logistic is more a ‘go to’, just like a Bernoulli(.5) is a ‘go to’ when talking about a coin flip.
        I’d be worried about finding a model that has too good of a fit, like overfitting. It seems like that would always be relatively simple to do.

        Justin
        http://www.statisticool.com

        • So, I think one argument that flows naturally from Andrew’s general viewpoint would support using “standard methods. I at once have a strong intuition that he would reject this argument and no idea why. The basic idea is that embracing “standard” or “agreed upon” methods is a solution to the garden of forking paths.

          If everyone is forced to run the “standard” technique, then you no longer have forking paths. It’s arguably a second best solution, but if a certain research area has agreed that when studying some DV you run a logit with particular controls, then that has an important discipling function.

          On the other hand, if the standard is run any reasonable/defensible model or the model you think is past, then there are a lot of forks in the path.

        • Joe:

          It seems to me that most of the forking paths come from decisions of how to code the data, what data to include, how to measure key variables, what information to include in the model, etc. Sometimes this is standardized (for example, there are standard scales used to measure psychological conditions such as depression), but in research settings there are no clear standard choices for these decisions, and any standard choices would create lots of other problems. For example, it seems that some researchers in evolutionary psychology were setting an informal standard by which days 6-14 of the monthly cycle were defined as “peak fertility,” but it turned out this contradicted general public health recommendations which was that peak fertility was days 10-17. To take another example, you could standardize your regressions to adjust for a very few set of predictors, but then there will be concern about systematic differences between treatment and control group.

          To put it another way, I think forks are unavoidable, and the solution is not to set things up so forks can’t happen, but rather to embrace the multiverse.

  2. The thing about priors is that there are several interpretations around for what a prior actually means. True, it doesn’t have to be subjective belief, but it could, and if it isn’t, it isn’t clear without further explanation what it actually is. If a prior is given or required, it isn’t clear without further explanation (and that explanation is often not given) what this assumption actually means.
    It may not even be something for which “assumption” is the right word; priors may be used in order to construct worst case (or other) scenarios that are not “assumed to be true” in any sense but may be worthwhile exploring anyway. Or they could have been chosen for their desired effect (e.g., regularisation) rather than for the “information” that they encode.

    The meaning of assumptions in frequentist statistics is not exactly beyond controversy either, but still feels like a piece of cake compared to a Bayesian prior if thrown at you without further elaboration.

    • I think the prior should first and foremost be interpreted as playing a functional role. Its specific meaning is determined by its use (or intended use) in a given context. That is, given a parametric statistical model and given a certain goal, there will be a subset of values of the parameters that best meet the goal, and the functional role of the prior (and the likelihood for that matter) is to help identify those values. For example, if the model contains parameters phi1, phi2, phi3, etc. the goal may be to infer the true value of phi1 (assuming phi1 represents a real physical quantity), in which case the role of the prior is to optimize inferences about phi1. Or the goal may be to infer the true value of phi2, in which case the role of the prior is to optimize inferences about phi2. Alternatively, the goal may not be to infer the true value of any parameter in the model, but rather to optimize predictive accuracy of some sort. Since there are different sorts of predictive accuracy (e.g. accuracy with respect to typical events or rare events), and given that there are different ways of measuring predictive accuracy, the prior will play a somewhat different functional role — and hence have a somewhat different meaning — in each case.

      Background information enters the picture only because the prior can typically only perform its role well if it’s constructed on the basis of background information that’s approximately correct (of course, the same goes for the likelihood). For example, if the model is a linear regression model, the goal of the analysis is predictive accuracy, and the functional role of the prior is therefore to regularize the posterior distributions of the slope parameters, then the prior will only successfully regularize if it’s constructed on the basis of reasonably correct information about how the independent variable is likely to respond to changes in the dependent variables. Hence the prior will only perform well if it’s based on decent information. But that doesn’t mean the primary goal of the prior is to encode that information.

      I think conceiving of the prior as playing a functional role is better than thinking of it as encoding beliefs or background information because the functional interpretation makes clear that: (1) a prior can be better or worse given a particular goal, (2) the prior, like the likelihood, can be tested and can be changed if it doesn’t perform well, and (3) the prior that’s used should ideally be dependent on both the goal of the analysis and relevant background information.

      • +1 to all of this. the prior can’t be understood except in the context of why you’re building the model and why you make the choices you do about the predictive portion of the model, and what you care to achieve with the model.

        Though I don’t think it’s the case that the prior is purely a knob you can tune say via posterior predictive out-of-sample accuracy or something like that, it should represent information you want to use that is reasonable without having seen the data. it shouldn’t be post-data tuned algorithmically (though I think it can be post-data tuned to incorporate information you didn’t realize it failed to have… there’s a difference but it’s subtle)

        • This comment makes me wonder about all the machine learning models. On one hand, they make no assumptions about functional forms, so they appear to avoid those assumptions that underlie “standard” statistical approaches. They usually don’t incorporate any prior information, so they don’t appear to be Bayesian. But they do have numerous levers – things such as number trees in a random forests, minimum split sizes, learning rates, number of layers in a neural net, etc etc) and these levers are usually applied after seeing the data. In fact, that is how validation works – it is used to fine tune the parameters. It is true that there is often (usually) a test data set not used to build or calibrate the models, but used instead to assess its performance. But it would seem that the forked paths are built into machine learning methods (a feature rather than a bug, some might say). The only difference is that the machine is choosing the path, not the human.

        • Indeed. It’s standard even with lasso to tune the penalty parameter. Given that lasso is equivalent to Bayesian MAP with a double-exponential prior, this is analogous to adjusting prior post data. The idea is to use leave-one-out cross validation or something like that during tuning, and then to assess based on a fully held-out dataset. It creates an interesting blur between the goals of inference and prediction…

        • That’s a fantastic paper. But I have one quibble about the terminology you use, which I think could potentially cause confusion for certain people. You say that the prior can often only be understood in conjunction with the “likelihood,” but what it seems you actually mean is that the prior can often only be understood in conjunction with the sampling distribution (or the set of sampling distributions — i.e., the statistical model). For purposes of the frequentism vs bayesianism debate, for example, that’s an important difference, since the “likelihood” is typically understood to be p(x|theta) for fixed x and varying theta. It wouldn’t really make sense to have the probability assigned to theta depend on the likelihood function in that sense of the word. What you’re saying, however, is that the probability assigned to a particular value of theta, theta*, should depend on the function p(x|theta*) for varying values of x, i.e. it should depend on the sampling distribution for theta*. That’s very sensible, but it’s a sharp break with the kind of “orthodox Bayesianism” (often assumed in the frequentism vs Bayesianism debate) that says that any possible data aside from the data actually seen should be irrelevant for parameter inference.

        • p(x|theta*) for varying x is NOT the sampling distribution of theta*

          in order to have a sampling distribution the quantity in question must be a statistic, meaning a function of the data. for example the average of the data points, or the rms error of the data from the sample average or similar.

          theta* is NOT a statistic of the data that you can observe, it’s an unobserved parameter that changes what you expect the data to look like. the mapping between parameter value and the probability you assign to the data is often called the likelihood even though until you plug in fixed data it should probably just be called something like the conditional probability of the prediction or some such thing

        • You’re right of course, and I didn’t mean to suggest that theta* is a statistic and certainly not that p(x|theta*) is a sampling distribution *of* theta*. I just meant that p(x|theta*) is the distribution on the sample space that’s associated with theta*. That’s what I meant when I said “the sampling distribution for theta*.” But as you point out “sampling distribution” already means something else, so it’s a bad idea to call p(x|theta*) a sampling distribution. But it’s also confusing to call it a likelihood function, as that also means something else. I guess p(x|theta*) for varying x and fixed theta* doesn’t have a standard name?

        • it’s sometimes called the likelihood informally which is the use that Andrew seems to have used. I guess there isn’t a standard name, once you add the prior, it’s called the prior predictive distribution.

        • > p(x|theta*) for varying x and fixed theta* doesn’t have a standard name?
          Probability distribution for x, mass if x discrete and density if x continuous???

          (For an exceptionally well done overview see https://betanalpha.github.io/assets/case_studies/probability_theory.html#4_representing_probability_distributions_with_densities )

          The reason for the ??? is I had a back and forth email discussion with someone from this blog who kept insisting p(x|theta*) was the likelihood even after I gave them references including wiki’s…

          So maybe something is afoot here?

        • Keith, “The probability distribution for x (associated with/conditional on theta*)” isn’t exactly a name that rolls of the tongue. (Isn’t it more of a description than a name?)

          In any case, I agree that it’s important to distinguish p(x|theta*) from the likelihood. Conceptually they are different, and — more importantly — they play a very different role in Bayesian analysis. The likelihood function is used to FIT the model, whereas p(x|theta*) is used to TEST the model (through its use in prior and posterior predictive testing). Obviously there is a tight mathematical relationship between the likelihood and p(x|theta*), but in a way that’s coincidental. One can imagine a more general framework (that is, more general than Bayesianism) in which the fitting function and the testing function are mathematically more distinct.

        • I think what’s going on is that people are thinking about writing out formal models in computing languages like Stan and BUGS and JAGS and soforth. In these computing languages, you write down a generative model in terms of p(Data | parameters) and p(parameters), the first part is called the likelihood, and as soon as you plug in the data it is a likelihood, but it’s still thought of as the “likelihood” even if you’re doing something like a prior predictive analysis, so generating ‘fake’ data instead of plugging in the values of real data.

          It’s probably better to distinguish fitting vs prediction, but in Bayesian analysis it really is the same mathematical form used in both cases.

  3. Right, also if I just give a priorn it doesn’t tell you if it ks my subjective belief or it is a prior from an experiment. I’d argue the latter would be more evidence, but just typong a prior doesn’t tell you which,

    • As I see it, the point is that the analyst should explain *why* they chose the prior. This is just part of the general desideratum for transparency — giving your reasons; not just reporting “What you did”, but “Why you did it” — which often involved discussing other possible options.

      And, speaking of options, it seems to me like good practice to report (when it fits) something like, “We find plausible arguments for using each of these two priors, so we have done the analysis using each. Here are the results of the analyses… Here is discussion of how the results are similar and how they are different… Here is our final conclusion based on considering the two analyses, each weighed according to its plausibility…”

  4. Agreeing wholeheartedly with Andrew here. The alternative to “good priors” isn’t “no priors”, it’s “bad priors” that you don’t understand the implications of – which (to connect to some of the comments above re the opaqueness of priors) is why Andrew’s and others’ work to understand the implications of different prior choices in the same way as we try to understand the implications of different choices about the likelihood is so important.

  5. Under the Bayesian paradigm, the choice of prior can be thrown together with the sampling model and be called the “statistical model”. However, under any paradigm that allows inferences to be made without a prior being chosen, the choice of prior is indeed an extra assumption on top of the choice of the statistical model (= sampling model).

    No interesting cross-paradigm debate on this matter can be achieved unless this point is conceded. The debate ends before it can begin.

    • Naked:

      There’s no reason that you need a sampling model or a prior distribution to do statistics. In real problems, the data do not arise from a probabilistic sampling process, and parameters are not drawn from a probability distribution. These probability models are mathematical constructs that allow us to perform estimation and assess uncertainty. From there, what you want to call “an extra assumption” is a matter of choice. From one perspective, a model with fewer parameters represents fewer assumptions. From another perspective, a model with fewer parameters represents additional assumptions, as it corresponds to the missing parameters being set to zero. Ultimately, I think statistical methods have to be judged on how they perform, not on misguided and vague notions of subjectivity and objectivity; see here for more on that topic.

      • I couldn’t agree more, and I think this is missed in Mayo’s treatment. These things are mathematical constructs that serve some purpose… to me the purpose of Bayesian models is to operationalize logical assumptions about the relative size of quantities that obtain in the world, and then to compute the logical consequences of what would be expected if the world were as-if the model.

        And the frequentist model is to operationalize *repeated frequency* assumptions about how often quantities would be a certain size in repeated application of some kind of data collection/experimentation and to compute the frequency consequences of what the world would act like if the world were as-if the model.

        In the first case, we get true facts about logical consequences flowing from assumptions.

        In the second case we get true facts about how often things would happen in some simplistic world which may or may not be like our real world.

        A) The need in the Bayesian case is to convince me that your relative size assumptions are something I should approximately agree with.

        B) The need in the Frequency case is to convince me that your frequency assumptions are close to correct in the actual world we live in by comparing them to data actually collected.

        the kind of information you need to convey to me is very different in A vs B. The kind of information needed in B is detailed data about every single distributional assumption you make, comparing it to distributions of actual data collected

        In A, we need merely that entertaining values in certain regions of space doesn’t have logical consequences that are clearly wrong, such as that the mass density of air pollution exceeds the internal density of neutron stars… etc

        To me, the kind of data collection we need to validate frequencies in B exceeds the information required to simply answer the more limited questions we actually care about by often a couple orders of magnitude. If you want to do severe testing of statistical models you will have to either stick to research questions where you have mathematical attractor distributions, so that only very simple assumptions need to be checked (bounded measurements for example) or multiply the research budget by a factor of 100x over whatever it is currently so you can fit nonparametric distributions to detailed distributional data at every level of slicing and dicing you need to do.

        • Daniel: You make prior choice look so easy and smooth. If it were indeed like that, much of Andrew’s work on priors wouldn’t be needed. What you write here makes sense as long as nobody comes up with a slightly different model that seems to innocently encode approximately the same “relative size assumptions” a priori but somehow produces a quite different posterior after data…

        • Yep, if posterior is somehow sensitively dependent on the prior, things are problematic, like if

          y ~ mydistribution(a,b,c)

          produces dramatically different inference from

          y ~ mydistribution(a*(1+eps1),b*(1+eps2),c*(1+eps3))

          for some small epsilons… hey that’d be good to know!

          To me though, the choice of prior is rarely the real issue, I spend the vast majority of my time working on the model of the process, not the least because until I have such a model the set of parameters over which a prior is necessary isn’t specified yet.

        • Daniel: I believe there are categorical errors in the distinctions you make between your A and B.

          First, the *repeated frequency* is in hypothetical repetitions in a fully abstract model (it is all math even if a given sample is treated as the population, the *true* bootstrap distribution is still an abstract math quantity as with *random* *independent* samples).

          Second, in both A and B it is the fidelity between the abstract math representations and the reality beyond our direct access that needs to be connected to avoid being frustrated by that reality when we act. Lack of fidelity in both comes from ignorance of that reality that mattered – not whether you, I or any other particular individual agrees with the arguments.

        • yes lack of fidelity is the issue, the arguments are how we try to mitigate the issue. my problem with Frequentist models usually comes from lack of good reason to believe that the world acts like an idealized roulette wheel of a certain type, when presented with a good reason to believe that I have no issue with frequency based analysis… this is kind of the converse of the usual frequency based objection to a prior… when presented with evidence that the prior corresponds to previously ascertained frequency distributions, the math of Bayes is usually accepted by dyed in the wool frequentists, to me it’s usually a concern the other way around, show me the evidence or argument for why a frequency analysis holds… usually it’s only asymptotic and large sample size that could convince

        • Daniel, something I’ve been interested in for a while is compiling a list of comparisons between Bayes and frequentist (well, frequentist max likelihood at any rate) inference ordered by model complexity (and perhaps also data size). My experience is that the asymptotics you refer to often kick in surprisingly quickly *for simple models*. What is needed is intuition for when that goes awry. Then, where it does go off the rails, intuition for when/how full Bayes with intelligently specified priors helps (or sometimes hinders).
          A simple example of hindering is say a Beta(2,2) prior on a Binomial probability. It has a nice regularization towards 0.5- but what if the “true” value is really small or really large? Well, what happens is the MSE is larger in that case. So, we need to be clear that our prior is favoring values around 0.5, and we could justify that for various reasons, but that is another layer of modeling sophistication.
          Anyhow, more abstract discussion on these issues doesn’t seem to help advance the conversation, and so I think a series of relatively clear examples might be better.

        • Chris, I’d be interested to work on something like that, send me an email. I do think asymptotics work well in simple cases, but imagine how you’d do it for the map of political tolerance for example… it’d be messy and noisy. most counties have zero data points…

          if you do inference by tail probabilities, the asymptotics work less well, tails don’t fit until you have 50-100 data points, and soforth.

      • Professor Andrew:

        It is true of course that you do not need a sampling model to do statistics, but one is usually required to do formal inference. Given that the sampling model is generally just a pure assumption, we can think of the inferential process having two parts:

        1) Inference under the assumption that the model is true.
        2) Allowing for model uncertainty in some (probably messy) way. What we really should do is perhaps reanalyse the data various times using a “reasonably diverse set” of “plausible” models, but doing that can often only be a dream.

        Can part (1) be skipped? Possibly but that may not be a good idea.

        It is in part (1) where the assumption of a prior needs to be compared with the alternative assumptions made in other statistical paradigms.

        Thanks for the link. I hope the subjective-objective dichotomy or continuum will be discussed again on your blog.

        Thanks also for calling me “Naked”, it is incidentally my preferred name shortening.

  6. whoa. whoa. i type savage axioms into search and this is what i get? the foundation for the application of bayesian decision theory to psychology, economics and from there, the world thanks to thaler and sunstein?
    but you are a whiz at taking mathematical gibberish and converting it to digestible algebra, calculus, probability theory, etc. so i have tried to understand savage’s theorem in simple language. i will not tell you the number of years because then you math whizzes can deduce my minimum age.
    In decision theory, subjective expected utility is the attractiveness of an economic opportunity as perceived by a decision-maker in the presence of risk. Characterizing the behavior of decision-makers as using subjective expected utility was promoted and axiomatized by L. J. Savage in 1954[1][2] following previous work by Ramsey and von Neumann.[3] The theory of subjective expected utility combines two subjective concepts: first, a personal utility function, and second a personal probability distribution (usually based on Bayesian probability theory).
    https://en.wikipedia.org/wiki/Subjective_expected_utility

Leave a Reply to Olav Cancel reply

Your email address will not be published. Required fields are marked *