I suppose both are similar: consider how often evidence would occur under competing challenge (‘hypothesis’ if you must). Eg how often evidence would occur when found by cop with history of planting and eg how often evidence would occur when crime committed by person with same genetic signature.

End of thinking out loud on other people’s blogs…for now

]]>I’m curious whether/how this sort of approach would work in a context where ‘evidence’ is central eg a legal trial, however.

I suppose the same sort of idea applies – an ‘evidence’ statistic (or statistics) and stability of this evidence statistic to challenges- but I’m not too sure how one would provide guidelines on measuring ‘evidence’ in the first place.

Example: murder weapon found in bedroom. Clearly ‘evidence’. Stability challenge: was found by cop with history of planting evidence.

Less clear: genetic evidence matches to high but not certain probability. Clearly ‘evidence’. But only versus those who lack same genetic signature. None vs someone who also matches. Is this a question of evidence measure or of stability or both or?

]]>This is another reason I’ve moved more towards the idea of

a) compute some summaries of interest eg a straight line summary

b) think about how these vary under sensible ‘challenges’ eg dropping or resampling observations etc

]]>Thanks!

]]>One example is linear regression with normality assumption and leverage outlier(s). Leverage outliers will usually have small residuals so looking at residuals (i.e. y|T, T regression parameter estimators and maybe variance) will not successfully find them.

One can probably argue that in this case the linearity plus normality assumption for residuals elsewhere technically will be violated, so for n to infinity residual analysis will tell you that the model is wrong (though it will not highlight the outlier(s) as source), but for moderate n one can easily jot down datasets in which looking at all information the leverage outlier is crystal clear but the residuals on their own won’t look suspicious at all.

Note that I don’t have time to write anything down in detail right now so you’ve got to live with what I get from my memory on the fly. I made a number of stupid mistakes with integrals in my lecture today so you shouldn’t necessarily trust me. ;-)

]]>I should say, putting my Bayes-ish hat on that the setup should be

a) T is minimal sufficient for the model family indexed by theta

b) y|T is used to judge the model

I’m interested in cases where b) fails while a) holds.

]]>I wouldn’t be surprised, but I would still be interested to see them!

BTW I then wonder – In what sense is the model judged ‘wrong’?

Eg is it that your statistics T clearly don’t capture everything you are ‘interested in’ (model independent judgement) but these don’t ‘show up’ in the conditional distribution? Or?

]]>ojm: If I remember it correctly, it’s not too difficult to construct examples in which the model is clearly wrong but y|T won’t show it.

]]>Christian,

True but I _think_ part of the idea is that if y|T looks ‘suspect’ then the model and hence factorisation is suspect.

But all this is with my Bayesian hat on – more generally I agree that it is probably best to separate this stuff from modelling assumptions even further.

]]>“In frequentist statistical theory, inference about parameters depends on the data only through the minimal sufficient statistic and, what is left over in the data (the residual), is available for model checking. Mixing these up would seem to correspond to an inappropriate statistical analysis.”

Interesting… but it’s part of the model assumptions that the factorisation is possible in this way and one may want to check this, too, for which more than the “leftover” would be needed.

“Common sense” is often very subjective — i.e., can vary from person to person.

]]>> Don’t we want a resolution to whole double use of data in PPC that isn’t so restrictive?

Yes but this is a case where the answer seems clear. We can then look at how to get _approximations_ to this.

Some quoutes from the first Mike Evans’ paper I posted:

> It is our claim that effectively (1) [OJM: basically, factoring the distribution as above] shows us how to proceed to avoid double use of the information and, as such, avoid double use of the data. Of course, as mentioned in the paper, it may be difficult, with complicated models to determine [p(y|T) etc] in meaningful ways. According, it seems reasonable to weaken this requirement in such contexts to having this hold asymptotically in some sense…

also

> In frequentist statistical theory, inference about parameters depends on the data only through the minimal sufficient statistic and, what is left over in the data (the residual), is available for model checking. Mixing these up would seem to correspond to an inappropriate statistical analysis. We believe this is equally applicable in Bayesian formulations….Of course, this restriction could be weakened ….only satisfy (1) in some asymptotic sense. The motivation for this would seem to arise from the complexity of some situations. Still, (1) can be implemented exactly with many models of considerable importance, so it isn’t just of theoretical relevance.

He also discusses checking priors and checking hierarchical models.

]]>ojm, your point about non existence of non-trivial minimal sufficient statistics is I think where I was going. I’ll check out Evans paper. But Isn’t this a huge handicap for modeling? Don’t we want a resolution to whole double use of data in PPC that isn’t so restrictive?

]]>Fisher-constant -> Fisher-consistent

veiw -> view

etc -> etc

Check out Mike Evan’s papers below, perhaps?

> ojm, I don’t fully follow your distinction btw ‘sufficient statistics’ and ‘parameters of a model’.

A sufficient statistic is a statistic (function of the data), while a parameter is a…parameter? E.g. for a parameter defined via Fisher-constant functional it is the value of the statistic evaluated on the ‘full’ model rather than just the observed distribution.

For a sufficient statistic T(y) and model p(y;theta) we have

P(y;theta) = p(y|T(y))p(T(y);theta)

by definition of sufficient statistic.

In particular, the term p(y|T(y)) is independent of theta by definition.

In Bayes the data only enters parameter estimation via the likelihood function which is _only defined up to a data-dependent constant_. This is the important point, and noted by both ‘Frequentists’ (e.g. David Cox) and ‘Bayesians’ (e.g. Mike Evans).

Any purely data-dependent factor such as p(y|T(y)) drops out of estimating theta. So this is ‘unused’ info in this sense.

Equivalently, you can take the set of likelihood ratios or normalised likelihood function as the minimal sufficient statistic. Again the data-dependent constant component factor drops out. E.g.

[p(y|T(y))p(T(y);theta_1)]/[p(y|T(y))p(T(y);theta_2)] = p(T(y);theta_1)/p(T(y);theta_2)

in which p(y|T(y)) drops out of the comparisons. You can also write this all out in terms of a Bayesian posterior – again, something like p(y|T(y)) drops out.

The intuition is that your choice of model determines your choice of minimal sufficient statistics or vice-versa. Knowing one gives you the other.

This tells you which parts of the data you are paying attention to.

However, you can also ask whether you are paying attention to the right things – what does the ‘residual’ y|T(y) look like? If this shows unusual patterns then you doubt you’ve used a good minimal sufficient statistic i.e. you doubt you’ve used a good model.

An issue with _nontrivial_ minimal sufficient statistics is that they don’t generally exist outside of special families.

I’d probably argue that one should proceed the other way – choose what statistics are of interest and are a compact summary (‘minimal sufficient’ from a data analysis point of veiw) and _then_, if you still want to do parametric estimation, choose a model for which these are the minimal sufficient stats.

(This is also probably not far off the same procedure as something like MaxEnt, but minus the mysticism and, possibly, issues with working with entropy).

]]>ojm, I don’t fully follow your distinction btw ‘sufficient statistics’ and ‘parameters of a model’. So, y ~ normal(mu,sigma) i.e. I model P(mu,sigma|y) prop to P(y|mu,sigma)*P(mu,sigma)

I then do a posterior predictive check on yrep|y which is basically an integral over P(yrep|mu,sigma)*P(mu,sigma|y)*dmu*dsigma

Here mu and sigma are parameters of my model. Whether or not there exist sufficient statistics for estimating them in a simple way depends on my likelihood and priors, no? For simple cases that amount to max likelihood, such statistics exist. But PPC pretty clearly needs to extend to complex cases. So I don’t understand how sufficient statistics really apply (big caveat that I am not a statistician) In particular, you say

“The data enters this estimation procedure only via the sufficient statistics (i.e. ‘the likelihood principle as applied within the model’)”

But I don’t see how this is generally true in Bayesian models, and thus for how PPC happens in practice, except in special cases where T(y) are available. What am I missing?

I was originally just trying to make a simple observation about posterior predictive checks and ‘double use of data’ not kick off another Bayes vs Freq argument! (FWIW If anything, I see a bigger division between model first vs procedure first people).

I’m glad that I found Mike Evans’ comments on the relevance of sufficient stats to the double use of data topic, so I don’t think I’m completely wrong. But would be nice to hear any other thoughts on the topic. Even Andrew seemed reluctant to take up the specific point and hence our big detour into ‘foundations’ territory.

Does anyone out there have thoughts on posterior predictive checks, sufficient stats, and double use of data?

]]>Thinking about this a little more, asymptotically a ML estimate becomes point-estimate like in the same sense that a Bayesian posterior becomes delta-function like with enough data.

The ML + curvature based estimate for a confidence interval can be seen as a first-order correction to the zeroth order estimate that the parameter value is just the ML value (a delta function).

A *proper* confidence interval however while I agree it doesn’t integrate over the parameter space, it *is* supposed to guarantee coverage regardless of what the true value is. The typical confidence interval construction procedure in ML fitting relies on asymptotic normality of the estimator, and only gives proper coverage as N goes to infinity. So for example if you do your simulation study for N = 100 instead of N = 4 and for mildly-informative order-of-magnitude priors, what do you observe?

So, for small data sets, the ML + Curvature estimate is probably systematically too small to be a proper CI because to be a proper CI requires that you do some kind of mini-max (ie. the minimum length interval having 95% coverage for the parameter value that is the worst-case)

It’s not that frequentist theory here says we *should* use the ML + Curvature, it’s that it has no computationally tractable way to get a true coverage guarantee and so it relies on an approximation.

At least, I think that’s what’s going on.

]]>I think we are moving into ‘dancing on pinheads’ territory :) Of course, if the models are truly equivalent, then they will yield the same results. What I’m saying is that *integration over parameter space* differentiates Bayesian and non-Bayesian estimation given the same model specification (as illustrated in code I provided above). My understanding from, e.g. BDA3, is that there is a legitimate frequentist derivation and interpretation of the max likelihood estimates that are often indeed equivalent to MAP + curvature (the Taylor Series approx of posterior mode is illustrated). From a Bayesian perspective, MAP+curvature is an approximation to what we really want (the integral), but maximum likelihood approaches for instance are NOT trying to integrate over parameter space. Therefore, I don’t think it is correct to say that those CIs are somehow invalid…and I am making no judgements here about which one is “better”. Conditional on not knowing the ‘true data generating distribution’, that is always debatable :)

]]>Chris: to be clear, as I said, there may be a subtlety I’m missing here, and so I’m not stating this as dogmatically definitely true, it’s just that I would like to see the mathematical reason why a *correctly calculated* confidence procedure from a model that is equivalent to a Bayesian model with flat prior would ever be shorter than the Bayesian model with a diffuse but proper prior.

Practically speaking, software may spit out a shorter interval, but we need to consider whether the shorter interval arises from using an asymptotic approximation that is inappropriate rather than actually having a correct CI.

]]>Chris: “just because we can interpret a frequentist procedure as a Bayesian model with a flat prior does not mean that the inference will necessarily be either a) the same, or b) a larger CI (because of aforementioned integration or lack thereof).”

No, I don’t think this is correct.

Mathematically, the 95% CI that occurs when a Frequentist procedure based on likelihood theory + a flat prior is used to construct a confidence interval, *is* the 95% posterior interval that occurs when the Bayesian uses the Bayesian procedure and an improper flat prior. In other words, whatever you get from running Stan with a flat prior, is *what you should have gotten with the CI procedure*.

There’s a mathematical isomorphism, the math that you get is the same in the two cases, the intervals you get are the same in the two cases.

Now, *as an approximation* to the *correct CI* often a MAP + curvature calculation is done by software like “lm” etc. The calculation is based on an asymptotic theory, and perhaps this may systematically under-estimate the correct size of the interval when the dataset is small. The fact is, this is a *calculation error* and the resulting interval *is not a correct 95% CI*

Taking the results of this MAP/ML + curvature *calculation error* as evidence that the CI can be smaller and hence “better” in some sense, is just saying “often the simple calculation that is done leads you to a systematic error in which your CI is wrong because it’s too small, and this is ‘good'”

Yes, I agree with you that often the point of a Frequentist interval is really just to pick out a number (the MAP/ML estimate). If Frequentists just don’t care about the size of their CI then fine… But when the model is mathematically equivalent to a Bayes CI, the intervals *need to be the same* that’s what “equivalent” means in this context.

]]>right, but my point is specifically the comparison of the Bayesian estimate (integrating over parameter space) to a hypothetical non-Bayesian approach (e.g. based on max likelihood, least squares, whatever) where there is no integration over parameter space. The closeness of MAP + curvature *as an approximation* to the Bayesian estimate is a Bayesian perspective/concern :) To be super clear, just because we can interpret a frequentist procedure as a Bayesian model with a flat prior does not mean that the inference will necessarily be either a) the same, or b) a larger CI (because of aforementioned integration or lack thereof).

]]>I haven’t thought this through carefully, but my intuition is that this can’t be true. We’re talking about the normalized product of two functions p(Data | Params) p(Params). Let’s let p(Params) = normal(Params, mu, sd) for any mu and sd we like…

The “flat” prior corresponds to the limit as sd goes to infinity. Let [a,b] be a “flat prior based confidence interval”. If you make sd something less than infinity you increase the prior probability that the parameter is in [a,b] because as sd goes to infinity, the prior probability to be in any fixed interval goes to zero.

If we increase the prior probability, and the likelihood would pick out this region under the flat prior, then we should increase the posterior probability as well. This means we can shorten the interval and keep the same posterior probability.

Now, your point about MAP + curvature may mostly be an observation that for small data, the *approximation procedure of MAP + curvature* systematically underestimates the appropriate size of the confidence interval. That’s a different story ;-)

]]>i agree with “…and this looks far too general a statement to me”.

Of course, priors should contain information that is not in the data, I took this as a given. In my (limited) experience, this is not very hard, because often weakly informative priors suffice.

I also agree that when the data is to weak to estimate the model, e.g. when parameter estimates depend more on priors than on data, one should rather admit that and collect better data than employ priors.

Lastly, when simple models are estimated based on weak data, priors can make sense because they can protect from falling for unreasonably large effect sizes (In the field I am working in).

In the end, also given my failed attempt ;-), I am wondering if general statements about the usefulness of priors that reflect more than common sense are possible.

]]>FWIW, for this example it looks like even principled, order of magnitude type weakly informative priors, lead to larger CI than the ‘frequentist’ (or MAP plus curvature) CI. The mechanism is integration – if we are not integrating over parameter space, tail area (and hence logical implication) of prior does not matter…

e.g.

stan_model2 <- '

data{

int N;

vector[N] dat;

}

parameters{

real mu;

real sigma;

}

model{

mu ~ normal(0,0.05);

sigma ~ normal(0,0.05);

dat ~ normal(mu,sigma);

}

‘

aha, sorry Daniel I think I misread your comment. The key provision there is your saying “the *real* prior”. Which is where Christian Hennig’s statement summarizes things nicely…

]]>+ 1

]]>e.g.

stan_simpleCI <- '

data{

int N;

vector[N] dat;

}

parameters{

real mu;

real sigma;

}

model{

dat ~ normal(mu,sigma);

}

‘

dat <- c(0.02, 0.025, 0.04, 0.031)

stan_dat <- list(N = 4, dat = dat)

demo <- stan(file = "stan_simpleCI.stan", data = stan_dat,

chains = 3, iter = 500)

#divergents are suppressible with adapt_delta and not important

# to point I'm making

# compare to

demo_mod <- stan_model("stan_simpleCI.stan")

opt_demo <- optimizing(demo_mod, data = stan_dat, hessian = T)

print(opt_demo)

sqrt(diag(solve(-opt_demo$hessian)))

# or

summary(lm(dat ~ 1))

If the prior puts much mass around where the data indicate that the truth is, a Bayesian uncertainty interval will be smaller than a frequentist CI. However, if the prior has the bulk of its mass elsewhere, it may be bigger.

It all boils down to the observation that a prior is helpful if the information encoded in it is good and reliable, otherwise rather not.

Sure it can be. Estimate the mean and sd of this vector of data c(0.02, 0.025, 0.04, 0.031) using stan and flat priors. Compare to estimates from lm() or use Stan optimizers and use the Hessian. The key is integrating over parameter space versus working with MAP and curvature (which is much much closer, if not identical in many cases, to the frequentist estimate and CI). With N small, the differences are noticeable.

]]>“I think it is hard to make or understand statements about usefulness of priors (or trade-offs when defining them) without specifying the statistical problem one is dealing with.”

Fair enough, I agree that this depends on the problem.

“However, when the data is weak (small samples and/or unreliable measurements) or when trying to estimate a complex model, I would be unhappy not to use priors.”

…and this looks far too general a statement to me. A prior can help if it represents genuine reliable information that is not in the data. If it doesn’t, I don’t see what you get from it. Surely you only want the prior to do some work here if the work done by it is good and helpful.

Sometimes I think it’s better to be honest and say that the available information is not strong enough to estimate your complex model at the required precision, and use graphs and non-probabilistic reasoning instead (or to use probability in a purely exploratory fashion) – keeping in mind also that if the data are too weak to estimate your model, chances are that they will also be to weak to check your model well.

I don’t think the frequentist CI based on using a flat prior can ever be smaller. The real prior is always more concentrated.

Centered more on the data, perhaps, depends on the situation. Whether that’s a good thing also depends on the situation, for example the noisiness of the data collection process and the amount of background info that informs the prior.

]]>Yea I’m inclined to agree on this one. The correspondence btw frequentist CI and Bayes with flat prior is useful, and the uniform prior on (-Inf, Inf) may well logically imply goofy estimates, as Daniel says. However, in my view the frequentist CIs seem a lot closer to doing MAP with flat priors, rather than real Bayesian inference (i.e. integrating over parameter space). This is why with N = small, the frequentist CI is often smaller, and centered more literally on the data, for instance, than integrating over a Bayesian model with uninformative priors.

]]>“if I don’t have background knowledge that translates smoothly into a prior, I’m quite happy not to use one”

I think it is hard to make or understand statements about usefulness of priors (or trade-offs when defining them) without specifying the statistical problem one is dealing with.

If one has lots of data and/or very good measurements and one is estimating relatively simple models, I could also be happy not to use priors.

However, when the data is weak (small samples and/or unreliable measurements) or when trying to estimate a complex model, I would be unhappy not to use priors.

So maybe disagreements about the trade-offs involved in formulating priors stem in part from different implicit assumptions about the statistical problems to be solved?

]]>My use of the terms here is a bit tongue-in-cheek of course. Somebody here needs to take the Mickey out of all the Bayesians from time to time.

]]>Andrew

OK. I still see these terms being used commonly and I used them because of such common usage. I will avoid doing so in future! I thank you for pointing out the paper by you and Christian; I accept your points and fully appreciate the views in the paper and the discussion.

I am still unsure about what term I should apply to a probability calculated using:

(1) carefully observed, documented data alone with a mathematical model (which of course makes some unverifiable assumptions) and incorporating Bayes rule

(2) hypothetical data (“made up” as Christian says) as well as carefully observed, documented data with the same model (based on the same unverifiable assumptions) and incorporating Bayes rule

In his the first reply to your blog, Christian uses the terms ‘frequentist’ (“when using Bayesian analyses with one specific prior”) and “Bayesian” (the latter when “making up” a prior). So in accordance with Christian’s usage, do you think my comment above should have used the term ‘Bayesian’ instead of ‘subjective’ for (2) and ‘frequentist’ instead of ‘objective’ for (1)? I must say that I also feel uncomfortable about using the terms ‘frequentist’ and ‘Bayesian’ too! Perhaps my last sentence should have read “Frequentists should make use of this ‘frequentist Bayesian’ approach too.

]]>Last comment for now – if this wasn’t clear then by ‘sufficient stat’ I really mean ‘non-trivial sufficient stat’ i.e. minimal sufficient.

]]>“the flat prior logically states within the Bayesian framework that the thing is infinitely large” – but when discussing confidence intervals we’re not in that framework. You just don’t pre-specify any size for it; and the CI will then usually sit nicely around what the data indicates, with no unruly drive to infinity whatsoever.

“But any attempt to use a Bayesian Decision Rule to make a decision can’t focus on just the local density at the peak” – if I remember correctly, you always claim that specifying a prior is not complicated and whatever has high density at the true value is fine… what you say here may depend a lot on what your prior does elsewhere, at least as long as you don’t have large amounts of data.

“You could reject Bayesian Decision Rules” – I don’t (I’m quite undogmatic and have been caught defending Bayes against frequentists). I’m fine with them as long as you need a decision and your prior and loss function are well justified. Except that this in my practice is a minority situation.

]]>As long as you feel that you’ve got to have a prior; if you’re old fashioned like me and think you can do without most of the time, doing sensitivity analysis with various priors feels like trying to address a problem which you don’t need to have in the first place.

]]>Huw:

I recommend abandoning the terms “subjective” and “objective” for reasons discussed in detail here by Christian Hennig, myself, and 53 others.

]]>Adding hypothetical likelihood distributions is a very powerful way of doing sensitivity analyses to anticipate how an existing study might give different results or the result of an attempted replication in a different setting. However, it is wrong to regard the uniform base-rate prior probability distribution as simply one more hypothetical prior. On the contrary, the uniform base rate prior probability is a fundamental part of random sampling mathematical models. Frequentists should make use of this ‘objective Bayesian’ approach too.

]]>Also:

https://projecteuclid.org/download/pdf_1/euclid.ba/1340370946

Mike Evans had made a similar point

]]>If you “don’t have background knowledge that translates smoothly into a prior,” then a sensible thing to do is to try plausible priors and see how much the choice of prior affects results.

]]>http://statmodeling.stat.columbia.edu/2017/04/26/using-prior-knowledge-frequentist-tests/

Its part of a quest to use Bayesian posteriors with informative priors as a method to what was quoted above as ‘pure’ Frequentist (emphasise optimality, decisions, coverage). Key is the realization that informative priors in Bayesian statistics carry the informtion that in ‘pure’ ferquentist statistics is carried by informative loss function.

]]>(See also David Cox’s description of the Fisherian Reduction)

]]>I think it’s consistent with the standard mathematical definition of sufficient statistic.

But as you mentioned sufficient stats rarely exist for more complex problems. The guiding intuition is perhaps the same though.

> while I kind of like the idea that there’s information left on the table, I don’t think it’s information within the statistical model, it’s kind of information about whether the statistical model itself is good for your purposes

That’s precisely the point!

]]>The correspondence of many confidence intervals to Bayesian models with flat priors is informative to me though. I know ahead of time that the object I’m trying to estimate is finite, and the flat prior logically states within the Bayesian framework that the thing is infinitely large. Using a point estimation / optimization procedure ignores this measure theoretic concept and focuses just on the local density. The flat density becomes a way to “avoid altering the location of the maximum” and it’s behavior epsilon away from the maximum is irrelevant in that framework.

But any attempt to use a Bayesian Decision Rule to make a decision can’t focus on just the local density at the peak. You could reject Bayesian Decision Rules, but then you have to explain how Wald’s theorem plays into your justification for rejection…

In the end, I think my biggest concern is that the frequentist confidence interval and/or point estimation procedure feel like a misfired attempt to make good decisions based on data. It seemed intuitive, but it wasn’t fully thought through.

]]>This notion of “sufficient statistics” is a non-standard one though isn’t it:

From wiki: “In statistics, a statistic is sufficient with respect to a statistical model and its associated unknown parameter if ‘no other statistic that can be calculated from the same sample provides any additional information as to the value of the parameter’.”

Sometimes in a Bayesian model things collapse to a small set of sufficient statistics, like the mean and sd of a normal. But other times that’s not true. I think the cauchy case there is no sufficient statistic according to the standard definition of sufficient statistics (I’ve never looked too carefully into that but I see it repeated in discussions, so I assume it’s a standard result.). So, while I kind of like the idea that there’s information left on the table, I don’t think it’s information within the statistical model, it’s kind of information about whether the statistical model itself is good for your purposes, and that comes down to whether it predicts “well” after fitting. The “well” has something to do with a utility rather than an inference. The fact is as fallible humans we can’t expect that we’ve specified the model exactly in the way we will want it specified until we’ve had a lot of experience with using it…

]]>Though the rest of my contribution is somewhat irrelevant nonsense, reading back I still think my comment on PPC and sufficient statistics was worth emphasising for some folk.

That is, Bayes estimation only uses the sufficient stat so PPC can be perfectly well justified from a traditional point of view when based on y|T(y), just as in eg David Cox’s the ‘Fisherian reduction’.

I got the feeling that Andrew didn’t care too much about emphasising this correspondence, despite mentioning on occasion that PPC should be based on aspects of the model that are not ‘fit automatically’. Perhaps because he had (rightly) moved on to more interesting things. But still.

]]>