A couple people asked me what I thought of this article by Miguel Ángel García-Pérez, Bayesian Estimation with Informative Priors is Indistinguishable from Data Falsification, which states:

Bayesian analysis with informative priors is formally equivalent to data falsification because the information carried by the prior can be expressed as the addition of fabricated observations whose statistical characteristics are determined by the parameters of the prior.

I agree with the mathematical point. Once you’ve multiplied the prior with the likelihood, you can’t separate what came from where. The prior is exactly equivalent to a measurement; conversely, any factor of the likelihood is exactly equivalent to prior information, from a mathematical perspective.

I don’t think it’s so helpful to label this procedure as “data falsification.” The prior is an assumption, just as the likelihood is an assumption. All the assumptions we use in applied statistics are false, so, sure the prior is a falsification, just as every normal distribution you use is a falsification, every logistic regression is a falsification, etc. Whatever. The point is, yes, the prior and the likelihood have equal mathematical status when they come together to form the posterior.

The article continues:

This property of informative priors makes clear that only the use of non-informative, uniform priors in all types of Bayesian analyses is compatible with standards of research integrity.

Huh? Where does “research integrity” come in here? That’s just nuts. I guess now we know what comes after Vixra. In all seriousness, I guess there’s always a market for over-the-top claims that tell (some) people what they want to hear (in this case, in a bit of an old-fashioned way).

To get to the larger issue: I do think there are interesting questions regarding the interactions between ethics and the use of prior information. No easy answers but the issue is worth thinking about. As I summarized in my 2012 article:

I believe we are ethically required to clearly state our assumptions and, to the best of our abilities, explain the rationales for these assumptions and our sources of information.

How do people still get away with this “non-informative” crap? It’s just wordsmithing to entrap the unwary (of which there are many, both in science and the general public).

How it is non-informative to say that x is twice as likely, a priori, to lie in the range [0,10] as [25,30]? If someone is offered a bet on those two outcomes, the prior implies very specific odds before anyone looks at any measurement.

+ 1. Exactly. The rest of the abstract is similarly shoddy, and contains patently false assertions. The only good thing is that anti-Bayesianism is now less fashionable in general than it used to be, so this article’s impact is not likely to be large. OTOH, that doesn’t mean that misunderstandings are not still rife. Now, if this author had used the opportunity to classify all other forms of regularization as “data falsification”, I would be a touch more sympathetic. As it is, my reaction is GTFOH.

You can go further back in time and say that your choice of instruments and data collection methods is “Indistinguishable from Data Falsification”. Or even that your choice of research area is “Indistinguishable from Data Falsification”. Data is like money in this respect, if you don’t trust who produced it, the whole enterprise comes tumbling down.

At first when I saw “Perez” and Bayesian criticisms I thought it was this guy:

https://medium.com/intuitionmachine/cargo-cult-statistics-versus-deep-learning-alchemy-8d7700134c8e

I stumbled across this paper by chance a few weeks ago and was shocked by quite how bad it is – was wondering how long it would be before it showed up on this blog!

That’s right. I saw this paper 2 days after its publication. I thought it would be on this blog almost immediately! Glad it’s finally here!

Funny that a paper in such an obscure journal would circulate so much. Extreme claims get attention, I guess.

The Spanish Journal of Psychology?

I’m curious how you “stumble across it”. Seems pretty obscure.

>non-informative, uniform priors

Oxymoron of the day.

There is an interesting debate (that might be more philosophical than statistical) hidden somewhere here

but it is not worth it to sift through all the bs for that.

Data falsification: place additional data points that you made up into your csv file without telling anyone.

Bayesian inference: place a description of a probability measure describing the information about your problem which you use in your fitting code which you release publicly. Place only actual collected data from your measurement instruments in your csv file.

Voila distingushable…. Next!

My thoughts exactly—but expressed better than I would have done!

Bob76

No, no, he says it’s

formallyindistinguishable. Of course, by this standard, collecting more actual data is also indistinguishable from data falsification.Formally indistinguishable only if you look exclusively at the posterior samples and not at the computer program that generated them. Of course if you’re just looking at the output, maximum likelihood estimates with standard errors are formally indistinguishable from handing your child a crayon and saying “draw some points with little bars coming out”

Ah. I thought me feeling like a fraud was an impostor syndrome, turns out it was the priors.

Isn’t the exclusion of any information (say, from horoscopes, or augury) equivalent to using a delta function as an informative prior for certain coefficients in a model?

The entrails of this goat say….. yes!

On the ethical side – isn’t it unethical to fail to use knowledge that we have to model a result? I would agree that including a scenario that doesn’t account for our knowledge (a base case) is good practice to help users of the result understand the implication and size of the impact of our assumption/knowledge, but that’s the limit of it.

If you know stuff and leave it out then you are doing exactly the same as making stuff up and putting it in. Being dishonest!

Simon:

I discussed that in my article, Ethics and the statistical use of prior information.

Isn’t his criticism actually a compliment? The purpose of identifying prior beliefs is to identify and test them. Ideally, you test these beliefs in specific ways that tell you whether you’re meaningfully right or wrong. I can’t see a sensible argument that not identifying beliefs is a good idea.

I was thinking about witches yesterday. Sink or float; if she floats, she’s a witch but if she drowns, then she was innocent. In what system of belief is that a sensible test? I’m not exactly sure, but I can say it embodies prior beliefs so powerfully there’s no way to pass the test without dying. Witches float. If you try to ‘save’ a drowning woman, she might actually be a witch waiting to float. Very powerful prior becomes a given, and overwhelms the notion that not being a witch equals death.

If you approach this as an informative prior, not as a given, then you can still say witches float but you can adjust how long they must be submerged: rather than assume the tiniest hint of possibiliy that she might still float (and not merely lie face down dead in the water), assume instead the possibility of innocence, counting back from at least trying to save them before they gave up the ghost. The prior data included is: they always drown so the test is only good at killing the innocent. This becomes apparent when you view whatever you are doing as occurring within a system that you guess about as best you can. We reach for certainty but any certainty is temporal.

The story is a gross simplification but the point is that any test you make embodies your prior beliefs, especially when those beliefs have beome givens.

Not sure why I thought of this, but that kind of story about witches, true or not, embodies deep metaphors about women. Ophelia drowns. The Flood washes away humanity to make a fresh start. Women not only bleed but they break their water, carrying within them the metaphor of life coming from water. Men only bleed when wounded, but women bleed regularly (or irregularly) as though they’ve been wounded. I assume the witch trial comes from the idea that the waters of creation would reject the wicked, causing them to float, while they would embrace the innocent, causing them to drown. That metaphor completes as: then their souls rise and float in the afterworld. That means this form of trial would represent human sacrifice, twisted around so a story about evil justifies the sacrificial act. Informative priors help dig out relationships.

The point about ignoring prior knowledge is certainly good.

A nitpick, however: in the era when witch trials were a real thing, witchcraft wasn’t seen as exclusive to women. So I think that water connection is a bit strained.

Well I’m not a statistician but in following climate science I have learned that the issue of priors is a controversial one. I do believe however that most experts I’ve encountered say that uniform priors are not appropriate in most circumstances. I would be curious Andrew what you think of the Jeffries “noninformative” prior.

I certainly don’t think that using an informative prior is “formally equivalent” to data falsification.

> following climate science I have learned that the issue of priors is a controversial one

I could not agree more:

https://climateaudit.org/2014/04/17/radiocarbon-calibration-and-bayesian-inference/#comment-547957

The title reminds me of all the titles in tabloids Andrew uses to mock at. At first I thought that I could use this posting to give it to my sceptical colleagues and reviewers, because I think the question should be taken serious. But then comes only polemics, the cited papers also do not take the matter seriously.

As someone using these tools, I know how to make sure I do not see the prior result only, but it is not an easy task to explain this seriously to p-value greedy reviewers.

Underwhelmed.

It takes a somewhat inhuman mindset to say that what one feels going in analyzing some data should have no effect on one’s conclusion after having seem the data. It is a standard that no one holds, and yet is held up as the required mode of thought. Once you add to it the fact that most data is inconclusive on its own, the counsel is to simply take what you believed before (the null hypothesis, not that you ever really really believed it) and keep it until data speaks all by itself without any prior, if such a thing is even possible in all but the simplest cases.

But it’s even worse than that. After all, if data is all that is important, it ought to convince even people who hold a strong prior going the other direction. If the null represents conventional wisdom (I know… it doesn’t) then why shouldn’t we continue to believe it not just against flat priors, but against strongly biased priors in favor of the conventional wisdom? Isn’t that just what the conventional wisdom is?

Does anyone know of examples where someone has pre-registered the priors they will use in their analysis before collecting data? I don’t know of examples like that off the top of my head.

Here.

Thank you!

Notwithstanding the click-baity article. Can we salvage something useful out of this polemic? And the answer is yes. Specifically that we can put priors on “pseudo-data points” and deform fits to match our knowledge.

Suppose you have a model of some process, such as a time-series curve fit using a chebyshev polynomial or a fourier series, and you have some gaps in the data, either a period in the past where no information was recorded, or the near future where you haven’t observed anything yet.

You have prior knowledge about the value of the function in regions of space that the missing points are in.

Since your model is on the coefficients of the polynomial/fourier series, how could you encode your knowledge about what should happen in the gap? It “needs to be a prior on the coefficients” right?

Well, not exactly.

Invent some parameters that represent (t_i,f_i) for the function values at the chosen time points t_i in the gap. Provide informative priors on the f_i values. Then fit the polynomial to the full set of (t_j,f_j) for j including all real data, and the i values represented by your parameters.

What this does is it induces a prior on your coefficients via the “masking” technique I described here: http://models.street-artists.org/2019/11/04/deformation-factors-vs-jacobians-and-all-that-jazz/

Think of it this way, whatever your prior is on the coefficients p(C), multiply it by a dimensionless function L(f(C,t_i)-f_i) describing how closely the function f should come to the parameter f_i. This new density: p(C) L(f(C,t_i)-f_i)/Z is your new prior. Fortunately for you, in an MCMC type situation, you don’t need to calculate Z.

This technique works extremely well, it regularizes your fit so that it doesn’t “misbehave” in regions where no data are available. However, you’d better have a good sampler, because it’s going to inherently produce dependent structures between the f_i parameters and the C parameters.

This is a very interesting idea, but I have a few naive and confused questions that will probably demonstrate that I’m not properly understanding it. You say:

“Invent some parameters that represent (t_i,f_i) for the function values at the chosen time points t_i in the gap. Provide informative priors on the f_i values. Then fit the polynomial to the full set of (t_j,f_j) for j including all real data, and the i values represented by your parameters.”

How are you supposed to fit the polynomial to the f_i given that they are unknown parameters? Do you use the expected value of f_i over your prior? Also, you say:

“Think of it this way, whatever your prior is on the coefficients p(C), multiply it by a dimensionless function L(f(C,t_i)-f_i) describing how closely the function f should come to the parameter f_i. This new density: p(C) L(f(C,t_i)-f_i)/Z is your new prior. Fortunately for you, in an MCMC type situation, you don’t need to calculate Z.”

This idea is similar to some ideas I’ve also had, but it sounds somewhat different from your preceding idea of fitting the function to “pseudo-data points.” Are you saying that the two ideas are equivalent? Also, is it necessary to have an original prior p(C) over the coefficients, or is it in principle possible to simply use L(f(C,t_i)-f_i)/Z as the prior?

>How are you supposed to fit the polynomial to the f_i given that they are unknown parameters?

Specifically by using your “likelihood” L, but instead of plugging in observed data values f_i you plug in the parameter values f_i

In other words, it’s not two different ideas, it’s just one idea.

>Are you saying that the two ideas are equivalent?

yes. It’s really just one idea “invent some fake data which are parameters, and then put them into the posterior just as if they were parameters through a “likelihood””

>is it necessary to have an original prior p(C) over the coefficients, or is it in principle possible to simply use L(f(C,t_i)-f_i)/Z as the prior?

If you put extremely broad gaussian priors on p(C) you guarantee a proper prior, and as the gaussian priors go towards infinite variance, they have no real effect on the fit.

However, every polynomial can be parameterized as an interpolating polynomial through a set of control points, so whether you have priors on the parameters, or priors on the value of the function at interpolation points, they both should be proper.

“>How are you supposed to fit the polynomial to the f_i given that they are unknown parameters?

Specifically by using your “likelihood” L, but instead of plugging in observed data values f_i you plug in the parameter values f_i”

That I gathered, but my issue was just how, exactly, you’re supposed to plug f_i into L, given that f_i is not a number but rather a parameter with an associated prior p. You could plug in the expected value of f_i over p; is that what you had in mind?

When sampling for example, simply insert the current sampled value of f_i into the function L to compute the posterior density.

In the end you marginalize away the f_i values by pooling all the samples.

How do you determine how many fake datapoints to sample? It seems like capturing your prior uncertainty then would be a function of both the spread of your informative prior on f_i values and the number of samples taken in the fake censored region, so it seems harder to capture a desired degree of vagueness.

Indeed. It helps to be humble and to do posterior predictive checking. You can also add information on things like the mean squared second derivative to keep the function from wiggling like crazy.

It’s obviously only science if the researcher pretends to know nothing about the subject matter at hand. Everyone knows that science advances from lone wolves who avoid prior subject matter knowledge to maximize naivete. Standing on the shoulders of giants is for sheep and hacks…

Well, I would have said goats, not sheep.

(Anyhow, brave of you to have a dog picture rather than a cat or baby alligator picture. ;~) )

“Using informative priors is wrong”

Oh, it feels like 2010. Brings back good memories.

I think a good counter-argument to this sort of nonsense is regularization and regularizing priors. 1) When using regularizing priors, you bias the model in favor of the null so any extraordinary claims you want to make require MORE extraordinary evidence than MLE, not less. 2) Unregularized frequentist MLE models make the implicit assumption that for two perfectly colinear predictors, the parameter values -Inf, Inf are just as likely as any other values (e.g. 0, 0). Even if the colinearity isn’t perfect but merely strong, you’ll still get very wacky estimates. A regularizing normal prior/L2 norm seems like a much more sensible default choice in pretty much every natural application I can think of.

Adam:

Indeed, one thing I like about lasso and machine learning is that they demonstrate the value of regularization, separate from any formal Bayesian perspective. I’d much rather have the non-Bayesians doing non-Bayesian regularization than have them doing least squares and maximum likelihood.

I wonder if what’s a reasonable default choice might depend on how much data there is. Recently, I did several (unsystematic) simulations with logistic regression to see how regularization with L2 penalization compares with MLE when it comes to recovering the correct parameter values, just to satisfy my own curiosity. Maybe others have a different experience, but my (informal) impression was that L2 penalization is definitely better when there are few observations per parameter (which is unsurprising), but that MLE tends to be better when there’s more data. So the penalty (or prior) is helpful when there’s not much data, but becomes somewhat of a liability as the data amount increases beyond a certain amount. So there might be a place for MLE — or, more generally, uninformative priors — when there’s a reasonable number of observations per parameter (obviously, when there’s tons of data, the penalty/prior stops mattering all together).

It obviously depends on how well your informative prior captures the true value of the parameter.

Yes, but one can still ask how good it is to use automatic/default penalization, i.e. rescaling the data and then shrinking parameter estimates towards 0, as is done in penalized regression, or using a default normal prior centered on 0 and with an SD of 1. In my simulations, I made the true coefficients relatively close to 0, so you’d expect penalization to help. But I found that it hurts rather than helps if there’s enough data.

I think rescaling data and using N(0,1) priors deserves some thought. What this amounts to saying is, “the range of variability in observed data is about what we would expect in this type of problem”. In effect, you are using both the data and a summary statistic of the data (the SD) in your joint generative model. This actually illustrates the old canard about ‘using the data twice’. It’s not a violation of the Ten Bayesian Commandments; rather, it just implies a different joint model, one where you are strongly centering your inferences in the range of observed data.

That said, better approaches to regularization would include something like the horseshoe where you estimate a group-level distro over the coefficient SDs. Aki Vehtari and colleagues are the authorities on that approach!

> It’s not a violation of the Ten Bayesian Commandments; rather, it just implies a different joint model, one where you are strongly centering your inferences in the range of observed data.

The extra strength is obtained by recycling the data, you’re sinning a bit.

Let he who is without sin cast the first stone! (with their eyes shut, of course)

I used to think that way a little bit myself but no more. I think it’s more useful to consider this another part of model formulation, i.e. assumption. For instance, let’s say that you have standardized response variable to N(0,1) and have a binary indicator predictor variable. What prior to use?

– Uniform: has all the problems noted here and elsewhere, most egregiously that this implies an effect size of btw say 100-110 SDs is equally plausible a priori as between 0-10 SDs. With enough data, however, not likely to be an issue of course.

– OK, we want/need some kind of regularization. We will stick to normal family, Normal(0,tau). But how to pick tau? You might think there is something especially “recycling” about Normal(0,1), which has the interpretation I noted above, but really any choice for tau could be thought of as a multiple of the observed data SD, and interpreted accordingly. In reality, we are just adjusting a regularization knob up and down, when we are in the absence of substantive external information.

I don’t understand your example but it seems to me that either you are reusing data or you are not. Say you have a (data-independent) prior, a model and some data: x1, …, xN. You could also write your data as muhat, sigmahat, z1, …, zN (where the meaning of muhat, sigmahat and the z’s is hopefully obvious). You could use it in two steps: “consuming” muhat and sigmahat you get an “informed” prior/model that you then update with the information remaining in the z-scores (if any!). That’s ok. But if your inference is different from the one you would get from the straightforward analysis then you’re either: a) not using the data efficiently or b) reusing the data.

Carlos. suppose you are trying to fit a curve to some data. You specify a prior on some fitting parameter that describes the wigglyness of the curve. Then you get some data. When you fit to the data, it has a weekly reporting periodicity that strongly forces the curve to wiggle up and down. so after seeing the fit, you crank up the prior on magnitude of the wigglyness regularizing parameter. refitting gives you a less wiggly fit, representing what you think the underlying process does without the weekly reporting biases.

in my opinion this is an informal version of some expanded model. but formally it’s equivalent to setting the prior based on the results of fitting to data.

> in my opinion this is an informal version of some expanded model.

I can imagine a model that reuses data by design (a squared likelihood, for example) but I cannot quite imagine an expanded model that gives different likelihoods depending on the data. If you can come up with a mathematical embodiment of that opinion I’d be interested.

Carlos, my point is just that it’s more useful to think of this as adjusting regularization knob. The choice to scale it to 1 SD of observed data is just an arbitrary default choice that does a little bit of regularization but not much. It is virtually always better than a flat prior. But if you have better external info, go for it. Alternatively, if you’re going for pure predictive power, could make like the Lasso/Ridge/Net folks and tune on out of sample predictive scores or something like that.

Carlos: imagine your prior information is “this curve changes concavity at most 1 time in the interval [a,b]”

One way to express that prior is to somehow calculate a functional of the curve that is closely related to the changes in concavity, and use it directly in your prior.

Another way is to maybe encode some simpler parameterization, and then run the fit, and tune what knobs you have until you get out of it something that expresses your knowledge. you don’t know what that is directly, because you weren’t capable of directly encoding your knowledge, but you know what it looks like when you see it.

Often times we’re in the second case, where we can at best have a few knobs that are capable of encoding the prior information, if we were able to express a complicated multidimensional probability distribution over our parameters, but we don’t know how, so the best we can do is to adjust the knobs until the posterior has the feature we knew it should have.

Of course when you do this, you should provide a very explicit argument for why your knowledge comes in the form it does, and how you decided to adjust the knobs.

I am reminded of a story about a Nobel prize winning biochemist of a past era (since the story is probably apocryphal, I omit his name). Roughly it was that he looked over the data, marked certain things as obviously wrong, to be discarded, and did so. And he was generally right. Formally it was equivalent to after-the-fact selective use of data. The key in science is really that priors be well-informed, not just informed.

The sociological problem stems from people applying statistical methodologies to problems they do not understand.

Good point.

I’m not a statistician, but isn’t it true that the null distribution is also fictional, i.e., in a sense it’s data falsification?

I would not call a null distribution fictional. It is a model that describes a conjecture; so it is used to try to check the data “against” that conjecture — that is, to ask the question (roughly put), “How likely would the data we have be if the conjecture were true?” If the probability of getting the data we have were very small (assuming the conjecture were true), that would reasonably cast doubt on the truth of the conjecture. (But then the devil is in the detail of “How small would the probability of getting the data we have need to be in order to rationally reject the conjecture?)

I saw this paper a while back and couldn’t stop thinking about it. It really is perfect clickbait. Particularly infuriating because the claims made are technically true.

The best response I could come up with is that choosing any class of models (e.g. linear) is the same as choosing an informative prior over some larger class of models (e.g. quadratic, with a prior that puts zero mass on models with quadratic terms), and that therefore all parametric model-fitting of any kind should also be punished as scientific fraud. I haven’t thought through whether or not nearest neighbors is still acceptable.

Michael:

I don’t think the claim, “only the use of non-informative, uniform priors in all types of Bayesian analyses is compatible with standards of research integrity,” is technically true.

One clue that this claim might be false is that it is preceded by “makes clear.” As all mathematicians know, words such as “obviously,” “clearly,” “trivially,” etc., are typically applied to false statements. If a statement is true, you can just say it. For example:

– “2 + 2 = 4.” No need for “clearly” etc.

– “This property of informative priors makes clear that only the use of non-informative, uniform priors in all types of Bayesian analyses is compatible with standards of research integrity.” The phrase “makes clear” is a giveaway that the author is bullshitting.

yeah, I meant that the mathematical construction is the part that’s technically true, which makes the argumentative sleight-of-hand harder to follow. when my officemate showed this to me I knew it was obviously wrong but couldn’t articulate why, which is an extremely irritating feeling. Mostly I was just waving my arms around.

I’m still begrudgingly impressed with the paper for how much it annoys everyone.

+1 for the mini-rant about “obviously,” “clearly,” “trivially,” etc.

Yes, the coefficients of these models are arbitrary. If you explore the “multiverse” of possible arbitrary models (one as good as the other) you will find any particular coefficient will vary wildly: https://statmodeling.stat.columbia.edu/2019/08/01/the-garden-of-forking-paths/

It is valid to use such models for their predictive skill though.

Curios that informative prior – bad – data augmentation – advantageous. I guess its different audiences https://statmodeling.stat.columbia.edu/2019/12/02/a-bayesian-view-of-data-augmentation/