Formally indistinguishable only if you look exclusively at the posterior samples and not at the computer program that generated them. Of course if you’re just looking at the output, maximum likelihood estimates with standard errors are formally indistinguishable from handing your child a crayon and saying “draw some points with little bars coming out”

]]>No, no, he says it’s *formally* indistinguishable. Of course, by this standard, collecting more actual data is also indistinguishable from data falsification.

Carlos: imagine your prior information is “this curve changes concavity at most 1 time in the interval [a,b]”

One way to express that prior is to somehow calculate a functional of the curve that is closely related to the changes in concavity, and use it directly in your prior.

Another way is to maybe encode some simpler parameterization, and then run the fit, and tune what knobs you have until you get out of it something that expresses your knowledge. you don’t know what that is directly, because you weren’t capable of directly encoding your knowledge, but you know what it looks like when you see it.

Often times we’re in the second case, where we can at best have a few knobs that are capable of encoding the prior information, if we were able to express a complicated multidimensional probability distribution over our parameters, but we don’t know how, so the best we can do is to adjust the knobs until the posterior has the feature we knew it should have.

Of course when you do this, you should provide a very explicit argument for why your knowledge comes in the form it does, and how you decided to adjust the knobs.

]]>Carlos, my point is just that it’s more useful to think of this as adjusting regularization knob. The choice to scale it to 1 SD of observed data is just an arbitrary default choice that does a little bit of regularization but not much. It is virtually always better than a flat prior. But if you have better external info, go for it. Alternatively, if you’re going for pure predictive power, could make like the Lasso/Ridge/Net folks and tune on out of sample predictive scores or something like that.

]]>The best response I could come up with is that choosing any class of models (e.g. linear) is the same as choosing an informative prior over some larger class of models (e.g. quadratic, with a prior that puts zero mass on models with quadratic terms), and that therefore all parametric model-fitting of any kind should also be punished as scientific fraud.

Yes, the coefficients of these models are arbitrary. If you explore the “multiverse” of possible arbitrary models (one as good as the other) you will find any particular coefficient will vary wildly: https://statmodeling.stat.columbia.edu/2019/08/01/the-garden-of-forking-paths/

It is valid to use such models for their predictive skill though.

]]>> in my opinion this is an informal version of some expanded model.

I can imagine a model that reuses data by design (a squared likelihood, for example) but I cannot quite imagine an expanded model that gives different likelihoods depending on the data. If you can come up with a mathematical embodiment of that opinion I’d be interested.

]]>Carlos. suppose you are trying to fit a curve to some data. You specify a prior on some fitting parameter that describes the wigglyness of the curve. Then you get some data. When you fit to the data, it has a weekly reporting periodicity that strongly forces the curve to wiggle up and down. so after seeing the fit, you crank up the prior on magnitude of the wigglyness regularizing parameter. refitting gives you a less wiggly fit, representing what you think the underlying process does without the weekly reporting biases.

in my opinion this is an informal version of some expanded model. but formally it’s equivalent to setting the prior based on the results of fitting to data.

]]>I don’t understand your example but it seems to me that either you are reusing data or you are not. Say you have a (data-independent) prior, a model and some data: x1, …, xN. You could also write your data as muhat, sigmahat, z1, …, zN (where the meaning of muhat, sigmahat and the z’s is hopefully obvious). You could use it in two steps: “consuming” muhat and sigmahat you get an “informed” prior/model that you then update with the information remaining in the z-scores (if any!). That’s ok. But if your inference is different from the one you would get from the straightforward analysis then you’re either: a) not using the data efficiently or b) reusing the data.

]]>+1 for the mini-rant about “obviously,” “clearly,” “trivially,” etc.

]]>I would not call a null distribution fictional. It is a model that describes a conjecture; so it is used to try to check the data “against” that conjecture — that is, to ask the question (roughly put), “How likely would the data we have be if the conjecture were true?” If the probability of getting the data we have were very small (assuming the conjecture were true), that would reasonably cast doubt on the truth of the conjecture. (But then the devil is in the detail of “How small would the probability of getting the data we have need to be in order to rationally reject the conjecture?)

]]>I used to think that way a little bit myself but no more. I think it’s more useful to consider this another part of model formulation, i.e. assumption. For instance, let’s say that you have standardized response variable to N(0,1) and have a binary indicator predictor variable. What prior to use?

– Uniform: has all the problems noted here and elsewhere, most egregiously that this implies an effect size of btw say 100-110 SDs is equally plausible a priori as between 0-10 SDs. With enough data, however, not likely to be an issue of course.

– OK, we want/need some kind of regularization. We will stick to normal family, Normal(0,tau). But how to pick tau? You might think there is something especially “recycling” about Normal(0,1), which has the interpretation I noted above, but really any choice for tau could be thought of as a multiple of the observed data SD, and interpreted accordingly. In reality, we are just adjusting a regularization knob up and down, when we are in the absence of substantive external information.

]]>Good point.

]]>yeah, I meant that the mathematical construction is the part that’s technically true, which makes the argumentative sleight-of-hand harder to follow. when my officemate showed this to me I knew it was obviously wrong but couldn’t articulate why, which is an extremely irritating feeling. Mostly I was just waving my arms around.

I’m still begrudgingly impressed with the paper for how much it annoys everyone.

]]>Michael:

I don’t think the claim, “only the use of non-informative, uniform priors in all types of Bayesian analyses is compatible with standards of research integrity,” is technically true.

One clue that this claim might be false is that it is preceded by “makes clear.” As all mathematicians know, words such as “obviously,” “clearly,” “trivially,” etc., are typically applied to false statements. If a statement is true, you can just say it. For example:

– “2 + 2 = 4.” No need for “clearly” etc.

– “This property of informative priors makes clear that only the use of non-informative, uniform priors in all types of Bayesian analyses is compatible with standards of research integrity.” The phrase “makes clear” is a giveaway that the author is bullshitting.

]]>Let he who is without sin cast the first stone! (with their eyes shut, of course)

]]>> It’s not a violation of the Ten Bayesian Commandments; rather, it just implies a different joint model, one where you are strongly centering your inferences in the range of observed data.

The extra strength is obtained by recycling the data, you’re sinning a bit.

]]>The best response I could come up with is that choosing any class of models (e.g. linear) is the same as choosing an informative prior over some larger class of models (e.g. quadratic, with a prior that puts zero mass on models with quadratic terms), and that therefore all parametric model-fitting of any kind should also be punished as scientific fraud. I haven’t thought through whether or not nearest neighbors is still acceptable.

]]>I think rescaling data and using N(0,1) priors deserves some thought. What this amounts to saying is, “the range of variability in observed data is about what we would expect in this type of problem”. In effect, you are using both the data and a summary statistic of the data (the SD) in your joint generative model. This actually illustrates the old canard about ‘using the data twice’. It’s not a violation of the Ten Bayesian Commandments; rather, it just implies a different joint model, one where you are strongly centering your inferences in the range of observed data.

That said, better approaches to regularization would include something like the horseshoe where you estimate a group-level distro over the coefficient SDs. Aki Vehtari and colleagues are the authorities on that approach!

]]>Indeed. It helps to be humble and to do posterior predictive checking. You can also add information on things like the mean squared second derivative to keep the function from wiggling like crazy.

]]>How do you determine how many fake datapoints to sample? It seems like capturing your prior uncertainty then would be a function of both the spread of your informative prior on f_i values and the number of samples taken in the fake censored region, so it seems harder to capture a desired degree of vagueness.

]]>The sociological problem stems from people applying statistical methodologies to problems they do not understand.

]]>Yes, but one can still ask how good it is to use automatic/default penalization, i.e. rescaling the data and then shrinking parameter estimates towards 0, as is done in penalized regression, or using a default normal prior centered on 0 and with an SD of 1. In my simulations, I made the true coefficients relatively close to 0, so you’d expect penalization to help. But I found that it hurts rather than helps if there’s enough data.

]]>It obviously depends on how well your informative prior captures the true value of the parameter.

]]>When sampling for example, simply insert the current sampled value of f_i into the function L to compute the posterior density.

In the end you marginalize away the f_i values by pooling all the samples.

]]>I wonder if what’s a reasonable default choice might depend on how much data there is. Recently, I did several (unsystematic) simulations with logistic regression to see how regularization with L2 penalization compares with MLE when it comes to recovering the correct parameter values, just to satisfy my own curiosity. Maybe others have a different experience, but my (informal) impression was that L2 penalization is definitely better when there are few observations per parameter (which is unsurprising), but that MLE tends to be better when there’s more data. So the penalty (or prior) is helpful when there’s not much data, but becomes somewhat of a liability as the data amount increases beyond a certain amount. So there might be a place for MLE — or, more generally, uninformative priors — when there’s a reasonable number of observations per parameter (obviously, when there’s tons of data, the penalty/prior stops mattering all together).

]]>“>How are you supposed to fit the polynomial to the f_i given that they are unknown parameters?

Specifically by using your “likelihood” L, but instead of plugging in observed data values f_i you plug in the parameter values f_i”

That I gathered, but my issue was just how, exactly, you’re supposed to plug f_i into L, given that f_i is not a number but rather a parameter with an associated prior p. You could plug in the expected value of f_i over p; is that what you had in mind?

]]>The point about ignoring prior knowledge is certainly good.

A nitpick, however: in the era when witch trials were a real thing, witchcraft wasn’t seen as exclusive to women. So I think that water connection is a bit strained.

]]>>How are you supposed to fit the polynomial to the f_i given that they are unknown parameters?

Specifically by using your “likelihood” L, but instead of plugging in observed data values f_i you plug in the parameter values f_i

In other words, it’s not two different ideas, it’s just one idea.

>Are you saying that the two ideas are equivalent?

yes. It’s really just one idea “invent some fake data which are parameters, and then put them into the posterior just as if they were parameters through a “likelihood””

>is it necessary to have an original prior p(C) over the coefficients, or is it in principle possible to simply use L(f(C,t_i)-f_i)/Z as the prior?

If you put extremely broad gaussian priors on p(C) you guarantee a proper prior, and as the gaussian priors go towards infinite variance, they have no real effect on the fit.

However, every polynomial can be parameterized as an interpolating polynomial through a set of control points, so whether you have priors on the parameters, or priors on the value of the function at interpolation points, they both should be proper.

]]>This is a very interesting idea, but I have a few naive and confused questions that will probably demonstrate that I’m not properly understanding it. You say:

“Invent some parameters that represent (t_i,f_i) for the function values at the chosen time points t_i in the gap. Provide informative priors on the f_i values. Then fit the polynomial to the full set of (t_j,f_j) for j including all real data, and the i values represented by your parameters.”

How are you supposed to fit the polynomial to the f_i given that they are unknown parameters? Do you use the expected value of f_i over your prior? Also, you say:

“Think of it this way, whatever your prior is on the coefficients p(C), multiply it by a dimensionless function L(f(C,t_i)-f_i) describing how closely the function f should come to the parameter f_i. This new density: p(C) L(f(C,t_i)-f_i)/Z is your new prior. Fortunately for you, in an MCMC type situation, you don’t need to calculate Z.”

This idea is similar to some ideas I’ve also had, but it sounds somewhat different from your preceding idea of fitting the function to “pseudo-data points.” Are you saying that the two ideas are equivalent? Also, is it necessary to have an original prior p(C) over the coefficients, or is it in principle possible to simply use L(f(C,t_i)-f_i)/Z as the prior?

]]>Adam:

Indeed, one thing I like about lasso and machine learning is that they demonstrate the value of regularization, separate from any formal Bayesian perspective. I’d much rather have the non-Bayesians doing non-Bayesian regularization than have them doing least squares and maximum likelihood.

]]>Well, I would have said goats, not sheep.

(Anyhow, brave of you to have a dog picture rather than a cat or baby alligator picture. ;~) )

Oh, it feels like 2010. Brings back good memories.

]]>> following climate science I have learned that the issue of priors is a controversial one

I could not agree more:

Nic: … think of the posterior PDF as a way of generating a CDF and hence credible intervals rather than being useful in itself. I agree that realistic posterior PDFs can be very useful, but if the available information does not enable generation of a believable posterior PDF then why should it be right to invent one?

But with this comment, you seem to have adopted a strange position that may be unique to you. Frequentists usually don’t have much use for a posterior PDF for any purpose. And I think “objective” Bayesians aim to produce a posterior PDF that is sensible. I’m puzzled why you would bother to produce a posterior that you don’t believe is even close to being a proper expression of posterior belief, and then use it as a justification for the credible intervals that can be derived from it. If these credible intervals have any justification, it can’t be that. And in fact, for this example, you can (and do) justify these intervals as being confidence intervals according to standard frequentist arguments (albeit ones that I think are flawed in this context). So what is the point of the whole objective Bayesian argument?

https://climateaudit.org/2014/04/17/radiocarbon-calibration-and-bayesian-inference/#comment-547957

]]>Thank you!

]]>Here.

]]>Suppose you have a model of some process, such as a time-series curve fit using a chebyshev polynomial or a fourier series, and you have some gaps in the data, either a period in the past where no information was recorded, or the near future where you haven’t observed anything yet.

You have prior knowledge about the value of the function in regions of space that the missing points are in.

Since your model is on the coefficients of the polynomial/fourier series, how could you encode your knowledge about what should happen in the gap? It “needs to be a prior on the coefficients” right?

Well, not exactly.

Invent some parameters that represent (t_i,f_i) for the function values at the chosen time points t_i in the gap. Provide informative priors on the f_i values. Then fit the polynomial to the full set of (t_j,f_j) for j including all real data, and the i values represented by your parameters.

What this does is it induces a prior on your coefficients via the “masking” technique I described here: http://models.street-artists.org/2019/11/04/deformation-factors-vs-jacobians-and-all-that-jazz/

Think of it this way, whatever your prior is on the coefficients p(C), multiply it by a dimensionless function L(f(C,t_i)-f_i) describing how closely the function f should come to the parameter f_i. This new density: p(C) L(f(C,t_i)-f_i)/Z is your new prior. Fortunately for you, in an MCMC type situation, you don’t need to calculate Z.

This technique works extremely well, it regularizes your fit so that it doesn’t “misbehave” in regions where no data are available. However, you’d better have a good sampler, because it’s going to inherently produce dependent structures between the f_i parameters and the C parameters.

]]>But it’s even worse than that. After all, if data is all that is important, it ought to convince even people who hold a strong prior going the other direction. If the null represents conventional wisdom (I know… it doesn’t) then why shouldn’t we continue to believe it not just against flat priors, but against strongly biased priors in favor of the conventional wisdom? Isn’t that just what the conventional wisdom is? ]]>

As someone using these tools, I know how to make sure I do not see the prior result only, but it is not an easy task to explain this seriously to p-value greedy reviewers.

Underwhelmed.

]]>My thoughts exactly—but expressed better than I would have done!

Bob76

]]>I certainly don’t think that using an informative prior is “formally equivalent” to data falsification.

]]>The Spanish Journal of Psychology?

I’m curious how you “stumble across it”. Seems pretty obscure.

]]>Simon:

I discussed that in my article, Ethics and the statistical use of prior information.

]]>I was thinking about witches yesterday. Sink or float; if she floats, she’s a witch but if she drowns, then she was innocent. In what system of belief is that a sensible test? I’m not exactly sure, but I can say it embodies prior beliefs so powerfully there’s no way to pass the test without dying. Witches float. If you try to ‘save’ a drowning woman, she might actually be a witch waiting to float. Very powerful prior becomes a given, and overwhelms the notion that not being a witch equals death.

If you approach this as an informative prior, not as a given, then you can still say witches float but you can adjust how long they must be submerged: rather than assume the tiniest hint of possibiliy that she might still float (and not merely lie face down dead in the water), assume instead the possibility of innocence, counting back from at least trying to save them before they gave up the ghost. The prior data included is: they always drown so the test is only good at killing the innocent. This becomes apparent when you view whatever you are doing as occurring within a system that you guess about as best you can. We reach for certainty but any certainty is temporal.

The story is a gross simplification but the point is that any test you make embodies your prior beliefs, especially when those beliefs have beome givens.

Not sure why I thought of this, but that kind of story about witches, true or not, embodies deep metaphors about women. Ophelia drowns. The Flood washes away humanity to make a fresh start. Women not only bleed but they break their water, carrying within them the metaphor of life coming from water. Men only bleed when wounded, but women bleed regularly (or irregularly) as though they’ve been wounded. I assume the witch trial comes from the idea that the waters of creation would reject the wicked, causing them to float, while they would embrace the innocent, causing them to drown. That metaphor completes as: then their souls rise and float in the afterworld. That means this form of trial would represent human sacrifice, twisted around so a story about evil justifies the sacrificial act. Informative priors help dig out relationships.

]]>The entrails of this goat say….. yes!

]]>Funny that a paper in such an obscure journal would circulate so much. Extreme claims get attention, I guess.

]]>