might be a better place to discuss details.

]]>There are no weights. Stan just adds the log densities directly.

]]>Thank you so much for your quick reply.

Does this mean that Stan considers both pieces of prior information to be equally important by directly adding their log density? Doing this, are we assuming these pieces of prior information are independent of each other? What would it be like if they are not independent?

Thank you!

]]>Each “target +=” statement (or the equivalent “~” statement) adds a term to the log density. In this case, you’re adding log(normal(mu|a,b)) + log(normal(mu|c,d)), which happens to be mathematically equivalent to log(normal(mu | (a/b^2 + c/d^2)/(1/b^2 + 1/d^2), sqrt(1/(1/b^2 + 1/d^2))). No weights are implicitly assigned. The log densities are simply added.

]]>target+=normal_lpdf(mu|a,b); \\ prior information 1

target+=normal_lpdf(mu|c,d); \\ prior information 2

What weights does Stan implicitly use for pooling these two pieces of prior information? If we want to change the default weights, how should we deal with this?

Many thanks.

]]>From a formal perspective, every normalizable density on {A,B} is a valid prior. The question is whether it expresses a state of information that approximates your *substantive* knowledge. That’s not a formal question, and it can only be answered by investigating the implications of the prior for things you know substantively, and accepting that they approximate what you actually do know.

If your substantive knowledge is of the form “I know the absolute probability density in the vicinity of all X and X is a function of Y and I want to sample Y” then you need a jacobian.

If your substantive knowledge is “Y is of an approximate given size y and also the more F(y) is similar to 0 the more credible the y value is” then you don’t need a jacobian, because you have only *relative* knowledge in the space of F.

Put another way, we can make a dimensional argument. Suppose y is in units of length, and F(y) is in units of energy. Suppose we say that the density in terms of energy is known… pe(E) is a known function. Now we want to derive a density in terms of per unit length:

pe(F(y)) is in units of probability per unit energy, if we multiply it by length, we’ll get probability * length/energy, dimensions are wrong.

Now suppose instead we start with a density per unit length:

pl(y)

and we multiply it by a function that’s a ratio of probability per unit energy to some given probability per unit energy (a ratio of two densities, like say p(x) / max(p(x)) for example) and this function is pi(F(y)). It’s dimensionless, so we have

pl(y) pi(F(y)) is still probability per unit length.

Whenever information is *relative*, then a Jacobian correction doesn’t make sense. What makes sense instead is a “prior predictive check” to see if the expressed relative information really does imply the kinds of things you expect it to…

]]>Essentially a jacobian “correction” allows you to sample in one space A in such a way that you induce a particular given known density on B when B is a known function of A.

if B = F(A) is a one to one and onto mapping (invertible) and you know what density you want B to have (pb(B)), then there is only one density you can define for A which will cause A to be sampled correctly so that B has the right density. Most people might work out that since B = F(A) the density should be pb(F(A))… but instead…

To figure this out, we can use nonstandard analysis, in which dA and dB are infinitesimal numbers, and we can do algebra on them. We will do algebra on *probability values*. Every density must be multiplied by an infinitesimal increment of the value over which the density is defined in order to keep the “units” of probability (densities are probability per unit something).

We want to define a density pa(A) such that any small set of width dA at any given point A* has total probability pb(F(A*)) abs(dB*)

That is, we have the infinitesimal equation:

pa(A) dA = pb(F(A)) abs(dB)

solve for pa(A) = pb(F(A)) abs(dB/dA)

if we said pa(A) = pb(F(A)) we’d be wrong by a factor involving the derivative dB/dA = dF/dA evaluated at A, which is itself a function of A. The absolute value is to ensure that everything remains positive.

abs(dB/dA) is a jacobian “correction” to the pb(F(A)) we derived naively at first.

—————

So, the applicability is when

1) you know the density in one space

2) you want to sample in a different space

3) There is a straightforward transformation between the two spaces

In Andrew’s example, this isn’t the case. We are trying to decide on a probability density *in the space where we’re sampling* A, and we’re not basing it on a known probability density in another space. Instead we’re basing it on

1) Information we have about the plausibility of values of A based on examination of the value of A itself. p(A)

2) Information about the relative plausibility of a given A value after we calculate some function of A… as I said, this is a kind of “mask”. pi(F(A))

Now we’re defining the full density P(A) in terms of some “base” density little p, p(A) and multiplying it by a masking function pi(F(A)) and then dividing by Z where Z is a normalization factor. So the density on space A is *defined* as P(A) = p(A) pi(F(A)) / Z

Notice how if p(A) is a properly normalized density, then pi(F(A)) K is a perfectly good mask function for all values of K, because the K value is eliminated by the normalization constant Z, which changes with K. In other words, pi tells us only *relatively* how “good” a given A value is in terms of the value of its F(A) value. It need not tell us any “absolute” goodness quantity.

Should “how much we like A” depend on how many different possible A values converge to the region F(A)? I think this seems wrong. If you have familiarity with photoshop, think like this: you want to mask away a “green screen”, should whether you mask a given pixel depend on “how green that pixel is” or “how many total green pixels there are”? The *first* notion is the masking notion I’m talking about, it’s local information about what is going on in vicinity of A, the second notion is the probability notion: how much total spatial locations get mapped to “green” that’s “probability of being green”

For example pi(F(A)) = 1 is a perfectly good mask, it says “all values of F(A) are equally good” (it doesn’t matter what color the pixel is). Clearly pi is not a probability density, since you can’t normalize that. You could also say “A is totally implausible if it results in negative F(A)” (if the pixel is green don’t include it) so then pi(F(A)) = 1 for all F(A) >= 0 and 0 otherwise is a good mask function as well. It’s also not normalizable in general.

If you start including “jacobian corrections” in your mask function, then your mask function isn’t telling you information like “mask this based on how green it is” it’s telling you some mixed up information instead that involves “how rapidly varying the “greenness measurement” is in the vicinity of this level of greenness”.

]]>The suggestion to layer strong priors over weak priors seems like it could be very useful in many contexts. I’d like to understand the issues you raised more completely.

]]>And yes, there will always be bugs/errors you need to fix. This is not something to be ashamed of.

]]>Obligaory but: the difficulty (for me) is in determining whether something works or not; for someone who only studied statistics as a minor subject, I’m prone to making lots of mathematical errors, not noticing obvious things and so on. It’s happened to me a zillion times that I think I’ve discovered something neat and later I realize I’ve made some obvious error.

]]>Implicitly, Andrew seems to be implementing the pooling function that Dietrich and List refer to as “multiplicative pooling.” As Dietrich and List argue, and as you also note, this pooling function is well-motivated if the probability functions to be pooled are based on independent pieces of information. However, if the probability functions are based on different perspectives on the same information or on correlated information, then some other pooling function ought to be used. In the extreme case where the probability functions are based on competing perspectives on exactly the same information, Dietrich and List argue that geometric pooling makes sense (where you take a geometric average of the probability functions).

]]>if prediction is all that matters then constraining the function as andrew has done would work just fine.

Actually, I don’t see this either. Your chains aren’t going to converge unless the constraints limit you to one unique value for every possible beta_ad*s.

]]>calibrated posterior probabilities can be incredibly useful for out-of-sample prediction

You could use the output as an additional feature and it may help. But if you want predictive skill, then optimize for predictive skill. Do not optimize for some other thing that may correlate with the thing you want*…

* unless it makes sense for efficiency/computational reasons because the gradient isn’t smooth or whatever

]]>https://www.stat.washington.edu/raftery/Research/PDF/poole2000.pdf

]]>However, calibrated posterior probabilities can be incredibly useful for out-of-sample prediction if there are real decision theoretic considerations associated with model outputs. ML is not always the way to go in such circumstances; assuming that’s what you were hinting at.

]]>I’m building a model to understand the effects of advertising on sales, controlling for other factors like pricing.

Anyway, if you are solely going for prediction you want to use some method that optimizes for out-of-sample predictive skill, not the posterior probability of the parameters.

]]>`f = s*ad_spend`

. Then that term is `beta_ad*s*ad_spend`

.
Obviously this:

`beta_ad = 1`

`s = 1`

Gives the same outcome as:

`beta_ad = 2`

`s = 0.5`

Etc, there are many different pairs of values that can result in the same product. I just don’t see how andrew’s example is necessarily solving this problem.

]]>sales = alpha + beta_ad * f(ad_spend, s) + beta_price * log(price)

[…]

The challenge is that different sets of parameter values for beta_ad and s yield very similar results in terms of incremental sales.

We aren’t told what function *f* refers to, but I see this and assume that *beta_ad* and *s* are underdetermined. Ie, different combinations of values for those two parameters can yield the same outcome for sales. So the solution is to either collapse them into one parameter or heavily constrain at least one of them with your prior, etc.

Is that what everyone is talking about in here?

]]>Then you could make a copy of this Stan code, add in the data parts, and do inference. The biggest issue is to make sure in your workflow that if you want to adjust your prior, you first do it in the “data free” file, re-run to ensure it still makes sense, and then transfer the equivalent code over to the version with the data.

The only other option would be to have one Stan file, and for every parameter have two versions, a and a_star, b and b_star etc… Use the a,b… parameters in the full inference but leave a_star and b_star out of the likelihood, so then you can sample simultaneously from the prior and the posterior.

]]>That is what my notation was meant to convey but did not.

normal(a|0,10) * normal(b|0,10) * normal(f(a,b) | 4.5, 0.2) being your [a,b] da db

normal(a’|0,10) * normal(b’|0,10) * normal(f(a’,b’) | 4.5, 0.2) being your [q,r] dq dr

Yes, I agree this is a challenge. Not so much an issue in the above example where the initial priors on a and b are so weak, but it could be a concern in more complicated problems.

In terms of workflow, I’m moving toward the use of weak priors as defaults, and then throwing strong priors on top of them as needed. In some ways it’s cleaner to add in new prior information, rather than to swap out priors.

]]>normal(a,0,1) * normal(b,0,1) * b^2*a^2 is a perfectly valid probability density that squashes the density near 0 and upweights stuff away from zero quadratically… since the normal density function declines to zero in the tails rapidly enough, the whole thing is a valid unnormalized density function (it’s normalizable) even though b^2*a^2 is not a valid density function on R^2 (you can’t normalize it on the whole real plane)

if you tried to do some kind of “jacobian correction” here by multiplying by terms related to 2*b*a^2 and 2*a*b^2 or some such thing you’d be completely confusing yourself.

]]>No, normal(f(a’,b’) | 4.5,0.2) in this case is just a weighting function that downweights regions of space by a factor that is calculated based on the locally calculated value of f(a’,b’) it doesn’t express

“the probability of f(a’,b’) to be within epsilon of any given value is the same as the probability of a random variable x with density normal(x,4.5,0.2) to be within epsilon of that given value”

Jacobian corrections are required when your mathematical statement is equivalent to “the probability to be in region [a,b,c] da db dc is the same as the probability of transformed values [q,r,x] = f([a,b,c]) to be in the region dq dr dx

that is, a Jacobian correction is designed to preserve a measure across a transformation into an equivalent expression in an alternative space

it is not required when you are *intentionally distorting a measure* by multiplying the density by a given distortion factor.

IMHO most of the applications for this kind of technique are intentionally “masking out” regions of space that don’t make sense because they fail to result in derived calculations that make sense.

so if p(a,b) is some valid prior over a,b then p_prime(a,b) = p(a,b) * mask_function(a,b) / Z is some new valid prior over a,b provided the new function is still non-negative and integrable. It intentionally expresses *different information* than the original p(a,b), and how valid that information is is a *subject matter question* not a formal question of how to do probability algebra.

]]>target += normal(a |0,10)

as specifying the prior over a… but it doesn’t it’s just one component of a more complicated expression that expresses the full prior. The more complicated expression is the full 3 lines…

Andrew puts the comments

// weak prior on a

// weak prior on b

// informative prior on a+5*b

but these comments are misleading, they should say

// weak prior information about a

// weak prior information about b

// informative prior information about a+5*b

the prior *distribution* is the density calculated by multiplying these surfaces over (a,b) space together (or adding the logarithms in this case)

But, YES the *structure* of the calculation has the same structure as taking a prior distribution and multiplying it by a likelihood function.

The difference is that the likelihood function expresses a conditional probability in data space, whereas the normal(a+b*5|…) doesn’t express a conditional probability in “a+5*b space”

if a,b are known then a+5*b has a delta function probability density over a precise value, so p(a+5*b | a,b) = normal(….) doesn’t make any sense taken by itself.

]]>If you specify a prior in this way (i.e. diffuse independent priors on the parameters, followed by informative priors on functionals of the parameters), how does one do prior predictive checks?

It is easy to sample from the independent prior, but sampling from the complete prior (diffuse independent factors x informative factors) seems to incur all the same difficulties as traditional posterior sampling. This seems like it would make prior predictive checks infeasible or at least much more difficult than when using a “convenience” prior.

]]>Issue 2: This assumes you are trying to derive a distribution over theta by specifying what the probability density function will be over the f space… which in Andrew’s example is *not* what you are doing. Instead you are saying ” the density p(a,b) is normal(a|..) * normal(b|…) * normal(a+5*b|…)” which works out to something proportional to exp(-(312.5*b^2+125*a*b – 562.5*b+12.5*a^2 – 112.5*a + 253.125))

if that is the density you want, then doing some “jacobian corrections” is really not a correction so much as expressing a *different* prior. The question is “which one do you want?” both are valid priors (ie. both surfaces over a,b space express a probability distribution that after normalizing integrates to 1), but they express different information.

In the case andrew mentions the information is local, and relative, of the form “whenever a+5*b is far from 4.5 reduce the density in that region of a,b space by a factor…”

in the case you mention with the jacobian correction the information is global and absolute: “sample theta such that the total probability to be in the region f(theta*) by within an amount df is foo(f…) df”

]]>The priors over a and b fully specify a joint prior density for a and b.

Therefore it would appear that:

The prior over the quantity a+5*b is not part of the original prior density.

It is a likelihood function that is combined with the original prior to generate a posterior density that is now being assumed to be the prior.

Is that clear for all the new users of Bayesian theory who read this blog? :-)

]]>I have a followup question which might be too involved but hopefully has a simple answer. This only works if the pieces of prior information are independent, right? For example, suppose the people who ran the trial in Nick’s example also collected results from the same participants for a related outcome (say, total number of purchases in addition to sales in $) that Nick could also somehow map back onto an implied distribution for his regression coefficients. Then I presume it would not be ok to just slap on another line adding the log of this new distribution to the posterior because it’s not completely new information? But how to define how much of the information is new and add only that?

]]>p(a,b) \propto normal(a|0,10) * normal(b|0,10) * normal(a+5*b|4.5,0.2)

all put together… the normal(a+5*b|…) part is just one factor in the expression of the overall prior.

]]>It appears that you are saying that the prior over the quantity a+5*b is not in fact a prior density but a likelihood function.

In particular, a likelihood function based on the joint probability density of observing some fake data given the parameters a and b.

Therefore why call it a prior density?

]]>Gabry et al. wrote a nice paper working through a real example (billed as visualization).

We usually recommend going back to the drawing board on priors after fitting the joint model if the posterior predictive checks don’t make sense and the cause is inconsistent priors. It may be “peeking”, but it’s the right thing to do if you want to make good predictions.

]]>f(theta) ~ foo(...);

then the Jacobian adjustment on the log scale is this, where f_prime(theta) is the derivative of f evaluated at theta.

target += log(f_prime(theta));

There are several examples in the user’s guide chapter on transforms. If the function is multivariate, you need the determinant of the Jacobian in place of f_prime, i.e.,

target += log(determinant(J(theta)));

where J(theta) is the Jacobian of theta, det is the determinant function, and log is log. It’s better to write log_determinant, as the composed function’s built in.

One also has to be careful if the transform looks linear, but involves a variable, as in f(theta) = alpha * theta where alpha is a parameter. In this case, you need the Jacobian adjustment,

target += log(alpha);

which follows from the derivative d/d.theta alpha * theta = alpha. When alpha is a constant, this drops out, but when alpha is a parameter, we need the Jacobian.

]]>It’s not “naughty” to make sure your prior over parameters actually does encode the information you have… the only thing that would be problematic is to try to tune the priors to match actual data you’re peeking at… because then you’re at least partially conditioning on that data, and when you incorporate that data into your likelihood you’ll be in essence over-conditioning

]]>I’d say multiple pieces of prior information, rather than multiple prior distributions.

]]>Suppose for example voltage on some electronic component is a nonlinear function of temperature.

you define this nonlinear function in terms of say a fourier series with 6 terms… so you have a,b,c,d,e,f as the coefficients

now you can say things like “the derivative at T=0 should be near zero” and “the value of the function at T=1 should be near 1” and “the derivative of the function near 0.5 should be near 1” and “the average value between T=0 and T=1 should be near 1”

and specify those using this same “masking” technique. This requires some algebra/calculus but basically will look like target += normal(f(1,a,b,c,d),1,0.1) or stuff like that

Now, to actually calibrate your choice of prior, you should be generating many prior functions and plotting them, if they fail to conform to your expectations it’s because you aren’t “masking hard enough” or you’re “masking too much” etc so you should adjust your specification until the kinds of functions you plot are the kinds of functions you expect.

]]>Each “target+=” line of the code adds a term to the log-posterior. The way to think of “normal(a | 0, 10);” is *not* that a is normally distributed with mean 0 and sd 0; rather, “normal(a | 0, 10);” is one piece of information added to the model. The code above has three lines after the data model, thus three pieces of prior information. In that example, the first two lines (“normal(a | 0, 10);” and “normal(b | 0, 10);”) represent weak information.

As its fake data you call it a prior but Daniel seems to be questioning if it has all the properties of a prior?

Interesting thought.

]]>