## Yes, you can include prior information on quantities of interest, not just on parameters in your model

Nick Kavanagh writes:

I studied economics in college and never heard more than a passing reference to Bayesian stats. I started to encounter Bayesian concepts in the workplace and decided to teach myself on the side.

I was hoping to get your advice on a problem that I recently encountered. It has to do with the best way to encode prior information into a model in which the prior knowledge pertains to the overall effect of some change (not the values of individual parameters). I haven’t seen this question addressed before and thought it might be a good topic for a blog post.

I’m building a model to understand the effects of advertising on sales, controlling for other factors like pricing. A simplified version of the model is presented below.

Additional units of advertising will, at some point, yield lower incremental sales. This non-linearity is incorporated into the model through a variable transformation — f(ad_spend, s) — where the parameter s determines the rate of diminishing returns.

sales = alpha + beta_ad * f(ad_spend, s) + beta_price * log(price)

Outside the model, I have estimates of the impact of advertising on sales obtained through randomized experiments. These experiments don’t provide estimates of beta_ad and s. They simply tell you that “increasing advertising spend by \$100K generated 400 [300, 500] incremental sales.” The challenge is that different sets of parameter values for beta_ad and s yield very similar results in terms of incremental sales. I’m struggling with the best way to incorporate the experimental results into the model.

In Stan this is super-easy: You can put priors on anything, including combinations of parameters. Consider this code fragment:

```model {
target += normal(y | a + b*x, sigma);  \\ data model
target += normal(a | 0, 10);           \\ weak prior information on a
target += normal(b | 0, 10);           \\ weak prior information on b
target += normal(a + 5*b | 4.5, 0.2);  \\ strong prior information on a + 5*b
```

In this example, you have prior information on the linear combination, a + 5*b, an estimate of 4.5 with standard error 0.2, from some previous experiment.

The key is that prior information is, mathematically, just more data.

You should be able to do the same thing if you have information on a nonlinear function of parameters too, but then you need to fix the Jacobian, or maybe there’s some way to do this in Stan.

P.S. I’ve updated the comments on the above code in response to Lakeland’s suggestion in comments.

1. Andrew (not Gelman) says:

I admit I am confused by this!

Certainly if we specify P(a) and P(b), and z=a+5b, then this uniquely determines P(z).

So this means something other than that, correct?

• Z says:

Yeah, I’m also confused. It seems like there’s double specifying of priors…

• Keith O'Rourke says:

Think of it intuitively a fake data set which gives rise to a likelihood the same shape as normal(a+5*b | 4.5, 0.2).

As its fake data you call it a prior but Daniel seems to be questioning if it has all the properties of a prior?

Interesting thought.

• The naked statistician says:

Keith:

It appears that you are saying that the prior over the quantity a+5*b is not in fact a prior density but a likelihood function.

In particular, a likelihood function based on the joint probability density of observing some fake data given the parameters a and b.

Therefore why call it a prior density?

• Daniel Lakeland says:

It’s not a prior density, the prior density is

p(a,b) \propto normal(a|0,10) * normal(b|0,10) * normal(a+5*b|4.5,0.2)

all put together… the normal(a+5*b|…) part is just one factor in the expression of the overall prior.

• The naked statistician says:

Daniel:

The priors over a and b fully specify a joint prior density for a and b.

Therefore it would appear that:

The prior over the quantity a+5*b is not part of the original prior density.

It is a likelihood function that is combined with the original prior to generate a posterior density that is now being assumed to be the prior.

Is that clear for all the new users of Bayesian theory who read this blog? :-)

• Daniel Lakeland says:

You can kind of think of it this way, but it’s not quite right… the problem is you’re thinking of

target += normal(a |0,10)

as specifying the prior over a… but it doesn’t it’s just one component of a more complicated expression that expresses the full prior. The more complicated expression is the full 3 lines…

// weak prior on a
// weak prior on b
// informative prior on a+5*b

// weak prior information about a
// weak prior information about b
// informative prior information about a+5*b

the prior *distribution* is the density calculated by multiplying these surfaces over (a,b) space together (or adding the logarithms in this case)

But, YES the *structure* of the calculation has the same structure as taking a prior distribution and multiplying it by a likelihood function.

The difference is that the likelihood function expresses a conditional probability in data space, whereas the normal(a+b*5|…) doesn’t express a conditional probability in “a+5*b space”

if a,b are known then a+5*b has a delta function probability density over a precise value, so p(a+5*b | a,b) = normal(….) doesn’t make any sense taken by itself.

• Daniel Lakeland says:

Nope, see my comments below, you are in fact specifying a single prior that has 3 factors to it, two of them are independent and one of them is a “mask” that describes dependency information.

• Andrew says:

Andrew (not Gelman), Z:

Each “target+=” line of the code adds a term to the log-posterior. The way to think of “normal(a | 0, 10);” is not that a is normally distributed with mean 0 and sd 0; rather, “normal(a | 0, 10);” is one piece of information added to the model. The code above has three lines after the data model, thus three pieces of prior information. In that example, the first two lines (“normal(a | 0, 10);” and “normal(b | 0, 10);”) represent weak information.

• Z says:

Thanks, I think I see now. So this is an implementation of logarithmic pooling of multiple prior distributions from independent sources?

• Andrew says:

Z:

I’d say multiple pieces of prior information, rather than multiple prior distributions.

• Z says:

Makes sense, I like that better too.

I have a followup question which might be too involved but hopefully has a simple answer. This only works if the pieces of prior information are independent, right? For example, suppose the people who ran the trial in Nick’s example also collected results from the same participants for a related outcome (say, total number of purchases in addition to sales in \$) that Nick could also somehow map back onto an implied distribution for his regression coefficients. Then I presume it would not be ok to just slap on another line adding the log of this new distribution to the posterior because it’s not completely new information? But how to define how much of the information is new and add only that?

• Andrew says:

Z:

Yes, I agree this is a challenge. Not so much an issue in the above example where the initial priors on a and b are so weak, but it could be a concern in more complicated problems.

In terms of workflow, I’m moving toward the use of weak priors as defaults, and then throwing strong priors on top of them as needed. In some ways it’s cleaner to add in new prior information, rather than to swap out priors.

• Olav says:

This is interesting. It strikes me as a special case of the more general problem of opinion pooling: https://philpapers.org/archive/DIEPOP.pdf

Implicitly, Andrew seems to be implementing the pooling function that Dietrich and List refer to as “multiplicative pooling.” As Dietrich and List argue, and as you also note, this pooling function is well-motivated if the probability functions to be pooled are based on independent pieces of information. However, if the probability functions are based on different perspectives on the same information or on correlated information, then some other pooling function ought to be used. In the extreme case where the probability functions are based on competing perspectives on exactly the same information, Dietrich and List argue that geometric pooling makes sense (where you take a geometric average of the probability functions).

• Keith O'Rourke says:

Its all pooling – all the way down – http://www.stat.columbia.edu/~gelman/research/unpublished/Amalgamating6.pdf

• Z says:

Thanks for the reference!

• Alan Feng says:

I was wondering what weights are assigned to these two pieces of prior information through the logarithmic pooling by Stan?

• Andrew says:

Alan:

There are no weights. Stan just adds the log densities directly.

• Keith O'Rourke says:

And so as Daniel was pointing out, no Jacobian anything is needed for normal( f(a,b) | 4.5, 0.2) but would be for normal(a’|0,10) * normal(b’|0,10) * normal(f(a’,b’) | 4.5, 0.2) if a’ ,b’ are non-linear?

• Daniel Lakeland says:

> [a jacobian calculation] would be [required] for normal(a’|0,10) * normal(b’|0,10) * normal(f(a’,b’) | 4.5, 0.2) if a’ ,b’ are non-linear?

No, normal(f(a’,b’) | 4.5,0.2) in this case is just a weighting function that downweights regions of space by a factor that is calculated based on the locally calculated value of f(a’,b’) it doesn’t express

“the probability of f(a’,b’) to be within epsilon of any given value is the same as the probability of a random variable x with density normal(x,4.5,0.2) to be within epsilon of that given value”

Jacobian corrections are required when your mathematical statement is equivalent to “the probability to be in region [a,b,c] da db dc is the same as the probability of transformed values [q,r,x] = f([a,b,c]) to be in the region dq dr dx

that is, a Jacobian correction is designed to preserve a measure across a transformation into an equivalent expression in an alternative space

it is not required when you are *intentionally distorting a measure* by multiplying the density by a given distortion factor.

IMHO most of the applications for this kind of technique are intentionally “masking out” regions of space that don’t make sense because they fail to result in derived calculations that make sense.

so if p(a,b) is some valid prior over a,b then p_prime(a,b) = p(a,b) * mask_function(a,b) / Z is some new valid prior over a,b provided the new function is still non-negative and integrable. It intentionally expresses *different information* than the original p(a,b), and how valid that information is is a *subject matter question* not a formal question of how to do probability algebra.

• Daniel Lakeland says:

Note for example that the mask_function(a,b) need not be a valid probability density in and of itself. for example

normal(a,0,1) * normal(b,0,1) * b^2*a^2 is a perfectly valid probability density that squashes the density near 0 and upweights stuff away from zero quadratically… since the normal density function declines to zero in the tails rapidly enough, the whole thing is a valid unnormalized density function (it’s normalizable) even though b^2*a^2 is not a valid density function on R^2 (you can’t normalize it on the whole real plane)

if you tried to do some kind of “jacobian correction” here by multiplying by terms related to 2*b*a^2 and 2*a*b^2 or some such thing you’d be completely confusing yourself.

• Keith O'Rourke says:

> Jacobian corrections are required when your mathematical statement is equivalent to “the probability to be in region [a,b,c] da db dc is the same as the probability of transformed values [q,r,x] = f([a,b,c]) to be in the region dq dr dx

That is what my notation was meant to convey but did not.

normal(a|0,10) * normal(b|0,10) * normal(f(a,b) | 4.5, 0.2) being your [a,b] da db
normal(a’|0,10) * normal(b’|0,10) * normal(f(a’,b’) | 4.5, 0.2) being your [q,r] dq dr

2. Daniel Lakeland says:

>You should be able to do the same thing if you have information on a nonlinear function of parameters too, but then you need to fix the Jacobian, or maybe there’s some way to do this in Stan.

You’re constructing a prior on a,b by starting with a density

normal(a|0,10) * normal(b|0,10)

which is an independent normal prior…

and then you’re distorting this independent normal prior by downweighting regions of space where a+5*b isn’t close to 4.5

normal(a|0,10) * normal(b|0,10) * normal(a+5*b | 4.5, 0.2)

It doesn’t matter what you put there a+5*b or nonlinearfunction(a,b) it’s still just a weighting function that downweights regions of space where the outcome of the calculation isn’t within on the order of 0.2 to the value 4.5

No jacobian anything is needed here because this distortion of the density plays the same role as a likelihood.

For those who are familiar with digital imaging ie. photoshop etc you can think of the normal(a+5*b | …) as a “mask” that only lets the region of a,b space where the outcome of the calculation is what you want “peek through” the mask.

• Daniel Lakeland says:

Put another way, instead of thinking of normal(a+5*b | 4.5, 0.2) as “the probability that a+5*b is between [q,r] is a particular area under the normal curve between the values q and r” think of it as “the relative importance of any given region of a,b space should be decreased from its previously calculated amount by a factor proportional to exp(-((a+5*b – 4.5)/0.2)^2/2)”

• Daniel Lakeland says:

The same thing can be done for nonlinear functions of the parameters and data…

normal(g(data,a,b) | 2,1)

is a “conditional downweighting”… a,b are assumed known (they’re provided by the sampler) and data is measured, so there is no probability involved here, it simply says

“when the parameters a and b are known to take on certain values, then we should downweight the prior density in this region by a factor determined from calculating g(data,a,b) in which all the numerical values are known constants, and then calculating the value of the curve normal(g | 2,1) and multiplying the prior density by this factor to get the posterior (unnormalized) density”

or “after seeing data I judge the posterior probability of a,b being in this region to be only a factor proportional to normal(g(data,a,b)|2,1) as big as the prior density”

3. Daniel Lakeland says:

This is a particularly good way to write priors for parameters that describe functions of a covariate. Often you know a LOT more about what the behavior of the function should be than what the values of the parameters should be.

Suppose for example voltage on some electronic component is a nonlinear function of temperature.

you define this nonlinear function in terms of say a fourier series with 6 terms… so you have a,b,c,d,e,f as the coefficients

now you can say things like “the derivative at T=0 should be near zero” and “the value of the function at T=1 should be near 1” and “the derivative of the function near 0.5 should be near 1” and “the average value between T=0 and T=1 should be near 1”

and specify those using this same “masking” technique. This requires some algebra/calculus but basically will look like target += normal(f(1,a,b,c,d),1,0.1) or stuff like that

Now, to actually calibrate your choice of prior, you should be generating many prior functions and plotting them, if they fail to conform to your expectations it’s because you aren’t “masking hard enough” or you’re “masking too much” etc so you should adjust your specification until the kinds of functions you plot are the kinds of functions you expect.

4. Nik Vetr says:

Are there any rules of thumb for when this’ll induce weird, pathological behavior in the joint prior? When I’ve had cause to do this before I’d always run the model under the prior and inspect the joint and marginal distributions of ‘a’ & ‘b’ directly, but have always thought doing so to be naughty (as well as do prior predictive simulation of the data, y, to make sure the priors are sensible to begin with — especially with link functions & diffuse priors I’ve found it really easy to ‘believe’ a priori that a small change in x can plausibly make y denser than a black hole or bigger than the universe or w/e).

• Daniel Lakeland says:

Absolutely the right way to calibrate these kinds of priors is in terms of plots of generated quantities. Your information lives in generated space (either prior predictive distributions for data or prior knowledge of combined quantities) the parameters are just there to generate appropriate generated values.

It’s not “naughty” to make sure your prior over parameters actually does encode the information you have… the only thing that would be problematic is to try to tune the priors to match actual data you’re peeking at… because then you’re at least partially conditioning on that data, and when you incorporate that data into your likelihood you’ll be in essence over-conditioning

• Bob Carpenter says:

I was just going to write the same thing as Daniel Lakeland. We call them prior predictive checks—they’re just like posterior predictive checks, but they use no data, hence the posterior is the prior, and you get prior predictive checks. Usually that’ll show most of the prior mass in undesirable part of predictive space.

Gabry et al. wrote a nice paper working through a real example (billed as visualization).

We usually recommend going back to the drawing board on priors after fitting the joint model if the posterior predictive checks don’t make sense and the cause is inconsistent priors. It may be “peeking”, but it’s the right thing to do if you want to make good predictions.

• Timothy says:

@Daniel Lakeland and @Bob Carpenter,

If you specify a prior in this way (i.e. diffuse independent priors on the parameters, followed by informative priors on functionals of the parameters), how does one do prior predictive checks?

It is easy to sample from the independent prior, but sampling from the complete prior (diffuse independent factors x informative factors) seems to incur all the same difficulties as traditional posterior sampling. This seems like it would make prior predictive checks infeasible or at least much more difficult than when using a “convenience” prior.

• Daniel Lakeland says:

From a Stan perspective, it’s the same as doing inference with data, you just don’t use any data, sample from the prior alone, generate some “generated quantities” and see if those quantities make sense to you.

Then you could make a copy of this Stan code, add in the data parts, and do inference. The biggest issue is to make sure in your workflow that if you want to adjust your prior, you first do it in the “data free” file, re-run to ensure it still makes sense, and then transfer the equivalent code over to the version with the data.

The only other option would be to have one Stan file, and for every parameter have two versions, a and a_star, b and b_star etc… Use the a,b… parameters in the full inference but leave a_star and b_star out of the likelihood, so then you can sample simultaneously from the prior and the posterior.

5. Bob Carpenter says:

If the transform is univariate and non-linear,

```f(theta) ~ foo(...);
```

then the Jacobian adjustment on the log scale is this, where f_prime(theta) is the derivative of f evaluated at theta.

```target += log(f_prime(theta));
```

There are several examples in the user’s guide chapter on transforms. If the function is multivariate, you need the determinant of the Jacobian in place of f_prime, i.e.,

```target += log(determinant(J(theta)));
```

where J(theta) is the Jacobian of theta, det is the determinant function, and log is log. It’s better to write log_determinant, as the composed function’s built in.

One also has to be careful if the transform looks linear, but involves a variable, as in f(theta) = alpha * theta where alpha is a parameter. In this case, you need the Jacobian adjustment,

```target += log(alpha);
```

which follows from the derivative d/d.theta alpha * theta = alpha. When alpha is a constant, this drops out, but when alpha is a parameter, we need the Jacobian.

• Daniel Lakeland says:

Issue 1: “If the transform is univariate and non-linear” should also include “and one-to-one”. If more than one region of theta space results in the same f the whole thing is more complicated. This becomes particularly complicated when the parameters are high dimensional and highly non-invertible. For example when many values of a,b result in the same f(a,b)… and is particularly problematic when f(a,b) is constant in some region of a,b

Issue 2: This assumes you are trying to derive a distribution over theta by specifying what the probability density function will be over the f space… which in Andrew’s example is *not* what you are doing. Instead you are saying ” the density p(a,b) is normal(a|..) * normal(b|…) * normal(a+5*b|…)” which works out to something proportional to exp(-(312.5*b^2+125*a*b – 562.5*b+12.5*a^2 – 112.5*a + 253.125))

if that is the density you want, then doing some “jacobian corrections” is really not a correction so much as expressing a *different* prior. The question is “which one do you want?” both are valid priors (ie. both surfaces over a,b space express a probability distribution that after normalizing integrates to 1), but they express different information.

In the case andrew mentions the information is local, and relative, of the form “whenever a+5*b is far from 4.5 reduce the density in that region of a,b space by a factor…”

in the case you mention with the jacobian correction the information is global and absolute: “sample theta such that the total probability to be in the region f(theta*) by within an amount df is foo(f…) df”

• Joshua Pritikin says:

Can you expand on both of these issues?

The suggestion to layer strong priors over weak priors seems like it could be very useful in many contexts. I’d like to understand the issues you raised more completely.

• Daniel Lakeland says:

Both of these issues are about when to do jacobian “corrections” So the first thing you have to understand is what is a jacobian correction for.

Essentially a jacobian “correction” allows you to sample in one space A in such a way that you induce a particular given known density on B when B is a known function of A.

if B = F(A) is a one to one and onto mapping (invertible) and you know what density you want B to have (pb(B)), then there is only one density you can define for A which will cause A to be sampled correctly so that B has the right density. Most people might work out that since B = F(A) the density should be pb(F(A))… but instead…

To figure this out, we can use nonstandard analysis, in which dA and dB are infinitesimal numbers, and we can do algebra on them. We will do algebra on *probability values*. Every density must be multiplied by an infinitesimal increment of the value over which the density is defined in order to keep the “units” of probability (densities are probability per unit something).

We want to define a density pa(A) such that any small set of width dA at any given point A* has total probability pb(F(A*)) abs(dB*)

That is, we have the infinitesimal equation:

pa(A) dA = pb(F(A)) abs(dB)

solve for pa(A) = pb(F(A)) abs(dB/dA)

if we said pa(A) = pb(F(A)) we’d be wrong by a factor involving the derivative dB/dA = dF/dA evaluated at A, which is itself a function of A. The absolute value is to ensure that everything remains positive.

abs(dB/dA) is a jacobian “correction” to the pb(F(A)) we derived naively at first.

—————

So, the applicability is when

1) you know the density in one space
2) you want to sample in a different space
3) There is a straightforward transformation between the two spaces

In Andrew’s example, this isn’t the case. We are trying to decide on a probability density *in the space where we’re sampling* A, and we’re not basing it on a known probability density in another space. Instead we’re basing it on

1) Information we have about the plausibility of values of A based on examination of the value of A itself. p(A)

2) Information about the relative plausibility of a given A value after we calculate some function of A… as I said, this is a kind of “mask”. pi(F(A))

Now we’re defining the full density P(A) in terms of some “base” density little p, p(A) and multiplying it by a masking function pi(F(A)) and then dividing by Z where Z is a normalization factor. So the density on space A is *defined* as P(A) = p(A) pi(F(A)) / Z

Notice how if p(A) is a properly normalized density, then pi(F(A)) K is a perfectly good mask function for all values of K, because the K value is eliminated by the normalization constant Z, which changes with K. In other words, pi tells us only *relatively* how “good” a given A value is in terms of the value of its F(A) value. It need not tell us any “absolute” goodness quantity.

Should “how much we like A” depend on how many different possible A values converge to the region F(A)? I think this seems wrong. If you have familiarity with photoshop, think like this: you want to mask away a “green screen”, should whether you mask a given pixel depend on “how green that pixel is” or “how many total green pixels there are”? The *first* notion is the masking notion I’m talking about, it’s local information about what is going on in vicinity of A, the second notion is the probability notion: how much total spatial locations get mapped to “green” that’s “probability of being green”

For example pi(F(A)) = 1 is a perfectly good mask, it says “all values of F(A) are equally good” (it doesn’t matter what color the pixel is). Clearly pi is not a probability density, since you can’t normalize that. You could also say “A is totally implausible if it results in negative F(A)” (if the pixel is green don’t include it) so then pi(F(A)) = 1 for all F(A) >= 0 and 0 otherwise is a good mask function as well. It’s also not normalizable in general.

If you start including “jacobian corrections” in your mask function, then your mask function isn’t telling you information like “mask this based on how green it is” it’s telling you some mixed up information instead that involves “how rapidly varying the “greenness measurement” is in the vicinity of this level of greenness”.

• Christopher King says:

Hmm, do you have some references on this I could think more about? It isn’t clear to me how this is different from the cases where you definitely need a jacobian, say that I have prior information on exp(XA %*% BY) and will be sampling in the space of {A, B}. Implicit priors via functionals are appealing except for the tedious and error prone jacobians.

• Daniel Lakeland says:

Christopher: I don’t have references for you. Sorry.

From a formal perspective, every normalizable density on {A,B} is a valid prior. The question is whether it expresses a state of information that approximates your *substantive* knowledge. That’s not a formal question, and it can only be answered by investigating the implications of the prior for things you know substantively, and accepting that they approximate what you actually do know.

If your substantive knowledge is of the form “I know the absolute probability density in the vicinity of all X and X is a function of Y and I want to sample Y” then you need a jacobian.

If your substantive knowledge is “Y is of an approximate given size y and also the more F(y) is similar to 0 the more credible the y value is” then you don’t need a jacobian, because you have only *relative* knowledge in the space of F.

Put another way, we can make a dimensional argument. Suppose y is in units of length, and F(y) is in units of energy. Suppose we say that the density in terms of energy is known… pe(E) is a known function. Now we want to derive a density in terms of per unit length:

pe(F(y)) is in units of probability per unit energy, if we multiply it by length, we’ll get probability * length/energy, dimensions are wrong.

pl(y)

and we multiply it by a function that’s a ratio of probability per unit energy to some given probability per unit energy (a ratio of two densities, like say p(x) / max(p(x)) for example) and this function is pi(F(y)). It’s dimensionless, so we have

pl(y) pi(F(y)) is still probability per unit length.

Whenever information is *relative*, then a Jacobian correction doesn’t make sense. What makes sense instead is a “prior predictive check” to see if the expressed relative information really does imply the kinds of things you expect it to…

• Joshua Pritikin says:

Thanks! I love the green screen analogy.

• Chris Wilson says:

This is a really important discussion! Bob, Andrew, do you have any reactions to what Daniel is saying here? I think there are a lot of corner cases like this where Jacobian warnings come up, but they are really not appropriate to use (or even well-defined objects in many cases). It would be nice to see this elaborated more in Stan manual, or some paper on ArXiv that could be cited or something. Students and reviewers are going to be endlessly confused about this…

6. Anoneuoid says:

I’m somewhat baffled by the discussion here, I wonder if other people not coming from a math/stats background are as well. There seems to be a silent step everyone is using in their reasoning that I am missing. I see this as the problem:

sales = alpha + beta_ad * f(ad_spend, s) + beta_price * log(price)
[…]
The challenge is that different sets of parameter values for beta_ad and s yield very similar results in terms of incremental sales.

We aren’t told what function f refers to, but I see this and assume that beta_ad and s are underdetermined. Ie, different combinations of values for those two parameters can yield the same outcome for sales. So the solution is to either collapse them into one parameter or heavily constrain at least one of them with your prior, etc.

Is that what everyone is talking about in here?

• Anoneuoid says:

The simplest example would be if `f = s*ad_spend`. Then that term is `beta_ad*s*ad_spend`.

Obviously this:
`beta_ad = 1`
`s = 1`

Gives the same outcome as:
`beta_ad = 2`
`s = 0.5`

Etc, there are many different pairs of values that can result in the same product. I just don’t see how andrew’s example is necessarily solving this problem.

• AllanC says:

If the primary goal of the model is for inference I would agree with you. However, if prediction is all that matters then constraining the function as andrew has done would work just fine.

• Anoneuoid says:

It says:

I’m building a model to understand the effects of advertising on sales, controlling for other factors like pricing.

Anyway, if you are solely going for prediction you want to use some method that optimizes for out-of-sample predictive skill, not the posterior probability of the parameters.

• AllanC says:

Sorry, I didn’t read closely enough.

However, calibrated posterior probabilities can be incredibly useful for out-of-sample prediction if there are real decision theoretic considerations associated with model outputs. ML is not always the way to go in such circumstances; assuming that’s what you were hinting at.

• Anoneuoid says:

calibrated posterior probabilities can be incredibly useful for out-of-sample prediction

You could use the output as an additional feature and it may help. But if you want predictive skill, then optimize for predictive skill. Do not optimize for some other thing that may correlate with the thing you want*…

* unless it makes sense for efficiency/computational reasons because the gradient isn’t smooth or whatever

• Anoneuoid says:

if prediction is all that matters then constraining the function as andrew has done would work just fine.

Actually, I don’t see this either. Your chains aren’t going to converge unless the constraints limit you to one unique value for every possible beta_ad*s.

7. Nick Menzies says:

Sounds like what Poole and Raftery termed ‘Bayesian Melding’:

https://www.stat.washington.edu/raftery/Research/PDF/poole2000.pdf

8. I am not John Kruschke says:

Great post! I came up with something like this–well, EXACTLY like this–independently not long ago, but discarded the idea since I was uncertain whether it was legitimate or not. The discussion in this post gives me more clarity and convinces me that this indeed is a legitimate move. Of course, it’s just more prior info. Yes!

• Anoneuoid says:

I don’t think you should have to get the approval of an authority figure to determine if a method is legitimate or not. All that matters is whether it works.

• I am not John Kruschke says:

I agree.

Obligaory but: the difficulty (for me) is in determining whether something works or not; for someone who only studied statistics as a minor subject, I’m prone to making lots of mathematical errors, not noticing obvious things and so on. It’s happened to me a zillion times that I think I’ve discovered something neat and later I realize I’ve made some obvious error.

• Anoneuoid says:

Just simulate a bunch of different datasets and study how well your method captures them. The “experts” may be helpful in pointing out possible gotchas, bugs, or useful tools just because they have been through the process before. But they are not required to prove to yourself it works.

And yes, there will always be bugs/errors you need to fix. This is not something to be ashamed of.

9. Alan Feng says:

I was wondering in Stan what the weights are implicitly assigned to these different pieces of prior information for the same parameter of interest? For example, a bit different from what is given by Andrew here, we have multiple pieces of prior information regarding a model parameter (say, mu), and if I do in Stan like:

target+=normal_lpdf(mu|a,b); \\ prior information 1
target+=normal_lpdf(mu|c,d); \\ prior information 2

What weights does Stan implicitly use for pooling these two pieces of prior information? If we want to change the default weights, how should we deal with this?

Many thanks.

• Andrew says:

Alan:

Each “target +=” statement (or the equivalent “~” statement) adds a term to the log density. In this case, you’re adding log(normal(mu|a,b)) + log(normal(mu|c,d)), which happens to be mathematically equivalent to log(normal(mu | (a/b^2 + c/d^2)/(1/b^2 + 1/d^2), sqrt(1/(1/b^2 + 1/d^2))). No weights are implicitly assigned. The log densities are simply added.

• Alan Feng says:

Hi Andrew: