In Bayesian priors, why do we use soft rather than hard constraints?

Posted on May 29, 2020 11:44 AM by Andrew

Luiz Max Carvalho has a question about the prior distributions for hyperparameters in our paper, Bayesian analysis of tests with unknown specificity and sensitivity:

My reply:

1. We recommend soft rather than hard constraints when we have soft rather than hard knowledge. In this case, we don’t absolutely know that spec and sens are greater than 50%. There could be tests that are worse than that. Conversely, to the extent that we believe spec and sens to be greater than 50% we don’t think they’re 51% either.

2. I typically use normal rather than beta because normal is easier to work with, and it plays well with hierarchical models.

29 thoughts on “In Bayesian priors, why do we use soft rather than hard constraints?”

jd on May 29, 2020 12:10 PM at 12:10 pm said:

“2. I typically use normal rather than beta because normal is easier to work with, and it plays well with hierarchical models.”

Yes! Maybe I am just daft, but I think normal is just much easier to think about and seems to work pretty well in many cases.

Reply ↓
- Daniel Lakeland on May 29, 2020 2:47 PM at 2:47 pm said:
  
  It depends. Beta makes sense for probability distributions over a compact region, like [0,1] where Normal doesn’t. To do normal(a,b) correct when you’re constrained to region [c,d] you need to renormalize the normal(a,b) to be normal(a,b)/integrate(normal_pdf(x,a,b),x,c,d)
  
  that is, you’re truncating the tails, by an amount that depends on a,b. This is no big deal when [c,d] extends way beyond a+-3*b but it’s a big deal if the normal can bump into the ends of the [c,d] range.
  
  Beta has no such problem.
  
  Reply ↓
  - Bob Carpenter on May 29, 2020 3:09 PM at 3:09 pm said:
    If the truncation bound is constant, then the normalizing term is constant and can be dropped from MCMC.
    
    In Stan, a positive-truncated normal would be explicitly written this way:
    
    alpha ~ normal(0, sigma) T[0, ];
    
    That would include the normalizing CDF term in the calculation. But it’s not needed for MCMC or optimization, so you could also write this:
    
    alpha ~ normal(0, sigma);
    
    and get the same density up to a constant proportion. On the other hand, if the bound is itself a parameter, perhaps
    
    alpha ~ normal(0, sigma) T[beta, ];
    
    then the normalization term is necessary. This turns into a lot of computation for the CDF and partials we need for autodiff.
    
    Either way, alpha needs to be declared with the appropriate lower bound constraint to allow Stan to implicitly transform it to work on the unconstrained scale for all parameters.
    Reply ↓
    - Daniel Lakeland on May 29, 2020 3:33 PM at 3:33 pm said:
      
      not only do you need it in the case where the truncation bounds are non-constant… but you also need it in the case where the location or the scale are non-constant.
      
      suppose x is confined to [0,1] which is a constant interval. Then you put
      
      a ~ normal(b,0.1)
      
      now when b is well away from 0, or 1, then the truncation doesn’t matter because it’s an epsilon, but when b is near 0 or near 1 then the truncation *does* matter.
      
      It gets even worse when you do:
      
      a ~ normal(b,c)
      
      I guess in these cases you should do:
      
      a ~ normal(b,c) T[0,1]
      
      but I didn’t remember that there was a truncation syntax
Bob Carpenter on May 29, 2020 1:13 PM at 1:13 pm said:

In terms of being plug-and-play, it’d be just as easy to use a Student-t with, say 4 degrees of freedom, or a logistic. Both would have wider tails than the normal and lead to less sensitivity to extreme values. I’m going to re-run the sensitivity analyses with different forms of priors as well as just different scales on the hyperprior.

I was just reading the penalized complexity prior paper from Dan Simpson and crew, and they recommend exponential priors because they’re consistent with zero and have wider tails (the argument was for scale free inference on complexity, the result of which is wider tails than a normal). The complex model here is no pooling and the simple model is complete pooling, and a PC prior penalizes distance from the simple model to the complex model, which is here controlled by the hierarchical prior.

If data’s consistent with the boundaries, it’ll distort the posterior mean. Consider a beta(9, 1), which has a mean of 0.9 and sd of 0.09. Here are posterior means of truncated distributions with the same parameters:

beta(9, 1) T[0.90, 1] has mean 0.957
beta(9, 1) T[0.85, 1] has mean 0.941
beta(9, 1) T[0.80, 1] has mean 0.928
beta(9, 1) T[0.70, 1] has mean 0.911
beta(9, 1) T[0.60, 1] has mean 0.904
beta(9, 1) T[0.50, 1] has mean 0.901
beta(9, 1) T[0.40, 1] has mean 0.900
beta(9, 1) T[0.00, 1] has mean 0.900

By the time you’re to a lower bound of 0.6 (3 sd below mean), the effect of truncation is only in the third decimal place.
…

Reply ↓
- jd on May 29, 2020 1:28 PM at 1:28 pm said:
  
  >I’m going to re-run the sensitivity analyses with different forms of priors as well as just different scales on the hyperprior.
  
  I was wondering if plotting results from a range of priors might be good. Results seem particularly sensitive to the prior here.
  https://statmodeling.stat.columbia.edu/2020/05/20/ok-heres-a-hierarchical-bayesian-analysis-for-the-santa-clara-study-and-other-prevalence-studies-in-the-presence-of-uncertainty-in-the-specificity-and-sensitivity-of-the-test/#comment-1341544
  
  >The complex model here is no pooling and the simple model is complete pooling, and a PC prior penalizes distance from the simple model to the complex model, which is here controlled by the hierarchical prior.
  
  I see what you’re saying, but it still looks to me like a normal prior behaves in a way that makes it a useful tool for the job it is doing. Expecting small variation and regularizing large variation seems to be what normal is doing, right?
  https://statmodeling.stat.columbia.edu/2020/05/17/stan-pedantic-mode/#comment-1341667
  
  I still have to read the pc prior paper, though:)
  
  Reply ↓
  - Andrew on May 29, 2020 1:37 PM at 1:37 pm said:
    
    Jd:
    
    We did a sensitivity analysis. It’s in section 4 of the linked paper!
    
    Reply ↓
    - jd on May 29, 2020 1:42 PM at 1:42 pm said:
      
      so you did! yikes! I must have somehow skipped that section entirely between reading the paper and doing work!
  - Bob Carpenter on May 29, 2020 2:48 PM at 2:48 pm said:
    
    Yes to the second part of this: that the normal prior regularizes. The question is whether it regularizes too much because it has relatively thin tails compared to, say, a student-t or exponential.
    
    Reply ↓
    - jd on May 29, 2020 3:07 PM at 3:07 pm said:
      
      I guess that might depend on the application and the reason for regularization in the context of the problem, right?
- Pablo Verde on May 29, 2020 1:56 PM at 1:56 pm said:
  
  Hi Bob,
  
  Very nice paper! Actually, using heavy tails on the random-effects for meta-analysis of a diagnostic test is a good idea. I have been doing this stuff for many years for the common situation where the primary studies reported the two by two tables. I implemented this functionality in my R package “bamdit”. Anyway, a meta-analysis of diagnostic is usually hard work…
  
  https://www.jstatsoft.org/article/view/v086i10
  
  One strange thing about the data is that the labs only reported “sensitivity” while others “specificity”. Normally, they should report both in the same test evaluation.
  
  Reply ↓
  - Bob Carpenter on May 29, 2020 2:46 PM at 2:46 pm said:
    
    We really wanted evaluations of both sensitivity and specificity on paired samples. That’d let us estimate covariance. The covariance structure is tricky in that sensitivity and specificity are anti-correlated in that one can often be traded for the other by adjusting thresholds, but they’re positively correlated in that good labs will be better at both. If we modeled each test site i’s departure (on the log odds scale) from the mean in terms of a difference in accuracy alpha[i] and a difference in bias delta[i], we’d reconstruct sensitivity and specificity as
    
    logit sens[i] = mu_sens + alpha[i] + delta[i]
    logit spec[i] = mu_spec + alpha[i] – delta[i]
    
    where mu_sens and mu_spec are the hierarchical mean sensiivity and specificity. mu_spec, mu_sens, alpha, and delta are all on the log odds scale, and are thus unconstrained. sens and spec on the other hand fall in (0, 1). A lab with better than average accuracy has alpha[i] > 0, and a lab with bias equal to the mean bias (determined by mu_sens and mu_spec) has delta[i] = 0. We could give (alpha, delta) a bivariate hierarchical priorwhere now we’d expect a positive correlation only between alpha and delta. An independent prior might be OK, too with this parameterization.
    
    Reply ↓
Nick Adams on May 29, 2020 1:20 PM at 1:20 pm said:

A sensitivity of less than 50% is not really plausible for any rationally designed test. If it did occur you would just flip the sign and call a negative test positive to keep it above 50%. Same for specificity. It’s like having a negative 1-sided p-value.

Reply ↓
- Zhou Fang on May 29, 2020 1:46 PM at 1:46 pm said:
  
  I think that’s often a good argument, but sometimes you can’t actually do that.
  
  A real life situation I had once was that we had a diagnostic test for an illness that was carried out each day, and then this was compared subsequently to a gold standard dataset of whether the individual was actually sick each day. To our surprise, we found the sensitivity and specificity was substantially worse than you would expect from random chance.
  
  Our realisation was that there was a time offset between the process that identified the sickness in the gold standard dataset and the diagnostic test. Because bouts of sickness didn’t follow each other, this means that when the diagnostic test correctly gave a positive result, this would always be just after the time period where the gold standard method registered the individual as being sick and so give a “false” positive.
  
  So yeah, in this case the poor specificity and sensitivity was an artifact of the situation, but it was a genuine result. Flipping the sign would be bad.
  
  Reply ↓
- Pablo Verde on May 29, 2020 2:13 PM at 2:13 pm said:
  
  Well, we hope that a diagnostic test would have sensitivity greater than 50%. But in clinical practice, it is common to have results that are not so great.
  
  For example, just take a look at the table which summarizes the results of this relative recent meta-analysis:
  
  https://pubmed.ncbi.nlm.nih.gov/31215969/
  
  Diagnostic studies based on deep-learning (cool name) algorithms have sensitivity around 40%.
  
  Reply ↓
- Bob Carpenter on May 29, 2020 3:00 PM at 3:00 pm said:
  
  Very low sensitivity can be reasonable in a diagnostic test if it’s coupled with high specificity. Puncture biopsies are an example of a test with low sensitivty but very high specificity.
  
  In general, it’s impossible to flip the sign in a reasonable way because disease status is unknown. If you just flip the sign of all diagnostic outputs, you wind up swapping sensitivity and specificty. So if you start with sens = 0.4, spec = 0.99, you wind up with sens = 0.6, spec = 0.01. Here’s what happens to the outputs in terms of true and false positive and negative results:
  
  TP -> FN
  TN -> FP
  FP -> TN
  FN -> TP
  
  Before the transform, we have
  
  sens = TP / (TP + FN)
  spec = TN / (TN + FP)
  
  After the transform
  
  sens’ = FN / (FN + TP) = 1 – sens
  spec’ = FP / (TN + FP) = 1 – spec
  
  Reply ↓
  - Nick Adams on May 31, 2020 2:25 PM at 2:25 pm said:
    
    Bob,
    I was more thinking of the situation where say the specificity is 0.2 and the specificity is 0.7.
    This gives a likelihood ratio for a positive test that’s less than 1 (2/3) and a likelihood ratio for a negative test that’s greater than 1 (8/3), which is ridiculous. Flipping the definition of a positive/negative test gives a LR+ of 8/7, and a LR- of 2/7, which is not ridiculous.
    And sure, in your example if you want to maximise specificity then stick with the 0.99 but if you want to maximise sensitivity then flip it. In medical diagnosis you sometimes want one and sometimes the other.
    
    Reply ↓
    - Nick Adams on May 31, 2020 2:26 PM at 2:26 pm said:
      
      Errata: last word os first line should be sensitivity.
Daniel Lakeland on May 29, 2020 2:52 PM at 2:52 pm said:

Andrew, here’s the thing I don’t understand, why do a half normal with maximum density on 0 for the standard deviation of anything?

In general when standard deviation = 0 then normal(a,0) is a delta function on a, which is actually not a valid density and numerically totally unstable for values near zero as well, the kind of thing that causes divergences immediately.

So, it makes no sense to ever specify half normal priors with maximum probability density 0 on standard deviations.

So, what’s your rational?

Reply ↓
- Carlos Ungil on May 29, 2020 3:14 PM at 3:14 pm said:
  
  Can’t you just exclude the zero from the domain?
  
  Reply ↓
  - Bob Carpenter on May 29, 2020 3:29 PM at 3:29 pm said:
    Not with floating point. The minimum value you can represent in double precision is 10^{-300} or so (-308 for normal, -324 for subnormal). So as soon as we get t around -300 on the unconstrained scale for a positive constrained value, we’ll underflow on the conversion to a positive constrained value. There are two alternatives to avoid zero. First, set a hard lower bound
    
    real sigma;
    
    This has the effect of adding 1e-6 to the underflowed result to produce a value of 1e-6.
    
    Second, you can use a zero-avoiding prior like a standard lognormal. Its log density drops very quickly as values become negative, so you never get anywhere near values like -300. But as I wrote in response to Daniel Lakeland, that can mask a situation where you really do just want to fully pool everything. But maybe it’d be close enough to just soft-avoid zero a little bit. But with a standard lognormal, you’re only getting to something like exp(-10) = 4e-5 is already 10 sd away from 1 for a standard lognormal(0, 1). So you may need to increase its scale to be compatible with even smaller sd values if appropriate.
    Reply ↓
    - Daniel Lakeland on May 29, 2020 3:41 PM at 3:41 pm said:
      
      of course, the blog ate the angle brackets… I’m assuming you meant (I hope this renders correctly)
      
      real<lower=0> sigma;
      
      The point is that it’s never the case that a sigma close to zero makes sense… Of course “close to zero” has different meanings for different problems. Once you’ve found some prior “order of magnitude” for sigma. it should always be better to use something like a gamma prior:
      
      sd ~ gamma(2.0,1.0/sdmag)
      
      or what’s better because it samples on unit scale
      
      sdpre ~ gamma(2.0,1.0/1.0);
      sd = sdscale*sdpre;
    - Daniel Lakeland on May 29, 2020 3:45 PM at 3:45 pm said:
      
      Incidentally, a gamma distribution is the maximum entropy distribution with a known mean, and a known order of magnitude (mean of the log(x))
      
      It’s not super convenient to parameterize that directly, but I generally think of a gamma in terms of a size and a degree of freedom:
      
      gamma(n,(n-1)/m) has peak density at value m and the bigger n is the more concentrated it is around that peak.
    - Carlos Ungil on May 29, 2020 3:56 PM at 3:56 pm said:
      
      Thanks, my comment didn’t really deserved such an extensive reply. I mostly wanted to know if Daniel found every prior which doesn’t go to zero around zero (what you call “zero-avoiding”) equally problematic. If a hard lower bound to avoid numerical issues is automatically used when using real lower=0 I expected something similar would be done for half-normal, exponential, etc.
- Zhou Fang on May 29, 2020 3:18 PM at 3:18 pm said:
  
  The fact that the half normal prior probability density is maximized at zero isn’t that important. The density is approximately flat near zero, so the prior probability of that scale parameter being less than some small value is proportionate to the size of the value. The prior predictive distribution is spiky but not *that* spiky.
  
  Reply ↓
  - Bob Carpenter on May 29, 2020 3:32 PM at 3:32 pm said:
    
    The fact that it’s maximized there isn’t important, but the fact that it’s consistent with zero is. We could have normal(1, 1), too. But lognormal(0, 1) is what Andrew calls a “zero-avoiding prior” because the posterior can’t get close to zero because of how fast the log density drops. Even that’s probably OK for most problems which aren’t going to have posterior sd anywhere near -5 on the log scale.
    
    Reply ↓
- Bob Carpenter on May 29, 2020 3:21 PM at 3:21 pm said:
  
  I’m a recent convert, so I can’t stop proslytizing for Simpson et al.’s Penalising Model Component Complexity: A Principled, Practical Approach to Constructing Priors. The idea is that a prior scale of zero (sd = 0) is the simpler model with complete pooling. We then penalize departure from the simple model with a prior that’s consistent with the simpler model.
  
  Yes, it’s always dicey trying to get too close to zero numerically, constrained or unconstrained. Yes, too that in HMC, those problems manifest as divergences of the internal Hamiltonian simulation. Stan has lots of diagnostics around this, so it should be easy to visualize what’s going on. If the data pulls very hard toward zero rather than just being consistent with zero, we’d probably just replace the partial pooling with complete pooling in order to get the same answer without the headache. If we set a zero-avoiding prior like an inverse gamma or lognormal, we might not ever see the problem as the posterior’s gently pushed away from zero.
  
  Reply ↓
Ian Fellows on May 30, 2020 2:08 AM at 2:08 am said:

Andrew:

I’d also recommend informative priors on the beta when you’re doing the logit model for pi. Because flat(ish) priors put most of their weight out toward pi = 0 or pi = 1, if you don’t have a huge amount of data and low prevalence it can push the posterior mode of specificity way down. When I worked with this type of data, I had to fiddle around in order to find priors for beta that would yield a roughly flat pi.

Reply ↓
- Ian Fellows on May 30, 2020 8:05 PM at 8:05 pm said:
  
  I should have said priors for beta and sigma.
  
  Reply ↓

Statistical Modeling, Causal Inference, and Social Science

In Bayesian priors, why do we use soft rather than hard constraints?

29 thoughts on “In Bayesian priors, why do we use soft rather than hard constraints?”

Leave a Reply Cancel reply