The fairy tale of the mysteries of mixtures

The post is by Leonardo Egidi.

This Bayesian fairy tale starts in July 2016 and will reveal some mysteries of the magical world of mixtures.

Opening: the airport idea

Once upon a time a young Italian statistician was dragging his big luggage in the JFK airport towards the security gates. He suddenly started thinking about how to elicit a prior distribution that is flexible but at the same time contains historical information about past similar events: thus, he wondered, “why not using a mixture of a noninformative and an informative prior in some applied regression problems?”

Magic: mixtures like three-headed Greek monsters

The guy fell in love with mixtures many years ago: the weights, the multimodality, the ‘multi-heads’ characteristic…like Fluffy, a gigantic, monstrous male three-headed dog who was once cared for by Rubeus Hagrid in the Harry Potter novel. Or Cerberus, a three-headed character from Greek mythology, one of the monsters guarding the entrance to the underworld over which the god Hades reigned. “Mixtures are so similar to Greek monsters and so full of poetic charm, aren’t they?!”

Of course his idea was not new at all: spike-and-slab priors are very popular in Bayesian variable selection and in clinical trials to avoid prior-data conflicts and get robust inferential conclusions.

He left his thought partially aside for some weeks, focusing on other statistical problems. However, some months later the American statistical wizard Andrew wrote an inspiring blog entry about prior choice recommendations:

What about the choice of prior distribution in a Bayesian model? The traditional approach leads to an awkward choice: either the fully informative prior (wildly unrealistic in most settings) or the noninformative prior which is supposed to give good answers for any possible parameter valuers (in general, feasible only in settings where data happen to be strongly informative about all parameters in your model).

We need something in between. In a world where Bayesian inference has become easier and easier for more and more complicated models (and where approximate Bayesian inference is useful in large and tangled models such as recently celebrated deep learning applications), we need prior distributions that can convey information, regularize, and suitably restrict parameter spaces (using soft rather than hard constraints, for both statistical and computational reasons).

This blog post gave him a lot of energy by reinforcing his old idea. So, he wondered, “what’s better than a mixture to represent a statistical compromise about a prior belief, combining a fully informative prior with a noninformative prior weighted somehow in an effective way?”. As the ancient Romans were used to say, in medio stat virtus. But he still needed to dig into the land of mixtures to discover some little treasures.

Obstacles and tasks: mixtures’ open issues

Despite their large use in theoretical and applied frameworks, as far as he knew from the current literature he realized that no statistician had explored the following issues about the mixture priors:

  • how to compute a measure of global informativity yielded by the mixture prior (such as a measure of effective sample size, according to this definition);
  • how to specify the mixture weights in a proper and automatic way (and not, say, only by fixing them upon historical experience, or by assigning them a vague hyperprior distribution) in some regression problems, such as clinical trials.

He struggled a lot with his mind during that cold winter. After some months he dug something out of the earth:

  1. the effective sample size (ESS) provided by a mixture prior never exceeds the information of any individual mixture component density of the prior.
  2. Theorem 1 here quantifies the role played by the mixture weights to reduce any prior-data conflict we can expect when using the mixture prior rather than the informative prior. “So, yes, mixture priors are more robust also from a theoretical point of view! Until now we only knew it heuristically”.

Happy ending: a practical regression case

Similarly to the bioassay experiment analyzed by Gelman et al. (2008), he considered a small-sample example to highlight the role played by different prior distributions, including a mixture prior, in terms of posterior analysis. Here there is the example.

Consider a dose-response model to assess immune-depressed patients’ survival according to an administered drug x, a factor with levels from 0 (placebo) to 4 (highest dose). The survival y, registered one month after the drug is administered, is coded as 1 if the patient survives, 0 otherwise. The experiment is firstly performed to the sample of patients y1 at time t1where less than 50% of the patients survive, and then repeated to the sample y2 at time t2, where all but one patients die, given that y1 and y2 are non-overlapping samples of patients.

The aim of the experimenter is to use the information from the first sample y1 to obtain inferential conclusions about the second sample y2Perhaps the two samples are quite different from each other in terms of survived people. From a clinical point of view we have two possible naive interpretations for the second sample:

  • the drug is not effective, even if there was a positive effect for y1;
  • regardless of the drug, the first group of patients had a much better health condition than the second one.

Both of them appear quite extreme clinical conclusions: moreover, our information status is scarce, since we do not have any other influential clinical covariate, such as sex, age, presence of comorbidities, etc.

Consider the following data where the sample size for the two experiments is N=15:

library(rstanarm)
library(rstan)
library(bayesplot)

n <- 15
y_1 <- c(0,0,0,0,0,0,0,0,1,1,0,1,1,1,1) # first sample
y_2 <- c(1, rep(0, n-1))                # second sample

Given pi Pr(yi = 1), we fit a logistic regression logit(pi) = α+βxi to the first sample, where the parameter β is associated with the administered dose of the drug, x. The five levels of the drug are randomly assigned to groups of three people each.

# dose of drug
x <- c(rep(0,3), rep(1,3), rep(2,3), rep(3,3), rep(4,3))
  
# first fit
fit <- stan_glm(y_1 ~ x, family = binomial)
print(fit)
## stan_glm
##  family:       binomial [logit]
##  formula:      y_1 ~ x
##  observations: 15
##  predictors:   2
## ------
##             Median MAD_SD
## (Intercept) -4.8    2.0  
## x            2.0    0.8  
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg

Using weakly informative priors the drug is effective at t1, being the parameter β positive and equal to 2.0 (with a posterior sd of 0.8), meaning that there is a positive effect of 2.0 on the log-odds of the survival for each additional amount of the dose.

Now we fit the same model to the second sample, according to three different priors for β, reflecting three different ways to incorporate/use the historical information about y1:

  1. weakly informative prior β N(0, 2.5)  scarce historical information about y1;
  2. informative prior β N(2, 0.8)  relevant historical information about y1;
  3. mixture prior β 0.8×N(0, 2.5)+0.2×N(2, 0.8)  weighted historical information (0.2).

(We skip here the details about the choice of the mixture weights in 3., see here for further details).

# second fit

## weakly informative
fit2weakly <- stan_glm(y_2 ~ x, family = binomial)

## informative
fit2inf <- stan_glm(y_2 ~ x, family = binomial,
                    prior = normal(fit$coefficients[2], 
                                   fit$stan_summary[2,3]))

## mixture
x_stand <- (x -mean(x))/5*sd(x)  # standardized drug 
p1 <- prior_summary(fit)
p2 <- prior_summary(fit2weakly)
stan_data <- list(N = n, y = y_2, x = x_stand, 
                  mean_noninf = as.double(p2$prior$location),
                  sd_noninf = as.double(p2$prior$adjusted_scale),
                  mean_noninf_int = as.double(p2$prior_intercept$location),
                  sd_noninf_int = as.double(p2$prior_intercept$scale),
                  mean_inf = as.double(fit$coefficients[2]),
                  sd_inf =  as.double(fit$stan_summary[2,3]))
fit2mix <- stan('mixture_model.stan', data = stan_data)

Let’s figure out now what the three posterior distributions suggest.

Remember that in the second sample all but one patients die, but we actually do not know why this happened: the informative and the weakly informative analysis suggest almost opposite conclusions about the drug efficacy, both of them quite unrealistic:

  • the ‘informative posterior’ suggests a non-negligible positive effect of the drug  possible overestimation;
  • the ‘weakly informative posterior’ suggests a strong negative effect  possible underestimation;
  • the ‘mixture posterior’, that captures the prior-data conflict existing between the prior on β suggested by y1 and the sample y2 and lies in the middle, is more conservative and likely to be more reliable for the second sample in terms of clinical justifications.

In this application a mixture prior combining the two extremes—the wildly informative prior and the weakly informative prior—can realistically average over them and represent a sound compromise (similar examples are illustrated here) to get robust inferences.

The moral lesson

The fairy tale is over. The statistician is now convinced: after digging in the mixture priors’ land he found another relevant case of prior regularization.

And the moral lesson is that we should stay in between—a paraphrase for the ancient in medio stat virtus—when we have small sample sizes and eventual conflicts between past historical information and current data.

20 thoughts on “The fairy tale of the mysteries of mixtures

  1. Creatively written post and the idea makes many headed sense.

    Back around 2010 I asked Peter Thall why he did not just simulate from the joint prior and marginalize to get the prior for the parameter of concern. Their paper you referenced was motivated by a Bayesian analysis of a clinical trial that did not make sense to them. I think his answer was something like I just didn’t think of it.

    Not too be too dismissive of doing the math, but if we all are going to do adequate Bayesian Workflow Analysis (aren’t we?) then those simulations and more need to be done.

    I also think doing that would make more convincing case studies as well as raise unanticipated challenges of the idea.

    • Thanks Keith!

      >Not too be too dismissive of doing the math, but if we all are going to do adequate Bayesian Workflow Analysis (aren’t we?) then those simulations and more need to be done.
      I also think doing that would make more convincing case studies as well as raise unanticipated challenges of the idea.

      Yeah, I completely agree on both points.

  2. Interesting stuff.

    Your toy example has some relevance to what Merck is facing with it’s new covid drug. The interim analysis had hospitalizations of 53/377 placebo vs. 28/385 treatment. The second batch of data had hospitalizations is 15/322 placebo vs. 20/324 treatment.

    https://www.nature.com/articles/d41586-021-03667-0

    My impression is that you have only 2 groups, and aren’t sure whether a pooled or unpooled model is best, so are using a mixture over the two possibilities. That seems reasonable, and the results are necessarily dependent on your prior over the two possibilities, as you show by varying the previous data weighting between 0, 0.2, and 1.

    Doing a multilevel model with a fixed group-level sigma should be pretty equivalent, right?, It would also give the drug efficacy estimates within reach group. Then again, it’s probably easier to set priors for your mixing weights than it is to pick a good sigma.

    • Thanks kj!

      > Your toy example has some relevance to what Merck is facing with it’s new covid drug.

      Nice reference, I’ll look at it.

      > Doing a multilevel model with a fixed group-level sigma should be pretty equivalent, right?, It would also give the drug efficacy estimates within reach group. Then again, it’s probably easier to set priors for your mixing weights than it is to pick a good sigma

      Yes, it is approximately the same, I did some attempts a few years ago. As you argue, here it is pretty easy to specify the mixture weights. Interesting!

  3. > Remember that in the second sample all but one patients die, but we actually do not know why this happened:

    But we know that the only patient who survived didn’t take the drug.

    > the informative and the weakly informative analysis suggest almost opposite conclusions about the drug efficacy, both of them quite unrealistic

    On the other hand, they wouldn’t go in the opposite direction if the only patient who survived had been one of those who received the maximum dose.

    If you have one experiment with relative strong evidence for large positive efficacy and another one with not-so-strong evidence for not-so-large negative efficacy, why is it unrealistic to conclude that there is weak evidence for small positive efficacy?

    Is it really desirable that the result depends on the order in which the experiments are considered?

    If I’m not mistaken, when the posterior of the first analysis is properly carried over as prior for the second analysis (taking into account the joint distribution of intercept and slope) the final result is independent of the order. It is also the same that we would get if a single analysis was performed on the pooled data.

    • Thanks Carlos. Here my reply:

      > On the other hand, they wouldn’t go in the opposite direction if the only patient who survived had been one of those who received the maximum dose.

      I agree, however the toy example is in fact designed to reveal possible prior-data conflicts (and here there are) between current data and past historical information.

      > If you have one experiment with relative strong evidence for large positive efficacy and another one with not-so-strong evidence for not-so-large negative efficacy, why is it unrealistic to conclude that there is weak evidence for small positive efficacy?

      Because we have very scarce information about the second sample of patients: maybe they are very young, with no comorbidities…and in such a way, using weakly informative prior, we would discard completely past historical drug information and claim that a drug is not only not effective, but even injurious! (look the plot). That is, usually, much unrealistic from a clinical point of view.

      > Is it really desirable that the result depends on the order in which the experiments are considered?

      Yes, I know, this can be problematic, but is how usually clinical trials (and Bayesian inference) work: sequential updating from previous experience.

      > It is also the same that we would get if a single analysis was performed on the pooled data.

      Yes, but as mentioned above many clicnical trials schemes work in this way: collect historical information about previous samples, perform informative analysis on current data and obtain inferential conclusions about the new sample.

      • > using weakly informative prior, we would discard completely past historical drug information

        My question was about the other so-called “extreme”. I’m not sure that it’s fair to say that the “weak evidence for small positive efficacy” conclusion that may be obtained when we do not discard – completely nor partly – past historical information is wildly unrealistic.

        > Yes, I know, this can be problematic, but is how usually clinical trials (and Bayesian inference) work: sequential updating from previous experience.

        One of the advantages of Bayesian inference is in fact the ability to perform sequential updating. Which means that – in principle at least – one could incorporate the data in any order – or all at once – without affecting the final result.

        In practice, there are approximations and the sequential updating is not perfect. The result may depend on the details. But purposely ignoring “previous experience” kind of goes against the spirit of Bayesian inference. There may be good reasons to ignore upstream data but “this is how Bayesian inference works” is not one.

        By the way, there seems to be something wrong with your code. The first fit is done on a covariate that goes from 0 to 4. The final fit2mix is done on a linarly transformed covariate that goes from -0.6 to 0.6. One of the inputs is the beta calculated in the first fit but its numerical value is just one third of what it should be after the change of variables. Unless I’m missing something you’re muffling the prior data twice: by mixing it with four parts of a zero-effect prior and by dividing the beta estimate by three.

        • > My question was about the other so-called “extreme”. I’m not sure that it’s fair to say that the “weak evidence for small positive efficacy” conclusion that may be obtained when we do not discard – completely nor partly – past historical information is wildly unrealistic.

          I agree it is not wildly unrealistic, but anyway it could overestimate the drug effect…remind that in this toy example the drug is supposed to be not effective at all for the second sample. In other words, from a clinical point of view, would you trust a drug that resulted to be effective in the first experiment, not effective for the second experiment, and evaluated for a small sample size? Personally, I would prefer a “no-effect” claim, rather than a slightly optimistic one. My concern is more about clinical interpretation rather than numerical/statistical interpretation (statisticians know how to interpret weak effects, doctors often don’t).

          > But purposely ignoring “previous experience” kind of goes against the spirit of Bayesian inference.

          In my opinion the purpose here is not to ignore previous experience, rather to check for possible prior-data conflicts. Previous experience matters a lot: however, since in small-sample size scenarios ‘data cannot entirely speak on their own’, I guess it is of primary importance to check for inconsistencies between prior information and the current data.

          > By the way, there seems to be something wrong with your code. The first fit is done on a covariate that goes from 0 to 4. The final fit2mix is done on a linarly transformed covariate that goes from -0.6 to 0.6. One of the inputs is the beta calculated in the first fit but its numerical value is just one third of what it should be after the change of variables. Unless I’m missing something you’re muffling the prior data twice: by mixing it with four parts of a zero-effect prior and by dividing the beta estimate by three.

          The reason behind the standardization of the drug covariate in the mixture fit is to maintain the consistency with ‘rstanarm’ (first two fits, weakly inf and informative analysis), that internally standardizes predictors, as suggested here:

          https://projecteuclid.org/journals/annals-of-applied-statistics/volume-2/issue-4/A-weakly-informative-default-prior-distribution-for-logistic-and-other/10.1214/08-AOAS191.full

          Without this covariate transformation, the ‘rstan’ is not directly comparable with the ‘rstanarm’ fit, so that’s the reason for the drug transformation above.

        • > would you trust a drug that resulted to be effective in the first experiment, not effective for the second experiment, and evaluated for a small sample size?

          What if we exchange the order? Would you trust a drug that resulted not effective in the first experiment and effective in the second experiment?

          The proposed “mixture prior” will result in a much more optimistic effect claim than standard sequential updating in that case.

          Or maybe you just intend to “water down” priors when they are positive, leaing them alone if negative.

        • > What if we exchange the order? Would you trust a drug that resulted not effective in the first experiment and effective in the second experiment?
          The proposed “mixture prior” will result in a much more optimistic effect claim than standard sequential updating in that case.

          No, I would not trust as well in fact. The mixture prior in this scenario should be able to capture again the conflict between the past historical information (‘no effect’)
          with the current data (‘effect’) and then give more credit to the weakly informative prior rather than the informative one. I would expect a quite similar result (I can make an example in case).

        • > The reason behind the standardization of the drug covariate in the mixture fit is to maintain the consistency with ‘rstanarm’ (first two fits, weakly inf and informative analysis), that internally standardizes predictors

          Thanks. But I don’t get it.

          I assume that mixture_model.stan – not given – is fixed and only stan_data changes.

          I assume as well that the formula (x -mean(x))/5*sd(x) in the script is wrong and x_stand is in fact (maybe a multiple of) the usual standardization (x-mean(x))/sd(x)

          In that case, imagine two copies of this example: one with

          # dose of drug
          x <- c(rep(0,3), rep(1,3), rep(2,3), rep(3,3), rep(4,3))

          as in the original and the other with

          # dose of drug
          x <- 1000*c(rep(0,3), rep(1,3), rep(2,3), rep(3,3), rep(4,3))

          Most of the elements of stan_data will be identical. N, y, x (=x_stand) will be the same. mean_noninf_int and sd_noninf_int will be the same.

          But the other elements of stan_data change by a factor of 1000. mean_noninf is 0 anyway, but mean_inf is completely different (as are both sd_inf and sd_noninf).

          It's difficult for me to see how the final result could be comparable when the data used in the final fit is the same and the priors are so different.

        • > It’s difficult for me to see how the final result could be comparable when the data used in the final fit is the same and the priors are so different.

          x_stand is defined in the same way the package rstanarm handles the data, so is supposed to be not wrong here.
          The informative prior used in the informative analysis for the second sample is exactly the posterior of beta associated to x_stand in the first sample.
          Then, the meaning is that whatever your data x are, the prior is always defined in terms of the standardized predictor, and this is what happens here. For this reason, in my opinion, the fits are perfectly comparable.

        • > The mixture prior in this scenario should be able to capture again the conflict between the past historical information (‘no effect’)
          with the current data (‘effect’) and then give more credit to the weakly informative prior rather than the informative one. I would expect a quite similar result (I can make an example in case).

          My point is that giving more credit to the weakly informative prior rather than the informative one results in a more optimistic effect claim. I don’t understand what do you mean by “similar result”.

          When you start with the analysis of the second experiment (fit2weakly in your example) the posterior beta = -1 (0.8) is more negative than the prior beta = 0 (2.5).

          If you analyse the second experiment using the “informative prior” beta = -1 (0.8) you will get some result. (More or less the same that you get when you analyse the first experiment and then use the posterior as prior for the analysis of the second experiment.)

          If you use the “mixture prior” which is 20% beta = -1 (0.8) and 80% 0 (2.5) you will get a more positive result.

          In summary, given your experiments #1 and #2 you would get the following ranking of effect estimates, from smaller to larger:

          a) #2 ignoring #1
          b) #1 and then #2 using mixture prior
          c) #1 and then #2 using informative prior ~ #2 and then #1 using informative prior ~ #1 and #2 together
          d) #2 and then #1 using mixture prior
          e) #1 ignoring #2

          You say that the mixture prior works well because it gives (b), smaller than (c). But it also gives (d), larger than (c), when the order is switched.

        • > x_stand is defined in the same way the package rstanarm handles the data, so is supposed to be not wrong here. The informative prior used in the informative analysis for the second sample is exactly the posterior of beta associated to x_stand in the first sample.

          This is not correct. The posterior beta in the first stan_glm fit that you’re using later is not “associated to x_stand”. It’s associated to the untransformed variable.

          > y_1 x stan_glm(y_1 ~ x, family = binomial)$coefficients[2]
          > # 2

          > y_1 x stan_glm(y_1 ~ x, family = binomial)$coefficients[2]
          > # 0.002

          Or maybe I misunderstand what you mean. Do you agree that x_stand should be the same in both cases?

          By the way, as I mentioned before if x_stand should be the same in both cases there is an error in your script. As it is one gets
          > x x_stand x_stand
          > # -0.58554 -0.58554 -0.58554 -0.29277 -0.29277 -0.29277 0.00000 0.00000 0.00000 0.29277 0.29277 0.29277 0.58554 0.58554 0.58554

          and

          > x x_stand x_stand
          > # [1] -585540 -585540 -585540 -292770 -292770 -292770 0 0 0 292770 292770 292770 585540 585540 585540

        • [ now using = for assigments to keep wordpress happy ]

          > x_stand is defined in the same way the package rstanarm handles the data, so is supposed to be not wrong here. The informative prior used in the informative analysis for the second sample is exactly the posterior of beta associated to x_stand in the first sample.

          This is not correct. The posterior beta in the first stan_glm fit that you’re using later is not “associated to x_stand”. It’s associated to the untransformed variable.

          > y_1 = c(0,0,0,0,0,0,0,0,1,1,0,1,1,1,1)
          > x = c(rep(0,3), rep(1,3), rep(2,3), rep(3,3), rep(4,3))
          > stan_glm(y_1 ~ x, family = binomial)$coefficients[2]
          > # 2

          > y_1 = c(0,0,0,0,0,0,0,0,1,1,0,1,1,1,1)
          > x = 1000*c(rep(0,3), rep(1,3), rep(2,3), rep(3,3), rep(4,3))
          > stan_glm(y_1 ~ x, family = binomial)$coefficients[2]
          > # 0.002

          Or maybe I misunderstand what you mean. Do you agree that x_stand should be the same in both cases?

          By the way, as I mentioned before if x_stand should be the same in both cases there is an error in your script. As it is one gets
          > x = c(rep(0,3), rep(1,3), rep(2,3), rep(3,3), rep(4,3))
          > x_stand = (x -mean(x))/5*sd(x)
          > x_stand
          > # -0.58554 -0.58554 -0.58554 -0.29277 -0.29277 -0.29277 0.00000 0.00000 0.00000 0.29277 0.29277 0.29277 0.58554 0.58554 0.58554

          and

          > x = 1000*c(rep(0,3), rep(1,3), rep(2,3), rep(3,3), rep(4,3))
          > x_stand = (x -mean(x))/5*sd(x)
          > x_stand
          > # [1] -585540 -585540 -585540 -292770 -292770 -292770 0 0 0 292770 292770 292770 585540 585540 585540

  4. A couple of years ago one of my consulting projects involved analyzing data on the effectiveness of a “demand response” program.

    Explanation of Demand Response if needed: Demand Response, or DR, means changing one’s demand for electricity in response to a signal of some kind. The traditional way of balancing demand and supply of electricity — such balance being necessary to avoid brownouts on the one hand or dangerous overheating of equipment on the other — is to modify supply to keep up with demand, but you can also adjust demand to match supply and many utilities have programs to do this.

    The program I was analyzing was extremely heterogeneous: it included big customers and small ones, in mild climates and severe ones, and there had only been about ten DR events — that is, ten times the customers were asked to reduce demand — in the entire year. I was supposed to quantify and characterize the effectiveness of the program for those ten events and also to forecast what would happen if they recruited a bunch more customers in specific geographic locations and then called for DR on some days with standardized weather.

    Even quantifying what happened during the ten events that had occurred can’t be done very accurately: you can only measure what the customers actually did, not what they would have done if they hadn’t participated in DR, but you are interested in the difference between these. So you have to have a of ‘baseline model’ to predict what would have happened without DR, and then subtract what actually happened, in order to find the energy savings.

    But what really made the task hard was the need to forecast what future customers might do. Even if we decide, ludicrously, that it’s fair to treat the current customers (the early adopters) as if they’re representative of the future population, they provide very little information about what that future population looks like. For example, some of the customers provide DR by changing air conditioner settings. From other programs we know that participants in mild climates generally provide less impact per degree of outdoor air temperature than customers from hotter climates, so when projecting for a future population that has different geographical spread than the current customers we need to adjust for this somehow, but the sparsity of the current data meant that we have very poor estimates of either the energy savings per degree F or for the way this quantity varies by climate.

    By the way, I suppose I need to say that simply telling the client “I’m sorry, I can’t do this with enough accuracy to make it worthwhile” was not an option, because the numbers from this report were going to be used to determine how much (if anything) the client was going to be paid by the utility for providing the DR capability.

    In the end I used Stan to fit a Bayesian hierarchical model so that I could hope to get semi-reasonable central estimates and uncertainties. I used mildly informative prior distributions to pull some of the numbers towards typical numbers in the industry. I got sensible results that I was willing to stand behind (albeit with the ever-present caveat that ‘all models are wrong.’) As with the OP, the answers did depend somewhat on the prior distribution that I used, and even my estimate of the current effect — what had happened in the events that had taken place in the prior year — was different in the posterior than the prior.

    Although I was satisfied with my work, overall it wasn’t worth it. The people in charge of evaluating the report did not reject the prior distributions or the Bayesian model outright, but there was a lot of pushback in the vein of “why are your estimates for what happened this year different from what the data show?”, in every one of the several opportunities for comments and questions. They didn’t reject it but they also didn’t believe it.

    Fortunately, the next year the program had grown a bit and there were enough customers that all of the quantities were better estimated. My Bayesian model and a more conventional approach gave similar answers so I just used the conventional approach and was able to sail through without all the pushback.

    Prior distributions give you an opportunity to put your thumb on the scale. Sometimes that’s good! If you think the scale is inaccurate in a specific direction then a thumb can help. But people are always going to be suspicious about that thumb.

  5. Today’s post about Wikipedia governance and outstanding errors reminds me that there were a couple of open questions regarding the analysis presented here – or at least regarding the code shown:

    x_stand <- (x -mean(x))/5*sd(x)

    doesn’t look correct if it’s supossed to standardize the variable.

    mean_inf = as.double(fit$coefficients[2]), sd_inf = as.double(fit$stan_summary[2,3]))

    do not seem appropriate coefficients when the x_stand variable is used. They are relative to the original x representation – even if internally the stan_glm function may standardize the data.

    (My last comment about those issues never got a reply, I’m not sure if the author missed it. Maybe my concerns are obviously unfounded. It’s not that obvious to me though.)

    • Hey Carlos, sorry for my late reply, was very buzy.
      Before replying, I deeply checked my own code and how ‘rstanarm’ works: actually, ‘rstanarm’ internally centers the predictors, however it returns posterior estimates for the non-centered (original) predictors, as you suggested.
      So thanks for your comment.

      In my code above, it is then sufficient to replace x_stand with x in the stan_data and fit the mixture prior model (not using then the old x_stand anymore, not useful).
      You can try and note, however, that the results do not change so much: the mixture posterior is approximately centered at zero.

      I report here, for completeness, the stan code for the mixture model (not reported in the blog post), in order for you to have a double check, if you wish:

      data{
      int N;
      int y[N];
      vector[N] x;
      real mean_inf;
      real sd_inf;
      real mean_noninf;
      real sd_noninf;
      real mean_noninf_int;
      real sd_noninf_int;
      }
      parameters{
      real alpha;
      real beta;
      }
      model{
      target+=bernoulli_logit_lpmf(y| alpha + beta*x);
      target+= 0.8*normal_lpdf(beta|mean_noninf, sd_noninf)+0.2*normal_lpdf(beta|mean_inf, sd_inf);
      target+= normal_lpdf(alpha|mean_noninf_int, sd_noninf_int);
      }

      Thanks again for your useful comments.

      • Leonardo, thanks for the follow up.

        I agree that in this case using the transformed or untransformed variable doesn’t have much of an impact. (Neither does the botched standardization formula, for that matter.)

Leave a Reply

Your email address will not be published. Required fields are marked *