“The method uses off-site information from the nearby meteorological towers at Goodnoe Hills and Kennewick, identifies atmospheric regimes and fits conditional predictive models for each regime, based on a sliding 45-day training period.”

There is something “multi-level” about this, but I cannot articulate it. At which scale are these models supposed to predict?

]]>I’m also a big believer in multiple diagnostics to get at multiple aspects of a problem as long as they don’t have too many false positives. Michael Betancourt has built some of his recommendations in section 2.3 of the methodology paper into the Stan repo stat_comp_benchmarks. The nice thing about those tests is that they’re pretty powerful (reject a lot of bad stuff) and very fast compared to something like simulation-based calibration.

As far as sensitivity goes, I’d recommend checking out Ryan Giordano’s thesis work (U.C. Berkeley, though he’s now a postdoc at M.I.T.), where he uses autodiff to characterize sensitiity of variational solutions w.r.t. parameters. This is a standard approach in aplied math where they do sensitivity analysis of solutions to differential equations with respect to its parameter(s) [which is also where we got the technique of differentiating solutions to differential equations in Stan].

]]>By the usual definition of statistics, machine learning methods *are* statistical methods. The difference seems mainly to be that they’re developed algorithmically by computer scientists and focused on prediction rather than estimation or hypothesis testing. I have a lot of respect for a lot of what’s been accomplished in machine learning, not the least of which is refocusing broader attention on predictive methods. Perhaps that’s because I spent fifteen years working on machine learning applications in speech and natural language processing before moving to Columbia to work on and eventually learn Bayesian stats.

The only drawback is that it requires the model to be right to work with real data in the wild. Since we rarely assume our models are actually correct, in practice, we have to fall back on empirical coverage tests like posterior predictive checks.

]]>Let me try again, though it may just be a doomed analogy. Unbiasedness means getting the right estimate in expectation, whereas calibration is about getting the right posterior in terms of expected coverage. In the same way, sharpness seems related to the precision (inverse variance) of an estimator.

The reason I brought this up is that in point estimation you can reduce variance at the cost of an increase in bias (e.g., by adding a prior or hierarchical prior). So I was wondering if you could do something similar w.r.t. calibration and sharpness (e.g., increasingly sharpness by sacrificing some calibration).

then calibration of a Bayesian posterior seems like the same kind of condition as unbiasedness of an estimator.

I also fixed the link.

]]>The only thing with which I was previously familar for assessing calibration is the way we’re now using simulation-based calibration. That is indeed just repeated posterior predictive checks based on simulated parameters and data using the parameter draws as the statistic of interest. In general, PPCs can check any quantity of interest, not just parameter estimates.

P.S. I was about to ask Aki Vehtari and Dan Simpson and Andrew Gelman directly, but figured I’d ask everyone instead.

]]>The ramifications are currently unclear, muddled, confused and likely a long way from clarification, elucidation and general acceptance.

Some current alternatives are Bayesian notions of calibration given in this book by Michael Evans https://www.crcpress.com/Measuring-Statistical-Evidence-Using-Relative-Belief/Evans/p/book/9781482242799 and related papers, Michael Betancourt’s section 2.3 Model Sensitivity in https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html as well as many papers Andrew has done with others.

So I would suggest some pluralism here, implement the methods of the paper with a clear warning that there isn’t any wide consensus of assurance that the methods do more good than harm. Enabling more to get experience working with these method might help give a better sense in the future.

]]>Of note, the best-performing method combined statistical and machine learning methods (https://www.sciencedirect.com/science/article/pii/S0169207019301153) ]]>

Regarding posterior evaluation, it has been argued (e.g., Little, 2006, https://doi.org/10.1198/000313006X117837) that frequentist principles are useful for evaluating Bayesian models. See also Krüger et al. (2016, https://arxiv.org/abs/1608.06802) regarding scoring rule computation (such as the CRPS) in an MCMC context.

]]>And BTW, it looks like the link you give for the paper leads back to this page (rather than to the paper at https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9868.2007.00587.x )

]]>I know that proper scoring methods generally and CRPS specifically are advocated by Briggs.

]]>