# Bayesian inference for discrete parameters and Bayesian inference for continuous parameters: Are these two completely different forms of inference?

I recently came across an example of discrete Bayesian inference: a problem where there were some separate states of the world and the goal is to infer your state given some ambiguous information. There are lots of such examples, such as estimating the probability someone has a certain disease given a positive test, or estimating the probability that a person is male given the person’s height and weight, or the probability that an image is a picture of a cat, a dog, or something else, given some set of features of the image.

Discrete Bayesian inference is simple: start with the prior probabilities, update with the likelihood, renormalize, and you have your posterior probabilities. So it makes sense that discrete examples are often used to introduce Bayesian ideas, and indeed we have some discrete examples in chapter 1 of Bayesian Data Analysis.

Just to be clear: what’s relevant here is that the parameter space (equivalently, the hypothesis space) is discrete; it’s not necessary that the data space be discrete. Indeed, in the sex-guessing example, you can treat height and weight as continuous observations and that works just fine.

There’s also continuous Bayesian inference, where you’re estimating a parameter defined on a continuous space. This comes up all the time too. You might be using data to estimate a causal effect or a regression coefficient or a probability or some parameter in a physical model. Again, the Bayesian approach is clear: start with a continuous prior distribution on parameter space, multiply by the likelihood, renormalize (doing some integral or performing some simulation), and you have the posterior.

What I want to argue here is that maybe we should think of discrete and continuous Bayesian inference as two different forms of reasoning. Here I’m trying to continue what Yuling and I said in section 6 of our Holes in Bayesian statistics paper and in our stacking paper.

In general you can define Bayesian inference for a continuous parameter as Bayesian inference for a discrete parameter, taking the limit where the number of discrete possibilities increases. The challenge comes in setting up the joint prior distribution for all the parameters in the model.

The logical next step here is to work this out in a simple example, and I haven’t done that yet, so this post is kinda sketchy. But I wanted to share it here in case any of you could take it further.

# Elicitation from different perspectives

This is Jessica. I recently heard the term elicitation being used in a machine learning context, where the loss function was referred to as “eliciting” information from the algorithm during optimization. This seemed unusual, though not necessarily incorrect, given a dictionary definition like “the process of drawing out some response or information.” It got me thinking about how the meaning of “elicitation” is interpreted in different academic circles and how they vary in strictness of definition.

For example, if it’s used in fields like visualization and human computer interaction, elicitation tends to refer to asking people for some beliefs, whether or not the beliefs are about some state of the world for which ground truth can be obtained, and regardless of how they are rewarded for providing the beliefs or whether they are tested for coherency in any way. A paper might use the term elicitation to refer to something like asking users of some interface tool to rate their level of trust on a scale from 1 to 10.

As a research topic though, elicitation has been scoped more specifically to asking people for information about the uncertainty of some state or event. For example, elicitation in judgment and decision making often refers to prior elicitation from domain experts. The goal assumed by classic literature by Brier and others focuses on eliciting a probabilistic forecast in the form of a distribution. Closely related is using elicitaiton to refer to getting information about properties of random variables rather than entire distributions, like an expectation, median, or variance.

However, even when different camps agree that elicitation is getting information about state uncertainty, there are different views on what is required to “draw out” information from some agent or process (rather than just requesting it). Some see the incentive scheme as a necessary component, implying elicitation is about making someone want to give you certain information. This view is common in work on elicitation in theoretical CS and econ for instance, which assumes that proper scoring rules are available to evaluate the elicited information against ground truth: rules such that if the person is aware of how the rule works, then they will report honestly because it maximizes their payoff (and in the case of strictly proper scoring rules, the payoff maximizing report is unique). Consequently, we get definitions of elicitable properties as those that we can construct a strictly proper scoring rule for. In this view, if we aren’t using proper scoring rules, we might be better off using a term like “solicit” to convey we are attempting to obtain some information but not able to provide any sort of guarantees. Or we should use terms that have been developed to refer specifically to eliciting non-verifable information under different conditions (e.g., peer prediction, where scoring is based on comparing a response elicited from one agent to other agents’ responses, or mechanism design, which is about eliciting how much people value something, e.g., utility functions). Some of these approaches are mathematically very similar to what the theorists call elicitation, but still referred to under different terms.

Much of the work focused on prior elicitation from domain experts for statistical analysis however seems to have moved away from relying on proper scoring rules. Instead, the goal seems to be figuring out structured ways to ask for distributions or properties in ways that minimize the addition of noise to the modeling process. As one canonical survey claims “We see that elicitation as simply part of the process of statistical modeling. Indeed, in a hierarchical model at which point the likelihood ends and the prior begins is ambiguous.” Acknowledging the murkiness of using elicited priors makes a lot of sense to me, having spent a few years working in this space on graphical elicitation of prior distributions. It’s difficult to evaluate whether you got the “right” information from someone, i.e., beliefs they held before you asked, given that asking people for beliefs can trigger a constructive process. It’s difficult to evaluate elicitation methods against one another by comparing the distributions they imply,  since if they ask for different information sets (e.g., different properties of a distribution), then you might ben using different methods with different assumptions to map the inputs to distributions. One unfortunate thing I also discovered along the way is that the prior elicitation literature is not well integrated across different fields, and so you have people working on similar problems in different areas but not really citing each other, which is frustrating.

Related to the differences in how much weight researchers put on incentive compatibility, there’s a debate that I’ve seen play out a few times at talks and in conversations, where, when faced with the results of some human user study or experiment that elicited beliefs from people, the theorist asks: How can you even interpret the results from this study if you didn’t incentivize their reports? And the counterargument from the behavioral researcher is often along the lines of, Well, we don’t tend to see any big effects on results when we change incentives, so I doubt it would make a difference. To which the economist or theorist might launch into a lecture of the importance of incentive compatibility and proper scoring rules and so on. And the conversation continues with neither side convinced that the other side knows what they are talking about.

One challenge with seeing elicitation as equivalent to use of proper scoring rules is that a proper scoring rule might not work as intended due to people not understanding it. In an ML context where you don’t have to care about interpretability, you just set up the loss function. But with people, the rule has to be sufficiently interpretable that they recognize that responding truthfully will maximize their payoffs. In game theory, this has led to the definition of “obviously-strategy proof mechanisms,” mechanisms that have an equilibrium not just in dominant strategies (those that are best regardless of what others do) but in obviously dominant strategies, defined as strategies where for any deviating strategy, at the earliest information set where the obviously dominant strategy and the alternative diverge, the best possible outcome you can get from the alternative is not better than the worst outcome you can get the obviously dominant strategy. It’s like designing for the worst case scenario where no one is capable of contingent reasoning. Maybe there are other variants of baking interpretability in elicitation; I would be interested in pointers.

It also seems plausible that there may be situations where through enough experience, people could internalize the scoring rule, even if they don’t understand descriptions of it. Maybe some people might need to see it in action, used to determine their payoffs, to realize that its in their best interest to respond truthfully, and that learning process may take time. This implies that to be sure that your proper scoring rule worked, you need to evaluate the responses you obtained. It becomes more of a process than a definition.

P.S. Dan Goldstein sends a link to this study on binarized scoring rule which finds that giving people transparent information on quantitiative incentives reduces truth-telling comparing to not telling participants anything about the rule. Doh!

P.P.S. See the recent survey by Aki and others for a deeper dive into prior elicitation challenges and practice.

# Programmer position with our research group—at the University of Michigan!

Hey, midwesterners—here’s a chance for you to join the big team! It’s for our research project with Yajuan Si, Len Covello, Mitzi Morris, Jonah Gabry, and others on building models and software for epidemic tracking (see this paper, this paper, or, for the short version, this op-ed):

Summary

The Survey Research Center (SRC) at the University of Michigan’s Institute for Social Research (ISR) invites applications for a Full Stack Programmer with the Survey Methodology Program, in collaboration with the team of Stan developers at Columbia University.

Our multidisciplinary research team is involved in cutting-edge statistical methodology research and the development of computational algorithms, software, and web-based interfaces for application and visualization. We are looking for a Full-Stack Programmer to work on an innovative infrastructure to enable user-friendly implementation and reproducibility of statistical methods for population generalizability. The position provides an opportunity to work in an exciting and rewarding research area that constantly poses new technical and computational problems.

Responsibilities*

Develop, test, maintain, document, and deploy an interface for statistical inferences with sample data and result visualization, specifically the implementation of multilevel regression and poststratification.

Provide timely technical support during research deployments.
Create documentation and tutorials about the developed applications for use by interdisciplinary research teams.

Required Qualifications*

Bachelor’s or Master’s degree in Statistics/Computer Science/Information Science/Informatics or related technical discipline or a combination of education and software-development experience in a research or corporate environment to equal three (3) years.

Skills in R/Python programming.

Experience in data visualization and dashboard construction.

Experience with databases. Direct experience with demographic or geospatial data a plus.

Familiarity with C++ or Java. Knowledge with Stan programming a plus.

Track record of successful application development a plus.

Desired Qualifications*

Dashboard development experience preferred.

Work Locations

This position will be on-site at the University of Michigan offices in Ann Arbor, with flexible scheduling and remote opportunities made available within our overall Center policies. If relocation is required, we will allow for reasonable time and flexibility to make necessary arrangements.

This is an exciting research project and we’re hoping to find someone who can take a lead on the programming. Click through to the link for more details and to apply.

# preregistered vs adaptable Bayesian workflow, and who should do the work ?

Later in the thread, the Millennium Villages Project (MVP) evaluations came up:

I thought:

1. Our MVP evaluation says:

The design was completed before endline data collection, and a peer-reviewed evaluation protocol was registered with The Lancet.

Could the protocol have been more detailed and constrained more researcher degrees of freedom ? Yes.

This may not be a solution though, as Michael Betancourt writes:

Preserving a preregistered model only perpetuates its optimistic assumptions about the realization of an experiment. In practice we almost always have to adapt our model to the peculiarities of the observed data.

The Bayesian workflow authors say:

Our claim is that often a model whose assumptions withstood such severe tests is, despite being the result of data-dependent iterative workflow, more trustworthy than a preregistered model that has not been tested at all.

2. Eliciting domain expertise should probably involve project leadership. But how ?

3. I’m flattered to be considered by Johannes to be a trustworthy researcher. I also prefer publication procedures that rely less on authors’ reputation and more on methods clarity.

Andrew thought:

It’s funny that they characterize us as being “independent,” given that both of us had Earth Institute associations at the time we did the project.

# Some project opportunities for Ph.D. students!

Hey! My collaborators and I are working on a bunch of interesting, important projects where we could use some help. If you’d like to work with us, send me an email with your CV and any relevant information and we can see if there’s a way to fit you into the project. Any of these could be part of a Ph.D. thesis. And with remote collaboration, this is open to anyone—you don’t have to be at Columbia University. It would help if you’re a Ph.D. student in statistics or related field with some background in Bayesian inference.

There are other projects going on, but here are a few where we could use some collaboration right away. Please specify in your email which project you’d like to work on.

1. Survey on social conditions and health:

We’re looking for help adjusting to the methodological hit that a study suffered due to COVID shut down in the middle of implementing sampling design to refresh the cohort, as well as completing interviews for the ongoing cohort members. Will need consider sampling adjustments. Also they experimented with conducting some interviews via phone or zoom during the pandemic, as they were shorter than their regular 2 hr in-person interview, so it would be good to examine item missing and imputation strategy for key variables important for the analyses that are planned.

2. Measurement-error models:

This is something that I’m interested in for causal inference in general, also a recent example came up in climate science, where an existing Bayesian approach has problems, and I think we could do something good here by thinking more carefully about priors. In addition to the technical challenges, climate change is a very important problem to be working on.

3. Bayesian curve fitting for drug development:

This is a project with a pharmaceutical company to use hierarchical Bayesian methods to fit concentration response curves in drug discovery. The cool thing here is that have a pipeline with thousands of experiments and so we want an automated approach. This relates to our work on scalable computing, diagnostics, and model understanding, as well as specific issues of nonlinear hierarchical models.

4. Causal inference with latent data:

There are a few examples here of survey data where we’d like to adjust for pre-treatment variables which are either unmeasured or are measured with error. This is of interest for Bayesian modeling and causal inference, in particular the idea that we can improve upon the existing literature by using stronger priors, and also for the particular public health applications.

5. Inference for models identifying spatial locations:

This is for a political science project where we will be conducting a survey and asking people questions about nationalities and ethnic groups and using this to estimate latent positions of groups. Beyond the political science interest (for example, comparing mental maps of Democrats and Republicans), this relates to some research in computational neuroscience. It would be helpful to have a statistics student on the project because there are some challenging modeling and computational issues and it would be good for the political science student to be able to focus on the political science aspects of the project.

# Bets as forecasts, bets as probability assessment, difficulty of using bets in this way

John Williams writes:

Bets as forecasts come up on your blog from time to time, so I thought you might be interested in this post from RealClimate, which is the place to go for informed commentary on climate science.

The post, by Gavin Schmidt, is entitled, “Don’t climate bet against the house,” and tells the story of various public bets in the past few decades regarding climate outcomes.

The examples are interesting in their own right and also as a reminder that betting is complicated. In theory, betting has close links to uncertainty, and you should be able to go back and forth between them:

1. From one direction, if you think the consensus is wrong, you can bet against it and make money (in expectation). You should be able to transform your probability statements into bets.

2. From the other direction, if bets are out there, you can use these to assess people’s uncertainties, and from there you can make probabilistic predictions.

In real life, though, both the above steps can have problems, for several reasons. First is the vig (in a betting market) or the uncertainty that you’ll be paid off (in an unregulated setting). Second is that you need to find someone to make that bet with you. Third, and relatedly, that “someone” who will bet with you might have extra information you don’t have, indeed even their willingness to bet at given odds provides some information, in a Newtonian action-and-reaction sort of way. Fourth, we hear about some of the bets and we don’t hear about others. Fifth, people can be in it to make a point or for laffs or thrills or whatever, not just for the money, enough so that, when combined with the earlier items on this list, there won’t be enough “smart money” to take up the slack.

This is not to say that betting is a useless approach to information aggregation; I’m just saying that betting, like other social institutions, works under certain conditions and not in absolute generality.

And this reminds me of another story.

Economist Bryan Caplan reports that his track record on bets is 23 for 23. That’s amazing! How is it possible? Here’s Caplan’s list, which starts in 2007 and continues through 2021, with some of the bets still unresolved.

Caplan’s bets are an interesting mix. The first one is a bet where he offered 1-to-100 odds so it’s no big surprise that he won, but most of them are at even odds. A couple of them he got lucky on (for example, he bet in 2008 that no large country would leave the European Union before January 1, 2020, so he just survived by one month on that one), but, hey, it’s ok to be lucky, and in any case even if he only had won 21 out of 23 bets, that would still be impressive.

It seems to me that Caplan’s trick here is to show good judgment on what pitches to swing at. People come at him with some strong, unrealistic opinions, and he’s been good at crystallizing these into bets. In poker terms, he waits till he has the nuts, or nearly so. 23 out of 23 . . . that’s a great record.

# Weak separation in mixture models and implications for principal stratification

Avi Feller, Evan Greif, Nhat Ho, Luke Miratrix, and Natesh Pillai write:

Principal stratification is a widely used framework for addressing post-randomization complications. After using principal stratification to define causal effects of interest, researchers are increasingly turning to finite mixture models to estimate these quantities. Unfortunately, standard estimators of mixture parameters, like the MLE, are known to exhibit pathological behavior. We study this behavior in a simple but fundamental example, a two-component Gaussian mixture model in which only the component means and variances are unknown, and focus on the setting in which the components are weakly separated. . . . We provide diagnostics for all of these pathologies and apply these ideas to re-analyzing two randomized evaluations of job training programs, JOBS II and Job Corps.

The paper’s all about maximum likelihood estimates and I don’t care about that at all, but the general principles are relevant to understanding causal inference with intermediate outcomes and fitting such models in Stan or whatever.

# Stan downtown intern posters: scikit-stan & constraining transforms

It’s been a happening summer here at Stan’s downtown branch at the Flatiron Institute. Brian Ward and I advised a couple of great interns. Two weeks or so before the end of the internship, our interns present posters. Here are the ones from Brian’s intern Alexey and my intern Meenal.

Alexey Izmailov: scikit-stan

Alexey built a version of the scikit-learn API backed by Stan’s sampling, optimization, and variational inference. It’s plug and play with scikit.learn.

Meenal Jhajharia: unconstraining transforms

Meenal spent the summer exploring constraining transforms and how to evaluate them with a goal toward refining Stan’s transform performance and to add new data structures. This involved both figuring out how to evaluate them (vs. target distributions w.r.t. convexity, condition if convex, and sampling behavior in the tail, body, and near the mode of target densities). Results are turning out to be more interesting than we suspected in that different transforms seem to work better under different conditions. We’re also working with Seth Axen (Tübingen) and Stan devs Adam Haber and Sean Pinkney.

They don’t make undergrads like they used to

Did I mention they were undergrads? Meenal’s heading back to University of Delhi to finish her senior year and Alexey heads back to Brown to start his junior year! The other interns at the Center for Computational Mathematics, many of whom were undergraduates, have also done some impressive work in everything from using normalizing flows to improve sampler proposals for molecular dynamics to building 2D surface PDE solvers at scale to HPC for large N-body problems. In this case, not making undergrads like they used to is a good thing!

Hiring for next summer

If you’re interested in working on statistical computing as an intern next summer, drop me a line at [email protected]. I’ll announce when applications are open here on the blog.

# If the outcome is that rare, then nothing much can be learned from pure statistics.

Alain Fourmigue writes:

You may have heard of this recent controversial study on the efficacy of colchicine to reduce the number of hospitalisations/deaths due to covid.

It seems to be the opposite of the pattern usually reported on your blog.

Here, we have a researcher making a bold claim despite the lack of statistical significance,
and the scientific community expressing skepticism after the manuscript is released.

This study raises an interesting issue: how to analyse very rare outcomes (prevalence < 1%)? The sample is big (n>4400), but the outcome (death) is rare (y=14).
The SE of the log OR is ~ sqrt(1/5+1/9+1/2230+1/2244).
Because of the small number of deaths, there will inevitably be a lot of uncertainty.
Very frustrating…

Is there nothing we could do?
Is there nothing better than logistic regression / odd ratios for this situation?
I’m not sure the researcher could have afforded a (credible) informative prior.

I replied that, yes, if the outcome is that rare then nothing much can be learned from pure statistics. You’d need a model that connects more directly to the mechanism of the treatment.

# Discussion on identifiability and Bayesian inference

A bunch of us were in an email thread during the work on the article, “Bayesian workflow for disease transmission modeling in Stan,” by Léo Grinsztajn, Liza Semenova, Charles Margossian, and Julien Riou, and the topic came up of when to say that a parameter in a statistical model is “identified” by the data.

We did not come to any conclusions, but I thought some of you might find our discussion helpful in organizing your own thoughts.

It started with Liza, who said:

Definitions are important when it comes to identifiability. There are a few methods in the literature addressing (or, at least, discussing) the issue, but it is not always specified which definition is being used.

The most important pair of definitions is probably structural identifiability and practical identifiability.

Practical non-identifiability is linked to the amount and quality of data. It answers the question of whether parameters can be estimated given available data.

Structural non-identifiability arises due to model structure and can not be resolved by collecting more data, or better quality data. Example y = a * b: we can measure y as much as we want, but will not be able to distinguish between a and b no matter the number of measurements. One would believe that the issue could get resolved either via re-parametrisation or via providing sensible priors (i.e. by adding the Bayesian sauce). In this specific example, the issue is caused by the scaling symmetry of the equation, i.e. the equation does not change if we substitute a with k*a, and substitute b with 1/k*b. I wonder whether the knowledge of the type of symmetry can help us understand which priors would be “sensible” as these priors would need to break the detected symmetries. (Everything above applies to any kind of models, but my focus at work at the moment is on ODEs. For ODE systems there is a lot of literature on analytical methods of finding symmetries, and the simplest of them can be applied to mid-size systems by hand.)

An intuition concerning Stan (the software) would be that it would produce warnings in every case of non-identifiability. But I start developing a feeling that it might not be the case. I am preparing an example of an ODE dx/dt = – a1*a2*x, x(0)=x_0 with uniform priors U(0,10) on both a1, a2, and maxtree_depth = 15. No warnings. Convergence is good.

A further list of definitions includes structural global identifiability, structural local identifiability, identifiability of the posterior, weak identifiability. Linked terms are observability and estimability.

My aim to distil existing methods to test for identifiability and demonstrate them on a few examples. As well as to answer the questions emerging along the way, e.g. “Do Bayesian priors help?”, “Can the knowledge about the type of invariance of the system help us chose the priors?”, “How to test a system for identifiability – both analytically and numerically?”, “Can analytical methods help us find good model reparametrisations to avoid numerical issues?”.

I responded:

One insight that might be helpful is that identifiability depends on data as well as model. For example, suppose we have a regression with continuous outcome y and binary predictors x1,…,x10. If your sample size is low enough, you’ll have collinearity. Even with a large sample size you can have collinearity. So I’m not so happy with definitions based on asymptotic arguments. A relevant point is that near-nonidentifiability can be as bad as full nonidentifiability. To put it another way, suppose you have an informative prior on your parameters. Then nonidentifiability corresponds to data that are weak enough that the prior is dominant.
Also relevant is this paper, “The prior can often only be understood in the context of the likelihood,” also this, “The experiment is just as important as the likelihood in understanding the prior.”

Identifiability is like a complicated thing that is often the result of a failure.

Like if a car explodes, we can sit around and explain what happens once it has exploded (fire, smoke, etc.), but the fundamental problem of making a car that doesn’t explode is simpler (and more useful if we wanted to drive a car).

The last time I hit identifiability problems on the forums we worked it out with simulated data.

My reaction more and more to a model that doesn’t behave as expected is first we should find a model that does kinda behave well and go from there. The reason I think this is because finding a model that does work is a much more methodical, reliable thing than trying to figure out geometry or whatever is causing it. I tell people to look at posterior correlations and stuff to find non-identifiabilities, but I don’t know how far it’s ever gotten anyone.

I threw in:

Just to shift things slightly: I think that one of the problems with the statistical literature (including my own writings!) is how vague it all is. For example, we talk about weakly informative priors or weak identification, and these terms are not clearly defined. I think the appropriate way to go forward here is to recognize that “identification,” however defined, is a useful concept, but I don’t think that any strict definition of “identification” will be useful to us here, because in the Bayesian world, everything’s a little bit identified (as long as we have proper priors) and nothing is completely identified (because we’re not asymptotic). So maybe, rather than talking about identified or weakly identified or partially identified etc., we should be developing quantitative tools to assess the role of the prior and how it affects posterior inferences. Or something like that.

To which Liza replied:

Yes to quantitative tools. Some attempts in that direction have been made, but it seems that none of them has been acquired broadly:

1). Percentage overlap between prior and posterior (Garrett and Zeger, 2000). In applications, they have used the threshold of 35%: if the overlap is greater than the threshold, they considered the parameter in question weakly identifiable. The overlap is calculated using posterior samples and kernel smoothing. This approach has been only used when priors were uniform, and the threshold would be hard to calibrate for other priors. Moreover, there are examples when posterior of a non-identifiable parameter differs from its prior (Koop et al, 2013). Also, posterior and prior of an identifiable parameter can have high overlap – for instance, when the prior is informative and data only confirms prior knowledge.

2). Xie and Carlin (2005) suggest two measures of Bayesian learning (computed as KL divergence): how much we can potentially learn about a parameter, and how much there is left to learn given data y. The computational algorithm involves MCMC.

3). Data cloning (Lele et al, 2010): it is a technique used to obtain MLE estimates in complex models using Bayesian inference. Supposedly, it is useful to study identifiability. The idea is that a dataset gets cloned K times (i.e. each observation is repeated K times), while the model remains unchanged. For estimable parameters the scaled variance (ratio of the variance of posterior at K=1 to the variance of posterior at K>1) should, in theory, behave as 1/K. If a parameter is not estimable, the scaled variance is larger than 1/K. The method has been proposed in a setting with uniform priors. The question of what happens if priors are not uniform stands open. I have applied data cloning to an example of two ODEs. The first ODE only contains parameter k: x'(t) = -k*x; the second ODE is overparametrized and contains two parameters lambda1 and lambda2: x'(t) = -lambda1*lambda2*x; priors on the parameters are always the same – U(0,10). The plot of scaling variances is attached. Note that with the Stan setup which I have chosen (max tree depth = 15) there has been no warnings.

Another question is the connection of identifiability and model selection: if an information criterion takes the number of parameters into account, shouldn’t it be the number of identifiable parameters?

An overall suggestion for the “identifiability workflow”: it should be a combination of analytical and numerical methods and include all or some of the steps:
1. apply analytical methods by hand, if applicable (e.g. testing scaling invariances is possible for mid-size ODE systems)
2. numerical methods:
(a) Hessian method (standardized eigenvalues which are too small correspond to redundant parameters, but picking the threshold is hard),
(b) simulation-based method (start with some true values, generate data, and fit the model, then calculate the coefficient of variation; it is small for identifiable parameters and large for nonidentifiable)
(c) profile likelihood method: a flat profile indicates parameter redundancy
3. symbolic methods

P.S. There is also no consensus on the term either – both non-identifiability and unidentifiability are used.

Charles then said:

Thanks for pooling all these ideas. As Andrew mentions, it can be good to distinguish finite data cases and asymptotics. The intuitive way I think about non-identifiability is we can’t solve a * b = x for a and b. There is an infinite number of solutions, which in a Bayesian context suggests an infinite variance. So maybe thinking in terms of variance might be helpful? But we could survey problems scientists face and see how these motivate different notions of identifiability.

If I understand correctly, (1) measures how much we learn from the data / how much the posterior is driven by the prior. The data could be not very informative (relative the prior) or (equivalently) the prior could be very informative (because it is well constructed or else). I think this type of criterion goes back to the questionable idea that the prior shouldn’t influence the posterior. This becomes a question of robustness, because we can think of the prior as additional data with a corresponding likelihood (so putting a bad prior is like adding bad data).

(2) sounds interesting too. Working out “how much there is to learn from more data” reminds me of a prediction problem for my qualifying exams. For simple enough models, the learning rate plateaus quickly and more data wouldn’t improve predictions. Maybe the idea here is that the posterior variance of the parameter doesn’t go to 0 with more data.

(3) also sounds very neat and relates to learning rates. I’m not sure what “estimable” and “non-estimable” means here. But in a unif-normal case, I agree the posterior variance goes down with n. We could then ask what the expected rate is for multimodal distributions. A minor note: rather than cloning the data, can we simply lower likelihood variance?

# PPL Workshop this Friday (29 July 2022)

The Computational Abstractions for Probabilistic and Differentiable Programming (CAPP) Workshop is happening this Friday, 29 July 2022. It’s organized out of Cambridge and is running on UK time.

The schedule of speakers is online and the free Zoom link is available under the Location tab of their web site:

Despite the title, the talks sound very practical. The speaker lineup includes developers from BUGS, JAGS, Stan, JAX, and Turing.jl.

My talk

I (Bob) will be talking about our plan to introduce Python-like comprehensions into the Stan language to deal with constant matrix constructions like Gaussian Process covariance matrices. This is motivated by our shift in underlying matrix data structure for autodiff in Stan—what the Stan developers have been calling “mat-var” in meetings and pull requests because it’s an autodiff variable with a matrix value and matrix adjoint rather than our original approach using a matrix of scalar autodiff variables.

P.S. This is a big reduction in memory pressure and time for larger-scale matrix ops for Stan. The really clever part is how the C++ devs (Tadej Ciglarič, Steve Bronder, Rok Češnovar, et al.) have managed to abstract the mat-var construct in a way that’s backward compatible with our existing autodiff with simpler code. If you like C++ and functional template meta programming in C++, I urge you to follow along. Steve’s about to drop all the dev-facing doc for all of this—it’s currently in a massive PR undergoing final revisions.

# J. Robert Lennon on working with your closest collaborator

From Broken River:

The problem she was grappling with now, though, was this: the Irina who started writing this novel was the old one. The Irina who was trying to finish it was the new. Was such a collaboration even possible? Did new Irina even want to share a .doc with old Irina? The situation felt hopeless.

This reminds me of the principle that your closest collaborator is you, six months ago—and she doesn’t answer emails. Interesting to see this idea in a fictional context. Then again, fiction is just a form of simulated-data experimentation.

P.S. Lennon reviews Sedaris, which reminds me of my idea from many years ago that Sedaris, having run out of good material from his own life, should start interviewing other people and telling their stories. Sedaris is a great storyteller, and I don’t see why he needs to restrict himself to basing them on things that happened to him personally. Seems like a real missed opportunity.

# Solving inverse problems using Bayesian inference with FFT in Stan, from Python . . . this example has it all!

Brian Ward, Bob Carpenter, and David Barmherzig write:

The HoloML technique is an approach to solving a specific kind of inverse problem inherent to imaging nanoscale specimens using X-ray diffraction.

To solve this problem in Stan, we first write down the forward scientific model given by Barmherzig and Sun, including the Poisson photon distribution and censored data inherent to the physical problem, and then find a solution via penalized maximum likelihood. . . .

The idealized version of HCDI is formulated as
Given a reference R, data Y = |F(X+R)|^2
Recover the source image X
Where F is an oversampled Fourier transform operator.

However, the real-world set up of these experiments introduces two additional difficulties. Data is measured from a limited number of photons, where the number of photons received by each detector is modeled as Poisson distributed with expectation given by Y_ij (referred to in the paper as Poisson-shot noise). The expected number of photons each detector receives is denoted N_p. We typically have N_p < 10 due to the damage that radiation causes the biomolecule under observation. Secondly, to prevent damage to the detectors, the lowest frequencies are removed by a beamstop, which censors low-frequency observations.

In this Python notebook, Ward et al. do it all. They simulate fake data from the model, they define helper functions for their Fourier transforms in Stan (making use of Stan’s fast Fourier transform function), they write the Bayesian model (including an informative prior distribution) directly as a Stan program, they run it from Python using cmdstanpy, then they do some experimentation to see what happens when they vary the sample size of the data. The version in the case study uses optimization, but the authors tell me that sampling also works, and they’re now working on the next step, which is comparison with other methods.

I wish this computational infrastructure had existed back when I was working on my Ph.D. thesis. It would’ve made my life so much simpler—I wouldn’t’ve had to write hundreds of lines of Fortran code!

In all seriousness, this is wonderful, from beginning to end, showing how Bayesian inference and Stan can take on and solve a hard problem—imaging with less than 10 photons per detector, just imagine that!

Also impressive to me is how mundane and brute-force this all is. It’s just step by step: write down the model, program it up, simulate fake data, check that it all makes sense, all in a single Python script. Not all these steps are easy—for this particular problem you need to know a bit of Fourier analysis—but they’re all direct. I love this sort of example where the core message is not that you need to be clever, but that you just need to keep your eye on the ball, and with no embarrassment about the use of a prior distribution. This is where I want Bayesian inference to be. So thanks, Brian, Bob, and David for sharing this with all of us.

P.S. Also this, which team Google might be wise to emulate:

Reproducibility

This notebook’s source and related materials are available at https://github.com/WardBrian/holoml-in-stan.

%load_ext watermark
%watermark -n -u -v -iv -w
print("CmdStan:", cmdstanpy.utils.cmdstan_version())
Last updated: Fri Jul 01 2022

Python implementation: CPython
Python version       : 3.10.4
IPython version      : 8.4.0

scipy     : 1.8.0
cmdstanpy : 1.0.1
numpy     : 1.22.3
IPython   : 8.4.0
matplotlib: 3.5.2

Watermark: 2.3.1

CmdStan: (2, 30)
The rendered HTML output is produced with
jupyter nbconvert --to html "HoloML in Stan.ipynb" --template classic --TagRemovePreprocessor.remove_input_tags=hide-code -CSSHTMLHeaderPreprocessor.style=tango --execute

# Moving cross-validation from a research idea to a routine step in Bayesian data analysis

This post is by Aki.

Andrew has a Twitter bot @StatRetro tweeting old blog posts. A few weeks ago, the bot tweeted link to a 2004 blog post
Cross-validation for Bayesian multilevel modeling. Here are some quick thoughts now.

Andrew started with a question “What can be done to move cross-validation from a research idea to a routine step in Bayesian data analysis?” and mentions importance-sampling as possibility, but then continues “However, this isn’t a great practical solution since the weights, 1/p(y_i|theta), are unbounded, so the importance-weighted estimate can be unstable.”. We now have Pareto smoothed importance sampling leave-one-out (PSIS-LOO) cross-validation (Vehtari, A., Gelman, A., Gabry, J., 2017) implemented, e.g., in loo R package and ArviZ Python/Julia package, and they’ve been downloaded millions of times and seem to be routinely used in Bayesian workflow! The benefit of the approach is that in many cases the user doesn’t need to do anything extra or add a few lines to their Stan code, the computation after sampling is really fast, and the method has diagnostic to tell if some other computationally more intensive approach is needed.

Andrew discussed also multilevel models: “When data have a multilevel (hierarchical) structure, it would make sense to cross-validate by leaving out data individually or in clusters, for example, leaving out a student within a school or leaving out an entire school. The two cross-validations test different things.”PSIS-LOO is great for leave-one-student-out, but leaving out an entire school often changes the posterior too much so that even PSIS can’t handle it. It’s still the easiest way to use K-fold-CV in such cases (ie do brute force computation K times, with K possibly smaller than the number of schools). It is possible to use PSIS, but then additional quadrature integration over the parameters for the left put school is needed to get useful results (e.g. Merkel, Furr, and Rabe-Hesketh, 2019). We’re still thinking how to do cross-validation for multilevel models easier and faster.

Andrew didn’t discuss time series or non-factorized models, but we can use PSIS to compute leave-future-out cross-validation for time series models (Bürkner, P.-C., Gabry, J., and Vehtari, A., 2020a) and for multivariate normal and Student-t models we can do one part analytically and rest with PSIS (Bürkner, P.-C., Gabry, J., and Vehtari, A., 2020b).

Andrew mentioned DIC, and we have later analyzed the properties of DIC, WAIC, and leave-one-out cross-validation (Gelman, A., Hwang, J., and Vehtari, A., 2014), and eventually PSIS-LOO has provided to be the most reliable and has the best self-diagnostic (Vehtari, A., Gelman, A., Gabry, J., 2017).

Andrew also mentioned my 2002 paper on cross-validation, so I knew that he was aware of my work, but it still took several years before I had the courage to contact him and propose a research visit. That research visit was great, and I think we can say we (including all co-authors and people writing software) have been able to make some concrete steps to make cross-validation a more routine step.

Although we are advocating a routine use of cross-validation, I want to remind that we are not advocating cross-validation for model selection as a hypothesis testing (see, e.g. this talk, and Gelman et al. 2020). Ideally the modeller includes all the uncertainties in the model, integrates over the uncertainties and makes model checking that the model makes sense. There is no need then to select any model, as the model that in the best way expresses the information available for the modeller and the related uncertainties is all that is needed. However, cross-validation is useful for assessing how good a single model is, model checking (diagnosing misspecification), understanding differences between models, and to speed-up the model building workflow (we can quickly ignore really bad models, and focus on more useful models, see e.g. this talk on Bayesian workflow).

You can find more papers and discussion of cross-validation in CV-FAQ, and stay tuned for more!

# Stan goes mountain climbing, also a suggestion of 3*Alex

Jarrett Phillips writes:

I came across this recent preprint by Drummond and Popinga (2021) on applying Bayesian modeling to assess climbing ability. Drummond is foremost a computational evolutionary biologist (as am I) who is also an experienced climber. The work looks quite interesting. I was previously unaware of such an application and thought others may also appreciate it.

I’m not a climber at all, but it’s always fun to see new applications of Stan. In their revision, I hope the authors can collaborate with someone who’s experienced in Bayesian data visualization and can help them make some better graphs. I don’t mean they should just take their existing plots and make them prettier; I mean that I’m pretty sure there are some more interesting and informative ways to display their data and fitted models. Maybe Jonah or Yair could help—they live in Colorado so they might be climbers, right? Or Aleks would be perfect: he’s from Slovenia where everyone climbs, and he makes pretty graphs, and then the paper would have 3 authors named some form of Alex. So that’s my recommendation.

# Academic jobs in Bayesian workflow and decision making

This job post (with two reserach topics) is by Aki (I promise that next time I post about something else)

I’m looking for postdocs and doctoral students to work with me on Bayesian workflow at Aalto University, Finland. You can apply through a joint call (with many more other related topics) application forms for postdocs) and for doctoral students.

We’re also looking for postdocs and doctoral students to work on Probabilistic modeling for assisting human decision making in with Finnish Center for Artificial Intelligence funding. You can apply through a joint call (with many more probabilistic modeling topics) application form.

To get some idea on how we might approach these topics, you can check what I’ve been recently talking and working.

For five years straight, starting in 2018, the World Happiness Report has singled out Finland as the happiest country on the planet

# Bayes factors measure prior predictive performance

I was having a discussion with a colleague after a talk that was focused on computing the evidence and mentioned that I don’t like Bayes factors because they measure prior predictive performance rather than posterior predictive performance. But even after filling up a board, I couldn’t convince my colleagues that Bayes factors were really measuring prior predictive performance. So let me try in blog form and maybe the discussion can help clarify what’s going on.

Prior predictive densities (aka, evidence)

If we have data $y$, parameters $\theta$, sampling density $p(y \mid \theta)$ and prior density $p(\theta)$, the prior predictive density is defined as

$p(y) = \int p(y \mid \theta) \, p(\theta) \, \textrm{d}\theta.$

The integral computes an average of the sampling density $p(y \mid \theta)$ weighted by the prior $p(\theta)$. That’s why we call it “prior predictive”.

Bayes factors compare prior predictive densities

Let’s write $p_{\mathcal{M}}(y)$ to indicate that the prior predictive density depends on the model $\mathcal{M}$. Then if we have two models, $\mathcal{M}_1, \mathcal{M}_2$, the Bayes factor for data $y$ is defined to be

$\textrm{BF}(y) = \frac{p_{\mathcal{M}_1}(y)}{p_{\mathcal{M}_2}(y)}$.

What are Bayes factors measuring? Ratios of prior predictive densities. Usually this isn’t so interesting because the difference between a weakly informative prior and one an order of magnitude wider usually doesn’t make much of a difference for posterior predictive inference. There’s more discussion of this with examples in Gelman an et al.’s Bayesian Data Analysis.

Jeffreys set thresholds for Bayes factors of “barely worth mentioning” (below $\sqrt{10}$) to “decisive” (above 100). But we don’t need to worry about that.

Posterior predictive distribution

Suppose we’ve already observed some data $y^{\textrm{obs}}$. The posterior predictive distribution is

$p(y \mid y^{\textrm{obs}}) = \int p(y \mid \theta) \, p(\theta \mid y^{\textrm{obs}}) \, \textrm{d}\theta.$

The key difference from the prior predictive distribution is that we average our sampling density $p(y \mid \theta)$ over the posterior $p(\theta \mid y^{\textrm{obs}})$ rather than the prior $p(\theta)$.

Cross-validation

In the Bayesian workflow paper, we recommend using cross-validation to compare posterior predictive distributions and we don’t even mention Bayes factors. Stan provides an R package, loo, for efficiently computing approximate leave-one-out cross-validation.

The path from prior predictive to posterior predictive

Introductions to Bayesian inference often start with a very simple beta-binomial model which can be solved analytically online. That is, we can update the posterior by simple counting after each observation. Each posterior is also a beta distribution. We can do this in general and consider our data $y = y_1, \ldots, y_N$ arriving sequentially and updating the posterior each time.

$p(y_1, \ldots, y_N) = p(y_1 \mid \theta) \ p(y_2 \mid y_1, \theta) \, \cdots \, p(y_N \mid y_1, \ldots, y_{N-1}, \theta).$

In this factorization, we predict $y_1$ based only on the prior, then $y_2$ based on $y_1$ and the prior and so on until the last point is modeled in the same way as leave-one-out cross-validation as $p(y_N \mid y_1, \ldots, y_{N-1})$. We can do this in any order and the result will be the same. As $N$ increases, prior predictive density converges to posterior predictive density on an average (per observation $y_n$) basis. But for finite amounts of data $N \ll \infty$, the measures can be very different.

# Imperfectly Bayesian-like intuitions reifying naive and dangerous views of human nature

Allan Cousins writes:

After reading your post entitled “People are complicated” and the subsequent discussion that ensued, I [Cousins] find it interesting that you and others didn’t relate the phenomenon to human propensity to bound probabilities into 3 buckets (0%, coin toss, 100%), and how that interacts with anchoring bias. It seems natural that if we (meaning people at large) do that across most domains that we would apply the same in our assessment of others. Since we are likely to have more experiences with certain individuals on one side of the spectrum or the other (given we tend to only see people in particular rather than varied circumstances) it’s no wonder we tend to fall into the dichotomous trap of treating people as if they are only good or bad; obviously the same applies if we don’t have personal experiences but only see / hear things from afar. Similarly, even if we come to know other circumstances that would oppose our selection (e.g. someone we’ve classified as a “bad person” performs some heroic act), we are apt to have become anchored on our previous selection (good or bad) and that reduces the reliance we might place on the additional information in our character assessment. Naturally our human tendencies lead us to “forget” about that evidence if ever called upon to make a similar assessment in the future. In a way it’s not dissimilar to why we implement reverse counting in numerical analysis. When we perform these social assessments it is as if we are always adding small numbers (additional circumstances) to large numbers (our previous determination / anchor) and the small numbers, when compared to the large number, are truncated and rounded away; of course possibly leading to the possibility that our determination to be hopelessly incorrect!

This reminds me of the question that comes up from time to time, of what happens if we use rational or “Bayesian” inference without fully accounting for the biases involved in what information we see.

The simplest example is if someone rolls a die a bunch of times and tells us the results, which we use to estimate the probability that the die will come up 6. If that someone gives us a misleading stream of information (for example, telling us about all the 6’s but only a subset of the 1’s, 2’s, 3’s, 4’s, and 5’s) and we don’t know this, then we’ll be in trouble.

The linked discussion involves the idea that it’s easy for us to think of people as all good or all bad, and my story about a former colleague who had some clear episodes of good and clear episodes of bad is a good reminder that (a) people are complicated, and (b) we don’t always see this complication given the partial information available to us. From a Bayesian perspective, I’d say that Cousins is making the point that the partial information available to us can, if we’re not careful, be interpreted as supporting a naive bimodal view of human nature, thus leading to a vicious cycle or unfortunate feedback mechanism where we become more and more set in this erroneous model of human nature.

# Webinar: On using expert information in Bayesian statistics

This post is by Eric.

On Thursday, 23 June, Duco Veen will stop by to discuss his work on prior elicitation. You can register here.

Abstract

Duco will discuss how expert knowledge can be captured and formulated as prior information for Bayesian analyses. This process is also called expert elicitation. He will highlight some ways to improve the quality of the expert elicitation and provide case examples of work he did with his colleagues. Additionally, Duco will discuss how the captured expert knowledge can be contrasted with evidence provided by traditional data collection methods or other prior information. This can be done, for instance, for quality control or training purposes where experts reflect on their provided information.

Duco Veen is an Assistant Professor at the Department of Global Health situated at the Julius Center for Health Sciences and Primary Care of the University Medical Center Utrecht. In that capacity, he is involved in the COVID-RED, AI for Health, and [email protected] projects. In addition, he is appointed as Extraordinary Professor at the Optentia Research Programme of North-West University, South Africa. Duco works on the development of ShinyStan and has been elected as a member of the Stan Governing Body.

# Postdoc opportunity in San Francisco: methods research in public health

Angela Aidala points us to this ad:

Evidence for Action (E4A) is hiring a postdoc to work in our Methods Lab to help develop and share approaches to overcoming methodological challenges in research to advance health and racial equity. The postdoc should have training and interest in quantitative research methods and their application toward dismantling root causes of health inequities.

The individual will work on emerging topics that may focus on areas such as critical perspectives on quantitative methods in disparities research, data pooling to address small sample sizes, and development of measures relevant to advancing racial equity.

For full consideration, applications are due June 27, 2022. For more information visit https://e4action.org/PostDoc

They say:

The methods issues that arise in research on social conditions and health can be particularly difficult: pure randomization is rarely feasible, measurement is challenging, and causal pathways underlying effects are complex and cyclical.

Yup. They don’t explicitly mention Bayes/Stan, but it couldn’ thurt.