A bunch of us were in an email thread during the work on the article, “Bayesian workflow for disease transmission modeling in Stan,” by Léo Grinsztajn, Liza Semenova, Charles Margossian, and Julien Riou, and the topic came up of when to say that a parameter in a statistical model is “identified” by the data.

We did not come to any conclusions, but I thought some of you might find our discussion helpful in organizing your own thoughts.

It started with Liza, who said:

Definitions are important when it comes to identifiability. There are a few methods in the literature addressing (or, at least, discussing) the issue, but it is not always specified which definition is being used.

The most important pair of definitions is probably structural identifiability and practical identifiability.

*Practical non-identifiability* is linked to the amount and quality of data. It answers the question of whether parameters can be estimated given available data.

*Structural non-identifiability* arises due to model structure and can not be resolved by collecting more data, or better quality data. Example y = a * b: we can measure y as much as we want, but will not be able to distinguish between a and b no matter the number of measurements. One would believe that the issue could get resolved either via re-parametrisation or via providing sensible priors (i.e. by adding the Bayesian sauce). In this specific example, the issue is caused by the scaling symmetry of the equation, i.e. the equation does not change if we substitute a with k*a, and substitute b with 1/k*b. I wonder whether the knowledge of the type of symmetry can help us understand which priors would be “sensible” as these priors would need to break the detected symmetries. (Everything above applies to any kind of models, but my focus at work at the moment is on ODEs. For ODE systems there is a lot of literature on analytical methods of finding symmetries, and the simplest of them can be applied to mid-size systems by hand.)

An intuition concerning Stan (the software) would be that it would produce warnings in every case of non-identifiability. But I start developing a feeling that it might not be the case. I am preparing an example of an ODE dx/dt = – a1*a2*x, x(0)=x_0 with uniform priors U(0,10) on both a1, a2, and maxtree_depth = 15. No warnings. Convergence is good.

A further list of definitions includes structural global identifiability, structural local identifiability, identifiability of the posterior, weak identifiability. Linked terms are observability and estimability.

My aim to distil existing methods to test for identifiability and demonstrate them on a few examples. As well as to answer the questions emerging along the way, e.g. “Do Bayesian priors help?”, “Can the knowledge about the type of invariance of the system help us chose the priors?”, “How to test a system for identifiability – both analytically and numerically?”, “Can analytical methods help us find good model reparametrisations to avoid numerical issues?”.

I responded:

One insight that might be helpful is that identifiability depends on data as well as model. For example, suppose we have a regression with continuous outcome y and binary predictors x1,…,x10. If your sample size is low enough, you’ll have collinearity. Even with a large sample size you can have collinearity. So I’m not so happy with definitions based on asymptotic arguments. A relevant point is that near-nonidentifiability can be as bad as full nonidentifiability. To put it another way, suppose you have an informative prior on your parameters. Then nonidentifiability corresponds to data that are weak enough that the prior is dominant.

Also relevant is this paper, “The prior can often only be understood in the context of the likelihood,” also this, “The experiment is just as important as the likelihood in understanding the prior.”

Ben Bales added:

Identifiability is like a complicated thing that is often the result of a failure.

Like if a car explodes, we can sit around and explain what happens once it has exploded (fire, smoke, etc.), but the fundamental problem of making a car that doesn’t explode is simpler (and more useful if we wanted to drive a car).

The last time I hit identifiability problems on the forums we worked it out with simulated data.

My reaction more and more to a model that doesn’t behave as expected is first we should find a model that does kinda behave well and go from there. The reason I think this is because finding a model that does work is a much more methodical, reliable thing than trying to figure out geometry or whatever is causing it. I tell people to look at posterior correlations and stuff to find non-identifiabilities, but I don’t know how far it’s ever gotten anyone.

I threw in:

Just to shift things slightly: I think that one of the problems with the statistical literature (including my own writings!) is how vague it all is. For example, we talk about weakly informative priors or weak identification, and these terms are not clearly defined. I think the appropriate way to go forward here is to recognize that “identification,” however defined, is a useful concept, but I don’t think that any strict definition of “identification” will be useful to us here, because in the Bayesian world, everything’s a little bit identified (as long as we have proper priors) and nothing is completely identified (because we’re not asymptotic). So maybe, rather than talking about identified or weakly identified or partially identified etc., we should be developing quantitative tools to assess the role of the prior and how it affects posterior inferences. Or something like that.

To which Liza replied:

Yes to quantitative tools. Some attempts in that direction have been made, but it seems that none of them has been acquired broadly:

1). Percentage overlap between prior and posterior (Garrett and Zeger, 2000). In applications, they have used the threshold of 35%: if the overlap is greater than the threshold, they considered the parameter in question weakly identifiable. The overlap is calculated using posterior samples and kernel smoothing. This approach has been only used when priors were uniform, and the threshold would be hard to calibrate for other priors. Moreover, there are examples when posterior of a non-identifiable parameter differs from its prior (Koop et al, 2013). Also, posterior and prior of an identifiable parameter can have high overlap – for instance, when the prior is informative and data only confirms prior knowledge.

2). Xie and Carlin (2005) suggest two measures of Bayesian learning (computed as KL divergence): how much we can potentially learn about a parameter, and how much there is left to learn given data y. The computational algorithm involves MCMC.

3). Data cloning (Lele et al, 2010): it is a technique used to obtain MLE estimates in complex models using Bayesian inference. Supposedly, it is useful to study identifiability. The idea is that a dataset gets cloned K times (i.e. each observation is repeated K times), while the model remains unchanged. For estimable parameters the scaled variance (ratio of the variance of posterior at K=1 to the variance of posterior at K>1) should, in theory, behave as 1/K. If a parameter is not estimable, the scaled variance is larger than 1/K. The method has been proposed in a setting with uniform priors. The question of what happens if priors are not uniform stands open. I have applied data cloning to an example of two ODEs. The first ODE only contains parameter k: x'(t) = -k*x; the second ODE is overparametrized and contains two parameters lambda1 and lambda2: x'(t) = -lambda1*lambda2*x; priors on the parameters are always the same – U(0,10). The plot of scaling variances is attached. Note that with the Stan setup which I have chosen (max tree depth = 15) there has been no warnings.

Another question is the connection of identifiability and model selection: if an information criterion takes the number of parameters into account, shouldn’t it be the number of identifiable parameters?

An overall suggestion for the “identifiability workflow”: it should be a combination of analytical and numerical methods and include all or some of the steps:

1. apply analytical methods by hand, if applicable (e.g. testing scaling invariances is possible for mid-size ODE systems)

2. numerical methods:

(a) Hessian method (standardized eigenvalues which are too small correspond to redundant parameters, but picking the threshold is hard),

(b) simulation-based method (start with some true values, generate data, and fit the model, then calculate the coefficient of variation; it is small for identifiable parameters and large for nonidentifiable)

(c) profile likelihood method: a flat profile indicates parameter redundancy

3. symbolic methods

P.S. There is also no consensus on the term either – both non-identifiability and unidentifiability are used.

Charles then said:

Thanks for pooling all these ideas. As Andrew mentions, it can be good to distinguish finite data cases and asymptotics. The intuitive way I think about non-identifiability is we can’t solve a * b = x for a and b. There is an infinite number of solutions, which in a Bayesian context suggests an infinite variance. So maybe thinking in terms of variance might be helpful? But we could survey problems scientists face and see how these motivate different notions of identifiability.

If I understand correctly, (1) measures how much we learn from the data / how much the posterior is driven by the prior. The data could be not very informative (relative the prior) or (equivalently) the prior could be very informative (because it is well constructed or else). I think this type of criterion goes back to the questionable idea that the prior shouldn’t influence the posterior. This becomes a question of robustness, because we can think of the prior as additional data with a corresponding likelihood (so putting a bad prior is like adding bad data).

(2) sounds interesting too. Working out “how much there is to learn from more data” reminds me of a prediction problem for my qualifying exams. For simple enough models, the learning rate plateaus quickly and more data wouldn’t improve predictions. Maybe the idea here is that the posterior variance of the parameter doesn’t go to 0 with more data.

(3) also sounds very neat and relates to learning rates. I’m not sure what “estimable” and “non-estimable” means here. But in a unif-normal case, I agree the posterior variance goes down with n. We could then ask what the expected rate is for multimodal distributions. A minor note: rather than cloning the data, can we simply lower likelihood variance?