Discussion on identifiability and Bayesian inference

A bunch of us were in an email thread during the work on the article, “Bayesian workflow for disease transmission modeling in Stan,” by Léo Grinsztajn, Liza Semenova, Charles Margossian, and Julien Riou, and the topic came up of when to say that a parameter in a statistical model is “identified” by the data.

We did not come to any conclusions, but I thought some of you might find our discussion helpful in organizing your own thoughts.

It started with Liza, who said:

Definitions are important when it comes to identifiability. There are a few methods in the literature addressing (or, at least, discussing) the issue, but it is not always specified which definition is being used.

The most important pair of definitions is probably structural identifiability and practical identifiability.

Practical non-identifiability is linked to the amount and quality of data. It answers the question of whether parameters can be estimated given available data.

Structural non-identifiability arises due to model structure and can not be resolved by collecting more data, or better quality data. Example y = a * b: we can measure y as much as we want, but will not be able to distinguish between a and b no matter the number of measurements. One would believe that the issue could get resolved either via re-parametrisation or via providing sensible priors (i.e. by adding the Bayesian sauce). In this specific example, the issue is caused by the scaling symmetry of the equation, i.e. the equation does not change if we substitute a with k*a, and substitute b with 1/k*b. I wonder whether the knowledge of the type of symmetry can help us understand which priors would be “sensible” as these priors would need to break the detected symmetries. (Everything above applies to any kind of models, but my focus at work at the moment is on ODEs. For ODE systems there is a lot of literature on analytical methods of finding symmetries, and the simplest of them can be applied to mid-size systems by hand.)

An intuition concerning Stan (the software) would be that it would produce warnings in every case of non-identifiability. But I start developing a feeling that it might not be the case. I am preparing an example of an ODE dx/dt = – a1*a2*x, x(0)=x_0 with uniform priors U(0,10) on both a1, a2, and maxtree_depth = 15. No warnings. Convergence is good.

A further list of definitions includes structural global identifiability, structural local identifiability, identifiability of the posterior, weak identifiability. Linked terms are observability and estimability.

My aim to distil existing methods to test for identifiability and demonstrate them on a few examples. As well as to answer the questions emerging along the way, e.g. “Do Bayesian priors help?”, “Can the knowledge about the type of invariance of the system help us chose the priors?”, “How to test a system for identifiability – both analytically and numerically?”, “Can analytical methods help us find good model reparametrisations to avoid numerical issues?”.

I responded:

One insight that might be helpful is that identifiability depends on data as well as model. For example, suppose we have a regression with continuous outcome y and binary predictors x1,…,x10. If your sample size is low enough, you’ll have collinearity. Even with a large sample size you can have collinearity. So I’m not so happy with definitions based on asymptotic arguments. A relevant point is that near-nonidentifiability can be as bad as full nonidentifiability. To put it another way, suppose you have an informative prior on your parameters. Then nonidentifiability corresponds to data that are weak enough that the prior is dominant.
Also relevant is this paper, “The prior can often only be understood in the context of the likelihood,” also this, “The experiment is just as important as the likelihood in understanding the prior.”

Ben Bales added:

Identifiability is like a complicated thing that is often the result of a failure.

Like if a car explodes, we can sit around and explain what happens once it has exploded (fire, smoke, etc.), but the fundamental problem of making a car that doesn’t explode is simpler (and more useful if we wanted to drive a car).

The last time I hit identifiability problems on the forums we worked it out with simulated data.

My reaction more and more to a model that doesn’t behave as expected is first we should find a model that does kinda behave well and go from there. The reason I think this is because finding a model that does work is a much more methodical, reliable thing than trying to figure out geometry or whatever is causing it. I tell people to look at posterior correlations and stuff to find non-identifiabilities, but I don’t know how far it’s ever gotten anyone.

I threw in:

Just to shift things slightly: I think that one of the problems with the statistical literature (including my own writings!) is how vague it all is. For example, we talk about weakly informative priors or weak identification, and these terms are not clearly defined. I think the appropriate way to go forward here is to recognize that “identification,” however defined, is a useful concept, but I don’t think that any strict definition of “identification” will be useful to us here, because in the Bayesian world, everything’s a little bit identified (as long as we have proper priors) and nothing is completely identified (because we’re not asymptotic). So maybe, rather than talking about identified or weakly identified or partially identified etc., we should be developing quantitative tools to assess the role of the prior and how it affects posterior inferences. Or something like that.

To which Liza replied:

Yes to quantitative tools. Some attempts in that direction have been made, but it seems that none of them has been acquired broadly:

1). Percentage overlap between prior and posterior (Garrett and Zeger, 2000). In applications, they have used the threshold of 35%: if the overlap is greater than the threshold, they considered the parameter in question weakly identifiable. The overlap is calculated using posterior samples and kernel smoothing. This approach has been only used when priors were uniform, and the threshold would be hard to calibrate for other priors. Moreover, there are examples when posterior of a non-identifiable parameter differs from its prior (Koop et al, 2013). Also, posterior and prior of an identifiable parameter can have high overlap – for instance, when the prior is informative and data only confirms prior knowledge.

2). Xie and Carlin (2005) suggest two measures of Bayesian learning (computed as KL divergence): how much we can potentially learn about a parameter, and how much there is left to learn given data y. The computational algorithm involves MCMC.

3). Data cloning (Lele et al, 2010): it is a technique used to obtain MLE estimates in complex models using Bayesian inference. Supposedly, it is useful to study identifiability. The idea is that a dataset gets cloned K times (i.e. each observation is repeated K times), while the model remains unchanged. For estimable parameters the scaled variance (ratio of the variance of posterior at K=1 to the variance of posterior at K>1) should, in theory, behave as 1/K. If a parameter is not estimable, the scaled variance is larger than 1/K. The method has been proposed in a setting with uniform priors. The question of what happens if priors are not uniform stands open. I have applied data cloning to an example of two ODEs. The first ODE only contains parameter k: x'(t) = -k*x; the second ODE is overparametrized and contains two parameters lambda1 and lambda2: x'(t) = -lambda1*lambda2*x; priors on the parameters are always the same – U(0,10). The plot of scaling variances is attached. Note that with the Stan setup which I have chosen (max tree depth = 15) there has been no warnings.

Another question is the connection of identifiability and model selection: if an information criterion takes the number of parameters into account, shouldn’t it be the number of identifiable parameters?

An overall suggestion for the “identifiability workflow”: it should be a combination of analytical and numerical methods and include all or some of the steps:
1. apply analytical methods by hand, if applicable (e.g. testing scaling invariances is possible for mid-size ODE systems)
2. numerical methods:
(a) Hessian method (standardized eigenvalues which are too small correspond to redundant parameters, but picking the threshold is hard),
(b) simulation-based method (start with some true values, generate data, and fit the model, then calculate the coefficient of variation; it is small for identifiable parameters and large for nonidentifiable)
(c) profile likelihood method: a flat profile indicates parameter redundancy
3. symbolic methods

P.S. There is also no consensus on the term either – both non-identifiability and unidentifiability are used.

Charles then said:

Thanks for pooling all these ideas. As Andrew mentions, it can be good to distinguish finite data cases and asymptotics. The intuitive way I think about non-identifiability is we can’t solve a * b = x for a and b. There is an infinite number of solutions, which in a Bayesian context suggests an infinite variance. So maybe thinking in terms of variance might be helpful? But we could survey problems scientists face and see how these motivate different notions of identifiability.

If I understand correctly, (1) measures how much we learn from the data / how much the posterior is driven by the prior. The data could be not very informative (relative the prior) or (equivalently) the prior could be very informative (because it is well constructed or else). I think this type of criterion goes back to the questionable idea that the prior shouldn’t influence the posterior. This becomes a question of robustness, because we can think of the prior as additional data with a corresponding likelihood (so putting a bad prior is like adding bad data).

(2) sounds interesting too. Working out “how much there is to learn from more data” reminds me of a prediction problem for my qualifying exams. For simple enough models, the learning rate plateaus quickly and more data wouldn’t improve predictions. Maybe the idea here is that the posterior variance of the parameter doesn’t go to 0 with more data.

(3) also sounds very neat and relates to learning rates. I’m not sure what “estimable” and “non-estimable” means here. But in a unif-normal case, I agree the posterior variance goes down with n. We could then ask what the expected rate is for multimodal distributions. A minor note: rather than cloning the data, can we simply lower likelihood variance?

12 thoughts on “Discussion on identifiability and Bayesian inference

  1. Is the situation people are wanting here, ultimately, that for some finite dataset, every parameter in the model has a unique posterior mode? Is that what we mean by identifiable functionally, putting aside asymptotic approaches to the “true value” of the parameter?

    • I learned that a model is “(point) identified” if the objective function (e.g. likelihood, moments, MSE, …) has a unique global minimum.

      In Bayesian terms, that seems like it might correspond to a unique posterior mode.

  2. Instead of using the asymptotic definition of identifiability, I like considering how informative a specific data are (which is related to Andrew’s comment on collinearity and Andrew’s separabaility examples in ROS)

    WAIC’s estimate of effective number of parameters is taking into account how much information is obtained from data (corresponding estimate can be obtained as a side-product from cross-validation)

  3. For what it worth, the last time I looked into this it seemed sensible to define identifiable as having no flat spots on the the likelihood surface ( more than one parameter point giving exactly the same likelihood value, sketch of a _proof_ based on observations always being discrete below). And I remember the term for just not having enough data was aliased.

    Under any data model specification there is only a finite set of possible observations (since without lack of generality continuous observations are excluded) and their probabilities are fully determined by the data model for each parameter point (value). If for every
    parameter point in the data model, these probabilities are different, then in principle the is a one to one function from these probabilities of possible observations to the parameter points. If not, the data model is not identifiable and no amount of data would ever be able to discern which of the various parameters points gave rise to the observed data – no matter how much data is observed.

  4. Take a model like P = MV/Q. Ie, price of an item is proportional to money supply chasing it times velocity of that money, and inversely to the quantity of that item available.

    It seems obvious to me that if you only measure price, you can not distinguish between the parameters on the RHS. The only way is to collect some other kind of data to estimate M, V, and Q.

    Is the discussion above suggesting there may be some kind of mathematical manipulation that solves this problem? I just don’t see how so much got written about it.

  5. The discussion on identifiability is so interesting! I would like to provide some thoughts on this topic based on sampling efficiency. Non-identifiability is often considered as a problem. But I don’t think non-identifiability is a problem if no one cares about the inference of the non-identified parameters and there is a way to generate the posterior samples efficiently. However, MCMC algorithms like HMC which use the gradient of log-density can be very sensitive to the exist of non-identified parameters. Taking a linear regression with overparameterization for example. Consider:

    Model: y_i = (b1 + b2) x_i + e_i, e_i ~ N(0, 1) for i = 1, … , n
    Prior: b1 ~ N(0, 1), b2 ~ N(0, 1)

    Here the sum of b1 and b2 models the regression coefficient of x_i. Despite the overparameterization, we can still obtain a well-defined posterior with the assigned prior, which is a Gaussian distribution centered at [x’y / (2x’x +1), x’y / (2x’x +1)]’ with covariance
    [(x’x + 1) / (2x’x +1), (-x’x) / (2x’x +1) ]
    [(-x’x) / (2x’x +1) , (x’x + 1) / (2x’x +1)]
    Here x = (x_1, …., x_n)’, y = (y_1, …, y_n). Assume the actual value of the regression coefficient is 2 and assume x_i ~ N(0, 1), then the correlation between b1 and b2 goes to -1 as n goes to inf and the posterior samples will gather around line b1 + b2 = 2. For this example, the posterior is in closed form so one can easily generate posterior samples.
    Now let’s take a look at the gradient of the log-density:
    \partial lp / \partial b1 = x’ (y – (b1 + b2) x) – b1;
    \partial lp / \partial b2 = x’ (y – (b1 + b2) x) – b2;
    When n is large, there is a high probability that the gradient of the log-density on line b1 + b2 =2 is almost orthogonal to the line b1 + b2 = 2, which is probably the worst direction we want to explore when proposing the next iteration in an MCMC algorithm. In this case, using the gradient of log-density in HMC doesn’t improve the sampling efficient. I even suspect that the sampling efficiency of HMC might be lower than that of a random walk metropolis for this example.

    There is another example: Consider y_i ~ N(0, a*b), i = 1, … , n. Assume the actual value of the variance of y_i (which is a*b in the model) is 1. As the sample size increases, the posterior samples of a and b will get very close to the curve a*b = 1. But the gradient of the log-likelihood along the curve a*b = 1 is perpendicular to the curve a*b = 1. For this example, I don’t think any MCMC algorithm will be efficient when n is large and a simple reparameterization can dramatically improve the sampling efficiency. But if we just want to know the posterior inference of a*b, both parameterizations should give comparable results when there is a good sampling algorithm.

    From the technique aspect, I am more interested in the definition of practical non-identifiability “problem” than the definition of identifiability. If we have a tool to detect the non-identifiability “problem”, then it might help researchers build better models.

  6. Thank you for an interesting discussion. If a statistical model contains a hierarchical structure or a latent variable with redundancy, it is non-identifiable and has the degenerate Fisher information matrix. Normal mixtures, neural networks, matrix factorizations, latent Dirichlet allocations and so on are such examples. If a statistical model is non-identifiable, then Bayesian inference makes the generalization error much smaller than the maximum likelihood method, even asymptotically. The effect of non-identifiability can be quantitatively measured by algebro-geometric studies. In fact, the dimension of the parameter in BIC of a regular case is replaced by the real log canonical threshold, which is an important birational invariant of algebraic geometry. Statistically, this fact was applied to sBIC by Drton and Plummer.

  7. This issue comes up regularly but the discussion never progresses much. I believe the reason for that is that we are trying to capture notions of identifiability from economics, causal inference, etc. that were put forth long before Bayesian inference was feasible by people who wouldn’t be caught dead making Bayesian inferences even if they were feasible at the time.

    In my opinion, a Bayesian definition of nonidentifiability that is the most analogous to those other ones is this: If two (or more) generative processes have the same prior predictive distribution, then you are not going to be able to distinguish between those two generative processes with any outcome data collected from the wild. The generative processes where Y ~ Poisson(mu) with log(mu) = alpha + beta and some priors on alpha and beta falls into this category because you get the same prior predictive distribution, for example, whether alpha ~ normal(1, 1) and beta ~ normal(-1, 2) or alternatively alpha ~ normal(-1, 2) and beta ~ normal(1, 1).

    That Bayesian notion of nonidentifiability is fairly analogous to non-Bayesian ones and easy to understand, but it is not the notion that people like Andrew want because they are not that interested in Bayesian model selection anyway. Rather, estimation-based Bayesians are saying something like: Suppose you commit to priors like alpha ~ normal(1, 1) and beta ~ normal(-1, 2); can you successfully draw from the posterior distribution of alpha and beta given N realizations of Y? The difficulty here is that whether you can successfully draw from the posterior distribution of the parameters given the data is data-dependant, in addition to being algorithm-dependent and prior-dependent (and hinging on what constitutes success). If alpha ~ normal(1, 1) and beta ~ normal(-1, 2) and you use the MCMC algorithm in Stan, then there is a good chance (but not guarantee) you will get draws without warnings in a reasonable amount of time, but that might not be the case if your priors were alpha ~ normal(1, 5) and beta ~ normal(-1, 5).

    What I don’t like about this estimation-based notion of nonidentifiability is that it does not lend itself to a precise definition. You can easily find yourself saying nonsense like the parameters are not “identified” in Stan with the default adapt_delta = 0.8 but there is a better than even chance that they are if adapt_delta = 0.99 and you are willing to wait more than T units of time and you choose non-default starting values. I don’t think you are going to get to a precise definition of this latter notion, but I don’t know that we need one. Just say drawing from the prior predictive distribution is easy and drawing from the posterior distribution (with Stan) is hard in this case because reasons.

  8. This book is a good summary of the types of identifiability discussed here:

    https://www.routledge.com/Parameter-Redundancy-and-Identifiability/Cole/p/book/9780367493219

    In the chapter on Bayesian identifiability, two distinct definitions of Bayesian identifiability are considered: one that amounts to ‘posterior identifiability’ and one that amounts to considering identifiability to be a property of the likelihood (standard parametric version).

    Shameless plug of some of my own thoughts on identifiability, closest to the ‘property of the likelihood’ version that can be applied whether Bayesian or non-Bayesian:

    https://arxiv.org/abs/1904.02826

Leave a Reply

Your email address will not be published. Required fields are marked *