We had some questions on the Stan list regarding identification. The topic arose because people were fitting models with improper posterior distributions, the kind of model where there’s a ridge in the likelihood and the parameters are not otherwise constrained.

I tried to help by writing something on Bayesian identifiability for the Stan list. Then Ben Goodrich came along and cleaned up what I wrote. I think this might be of interest to many of you so I’ll repeat the discussion here.

Here’s what I wrote:

Identification is actually a tricky concept and is not so clearly defined. In the broadest sense, a Bayesian model is identified if the posterior distribution is proper. Then one can do Bayesian inference and that’s that. No need to require a finite variance or even a finite mean, all that’s needed is a finite integral of the probability distribution.

That said, there are some reasons why a stronger definition can be useful:

1. Weak identification. Suppose that, with reasonable data, you’d have a posterior with a sd of 1 (or that order of magnitude). But you have sparse data or collinearity or whatever, and so you have some dimension in your posterior that’s really flat, some “ridge” with a sd of 1000. Then it makes sense to say that this parameter or linear combination of parameters is only weakly identified. Or one can say that it’s identified from the prior but not the likelihood.

If we wanted to make this concept of “weak identification” more formal, we could stipulate that the model is expressed in terms of some hyperparameter A which is set to a large value, and that weak identifiability corresponds to nonidentifiability when A -> infinity.

Even there, though, some tricky cases arise. For example, suppose your model includes a parameter p that is defined on [0,1] and is given a Beta(2,2) prior, and suppose the data don’t tell us anything about p, so that our posterior is also Beta(2,2). That sounds nonidentified to me, but it does have a finite integral.

2. Aliasing. Consider an item response model or ideal point model or mixture model where the direction or labeling is unspecified. Then you can have 2 or 4 or K! different reflections of the posterior. Even if all priors are proper, so the full posterior is proper, it contains all these copies so this labeling is not identified in any real sense.

Here, and in general, identification depends not just on the model but also on the data. So, strictly speaking, one should not talk about an “identifiable model” but rather an ‘identifiable fitted model” or “identifiable parameters” within a fitted model.

Ben supplied some more perspective. First, in reaction to my definition that a Bayesian model is identified if the posterior distribution is proper, Ben said he agreed, but in that case “what good is the word ‘identified’? If the posterior distribution is improper, then there is no Bayesian inference.”

I agree with Ben, indeed the concept of identification is less important in the Bayesian world than elsewhere. For a Bayesian, it’s generally not a black-and-white issue (“identified” or “not identified”) but rather shades of gray: considering some parameter or quantity of inference (qoi), how much information is supplied by the data. This suggests some sort of continuous measure of identification: for any qoi, corresponding to how far the posterior, p(qoi|y), is from the prior, p(qoi).

Ben continues:

I agree that a lot of people use the word identification without defining what they mean, but there are no shortage of definitions out there. However, I’m not sure that identification is that helpful a concept for the practical problems we are trying to solve here when providing recommendations on how users should write .stan files.

I think many if not most people that think about identification rigorously have in mind a concept that is pre-statistical. So, for them it is going to sound weird to associate “identification” with problems that arise with a particular sample or a particular computational approach. In economics, the idea of identification of a parameter goes back at least to the Cowles Commission guys, such as in the first couple of papers here.

In causal inference, the idea of identification of an average causal effect is a property of a DAG in Pearl’s stuff.

I’d like to hold fast to the idea that identification, to the extent it means anything, must be defined as a function of model + data, not just of the model. Sure, with a probability model, you can say that asymptotically you’ll get identification, but asymptotically we’re all dead, and in the meantime we have sparseness and separation and all sorts of other issues.

Ben also had some things to say about my casual use of the term “weak identification” to refer to cases where the model is so weak as to provide very little information about a qoi. Here’s Ben:

Here again we are running into the problem of other people associating the phrase “weak identification” with a different thing (usually instrumental variable models where the instruments are weak predictors of the variable they are instrumenting for). This paper basically is interested in situations where some parameter is not identified iff another parameter is zero. And then they drift the population toward that zero.

Ben thought my above “A -> infinity” definition was kinda OK but he recommended I not use the term “weak identifiability” which has already been taken. Maybe better for us to go with some measure of the information provided in the shift from prior to posterior. I actually had some of this in my Ph.D. thesis . . .

Regarding my example where the data provide no information on the parameter p defined on (0,1), Ben writes:

Do you mean that a particular sample doesn’t tell us anything about p or that data are incapable of telling us anything about p? In addition, I think it is helpful to distinguish between situations where

(a) There is a unique maximum likelihood estimator (perhaps with probability 1)

(b) There is not a unique maximum likelihood estimator but the likelihood is not flat everywhere with respect to a parameter proposal

(c ) The likelihood is flat everywhere with respect to a parameter proposal

What bothers me about some notion of “computational identifiability” is that a Stan user may be in situation 1 but through some combination of weird priors, bad starting values, too few iterations, finite-precision, particular choice of metric, maladaptation, and / or bad luck can’t get one or more chains to converge to the stationary distribution of the parameters. That’s a practical problem that Stan users face, but I don’t think many people would consider it to be an identification problem.Maybe something that is somewhat unique to Stan is the idea of identified in the constrained parameter space but not identified in the unconstrained parameter space like we have with uniform sampling on the unit sphere.

Regarding Ben’s remarks above, I don’t really care if there’s a unique maximum likelihood estimator or anything like that. I mean, sure, point estimates do come up in some settings of approximate inference, but I wouldn’t want them to be central to any of our definitions.

Regarding the question of whether identification is defined conditional on the data as well as the model, Ben writes:

Certainly, whether you have computational problems depends on the data, among other things. But to say that identification depends on the data goes against the conventional usage where identification is pre-statistical so we need to think about whether it would be more effective to try to redefine identification or to use other phrases to describe the problems we are trying to overcome.

Hmm, maybe so. Again, this might motivate the quantitative measure of information. For Bayesians, “information” sounds better than “identification” anyway.

Finally, recall that the discussion all started because people were having problems running Stan with improper posteriors or with models with nearly flat priors and where certain parameters were not identified by the data alone. Here’s Ben’s summary of the situation, to best help users:

We should start with the practical Stan advice and avoid the word identifiability. The basic question we are trying to address is “What are the situations where the posterior is proper, but Stan nevertheless has trouble sampling from that posterior?” There is not much to say about improper posteriors, except that you basically can’t do Bayesian inference. Although Stan can optimize a log-likelihood function, everybody doing so should know that you can’t do maximum likelihood inference without a unique maximum. Then, there are a few things that are problematic such as long ridges, multiple modes (even if they are not exactly the same height), label switches and reflections, densities that approach infinity at some point(s), densities that are not differentiable, discontinuities, integerizing a continuous variable, good in the constrained space vs. bad in the constrained space, etc. And then we can suggest what to do about each of these specific things without trying to squeeze them under the umbrella of identifiability.

And that seems like as good as any place to end it. Now I hope someone can get the economists to chill out about identifiability as well. . . .

I like this paper

Information About Hyperparamters in Hierarchical Models

Prem K. Goel and Morris H. Degroot

Journal of the American Statistical Association

Vol. 76, No. 373 (Mar., 1981) , pp. 140-147

where they write “It is clear, however, from the discussion in this article that for many of the widely used information measures, the information about the hyperparameters decreases as one moves to higher levels away from the data. In practice, an experimenter with a small amount of data may as well specify the prior in only a few stages, because he cannot expect to learn about the higher level parameters to any significant degree.”

Better conceptualizations likely a good idea.

> suggests some sort of continuous measure of identification: for any qoi, corresponding to how far the posterior, p(qoi|y), is from the prior, p(qoi).

Mike Evans has thought a lot about p(qoi|y) / p(qoi) – something better than a ratio?

(Though maybe a better term than relative belief is required http://www.utstat.utoronto.ca/mikevans/research.html )

Rather than asymptotics – thats only directly relevant for zombies ;-)

perhaps raise the likelihood to various powers, cleverly recopy the data multiple times or even _data cloning_

What would be a suitable power, and how is this procedure different from data cloning? Just curious.

Some more hair-splitting, if you don’t mind –

I strongly agree with this one: “I’d like to hold fast to the idea that identification, to the extent it means anything, must be defined as a function of model + data, not just of the model. Sure, with a probability model, you can say that asymptotically you’ll get identification, but asymptotically we’re all dead, and in the meantime we have sparseness and separation and all sorts of other issues.”

To me it seems that the concept of weak identifiability can’t, consequently, be sufficient. Because you can impose finite variance on a parameter by using a constrained prior, and in that case, WI does not depend on data.

I would even go as far as to say that identifiability is a finite-variance posterior given an improper prior. But I don’t know if this concept is too harsh. Maybe many of the commonly used seemingly ‘identifiable’ models fall out of this category?

What’s wrong with the definition that a model is non-identifiable if and only if there are two distinct values for its parameters that produce the same distribution for the data?

I don’t see how the prior is relevant. And I don’t see how the data is relevant either, except for data that isn’t part of the probability model, like predictors in a regression model. (There, if for a regression problem, you happen to get the same value for a predictor variable in every observed case, then you can learn nothing about the regression coefficient for that predictor, though you might for some other dataset. It’s sort of optional whether you use “identifiable” to refer to the specific set of predictors observed, or consider a distribution over predictors.)

We already have the term “improper posterior”. And we have a simple sentence, “the posterior is the same as the prior”, or “almost the same”. I don’t think we need to redefine “non-identifiable” to mean one of these things, especially if the new definition is completely vague, and hence useless.

Just a clarification and a nitpick. First the clarification: here, “distribution for the data” means the sampling distribution. I.e. unlike most Bayesian settings where we consider the data set to be fixed, for a model to be unidentifiable we need to have a pair of parameters giving the same likelihood for every possible data set and not just for the data set we actually have. Otherwise the definition implies sillyness such as the mean of a unit-variance normal distribution being unidentifiable. But perhaps its dependence on data that weren’t actually observed is what makes some Bayesians unhappy with this definition.

Next, the nitpick: this only works if the set of observations in the data set is fixed (even though their values are not). So the data are relevant in the sense that an identifiable model may be rendered unidentifiable by removing data points (in the extreme case, removing all of the data will render any model unidentifiable).

That’s what I understood Radford to be saying in his first paragraph. That a model p(y|theta) is unidentified if and only if there are two distinct parameter values theta1 and theta2 that produce the same sampling distribution, so that p(y|theta1) = p(y|theta2) for all data sets y. Specifically, the data set y is not fixed in the definition.

This differs from what Andrew was saying. I don’t think Andrew’s unhappy with the standard definition, just that he doesn’t find it so useful to talk about identifiability for models in the limit of more data or in the face of alternative data sets he doesn’t have. The problem is that the standard definition is not enough, because it may fall down in the face of the data set we actually observe (from giving students a test, sending a satellite into space to measure light of stars, running a clinical trial, etc.). So we run into problems in practice if we’re given a data set y0, and there are two parameter values theta1 and theta2 such that p(y0,theta1) = p(y0,theta2) [I’ve switched to joint probabilities now to be compatible with the Bayesian approach].

Radford:

I think you’re right that as Bayesians we might be best off just abandoning the term “identifiability,” which, as Bob has pointed out on the Stan list, is already such an overloaded term. As with “bias” and “power,” what we have is a linguistically evocative word that doesn’t have all the meanings in statistics that is implied by its meaning in English.

On your more specific point, I think the prior is always relevant, if for no other reason than, in Bayesian inference, there’s no strict division between “prior” and “likelihood.” This point is most obvious in hierarchical models but is really the case in general. So any Bayesian definition that requires a distinction between prior and likelihood is itself dependent on that choice of partition, which is not part of the model itself. Now, for some purposes (notably, predictive checking and computation of out-of-sample prediction error), I think it’s appropriate for the result to depend on the partition. But for identification, I’m not so sure.

In any case, though, we’re agreed that trying to redefine the classical concept of “identifiability” in a Bayesian context is a fool’s errand or a mug’s game or whatever.

I would think that most people mean by non-identifiability simply that the marginal posterior for a parameter is not substantially more narrow than the prior, or at least pretty wide.

This indeed misses the point that the joint posterior might still occupy considerably less space than the joint prior distribution, so I agree that it’s a continuum rather than a yes / no.

On the other hand, the concern with non-identifiability is a logical response to results being most easy to plot, interpret and sell if we can get a fixed, narrow marginal estimate for each parameter. Plus, correlations between parameters have a bad reputation in a frequentist world because they reduce significance for regression coefficients – if you have collinearity between two variables, for example, you would rather throw one out than finding that their relative contribution is not identifiable. I often think that quantifying the correlation in posterior space is a much more elegant solution to this problem, although it runs into problems for higher-order trade-offs unfortunately.

I originally brought this up because of my confusion reconciling the notion of identifiability Andrew uses in his books (roughly that a model’s posterior is unimodal for a given set of data) with the proper definition pointed out to me by Ben and Michael Betancourt (and in Radford’s comment above).

The problematic posteriors Ben brings up are all about lack of unimodal posteriors and they jibe with Andrew’s definition of identifiability. But I agree calling this “identifiability” is just going to confuse people, especially those with good classical educations.

The “weak” vs. “strong” concept is really just about how sharp the peak is, which is obviously scale dependent and subject to change under reparameterizations. In fact, Andrew pointed this out when I brought up the Beta(0.5, 0.5) distribution, which has no modes because its density goes to infinity as the variable approaches 0 or 1. If you do a logit transform (with appropriate Jacobian), you get a nice mode at 0 (corresponding to 0.5 under the inverse logit transform). And in fact, that’s what Stan does behind the scenes when it samples from a parameter declared to be constrained to (0,1).

Which brings us to why Michael Betancourt’s been giving us grief for even considering doing more work on (penalized) maximum likeliood estimates, which find modes. The MLE is a moving target that depends on parameterization. Far better to look at posterior means, which are much better behaved, as demonstrated in the Beta example above.

Two small things about the last paragraph:

– Firstly, you don’t look at posterior modes (or penalised MLEs) because you want to, it’s usually because means (medians, chicken gutting etc) are no longer feasible! Optimisation is infinitely easer than integration.

– Secondly, posterior means are only well behaved in (to nick the term from the topic) identifiable models. When you have, say, two reasonable modes, the mean is going to be in a really bad place (e.g. one with almost zero posterior mass). But then you get the awkward question of what exactly should your qoi be, and that’s probably a strongly application dependent thing, and so you have to compute the full posterior so the user can make their own choice and then we’re back to the first problem (of posterior intractability).

I think this comment illustrates why “identifiable” should not be redefined to be meaningless. Bimodality of the posterior distribution is not the same the model being “non-identifiable” by any useful definition of “identifiable” (certainly it’s not the same for the standard definition). Actual non-identifiable models will often (not always) result in bimodal posterior distributions, but the reverse is not true. If you want to say that your posterior is bimodal, why not use the word “bimodal”? It’s even shorter!

On another topic, the posterior mean being in a place with low posterior density is not necessarily a reason not to use it. If, for instance, you seriously have squared estimation error as your loss function, the mean is the right thing regardless. Of course, squared error is seldom your real loss function…

Right. I think the confusion comes from conflating “identifiable” with the existence of an MLE.

My feeling is that Andrew just wants to chuck the term “identifiable” for his uses altogether. Not try to redefine it to mean something different than the classical notion of having parameters that can’t be teased apart by any data set.

Radford:

Regarding your suggestions on language usage, I don’t disagree with you, indeed I’ve been persuaded by you, Bob, Ben and others that we should just let the word “identifiability” lie in the dustbin with “power,” “bias,” and all the other words that have misleading (I believe) technical meanings.

But . . . try flipping it around. Instead of (implicitly) asking, “Why is Gelman so foolish as to want to redefine a word that already has a clear meaning?”, you could ask, “Why is someone so non-foolish as Gelman interested in doing this redefinition? Perhaps Gelman has a good reason…”

My reason in this case is that people ask us about identifiability, and also that the general concept of “parameters being identified from the data” is more general than any specific thing about likelihood functions. It’s the same way that people ask us about “bias” even if they’re not thinking about the term in the silly technical context in which an unbiased estimate of a variance parameter has to be sometimes negative, etc.

Again, I’m convinced by all of you to give up on ever using the term “identifiability” in a Bayesian context. But I hope you can see my motivation. If I go your route (which I suppose I will), this just means that the next time someone asks me about the topic, I’ll have to say something like: Yeah, there

isthis statistical concept called “identification” but it has a particular technical meaning that’s not so appropriate with Bayesian inference . . .P.S. I agree with you 100% regarding bimodality. I think that in general it can be misleading when people focus on specific features of a distribution, features such as ridges and multi modality. For example, a distribution can be really spread out and

almostbimodal; that will still create issues in practice if inference is summarized by a single mode. Similarly, a ridge that’s nearly but not completely flat can still create problems both in computation and in inference.Wait a minute – what does “parameters being identified from the data” mean? I’m familiar with the concept of parameters being “inferred” or “estimated” from the data – is “identifying” them the same thing?

My first thought was that rather than “identifiable”, the word you are looking for is “estimable” – but that word has also already been taken in the frequentist literature (e.g. a function is said to be (linearly) estimable iff there exists a linear unbiased estimator for it). Still, if you want to redefine a word I’d go with estimability rather than identifiability – seems more descriptive of the concept you have in mind.

“And that seems like as good as any place to end it. Now I hope someone can get the economists to chill out about identifiability as well. . . .”

I hope not, unless perhaps they’ve decided to adopt a new definition. My impression of the economist’s concept of “unidentifiability” is that it tends to be a function of unmeasured confounding, selection/sampling bias, etc. A rough translation might be something like having a parameter in a probability model that’s identified in the likelihood sense but corresponding to the wrong population (or superpopulation, as the case may be) due to some nonignorable aspect of sampling. The economists wouldn’t tend to tie identification to a likelihood function or probability model, of course, but I think the analogy is close.

A statistician might call this model misspecification instead of unidentifiability, but whatever you call it we ought to worry about it.

I’ll second Jared on this one, but point out that I think economists mean two different things when we say a model is or isn’t “identified”.

The first is mathematical/statistical: is there variation in X that allows us to estimate the correlation between X and Y. So, supposing we believe the true DGP is something like Yit = B*Xit + Zi + Eit, B is identified if there are multiple observations per person and we estimate a within-person (i) differences model. If there aren’t multiple observations per person, we can’t, and the model is unidentified, due to confounding/selection/whatever (as Jared points out – and also Ben in relation to Pearl’s DAGs).

The second has to do more with the metaphysical question (and hence the conceptualization of the DGP above): does the coefficient estimate for B converge to the “real” B? This question is more about “did you conceive of the DGP right?” or “what variation in the world is the model using to estimate Bhat, and is that variation likely to produce an estimate of the treatment effect of interest?”

The two problems go hand in hand: first, is your thought-experiment good? (my second point), and then, supposing it is, do you have the data to actually estimate that effect (my first point). So in that sense, I think Andrew is right that you need both the model and the data. But, to me, the real thing of interest isn’t the relation between the model and the data, but between the model and the real world, with the added complexity that you only have some small, sparse window on the world (the data itself) with which to work.

And I think this leads us to a thought I’ve recently been having about when and where a hierarchical Bayesian model makes sense, and where a reduced-form econometric model does – it depends on what you are trying to do (I know… deep, right?). For Andrew, I suspect that understanding a large, complex system is the task at hand most often. For me, I’m more likely interested in “What effect did this program/policy have on this outcome?” In my case, modeling the whole system is unnecessary, and I just want to get that one relation right – I can mis-specify the hell out of my model (in relation to the world), so long as it latches onto only the variation I really want. For Andrew, and many other applied statisticians, I’m guessing that the whole system is the object of interest, in which case you sacrifice clarity of coefficient interpretation (or at least causal clarity) for a broader understanding of the process at work as a whole. At least that’s what I’m thinking these days.

Just want to say thanks for this post and all the comments for confirming that I’m not the only one who’s confused about what we talk about when we talk about identifiability. I’m sympathetic to Radford Neal’s comment, but also find it extremely useful to see all these concepts gathered together in one place for a moment so that they can be labeled before being released back into the wild.

Seconded.

I’ve been trying to reconcile the mathematical definition of identifiability, two parameters not being one to one with a distribution, with the bayesian “overloaded” naming conventions of identifiability.

Before my way here, I did find the article (Gelfand and Sahu 1999) about bayesian identifiability vs likelihood identifiability helpful.

Fast forward to this blog post. The clearer enumeration of how the term identifiability has been used to talk about different things has been the most helpful

“Again, this might motivate the quantitative measure of information. For Bayesians, “information” sounds better than “identification” anyway.”

@Andy, In what ways is the idea in your head different from the entropy of the full-conditional distribution of a parameter, perhaps contrasted with the entropy of its marginal prior distribution or something?

Just to throw this into the mix, (using the definition of non-identifiability given by Radford) careless inference when some parameters are non-identifiable is likely to lead to stupid answers, whether your approach is Bayesian or otherwise. In the paper below we just put Dirichlet priors over potential outcomes (which are not directly observable). There’s no question of the posterior being improper, so formally the inference is correct: if you believed the prior before you saw the data, then you should believe the posterior afterwards.

But the parameter of interest is only semi-identified (i.e. one can infer inequality constraints), and values which satisfy the inequality constraints are indistinguishable in the likelihood. Which makes the posterior totally sensitive to the prior in those regions (see Fig 3). Garbage in garbage out.

http://www.stats.ox.ac.uk/~evans/RichardsonEvansRobins.pdf

Because statistics tends to be so splintered (let alone applied statistics where we have to incorporate other sciences as well), vocabulary is a nightmare.

In this case we were dealing with a very specific issue that has been coming up more and more frequently on the Stan Users’ List: when do you need (nontrivial) priors and when don’t you need (nontrivial) priors? As people build more and more complex models the posteriors become weirder and weirder, with some combination of parameters well-constrained by the data and other essentially unconstrained (the Gaussian example in the post is a canonical example, but in reality we tend to see much nastier, non-linear behaviors). In these cases prior information is critical to ensuring a well-behaved posterior. A big part of modeling has then become identifying in what ways the likelihood does not constrain the data well so that the user can focus on modeling and then introducing realistic prior information to compensate (and perhaps reparameterizing the model to make this step easier). It’s a nice way to help the user solve their own problem by asking the right question.

The only issue, then, is how do we discuss these issues without having to describe them from scratch every time? Identifiability has the right idea, but it’s an overloaded term (seriously, though, what _isn’t_ an overloaded term these days?). Information isn’t right because we’re identifying only certain local properties of the posterior (so coordinate-invariant objects like KL divergence aren’t helpful).

I still find the issue of identifiability (in the classical sense) useful and something I struggle with all the time. Consider a basic item response theory model. I have a latent variable theta[i] representing the ability of Student i, and a item parameter b[j] representing the difficulty of Item j. I then have the probability of a correct response by Student i to Item j as

Pr(X[i,j]==1|theta,b) = invlogit(theta[i]-b[j]).

This model is not identified according to Radford’s definition (which I like a lot). In particular, if I add a constant c to theta[i] and b[j] for all i and j, then the model has the same likelihood.

The way I think about this model is that it is underconstrained. I would typically constrain it by adding a constrain such as mean(theta)=0.

I could constrain it by adding proper priors on theta or b, but in many ways that disguises the problem and the solution. Maybe “underconstrained” is a good approximation of what we mean by non-identified.

Russell:

Yes, and many open problems remain regarding how best to set these constraints. Similar issues arise in hierarchical models where there is nonidentifiabiilty between regression coefficients and group-level errors. Jennifer and I discuss some of this in our book but I think a lot more needs to be figured out in this area.

Please, I’d like to understand the meaning of ‘weakly identifiable’ in the Classical approach. Can someone help me?

In “classic” linear regression with collinearity amongst predictor variables we have that t(X)*X isn’t invertible, which causes an inflation of the paramter variances. In a Bayesian approach through sampling posterior via MCMC, however, why should we still expect inflated variances of the posterior distributions of our parameters?

I just wanted to stop by and say I did enjoy reading this post on my quest to understand identifiability. So thank you Andrew. I personally agree with Radford’s take on the issue, but for future readers, I wanna let them know I came across a book on identifiablity and redundancy in general, and beyond just statistical models. The book also has a chapter dedicated to Bayesian identifiability:

Parameter Redundancy and Identifiability

Book by Diana Cole

Chapter 6 is on Bayesian idetnifiability.