Hierarchical models for phylogeny: Here’s what everyone’s talking about

Posted on February 14, 2016 9:34 AM by Andrew

The other day on the Stan users list, we had a long discussion on hierarchical models in phylogeny that I thought might be of general interest, so I’m reconstructing it here.

It started with this question from Ben Lambert:

I am hoping that you can help me settle a debate.

My collaborators and I have data for experiments structured by the following categories (from top to bottom): genus -> species -> individual time series.

I believe that the best way to approach this is to use a hierarchical model which has 3 levels; one for each of the categories. However, my collaborators (entomologists) argue that the different species within a particular genus are so incredibly different (they use the analogy that the species within a particular genus are more different than say, lions and elephants, at a genetic level), that it does not make any sense to group them in any way. Furthermore, they argue that any ‘genus-level’ parameters that are estimated would be meaningless biologically, since they are averages across a range of very heterogeneous entities.

I definitely do see their point, but can’t help thinking that species are categorised within a particular genus for a reason; some sort of similarity. I agree with them that the overall parameters probably won’t mean much. However, I always default to using hierarchies whenever I can, due to all the benefits (reduced variance, less overfit, more parsimony etc).

I suspect it won’t make all that much difference to the results (some preliminary analysis results hint at this), but wanted to see what others thought of this. Do you think that this model should be a three-level hierarchy, or independent analyses consisting of a two-level hierarchy within each genus?

I replied that if you use a grouping factor that’s irrelevant, then it shouldn’t matter much in the analysis. That is, you could include a variance component for genus, and it shouldn’t really hurt you if genus doesn’t really matter for your purposes. So you could include it, or you could do the analysis both ways and it probably won’t matter. But the one thing you wouldn’t want to do, I think, is a separate two-level model for each genus. There’s information to be shared between genuses. Especially if genus doesn’t matter—then you’d really want to be combining across them.

But let’s step back for a moment. Why do we do hierarchical models? Why include grouping in the analysis? Because if we don’t have enough local data (i.e., if we have noisy time series, in this example), we want to do partial pooling to get better, more reasonable estimates. So . . . if your colleagues think that partial pooling across species within genus is a waste of time, then that’s fine. But then maybe there’s another model that could be fit, using some other characteristics of the species. To the extent that you can group the species a priori into reasonable categories, or to the extent that you can construct good species-level predictors, your partial pooling will be more effective.

And three biologists responded along the same lines, but with more specifics.

Josh Rosenau:

If this is a well-understood taxonomic group, then the genus ought to consist of species that are all more closely related to one another than to any other species, as the members of each species are more closely related to one another than they are to any other species. That would tend to argue for treating it hierarchically. OTOH, many groups have not been revised thoroughly and the taxonomic structure may not reflect the phylogeny as well as one might like, which could argue against it in some situations. In the third hand, not all biologists have really embraced cladistic taxonomy (naming groups based on relatedness) and the importance of controlling for the effects of shared ancestry in such analyses. In that case, you either fight wit the biologists or defer to their specialized knowledge.

If it’s a particularly speciose group with lots of phylogenetic structure, and there are good published estimates of phylogenetic distance, it might make sense to incorporate that into your model. There are a few common approaches to phylogenetically-informed analysis which should be adaptable to your model.

In terms of interpreting the parameter estimates at the genus level, in an analysis fully incorporating the effects of phylogeny, the genus-level estimate ought to be something like an estimate of the common ancestor’s character state. Whether such a thing is meaningful in this instance I couldn’t say, but in general it seems like it should be. In an analysis that doesn’t incorporate phylogenetic distance (i.e. just the taxonomic levels), the interpretation would indeed be unclear.

Lizzie Wolkovich:

In the past two weeks I have had two such similar debates. I have two conflicting views—on the first side I find it interesting that your colleagues are fully behind the species concept but seem to think genus is completely irrelevant. Though I can understand that view a little I suspect the systematists who came up with the arthropods’ genera may be offended by this (or for this particular group it could well be that the genus level classifications are a mess and poorly done currently). I wonder if there is some other taxonomic level they would be comfortable with such as family. Lions and elephants are actually pretty similar once you start comparing them with komodo dragons and earthworms, for example, and even I can tell a beetle from an insect. [Are beetles not a type of insect?? I had no idea. — ed.]

Then to echo Andrew and go in the other direction, if your colleagues really think genus is irrelevant, or worse, poorly defined for these groups just now such that species A is in genus 1 but really belongs in genus 2 then you would be pooling in a way that could give you less biologically accurate estimates. In line with Andrew I would wonder if there are other characteristics you could group by.

In my own experiences we sometimes (1) skip all taxonomy above species and actually don’t partially pool species together even because we don’t want one species influencing the other too much (and, in last week’s example we realized we had missing data for some species that varied by treatment which would make the pooling effect especially strong for some species with the most unique treatment effects), (2) use distances from a phylogenetic tree instead of categorical levels (but then I echo that phylogenies are their own balls of wax to start batting at and evaluating) and (3) worry about how species can often be confounded with other effects, like site or who did the study. It sounds like from the below you might be in the realm of (3) perhaps.

Simon Blomberg:

The modern way to compare species is to incorporate the evolutionary relationships as a prior covariance matrix in a mixed-effects model, either on random effects associated with species (G level) or residuals (R level). Species are real entities (distinct, non-interbreeding populations) but genera are not: they are subjective artificial classifications built by taxonomists that may or may not have any basis in biology. The solution is to work with the evolutionary relationships (the phylogeny) directly, and forget about the biological classification.

Lambert had some concerns:

My fear from pooling information across genuses is that you are really comparing apples with pears. Even if the data looks similar, (imagine that we are doing mark-release-release-recaptures for elephants and insects, say), then does it make sense to share information in this context?

I agree it probably won’t hurt to group by genus ultimately, but I fear I may run into problems trying to justify it biologically/ecologically.

In my case the measurements were taken over very different geographies, and it is likely that the organisms (here mosquitoes) across species/genuses, probably do not respond that similarly to environmental conditions. Hence, I suspect that it may not make sense to pool information?

Bob Carpenter interpolated to ask if anyone’s been using Gaussian processes in this area, and he got two responses.

Maxwell Joseph:

I have toyed around with Gaussian processes with phylogenetic distance as an input in order to allow for correlation among species as it relates to evolutionary history. For prediction, these GP models are outperforming models with hierarchical structure based only on taxonomy (genus, family, order, etc.). Conveniently, previous work (e.g., Hansen and Martins 1996) points to correlation functions with mechanistic support. I haven’t been accounting for uncertainty in the phylogeny, however, though such approaches do exist (e.g., de Villemereuil et al. 2012).

Simon Blomberg:

Gaussian processes are currently the standard approach to analysing cross-species data. Essentially, you need a model of evolution for your traits. The most common models are Brownian motion and the Ornstein-Uhlenbeck process (both Gaussian). In both cases there is a simple relationship between the branch lengths on the phylogeny and the covariance structure for the data. I don’t know about mixture models in this context.

And Michael Betancourt wrote in to emphasize the value of hierarchical modeling from a purely predictive perspective:

Hierarchical models in absolutely no way imply a causal structure, so a phylogenetic relation is absolutely not necessary. It may be helpful, but in many cases is may not. For example, consider some model that depends on vision—if you assume that the eye evolved independently then species in very different branches of phylogenetic trees may have similar responses to vision-based observables which motivates an entirely different hierarchical structure.

The hierarchy represents similarities in the observations and consequently can only be justified as such. Even if genera are artificial classifications they can imply good hierarchical structure if the classification strategies consider observable behaviors. And phylogenetic clustering requires knowing the correct phylogenetic tree which is itself a huge uncertain problem.

And then Blomberg looped back in to connect these ideas back to the underlying biological models:

A phylogenetic model explicitly accounts for non-independence in the data due to common ancestry. Organisms can be similar due to common ancestry OR convergence. If convergence is your hypothesis, there are methods for testing for that, although how to do it correctly is a topic of current research.

If the genera are monophyletic, then you (Betancourt) may not be far wrong. But it completely ignores the intra-generic lack of independence, and it also treats each genus equally. Some genera will be more closely related to each other than others. Let’s be clear. You cannot avoid making a phylogenetic assumption in these kinds of models. The taxonomic approach implies a certain correlation structure for the data: All genera are monophyletic and assumed to arise at the same time, and independently of each other, and all species within a genus are assumed to arise at the same time and independently of each other. We know that this assumption is always wrong. Why build models which we know to be wrong on such an important aspect of the data? The only objective way to correctly model the phylogenetic independence is to use the phylogeny. Now, all models are wrong etc. But why be deliberately wrong by using a taxonomic hierarchical structure?

Bayesian methods are ideal for incorporating phylogenetic uncertainty. We have methods for that.

Now here’s Betancourt again:

And even if there are no similarities between the groups the hierarchical model will learn that and shouldn’t penalize you that much.

Blomberg:

It may be that the phylogeny might not add much to the analysis. But that is always an empirical question and not an argument for not using a phylogenetic model when you suspect that lack of independence due to phylogenetic relationships may be a problem. You can’t really know that in advance.

I’m not arguing against a hierarchical structure to the model. I’m just emphasising that using the taxonomy is the wrong way to do it. . . .

You could, for example classify species into ecological guilds (a guild is a group of possibly unrelated species that inhabit similar niches). But that doesn’t get around the problem of phylogenetic lack of independence in the data.

Not using any phylogenetic information at all, treating species as IID is perhaps the worst thing you could do. There is a strong analogy between time series and spatial data here. We would not treat time series as IID (no temporal autocorrelation), or spatial data without considering spatial autocorrelation. We should have the same respect for cross-species data. We should incorporate phylogenetic information on cross-species correlations, or at least entertain the idea that phylogenetic covariance could be a problem. There are some situations in which phylogenetic information could conceivably be ignored. For the time series analogy, we might pretend that our data are IID if the time series is short. Or for spatial data where the data are very far apart in space, making the IID assumption more plausible. So for small data sets or data sets where the phylogenetic covariance is thought to be extremely small, e.g. comparing phyla, it is possible that phylogenetic effects could be neglected. But that is an empirical question. . . .

A phylogenetic model explicitly accounts for non-independence in the data due to common ancestry. Organisms can be similar due to common ancestry OR convergence. If convergence is your hypothesis, there are methods for testing for that, although how to do it correctly is a topic of current research.

At this point, Betancourt shot back:

You are missing the point entirely. As I said before, a phylogenetic model is a causal model—but the observational correlations between species need not be causal. Hierarchical structure is statistical so it doesn’t need nor really care about whether any correlations are causal, which is why they are so amazingly powerful and widely applicable.

And, once again, there is no real cost of adding hierarchical structure that’s not there other than increased computation (the hierarchical model will converge to an IID model if necessitated by the data). Another reason why they are so awesome. . . .

We might not care a lick about the phylogenetic structure! Taxonomic structure can capture useful correlations—if you think otherwise than criticize the particular choice of taxonomy relative to a given observational model, but blanket criticizing taxonomies is equivalent to blanket criticizing hierarchical models in general which are in general not cause relationships.

Can phylogenetics provide useful motivation for building statistical models? Absolutely. Are they necessary and sufficient for all statistical analyses? No way.

Again, we’re not talking about phylogenetics. The point is that even if you don’t like a chosen taxonomic structure then you don’t have to worry about the fit because the hierarchical model will learn the independence of the groups.

To connect back to the models, Betancourt wrote:

Yes, Bayesian methods are ideal. Or they will be we have any idea how to effectively explore and sample from tree spaces with corresponding guarantees/validation methods/diagnostics that we can represent the true posterior uncertainty with any fidelity. Topological real talk.

Blomberg replied:

I disagree that phylogenetic models are necessarily causal models. It still makes sense to use the phylogeny as a hypothesis about covariance among species even when there is no notion of a variable having evolved along a tree. It’s the lack of independence in the data that is what is being modelled here. A hierarchical modelling approach is completely appropriate. It’s the structure of the hierarchy that is the issue. Taxonomy doesn’t cut it.

I agree that hierarchical models are good, awesome, whatever. And the model will converge to an IID model if there is really no phylogenetic “signal” in the data. But again it’s an empirical question. When dealing with cross-species data, your baseline assumption (your prior) should be that there is lack of independence in the data due to phylogenetic effects.

But you should care about the phylogenetic structure! You should care for the same reason as that you would care about temporal autocorrelation in time series, spatial autocorrelation in spatial data, pedigree information in genetic models etc. The phylogeny is the only way to incorporate that information in cross-species data. I am criticising taxonomy in general because they a) don’t represent anything like phylogenetic hierarchical structure. To the extent that taxonomies are useful is only because of some (perhaps accidental) similarities to the underlying phylogeny and b) they are not objective. I am not criticising hierarchical models in general. I use them all the time, and I think Bayes is the best way to implement these models. It’s just the structure of the hierarchy that I am arguing should be based on the phylogeny.

Can phylogenetics provide useful motivation for building statistical models? Absolutely. Are they necessary and sufficient for all statistical analyses? No way.

In other parts of the thread I have alluded to situations in which the phylogeny may not be useful. But I still maintain that your prior on the covariance structure of the data should be based on the phylogeny! Any other approach is a) wrong a priori and b) subjective. Are they necessary? No. for the reasons I have mentioned elsewhere in the thread. Are they sufficient? No because there are other substantive questions that we are interested in about our data. The model should “learn” whether that phylogenetic prior has any relevance to the posterior parameter estimates. That’s great. But the model should be given the chance to learn that!

Back to Betancourt:

a) Taxonomical structure may be based on previous observations that may be compatible with new measurements, hence a quite good motivation for a hierarchical prior.

b) Even if the hierarchical structure is chosen poorly the model will adapt and inferences will largely remain valid.

c) Even if there is some “objective” phylogenetic structure, it need not manifest in the observables and hence need not be relevant to a hierarchical model.

d) Known phylogenetic structures depend on data models, models which are built out of assumptions, assumptions which are in no way “objective”.

e) On top of that, even state-of-the-art phylogenetic MCMC methods are extremely limited. So even if there was an “objective” model we wouldn’t be able to use it to construct the necessary inferences to pick out the corresponding “objective” phylogenetic trees compatible with the data.

So the statement that “phylogenetic trees are objective and known a priori” and “always the correct hierarchical structure” are both incorrect. In a real problem using neither taxonomical structure or phylogentic structure will lead to poor inferences. One may certainly be better, but which is better depends very strongly on the details of a given model and hence no approach can be determined “correct” for all models.

Again the original question was not “should I use a taxonomical hierarchy or phylogenetic hierarchy” but rather “will using a taxonomical hierarchy lead to poor inferences.” The latter is absolutely not. End of answer.

Last word on this from Blomberg:

Modern classifications are based on data. And most often they are now built to reflect some aspect of phylogenies. But the Linnean hierarchy cannot accurately reflect phylogeny. Life just doesn’t evolve according to the Linnean hierarchy. The Linnean hierarchy is a pre-Darwinian human construct that can only be an imperfect representation of evolutionary history. Estimates of phylogenies are also an imperfect representation of evolutionary history (a tentative model). But if they are based on good data and reasonable assumptions made explicit at every step of the analysis, and appropriate diagnostics are used and sensitivity to assumptions is examined, then it make sense to use this information as a better tentative description of reality and use it as information informing further analyses.

Taxonomies work less well than known phylogenies in simulations (by frequentist criteria). Inferences using a “mostly correct” phylogeny have better properties than using a bad classification. Inferences based on a prior set of highly probable trees also have good frequentist properties.

There may be no “phylogenetic signal” in the data, this is true. But that is an empirical question dependent on the observables and the phylogeny. The phylogeny may not be relevant to the particular model, but then a taxonomic model will not either. There will be times when it does matter. For that you want the best estimates of among-species covariances that you can get. From the phylogeny.

Yes, phylogenies are estimates based on models. And they are almost always wrong in some regard. This is partly why modern phylogenetic comparative methods try to account for phylogenetic uncertainty. Systematists routinely publish their data sets and try different ways of analysing them to try to get the most robust estimate of the phylogeny. Methods sections in papers are usually very detailed and explicit about models and assumptions. This is becoming even more so with the new push for Open Science. And more data are always welcome. The genomics revolution has meant that we are getting better and better at estimating phylogenies. Science is built on these incremental advances. To say that the models and assumptions are not objective is not to criticise phylogenetic models. It is to criticise all of scientific practice. There is subjective choice in models and assumptions. But that is nowhere_near the degree of subjectivism and assumption ladenness inherent in making up a Linnean hierarchy for a given set of taxa and then using it to model the hierarchical structure and consequent lack of independence in the data.

Phylogenetics is hard. It is NP-hard. ALL our methods are based on heuristics (assumptions) that seem to work in practice but we generally have no idea whether the best tree we have is the true course of evolution that really happened. Maybe another heuristic could find an even better tree. I don’t see this as a problem, just part of the scientific process. All statements about reality are tentative, until something better comes along.

We know from simulations how consistent the estimating methods are as the amount of data increases, under different conditions. This research has been going on since the “likelihood” versus “parsimony” wars of the 1970s. Now real data are more messy than simulated data, no question. But we are working on that! And again, there is One True Objective Phylogeny which we are trying to estimate: the real process of evolution that actually happened. That is the benchmark we are trying to achieve.

We are generally not trying to pick the corresponding “objective” tree that is compatible with the observed data. Trees just represent phylogenetic covariance. Some trees will necessarily fit the data better than others, and that may not be the true tree. But our best guess at the covariances (the prior) should come from our best guess at the true phylogeny.

Phylogenetic trees are not known a priori. They are estimates (statements, hypotheses, models) of the true pattern of evolution. Well-supported phylogenies aren’t necessarily the true pattern of evolution. But they are our current best guess. It’s not “always the correct hierarchical structure”. But I argue that it is our best least wrong guess, given our current knowledge of the study species. Further data may change the phylogeny. That’s OK. That’s science. It may mean that the analysis will have to be re-visited.

But if your starting point is to use the Linnean hierarchy, my view is that you are quite possibly shooting yourself in the foot. A good phylogeny is “least wrong” as I said above. I’m going to stick my neck out here and say that other researchers in my field agree. There is about 30 years of research, thinking about and analysing multi-species data sets to find ways to best account for the hierarchical nature of multi-species data. I really think we have progressed, and one of those milestones was to ditch the use of the Linnean hierarchy.

If I was to bet money (and as a statistician, I never do), I would bet on a good phylogeny over a Linnean hierarchy any time.

The funny thing is, from the tone of this discussion, it looks like Blomberg and Betancourt are having a big argument. But after reading more carefully, I think they’re basically in agreement:
– Hierarchical modeling is a good way to account for partial information in this sort of predictive setting.
– Hierarchical models are most effective when used in the context of substantive information.

The interaction between statistical modeling and substantive concerns in this discussion is fascinating. This one’s not as important as football and elections, but I’ve heard that some people care about it nonetheless.

15 thoughts on “Hierarchical models for phylogeny: Here’s what everyone’s talking about”

Rahul on February 14, 2016 10:44 AM at 10:44 am said:

I love this sort of back and forth among experts. What looks like a big heated argument is actually far more productive than consensus at some shallow level.

The best threads from expert forums often look like fights. But that’s often experts who really care about the details fighting through the thicket for clarity to emerge.

Sometimes the substantive technical issues get sorted out much better in blunt often rude-seeming arguments than milder, vaguer, softer discussions where no one takes a stand or speaks out his mind bluntly. So long as the parties involved don’t take things personally and are not vindictive.

In my opinion, something like the Linux kernel mailing list works so well is because people are blunt, terse and a tad rude to each other and never afraid to speak their minds.

Reply ↓
Z on February 14, 2016 10:46 AM at 10:46 am said:

My takeaway from the back and forth between Betancourt and Blomberg:
Modeling variation taxonomically instead of phylogenetically is like modeling a time series based on what time somebody sitting in a dark room felt like each observation was recorded instead of the actual time it was recorded. Betancourt says that the guy in the dark room’s time estimates could in theory be more informative for some reason, and even if they’re not informative at all then incorporating them in a hierarchical model won’t do much harm compared to no pooling. But Blomberg still recommends using the actual time, which seems to make sense.

Reply ↓
Alex D on February 14, 2016 12:48 PM at 12:48 pm said:

(This comment involves my own mental models of Betancourt and Bloomberg, so apologies to the real people if I am misrepresenting their positions)

I think a key point that Betancourt is making is that it’s possible to think about hierarchical models as a regularization devices rather than a data-generating stories, and there are often gains to be made by using a hierarchical model whose prior structure is simple, even if that structure isn’t supported by a causal story. This, of course, depends on the scientific question.

It is not unheard of for dyed-in-the-wool Bayesians to only accept prior structures and models that entail a full data-generating story, and as a result, to either declare certain complex problems as “too difficult for the time being” or to accept an extremely complex model whose unverifiable assumptions affect the output of the estimation procedure in subtle ways that are difficult to understand. In these cases, depending on the question that needs to be answered and the range over which that answer needs to remain valid, specifying a simple, well-understood model can be useful, even if that model relies on qualitative observations of 18th century Europeans, especially if the subject of the investigation involves explaining data that may have influenced the 18th century Europeans in their classification.

It seems that Bloomberg is arguing that while this may be true in general principle, in this particular case, phylogenetic models are currently well-enough understood that the risk of including the more complex phylogenetic prior structure (the accompanying computational effort and sensitivity to assumptions) is worth it for better predictive performance for almost all biological questions. I think Betancourt could be convinced of this idea, but the key unanswered contention from Betancourt is that the reasoning for using phylogenetic tree priors should be stated in these terms, rather than on the basis of the “objectivity” or “correctness” of the phylogenetic prior. In particular, I think Betancourt would like Bloomberg to admit that depending on the question, the available computational resources, and the level of uncertainty in the relevant part of the phylogenetic tree, a reasonable data analyst could make a different choice, regardless of the “objectivity” or the “correctness” of phylogenetic assumptions.

Reply ↓
- Michael Betancourt on February 14, 2016 7:36 PM at 7:36 pm said:
  
  People building mental models of me is a scary thought…
  
  I think you’ve largely characterized the discussion correctly. The only point I’ll emphasize is that non-phylogenetic groupings could be generative, for example if based on phenotypes or taxonomies that capture phenotypes. But ultimately the answer to the original question, whether non-phylogenetic models are dangerous, is best answered with the perspective that hierarchical models can always be considered as non-generative, well-behaved regularizers.
  
  Reply ↓
Martha (Smith) on February 14, 2016 5:06 PM at 5:06 pm said:

Background: For several years, I attended a phylogenetics seminar; I have served on the dissertation committees of three phylogenetics students.

My overall opinion:

The biggest thing to emphasize is that just what is the best approach (in particular, what groupings might be relevant) depends on the question you are interested in, the data you have, and the limitations of methods uses to obtain inputs into an analysis.

Some specifics:

1. One important point lost in the focus on the interchange between Betancourt and Blomberg is Lizzie Wolovich’s comment,

“In my own experiences we … (3) worry about how species can often be confounded with other effects, like site or who did the study. It sounds like from the below you might be in the realm of (3) perhaps.”

Indeed, hierarchical analysis based on site or who did the study might be revealing, just as it might in meta-analyses of medical or educational studies.

2. Species are not entirely well-defined. In fact, some phylogenetic methods focus on “populations” rather than species. This (at least partially) takes into account the continued evolution of phylogenetic groups. For example, seeds that somehow get across a mountain may not disperse back to the original side if the birds that carry them die out or change migration patterns, so there is a question as to whether or how long the plants on one side of the mountain are in the same species as those on the other side.

3. Older definitions on species may be based on physical similarities, but newer definitions based on evidence of genetic relationship. Thus, older definitions may group organisms that look similar but are not closely related phylogenetically.

4. Classifications into genera are usually even more iffy than older species classifications.

5. Convergent evolution covers various possibilities: It might produce similar-looking or similar-functioning structures in unrelated species (e.g, wings in birds and bats), so classification by visually similarities has its problems. But sometimes it might be of interest to study such similarities (e.g., Betancourt’s comment about optical structures), in which case phylogenetic information might be irrelevant (as Betancourt points out).

Reply ↓
- Martha (Smith) on February 14, 2016 5:20 PM at 5:20 pm said:
  
  Typos:
  3. “definitions of species”
  5. “visual similarities”
  
  Also:
  
  6. Phylogenies obtained by the various methods have their limitations, which need to be taken into account in using them in further analyses. One problem is “introgression”, referring to hybridization and back-crossing. Thus the “true phylogenetic “tree” will not be a tree, but a more complex structure. However, many phylogeny-generating methods only spit out a tree.
  
  7. Most programs for generating phylogenies ouput some “measure” of probability for each node in the tree. The Bayesian measure has a typical Bayesian posterior interpretation; however the so-called “bootstrap probabilities” have a pretty fuzzy interpretation. Thus using a phylogeny as input into a further analysis ideally should take such uncertainties into account — but how to do this may be difficult to figure out.
  
  8. Bottom line seems to be: Try all plausible methods; if they agree, great. If they don’t, … well, science is a work in progress.
  
  Reply ↓
  - Michael Betancourt on February 14, 2016 7:13 PM at 7:13 pm said:
    
    +1
    
    Reply ↓
Michael Betancourt on February 14, 2016 7:31 PM at 7:31 pm said:

One thing to keep in mind here is that the discussion was not about whether genotype or phenotype/other non-genetic characteristics are the best ways to group individuals when building up a model. The original question asked whether the latter would inevitably lead to bad inferences.

The answer is an emphatic “no”. And the reason is purely statistical and has nothing to do with the science — even if the grouping chosen is not consistent with the true data generating process then the hierarchical model will adapt to have little to no effect on the final inferences. In particular, this means that even if the taxonomies are wrong then there’s no reason not to explore their use in building hierarchical models. If phylogenetic models are better then that will manifest immediately in any statistical comparison.

Really most of my contributions after first making this point were examples of why phylogenetics could fail to be the most important feature and hence taxonomic/phenotypic groups could be useful. But that ended up straying pretty far from the original point!

Reply ↓
Steve Sailer on February 14, 2016 7:36 PM at 7:36 pm said:

“Species are real entities (distinct, non-interbreeding populations)”

Except when they are not.

I’m being snarky, but:

A. There are a few dozen different definitions of species floating around. Ernst Mayr’s non-interbreeding definition is easiest to understand, but it has numerous problems, both theoretical and real world. (But that doesn’t mean it’s completely wrong, either.)

B. The species question isn’t just philosophical — it comes up all the time in the big money context of enforcement of Endangered Species Act, which frequently has multibillion dollar consequences for things like real estate development.

For example, are grey wolves distinct and non-interbreeding enough regarding dogs and coyotes to merit protection? Well, they seem pretty distinct to most people, but they are not non-interbreeding. What about red wolves, which appear to be an ongoing hybrid of wolves and coyotes?

Is the dime-sized weed called the San Fernando Spineflower distinct enough from the dime-sized weed called the San Gabriel Spineflower to shut down the huge Ahmanson Ranch housing development?

Is the rare California Gnatcatcher distinct and non-interbreeding enough with the common Baja Gnatcatcher to be protected under the Endangered Species Act? I had lunch once with a golf course owner who eventually went broke and had to sell his golf course to Donald Trump in part because he’d set aside land that that could have been sold as building lots to protect the endangered California Gnatcatcher. But then the biologist who’d declared them a separate species changed his mind and said they were just a local race of the unendangered Baja Gnatcatchers.

Is species just a social construct?

Well, no, but not completely no, either.

Reply ↓
Steve Sailer on February 14, 2016 7:49 PM at 7:49 pm said:

One surprising aspect is that 18th century Linnaean hierarchies aren’t all that bad, even though they were pre-evolutionary in conception. They aren’t like Aristotelian mechanics. They didn’t have to be junked wholesale when Darwin came along or when genome analysis came along. The old categories often could be tinkered with rather than thrown out. The Linnaean glass is definitely part full (as well as, of course, part empty).

I think one reason for this surprising usefulness is because Linnaeus paid a lot of attention to the look of genitalia, which tends to correlate with who can interbreed with whom.

Reply ↓
Steve Sailer on February 14, 2016 8:03 PM at 8:03 pm said:

One thing to keep in mind is that academics sometimes have very strong opinions on lumper-splitter questions that can be misleading to naive outsiders who don’t understand the internal professional politics that encourage insiders to take a vociferous stance on lumping v. splitting.

For example, being able to tell yourself “I’m the world’s leading expert on X” can be more satisfying if you believe X isn’t just an obvious close cousin of Y and Z, but instead that X is a remarkably distinct entity almost unto itself. Or, perhaps your expertise in X can make you the go-to guy for soundbites about X,Y, and Z if you make the case for lumping them.

Reply ↓
Steve Sailer on February 16, 2016 4:55 AM at 4:55 am said:

My hunch is that the philosophical problems of categorizing living creatures in our Darwinian age is one of the most interesting intellectual challenges of this century.

Reply ↓
- Rahul on February 16, 2016 7:01 AM at 7:01 am said:
  
  But isn’t taxonomy a bit of a redundant, not-very-useful pastime? A relic of an age where every entity had to be placed under a unique bin. Which led to quirky systems like the Dewey decimal system.
  
  Pedantically categorizing living creatures & splitting hairs over whether a weird bacteria must by put into binA or binB sounds like a futile job.
  
  Reply ↓
  - Steve Sailer on February 22, 2016 11:21 PM at 11:21 pm said:
    
    It’s necessary under the Endangered Species Act, which has huge impact on who can build where.
    
    I’m not saying your objection isn’t valid, just that these pedantic-sounding questions have billion dollar consequences in the real world of real estate development (and much else).
    
    Reply ↓
Luiz Carvalho on July 22, 2016 11:39 AM at 11:39 am said:

BTW, beetles ARE a type of insect (https://en.wikipedia.org/wiki/Beetle).
Regarding the actual debate, I side with Betancourt in his argument that adding an extra grouping level, if set up correctly, should have almost no effect on the final estimates if said grouping is irrelevant. Loads of research into shrinkage was done to ensure this was the case.

As Andrew pointed out, Blomber and Betancourt are not in opposite sides of the discussion. I too noticed more similarities than disagreements in their arguments.

My take on this is that, if possible, you should try and set up your model with a hierarchy that reflects phylogeny. Why? Well, by Betancourt’s own argument, if phylogeny doesn’t matter, the model will adapt and the final estimates/predictions won’t be too far off.

Reply ↓

Statistical Modeling, Causal Inference, and Social Science

Hierarchical models for phylogeny: Here’s what everyone’s talking about

15 thoughts on “Hierarchical models for phylogeny: Here’s what everyone’s talking about”

Leave a Reply to Martha (Smith) Cancel reply