Nick Patterson writes:

I am a scientist/data analyst, still working, who has been using Bayesian methods since 1972 (getting on for 50 years), I was initially trained at the British code-breaking establishment GCHQ, by intellectual heirs of Alan Turing.

I’ve been accused of being a Bayesian fanatic, but in fact a good deal of my work is frequentist—either because this is easier and “good enough” or because specifying a decent model seems too hard. Currently I work in genetics and here is an example of the latter problem. We want to know if there is strong evidence against a null that 2 populations have split off with basically no gene-flow after the split. Already this a bit hard to handle with Bayesian methods as we haven’t really got any kind of prior on how much flow might occur. But it is even harder than that. It turns out that the distribution of relevant statistics depends on the extent of “LD” in the genome, and the detailed structure of LD is uncertain. [LD is jargon for non-independence of close genomic regions.] I use a “block jackknife” which (largely) deals with this issue but is a frequentist technique. This is a case where frequentist methods are simple and mostly work well, and the Bayesian analogs look unpleasant, requiring inference on lots of nuisance parameters that frequentists can bypass.

I’d love to have a dialog here. In general I feel a lot of practical statistical problems/issues are messier than the textbook or academic descriptions imply.

My reply:

I don’t have experience in this genetics problem, but speaking in general terms I think the Bayesian version with lots of so-called nuisance parameters can work fine in Stan. The point is that by modeling these parameters you can do better than methods that don’t model them. Or, to put it another way, how does the non-Bayesian method finesse those “nuisance parameters”? If this is done by integrating them out, then you already have a probability model for these parameters, and you’re already doing some form of Bayesian inference, so then it just comes down to computation. If you’re bypassing the nuisance parameters *without* modeling them, then I suspect you’re throwing away some information.

This is not to say that the existing methods have problems. If your existing methods work well, fine. The statement, though, is that the existing methods “mostly” work well. So maybe the starting point is to focus on the problems where the existing methods don’t work well, as these are the problems where an investment in Bayesian modeling could pay off.

And recall footnote 1 of this paper.

How is “working well” determined? Better predictive skill of the model?

I assume that what you want to look for are alleles that by chance show substantial differences in frequency in the two subpopulations immediately after the split. Could you not set up a model in which the frequencies of these alleles are subject only to drift in the two subpopulations, Monte Carlo for many generations, then get a distribution on the frequencies of those alleles? It’s true that you still have to make assumptions on subpopulation sizes, allele frequencies, extent of gene flow between the subpopulations, etc. It might work better to model haplotypes rather than single loci; then as you note, you would have to model the LD between those loci. That is a lot of parameters, but in principle, it seems doable.

< That is a lot of parameters, but in principle, it seems doable.

That was precisely my point. My frequentist technique is basically to analyze each "SNP" (variable locus) as though independent.

There can easily be 1M loci. That gives a statistic that under the null has mean 0. To get the standard error

we delete large blocks (about 5M genome bases) in turn and apply the jackknife. This can easily be coded up

in a day or two. A Bayesian analysis if practical at all would in my best guess be months of work, and perhaps sensitive

to obscure modeling assumptions about LD.

And I have run my test on perhaps 1M population pairs…

This makes sense, and I think it also agrees with what Andrew is saying. The 5Mb linkage blocks are probably bigger than needed, so you’re being conservative. But it makes sense to do so, because you’re saving yourself a lot of time and when you find a signal of gene flow, it’s robust to the details of the underlying population dynamics. And being conservative isn’t entirely bad – I think in this case there’s often a big enough separation between “effect size needed to be detectable in genomic data” and “effect size big enough that it’s worth complicating our population history cartoon” that there’s no harm in bumping up the former. I guess my main source of hesitation would just be the generic problem with null hypothesis testing – that once you do show the null is wrong you still need to figure out what actually happened.

I think one can abstract away the population genetics details: it’s easy to calculate a test statistic for the data, and to estimate the error by resampling. We understand the underlying data-generating process well enough to have some idea about how much to trust the resampling; in fact, we understand it well enough that we can even generate fake data, but this is a lot of work, even for a single parameter set. In this case, I think it sometimes makes sense to start by just looking at whether the test statistic deviates from the null.

(PS: Nick, we met at the Simons Institute in Berkeley several years ago.)

Nick Patterson is seeking to have a dialogue? Wow!!!!

I do not know of another living statistician who has done such impressive work across academia (Broad), government (GCHQ/NSA) and business (Renaissance).It would be useful to have a simple example (with smaller N) and working code, in order to make the discussion more concrete . . .

That’s flattering. I didn’t expect anyone on this blog to have heard of me.

Please go to the Reich lab web page:

There are tabs for publications, software and datasets.

In my note I was referring to the “f4 test” described in

“Ancient Admixture in Human History” (2012). The test is implemented

in a program qpDstat, part of a larger package ADMIXTOOLS. Much suitable data

is also available on the site.

The world can be awfully small sometimes. I don’t think we’ve ever met but I used to work with some of your former colleagues who knew you ….

” I didn’t expect anyone on this blog to have heard of me.”

You and I have met — more than once. I recall twice in Cheltenham and once or twice in Princeton. If you don’t remember, I might try giving you a hint. :~)