This makes sense, and I think it also agrees with what Andrew is saying. The 5Mb linkage blocks are probably bigger than needed, so you’re being conservative. But it makes sense to do so, because you’re saving yourself a lot of time and when you find a signal of gene flow, it’s robust to the details of the underlying population dynamics. And being conservative isn’t entirely bad – I think in this case there’s often a big enough separation between “effect size needed to be detectable in genomic data” and “effect size big enough that it’s worth complicating our population history cartoon” that there’s no harm in bumping up the former. I guess my main source of hesitation would just be the generic problem with null hypothesis testing – that once you do show the null is wrong you still need to figure out what actually happened.

I think one can abstract away the population genetics details: it’s easy to calculate a test statistic for the data, and to estimate the error by resampling. We understand the underlying data-generating process well enough to have some idea about how much to trust the resampling; in fact, we understand it well enough that we can even generate fake data, but this is a lot of work, even for a single parameter set. In this case, I think it sometimes makes sense to start by just looking at whether the test statistic deviates from the null.

(PS: Nick, we met at the Simons Institute in Berkeley several years ago.)

]]>” I didn’t expect anyone on this blog to have heard of me.”

You and I have met — more than once. I recall twice in Cheltenham and once or twice in Princeton. If you don’t remember, I might try giving you a hint. :~)

]]>The world can be awfully small sometimes. I don’t think we’ve ever met but I used to work with some of your former colleagues who knew you ….

]]>That’s flattering. I didn’t expect anyone on this blog to have heard of me.

Please go to the Reich lab web page:

There are tabs for publications, software and datasets.

In my note I was referring to the “f4 test” described in

“Ancient Admixture in Human History” (2012). The test is implemented

in a program qpDstat, part of a larger package ADMIXTOOLS. Much suitable data

is also available on the site.

**I do not know of another living statistician who has done such impressive work across academia (Broad), government (GCHQ/NSA) and business (Renaissance).**

It would be useful to have a simple example (with smaller N) and working code, in order to make the discussion more concrete . . .

]]>< That is a lot of parameters, but in principle, it seems doable.

That was precisely my point. My frequentist technique is basically to analyze each "SNP" (variable locus) as though independent.

There can easily be 1M loci. That gives a statistic that under the null has mean 0. To get the standard error

we delete large blocks (about 5M genome bases) in turn and apply the jackknife. This can easily be coded up

in a day or two. A Bayesian analysis if practical at all would in my best guess be months of work, and perhaps sensitive

to obscure modeling assumptions about LD.

And I have run my test on perhaps 1M population pairs…