Review: Geographic population structure analysis of worldwide human populations infers their biogeographical origins

This post is by Joe Pickrell, and is part of an experiment where I will be posting summaries and critiques of the main points of papers I review for journals. Apologies in advance for any misunderstandings and errors on my end; please correct these in the comments.

TL;DR: I have a conceptual disagreement with a paper on learning about the geographic origin of individuals from genetic information.

I recently reviewed a manuscript titled “Geographic population structure analysis of worldwide human populations infers their biogeographical origins”, which has now been published. Overall I found the paper difficult to review because the authors and I have fundamentally different views about what genetic information can tell us about geography. I hope to explain this a bit in this post.

(Side note: some of the authors have started a company called Prosapia Genetics to sell a product based on this paper, but in the paper write “The authors declare no competing financial interests”. This seems to run counter the spirit of these types of disclosures).

(Side note 2: Pseudonymous blogger Dienekes Pontikos notes that the method in this paper is extremely similar to one he developed a few years ago. Regardless of the intentions of the authors, I personally apologize to Dienekes for not noticing his previous work).


The goal of the paper and the associated method

Imagine you had my genome sequence. The goal of this paper is to develop an algorithm to place me on a map–that is, to find the latitude and longitude of my “biogeographical origins”, a concept that I think can be vaguely defined as the geographic location of my ancestors sometime in the recent past (for a European-American, maybe sometime in the last few hundred years prior to the major European migrations to the US)

One way to do this is to imagine the world as a grid (either in 2D or 3D space), and build some model for how the frequencies of genetic variants vary across space. If you had my genotypes at a number of variants, you could then find the best spot for me on this grid. This is the basic idea underlying previous work on this topic, for example in Spatial Ancestry analysis.

The authors of this paper take a different approach. Instead of explicitly modeling geographic variation in the frequencies of alleles, they first perform a clustering analysis on a reference set of individuals with known geographic locations. They then (more or less) find the clusters I fall closest to, and copy over the geographic information from those clusters. That is, if genetically I seem most similar to a reference group of French and German individuals, then they say that my “biogeographic origin” is between France and Germany.

Conceptual issue: This is a paper about genetic clustering, not about geography

Basically, geographic information plays no role in this algorithm except in a post hoc manner. Instead, this is a standard genetic clustering algorithm. This means it has the same limitations as any such algorithm. For example, in the Figure above, imagine a set of reference individuals colored according to their inferred “cluster”. Now imagine matching test individual 1 to those clusters. In this case, it’s simple: individual 1 matches cluster 1, and so copying over the geographic information from cluster 1 to individual 1 seems reasonable. But what about individual 2? This individual doesn’t match any of the reference clusters, so the algorithm can’t do anything with it. If the algorithm were truly learning about geography, this wouldn’t be the case.

As the authors note, a whole host of other limitations come along with this. For example, the authors assume the reference populations can’t have changed geographic locations in the time frame of interest (implying the method is limited to populations with historical records attesting to their residency in a geographic location). That is, imagine that 200 years ago, everyone from one reference village moved 200 km to the west for some reason–this algorithm would place the descendants of that population in the present-day location, rather than the historical location. This is all fine if the goal is genetic clustering, but the authors interpret their algorithm strongly in terms of geography. This leads to something of a tautology: this algorithm can use genetic information to infer geography only if you assume genetic clusters are geographically meaningful*. The utility of the method thus depends on whether this is the case in any particular application.

*The sentence has been edited for clarify

About these ads


  1. Good post Joe. I think, even with the “spatial ancestry analysis”, there is a lot of issues regarding how representative the geographic location of contemporary populations to the geographic locations of ancestral populations.

    Also, the “depth” of clustering should be better constructed. Clustering between two African populations are different than clustering between Asian and indigenous American populations with regard to the “time frames” and “causes” of these clusterings.

    1. Thanks, these are great points.

      As you’re alluding to, there is a lot of work on how to interpret genetic clusters in terms of phylogeography and history, and these interpretations are absolutely not straightforward.

  2. Awesome. The effort to conflate genetic signatures derived from modern populations to infer historical geography has bugged me for some time, but I haven’t been able to say exactly why. I need to memorise post and repeat it like a mantra. “This algorithm can use genetic information to infer geography only if you assume genetic clusters are geographically meaningful.” How robust are inferences of geography based upon genetic similarity? For many of my friends (and not just them) whose parents were born thousands of miles apart, I suspect the answer is not very much. Huge thanks for putting into words what I could only express with a grunt and a furrowed brow.

    1. Thanks. To be fair to the authors of this paper, there are certainly situations in which geography and genetics are correlated. For example, we know that many European-Americans descend from migrants to the US in the last couple of hundred years. To estimate the ancestry of European-Americans, it thus seems reasonable to build a reference set of European populations and assume that these references haven’t moved around extensively in that rather short time period. This is along the lines of what ancestry-testing companies like 23andMe do.

      That said, of course there has been migration in Europe over these time periods (for example, migrations from Italy to France), so even “straighforward” cases very quickly become complex to interpret.

  3. Genocide would exclude whole populations, e.g., European Jewry, and famine driven immigration might bias results for the, say, Irish.

  4. Their website is thin on information, but with each generation, there is a doubling of the number of ancestors and all could be from a different location. It appears that their analysis provides a single ancestral location. A single location would have almost no meaning when in just four generations, you could have ancestors from 16 different locations.

    1. I have data for one same sex ancestor, my maternal grandmother. That limits the number of ancestors to one per generation, and, therefore, one location. Someone would, I assume, have to do this for each genetic line for which they have DNA.

  5. Very well stated. If you infer the geographic origin of a Canadian or American based on a cluster of like types in the North of Ireland, without taking into account the Ulster plantation migration of many Scots – it’s in error from the start. Not to mention that the Scots migrated to their home villages in Scotland from someplace else… Thanks for your insight.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s