This post is by Joe Pickrell, and is part of an experiment where I will be posting summaries and critiques of the main points of papers I review for journals. Apologies in advance for any misunderstandings and errors on my end; please correct these in the comments.
TL;DR: I have a conceptual disagreement with a paper on learning about the geographic origin of individuals from genetic information.
I recently reviewed a manuscript titled “Geographic population structure analysis of worldwide human populations infers their biogeographical origins”, which has now been published. Overall I found the paper difficult to review because the authors and I have fundamentally different views about what genetic information can tell us about geography. I hope to explain this a bit in this post.
(Side note: some of the authors have started a company called Prosapia Genetics to sell a product based on this paper, but in the paper write “The authors declare no competing financial interests”. This seems to run counter the spirit of these types of disclosures).
(Side note 2: Pseudonymous blogger Dienekes Pontikos notes that the method in this paper is extremely similar to one he developed a few years ago. Regardless of the intentions of the authors, I personally apologize to Dienekes for not noticing his previous work).
The goal of the paper and the associated method
Imagine you had my genome sequence. The goal of this paper is to develop an algorithm to place me on a map–that is, to find the latitude and longitude of my “biogeographical origins”, a concept that I think can be vaguely defined as the geographic location of my ancestors sometime in the recent past (for a European-American, maybe sometime in the last few hundred years prior to the major European migrations to the US)
One way to do this is to imagine the world as a grid (either in 2D or 3D space), and build some model for how the frequencies of genetic variants vary across space. If you had my genotypes at a number of variants, you could then find the best spot for me on this grid. This is the basic idea underlying previous work on this topic, for example in Spatial Ancestry analysis.
The authors of this paper take a different approach. Instead of explicitly modeling geographic variation in the frequencies of alleles, they first perform a clustering analysis on a reference set of individuals with known geographic locations. They then (more or less) find the clusters I fall closest to, and copy over the geographic information from those clusters. That is, if genetically I seem most similar to a reference group of French and German individuals, then they say that my “biogeographic origin” is between France and Germany.
Conceptual issue: This is a paper about genetic clustering, not about geography
Basically, geographic information plays no role in this algorithm except in a post hoc manner. Instead, this is a standard genetic clustering algorithm. This means it has the same limitations as any such algorithm. For example, in the Figure above, imagine a set of reference individuals colored according to their inferred “cluster”. Now imagine matching test individual 1 to those clusters. In this case, it’s simple: individual 1 matches cluster 1, and so copying over the geographic information from cluster 1 to individual 1 seems reasonable. But what about individual 2? This individual doesn’t match any of the reference clusters, so the algorithm can’t do anything with it. If the algorithm were truly learning about geography, this wouldn’t be the case.
As the authors note, a whole host of other limitations come along with this. For example, the authors assume the reference populations can’t have changed geographic locations in the time frame of interest (implying the method is limited to populations with historical records attesting to their residency in a geographic location). That is, imagine that 200 years ago, everyone from one reference village moved 200 km to the west for some reason–this algorithm would place the descendants of that population in the present-day location, rather than the historical location. This is all fine if the goal is genetic clustering, but the authors interpret their algorithm strongly in terms of geography. This leads to something of a tautology: this algorithm can use genetic information to infer geography only if you assume genetic clusters are geographically meaningful*. The utility of the method thus depends on whether this is the case in any particular application.
*The sentence has been edited for clarify