Review: Geographic population structure analysis of worldwide human populations infers their biogeographical origins

This post is by Joe Pickrell, and is part of an experiment where I will be posting summaries and critiques of the main points of papers I review for journals. Apologies in advance for any misunderstandings and errors on my end; please correct these in the comments.

TL;DR: I have a conceptual disagreement with a paper on learning about the geographic origin of individuals from genetic information.

I recently reviewed a manuscript titled “Geographic population structure analysis of worldwide human populations infers their biogeographical origins”, which has now been published. Overall I found the paper difficult to review because the authors and I have fundamentally different views about what genetic information can tell us about geography. I hope to explain this a bit in this post.

(Side note: some of the authors have started a company called Prosapia Genetics to sell a product based on this paper, but in the paper write “The authors declare no competing financial interests”. This seems to run counter the spirit of these types of disclosures).

(Side note 2: Pseudonymous blogger Dienekes Pontikos notes that the method in this paper is extremely similar to one he developed a few years ago. Regardless of the intentions of the authors, I personally apologize to Dienekes for not noticing his previous work).


The goal of the paper and the associated method

Imagine you had my genome sequence. The goal of this paper is to develop an algorithm to place me on a map–that is, to find the latitude and longitude of my “biogeographical origins”, a concept that I think can be vaguely defined as the geographic location of my ancestors sometime in the recent past (for a European-American, maybe sometime in the last few hundred years prior to the major European migrations to the US)

One way to do this is to imagine the world as a grid (either in 2D or 3D space), and build some model for how the frequencies of genetic variants vary across space. If you had my genotypes at a number of variants, you could then find the best spot for me on this grid. This is the basic idea underlying previous work on this topic, for example in Spatial Ancestry analysis.

The authors of this paper take a different approach. Instead of explicitly modeling geographic variation in the frequencies of alleles, they first perform a clustering analysis on a reference set of individuals with known geographic locations. They then (more or less) find the clusters I fall closest to, and copy over the geographic information from those clusters. That is, if genetically I seem most similar to a reference group of French and German individuals, then they say that my “biogeographic origin” is between France and Germany.

Conceptual issue: This is a paper about genetic clustering, not about geography

Basically, geographic information plays no role in this algorithm except in a post hoc manner. Instead, this is a standard genetic clustering algorithm. This means it has the same limitations as any such algorithm. For example, in the Figure above, imagine a set of reference individuals colored according to their inferred “cluster”. Now imagine matching test individual 1 to those clusters. In this case, it’s simple: individual 1 matches cluster 1, and so copying over the geographic information from cluster 1 to individual 1 seems reasonable. But what about individual 2? This individual doesn’t match any of the reference clusters, so the algorithm can’t do anything with it. If the algorithm were truly learning about geography, this wouldn’t be the case.

As the authors note, a whole host of other limitations come along with this. For example, the authors assume the reference populations can’t have changed geographic locations in the time frame of interest (implying the method is limited to populations with historical records attesting to their residency in a geographic location). That is, imagine that 200 years ago, everyone from one reference village moved 200 km to the west for some reason–this algorithm would place the descendants of that population in the present-day location, rather than the historical location. This is all fine if the goal is genetic clustering, but the authors interpret their algorithm strongly in terms of geography. This leads to something of a tautology: this algorithm can use genetic information to infer geography only if you assume genetic clusters are geographically meaningful*. The utility of the method thus depends on whether this is the case in any particular application.

*The sentence has been edited for clarify


Review: High Resolution Genomic Analysis of Human Mitochondrial RNA Sequence Variation

This post is by Joe Pickrell, and is part of an experiment where I will be posting summaries and critiques of the main points of papers I review for journals. Apologies in advance for any misunderstandings and errors on my end; please correct these in the comments.

TL;DR: A clever analysis of RNA sequencing data identifies natural genetic variation influencing mitochondrial tRNA processing in humans.

I recently reviewed a manuscript titled “High Resolution Genomic Analysis of Human Mitochondrial RNA Sequence Variation”, which has now been published. Overall I thought the paper was creative and surprising; I’d be interested in hearing other folks’ thoughts.

The experiment

The initial goal of this study seems to have been to use RNA-seq to quantify variation in mitochondrial RNA and DNA sequences. The authors sequenced cDNA libraries prepared from mRNA from whole blood in ~700 individuals, and focused specifically on sequencing reads that mapped to the mitochondrial genome. Since each individual in principle inherited a single mitochondrial genome from their mother, there should be essentially no sequence-level variation within individuals (modulo sequencing and mapping artifacts, more on this later).

The authors then did a simple analysis: they looked for positions in the mitochondrial transcriptome where they observed more than a single base in an individual. They identified ~600 such sites (some observed in multiple individuals), which they call “heteroplasmies”. Putting aside potential technical explanations for these sites, heteroplasmies could be due to either 1) variation at the DNA level (e.g. mutations that have occurred in mitochondria of the individual’s blood during their lifetime) or 2) variation at the RNA level (post-transcriptional modifications of RNA through mechanisms like RNA editing).

Main result: A genetic variant in MRPP3 influences processing of mitochondrial tRNAs

At 13 of the heteroplasmic sites, the authors noticed that their data contained multiple alleles (rather than the two you might expect from a new mutation or a simple RNA editing event). They also made an odd observation: 11 of these 13 sites fell in the ninth position of tRNA genes. By reference to what is known about tRNA biology, they argue that the particular patterns of mismatches they observe at these sites are caused by the presence of RNA methylation (which causes the observed mismatches via reverse transcriptase errors).

Under this model, the proportion of non-reference alleles at a site is a quantitative measure of the fraction of mitochondria in an individual that is methylated at the site. The authors reasoned that as a quantitative phenotype, genetic variants influencing methylation levels might be mapped by standard human genetics methods. Shown at the top of the post is a “Manhattan plot” showing the authors’ results from a genome-wide association study of (putative) tRNA methylation in the mitochondria. The result is essentially every human geneticist’s dream: there’s a single strong peak centered on a nonsynonymous SNP in a biologically plausible gene (in this case, MRPP3, a gene involved in processing of mitochondrial tRNAs).

Putting all of this together, is seems that there is variation in mitochondrial tRNA methylation (or some other modification that could cause similar reverse-transcriptase errors) among individuals in a population, and that this variation is partially due to a trans-acting genetic variant of relatively large effect. I found this is quite impressive.

A note of caution regarding estimates of the total number of heteroplasmies

At various points in the paper, the authors include other results that are often interesting but not as important to the main conclusion. One of these that is worth thinking about is the overall number of heteroplasmic sites.

The authors estimate that in their samples, there are around 600 mitochondrial sites that have multiple alleles (note that this is a sum of DNA-level heteroplasmies and RNA-level heteroplasmies). I have a nagging suspicion that this is an overestimate.

The reason for this suspicion is that I’m worried about mapping errors from “nuclear mitochondrial DNA” (AKA Numt) sequences causing false inference of heteroplamies. Examination of some of the reported sites suggests that the alleles of the “heteroplasmies” indeed are consistent with instead being due to mismapping errors from autosomal sequences.

For example, below is a screenshot of the UCSC genome browser surrounding two “heteroplasmic” sites from Supplementary Table 1. I’m showing the sequence of the reference mtDNA (at the top), as well as the sequences of all relevant Numts (using the NumtS Sequence track). As you can see, at the two sites called by the authors, the alternative “allele” at the site matches the sequence of the Numt. My guess is that there is no mitochondrial sequence variation at these two sites, just mis-mapped sequencing reads that originated from the Numts.


It’s unclear how many of the sites identified by the authors are potentially affected by mapping errors (though note none of the 13 used in the mapping experiment described above have any indication of such problems to my eye). For people interested in quantifying the overall extent of the phenomenon observed by the authors, this seems like a potentially important source of error to take into account.

Y-chromosome “Adam” was not necessarily human

phylogeny2This post is by Joe Pickrell

Metaphors in science play an important role in communicating results from one field to scientists in other fields and to the general public. In some cases, however, metaphors are so successful and so appealing that they actually obscure rather than enlighten.

In human population genetics, it is a simple fact that all of the Y chromosomes present in the world today can be traced back to a single common ancestor–if you follow my paternal line (my father’s father’s father’s father, and so on) and your paternal line back far enough, eventually they will overlap. At some point, a population geneticist had the clever idea of calling this common ancestor “Adam”. This is a biblical allusion, of course, and it probably was good for a bit of amusement a couple of decades ago. But it’s time to retire this metaphor–not only because it confuses the public (see a nice series of posts by Melissa Wilson Sayres on this topic here) or scientists in other fields–but because it confuses even practicing human population geneticists!

I was reminded of this when reading over a paper by Eran Elhaik, Dan Graur, and colleagues critiquing work on the human Y chromosome phylogeny by Mendez et al. The basic question being debated is: when did the most recent common ancestor (MRCA) of all Y chromosomes exist? Mendez et al. claimed that this Y chromosome was present around 300,000 years ago, and Elhaik et al. claim they arrived at this number incorrectly.

The details of these papers are not relevant for this post. The key thing I want to point out is an underlying assumption, perhaps most clearly expressed by Elhaik et al., who write:

[Mendez et al.] estimated the time to the most recent common ancestor (TMRCA) for the Y tree to be 338,000 years ago (95% CI=237,000–581,000). Such an extraordinarily early estimate contradicts all previous estimates in the literature and is over a 100,000 years older than the earliest fossils of anatomically modern humans. This estimate raises two astonishing possibilities

The implicit assumption here (the reason Elhaik et al. find the numbers “extraordinarily early” and “astonishing”) is that the individual carrying the most recent common ancestor of all human Y chromosomes (AKA “Adam”) should be an anatomically modern human. Amusingly, Elhaik et al. argue that to claim otherwise is analogous to claiming you have a unicorn in your backyard. But there is simply no reason that “Adam” must be a human. At the top of this post I’ve put a figure showing a hypothetical Y-chromosome genealogy superimposed on a hypothetical human phylogeny. In this (of course hypothetical) example, “Adam” existed well before the diversification of modern humans; this type of scenario is perfectly compatible with basic population genetic theory. From the point of view of population genetics, there is absolutely no reason that the common ancestor of all human Y chromosomes must have existed in an individual that we would identify as “human”.

So why would anyone make this assumption? Note that Elhaik et al. made a YouTube video describing their results; this video leads with a bit of religious iconography. It seems plausible that by calling the most recent common ancestor of all Y chromosomes “Adam”, population geneticists have confused themselves into thinking that “he” must have been human.