Simons Genome Diversity Project

Human Genome Variation: 25 Genomes from 13 Diverse Populations

This dataset of high-quality genome sequences from diverse human populations around the world was released with the high coverage Neandertal genome paper:

Prüfer, K. et al. The complete genome sequence of a Neandertal from the Altai Mountains. Nature advance online publication December 18, 2013.

In this study, an international consortium led by Svante Pääbo from the Max Planck Institute for Evolutionary Anthropology in Germany generated a high-quality genome of a Neandertal individual from the Altai Mountains of Siberia. The researchers then compared it to the dataset of present-day humans made publicly available here.

This dataset also serves as a pilot project for the Simons Genome Diversity Project, which will generate 250 genomes from 125 diverse populations. The goal of this project is to elucidate the history of human populations and natural selection in humans and to identify important parameters in the search for disease-causing genes.

The diverse nature of the genomes studied will likely provide a helpful guide to others studying human genetic variation for many years to come.

Please us) for instructions on how to access the sequence data described below. Additional data releases are expected in the near future.



A publicly downloadable dataset of 25 deep genome sequences of which 13 are experimentally phased

Jacob Kitzman, Heng Li, Swapan Mallick, Arti Tandon, Nick Patterson, Kay Prüfer, Svante Pääbo, Janet Kelso, Jay Shendure and David Reich here for instructions on how to download this dataset)

Sample preparation and shotgun sequencing
“Panel A” (11 individuals): We previously made publicly available deep whole genome sequences for 11 individuals from diverse populations (1) (10 from the CEPH-Human Genome Diversity Panel (2) and 1 Dinka individual from Sudan (3)). We prepared four barcoded libraries for each of the 11 individuals using the method of ref. 4, and then combined all 44 into a single pool in approximately equimolar amounts. The libraries for all Panel A samples were sequenced together thereby minimizing differences between samples due to differences in sequencing machines or reagents.

This tree, color coded by geographic region, summarizes the degree of relationships between populations, as inferred from genetic data.

“Panel B” (14 individuals): We have generated genome sequences for 14 additional individuals as part of the high coverage Neandertal genome study (5).  Eleven are from the same exact populations as the individuals in Panel A and have the same provenance. In addition, there is 1 Mixe Native American for which HLA, microsatellite, and SNP genotypes have previously been reported(6,7,8). There are also 2 indigenous Australians from a diversity panel maintained at the European Collection of Cell Cultures (ECCAC) for which genome-wide SNP data have been previously reported(9,10). At our request, the ECCAC carried out an independent re-review of the Australian samples to determine if the consent for these samples was consistent with whole genome sequencing and dissemination of data, and approved their use for this purpose. Illumina Inc. (San Diego, USA) prepared TruSeq libraries from DNA we provided and sequenced these on HiSeq2000 instruments for 2×100 cycles.

Mapping and generation of BAM and VCF files
The mappings of the Panel A individuals was carried out as described in ref. 1.  Using BWA (11) version bwa-0.5.9 we mapped reads from the Panel B individuals to the human (hg19/GRCh37 + extended by adding the Epstein Barr virus). Low-quality read ends were removed using “bwa aln -q15”. We marked potential PCR duplicates with Picard ( Genotype calling was carried out as described in ref. 5.
Alignments to hg19/GRCh37 in BAM format and genotype calls in Variant Call Format (VCF) are available from this website. The generation of these VCF files is described in ref. 5.
Table 1 reports summary statistics for the sequencing of the Panel A and B individuals.

Experimental phasing of 13 individuals
We generated experimentally phased genomes for 13 individuals: 10 from Panel A and 3 from Panel B based on pooled fosmid sequencing. Table 1 indicates with an asterisk (*) which samples from Panel A and Panel B were phased.
Shotgun sequencing of the fosmid pools was performed using 75bp paired-end reads for the 10 Panel A individuals (at the Beijing Genome Institute, Shenzhen, China) or 50bp paired-end reads for the 3 Panel B individuals (at the Harvard Medical School Biopolymers Facility, Boston, USA).
We computationally phased the data using an adaptation of the algorithm presented in ref. 12.
The output of the phasing pipeline is a master file showing the phased regions and the haplotypes composed of the alleles at candidate heterozygous sites. We perform haploid SNP calling later on these BAMs to derive the haploid consensus sequence.


Table 1. Summary statistics on shotgun sequencing of 25 present-day human samples

* The 10 Panel A and 3 Panel B individuals that were experimentally phased.



[1] Meyer M, Kircher M, Gansauge MT, Li H, Racimo F, Mallick S, Schraiber JG, Jay F, Prüfer K, de Filippo C, Sudmant PH, Alkan C, Fu Q, Do R, Rohland N, Tandon A, Siebauer M, Green RE, Bryc K, Briggs AW, Stenzel U, Dabney J, Shendure J, Kitzman J, Hammer MF, Shunkov MV, Derevianko AP, Patterson N, Andrés AM, Eichler EE, Slatkin M, Reich D, Kelso J, Pääbo S (2012) A high-coverage genome sequence from an archaic Denisovan individual. Science 338, 222-6.

[2] Cann HM, de Toma C, Cazes L, Legrand Marie-Fernande, Morel V, Piouffre L, Bodmer J, et al.(2002) A human genome diversity cell line panel. Science 296, 261-2.

[3] Cox MP, Mendez FL, Karafet TM, Pilkington MM, Kingan SB, Destro-Bisol G, Strassmann BI, Hammer MF (2008) Testing for archaic hominin admixture on the X chromosome: model likelihoods for the modern human RRM2P4 region from summaries of genealogical topology under the structured coalescent. Genetics 178, 427-437.

[4] Rohland N, Reich D (2012) Cost-effective, high-throughput DNA sequencing libraries for multiplexed target capture. Genome Res. 22, 939-46.

[5] Prüfer, K. et al. (2013) The complete genome sequence of a Neandertal from the Altai Mountains. Nature, Advance Online Publication December 18.

[6] Hollenbach JA, Thomson G, Cao K, Fernandez-Vina M, Erlich HA, Bugawan TL, Winkler C, Winter M, Klitz W (2001) HLA diversity, differentiation, and haplotype evolution in Mesoamerican Natives. Hum Immunol. 62, 378-90.

[7] Wang S, Lewis CM, Jakobsson M, Ramachandran S, Ray N, Bedoya G, Rojas W, Parra MV, Molina JA, Gallo C, Mazzotti G, Poletti G, Hill K, Hurtado AM, Labuda D, Klitz W, Barrantes R, Bortolini MC, Salzano FM, Petzl-Erler ML, Tsuneto LT, Llop E, Rothhammer F, Excoffier L, Feldman MW, Rosenberg NA, Ruiz-Linares A (2007) Genetic variation and population structure in Native Americans. PLoS Genet. 3, e185.

[8] Reich D, Patterson N, Campbell D, Tandon A, Mazieres S, Ray N, Parra MV, Rojas W, Duque C, Mesa N, García LF, Triana O, Blair S, Maestre A, Dib JC, Bravi CM, Bailliet G, Corach D, Hünemeier T, Bortolini MC, Salzano FM, Petzl-Erler ML, Acuña-Alonzo V, Aguilar-Salinas C, Canizales-Quinteros S, Tusié-Luna T, Riba L, Rodríguez-Cruz M, Lopez-Alarcón M, Coral-Vazquez R, Canto-Cetina T, Silva-Zolezzi I, Fernandez-Lopez JC, Contreras AV, Jimenez-Sanchez G, Gómez-Vázquez MJ, Molina J, Carracedo A, Salas A, Gallo C, Poletti G, Witonsky DB, Alkorta-Aranburu G, Sukernik RI, Osipova L, Fedorova SA, Vasquez R, Villena M, Moreau C, Barrantes R, Pauls D, Excoffier L, Bedoya G, Rothhammer F, Dugoujon JM, Larrouy G, Klitz W, Labuda D, Kidd J, Kidd K, Di Rienzo A, Freimer NB, Price AL, Ruiz-Linares A (2012) Reconstructing Native American population history. Nature 488, 370-4.

[9] Hancock AM, Witonsky DB, Alkorta-Aranburu G, Beall CM, Gebremedhin A, Sukernik R, Utermann G, Pritchard JK, Coop G, Di Rienzo A (2011) Adaptations to climate-mediated selective pressures in humans. PLoS Genet. 7, e1001375.

[10] Reich D, Patterson N, Kircher M, Delfin F, Nandineni MR, Pugach I, Ko AM, Ko YC, Jinam TA, Phipps ME, Saitou N, Wollstein A, Kayser M, Pääbo S, Stoneking M (2011) Denisova admixture and the first modern human dispersals into Southeast Asia and Oceania. Am J Hum Genet. 89, 516-28.

[11] Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754-60.

[12] He D, Choi A, Pipatsrisawat K, Darwiche A, Eskin E (2010) Optimal algorithms for haplotype assembly from whole-genome sequence data. Bioinformatics 26, 183-90.