打开APP
userphoto
未登录

开通VIP,畅享免费电子书等14项超值服

开通VIP
NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel fo
YirongShi14HonghongZhou1TingruiSong1QuanKang1The Han100K InitiativeTaoXu2ShunminHe136
Summary
The lack of haplotype reference panels and whole-genome sequencing resources specific to the Chinese population has greatly hindered genetic studies in the world’s largest population. Here, we present the NyuWa genome resource, based on deep (26.2×) sequencing of 2,999 Chinese individuals, and construct a NyuWa reference panel of 5,804 haplotypes and 19.3 million variants, which is a high-quality publicly available Chinese population-specific reference panel with thousands of samples. Compared with other panels, the NyuWa reference panel reduces the Han Chinese imputation error rate by a margin ranging from 30% to 51%. Population structure and imputation simulation tests support the applicability of one integrated reference panel for northern and southern Chinese. In addition, a total of 22,504 loss-of-function variants in coding and noncoding genes are identified, including 11,493 novel variants. These results highlight the value of the NyuWa genome resource in facilitating genetic research in Chinese and Asian populations.
Graphical abstract
Download : Download high-res image (185KB)
Download : Download full-size image
Previous article in issue
Next article in issue
Keywords
whole genome sequencing
Chinese population
haplotype reference panel
genome resource
variants
Introduction
Comprehensive catalogs of genetic variation are fundamental building blocks in research of population and demographic history, medical genetics, and genotype-phenotype associations. Since the first assembly of the human genome was released in 2003 (International Human Genome Sequencing Consortium, 2004), many large-scale whole-genome sequencing (WGS) projects have been launched in Western countries and, more recently, in Asia, creating large and diverse population genetic variation resources. Constructing a haplotype reference panel from large cohort WGS resources is a meaningful and cost-effective way to facilitate genome-wide association studies (GWASs), mainly by imputation of unobserved genotypes into samples that have been assayed using relatively sparse microarrays or low-coverage sequencing (Asimit and Zeggini, 2012McCarthy et al., 2016). However, there is no specific reference panel for the Chinese population, which is the largest ethnic group in the world.
A remarkable milestone among population genome projects is the 1000 Genomes Project, which released an important resource of ∼7.4× WGS data from 2,504 individuals in 26 populations and constructed a reference panel (1KGP3) of 5,008 haplotypes and over 88 million variants (Auton et al., 2015). This resource provides a benchmark for surveys of human genetic variation and has facilitated numerous GWASs through imputation of variants that are not directly genotyped, enabling a deeper understanding of the genetic architecture of complex diseases (Timpson et al., 2018). Nevertheless, rare and low-frequency variants tend to be specific to a population or sample (Auton et al., 2015), and many disease-related variants are very rare and population specific (Bomba et al., 2017Maher et al., 2012Saint Pierre and Génin, 2014). The GWASs missed a proportion of potential trait-associated variants that were poorly imputed with current reference panels (Asimit and Zeggini, 2012Bomba et al., 2017Hoffmann and Witte, 2015). Therefore, a number of projects have focused on specific populations, attempting to capture population-specific genetic variability and build specific reference panels. For example, the Genome of the Netherlands (GoNL) Project sequenced the whole genomes of 250 Dutch parent-offspring families, found a large number of novel rare variants, and constructed a reference panel with 998 haplotypes (Francioli et al., 2014). Based on the GoNL panel, researchers discovered that a rare variant, rs77542162, was associated with blood lipid levels in the Dutch population (van Leeuwen et al., 2015). Later, there were other such projects, including UK10K in the United Kingdom population (Walter et al., 2015), SISu in the Finnish population (Chheda et al., 2017), and GenomeDenmark (Maretty et al., 2017). However, these resources are biased toward European populations. Recently, some genomic resources and panels have also been created for Asian populations, including the Japanese population in the work of Nagasaki et al. (2015), 219 population groups across Asia in the GenomeAsia 100K project (GAsP) (Wall et al., 2019), and three Singaporean populations in the SG10K project (Wu et al., 2019). Some studies have also focused on the Chinese population, but these studies had limited sample sizes (Du et al., 2019Lan et al., 2017) or geographical coverage (Lin et al., 2018) or relied mainly on low-coverage WGS (1.7× or 0.1×) (Gao et al., 2020Liu et al., 2018a). In a most recent study, the China Metabolic Analytics Project (ChinaMAP) presented a deep WGS (40.8×) dataset of 10,588 Chinese individuals, focusing on metabolic disease (Cao et al., 2020). However, no reference panel has yet been constructed from that study. The Han Chinese population is the largest ethnic group in East Asia and even worldwide, comprising approximately 1.23 billion people. Han Chinese people account for ∼20% of the global human population and ∼92% of the mainland Chinese population (Xu et al., 2009). Constructing an integrated, large-cohort, high-quality genetic variation database and reference panel for the Han Chinese population is imperative; such a resource would help clarify the population structure and population history and facilitate genetic studies in the world’s largest population.
Here we present the genome resource NyuWa, based on deep (median, 26.2×) WGS of 2,999 Chinese individuals from 23 of 34 administrative divisions in China. NyuWa, or Nüwa, is the mother goddess who was the creator of the human population in Chinese mythology. The NyuWa genome resource includes a total of 71.1 million single-nucleotide polymorphisms (SNPs) and 8.2 million small insertions or deletions (indels), of which 25.0 million are novel. More importantly, we constructed the NyuWa reference panel of 5,804 haplotypes and 19.3 million variants; this resource is a high-quality publicly available Chinese population-specific reference panel with thousands of samples and currently has the best performance for imputation in the Han Chinese population. We also found 1,140 pathogenic variants, 18,711 loss-of-function protein-truncating variants (PTVs), and 3,793 long noncoding RNA (lncRNA) splicing variants, of which 11,493 were novel compared with existing genome resources. The NyuWa genome resource can provide useful and reliable support for genetic and disease studies. The NyuWa variant database and imputation server are available at http://bigdata.ibp.ac.cn/NyuWa/.
Results
Large Chinese population cohort of deep WGS data
The NyuWa genome resource included high-coverage (median depth, 26.2×) WGS of 2,999 different Chinese samples, including diabetes and control samples collected from hospitals and physical examination centers. The samples were from 23 administrative divisions in China, including 17 provinces, 2 autonomous regions, and 4 municipalities directly under the central government (termed “provinces” for simplicity; Figure 1A), which can be summarized into several geographical divisions of China (Table S1). The origins of the samples were referenced to the native places or the provinces where samples were collected. The majority of samples were collected from Shanghai, Guangdong, and Beijing (Figure 1A), which all have numerous residents from external provinces. The ethnicities associated with the samples were not available at the time of the study. Because national minorities are usually clustered geographically in China and are not numerous in our sampling areas, we estimated that the Han Chinese ethnicity made up the overwhelming majority of our samples.
Download : Download high-res image (1MB)
Download : Download full-size image
Figure 1. Overview of the NyuWa dataset
(A) Distribution of samples in the NyuWa resource. Samples were assigned to provinces based on the native places or hospitals where the samples were collected.
(B) The distribution of WGS mean genomic coverage after genome alignment and removal of duplicates.
(C) The sex of each sample inferred by sex chromosome coverage and ploidy of the chrX non-pseudo-autosomal region (PAR) estimated by the BCFtools plugin guess-ploidy. The results were consistent for all samples except one with no chrY coverage and a haploid chrX. This special sample was a putative XO type and was classified as female.
See also Figure S1 and Table S1.
Most of the samples were sequenced at a depth of more than 30× (median 38.9; Figure S1A). After genome alignment and removal of duplicates, the median of actual genomic coverage was 26.2× (Figure 1B; Figure S1B). Samples with contamination levels of alpha ≥ 0.05 were removed (Figure S1C). Based on the genomic coverage of sex chromosomes, the sex of each subject could be clearly identified except for one potential XO type (Figure 1C). The ploidy of chromosome X (chrX) for the sample also supported the XO type, which was classified as female. In total, there were 1,335 females and 1,664 males. After identification of close relatives within the third degree (Figure S1D), we found that the NyuWa dataset contained a maximum of 2,902 independent samples.
Discovery of 25.0 million novel variants in the NyuWa resource
Variants were called and filtered using NyuWa cohort variant calling pipeline (STAR Methods). SNPs and indels were genotyped jointly using GATK (Poplin et al., 2017) with human reference genome version GRCh38/hg38. After site quality filtering, a total of 76.4 million variant sites were identified, including 2.5 million multiallelic sites (Figure S2A). After splitting of multiallelic sites, the final dataset contained 71.1 million SNPs and 8.2 million indels (Figure S2B), including 2.5 million SNPs and 0.3M indels from sex chromosomes (Table S2). The transition-to-transversion ratio (Ts/Tv) is 2.107 for all biallelic SNPs, which is consistent with previous whole-genome studies such as 1KGP3 (2.09) (Auton et al., 2015) and UK10K (2.15) (Walter et al., 2015).
Compared with other public variant repositories, including ExAC (Lek et al., 2016), gnomAD (v2 and v3) (Lek et al., 2016), 1KGP3, ESP (NHLBI GO Exome Sequencing Project), dbSNP (v150) (Sherry et al., 2001), GAsP, 90 Han (Lan et al., 2017), and TOPMed (Taliun et al., 2019), the NyuWa dataset contained 25.0 million novel variants, including 23.1million SNPs (32.5%) and 1.9 million indels (23.3%) (Figure 2A). The ChinaMAP resource (Cao et al., 2020) merely provided a website for variant search and did not make a full variant list available. To estimate the ratio of novel variants compared with ChinaMAP, we used two variant sets for manual comparison. The first set was 230 novel singletons selected randomly from the NyuWa dataset (10 per chromosome); only 21.3% of variants also existed in the ChinaMAP dataset. Another set consisted of novel variants in 906 cancer-related genes collected from the ClinGen database and literature (Huang et al., 2018Mirabello et al., 2020Rehm et al., 2015). There were a total of ∼959,000 novel variants in these genes, and only 27.3% of these variants overlapped with ChinaMAP. We estimated that approximately 73% of novel variants would remain (∼18 million) after removal of variants in ChinaMAP. As expected, most novel variants were extremely rare, with singletons, doubletons, and tripletons accounting for 86.8%, 10.1%, and 1.9% of novel variants, respectively (Figure 2A). This is not surprising because rare variants are usually specific to a sample or population (Francioli et al., 2014). The absolute number of novel variants with a minor allele frequency (MAF) greater than 0.1% was still large (∼77,200). These variants are frequent enough to be subject to large-scale genetic association studies and may lead to new biological discoveries (Piton et al., 2013Walter et al., 2015). The large overall number of novel variants indicates severe underrepresentation of variants from the Chinese population in recent genetic studies.
Download : Download high-res image (660KB)
Download : Download full-size image
Figure 2. Statistics on variants in the NyuWa resource
(A) Numbers of variants detected in different bins of allele counts or frequencies. Variants were classified as known or novel based on public resources, including ExAC, gnomAD v2 and v3, 1KGP3, ESP, dbSNP, TOPMed, 90 Han, and GAsP. INS, small insertion; DEL, small deletion.
(B) Numbers (top) and novel rates (bottom) of variants in different RefSeq annotation regions.
(C) Numbers (top) and novel rates (bottom) of variants in different NONCODE annotation regions.
(D) Numbers of nonsynonymous SNPs predicted as deleterious by different number of 10 selected prediction algorithms (SIFT, PolyPhen2 HDIV & HVAR, LRT, MutationTaster, FATHMM, PROVEAN, MetaSVM, MetaLR, and M-CAP) provided by dbNSFP. The novel variants are based on results in (A).
See also Figure S2 and Tables S2–S5.
A typical NyuWa sample carries a median number of 3.51 million SNPs and ∼523,000 indels in autosomes. These numbers are close to those for East Asia samples in 1KGP3 (3.55M SNPs, ∼546,000 indels). The number of detected SNPs and indels with a MAF greater than 0.1% per sample had slightly positive correlations with genomic coverage (R2 = 0.075 and 0.11, respectively) (Figures S2C and S2D), indicating that WGS quality can still be improved by increasing the sequencing depth beyond 30×, especially for indels. This could be explained by the fact that, although there is sufficient coverage for the whole genome, there are still regions that lack coverage randomly or are difficult to amplify, which will be improved when the sequencing depth increases. The median number of SNPs and indels with a MAF less than 0.1% in a genome were 26,400 (0.75%) and 2,570 (0.49%), respectively. The very rare SNPs and indels showed no positive correlation with sequencing depth (Figures S2E and S2F), probably because the number of rare variants varies more widely (approximately ±10%) in different samples than the number of variants with a MAF greater than 0.1% (approximately ±1%), and the positive correlation is obscured by the large fluctuation.
To evaluate the effect of increasing sample size on variant discovery, we randomly downsampled the NyuWa dataset to different sizes and estimated the total number and variant increase at different sample sizes (Figures S2G–S2J). We found that the numbers of SNPs and indels continued to increase with increasing sample size (Figures S2G and S2H), but the growth rate decreased, from an initial average increase of 39,400 and 5,700 per sample to a final average ∼13,000 and ∼1,000 for SNPs and indels, respectively (Figures S2I and S2J).
There were a total of 31.9 million variants in protein coding genes, including ∼857,000 coding sequence (CDS), 1.10 million untranslated region (UTR), ∼8,600 splicing, and 30 million intron variants (Figure 2B; Figure S2K; Table S3). For lncRNAs, variants were also annotated with NONCODE v5 (Fang et al., 2018), which has the largest collection of lncRNAs. There were a total of 4.78 million variants in lncRNA exon regions (Figure 2C; Table S4). Focusing on variants in protein-coding exons, ∼315,000 of ∼501,000 nonsynonymous SNPs were annotated as deleterious by at least two of ten selected prediction algorithms provided by dbNSFP (Liu et al., 2016Figure 2D). The numbers of novel nonsynonymous and deleterious SNPs were ∼149,000 and ∼101,000, respectively (Table 1). Other functional protein-coding variants included ∼311,000 synonymous SNPs, ∼15,300 frameshift indels, ∼12,700 non-frameshift indels, ∼11,900 stop gains, and 613 stop losses (Table S5). There are more in-frame indels than adjacent frameshift indels in the coding region (Figure S2L), consistent with a previous report (Lek et al., 2016).
Table 1. Numbers of variants in the NyuWa resource and reference panel
TypeAll variantsaReference panelb
TotalNovelcTotalSpecificd
All79,226,35125,014,64619,256,2673,246,071
Nonsynonymous500,966149,34373,2607,048
Nonsynonymous deleterious315,016101,40733,5263,323
PTV18,7119,9941,381334
lncRNA splicing3,7931,49974380
a
Variants in the NyuWa resource.
b
Variants in the NyuWa reference panel.
c
The novelty of variants was determined by comparison with dbSNP, 1KGP3, gnomAD v2.1, EXAC, ESP, gnomAD v3, TOPMed, 90 Han, and GAsP.
d
Variants included in the other 4 publicly available haplotype reference panels (1KGP3, HRC.r1.1, GAsP, and TOPMed) were excluded.
We designed a companion database (http://bigdata.ibp.ac.cn/NyuWa_variants/) to archive SNPs and indels in the NyuWa resource and to comprehensively catalog the variants on allele frequencies in our Chinese dataset and external datasets, including 1KGP3 and gnomAD v3. In addition, variant quality metrics, genome region annotations, nonsynonymous variant impact predictions, loss-of-function predictions, clinical annotations, and pharmacogenomic annotations were collected and presented.
The NyuWa reference panel outperformed other publicly available panels for Chinese populations
Genome-wide genotype imputation is a statistical technique to infer missing genotypes from known haplotype information; this technique is more cost-effective for GWAS with SNP arrays than whole-exome sequencing (WES) or WGS. The NyuWa haplotype reference panel (http://bigdata.ibp.ac.cn/refpanel/) was constructed using NyuWa phasing and the reference panel construction pipeline (STAR Methods). The NyuWa panel used 19.3 million SNPs and indels with minor allele count of 5 or greater (MAC5; approximately equivalent to MAF > 0.1%) in 2,902 independent samples, including 73,300 nonsynonymous and 33,500 deleterious SNPs (Table 1). Compared with 4 other publicly available reference panels, including 1KGP3, Haplotype Reference Consortium release 1.1 (HRC.r1.1) (McCarthy et al., 2016), GAsP, and TOPMed r2, the NyuWa reference panel had 3.25 million specific variants not included in other panels, including 7,050 nonsynonymous and 3,320 deleterious SNPs (Table 1). These NyuWa-specific variants may bring new discoveries in future association studies. To evaluate the imputation performance, array genotyping data, and high coverage WGS data for 54 worldwide populations from the Human Genome Diversity Project (HGDP) (Bergström et al., 2020Li et al., 2008) were used as a test dataset. We focused on 16 Chinese populations and 11 other Asian populations in the HGDP. NyuWa outperformed 1KGP3, HRC.r1.1, and TOPMed r2 in all Chinese populations except the Uygurs (Figure 3A; Figures S3A and S3B). This can be explained by the fact that the Uygurs mainly inhabit Central Asia and were seldom included in our sampled areas. For the Han Chinese population, imputation with NyuWa reduced the error rates by 38.1%, 50.8%, and 30.4% compared with 1KGP3, HRC.r1.1, and TOPMed r2, respectively. NyuWa also achieved superior performance in most other East Asian and Northeast Asian populations (Figure 3A; Figures S3A–S3D). Not surprisingly, NyuWa did not perform as well as 1KGP3 in Central/South Asian populations in HGDP, which are mainly from Pakistan and historically received substantial gene flow from Central Asia and western Eurasia (Majumder, 2010Qamar et al., 2002). Compared with GAsP, a newly released reference panel for Asian populations, NyuWa also has advantages in several Chinese populations, including the Han, She, Tujia, Miaozu, Yizu, Tu, and Naxi (Figure 3B; Figure S3C). For the Han Chinese population, imputation with NyuWa reduced the error rate by 33.2% compared with GAsP. Nevertheless, NyuWa performed worse in some Chinese minorities and Pakistani Central/South Asian populations, possibly because the Han population makes up an overwhelming majority of subjects in NyuWa. These results indicate that additional minority samples are needed to improve the imputation performance for certain Chinese minorities. Imputation error rates for all other non-Chinese populations are shown in Figures S3D–S3F. We further compared the aggregate R2 between imputed dosages and true genotypes among panels at different allele frequencies. NyuWa had an absolute advantage over the other panels for the Chinese Han population in all allele frequency bins, with great improvement for low-frequency (allele frequency [AF] < 5%, R2 > 0.91) and rare (AF < 0.5%; R2 > 0.81) variants (Figure 3C). NyuWa also achieved the highest aggregate R2 in some other Chinese populations including She, Miaozu, Tujia, Yizu, and Nanxi (Figure S3G). These results indicated the good overall imputation quality of the NyuWa panel.
Download : Download high-res image (1MB)
Download : Download full-size image
Figure 3. Performance of the NyuWa haplotype reference panel
(A) Fold change (FC) in the imputation error rate in different Asian populations in the HGDP array SNPs between the 1KGP3 panel and the NyuWa (left) or NyuWa+1KGP3 (right) panel. Lower FC values represent better performance with the NyuWa or NyuWa+1KGP3 panel. EAS, East Asian; NEA, Northeast Asian; CA, Central Asian; CSA, Central South Asian. The “Han, China” samples do not include “Han (N. China)” samples in HGDP. The significance of error rate differences was calculated by chi-square test. ∗p < 0.05, ∗∗p < 0.01, ∗∗∗p < 0.001.
(B) FC in imputation error rate for array SNPs between the GAsP panel and the NyuWa (left) or NyuWa+1KGP3 (right) panel. Colors representing regions in (A) and (B) are consistent.
(C) The aggregate R2 value between imputed dosage and known genotypes in stratified nonreference AF bins in high-coverage WGS data. Colors represent different reference panels. The “Han, China” and “Han (N. China)” samples are the same as in (A).
See also Figure S3.
To optimize imputation performance, we also combined the NyuWa reference panel with the 1KGP3 panel using the reciprocal imputation strategy (Huang et al., 2015). The combined panel (NyuWa + 1KGP3) included 5,406 samples and 40.2 million variants, which improved imputation in all other tested Asian populations (Figure 3A; Figure S3). The imputation accuracy was improved markedly by approximately 10% for the Mongolian, Dai, Daur, Xibo, Tu, Oroqen, and Uygur and outperformed GAsP in more Chinese minority populations (Figure 3B). For the Han, She, Miaozu, Tujia, Yizu, and Naxi populations, NyuWa combined with 1KGP3 had almost the same aggregate R2 as NyuWa or a slightly lower R2 than NyuWa in the rare and low-frequency bins (Figure 3C; Figure S3G). For some other Chinese populations, such as the Mongolian, Dai, Daur, Xibo, Tu, Oroqen, and Uygur, NyuWa+1KGP3 had an improved R2 compared with NyuWa, which was consistent with the error rate results. In brief, NyuWa+1KGP3 is an excellent alternative to NyuWa.
Applicability of one integrated reference panel for northern and southern Chinese
In light of genetic differences between northern and southern Han Chinese people (Chiang et al., 2018Xu et al., 2009), we wanted to determine whether it is adequate to use one integrated reference panel for the northern and southern Han populations. To do this, we analyzed the NyuWa dataset from the perspective of population structure and imputation simulation tests.
To verify the ethnic authenticity of NyuWa samples, principal-component analysis (PCA) was performed on 200 randomly selected NyuWa samples together with 1KGP3 samples; the results showed that NyuWa samples were clustered together with 1KGP3 Han Chinese samples (Figures S4A and S4B), indicating that NyuWa samples are truly Han Chinese samples and do not show a large batch effect. Y chromosome analysis of male samples in the NyuWa population showed that the O group, which is the dominant group in the Han Chinese population, accounted for the majority (77.5%) of Y chromosome haplogroups. The next most common groups were C (9.0%) and N (7.5%). The Y haplogroup distribution was consistent with a previous analysis of Chinese populations (Yan et al., 2014Figure S5A). The distribution of Y haplogroups in different provinces is shown in Figure S5B.
We then analyzed ancestral components of NyuWa samples. Cross-validation of ADMIXTURE analysis for NyuWa with 1KGP3 East Asia samples showed that K = 3 best matched the structure of East Asian populations (Figure 4A; Figure S6). As in CHB (Han Chinese in Beijing, China) and CHS (southern Han Chinese) samples in 1KGP3, the most predominant component in NyuWa samples was ancestral component 1 (red). Regarding the sample origins, a clear difference between people in northern and southern provinces was that southern people had a higher proportion of ancestral component 3 (blue; Figure 4B), which was also the case between CHB and CHS samples in 1KGP3. Component 3 was also the major component for Dai (Chinese Dai in Xishuangbanna, China [CDX]) and Vietnamese (Kinh in Ho Chi Minh City, Vietnam [KHV]) people (Figures 4A and 4B). Component 2 (green) was the major component for Japanese (Japanese in Tokyo, Japan [JPT]) people and was uncommon in Chinese samples (Figures 4A and 4B).
Download : Download high-res image (2MB)
Download : Download full-size image
Figure 4. Chinese population structure based on the NyuWa dataset
(A) ADMIXTURE analysis of NyuWa samples with East Asia samples in 1KGP3. An assumption of K = 3 ancestries best fits the model. Different colors represent different ancestry components. CHB, Han Chinese in Beijing, China; CHS, Southern Han Chinese; CDX, Chinese Dai in Xishuangbanna, China; JPT, Japanese in Tokyo, Japan; KHV, Kinh in Ho Chi Minh City, Vietnam.
(B) Proportions of ancestry components in different provinces. The ancestry components and colors are consistent with (A). 1KGP3 East Asia populations (CHB, CHS, CDX, JPT, and KHV) are also shown.
(C) Top 2 primary components (PC1 and PC2) of NyuWa samples. Each point represents a sample. Samples are marked with the provinces and areas of China. PC1 represents the difference between northern and southern Chinese.
(D) Imputation error rates of two test datasets representing northern (Han N. China in HGDP, top) and southern (CHS in 1KGP3, bottom) Han Chinese. Each point represents a reference panel constructed with a certain sample subset of the NyuWa reference panel. Red represents north (N)-specific panels from samples in the left part of PC1 shown in (C), and blue represents south (S)-specific panels in the right part of PC1. The gray triangles represent reference panels with randomly (R) selected samples. 1k and 5k, respectively, represent 1/3 and 1/2 of the 2,902 total samples in the NyuWa panel. Dotted lines represent addition of more samples.
See also Figures S4–S9 and Table S1.
The above ADMIXTURE results indicated that northern and southern Chinese share two major ancestral components and differ in the proportions, which is consistent with the history of migration and partial mixing within the past two to three millennia (Chen et al., 2009Wen et al., 2004). Using PCA, we found that primary component 1 (PC1) of NyuWa samples represented the trend of north-south differentiation (Figure 4C), which is consistent with previous studies of the Han ethnicity and Chinese minorities (Cao et al., 2020Chiang et al., 2018Liu et al., 2018a). Other PCs did not show differentiation between the north and south (Figure S7A). Variants with high absolute weights in PC1 also showed high AF differences between ancestral components 1 and 3 (Figure S7B). Fst, another analysis for genetic differentiation between northern and southern NyuWa samples as defined by the classic geographical demarcation of the Qinling Mountains-Huaihe River, also showed that north-south differential variants also differed in ancestral components 1 and 3 (Figure S7C). These results are consistent with the partial mixing of ancestral components. Because northern and southern Chinese people share the same major ancestral components, we reason that one integrated reference panel is applicable to northern and southern Han Chinese.
To test this speculation, we divided samples from the NyuWa reference panel into northern and southern subsets based on sample positions on PC1, which represents differentiation between the north and the south (Figure 4C). Specific panels for northern and southern Han Chinese were then constructed using these sample subsets, and imputation error rates were compared on independent public datasets, including northern Han Chinese (Han North China in HGDP) and CHS (Chinese Han South in 1KGP3). As expected, given the same sample sizes, the regionally matched panels had lower imputation error rates than unmatched panels (Figure 4D). Panels with randomly selected samples had intermediate error rates. Increasing panel sizes always reduced error rates, regardless of whether the added samples were matched (Figure 4D; Figure S8A). The integrated panel always had the lowest error rates. The imputation results for CHB samples in 1KGP3 also showed lower error rates for panels with larger sizes (Figure S8B), whereas the differences between the northern and southern panels were not obvious, probably because there are also many southern samples in CHB (Figure S4B). Another classification method using the Qinling Mountains-Huaihe River geographical demarcation showed similar results (Figures S8C and S8D). These results confirmed the applicability of one integrated panel for northern and southern Chinese subjects.
We also explored whether there was a difference in the introgression level of Denisovan and Neanderthal ancestries between the northern and southern NyuWa populations (Figure S9). No obvious north-south difference was found, suggesting that the introgression of Denisovan and Neanderthal ancestries occurred before the split of northern and southern ancestral populations, which was far before the current mixing of the population. Additionally, we found no samples with high Denisovan ancestry (>3%) as observed in Melanesians and Aeta (Wall et al., 2019). The top 10 samples with the highest Denisovan ancestry were from Shanghai (5), Beijing (2), Guangdong (1), Shaanxi (1), and Xinjiang (1), with percentages ranging from 0.42%–0.45%.
Clinical annotations for variants
To demonstrate the value of the NyuWa resource in improving human health, we further evaluated the utility of NyuWa in genetic disease studies and medical applications. We annotated all variants with ClinVar (Landrum et al., 2018) and found 1,140 pathogenic variants (Figures S10A and S10B). As expected, most of the pathogenic variants were singletons or rare variants in the NyuWa and public datasets (Figure 5A). Each sample had a median of 4 homozygous pathogenic variants and 7 heterozygous pathogenic variants (Figure S10C). We noticed that there were 32 pathogenic variants with an AF greater than 1% (Figure 5A; Data S1). Pathogenic variants are usually rare, and pathogenic variants with high AFs may relate to common diseases, otherwise, their pathogenicity should be subjected to further examination. We also found some variants annotated with conflicting interpretations of pathogenicity by ClinVar that showed higher AFs specifically in the NyuWa resource (Figure 5B; Data S1). For example, with an AF of 1% as the threshold, two variants, rs182677317 and rs369849556, were annotated as conflicting for a rare disease, ciliary dyskinesia, whereas the high AFs (>1%) in the NyuWa dataset suggested that these variants may not be pathogenic (Figure 5C). These results showed that variant AFs in the NyuWa dataset can provide an additional reference to assist in the study of disease-related variants.
Download : Download high-res image (2MB)
Download : Download full-size image
Figure 5. Annotation of variants
(A) Allele count and frequency distribution for ClinVar pathogenic variants.
(B) Allele count and frequency distribution for ClinVar variants annotated as conflicting interpretations of pathogenicity.
(C) Allele frequencies of two variants in different repositories. The two variants were annotated by ClinVar as having conflicting interpretations of pathogenicity for ciliary dyskinesia. TotalFreq, the AF of all samples in the corresponding dataset; EAS, East Asian; AMR, American; AFR, African; EUR, European; SAS, South Asian; NFE, non-Finnish European; FIN, Finnish; ASJ, Ashkenazi Jewish; AMI, Amish; Oth, Other.
(D) Allele frequencies of known pharmacogenomic loci (row) that vary in different populations or regions (column). For the NyuWa dataset, only provinces with sample sizes of 20 or greater are shown.
(E) Allele frequencies of known cancer risk loci (rows) that vary in different populations or regions (columns). For the NyuWa dataset, only provinces with sample sizes of 20 or greater are shown. The AF color bar is consistent with (D).
See also Figure S10 and Data S1 and S2.
We also assessed the allele frequencies of known pharmacogenomic loci from ADME core genes (http://pharmaadme.org/) that may affect the efficacy and safety of drugs in different Chinese provinces and global regions (Data S2). We found some variants with obvious AF differences in different regions of China as well as in different populations worldwide (Figure 5D). For instance, isoniazid, a drug recommended by the World Health Organization (WHO) for treatment of tuberculosis (TB), is metabolized primarily by the enzyme NAT2 (N-acetyltransferase 2). NAT2∗12 refers to rs1208, and the reference allele (A) dampens enzyme activity (Vatsis et al., 1991). The homozygous reference genotype will cause drug accumulation and toxicity, whereas heterozygous and homozygous alternative genotypes have reduced side effects (Toure et al., 2016). We detected consistently high AFs (near 100%) of NAT2∗12 in different Chinese provinces and East Asians and lower frequencies in other populations (Figure 5D). This suggested that testing the NAT2∗12 genotype before using isoniazid is not as necessary for the Chinese population as for other populations. For other examples, the AFs were not close to 0% or 100% and varied among different Chinese provinces (Figure 5D); hence, genetic tests are recommended before certain drugs are used for individualized treatment.
We also examined cancer risk loci (Sud et al., 2017) in different regions (Data S2). It is generally recognized that there are racial differences in cancer susceptibility and survival, and genetic factors are very important determinants of cancer risk (Özdemir and Dotto, 2017). We also detected obvious AF differences between Chinese and other populations in many cancer susceptibility loci (Figure 5E).
Loss-of-function variants of protein-coding genes and lncRNA genes
Human loss-of-function variants have profound effects on gene function and are informative for clinical genome interpretation. In this study, we screened high-confidence loss-of-function PTVs, especially novel variants. We found 18,711 PTVs in 7,696 genes, of which most PTVs were singletons (Figures 6A and 6B), in line with PTV data from ExAC (67% singletons) (Lek et al., 2016). There were 9,994 novel PTVs found in the NyuWa dataset, and 1,381 PTVs could be imputed by the NyuWa reference panel (Table 1). The number of homozygous PTVs was 21 (Figure 6B; Figure S10D). There was a median of 24 homozygous PTVs and 58 heterozygous PTVs per sample (Figure S10E). We detected 1,138 PTVs in 385 of 906 cancer-related genes; 636 of these PTVs were novel. Focusing on 9 well-studied cancer-associated genes (BRCA1BRCA2TP53MEN1MLH1MSH2MSH6PMS1, and PMS2A) (Wall et al., 2019), we identified 5 novel PTVs and 48 known PTVs in BRCA2, BRCA1, PMS1, TP53, and MSH6 (Figure 6C). BRCA1 and BRCA2 are involved in maintenance of genome stability. Inherited mutations in BRCA1 and BRCA2 confer an increased lifetime risk of developing breast or ovarian cancer. There were 10 known PTVs in BRCA1 and BRCA2, of which 9 have been annotated as pathogenic and related to breast and ovarian cancer in ClinVar (Landrum et al., 2018).
Download : Download high-res image (878KB)
Download : Download full-size image
Figure 6. Predicted loss-of-function variants in the NyuWa dataset
(A) Allele count and frequency distribution of protein-truncating variants (PTVs).
(B) Numbers of PTVs classified as novel, known, heterozygous, and homozygous.
(C) Known and novel PTVs identified in selected cancer-associated genes in the NyuWa dataset.
(D) Numbers of lncRNA splicing variants classified as novel, known, heterozygous, and homozygous.
(E) Allele count and frequency distribution for lncRNA splicing variants.
(F) Allele count and frequency distribution for lncRNA splicing variants in 230 lncRNA genes reported to be essential for cell growth.
See also Figure S10.
Because lncRNAs do not contain consensus CDS regions, splicing variants become the most important class for possible lncRNA loss-of-function variants. Splicing variants may cause intron retention or exon skipping and greatly change the lncRNA sequence and structure (Ulitsky et al., 2011). A total of 230 lncRNA genes have been reported to affect cell growth after CRISPR editing at lncRNA splicing sites (Liu et al., 2018b), suggesting the importance of lncRNA splicing variants for lncRNA functions. A total of 3,793 splicing variants in 3,544 lncRNA genes were found in the NyuWa dataset (Figure 6D), including 1,454 splicing variants in 1,287 Ensembl lncRNA genes and another 2,339 splicing variants in 2,257 NONCODE lncRNA genes (Figures S10F and S10G). Each sample had a median of 61 homozygous and 91 heterozygous lncRNA splicing variants (Figure S10H). Among 230 lncRNA genes reported to be essential for cell growth (Liu et al., 2018b), we found 22 splicing variants in 20 lncRNA genes. The proportion of AF > 0.1% lncRNA splicing variants was smaller in the 20 essential lncRNA genes than all lncRNA splicing variants (Figures 6E and 6F), suggesting that splicing variants can truly affect the function of these lncRNAs. In general, the loss-of-function variants for protein-coding and noncoding genes identified in the NyuWa dataset may be associated with disease etiology or trait tendency, which will provide novel insights into disease and genetic studies.
Discussion
The Chinese population, which accounts for approximately 20% of the global human population, contains 56 ethnic groups and highly diverse disease types. Constructing a comprehensive genome resource platform of the Chinese population empowers medical genetics discoveries in the world’s largest population and contributes to the diversity of worldwide human genetic resources. Here we present the NyuWa resource, consisting of large-cohort deep WGS data for the Chinese population. We also constructed a companion database to comprehensively catalog the variants. The 25 million novel variants identified in the NyuWa resource will greatly benefit studies of human diseases, especially in Chinese people. Although ChinaMAP has also published a resource for the Chinese population, variant data files were not available to download. By comparing manually selected variants, we estimated that ∼18 million variants would remain novel after the exclusion of variants in ChinaMAP.
Another important contribution of this work is that the NyuWa resource can fill in the blanks of the WGS-based haplotype reference panel in the Chinese population. Previously, the most commonly used imputation panels were constructed by 1KGP3 and HRC. The recently released TOPMed reference panel included the largest number of haplotypes (Taliun et al., 2019) so far. However, the imputation performance of these panels for Chinese and East Asian populations is limited because East Asian samples are underrepresented. In addition, a large number of genome variants are specific to a population or sample, especially for rare variants, whose imputation can be challenging (Carmi et al., 2014). Our NyuWa reference panel contains 19.3 million variants (approximately MAF > 0.1%) with 3.25 million specific variants not included in other panels, and contains a large proportion of low-frequency alleles. The imputation performance of NyuWa exceeded that of 1KGP3, HRC, and TOPMed for the Chinese population (Figure 3A; Figures S3A and S3B). Furthermore, the combined reference panel of NyuWa and 1KGP3 outperformed 1KGP3, HRC, and TOPMed for nearly all Asians (Figure 3A; Figure S3). Compared with GAsP, a newly public Asian reference panel, NyuWa also has an advantage in Chinese populations, including the Han, She, Tujia, Miaozu, Yizu, Tu, and Naxi, and possesses higher accuracy across all AF bins.
We also found that the genetic differences between northern and southern Chinese are mainly the proportions of two major ADMIXTURE components, suggesting that the north-south differences result mainly from partial population mixing in recent history. In the ADMIXTURE results, the main difference was the proportion of the northern Han-like component (ancestry 1, red) and southern Dai- or Vietnamese-like component (ancestry 3, blue) (Figures 4A and 4B). The northern samples have a very large proportion of component 1 and a small proportion of component 3, whereas component 3 is present in approximately half of the south samples. This population structure implies a partial mix of two ancestral components in the north and south, which is also consistent with the history of China. The earliest center of Chinese civilization was located in central to northern China, ranging from Henan to Shaanxi. Starting from the Eastern Zhou Dynasty, the Chinese territory expanded greatly, especially to the south. Then the foundation of a unitary multiethnic country beginning in the Qin and Han Dynasties facilitated mixing of the early Chinese population with southern ancestral populations. At present, the mix has still not achieved equilibrium.
An ideal reference panel for a population needs to cover all major ADMIXTURE components in the population. Each major component is required to have a sufficient and balanced sample size to cover most haplotypes in the component. As described above, northern and southern Han Chinese have the same two major components, although the proportions of these components are different. Therefore, a single reference panel that covers these major components can be used to impute northern and southern Han populations. Imputation tests using northern or southern subset panels confirmed this speculation.
The current knowledge and guidelines on medical genomics are mainly from Eurocentric genetic and genomic resources and may be missing information about people of non-European ancestry. Our study provides a large and high-quality WGS resource for Chinese populations, which will be useful in examining the effect of known genetic variants on disease susceptibility and drug responses, and benefit clinical investigations in the future. The identification of loss-of-function variants for protein-coding and lncRNA genes in this study expands the catalog of loss-of-function variants in nature. When combined with phenotype information, this resource will provide important biological insights into gene functions.
Limitations of the study
Because of the lack of samples from certain Chinese minority populations, the performance of the NyuWa reference panel can still be improved by including more minority samples. Currently, ethnic information in the NyuWa resource is not available. Han Chinese are supposed to be the majority in NyuWa samples. The results of better performance using one integrated panel for both northern and southern Chinese are based on the current panel size. When a larger sample size has been accumulated, the specific situation will determine which panel works better.
STAR★Methods
Key resources table
REAGENT or RESOURCESOURCEIDENTIFIER
Deposited data
Human reference genome (hg38)GATK resource bundle (Poplin et al., 2017)https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0
1000 Genomes Project Phase 3Auton et al., 2015https://www.internationalgenome.org/
HGDP arrayLi et al., 2008http://hagsc.org/hgdp/
HGDP WGSBergström et al., 2020ftp://ngs.sanger.ac.uk/production/hgdp
Neanderthal genomePrüfer et al., 2014http://cdna.eva.mpg.de/neandertal/altai/AltaiNeandertal/VCF/
Denisovan genomeMeyer et al., 2012http://cdna.eva.mpg.de/neandertal/altai/Denisovan/
Human ancestral genomeEnsemblftp://ftp.ensembl.org/pub/release-99/fasta/ancestral_alleles/
Genome In A Bottle (GIAB) HG001 v.3.3.2Zook et al., 2014ftp://ftp-trace.ncbi.nlm.nih.gov/giab/
dbSNP v150NCBIhttps://ftp.ncbi.nih.gov/snp/
gnomAD v2 & v3Lek et al., 2016https://gnomad.broadinstitute.org/
TopMed r2Taliun et al., 2019https://imputation.biodatacatalyst.nhlbi.nih.gov/#!pages/home
HRC r1.1McCarthy et al., 2016https://imputation.sanger.ac.uk/
GAsPWall et al., 2019https://browser.genomeasia100k.org/
GTEx v8Ardlie et al., 2015https://gtexportal.org/home/
NONCODE v5Fang et al., 2018http://noncode.org/
NyuWa imputation serverThis manuscripthttp://bigdata.ibp.ac.cn/refpanel/
NyuWa variant databaseThis manuscripthttp://bigdata.ibp.ac.cn/NyuWa_variants/
NyuWa WGSThis manuscriptNODE: OEP002803
Software and algorithms
GATK v3.7(Poplin et al., 2017)https://gatk.broadinstitute.org/hc
FastQC v0.11.3Babraham Institutehttps://www.bioinformatics.babraham.ac.uk/projects/fastqc
Trimmomatic v0.36Bolger et al., 2014http://www.usadellab.org/cms/index.php?page=trimmomatic
BWA-MEM v0.7.15Li and Durbin, 2010https://github.com/lh3/bwa
qualimap v2.1.2Okonechnikov et al., 2016http://qualimap.conesalab.org/
Picard v 2.9.2Broad Institutehttp://broadinstitute.github.io/picard/
verifyBamID2 v1.0.6Zhang et al., 2020https://github.com/Griffan/VerifyBamID
ANNOVAR v2018-04-16Wang et al., 2010https://annovar.openbioinformatics.org/en/latest/
LOFTEE v1.0.3Karczewski et al., 2020https://github.com/konradjk/loftee
HAPCUT2 v1.0Edge et al., 2017https://github.com/vibansal/HapCUT2
SHAPEIT4 v4.1.2Delaneau et al., 2019https://odelaneau.github.io/shapeit4/
Minimac3 & 4Das et al., 2016https://genome.sph.umich.edu/wiki/Minimac4
Eagle2 v2.4.1Loh et al., 2016https://alkesgroup.broadinstitute.org/Eagle/
Plink v2.00Chang et al., 2015https://www.cog-genomics.org/plink/2.0/
Bcftools v1.10.2Danecek et al., 2021https://samtools.github.io/bcftools/bcftools.html
ADMIXTURE v1.3.0Alexander et al., 2009https://dalexander.github.io/admixture/
VCFtools v0.1.15Danecek et al., 2011https://vcftools.github.io/
CrossMap v0.5.3Zhao et al., 2014http://crossmap.sourceforge.net/
yHaplo v1.0.21Poznik, 2016https://github.com/23andMe/yhaplo
FigTree v1.4.4GitHubhttps://github.com/rambaut/figtree
Resource availability
Lead contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Shunmin He (heshunmin@ibp.ac.cn).
Materials availability
This study did not generate new unique reagents.
Experimental model and subject details
Whole blood DNA samples of 3064 Chinese samples were collected including diabetes and control samples. This study was approved by the Medical Research Ethics Committee of Institute of Biophysics, Chinese Academy of Sciences. All participants provided written informed consent. The informed consent is used to collect samples for genome studies conducted by Chinese Academy of Sciences. The consent requires participants to be 30-70 years old patients and healthy people with full capacity. Participants voluntarily donate blood samples, provide clinical treatment information and sign informed consent. All their personal information is kept confidential. Participants can choose not to participate in sample donation, or withdraw at any time.
Method details
DNA extraction and library preparation
Genomic DNA was extracted and sequenced by WuXi Apptec Co., Ltd. according to the standard protocols of Illumina on HiSeq X10 platform or NovaSeq 6000. The sequencing reads were paired-end 150 nt and the target depth is 30X. Sequencing quality was checked with FastQC v0.11.3 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc). Adaptor sequences and low quality bases were removed with Trimmomatic v0.36 (Bolger et al., 2014).
Quantification and statistical analysis
NyuWa cohort variant calling pipeline
The variant calling followed GATK (Poplin et al., 2017) Best Practices Workflows Germline short variant discovery (SNPs + Indels) joint genotyping cohort mode. In brief, the raw sequencing reads were mapped to human reference genome assembly 38 with BWA-MEM v0.7.15 (Li and Durbin, 2010). Picard (http://broadinstitute.github.io/picard/) was used to sort bam and mark duplicates. Mapping quality was check by qualimap v2.1.2 (Okonechnikov et al., 2016). Indels were realigned and bases were recalibrated with GATK v3.7. Variants were called for each sample using GATK HaplotypeCaller in 'GVCF’ mode. GATK GenotypeGVCFs was then used to identify variants for all samples in the cohort. Then GATK VQSR was applied for SNPs and indels with truth sensitivity filter levels 99.7 and 99.0, respectively. Variants were then annotated with annovar v2018-04-16 (Wang et al., 2010).
Duplicate sequencing data for the same persons were removed. verifyBamID2 (Zhang et al., 2020) version 1.0.6 was used to check the contamination. Samples with contamination levels of alpha ≥ 0.05 were removed. The sex of each sample was inferred by two ways. Based on whole genome and chromosome coverage results reported by qualimap, the coverage of X and Y chromosomes were divided by the whole genome coverage. The relative coverage of (X, Y) of male is expected to be (0,5, 0.5), and that of female is expected to be (1, 0). The ploidy of non-PAR region of X chromosome were estimated by BCFtools v1.5 (https://samtools.github.io/bcftools/bcftools.htmlDanecek et al., 2021) guess-ploidy module. Males are haploid while females are diploid.
To filter low quality sites, variants with VQSR not passed were removed. Additional filters were applied to further exclude low quality variants. Sites with genotype quality (GQ) < 10 in > 50% samples were removed. For Y chromosome, sites were removed if GQ < 10 in > 50% male samples, or GQ > = 10 in > 10% female samples. Sites with no ALT allele in GQ > = 10 samples were also removed. Variants were further filtered with a Hardy-Weinberg Equilibrium (HWE) p value < 10−6 in the direction of excessive heterozygosity or ExcessHet > 54.69 in the INFO column calculated by GATK. Multi-allele sites were split using BCFtools norm module.
Some analyses required removal of close relatives. The 3rd degree or closer relationships were identified with the combination of kinship coefficient (Φ) and probability of zero identity-by-descent (IBD) sharing (π0) (Manichaikul et al., 2010) calculated by plink (Chang et al., 2015). The k-degree relationship was defined as 2-k-1.5 < Φ < 2-k-0.5. For the 1st degree relationships, parent-offspring was defined as π0 < 0.1 and full sibling if π0 > 0.1. Φ > 2-1.5 represents monozygotic twin or sample replicates. Relationships more than 3rd degree were treated as unrelated. To determine the list of independent samples, subjects with more relatives were excluded with priority, and a maximum of 2,902 unrelated samples were kept.
Phasing and reference panel construction pipeline
Sequencing reads based haplotype phasing for each sample was carried out with HAPCUT2 (Edge et al., 2017). The local phased sets were then incorporated in population-based phasing of 2,999 samples using SHAPEIT4 (Delaneau et al., 2019) version 4.1.2 with parameter '–use-PS 0.0001’. The information from family trios or duos were converted to phasing scaffold data and used by SHAPEIT4 with '–scaffold’ option. Sites with missing call rates greater than 10% were removed. Sites with minor allele count < 2 (MAC2) were also removed. There were no samples with missing call rate greater than 10%. No additional reference panel was used. Only chromosome 1-22 and X were phased, and each chromosome was phased separately. For X chromosome, the pseudo-autosomal regions (PARs) and non-PAR were divided and phased separately. For samples with haploid X chromosome in non-PAR regions (male), the heterozygous genotypes were converted to missing before phasing using SHAPEIT4.
The 2,902 independent samples were extracted from the above phasing results. Sites with minor allele count < 5 (MAC5) in the independent sample set were also removed. The final list included 2,902 samples and 19,256,267 variants. Phased genotypes were then converted to m3vcf. format as imputation reference file using Minimac3 (Das et al., 2016) v2.0.1. The hg38 version of 1KGP3 reference panel was generated similarly with MAC5 sites.
To further improve imputation performance, a combined panel of NyuWa with 1KGP3 panel was generated using the reciprocal imputation strategy (Huang et al., 2015). The missed variants in each panel were imputed with the other with Minimac4 (Das et al., 2016), and the results were combined to form a new panel with all samples and union of variants in NyuWa and 1KGP3 panel. The combined panel had 5,406 samples and 40,196,029 variants in total.
Imputation performance
The chromosome 2 of HGDP genotyping array data was used to test imputation error rates for NyuWa, 1KGP3, GAsP, HRC.r1.1, TOPMed and NyuWa+1KGP3 reference panels. Bi-allele SNPs that exist in all panels were selected. Then every 1 out of 10 of the selected SNPs were masked to evaluate the imputation accuracy. Phasing and imputation of GAsP HRC.r1.1 and TOPMed panels were run on respective web servers. Phasing and imputation of NyuWa, 1KGP3 and NyuWa+1KGP3 panels were run locally with Eagle2 (Loh et al., 2016) and Minimac4, respectively. Imputation error rates were computed for each population as the genotype discordance rates of the masked SNPs.
In addition, for Chinese samples in HGDP dataset, we compared Pearson’s R2 between the genotypes from high coverage WGS of HGDP samples (Bergström et al., 2020) and imputed dosages from the reference panels described above. Sites overlapping with all the compared panels were used, and variants with missing rate > 10% in HGDP WGS were excluded. The imputation accuracy is stratified by the non-reference allele frequencies (AFs) in NyuWa reference panel, and R2 was calculated for all variants in each bin.
The imputation error rates of reference panels constructed with sample subsets of the NyuWa reference panel were evaluated the same way as NyuWa panel. The 1KGP3 CHS and CHB test samples were already phased, and every 1 out of 10 of the selected SNPs were masked to evaluate the imputation error rates. The samples in the North or South specific panels were divided based on ranks of sample positions on PC1 from PCA or geographical demarcation of Qinling Mountains-Huaihe River (Table S1).
Population structure analysis
NyuWa 2,902 independent samples and 1KGP3 data were merged by extracting overlapped bi-allelic autosomal SNPs. SNPs with missing rate of more than 10% or MAF less than 0.05 were excluded. Linkage equilibrium (LD) was removed by thinning the SNPs to no closer than 2kb using plink. Furthermore, 27 known long-range LD regions were removed according to previous studies (Price et al., 2008Tang et al., 2008Wu et al., 2019). The resulted dataset included 901,455 SNPs. The merged data were then used in principal component analysis (PCA) and ADMIXTURE by extracting samples of interest in each analysis. PCA was carried out using plink. ADMIXTURE were carried out using ADMIXTURE Version 1.3.0 (Alexander et al., 2009). For each K, the analysis was repeated 4 to 8 times with different seeds, and the one with the highest value of likelihood was chosen. For ADMIXTURE result display when K > 2, dimensions were reduced to 1-dimension by tSNE and samples were ordered by tSNE values.
Fst between south and north of China
SNP-level fixation index (Fst) between north and south of China was calculated using the Weir and Cockerham’s estimator (Weir and Cockerham, 1984) integrated in VCFtools (Danecek et al., 2011). North and south of China were divided according to the classic demarcation of Qinling Mountains-Huaihe River (Table S1). Henan, Jiangsu, Anhui were excluded because the Huaihe River flows through these provinces. Shanghai was also excluded for the possibility that there may be too many individuals from other provinces.
Denisovan and Neanderthal ancestry
Estimation of Denisovan and Neanderthal ancestry followed methods in GAsP (Wall et al., 2019). In brief, Neanderthal and Denisovan genomes were downloaded from http://cdna.eva.mpg.de/neandertal/altai/AltaiNeandertal/VCF/ (Prüfer et al., 2014) and http://cdna.eva.mpg.de/neandertal/altai/Denisovan/ (Meyer et al., 2012). Human ancestral sequences were downloaded from ftp://ftp.ensembl.org/pub/release-99/fasta/ancestral_alleles/. Potential Neanderthal/Denisovan SNPs were filtered by the following criteria. 1. The REF allele matched the ancestral allele; 2. Neanderthal/Denisovan genotype was homozygous ALT allele; 3. Denisovan/Neanderthal genotype was homozygous REF allele; 4. ALT allele was not found in YRI, GWD, MSL or ESN samples in 1KGP3. Then, for each NyuWa sample, the number of Neanderthal/Denisovan SNP alleles were counted. To correct background, linear models were fit for both Neanderthal and Denisovan SNPs based on allele counts and ancestry percentage in GAsP results. Supposing SNPs called in NyuWa and GAsP were independent for Neanderthal/Denisovan SNPs, allele counts were scaled to make the median of NyuWa samples equal to the average of GAsP HAN samples. The ancestry proportion for each sample was then determined by the linear model using scaled allele count.
Y chromosome analysis
Genotypes of male chrY SNPs in NyuWa dataset were lift over to hg19 using CrossMap (Zhao et al., 2014). Y chromosomal haplogroups were inferred using yHaplo (https://github.com/23andMe/yhaploPoznik, 2016). Besides, file of primary tree structure (y.tree.primary.2016.01.04.nwk), file of preferred SNP names (preferred.snpNames.txt) and file of phylogenetically informative SNPs (isogg.2016.01.04.txt) were used.
MEGA X (Kumar et al., 2018) were used to construct a phylogenetic tree based on neighbor joining (NJ) method with 50 bootstrap. FigTree v1.4.4 (https://github.com/rambaut/figtree/releases) was used to color the tree and label main branches manually.
PTVs and lncRNA loss-of-function variants
PTV analysis followed methods in GAsP (Wall et al., 2019). In brief, stop gain, frameshift and splicing sites were selected according to ensGene annotation by annovar (Wang et al., 2010). Splicing variants are variants within 2-bp away from an exon/intron boundary that disrupt the GT-AG boundary pattern. Then multiple filters were applied. Variants out of Genome In A Bottle (GIAB) (Zook et al., 2014) high confidence regions were excluded. Stop gain or frameshift variants in the last exon or the last 50 nt in the second last exon were excluded. Variants in exons with non-classic splice sites were also removed. Splicing variants that locate in introns length < 15 nt or UTRs were excluded. Stop gain and splicing variants with phyloP100way vertebrate rankscore < 0.01 were excluded. Additional filters were applied to filter high quality PTVs. Only variants with GQ > = 20, DP > 7 and ALT DP > DP∗0.2 were kept. Only variants affecting transcripts that within top 50% of gene expression in GTEx (Ardlie et al., 2015) were kept. A total of 9,526 PTVs in 4666 genes were obtained.
Loss-of-function variants were also predicted using LOFTEE v 1.0.3 (https://github.com/konradjk/loftee) (Karczewski et al., 2020). A total of 16,910 High confidence loss-of-function variants in canonical transcripts were identified. These variants covered most (7,725) of previously identified PTVs. The results were then combined to get the union set of PTVs.
For lncRNA splicing variants, Ensembl annotation was used first. Splicing variants were filtered similar to PTVs except that the phyloP100way conservation filter was not applied. The remaining splicing variants in NONCODE annotation were also filtered similarly, with GTEx expression replaced with expression data downloaded from NONCODE database.
Acknowledgments
We thank Weiwei Zhai for thoughtful discussions and valuable comments regarding the population structure analysis. This work was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (XDB38040300 and XDA12030100), the National Natural Science Foundation of China (919403063187129431970647, and 81902519, the National Key R&D Program of China (2017YFC09075032016YFC0901002, and 2016YFC0901702), the 13th Five-year Informatization Plan of Chinese Academy of Sciences (XXH13505-05), and the National Genomics Data Center, China.
Author contributions
T.X. and S.H. conceptualized and supervised the project. P.Z., H.L., Y.L., J.W., Y.N., Q.K., Y.S., and H.Z. conducted analyses. Y.W. and T.X. contributed to sample collection and data generation. P.Z., H.L., Y.Z., Q.K., and T.S. made the web server and database. P.Z., H.L., Y.N., and S.H. drafted the manuscript, and all primary authors reviewed, edited, and approved the manuscript.
Declaration of interests
The authors declare no competing interests.
Supplemental information
Download all supplementary files included with this articleHelpDownload : Download Acrobat PDF file (5MB)
Document S1. Figures S1–S10 and Tables S1–S5.
Download : Download spreadsheet (28KB)Data S1. ClinVar pathogenic and conflicting variants with AF > 1%, related to Figure 5.
Download : Download spreadsheet (172KB)Data S2. AFs in pharmacogenomic and cancer risk loci, related to Figure 5.
Download : Download Acrobat PDF file (9MB)Document S2. Article plus supplemental information.
Data and code availability
The raw sequencing data derived from human samples have been deposited at NODE (http://www.biosino.org/node) with accession number: OEP002803. The access and use of the data shall comply with the regulations of the People’s Republic of China on the administration of human genetic resources. To request access, contact Shunmin He (heshunmin@ibp.ac.cn). In addition, processed variants derived from these data have been deposited at http://bigdata.ibp.ac.cn/NyuWa_variants/ and are publicly available as of the date of publication.
This paper does not report original code.
Any additional information required to reanalyze the data reported in this work paper is available from the lead contact upon request.
References
Alexander et al., 2009
D.H. Alexander, J. Novembre, K. Lange
Fast model-based estimation of ancestry in unrelated individuals
Genome Res., 19 (2009), pp. 1655-1664
Ardlie et al., 2015
K.G. Ardlie, D.S. DeLuca, A.V. Segre, T.J. Sullivan, T.R. Young, E.T. Gelfand, C.A. Trowbridge, J.B. Maller, T. Tukiainen, M. Lek, et al., GTEx Consortium
Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans
Science, 348 (2015), pp. 648-660
Asimit and Zeggini, 2012
J.L. Asimit, E. Zeggini
Imputation of rare variants in next-generation association studies
Hum. Hered., 74 (2012), pp. 196-204
Auton et al., 2015
A. Auton, L.D. Brooks, R.M. Durbin, E.P. Garrison, H.M. Kang, J.O. Korbel, J.L. Marchini, S. McCarthy, G.A. McVean, G.R. Abecasis, 1000 Genomes Project Consortium
A global reference for human genetic variation
Nature, 526 (2015), pp. 68-74
Bergström et al., 2020
A. Bergström, S.A. McCarthy, R. Hui, M.A. Almarri, Q. Ayub, P. Danecek, Y. Chen, S. Felkel, P. Hallast, J. Kamm, et al.
Insights into human genetic variation and population history from 929 diverse genomes
Science, 367 (2020), p. 1339
A.M. Bolger, M. Lohse, B. Usadel
Trimmomatic: a flexible trimmer for Illumina sequence data
Bioinformatics, 30 (2014), pp. 2114-2120
Bomba et al., 2017
L. Bomba, K. Walter, N. Soranzo
The impact of rare and low-frequency genetic variants in common disease
Genome Biol., 18 (2017), p. 77
Cao et al., 2020
Y. Cao, L. Li, M. Xu, Z. Feng, X. Sun, J. Lu, Y. Xu, P. Du, T. Wang, R. Hu, et al., ChinaMAP Consortium
The ChinaMAP analytics of deep whole genome sequences in 10,588 individuals
Cell Res., 30 (2020), pp. 717-731
Carmi et al., 2014
S. Carmi, K.Y. Hui, E. Kochav, X. Liu, J. Xue, F. Grady, S. Guha, K. Upadhyay, D. Ben-Avraham, S. Mukherjee, et al.
Sequencing an Ashkenazi reference panel supports population-targeted personal genomics and illuminates Jewish and European origins
Nat. Commun., 5 (2014), p. 4835
Chang et al., 2015
C.C. Chang, C.C. Chow, L.C.A.M. Tellier, S. Vattikuti, S.M. Purcell, J.J. Lee
Second-generation PLINK: rising to the challenge of larger and richer datasets
Gigascience, 4 (2015), p. 7
Chen et al., 2009
J. Chen, H. Zheng, J.X. Bei, L. Sun, W.H. Jia, T. Li, F. Zhang, M. Seielstad, Y.X. Zeng, X. Zhang, J. Liu
Genetic structure of the Han Chinese population revealed by genome-wide SNP variation
Am. J. Hum. Genet., 85 (2009), pp. 775-785
Chheda et al., 2017
H. Chheda, P. Palta, M. Pirinen, S. McCarthy, K. Walter, S. Koskinen, V. Salomaa, M. Daly, R. Durbin, A. Palotie, et al.
Whole-genome view of the consequences of a population bottleneck using 2926 genome sequences from Finland and United Kingdom
Eur. J. Hum. Genet., 25 (2017), pp. 477-484
Chiang et al., 2018
C.W.K. Chiang, S. Mangul, C. Robles, S. Sankararaman
A Comprehensive Map of Genetic Variation in the World’s Largest Ethnic Group-Han Chinese
Mol. Biol. Evol., 35 (2018), pp. 2736-2750
Danecek et al., 2011
P. Danecek, A. Auton, G. Abecasis, C.A. Albers, E. Banks, M.A. DePristo, R.E. Handsaker, G. Lunter, G.T. Marth, S.T. Sherry, et al., 1000 Genomes Project Analysis Group
The variant call format and VCFtools
Bioinformatics, 27 (2011), pp. 2156-2158
Danecek et al., 2021
P. Danecek, J.K. Bonfield, J. Liddle, J. Marshall, V. Ohan, M.O. Pollard, A. Whitwham, T. Keane, S.A. McCarthy, R.M. Davies, H. Li
Twelve years of SAMtools and BCFtools
Gigascience, 10 (2021), p. giab008
Das et al., 2016
S. Das, L. Forer, S. Schönherr, C. Sidore, A.E. Locke, A. Kwong, S.I. Vrieze, E.Y. Chew, S. Levy, M. McGue, et al.
Next-generation genotype imputation service and methods
Nat. Genet., 48 (2016), pp. 1284-1287
Delaneau et al., 2019
O. Delaneau, J.F. Zagury, M.R. Robinson, J.L. Marchini, E.T. Dermitzakis
Accurate, scalable and integrative haplotype estimation
Nat. Commun., 10 (2019), p. 5436
Du et al., 2019
Z. Du, L. Ma, H. Qu, W. Chen, B. Zhang, X. Lu, W. Zhai, X. Sheng, Y. Sun, W. Li, et al.
Whole Genome Analyses of Chinese Population and De Novo Assembly of A Northern Han Genome
Genomics Proteomics Bioinformatics, 17 (2019), pp. 229-247
Edge et al., 2017
P. Edge, V. Bafna, V. Bansal
HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies
Genome Res., 27 (2017), pp. 801-812
Fang et al., 2018
S. Fang, L. Zhang, J. Guo, Y. Niu, Y. Wu, H. Li, L. Zhao, X. Li, X. Teng, X. Sun, et al.
NONCODEV5: a comprehensive annotation database for long non-coding RNAs
Nucleic Acids Res., 46 (D1) (2018), pp. D308-D314
Francioli et al., 2014
L.C. Francioli, A. Menelaou, S.L. Pulit, F. Van Dijk, P.F. Palamara, C.C. Elbers, P.B.T. Neerincx, K. Ye, V. Guryev, W.P. Kloosterman, et al., Genome of the Netherlands Consortium
Whole-genome sequence variation, population structure and demographic history of the Dutch population
Nat. Genet., 46 (2014), pp. 818-825
Gao et al., 2020
Y. Gao, C. Zhang, L. Yuan, Y. Ling, X. Wang, C. Liu, Y. Pan, X. Zhang, X. Ma, Y. Wang, et al., Han100K Initiative
PGG.Han: the Han Chinese genome database and analysis platform
Nucleic Acids Res., 48 (D1) (2020), pp. D971-D976
Hoffmann and Witte, 2015
T.J. Hoffmann, J.S. Witte
Strategies for Imputing and Analyzing Rare Variants in Association Studies
Trends Genet., 31 (2015), pp. 556-563
Huang et al., 2015
J. Huang, B. Howie, S. McCarthy, Y. Memari, K. Walter, J.L. Min, P. Danecek, G. Malerba, E. Trabetti, H.F. Zheng, et al., UK10K Consortium
Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel
Nat. Commun., 6 (2015), p. 8111
Huang et al., 2018
K.L. Huang, R.J. Mashl, Y. Wu, D.I. Ritter, J. Wang, C. Oh, M. Paczkowska, S. Reynolds, M.A. Wyczalkowski, N. Oak, et al., Cancer Genome Atlas Research Network
Pathogenic Germline Variants in 10,389 Adult Cancers
Cell, 173 (2018), pp. 355-370.e14
International Human Genome Sequencing Consortium, 2004
International Human Genome Sequencing Consortium
Finishing the euchromatic sequence of the human genome
Nature, 431 (2004), pp. 931-945
Karczewski et al., 2020
K.J. Karczewski, L.C. Francioli, G. Tiao, B.B. Cummings, J. Alföldi, Q. Wang, R.L. Collins, K.M. Laricchia, A. Ganna, D.P. Birnbaum, et al., Genome Aggregation Database Consortium
The mutational constraint spectrum quantified from variation in 141,456 humans
Nature, 581 (2020), pp. 434-443
Kumar et al., 2018
S. Kumar, G. Stecher, M. Li, C. Knyaz, K. Tamura
MEGA X: Molecular Evolutionary Genetics Analysis across Computing Platforms
Mol. Biol. Evol., 35 (2018), pp. 1547-1549
Lan et al., 2017
T. Lan, H. Lin, W. Zhu, T.C.A.M. Laurent, M. Yang, X. Liu, J. Wang, J. Wang, H. Yang, X. Xu, X. Guo
Deep whole-genome sequencing of 90 Han Chinese genomes
Gigascience, 6 (2017), pp. 1-7
Landrum et al., 2018
M.J. Landrum, J.M. Lee, M. Benson, G.R. Brown, C. Chao, S. Chitipiralla, B. Gu, J. Hart, D. Hoffman, W. Jang, et al.
ClinVar: improving access to variant interpretations and supporting evidence
Nucleic Acids Res., 46 (D1) (2018), pp. D1062-D1067
Lek et al., 2016
M. Lek, K.J. Karczewski, E.V. Minikel, K.E. Samocha, E. Banks, T. Fennell, A.H. O’Donnell-Luria, J.S. Ware, A.J. Hill, B.B. Cummings, et al., Exome Aggregation Consortium
Analysis of protein-coding genetic variation in 60,706 humans
Nature, 536 (2016), pp. 285-291
Li and Durbin, 2010
H. Li, R. Durbin
Fast and accurate long-read alignment with Burrows-Wheeler transform
Bioinformatics, 26 (2010), pp. 589-595
Li et al., 2008
J.Z. Li, D.M. Absher, H. Tang, A.M. Southwick, A.M. Casto, S. Ramachandran, H.M. Cann, G.S. Barsh, M. Feldman, L.L. Cavalli-Sforza, R.M. Myers
Worldwide human relationships inferred from genome-wide patterns of variation
Science, 319 (2008), pp. 1100-1104
Lin et al., 2018
J.C. Lin, C.T. Fan, C.C. Liao, Y.S. Chen
Taiwan Biobank: making cross-database convergence possible in the Big Data era
Gigascience, 7 (2018), pp. 1-4
Liu et al., 2016
X. Liu, C. Wu, C. Li, E. Boerwinkle
dbNSFP v3.0: A One-Stop Database of Functional Predictions and Annotations for Human Nonsynonymous and Splice-Site SNVs
Hum. Mutat., 37 (2016), pp. 235-241
Liu et al., 2018a
S. Liu, S. Huang, F. Chen, L. Zhao, Y. Yuan, S.S. Francis, L. Fang, Z. Li, L. Lin, R. Liu, et al.
Genomic Analyses from Non-invasive Prenatal Testing Reveal Genetic Associations, Patterns of Viral Infections, and Chinese Population History
Cell, 175 (2018), pp. 347-359.e14
Liu et al., 2018b
Y. Liu, Z. Cao, Y. Wang, Y. Guo, P. Xu, P. Yuan, Z. Liu, Y. He, W. Wei
Genome-wide screening for functional long noncoding RNAs in human cells by Cas9 targeting of splice sites
Nat. Biotechnol (2018), 10.1038/nbt.4283
30395134 Published online November 5, 2018
Loh et al., 2016
P.R. Loh, P. Danecek, P.F. Palamara, C. Fuchsberger, Y. A Reshef, H. K Finucane, S. Schoenherr, L. Forer, S. McCarthy, G.R. Abecasis, et al.
Reference-based phasing using the Haplotype Reference Consortium panel
Nat. Genet., 48 (2016), pp. 1443-1448
Maher et al., 2012
M.C. Maher, L.H. Uricchio, D.G. Torgerson, R.D. Hernandez
Population genetics of rare variants and complex diseases
Hum. Hered., 74 (2012), pp. 118-128
Majumder, 2010
P.P. Majumder
The human genetic history of South Asia
Curr. Biol., 20 (2010), pp. R184-R187
Manichaikul et al., 2010
A. Manichaikul, J.C. Mychaleckyj, S.S. Rich, K. Daly, M. Sale, W.M. Chen
Robust relationship inference in genome-wide association studies
Bioinformatics, 26 (2010), pp. 2867-2873
Maretty et al., 2017
L. Maretty, J.M. Jensen, B. Petersen, J.A. Sibbesen, S. Liu, P. Villesen, L. Skov, K. Belling, C. Theil Have, J.M.G. Izarzugaza, et al.
Sequencing and de novo assembly of 150 genomes from Denmark as a population reference
Nature, 548 (2017), pp. 87-91
McCarthy et al., 2016
S. McCarthy, S. Das, W. Kretzschmar, O. Delaneau, A.R. Wood, A. Teumer, H.M. Kang, C. Fuchsberger, P. Danecek, K. Sharp, et al., Haplotype Reference Consortium
A reference panel of 64,976 haplotypes for genotype imputation
Nat. Genet., 48 (2016), pp. 1279-1283
Meyer et al., 2012
M. Meyer, M. Kircher, M.T. Gansauge, H. Li, F. Racimo, S. Mallick, J.G. Schraiber, F. Jay, K. Prüfer, C. de Filippo, et al.
A high-coverage genome sequence from an archaic Denisovan individual
Science, 338 (2012), pp. 222-226
Mirabello et al., 2020
L. Mirabello, B. Zhu, R. Koster, E. Karlins, M. Dean, M. Yeager, M. Gianferante, L.G. Spector, L.M. Morton, D. Karyadi, et al.
Frequency of Pathogenic Germline Variants in Cancer-Susceptibility Genes in Patients With Osteosarcoma
JAMA Oncol., 6 (2020), pp. 724-734
Nagasaki et al., 2015
M. Nagasaki, J. Yasuda, F. Katsuoka, N. Nariai, K. Kojima, Y. Kawai, Y. Yamaguchi-Kabata, J. Yokozawa, I. Danjoh, S. Saito, et al., ToMMo Japanese Reference Panel Project
Rare variant discovery by deep whole-genome sequencing of 1,070 Japanese individuals
Nat. Commun., 6 (2015), p. 8018
Okonechnikov et al., 2016
K. Okonechnikov, A. Conesa, F. García-Alcalde
Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data
Bioinformatics, 32 (2016), pp. 292-294
View PDF
Özdemir and Dotto, 2017
B.C. Özdemir, G.P. Dotto
Racial Differences in Cancer Susceptibility and Survival: More Than the Color of the Skin?
Trends Cancer, 3 (2017), pp. 181-197
ArticleDownload PDFView Record in ScopusGoogle Scholar
Piton et al., 2013
A. Piton, C. Redin, J.L. Mandel
XLID-Causing Mutations and Associated Genes Challenged in Light of Data From Large-Scale Human Exome Sequencing (vol 93, pg 368, 2013)
Am. J. Hum. Genet., 93 (2013), p. 406
406
Poplin et al., 2017
R. Poplin, V. Ruano-Rubio, M.A. DePristo, T.J. Fennell, M.O. Carneiro, G.A. Van der Auwera, D.E. Kling, L.D. Gauthier, A. Levy-Moonshine, D. Roazen, et al.
Scaling accurate genetic variant discovery to tens of thousands of samples
bioRxiv (2017), 10.1101/201178
Poznik, 2016
G.D. Poznik
Identifying Y-chromosome haplogroups in arbitrarily large samples of sequenced or genotyped men
bioRxiv (2016), 10.1101/088716
Price et al., 2008
A.L. Price, M.E. Weale, N. Patterson, S.R. Myers, A.C. Need, K.V. Shianna, D. Ge, J.I. Rotter, E. Torres, K.D. Taylor, et al.
Long-range LD can confound genome scans in admixed populations
Am. J. Hum. Genet., 83 (2008), pp. 132-135
author reply 135–139
Prüfer et al., 2014
K. Prüfer, F. Racimo, N. Patterson, F. Jay, S. Sankararaman, S. Sawyer, A. Heinze, G. Renaud, P.H. Sudmant, C. de Filippo, et al.
The complete genome sequence of a Neanderthal from the Altai Mountains
Nature, 505 (2014), pp. 43-49
Qamar et al., 2002
R. Qamar, Q. Ayub, A. Mohyuddin, A. Helgason, K. Mazhar, A. Mansoor, T. Zerjal, C. Tyler-Smith, S.Q. Mehdi
Y-chromosomal DNA variation in Pakistan
Am. J. Hum. Genet., 70 (2002), pp. 1107-1124
Rehm et al., 2015
H.L. Rehm, J.S. Berg, L.D. Brooks, C.D. Bustamante, J.P. Evans, M.J. Landrum, D.H. Ledbetter, D.R. Maglott, C.L. Martin, R.L. Nussbaum, et al., ClinGen
ClinGen--the Clinical Genome Resource
N. Engl. J. Med., 372 (2015), pp. 2235-2242
View PDF
CrossRefView Record in ScopusGoogle Scholar
Saint Pierre and Génin, 2014
A. Saint Pierre, E. Génin
How important are rare variants in common disease?
Brief. Funct. Genomics, 13 (2014), pp. 353-361
View PDF
CrossRefGoogle Scholar
Sherry et al., 2001
S.T. Sherry, M.H. Ward, M. Kholodov, J. Baker, L. Phan, E.M. Smigielski, K. Sirotkin
dbSNP: the NCBI database of genetic variation
Nucleic Acids Res., 29 (2001), pp. 308-311
Sud et al., 2017
A. Sud, B. Kinnersley, R.S. Houlston
Genome-wide association studies of cancer: current insights and future perspectives
Nat. Rev. Cancer, 17 (2017), pp. 692-704
Taliun et al., 2019
D. Taliun, D.N. Harris, M.D. Kessler, J. Carlson, Z.A. Szpiech, R. Torres, S.A.G. Taliun, A. Corvelo, S.M. Gogarten, H.M. Kang, et al.
Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program
bioRxiv (2019), 10.1101/563866
Google Scholar
Tang et al., 2008
H. Tang, S. Choudhry, R. Mei, M. Morgan, W. Rodriguez-Cintron, E.G. Burchard, N.J. Risch
Long-range LD can confound genome scans in admixed populations - Response to Price et al
Am. J. Hum. Genet., 83 (2008), pp. 135-139
Timpson et al., 2018
N.J. Timpson, C.M.T. Greenwood, N. Soranzo, D.J. Lawson, J.B. Richards
Genetic architecture: the shape of the genetic contribution to human traits and disease
Nat. Rev. Genet., 19 (2018), pp. 110-124
View PDF
CrossRefView Record in ScopusGoogle Scholar
Toure et al., 2016
A. Toure, M. Cabral, A. Niang, C. Diop, A. Garat, L. Humbert, M. Fall, A. Diouf, F. Broly, M. Lhermitte, D. Allorge
Prevention of isoniazid toxicity by NAT2 genotyping in Senegalese tuberculosis patients
Toxicol. Rep., 3 (2016), pp. 826-831
Ulitsky et al., 2011
I. Ulitsky, A. Shkumatava, C.H. Jan, H. Sive, D.P. Bartel
Conserved function of lincRNAs in vertebrate embryonic development despite rapid sequence evolution
Cell, 147 (2011), pp. 1537-1550
van Leeuwen et al., 2015
E.M. van Leeuwen, L.C. Karssen, J. Deelen, A. Isaacs, C. Medina-Gomez, H. Mbarek, A. Kanterakis, S. Trompet, I. Postmus, N. Verweij, et al., Genome of The Netherlands Consortium
Genome of The Netherlands population-specific imputations identify an ABCA6 variant associated with cholesterol levels
Nat. Commun., 6 (2015), p. 6065
View PDF
View Record in ScopusGoogle Scholar
Vatsis et al., 1991
K.P. Vatsis, K.J. Martell, W.W. Weber
Diverse point mutations in the human gene for polymorphic N-acetyltransferase
Proc. Natl. Acad. Sci. USA, 88 (1991), pp. 6333-6337
View PDF
CrossRefView Record in ScopusGoogle Scholar
Wall et al., 2019
J.D. Wall, E.W. Stawiski, A. Ratan, H.L. Kim, C. Kim, R. Gupta, K. Suryamohan, E.S. Gusareva, R.W. Purbojati, T. Bhangale, et al., GenomeAsia100K Consortium
The GenomeAsia 100K Project enables genetic discoveries across Asia
Nature, 576 (2019), pp. 106-111
View PDF
View Record in ScopusGoogle Scholar
Walter et al., 2015
K. Walter, J.L. Min, J. Huang, L. Crooks, Y. Memari, S. McCarthy, J.R.B. Perry, C. Xu, M. Futema, D. Lawson, et al., UK10K Consortium
The UK10K project identifies rare variants in health and disease
Nature, 526 (2015), pp. 82-90
Wang et al., 2010
K. Wang, M. Li, H. Hakonarson
ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data
Nucleic Acids Res., 38 (2010), p. e164
Weir and Cockerham, 1984
B.S. Weir, C.C. Cockerham
Estimating F-Statistics for the Analysis of Population Structure
Evolution, 38 (1984), pp. 1358-1370
Wen et al., 2004
B. Wen, H. Li, D. Lu, X. Song, F. Zhang, Y. He, F. Li, Y. Gao, X. Mao, L. Zhang, et al.
Genetic evidence supports demic diffusion of Han culture
Nature, 431 (2004), pp. 302-305
D. Wu, J. Dou, X. Chai, C. Bellis, A. Wilm, C.C. Shih, W.W.J. Soon, N. Bertin, C.B. Lin, C.C. Khor, et al., SG10K Consortium
Large-Scale Whole-Genome Sequencing of Three Diverse Asian Populations in Singapore
Cell, 179 (2019), pp. 736-749.e15
ArticleDownload PDFView Record in ScopusGoogle Scholar
Xu et al., 2009
S. Xu, X. Yin, S. Li, W. Jin, H. Lou, L. Yang, X. Gong, H. Wang, Y. Shen, X. Pan, et al.
Genomic dissection of population substructure of Han Chinese and its implication in association studies
Am. J. Hum. Genet., 85 (2009), pp. 762-774
Yan et al., 2014
S. Yan, C.C. Wang, H.X. Zheng, W. Wang, Z.D. Qin, L.H. Wei, Y. Wang, X.D. Pan, W.Q. Fu, Y.G. He, et al.
Y chromosomes of 40% Chinese descend from three Neolithic super-grandfathers
PLoS ONE, 9 (2014), p. e105691
Zhang et al., 2020
F. Zhang, M. Flickinger, S.A.G. Taliun, G.R. Abecasis, L.J. Scott, S.A. McCaroll, C.N. Pato, M. Boehnke, H.M. Kang, InPSYght Psychiatric Genetics Consortium
Ancestry-agnostic estimation of DNA sample contamination from sequence reads
Genome Res., 30 (2020), pp. 185-194
Zhao et al., 2014
H. Zhao, Z. Sun, J. Wang, H. Huang, J.P. Kocher, L. Wang
CrossMap: a versatile tool for coordinate conversion between genome assemblies
Bioinformatics, 30 (2014), pp. 1006-1007
View PDF
CrossRefView Record in ScopusGoogle Scholar
Zook et al., 2014
J.M. Zook, B. Chapman, J. Wang, D. Mittelman, O. Hofmann, W. Hide, M. Salit
Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls
Nat. Biotechnol., 32 (2014), pp. 246-251
These authors contributed equally
本站仅提供存储服务,所有内容均由用户发布,如发现有害或侵权内容,请点击举报
打开APP,阅读全文并永久保存 查看更多类似文章
猜你喜欢
类似文章
How to get the detailed information on genetic variants for your genotype-phenotype association stud
Low genetic variation is associated with low mutat...
只有4篇参考文献的队列研究
基因慧专栏 | 癌症序列变异解释和报告的标准和指南
利用大数据估计美国人群中的致病DNA修复变异情况(JAMA Dermatol ,IF:8.107)
Whole genome sequencing to identify host genetic risk factors for severe outcomes of hepatitis A vir
更多类似文章 >>
生活服务
热点新闻
分享 收藏 导长图 关注 下载文章
绑定账号成功
后续可登录账号畅享VIP特权!
如果VIP功能使用有故障,
可点击这里联系客服!

联系客服