Christopher J Adams, Mitchell Conery, Benjamin J Auerbach, Shane T Jensen, Iain Mathieson, Benjamin F Voight
{"title":"正则化序列-上下文突变树捕捉整个人类基因组突变率的变化。","authors":"Christopher J Adams, Mitchell Conery, Benjamin J Auerbach, Shane T Jensen, Iain Mathieson, Benjamin F Voight","doi":"10.1371/journal.pgen.1010807","DOIUrl":null,"url":null,"abstract":"<p><p>Germline mutation is the mechanism by which genetic variation in a population is created. Inferences derived from mutation rate models are fundamental to many population genetics methods. Previous models have demonstrated that nucleotides flanking polymorphic sites-the local sequence context-explain variation in the probability that a site is polymorphic. However, limitations to these models exist as the size of the local sequence context window expands. These include a lack of robustness to data sparsity at typical sample sizes, lack of regularization to generate parsimonious models and lack of quantified uncertainty in estimated rates to facilitate comparison between models. To address these limitations, we developed Baymer, a regularized Bayesian hierarchical tree model that captures the heterogeneous effect of sequence contexts on polymorphism probabilities. Baymer implements an adaptive Metropolis-within-Gibbs Markov Chain Monte Carlo sampling scheme to estimate the posterior distributions of sequence-context based probabilities that a site is polymorphic. We show that Baymer accurately infers polymorphism probabilities and well-calibrated posterior distributions, robustly handles data sparsity, appropriately regularizes to return parsimonious models, and scales computationally at least up to 9-mer context windows. We demonstrate application of Baymer in three ways-first, identifying differences in polymorphism probabilities between continental populations in the 1000 Genomes Phase 3 dataset, second, in a sparse data setting to examine the use of polymorphism models as a proxy for de novo mutation probabilities as a function of variant age, sequence context window size, and demographic history, and third, comparing model concordance between different great ape species. We find a shared context-dependent mutation rate architecture underlying our models, enabling a transfer-learning inspired strategy for modeling germline mutations. In summary, Baymer is an accurate polymorphism probability estimation algorithm that automatically adapts to data sparsity at different sequence context levels, thereby making efficient use of the available data.</p>","PeriodicalId":20266,"journal":{"name":"PLoS Genetics","volume":"19 7","pages":"e1010807"},"PeriodicalIF":4.5000,"publicationDate":"2023-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10355397/pdf/","citationCount":"0","resultStr":"{\"title\":\"Regularized sequence-context mutational trees capture variation in mutation rates across the human genome.\",\"authors\":\"Christopher J Adams, Mitchell Conery, Benjamin J Auerbach, Shane T Jensen, Iain Mathieson, Benjamin F Voight\",\"doi\":\"10.1371/journal.pgen.1010807\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Germline mutation is the mechanism by which genetic variation in a population is created. Inferences derived from mutation rate models are fundamental to many population genetics methods. Previous models have demonstrated that nucleotides flanking polymorphic sites-the local sequence context-explain variation in the probability that a site is polymorphic. However, limitations to these models exist as the size of the local sequence context window expands. These include a lack of robustness to data sparsity at typical sample sizes, lack of regularization to generate parsimonious models and lack of quantified uncertainty in estimated rates to facilitate comparison between models. To address these limitations, we developed Baymer, a regularized Bayesian hierarchical tree model that captures the heterogeneous effect of sequence contexts on polymorphism probabilities. Baymer implements an adaptive Metropolis-within-Gibbs Markov Chain Monte Carlo sampling scheme to estimate the posterior distributions of sequence-context based probabilities that a site is polymorphic. We show that Baymer accurately infers polymorphism probabilities and well-calibrated posterior distributions, robustly handles data sparsity, appropriately regularizes to return parsimonious models, and scales computationally at least up to 9-mer context windows. We demonstrate application of Baymer in three ways-first, identifying differences in polymorphism probabilities between continental populations in the 1000 Genomes Phase 3 dataset, second, in a sparse data setting to examine the use of polymorphism models as a proxy for de novo mutation probabilities as a function of variant age, sequence context window size, and demographic history, and third, comparing model concordance between different great ape species. We find a shared context-dependent mutation rate architecture underlying our models, enabling a transfer-learning inspired strategy for modeling germline mutations. In summary, Baymer is an accurate polymorphism probability estimation algorithm that automatically adapts to data sparsity at different sequence context levels, thereby making efficient use of the available data.</p>\",\"PeriodicalId\":20266,\"journal\":{\"name\":\"PLoS Genetics\",\"volume\":\"19 7\",\"pages\":\"e1010807\"},\"PeriodicalIF\":4.5000,\"publicationDate\":\"2023-07-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10355397/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"PLoS Genetics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1371/journal.pgen.1010807\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2023/7/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"Agricultural and Biological Sciences\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS Genetics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1371/journal.pgen.1010807","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/7/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"Agricultural and Biological Sciences","Score":null,"Total":0}
引用次数: 0
摘要
种系突变是种群遗传变异产生的机制。从突变率模型中得出的推论是许多群体遗传学方法的基础。以前的模型已经证明,多态性位点侧翼的核苷酸--局部序列上下文--解释了位点多态性概率的变化。然而,随着局部序列上下文窗口的扩大,这些模型也存在局限性。这些限制包括:在典型样本量下对数据稀疏性缺乏鲁棒性、缺乏正则化以生成简约模型,以及缺乏对估计率的不确定性进行量化以促进模型之间的比较。为了解决这些局限性,我们开发了正则化贝叶斯分层树模型 Baymer,它能捕捉序列上下文对多态性概率的异质性影响。Baymer 采用自适应 Metropolis-Within-Gibbs Markov Chain Monte Carlo 采样方案来估计基于序列上下文的位点多态性概率的后验分布。我们的研究表明,Baymer 能准确推断多态性概率和校准良好的后验分布,稳健地处理数据稀疏性,适当地正则化以返回简约模型,并且在计算上至少能扩展到 9 个单词上下文窗口。我们从三个方面展示了 Baymer 的应用--首先,在 1000 基因组第三阶段数据集中识别大陆种群之间多态性概率的差异;其次,在稀疏数据环境中检验多态性模型作为新突变概率的替代物与变异年龄、序列上下文窗口大小和人口历史的函数关系;第三,比较不同类人猿物种之间的模型一致性。我们发现在我们的模型中存在一个共同的上下文相关突变率结构,从而可以采用迁移学习启发的策略来建立种系突变模型。总之,Baymer 是一种精确的多态性概率估计算法,它能自动适应不同序列上下文层次的数据稀疏性,从而有效利用可用数据。
Regularized sequence-context mutational trees capture variation in mutation rates across the human genome.
Germline mutation is the mechanism by which genetic variation in a population is created. Inferences derived from mutation rate models are fundamental to many population genetics methods. Previous models have demonstrated that nucleotides flanking polymorphic sites-the local sequence context-explain variation in the probability that a site is polymorphic. However, limitations to these models exist as the size of the local sequence context window expands. These include a lack of robustness to data sparsity at typical sample sizes, lack of regularization to generate parsimonious models and lack of quantified uncertainty in estimated rates to facilitate comparison between models. To address these limitations, we developed Baymer, a regularized Bayesian hierarchical tree model that captures the heterogeneous effect of sequence contexts on polymorphism probabilities. Baymer implements an adaptive Metropolis-within-Gibbs Markov Chain Monte Carlo sampling scheme to estimate the posterior distributions of sequence-context based probabilities that a site is polymorphic. We show that Baymer accurately infers polymorphism probabilities and well-calibrated posterior distributions, robustly handles data sparsity, appropriately regularizes to return parsimonious models, and scales computationally at least up to 9-mer context windows. We demonstrate application of Baymer in three ways-first, identifying differences in polymorphism probabilities between continental populations in the 1000 Genomes Phase 3 dataset, second, in a sparse data setting to examine the use of polymorphism models as a proxy for de novo mutation probabilities as a function of variant age, sequence context window size, and demographic history, and third, comparing model concordance between different great ape species. We find a shared context-dependent mutation rate architecture underlying our models, enabling a transfer-learning inspired strategy for modeling germline mutations. In summary, Baymer is an accurate polymorphism probability estimation algorithm that automatically adapts to data sparsity at different sequence context levels, thereby making efficient use of the available data.
期刊介绍:
PLOS Genetics is run by an international Editorial Board, headed by the Editors-in-Chief, Greg Barsh (HudsonAlpha Institute of Biotechnology, and Stanford University School of Medicine) and Greg Copenhaver (The University of North Carolina at Chapel Hill).
Articles published in PLOS Genetics are archived in PubMed Central and cited in PubMed.