正则化序列-上下文突变树捕捉整个人类基因组突变率的变化。

IF 4.5 2区 生物学 Q1 Agricultural and Biological Sciences PLoS Genetics Pub Date : 2023-07-07 eCollection Date: 2023-07-01 DOI:10.1371/journal.pgen.1010807
Christopher J Adams, Mitchell Conery, Benjamin J Auerbach, Shane T Jensen, Iain Mathieson, Benjamin F Voight
{"title":"正则化序列-上下文突变树捕捉整个人类基因组突变率的变化。","authors":"Christopher J Adams, Mitchell Conery, Benjamin J Auerbach, Shane T Jensen, Iain Mathieson, Benjamin F Voight","doi":"10.1371/journal.pgen.1010807","DOIUrl":null,"url":null,"abstract":"<p><p>Germline mutation is the mechanism by which genetic variation in a population is created. Inferences derived from mutation rate models are fundamental to many population genetics methods. Previous models have demonstrated that nucleotides flanking polymorphic sites-the local sequence context-explain variation in the probability that a site is polymorphic. However, limitations to these models exist as the size of the local sequence context window expands. These include a lack of robustness to data sparsity at typical sample sizes, lack of regularization to generate parsimonious models and lack of quantified uncertainty in estimated rates to facilitate comparison between models. To address these limitations, we developed Baymer, a regularized Bayesian hierarchical tree model that captures the heterogeneous effect of sequence contexts on polymorphism probabilities. Baymer implements an adaptive Metropolis-within-Gibbs Markov Chain Monte Carlo sampling scheme to estimate the posterior distributions of sequence-context based probabilities that a site is polymorphic. We show that Baymer accurately infers polymorphism probabilities and well-calibrated posterior distributions, robustly handles data sparsity, appropriately regularizes to return parsimonious models, and scales computationally at least up to 9-mer context windows. We demonstrate application of Baymer in three ways-first, identifying differences in polymorphism probabilities between continental populations in the 1000 Genomes Phase 3 dataset, second, in a sparse data setting to examine the use of polymorphism models as a proxy for de novo mutation probabilities as a function of variant age, sequence context window size, and demographic history, and third, comparing model concordance between different great ape species. We find a shared context-dependent mutation rate architecture underlying our models, enabling a transfer-learning inspired strategy for modeling germline mutations. In summary, Baymer is an accurate polymorphism probability estimation algorithm that automatically adapts to data sparsity at different sequence context levels, thereby making efficient use of the available data.</p>","PeriodicalId":20266,"journal":{"name":"PLoS Genetics","volume":"19 7","pages":"e1010807"},"PeriodicalIF":4.5000,"publicationDate":"2023-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10355397/pdf/","citationCount":"0","resultStr":"{\"title\":\"Regularized sequence-context mutational trees capture variation in mutation rates across the human genome.\",\"authors\":\"Christopher J Adams, Mitchell Conery, Benjamin J Auerbach, Shane T Jensen, Iain Mathieson, Benjamin F Voight\",\"doi\":\"10.1371/journal.pgen.1010807\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Germline mutation is the mechanism by which genetic variation in a population is created. Inferences derived from mutation rate models are fundamental to many population genetics methods. Previous models have demonstrated that nucleotides flanking polymorphic sites-the local sequence context-explain variation in the probability that a site is polymorphic. However, limitations to these models exist as the size of the local sequence context window expands. These include a lack of robustness to data sparsity at typical sample sizes, lack of regularization to generate parsimonious models and lack of quantified uncertainty in estimated rates to facilitate comparison between models. To address these limitations, we developed Baymer, a regularized Bayesian hierarchical tree model that captures the heterogeneous effect of sequence contexts on polymorphism probabilities. Baymer implements an adaptive Metropolis-within-Gibbs Markov Chain Monte Carlo sampling scheme to estimate the posterior distributions of sequence-context based probabilities that a site is polymorphic. We show that Baymer accurately infers polymorphism probabilities and well-calibrated posterior distributions, robustly handles data sparsity, appropriately regularizes to return parsimonious models, and scales computationally at least up to 9-mer context windows. We demonstrate application of Baymer in three ways-first, identifying differences in polymorphism probabilities between continental populations in the 1000 Genomes Phase 3 dataset, second, in a sparse data setting to examine the use of polymorphism models as a proxy for de novo mutation probabilities as a function of variant age, sequence context window size, and demographic history, and third, comparing model concordance between different great ape species. We find a shared context-dependent mutation rate architecture underlying our models, enabling a transfer-learning inspired strategy for modeling germline mutations. In summary, Baymer is an accurate polymorphism probability estimation algorithm that automatically adapts to data sparsity at different sequence context levels, thereby making efficient use of the available data.</p>\",\"PeriodicalId\":20266,\"journal\":{\"name\":\"PLoS Genetics\",\"volume\":\"19 7\",\"pages\":\"e1010807\"},\"PeriodicalIF\":4.5000,\"publicationDate\":\"2023-07-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10355397/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"PLoS Genetics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1371/journal.pgen.1010807\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2023/7/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"Agricultural and Biological Sciences\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS Genetics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1371/journal.pgen.1010807","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/7/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"Agricultural and Biological Sciences","Score":null,"Total":0}
引用次数: 0

摘要

种系突变是种群遗传变异产生的机制。从突变率模型中得出的推论是许多群体遗传学方法的基础。以前的模型已经证明,多态性位点侧翼的核苷酸--局部序列上下文--解释了位点多态性概率的变化。然而,随着局部序列上下文窗口的扩大,这些模型也存在局限性。这些限制包括:在典型样本量下对数据稀疏性缺乏鲁棒性、缺乏正则化以生成简约模型,以及缺乏对估计率的不确定性进行量化以促进模型之间的比较。为了解决这些局限性,我们开发了正则化贝叶斯分层树模型 Baymer,它能捕捉序列上下文对多态性概率的异质性影响。Baymer 采用自适应 Metropolis-Within-Gibbs Markov Chain Monte Carlo 采样方案来估计基于序列上下文的位点多态性概率的后验分布。我们的研究表明,Baymer 能准确推断多态性概率和校准良好的后验分布,稳健地处理数据稀疏性,适当地正则化以返回简约模型,并且在计算上至少能扩展到 9 个单词上下文窗口。我们从三个方面展示了 Baymer 的应用--首先,在 1000 基因组第三阶段数据集中识别大陆种群之间多态性概率的差异;其次,在稀疏数据环境中检验多态性模型作为新突变概率的替代物与变异年龄、序列上下文窗口大小和人口历史的函数关系;第三,比较不同类人猿物种之间的模型一致性。我们发现在我们的模型中存在一个共同的上下文相关突变率结构,从而可以采用迁移学习启发的策略来建立种系突变模型。总之,Baymer 是一种精确的多态性概率估计算法,它能自动适应不同序列上下文层次的数据稀疏性,从而有效利用可用数据。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

摘要图片

摘要图片

摘要图片

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Regularized sequence-context mutational trees capture variation in mutation rates across the human genome.

Germline mutation is the mechanism by which genetic variation in a population is created. Inferences derived from mutation rate models are fundamental to many population genetics methods. Previous models have demonstrated that nucleotides flanking polymorphic sites-the local sequence context-explain variation in the probability that a site is polymorphic. However, limitations to these models exist as the size of the local sequence context window expands. These include a lack of robustness to data sparsity at typical sample sizes, lack of regularization to generate parsimonious models and lack of quantified uncertainty in estimated rates to facilitate comparison between models. To address these limitations, we developed Baymer, a regularized Bayesian hierarchical tree model that captures the heterogeneous effect of sequence contexts on polymorphism probabilities. Baymer implements an adaptive Metropolis-within-Gibbs Markov Chain Monte Carlo sampling scheme to estimate the posterior distributions of sequence-context based probabilities that a site is polymorphic. We show that Baymer accurately infers polymorphism probabilities and well-calibrated posterior distributions, robustly handles data sparsity, appropriately regularizes to return parsimonious models, and scales computationally at least up to 9-mer context windows. We demonstrate application of Baymer in three ways-first, identifying differences in polymorphism probabilities between continental populations in the 1000 Genomes Phase 3 dataset, second, in a sparse data setting to examine the use of polymorphism models as a proxy for de novo mutation probabilities as a function of variant age, sequence context window size, and demographic history, and third, comparing model concordance between different great ape species. We find a shared context-dependent mutation rate architecture underlying our models, enabling a transfer-learning inspired strategy for modeling germline mutations. In summary, Baymer is an accurate polymorphism probability estimation algorithm that automatically adapts to data sparsity at different sequence context levels, thereby making efficient use of the available data.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
PLoS Genetics
PLoS Genetics 生物-遗传学
CiteScore
8.10
自引率
2.20%
发文量
438
审稿时长
1 months
期刊介绍: PLOS Genetics is run by an international Editorial Board, headed by the Editors-in-Chief, Greg Barsh (HudsonAlpha Institute of Biotechnology, and Stanford University School of Medicine) and Greg Copenhaver (The University of North Carolina at Chapel Hill). Articles published in PLOS Genetics are archived in PubMed Central and cited in PubMed.
期刊最新文献
Subfunctionalization of NRC3 altered the genetic structure of the Nicotiana NRC network The transcription factor RUNT-like regulates pupal cuticle development via promoting a pupal cuticle protein transcription Direct targets of MEF2C are enriched for genes associated with schizophrenia and cognitive function and are involved in neuron development and mitochondrial function Evolutionary rate covariation is pervasive between glycosylation pathways and points to potential disease modifiers Histone variant H2A.Z is needed for efficient transcription-coupled NER and genome integrity in UV challenged yeast cells
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1