核苷酸替换模型选择对具有良好先验的系统发育贝叶斯推断并非必要。

IF 6.1 1区生物学 Q1 EVOLUTIONARY BIOLOGY Systematic Biology Pub Date : 2023-12-30 DOI:10.1093/sysbio/syad041

Luiza Guimarães Fabreti, Sebastian Höhna

{"title":"核苷酸替换模型选择对具有良好先验的系统发育贝叶斯推断并非必要。","authors":"Luiza Guimarães Fabreti, Sebastian Höhna","doi":"10.1093/sysbio/syad041","DOIUrl":null,"url":null,"abstract":"Model selection aims to choose the most adequate model for the statistical analysis at hand. The model must be complex enough to capture the complexity of the data but should be simple enough not to overfit. In phylogenetics, the most common model selection scenario concerns selecting an adequate substitution and partition model for sequence evolution to infer a phylogenetic tree. Previously, several studies showed that substitution model under-parameterization can bias phylogenetic studies. Here, we explored the impact of substitution model over-parameterization in a Bayesian statistical framework. We performed simulations under the simplest substitution model, the Jukes-Cantor model, and compare posterior estimates of phylogenetic tree topologies and tree length under the true model to the most complex model, the $\\text{GTR}+\\Gamma+\\text{I}$ substitution model, including over-splitting the data into additional subsets (i.e., applying partitioned models). We explored 4 choices of prior distributions: the default substitution model priors of MrBayes, BEAST2, and RevBayes and a newly devised prior choice (Tame). Our results show that Bayesian inference of phylogeny is robust to substitution model over-parameterization and over-partitioning but only under our new prior settings. All 3 current default priors introduced biases for the estimated tree length. We conclude that substitution and partition model selection are superfluous steps in Bayesian phylogenetic inference pipelines if well-behaved prior distributions are applied and more effort should focus on more complex and biologically realistic substitution models.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":" ","pages":"1418-1432"},"PeriodicalIF":6.1000,"publicationDate":"2023-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Nucleotide Substitution Model Selection Is Not Necessary for Bayesian Inference of Phylogeny With Well-Behaved Priors.\",\"authors\":\"Luiza Guimarães Fabreti, Sebastian Höhna\",\"doi\":\"10.1093/sysbio/syad041\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Model selection aims to choose the most adequate model for the statistical analysis at hand. The model must be complex enough to capture the complexity of the data but should be simple enough not to overfit. In phylogenetics, the most common model selection scenario concerns selecting an adequate substitution and partition model for sequence evolution to infer a phylogenetic tree. Previously, several studies showed that substitution model under-parameterization can bias phylogenetic studies. Here, we explored the impact of substitution model over-parameterization in a Bayesian statistical framework. We performed simulations under the simplest substitution model, the Jukes-Cantor model, and compare posterior estimates of phylogenetic tree topologies and tree length under the true model to the most complex model, the $\\\\text{GTR}+\\\\Gamma+\\\\text{I}$ substitution model, including over-splitting the data into additional subsets (i.e., applying partitioned models). We explored 4 choices of prior distributions: the default substitution model priors of MrBayes, BEAST2, and RevBayes and a newly devised prior choice (Tame). Our results show that Bayesian inference of phylogeny is robust to substitution model over-parameterization and over-partitioning but only under our new prior settings. All 3 current default priors introduced biases for the estimated tree length. We conclude that substitution and partition model selection are superfluous steps in Bayesian phylogenetic inference pipelines if well-behaved prior distributions are applied and more effort should focus on more complex and biologically realistic substitution models.\",\"PeriodicalId\":22120,\"journal\":{\"name\":\"Systematic Biology\",\"volume\":\" \",\"pages\":\"1418-1432\"},\"PeriodicalIF\":6.1000,\"publicationDate\":\"2023-12-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Systematic Biology\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1093/sysbio/syad041\",\"RegionNum\":1,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"EVOLUTIONARY BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Systematic Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/sysbio/syad041","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EVOLUTIONARY BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

模型选择的目的是为当前的统计分析选择最合适的模型。模型必须足够复杂，以捕捉数据的复杂性，但也应足够简单，不至于过度拟合。在系统发育学中，最常见的模型选择情况是为序列进化选择一个合适的替换和分割模型，以推断系统发生树。之前的一些研究表明，替换模型参数化不足会使系统发育研究出现偏差。在此，我们在贝叶斯统计框架下探讨了替换模型参数过高的影响。我们在最简单的替换模型--Jukes-Cantor模型下进行了模拟，并比较了真实模型与最复杂模型--$\text{GTR}+\Gamma+\text{I}$替换模型下的系统发生树拓扑和树长的后验估计值，包括将数据过度分割成额外的子集（即应用分区模型）。我们探索了 4 种先验分布选择：MrBayes、BEAST2 和 RevBayes 的默认替换模型先验，以及一种新设计的先验选择（Tame）。我们的研究结果表明，贝叶斯系统发育推断对替代模型过度参数化和过度分区具有鲁棒性，但只有在我们新的先验设置下才具有这种鲁棒性。目前所有 3 个默认先验都会对估计的树长产生偏差。我们的结论是，如果应用良好的先验分布，替代和分区模型选择是贝叶斯系统发育推断流水线中多余的步骤，更多的精力应集中在更复杂和更符合生物现实的替代模型上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Nucleotide Substitution Model Selection Is Not Necessary for Bayesian Inference of Phylogeny With Well-Behaved Priors.

Model selection aims to choose the most adequate model for the statistical analysis at hand. The model must be complex enough to capture the complexity of the data but should be simple enough not to overfit. In phylogenetics, the most common model selection scenario concerns selecting an adequate substitution and partition model for sequence evolution to infer a phylogenetic tree. Previously, several studies showed that substitution model under-parameterization can bias phylogenetic studies. Here, we explored the impact of substitution model over-parameterization in a Bayesian statistical framework. We performed simulations under the simplest substitution model, the Jukes-Cantor model, and compare posterior estimates of phylogenetic tree topologies and tree length under the true model to the most complex model, the $\text{GTR}+\Gamma+\text{I}$ substitution model, including over-splitting the data into additional subsets (i.e., applying partitioned models). We explored 4 choices of prior distributions: the default substitution model priors of MrBayes, BEAST2, and RevBayes and a newly devised prior choice (Tame). Our results show that Bayesian inference of phylogeny is robust to substitution model over-parameterization and over-partitioning but only under our new prior settings. All 3 current default priors introduced biases for the estimated tree length. We conclude that substitution and partition model selection are superfluous steps in Bayesian phylogenetic inference pipelines if well-behaved prior distributions are applied and more effort should focus on more complex and biologically realistic substitution models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Systematic Biology 生物-进化生物学

CiteScore

13.00

自引率

7.70%

发文量

审稿时长

6-12 weeks

期刊介绍： Systematic Biology is the bimonthly journal of the Society of Systematic Biologists. Papers for the journal are original contributions to the theory, principles, and methods of systematics as well as phylogeny, evolution, morphology, biogeography, paleontology, genetics, and the classification of all living things. A Points of View section offers a forum for discussion, while book reviews and announcements of general interest are also featured.