An independent base composition of each rate class for improved likelihood-based phylogeny estimation: The 5rf model

bioRxiv - Evolutionary Biology Pub Date : 2024-09-08 DOI:10.1101/2024.09.03.610719

Peter J Waddell, Remco R Bouckaert

{"title":"An independent base composition of each rate class for improved likelihood-based phylogeny estimation: The 5rf model","authors":"Peter J Waddell, Remco R Bouckaert","doi":"10.1101/2024.09.03.610719","DOIUrl":null,"url":null,"abstract":"The combination of a time reversible Markov process with a 'hidden' mixture of gamma distributed relative site rates plus invariant sites have become the most favoured options for likelihood and other probabilistic models of nucleotide evolution (e.g., tr4gi which approximates a gamma with four rate classes). However, these models assume a homogeneous and stationary distribution of nucleotide (character or base) frequencies. Here, we explore the potential benefits and pitfalls of allowing each rate category (rate class) of a 4gi mixture model to have its own base frequencies. This is achieved by starting each of the five rate classes, at the tree's root, with its own free choice of nucleotide frequencies to create a 4gi5rf model or a 5rf model in shorthand We assess the practical identifiability of this approach with a BEAST 2 implementation, aiming to determine if it can accurately estimate credibility intervals and expected values for a wide range of plausible parameter values. Practical identifiability, as distinguished from mathematical identifiability, gauges the model's ability to identify parameters in real-world scenarios, as opposed to theoretically with infinite data. One of the most common types of phylogenetic data is mitochondrial DNA (mtDNA) protein coding sequence. It is often assumed current models analyse robustly such data and that higher likelihood/posterior probability models do better. However, this abstract shows that vertebrate mtDNA remains a very difficult type of data to fully model, and that dramatically higher likelihoods do not mean a model is measurably more accurate with respect to recovering key parameters of biological interest (e.g., monophyletic groups, their support and their ages). The 4gi5rf model considerably improves marginal likelihoods and seems to reverse some apparent errors exacerbated by the 4gi model, while introducing others. Problems appear to be linked to non-stationary DNA repair processes that alter the mutation/substitution spectra across lineages and time. We also show such problems are not unique to mtDNA and are encountered in analysing nuclear sequences. Non-stationarity of DNA repair processes mutation/substitution spectra thus pose an active challenge to obtaining reliable inferences of relationships and divergence times near the root of placental mammals, for example. An open source implementation is available under the LGPL 3.0 license in the beastbooster package for BEAST 2, available from https://github.com/rbouckaert/beastbooster.","PeriodicalId":501183,"journal":{"name":"bioRxiv - Evolutionary Biology","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv - Evolutionary Biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.09.03.610719","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The combination of a time reversible Markov process with a 'hidden' mixture of gamma distributed relative site rates plus invariant sites have become the most favoured options for likelihood and other probabilistic models of nucleotide evolution (e.g., tr4gi which approximates a gamma with four rate classes). However, these models assume a homogeneous and stationary distribution of nucleotide (character or base) frequencies. Here, we explore the potential benefits and pitfalls of allowing each rate category (rate class) of a 4gi mixture model to have its own base frequencies. This is achieved by starting each of the five rate classes, at the tree's root, with its own free choice of nucleotide frequencies to create a 4gi5rf model or a 5rf model in shorthand We assess the practical identifiability of this approach with a BEAST 2 implementation, aiming to determine if it can accurately estimate credibility intervals and expected values for a wide range of plausible parameter values. Practical identifiability, as distinguished from mathematical identifiability, gauges the model's ability to identify parameters in real-world scenarios, as opposed to theoretically with infinite data. One of the most common types of phylogenetic data is mitochondrial DNA (mtDNA) protein coding sequence. It is often assumed current models analyse robustly such data and that higher likelihood/posterior probability models do better. However, this abstract shows that vertebrate mtDNA remains a very difficult type of data to fully model, and that dramatically higher likelihoods do not mean a model is measurably more accurate with respect to recovering key parameters of biological interest (e.g., monophyletic groups, their support and their ages). The 4gi5rf model considerably improves marginal likelihoods and seems to reverse some apparent errors exacerbated by the 4gi model, while introducing others. Problems appear to be linked to non-stationary DNA repair processes that alter the mutation/substitution spectra across lineages and time. We also show such problems are not unique to mtDNA and are encountered in analysing nuclear sequences. Non-stationarity of DNA repair processes mutation/substitution spectra thus pose an active challenge to obtaining reliable inferences of relationships and divergence times near the root of placental mammals, for example. An open source implementation is available under the LGPL 3.0 license in the beastbooster package for BEAST 2, available from https://github.com/rbouckaert/beastbooster.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

每个速率类的独立基本组成，用于改进基于似然法的系统发育估计：5rf 模型

时间可逆马尔可夫过程与伽马分布的相对位点速率和不变位点的 "隐藏 "混合物的组合，已成为核苷酸进化的似然法和其他概率模型（例如，tr4gi 近似于具有四个速率等级的伽马）的最受欢迎的选择。然而，这些模型假定核苷酸（特征或碱基）频率的分布是均匀和静止的。在此，我们将探讨允许 4gi 混合模型的每个速率类别（速率类）拥有自己的碱基频率的潜在好处和隐患。我们用 BEAST 2 实现评估了这种方法的实际可识别性，目的是确定它是否能准确估计可信区间和各种可信参数值的预期值。实际可识别性有别于数学可识别性，它衡量的是模型在现实世界中识别参数的能力，而不是理论上的无限数据。线粒体 DNA（mtDNA）蛋白质编码序列是最常见的系统发育数据类型之一。人们通常认为目前的模型能稳健地分析这类数据，而且高似然/后验概率模型的效果更好。然而，本摘要表明，脊椎动物的 mtDNA 仍然是一种很难完全建模的数据类型，而且大幅提高似然率并不意味着模型在恢复生物兴趣的关键参数（如单系群、其支持度和年龄）方面的准确性会显著提高。4gi5rf 模型大大提高了边际似然值，似乎扭转了一些因 4gi 模型而加剧的明显误差，同时也引入了其他误差。这些问题似乎与非稳态 DNA 修复过程有关，它改变了不同世系和不同时间的突变/替换谱。我们还表明，此类问题并非 mtDNA 所独有，在分析核序列时也会遇到。因此，DNA 修复过程中突变/替换谱的非稳态性对可靠地推断例如胎盘哺乳动物根部附近的关系和分化时间构成了挑战。在 LGPL 3.0 许可下，BEAST 2 的 beastbooster 软件包提供了一个开源实现，可从 https://github.com/rbouckaert/beastbooster 获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

bioRxiv - Evolutionary Biology

自引率

0.00%

发文量