{"title":"An independent base composition of each rate class for improved likelihood-based phylogeny estimation: The 5rf model","authors":"Peter J Waddell, Remco R Bouckaert","doi":"10.1101/2024.09.03.610719","DOIUrl":null,"url":null,"abstract":"The combination of a time reversible Markov process with a 'hidden' mixture of gamma distributed relative site rates plus invariant sites have become the most favoured options for likelihood and other probabilistic models of nucleotide evolution (e.g., tr4gi which approximates a gamma with four rate classes). However, these models assume a homogeneous and stationary distribution of nucleotide (character or base) frequencies. Here, we explore the potential benefits and pitfalls of allowing each rate category (rate class) of a 4gi mixture model to have its own base frequencies. This is achieved by starting each of the five rate classes, at the tree's root, with its own free choice of nucleotide frequencies to create a 4gi5rf model or a 5rf model in shorthand We assess the practical identifiability of this approach with a BEAST 2 implementation, aiming to determine if it can accurately estimate credibility intervals and expected values for a wide range of plausible parameter values. Practical identifiability, as distinguished from mathematical identifiability, gauges the model's ability to identify parameters in real-world scenarios, as opposed to theoretically with infinite data. One of the most common types of phylogenetic data is mitochondrial DNA (mtDNA) protein coding sequence. It is often assumed current models analyse robustly such data and that higher likelihood/posterior probability models do better. However, this abstract shows that vertebrate mtDNA remains a very difficult type of data to fully model, and that dramatically higher likelihoods do not mean a model is measurably more accurate with respect to recovering key parameters of biological interest (e.g., monophyletic groups, their support and their ages). The 4gi5rf model considerably improves marginal likelihoods and seems to reverse some apparent errors exacerbated by the 4gi model, while introducing others. Problems appear to be linked to non-stationary DNA repair processes that alter the mutation/substitution spectra across lineages and time. We also show such problems are not unique to mtDNA and are encountered in analysing nuclear sequences. Non-stationarity of DNA repair processes mutation/substitution spectra thus pose an active challenge to obtaining reliable inferences of relationships and divergence times near the root of placental mammals, for example. An open source implementation is available under the LGPL 3.0 license in the beastbooster package for BEAST 2, available from https://github.com/rbouckaert/beastbooster.","PeriodicalId":501183,"journal":{"name":"bioRxiv - Evolutionary Biology","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv - Evolutionary Biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.09.03.610719","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The combination of a time reversible Markov process with a 'hidden' mixture of gamma distributed relative site rates plus invariant sites have become the most favoured options for likelihood and other probabilistic models of nucleotide evolution (e.g., tr4gi which approximates a gamma with four rate classes). However, these models assume a homogeneous and stationary distribution of nucleotide (character or base) frequencies. Here, we explore the potential benefits and pitfalls of allowing each rate category (rate class) of a 4gi mixture model to have its own base frequencies. This is achieved by starting each of the five rate classes, at the tree's root, with its own free choice of nucleotide frequencies to create a 4gi5rf model or a 5rf model in shorthand We assess the practical identifiability of this approach with a BEAST 2 implementation, aiming to determine if it can accurately estimate credibility intervals and expected values for a wide range of plausible parameter values. Practical identifiability, as distinguished from mathematical identifiability, gauges the model's ability to identify parameters in real-world scenarios, as opposed to theoretically with infinite data. One of the most common types of phylogenetic data is mitochondrial DNA (mtDNA) protein coding sequence. It is often assumed current models analyse robustly such data and that higher likelihood/posterior probability models do better. However, this abstract shows that vertebrate mtDNA remains a very difficult type of data to fully model, and that dramatically higher likelihoods do not mean a model is measurably more accurate with respect to recovering key parameters of biological interest (e.g., monophyletic groups, their support and their ages). The 4gi5rf model considerably improves marginal likelihoods and seems to reverse some apparent errors exacerbated by the 4gi model, while introducing others. Problems appear to be linked to non-stationary DNA repair processes that alter the mutation/substitution spectra across lineages and time. We also show such problems are not unique to mtDNA and are encountered in analysing nuclear sequences. Non-stationarity of DNA repair processes mutation/substitution spectra thus pose an active challenge to obtaining reliable inferences of relationships and divergence times near the root of placental mammals, for example. An open source implementation is available under the LGPL 3.0 license in the beastbooster package for BEAST 2, available from https://github.com/rbouckaert/beastbooster.