首页 > 最新文献

Systematic Biology最新文献

英文 中文
Estimating waiting distances between genealogy changes under a Multi-Species Extension of the Sequentially Markov Coalescent. 序贯马尔可夫聚结的多种扩展下谱系变化等待距离的估计。
IF 5.7 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Pub Date : 2025-09-09 DOI: 10.1093/sysbio/syaf059
Patrick F McKenzie, Deren A R Eaton

Genomes are composed of a mosaic of segments inherited from different ancestors, each separated by past recombination events. Consequently, genealogical relationships among multiple genomes vary spatially across different genomic regions. Genealogical variation among unlinked (uncorrelated) genomic regions is well described for either a single population (coalescent) or multiple structured populations (multispecies coalescent). However, the expected similarity among genealogies at linked regions of a genome is less well characterized. Recently, an analytical solution was derived for the distribution of the waiting distance for a change in the genealogical tree spatially across a genome for a single population with constant effective population size. Here we describe a generalization of this result, in terms of the distribution of waiting distances between changes in genealogical trees and topologies for multiple structured populations with branch-specific effective population sizes (i.e., under the multispecies coalescent). We implemented our model in the Python package ipcoal and validated its accuracy against stochastic coalescent simulations. Using a novel likelihood framework we show that tree and topology-change waiting distances in an ARG can be used to fit species tree model parameters, demonstrating an application of our model for developing new methods for phylogenetic inference. The Multi-Species Sequentially Markov Coalescent (MS-SMC) model presented here represents a major advance for linking local ancestry inference to hierarchical demographic models.

基因组是由从不同祖先遗传下来的片段拼接而成的,每个片段都被过去的重组事件分开。因此,多个基因组之间的谱系关系在不同的基因组区域存在空间差异。在单个种群(聚结)或多个结构种群(多物种聚结)中,非连锁(不相关)基因组区域之间的家谱变异都得到了很好的描述。然而,预期的相似性在系谱之间的联系区域的基因组是不太好表征。最近,对一个有效种群规模恒定的单一种群,导出了谱系树变化等待距离在基因组上的空间分布的解析解。本文从具有分支特异性有效种群大小的多结构种群(即多物种聚合)的谱系树和拓扑变化之间的等待距离分布的角度对这一结果进行了推广。我们在Python包ipcoal中实现了我们的模型,并在随机聚结模拟中验证了它的准确性。利用新的似然框架,我们证明了ARG中的树和拓扑变化等待距离可用于拟合物种树模型参数,证明了我们的模型在开发系统发育推断新方法方面的应用。本文提出的多物种序列马尔可夫聚结(MS-SMC)模型代表了将本地祖先推断与分层人口统计模型联系起来的重大进展。
{"title":"Estimating waiting distances between genealogy changes under a Multi-Species Extension of the Sequentially Markov Coalescent.","authors":"Patrick F McKenzie, Deren A R Eaton","doi":"10.1093/sysbio/syaf059","DOIUrl":"https://doi.org/10.1093/sysbio/syaf059","url":null,"abstract":"<p><p>Genomes are composed of a mosaic of segments inherited from different ancestors, each separated by past recombination events. Consequently, genealogical relationships among multiple genomes vary spatially across different genomic regions. Genealogical variation among unlinked (uncorrelated) genomic regions is well described for either a single population (coalescent) or multiple structured populations (multispecies coalescent). However, the expected similarity among genealogies at linked regions of a genome is less well characterized. Recently, an analytical solution was derived for the distribution of the waiting distance for a change in the genealogical tree spatially across a genome for a single population with constant effective population size. Here we describe a generalization of this result, in terms of the distribution of waiting distances between changes in genealogical trees and topologies for multiple structured populations with branch-specific effective population sizes (i.e., under the multispecies coalescent). We implemented our model in the Python package ipcoal and validated its accuracy against stochastic coalescent simulations. Using a novel likelihood framework we show that tree and topology-change waiting distances in an ARG can be used to fit species tree model parameters, demonstrating an application of our model for developing new methods for phylogenetic inference. The Multi-Species Sequentially Markov Coalescent (MS-SMC) model presented here represents a major advance for linking local ancestry inference to hierarchical demographic models.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145024236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The comparative analysis of lineage-pair traits. 系对性状的比较分析。
IF 5.7 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Pub Date : 2025-09-05 DOI: 10.1093/sysbio/syaf061
Sean A S Anderson, Sachin Kaushik, Daniel R Matute

For many questions in ecology and evolution, the most relevant data to consider are attributes of lineage pairs. Comparative tests for causal relationships among traits like 'diet niche overlap', 'divergence time', and 'strength of reproductive isolation (RI)' - measured for pairwise combinations of related species or populations - have led to several groundbreaking insights, but the correct statistical approach for these analyses has never been clear. Lineage-pair traits are non-independent, but unlike the expected covariance among species' traits, which is captured by a phylogenetic covariance matrix arising from a given model, the expected covariance among lineage-pair traits has not been explicitly formulated. Analyses of pairwise-defined data have thus employed untested workarounds for non-independence rather than direct models of lineage-pair covariance, with consequences that are unexplored. Here, we consider how evolutionary relatedness among taxa translates into non-independence among taxonomic pairs. We develop models by which phylogenetic signal in an underlying character generates covariance among pairs in a lineage-pair trait. We incorporate the resulting lineage-pair covariance matrices into modified versions of phylogenetic generalized least squares and a new phylogenetic beta regression for bounded response variables. Both outperform previous approaches in simulation tests. We find that a common heuristic method, node averaging, imparts a greater cost to model performance than does the non-independence it was designed to correct. We re-analyze two empirical datasets to find dramatic improvements in model fit and, in the case of avian hybridization data, an even stronger relationship between pair age and RI than is revealed from uncorrected analysis. We finally present a new tool, the R package phylopairs, that allows empiricists to test relationships among pairwise-defined variables in a way that is statistically robust and more straightforward to implement.

对于生态学和进化中的许多问题,最需要考虑的相关数据是谱系对的属性。对“饮食生态位重叠”、“分化时间”和“生殖隔离强度(RI)”等性状之间因果关系的比较测试——对相关物种或种群的成对组合进行测量——已经产生了一些开创性的见解,但这些分析的正确统计方法从未明确过。谱系对性状是非独立的,但与物种性状之间的预期协方差不同,物种性状之间的预期协方差是由给定模型产生的系统发育协方差矩阵捕获的,谱系对性状之间的预期协方差尚未明确表示。因此,对双定义数据的分析采用了未经检验的非独立性的变通方法,而不是直接的谱系对协方差模型,其后果尚未探索。在这里,我们考虑类群之间的进化亲缘关系如何转化为分类对之间的非独立性。我们开发的模型,其中系统发育信号在一个潜在的性状产生协方差对谱系对性状之间。我们将所得到的谱系对协方差矩阵纳入改良版的系统发育广义最小二乘和针对有界响应变量的新的系统发育β回归。在模拟测试中,这两种方法都优于以前的方法。我们发现,一种常见的启发式方法,即节点平均,对模型性能的影响比它旨在纠正的非独立性更大。我们重新分析了两个经验数据集,发现模型拟合的显著改善,并且在鸟类杂交数据的情况下,配对年龄和RI之间的关系比未经校正的分析显示的更强。我们最后提出了一个新工具,R包系统对,它允许经验主义者以一种统计上稳健且更直接实现的方式测试成对定义变量之间的关系。
{"title":"The comparative analysis of lineage-pair traits.","authors":"Sean A S Anderson, Sachin Kaushik, Daniel R Matute","doi":"10.1093/sysbio/syaf061","DOIUrl":"10.1093/sysbio/syaf061","url":null,"abstract":"<p><p>For many questions in ecology and evolution, the most relevant data to consider are attributes of lineage pairs. Comparative tests for causal relationships among traits like 'diet niche overlap', 'divergence time', and 'strength of reproductive isolation (RI)' - measured for pairwise combinations of related species or populations - have led to several groundbreaking insights, but the correct statistical approach for these analyses has never been clear. Lineage-pair traits are non-independent, but unlike the expected covariance among species' traits, which is captured by a phylogenetic covariance matrix arising from a given model, the expected covariance among lineage-pair traits has not been explicitly formulated. Analyses of pairwise-defined data have thus employed untested workarounds for non-independence rather than direct models of lineage-pair covariance, with consequences that are unexplored. Here, we consider how evolutionary relatedness among taxa translates into non-independence among taxonomic pairs. We develop models by which phylogenetic signal in an underlying character generates covariance among pairs in a lineage-pair trait. We incorporate the resulting lineage-pair covariance matrices into modified versions of phylogenetic generalized least squares and a new phylogenetic beta regression for bounded response variables. Both outperform previous approaches in simulation tests. We find that a common heuristic method, node averaging, imparts a greater cost to model performance than does the non-independence it was designed to correct. We re-analyze two empirical datasets to find dramatic improvements in model fit and, in the case of avian hybridization data, an even stronger relationship between pair age and RI than is revealed from uncorrected analysis. We finally present a new tool, the R package phylopairs, that allows empiricists to test relationships among pairwise-defined variables in a way that is statistically robust and more straightforward to implement.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145001406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parameter Estimation from Phylogenetic Trees Using Neural Networks and Ensemble Learning. 基于神经网络和集成学习的系统发育树参数估计。
IF 6.5 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Pub Date : 2025-09-03 DOI: 10.1093/sysbio/syaf060
Tianjian Qin,Koen J van Benthem,Luis Valente,Rampal S Etienne
Species diversification is characterized by speciation and extinction, the rates of which can, under some assumptions, be estimated from time-calibrated phylogenies. However, maximum likelihood estimation methods (MLE) for inferring rates are limited to simpler models and can show bias, particularly in small phylogenies. Likelihood-free methods to estimate parameters of diversification models using deep learning have started to emerge, but how robust neural network methods are at handling the intricate nature of phylogenetic data remains an open question. Here we present a new ensemble neural network approach to estimate diversification parameters from phylogenetic trees that leverages different classes of neural networks (dense neural network, graph neural network, and long short-term memory recurrent network) and simultaneously learns from graph representations of phylogenies, their branching times and their summary statistics. Our best-performing ensemble neural network (which adjusts the graph neural network result using a recurrent neural network) delivers estimates faster than MLE and shows less sensitivity to tree size for constant-rate and diversity-dependent speciation scenarios. It performs well compared to an existing convolutional network approach. However, like MLE, our approach still fails to recover parameters precisely under a protracted birth-death process. Our analysis suggests that the primary limitation to accurate parameter estimation is the amount of information contained within a phylogeny, as indicated by its size and the strength of effects shaping it. In cases where MLE is unavailable, our neural network method provides a promising alternative for estimating phylogenetic tree parameters. If detectable phylogenetic signals are present, our approach delivers results that are comparable to MLE but without inherent biases.
物种多样化的特征是物种形成和灭绝,在某些假设下,物种形成和灭绝的速率可以根据时间校准的系统发生来估计。然而,用于推断速率的最大似然估计方法(MLE)仅限于更简单的模型,并且可能存在偏差,特别是在小型系统发育中。使用深度学习来估计多样化模型参数的无似然方法已经开始出现,但是神经网络方法在处理系统发育数据的复杂性方面有多强大仍然是一个悬而未决的问题。在这里,我们提出了一种新的集成神经网络方法来估计系统发生树的多样化参数,该方法利用不同类别的神经网络(密集神经网络、图神经网络和长短期记忆循环网络),同时从系统发生的图表示、分支时间和汇总统计中学习。我们表现最好的集成神经网络(使用循环神经网络调整图神经网络结果)比MLE提供更快的估计,并且在恒定速率和多样性依赖的物种形成场景中对树大小的敏感性较低。与现有的卷积网络方法相比,它表现良好。然而,与MLE一样,我们的方法仍然无法在长期的出生-死亡过程中精确地恢复参数。我们的分析表明,准确参数估计的主要限制是系统发育中包含的信息量,如其大小和形成它的效应的强度所表明的那样。在MLE不可用的情况下,我们的神经网络方法为估计系统发育树参数提供了一个有希望的替代方法。如果存在可检测的系统发育信号,我们的方法提供的结果与MLE相当,但没有固有的偏差。
{"title":"Parameter Estimation from Phylogenetic Trees Using Neural Networks and Ensemble Learning.","authors":"Tianjian Qin,Koen J van Benthem,Luis Valente,Rampal S Etienne","doi":"10.1093/sysbio/syaf060","DOIUrl":"https://doi.org/10.1093/sysbio/syaf060","url":null,"abstract":"Species diversification is characterized by speciation and extinction, the rates of which can, under some assumptions, be estimated from time-calibrated phylogenies. However, maximum likelihood estimation methods (MLE) for inferring rates are limited to simpler models and can show bias, particularly in small phylogenies. Likelihood-free methods to estimate parameters of diversification models using deep learning have started to emerge, but how robust neural network methods are at handling the intricate nature of phylogenetic data remains an open question. Here we present a new ensemble neural network approach to estimate diversification parameters from phylogenetic trees that leverages different classes of neural networks (dense neural network, graph neural network, and long short-term memory recurrent network) and simultaneously learns from graph representations of phylogenies, their branching times and their summary statistics. Our best-performing ensemble neural network (which adjusts the graph neural network result using a recurrent neural network) delivers estimates faster than MLE and shows less sensitivity to tree size for constant-rate and diversity-dependent speciation scenarios. It performs well compared to an existing convolutional network approach. However, like MLE, our approach still fails to recover parameters precisely under a protracted birth-death process. Our analysis suggests that the primary limitation to accurate parameter estimation is the amount of information contained within a phylogeny, as indicated by its size and the strength of effects shaping it. In cases where MLE is unavailable, our neural network method provides a promising alternative for estimating phylogenetic tree parameters. If detectable phylogenetic signals are present, our approach delivers results that are comparable to MLE but without inherent biases.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"13 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144960285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parameter Estimation from Phylogenetic Trees Using Neural Networks and Ensemble Learning. 基于神经网络和集成学习的系统发育树参数估计。
IF 5.7 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Pub Date : 2025-09-03 DOI: 10.1093/sysbio/syaf060
Tianjian Qin, Koen J van Benthem, Luis Valente, Rampal S Etienne

Species diversification is characterized by speciation and extinction, the rates of which can, under some assumptions, be estimated from time-calibrated phylogenies. However, maximum likelihood estimation methods (MLE) for inferring rates are limited to simpler models and can show bias, particularly in small phylogenies. Likelihood-free methods to estimate parameters of diversification models using deep learning have started to emerge, but how robust neural network methods are at handling the intricate nature of phylogenetic data remains an open question. Here we present a new ensemble neural network approach to estimate diversification parameters from phylogenetic trees that leverages different classes of neural networks (dense neural network, graph neural network, and long short-term memory recurrent network) and simultaneously learns from graph representations of phylogenies, their branching times and their summary statistics. Our best-performing ensemble neural network (which adjusts the graph neural network result using a recurrent neural network) delivers estimates faster than MLE and shows less sensitivity to tree size for constant-rate and diversity-dependent speciation scenarios. It performs well compared to an existing convolutional network approach. However, like MLE, our approach still fails to recover parameters precisely under a protracted birth-death process. Our analysis suggests that the primary limitation to accurate parameter estimation is the amount of information contained within a phylogeny, as indicated by its size and the strength of effects shaping it. In cases where MLE is unavailable, our neural network method provides a promising alternative for estimating phylogenetic tree parameters. If detectable phylogenetic signals are present, our approach delivers results that are comparable to MLE but without inherent biases.

物种多样化的特征是物种形成和灭绝,在某些假设下,物种形成和灭绝的速率可以根据时间校准的系统发生来估计。然而,用于推断速率的最大似然估计方法(MLE)仅限于更简单的模型,并且可能存在偏差,特别是在小型系统发育中。使用深度学习来估计多样化模型参数的无似然方法已经开始出现,但是神经网络方法在处理系统发育数据的复杂性方面有多强大仍然是一个悬而未决的问题。在这里,我们提出了一种新的集成神经网络方法来估计系统发生树的多样化参数,该方法利用不同类别的神经网络(密集神经网络、图神经网络和长短期记忆循环网络),同时从系统发生的图表示、分支时间和汇总统计中学习。我们表现最好的集成神经网络(使用循环神经网络调整图神经网络结果)比MLE提供更快的估计,并且在恒定速率和多样性依赖的物种形成场景中对树大小的敏感性较低。与现有的卷积网络方法相比,它表现良好。然而,与MLE一样,我们的方法仍然无法在长期的出生-死亡过程中精确地恢复参数。我们的分析表明,准确参数估计的主要限制是系统发育中包含的信息量,如其大小和形成它的效应的强度所表明的那样。在MLE不可用的情况下,我们的神经网络方法为估计系统发育树参数提供了一个有希望的替代方法。如果存在可检测的系统发育信号,我们的方法提供的结果与MLE相当,但没有固有的偏差。
{"title":"Parameter Estimation from Phylogenetic Trees Using Neural Networks and Ensemble Learning.","authors":"Tianjian Qin, Koen J van Benthem, Luis Valente, Rampal S Etienne","doi":"10.1093/sysbio/syaf060","DOIUrl":"10.1093/sysbio/syaf060","url":null,"abstract":"<p><p>Species diversification is characterized by speciation and extinction, the rates of which can, under some assumptions, be estimated from time-calibrated phylogenies. However, maximum likelihood estimation methods (MLE) for inferring rates are limited to simpler models and can show bias, particularly in small phylogenies. Likelihood-free methods to estimate parameters of diversification models using deep learning have started to emerge, but how robust neural network methods are at handling the intricate nature of phylogenetic data remains an open question. Here we present a new ensemble neural network approach to estimate diversification parameters from phylogenetic trees that leverages different classes of neural networks (dense neural network, graph neural network, and long short-term memory recurrent network) and simultaneously learns from graph representations of phylogenies, their branching times and their summary statistics. Our best-performing ensemble neural network (which adjusts the graph neural network result using a recurrent neural network) delivers estimates faster than MLE and shows less sensitivity to tree size for constant-rate and diversity-dependent speciation scenarios. It performs well compared to an existing convolutional network approach. However, like MLE, our approach still fails to recover parameters precisely under a protracted birth-death process. Our analysis suggests that the primary limitation to accurate parameter estimation is the amount of information contained within a phylogeny, as indicated by its size and the strength of effects shaping it. In cases where MLE is unavailable, our neural network method provides a promising alternative for estimating phylogenetic tree parameters. If detectable phylogenetic signals are present, our approach delivers results that are comparable to MLE but without inherent biases.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144970014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
New perspectives in phylogenetic support assessment: using the new Relative Contradiction Index to investigate the phylogenetic controversies in Crocodylia 系统发育支持评价的新视角:用新的相对矛盾指数研究鳄鱼的系统发育争议
IF 6.5 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Pub Date : 2025-08-27 DOI: 10.1093/sysbio/syaf058
Paul Aubier, Valentin Rineau, Jorge Cubo, Stéphane Jouve
Numerous tools have been developed since the advent of phylogenetic methods to assess tree robustness. Identifying the degree of contradiction in a phylogenetic matrix, as well as the specific contribution of each taxon and character, is essential for estimating its reliability. In parsimony-based phylogenetic inferences, classically used by paleontologists, a phylogeny results from the interaction of all the characters used in the analysis. Consequently, the support initially provided by the characters in the matrix may differ from that after after optimization in the final tree, severing the link between the phylogenetic content of the matrix and that of the final tree. Thus, all methods aimed at measuring support only do so indirectly and the impact of individual characters or taxa can only be assessed after the analysis. Three-taxon analysis (3ta) is a phylogenetic method that can circumvent these issues by precisely measuring the support of targeted characters and/or taxa directly from the phylogenetic matrix. In 3ta, characters are coded as trees and decomposed into three-taxon statements (3ts). The analysis searches for the largest set of non-contradicting 3ts to compute the optimal phylogeny. Because the analysis is a compatibility procedure, not an optimization procedure, character supports on the tree are independent from one another. This enables direct assessment of support from the matrix, providing meaningful insights into the topology of the optimal trees. Moreover, the decomposition of characters into 3ts allows for precise quantification of the impact of the characters/taxa in the results. In this study, focusing on Crocodylia (a subject of ongoing debate over recent decades), we use 3ta to measure the support of specific characters and/or taxa in the recently published matrix of Rio and Mannion (2021). This conflict revolves around two competing hypotheses – Longirostres and Brevirostres – supporting a different placement of the Gavialoidea clade. We also introduce here the Relative Contradiction Index (RCI) to evaluate node support, a metric that reflects the degree of contradiction in a matrix between competing cladistic hypotheses, ranging from 0.5 (maximum contradiction) to 1 (no contradiction). We show that although the Longirostres hypothesis is the best-supported, it is strongly challenged by the Brevirostres hypothesis (RCI = 0.62). Furthermore, we find that Tomistominae provides 61% of the supporting evidence for the Longirostres hypothesis, such that, when removed, the matrix supports the Brevirostres hypothesis. Individual tomistomines’ contributions vary only from 2% to 7% of the total support to the Longirostres hypothesis. Finally, we show that characters correlated to longirostry only provide a fraction (22%) of the total support to the Longirostres hypothesis. Thus, our method can quantify the impact of specific characters or taxa on a phylogenetic result. This should prove very useful to phylogeneticists, especi
自从系统发育方法出现以来,已经开发了许多工具来评估树的稳健性。识别系统发育矩阵中的矛盾程度,以及每个分类群和特征的具体贡献,对于估计其可靠性至关重要。在以简约为基础的系统发育推断中,系统发育是由分析中使用的所有特征的相互作用产生的。因此,矩阵中字符最初提供的支持可能与最终树优化后的支持不同,从而切断了矩阵的系统发育内容与最终树的系统发育内容之间的联系。因此,所有旨在测量支持度的方法都是间接的,个体性状或分类群的影响只能在分析后才能评估。3 -taxon analysis (3ta)是一种系统发育方法,可以通过直接从系统发育矩阵中精确测量目标性状和/或分类群的支持度来避免这些问题。在3ta中,字符被编码为树,并分解为三个分类单元语句(3ts)。该分析寻找最大的不矛盾的3ts集来计算最优的系统发育。因为分析是一个兼容性过程,而不是一个优化过程,所以树上的字符支持是相互独立的。这使得可以直接评估矩阵的支持度,从而对最优树的拓扑结构提供有意义的见解。此外,将字符分解为3ts可以精确量化结果中字符/分类群的影响。在这项研究中,我们将重点放在鳄鱼(近几十年来一直存在争议的主题)上,在最近发表的里约热内卢和Mannion(2021)矩阵中,我们使用3ta来衡量特定特征和/或分类群的支持度。这场冲突围绕着两个相互竞争的假说——长形和短形假说——支持长形总分支的不同位置。我们还引入了相对矛盾指数(RCI)来评估节点支持度,这是一个反映竞争分支假设之间矩阵矛盾程度的度量,范围从0.5(最大矛盾)到1(无矛盾)。我们发现,虽然长压力假说得到了最好的支持,但短压力假说对它提出了强烈的挑战(RCI = 0.62)。此外,我们发现,Tomistominae提供了61%的支持证据的长压力假说,这样,当删除,矩阵支持短压力假说。对于Longirostres假说,个体的贡献仅占总支持量的2%到7%。最后,我们表明,与Longirostres相关的字符只提供了一小部分(22%)的支持Longirostres假说。因此,我们的方法可以量化特定性状或分类群对系统发育结果的影响。这将证明对系统发育学家非常有用,特别是在处理不完整的材料,如化石时。
{"title":"New perspectives in phylogenetic support assessment: using the new Relative Contradiction Index to investigate the phylogenetic controversies in Crocodylia","authors":"Paul Aubier, Valentin Rineau, Jorge Cubo, Stéphane Jouve","doi":"10.1093/sysbio/syaf058","DOIUrl":"https://doi.org/10.1093/sysbio/syaf058","url":null,"abstract":"Numerous tools have been developed since the advent of phylogenetic methods to assess tree robustness. Identifying the degree of contradiction in a phylogenetic matrix, as well as the specific contribution of each taxon and character, is essential for estimating its reliability. In parsimony-based phylogenetic inferences, classically used by paleontologists, a phylogeny results from the interaction of all the characters used in the analysis. Consequently, the support initially provided by the characters in the matrix may differ from that after after optimization in the final tree, severing the link between the phylogenetic content of the matrix and that of the final tree. Thus, all methods aimed at measuring support only do so indirectly and the impact of individual characters or taxa can only be assessed after the analysis. Three-taxon analysis (3ta) is a phylogenetic method that can circumvent these issues by precisely measuring the support of targeted characters and/or taxa directly from the phylogenetic matrix. In 3ta, characters are coded as trees and decomposed into three-taxon statements (3ts). The analysis searches for the largest set of non-contradicting 3ts to compute the optimal phylogeny. Because the analysis is a compatibility procedure, not an optimization procedure, character supports on the tree are independent from one another. This enables direct assessment of support from the matrix, providing meaningful insights into the topology of the optimal trees. Moreover, the decomposition of characters into 3ts allows for precise quantification of the impact of the characters/taxa in the results. In this study, focusing on Crocodylia (a subject of ongoing debate over recent decades), we use 3ta to measure the support of specific characters and/or taxa in the recently published matrix of Rio and Mannion (2021). This conflict revolves around two competing hypotheses – Longirostres and Brevirostres – supporting a different placement of the Gavialoidea clade. We also introduce here the Relative Contradiction Index (RCI) to evaluate node support, a metric that reflects the degree of contradiction in a matrix between competing cladistic hypotheses, ranging from 0.5 (maximum contradiction) to 1 (no contradiction). We show that although the Longirostres hypothesis is the best-supported, it is strongly challenged by the Brevirostres hypothesis (RCI = 0.62). Furthermore, we find that Tomistominae provides 61% of the supporting evidence for the Longirostres hypothesis, such that, when removed, the matrix supports the Brevirostres hypothesis. Individual tomistomines’ contributions vary only from 2% to 7% of the total support to the Longirostres hypothesis. Finally, we show that characters correlated to longirostry only provide a fraction (22%) of the total support to the Longirostres hypothesis. Thus, our method can quantify the impact of specific characters or taxa on a phylogenetic result. This should prove very useful to phylogeneticists, especi","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"3 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2025-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144906112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Phylogenetic Resolution and Conflict in the Species-Rich Flowering Plant Family Leguminosae. 物种丰富的开花植物豆科的系统发育解决与冲突。
IF 5.7 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Pub Date : 2025-08-19 DOI: 10.1093/sysbio/syaf057
Rong Zhang, Gregory W Stull, Jian-Jun Jin, Yin-Huan Wang, Ying Guo, Zhi-Yun Yang, Hong-Tao Li, Kai-Lun An, Joseph L M Charboneau, Ryan A Folk, Domingos Cardoso, Luciano P de Queiroz, Anne Bruneau, Pamela S Soltis, Douglas E Soltis, Stephen A Smith, De-Zhu Li, Ting-Shuang Yi

The Tree of Life is central to evolutionary biology, yet resolving deep, recalcitrant phylogenetic relationships remains challenging due to complex processes such as incomplete lineage sorting (ILS), hybridization, and polyploidization. Although previous phylogenetic studies have advanced our understanding of Leguminosae (Fabaceae), a species-rich and ecologically diverse family, many deep relationships at the tribal and higher levels remain unresolved. Incorporating newly generated genome skimming data for 231 species with previously issued plastid genomic, mitochondrial genomic and transcriptomic data, we reconstructed a phylogeny of the family using whole plastomes, 39 mitochondrial genes, and 1559 low-copy nuclear genes, achieving dense taxonomic sampling across almost all recognized tribes and major unplaced lineages. Our results supported the monophyly of the six subfamilies and 49 recognized tribes, identified ten clades worthy of recognition as new tribes in subfamily Papilionoideae, and clarified many contentious relationships. However, nuclear-nuclear and cytonuclear conflicts persist at multiple nodes among trees inferred from different datasets and analytical methods. We proposed the most probable resolution for 22 contentious nodes by applying nuclear gene-tree quartet analysis with corroboration from support of nuclear Maximum Likelihood (ML) and ASTRAL trees. Our results indicate ILS significantly contributes to observed phylogenetic conflicts, while gene flow represents an additional and previously underappreciated factor that mainly contributes to cytonuclear conflicts, particularly along the branches of the Angylocalyceae + Dipterygeae + Amburaneae (ADA) clade and Wisterieae. These processes likely underlie recalcitrant phylogenetic relationships, such as those within the 50-kb inversion clade of Papilionoideae. Our study uses multiple data partitions and analytical methods to resolve contentious phylogenetic relationships in Leguminosae, resulting in a robust phylogenomic framework to guide further investigations in this economically important and exceptionally diverse family.

生命之树是进化生物学的核心,但由于复杂的过程,如不完全谱系分类(ILS)、杂交和多倍体化,解决深层的、顽固的系统发育关系仍然具有挑战性。虽然以前的系统发育研究提高了我们对豆科(豆科)这一物种丰富、生态多样的科的认识,但在部落和更高层次上的许多深层关系仍未得到解决。结合新生成的231个物种的基因组略读数据和先前发布的质体基因组、线粒体基因组和转录组数据,我们利用整个质体、39个线粒体基因和1559个低拷贝核基因重建了该家族的系统发育,在几乎所有已知的部落和主要未被发现的谱系中实现了密集的分类抽样。本研究结果支持了6个亚科和49个已确认的部落的单一性,确定了10个值得确认为新部落的分支,并澄清了许多有争议的关系。然而,从不同的数据集和分析方法推断出的树中,核冲突和细胞核冲突在多个节点上持续存在。我们通过核基因树四重奏分析提出了22个争议节点的最可能解决方案,并得到核最大似然树(ML)和ASTRAL树的支持。我们的研究结果表明,ILS显著有助于观察到的系统发育冲突,而基因流动是一个额外的、以前未被重视的因素,主要有助于细胞核冲突,特别是沿着Angylocalyceae + Dipterygeae + Amburaneae (ADA)分支和Wisterieae的分支。这些过程可能是顽固的系统发育关系的基础,例如在凤蝶科的50 kb倒置分支中。我们的研究使用多个数据分区和分析方法来解决豆科植物中有争议的系统发育关系,从而形成一个强大的系统基因组框架,以指导对这个经济上重要且异常多样化的家庭的进一步调查。
{"title":"Phylogenetic Resolution and Conflict in the Species-Rich Flowering Plant Family Leguminosae.","authors":"Rong Zhang, Gregory W Stull, Jian-Jun Jin, Yin-Huan Wang, Ying Guo, Zhi-Yun Yang, Hong-Tao Li, Kai-Lun An, Joseph L M Charboneau, Ryan A Folk, Domingos Cardoso, Luciano P de Queiroz, Anne Bruneau, Pamela S Soltis, Douglas E Soltis, Stephen A Smith, De-Zhu Li, Ting-Shuang Yi","doi":"10.1093/sysbio/syaf057","DOIUrl":"10.1093/sysbio/syaf057","url":null,"abstract":"<p><p>The Tree of Life is central to evolutionary biology, yet resolving deep, recalcitrant phylogenetic relationships remains challenging due to complex processes such as incomplete lineage sorting (ILS), hybridization, and polyploidization. Although previous phylogenetic studies have advanced our understanding of Leguminosae (Fabaceae), a species-rich and ecologically diverse family, many deep relationships at the tribal and higher levels remain unresolved. Incorporating newly generated genome skimming data for 231 species with previously issued plastid genomic, mitochondrial genomic and transcriptomic data, we reconstructed a phylogeny of the family using whole plastomes, 39 mitochondrial genes, and 1559 low-copy nuclear genes, achieving dense taxonomic sampling across almost all recognized tribes and major unplaced lineages. Our results supported the monophyly of the six subfamilies and 49 recognized tribes, identified ten clades worthy of recognition as new tribes in subfamily Papilionoideae, and clarified many contentious relationships. However, nuclear-nuclear and cytonuclear conflicts persist at multiple nodes among trees inferred from different datasets and analytical methods. We proposed the most probable resolution for 22 contentious nodes by applying nuclear gene-tree quartet analysis with corroboration from support of nuclear Maximum Likelihood (ML) and ASTRAL trees. Our results indicate ILS significantly contributes to observed phylogenetic conflicts, while gene flow represents an additional and previously underappreciated factor that mainly contributes to cytonuclear conflicts, particularly along the branches of the Angylocalyceae + Dipterygeae + Amburaneae (ADA) clade and Wisterieae. These processes likely underlie recalcitrant phylogenetic relationships, such as those within the 50-kb inversion clade of Papilionoideae. Our study uses multiple data partitions and analytical methods to resolve contentious phylogenetic relationships in Leguminosae, resulting in a robust phylogenomic framework to guide further investigations in this economically important and exceptionally diverse family.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144875351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Social environment and the evolution of delayed reproduction in birds. 社会环境与鸟类延迟生殖的进化。
IF 5.7 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Pub Date : 2025-08-12 DOI: 10.1093/sysbio/syaf056
Liam U Taylor, Josef C Uyeda, Richard O Prum

One puzzling feature of avian life histories is that individuals in many different lineages delay reproduction for several years after they finish growing. Intraspecific field studies suggest that various complex social environments-such as cooperative breeding groups, nesting colonies, and display leks-result in delayed reproduction because they require forms of sociosexual development that extend beyond physical maturation. Here, we formally propose this hypothesis and use a full suite of phylogenetic comparative methods to test it, analyzing the evolution of age at first reproduction (AFR) in females and males across 963 species of birds. Phylogenetic regressions support increased AFR in colonial females and males, cooperatively breeding males, and lekking males. Continuous Ornstein-Uhlenbeck models support distinct evolutionary regimes with increased AFR for all of cooperative, colonial, and lekking lineages. Discrete hidden state Markov models suggest a net increase in delayed reproduction for social lineages, even when accounting for hidden state heterogeneity and the potential reverse influence of AFR on sociality. Our results support the hypothesis that the evolution of sociality reshapes the dynamics of life history evolution in birds. Comparative analyses of even the most broadly generalizable characters, such as AFR, must reckon with unique, heterogeneous, historical events in the evolution of individual lineages.

鸟类生命史的一个令人困惑的特征是,许多不同谱系的个体在发育完成后会推迟数年进行繁殖。种内实地研究表明,各种复杂的社会环境——如合作繁殖群体、筑巢群体和展示泄漏——导致繁殖延迟,因为它们需要超越身体成熟的社会性发展形式。在这里,我们正式提出了这一假设,并使用一整套系统发育比较方法来验证它,分析了963种鸟类雌性和雄性的首次繁殖年龄(AFR)进化。系统发育回归支持种群雌性和雄性、合作繁殖的雄性和雄性的AFR增加。连续的Ornstein-Uhlenbeck模型支持不同的进化机制,所有的合作、殖民和泄漏谱系都增加了AFR。离散隐态马尔可夫模型表明,即使考虑到隐态异质性和AFR对社会性的潜在反向影响,社会谱系的延迟生殖也会净增加。我们的研究结果支持了社会性进化重塑鸟类生活史进化动态的假设。即使是对最具普遍性的性状(如AFR)进行比较分析,也必须考虑到个体谱系进化过程中独特的、异质的历史事件。
{"title":"Social environment and the evolution of delayed reproduction in birds.","authors":"Liam U Taylor, Josef C Uyeda, Richard O Prum","doi":"10.1093/sysbio/syaf056","DOIUrl":"10.1093/sysbio/syaf056","url":null,"abstract":"<p><p>One puzzling feature of avian life histories is that individuals in many different lineages delay reproduction for several years after they finish growing. Intraspecific field studies suggest that various complex social environments-such as cooperative breeding groups, nesting colonies, and display leks-result in delayed reproduction because they require forms of sociosexual development that extend beyond physical maturation. Here, we formally propose this hypothesis and use a full suite of phylogenetic comparative methods to test it, analyzing the evolution of age at first reproduction (AFR) in females and males across 963 species of birds. Phylogenetic regressions support increased AFR in colonial females and males, cooperatively breeding males, and lekking males. Continuous Ornstein-Uhlenbeck models support distinct evolutionary regimes with increased AFR for all of cooperative, colonial, and lekking lineages. Discrete hidden state Markov models suggest a net increase in delayed reproduction for social lineages, even when accounting for hidden state heterogeneity and the potential reverse influence of AFR on sociality. Our results support the hypothesis that the evolution of sociality reshapes the dynamics of life history evolution in birds. Comparative analyses of even the most broadly generalizable characters, such as AFR, must reckon with unique, heterogeneous, historical events in the evolution of individual lineages.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144837778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ConvexML: Fast and accurate branch length estimation under irreversible mutation models, illustrated through applications to CRISPR/Cas9-based lineage tracing ConvexML:在不可逆突变模型下快速准确的分支长度估计,通过应用于基于CRISPR/ cas9的谱系追踪来说明
IF 6.5 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Pub Date : 2025-08-12 DOI: 10.1093/sysbio/syaf054
Sebastian Prillo, Akshay Ravoor, Nir Yosef, Yun S Song
Branch length estimation is a fundamental problem in Statistical Phylogenetics and a core component of tree reconstruction algorithms. Traditionally, general time-reversible mutation models are employed, and many software tools exist for this scenario. With the advent of CRISPR/Cas9 lineage tracing technologies, there has been significant interest in the study of branch length estimation under irreversible mutation models. Under the CRISPR/Cas9 mutation model, irreversible mutations – in the form of DNA insertions or deletions – are accrued during the experiment, which are then read out at the single-cell level to reconstruct the cell lineage tree. However, most of the analyses of CRISPR/Cas9 lineage tracing data have so far been limited to the reconstruction of single-cell tree topologies, which depict lineage relationships between cells, but not the amount of time that has passed between ancestral cell states and the present. Time-resolved trees, known as chronograms, would allow one to study the evolutionary dynamics of cell populations at an unprecedented level of resolution. Indeed, time-resolved trees would reveal the timing of events on the tree, the relative fitness of subclones, and the dynamics underlying phenotypic changes in the cell population – among other important applications. In this work, we introduce the first scalable and accurate method to refine any given single-cell tree topology into a single-cell chronogram by estimating its branch lengths. To do this, we perform regularized maximum likelihood estimation under a general irreversible mutation model, paired with a conservative version of maximum parsimony that reconstructs only the ancestral states that we are confident about. To deal with the particularities of CRISPR/Cas9 lineage tracing data – such as double-resection events which affect runs of consecutive sites – we avoid making our model more complex and instead opt for using a simple but effective data encoding scheme. Similarly, we avoid explicitly modeling the missing data mechanisms – such as heritable missing data – by instead assuming that they are missing completely at random. We stabilize estimates in low-information regimes by using a simple penalized version of maximum likelihood estimation (MLE) using a minimum branch length constraint and pseudocounts. All this leads to a convex MLE problem that can be readily solved in seconds with off-the-shelf convex optimization solvers. We benchmark our method using both simulations and real lineage tracing data, and show that it performs well on several tasks, matching or outperforming competing methods such as TiDeTree and LAML in terms of accuracy, while being 10 ∼ 100 × faster. Notably, our statistical model is simpler and more general, as we do not explicitly model the intricacies of CRISPR/Cas9 lineage tracing data. In this sense, our contribution is twofold: (1) a fast and robust method for branch length estimation under a general irreversible mutation model,
分支长度估计是统计系统发育学中的一个基本问题,也是树重建算法的核心组成部分。传统上,一般采用时间可逆的突变模型,并且存在许多用于此场景的软件工具。随着CRISPR/Cas9谱系追踪技术的出现,人们对不可逆突变模型下分支长度估计的研究产生了浓厚的兴趣。在CRISPR/Cas9突变模型下,不可逆转的突变——以DNA插入或缺失的形式——在实验过程中积累,然后在单细胞水平上读出这些突变,以重建细胞谱系树。然而,迄今为止,对CRISPR/Cas9谱系追踪数据的大多数分析都局限于单细胞树拓扑结构的重建,这些拓扑结构描述了细胞之间的谱系关系,而不是祖先细胞状态与当前状态之间经过的时间。时间分辨树,也就是时间表,将使人们能够以前所未有的分辨率研究细胞群体的进化动态。事实上,时间分辨树将揭示树中事件的时间,亚克隆的相对适应性,以及细胞群体中表型变化的动态-以及其他重要应用。在这项工作中,我们引入了第一个可扩展和精确的方法,通过估计其分支长度将任何给定的单细胞树拓扑细化为单细胞时序图。为此,我们在一般不可逆突变模型下执行正则化最大似然估计,并与仅重建我们确信的祖先状态的最大简约性的保守版本配对。为了处理CRISPR/Cas9谱系追踪数据的特殊性-例如影响连续位点运行的双切除事件-我们避免使我们的模型更复杂,而是选择使用简单但有效的数据编码方案。同样,我们避免显式地对缺失的数据机制(例如可继承的缺失数据)建模,而是假设它们完全是随机丢失的。我们通过使用最小分支长度约束和伪计数的最大似然估计(MLE)的简单惩罚版本来稳定低信息状态下的估计。所有这些都导致了一个凸MLE问题,这个问题可以用现成的凸优化求解器在几秒钟内轻松解决。我们使用模拟和真实谱系追踪数据对我们的方法进行了基准测试,并表明它在几个任务上表现良好,在准确性方面匹配或优于TiDeTree和LAML等竞争方法,同时速度快10 ~ 100倍。值得注意的是,我们的统计模型更简单,更通用,因为我们没有明确地模拟CRISPR/Cas9谱系追踪数据的复杂性。从这个意义上说,我们的贡献是双重的:(1)在一般不可逆突变模型下快速和鲁棒的分支长度估计方法,以及(2)特定于CRISPR/ cas9谱系追踪数据的数据编码方案,使其适用于一般模型。我们的分支长度估计方法,我们称之为“ConvexML”,应该广泛适用于任何具有不可逆突变(理想情况下,具有高多样性)和几乎可以忽略的缺失数据机制的进化模型。‘ ConvexML ’可以通过ConvexML开源Python包获得。
{"title":"ConvexML: Fast and accurate branch length estimation under irreversible mutation models, illustrated through applications to CRISPR/Cas9-based lineage tracing","authors":"Sebastian Prillo, Akshay Ravoor, Nir Yosef, Yun S Song","doi":"10.1093/sysbio/syaf054","DOIUrl":"https://doi.org/10.1093/sysbio/syaf054","url":null,"abstract":"Branch length estimation is a fundamental problem in Statistical Phylogenetics and a core component of tree reconstruction algorithms. Traditionally, general time-reversible mutation models are employed, and many software tools exist for this scenario. With the advent of CRISPR/Cas9 lineage tracing technologies, there has been significant interest in the study of branch length estimation under irreversible mutation models. Under the CRISPR/Cas9 mutation model, irreversible mutations – in the form of DNA insertions or deletions – are accrued during the experiment, which are then read out at the single-cell level to reconstruct the cell lineage tree. However, most of the analyses of CRISPR/Cas9 lineage tracing data have so far been limited to the reconstruction of single-cell tree topologies, which depict lineage relationships between cells, but not the amount of time that has passed between ancestral cell states and the present. Time-resolved trees, known as chronograms, would allow one to study the evolutionary dynamics of cell populations at an unprecedented level of resolution. Indeed, time-resolved trees would reveal the timing of events on the tree, the relative fitness of subclones, and the dynamics underlying phenotypic changes in the cell population – among other important applications. In this work, we introduce the first scalable and accurate method to refine any given single-cell tree topology into a single-cell chronogram by estimating its branch lengths. To do this, we perform regularized maximum likelihood estimation under a general irreversible mutation model, paired with a conservative version of maximum parsimony that reconstructs only the ancestral states that we are confident about. To deal with the particularities of CRISPR/Cas9 lineage tracing data – such as double-resection events which affect runs of consecutive sites – we avoid making our model more complex and instead opt for using a simple but effective data encoding scheme. Similarly, we avoid explicitly modeling the missing data mechanisms – such as heritable missing data – by instead assuming that they are missing completely at random. We stabilize estimates in low-information regimes by using a simple penalized version of maximum likelihood estimation (MLE) using a minimum branch length constraint and pseudocounts. All this leads to a convex MLE problem that can be readily solved in seconds with off-the-shelf convex optimization solvers. We benchmark our method using both simulations and real lineage tracing data, and show that it performs well on several tasks, matching or outperforming competing methods such as TiDeTree and LAML in terms of accuracy, while being 10 ∼ 100 × faster. Notably, our statistical model is simpler and more general, as we do not explicitly model the intricacies of CRISPR/Cas9 lineage tracing data. In this sense, our contribution is twofold: (1) a fast and robust method for branch length estimation under a general irreversible mutation model, ","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"1 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144825256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Selecting a Window Size for Phylogenomic Analyses of Whole Genome Alignments using AIC 利用AIC选择全基因组比对系统发育分析的窗口大小
IF 6.5 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Pub Date : 2025-08-12 DOI: 10.1093/sysbio/syaf053
Jeremias Ivan, Paul Frandsen, Robert Lanfear
Gene tree discordance along a set of aligned genomes presents a challenge for phylogenomic methods to identify the non-recombining regions and reconstruct the phylogenetic tree for each region. To address this problem, many studies used the non-overlapping window approach, often with an arbitrary selection of fixed window sizes that potentially include intra-window recombination events. In this study, we propose an information theoretic approach to select a window size that best reflects the underlying histories of the alignment. First, we simulated chromosome alignments that reflected the key characteristics of an empirical dataset and found that the AIC is a good predictor of window size accuracy in correctly recovering the tree topologies of the alignment. To address the issue of missing data in empirical datasets, we designed a stepwise non-overlapping window approach that compares the AIC of two window sizes at a time, retaining only genomic regions that can be analysed using both window sizes. We then applied this method to the genomes of Heliconius butterflies and great apes. We found that the best window sizes for the butterflies’ chromosomes ranged from &lt;125bp to 250bp, which are much shorter than those used in a previous study even though this difference in window size did not significantly change the most common topologies across the genome. On the other hand, the best window sizes for great apes’ chromosomes ranged from 500bp to 1kb with the proportion of the major topology (grouping human and chimpanzee) falling between 60% and 87%, consistent with previous findings. Additionally, we observed a notable impact of gene tree estimation error and concatenation when using small and large windows, respectively. For instance, the proportion of the major topology for great apes was 50% when using 250bp windows, but reached almost 100% for 64kb windows. In conclusion, our study highlights the challenges associated with selecting a fixed window size in non-overlapping window analyses and proposes the AIC as a less arbitrary way to select the optimal window size when running non-overlapping method on whole genome alignments.
基因树的不一致性对系统基因组学方法识别非重组区域和重建每个区域的系统发育树提出了挑战。为了解决这个问题,许多研究使用了非重叠窗口方法,通常是任意选择固定窗口大小,可能包括窗口内重组事件。在这项研究中,我们提出了一种信息理论方法来选择最能反映对齐潜在历史的窗口大小。首先,我们模拟了反映经验数据集关键特征的染色体比对,并发现AIC在正确恢复染色体比对的树拓扑结构方面是一个很好的窗口大小精度预测器。为了解决经验数据集中缺失数据的问题,我们设计了一种逐步非重叠窗口方法,该方法一次比较两个窗口大小的AIC,只保留可以使用两个窗口大小进行分析的基因组区域。然后,我们将这种方法应用于蝴蝶和类人猿的基因组。我们发现蝴蝶染色体的最佳窗口大小在125bp到250bp之间,这比之前研究中使用的要短得多,尽管这种窗口大小的差异并没有显著改变基因组中最常见的拓扑结构。另一方面,类人猿染色体的最佳窗口大小在500bp到1kb之间,主要拓扑结构(人类和黑猩猩分组)的比例在60%到87%之间,与先前的发现一致。此外,我们观察到分别使用小窗口和大窗口时基因树估计误差和连接的显着影响。例如,当使用250bp的窗口时,类人猿的主要拓扑比例为50%,但对于64kb的窗口,这一比例几乎达到100%。总之,我们的研究强调了在非重叠窗口分析中选择固定窗口大小的挑战,并提出AIC是在全基因组比对中运行非重叠方法时选择最佳窗口大小的一种不那么任意的方法。
{"title":"Selecting a Window Size for Phylogenomic Analyses of Whole Genome Alignments using AIC","authors":"Jeremias Ivan, Paul Frandsen, Robert Lanfear","doi":"10.1093/sysbio/syaf053","DOIUrl":"https://doi.org/10.1093/sysbio/syaf053","url":null,"abstract":"Gene tree discordance along a set of aligned genomes presents a challenge for phylogenomic methods to identify the non-recombining regions and reconstruct the phylogenetic tree for each region. To address this problem, many studies used the non-overlapping window approach, often with an arbitrary selection of fixed window sizes that potentially include intra-window recombination events. In this study, we propose an information theoretic approach to select a window size that best reflects the underlying histories of the alignment. First, we simulated chromosome alignments that reflected the key characteristics of an empirical dataset and found that the AIC is a good predictor of window size accuracy in correctly recovering the tree topologies of the alignment. To address the issue of missing data in empirical datasets, we designed a stepwise non-overlapping window approach that compares the AIC of two window sizes at a time, retaining only genomic regions that can be analysed using both window sizes. We then applied this method to the genomes of Heliconius butterflies and great apes. We found that the best window sizes for the butterflies’ chromosomes ranged from &amp;lt;125bp to 250bp, which are much shorter than those used in a previous study even though this difference in window size did not significantly change the most common topologies across the genome. On the other hand, the best window sizes for great apes’ chromosomes ranged from 500bp to 1kb with the proportion of the major topology (grouping human and chimpanzee) falling between 60% and 87%, consistent with previous findings. Additionally, we observed a notable impact of gene tree estimation error and concatenation when using small and large windows, respectively. For instance, the proportion of the major topology for great apes was 50% when using 250bp windows, but reached almost 100% for 64kb windows. In conclusion, our study highlights the challenges associated with selecting a fixed window size in non-overlapping window analyses and proposes the AIC as a less arbitrary way to select the optimal window size when running non-overlapping method on whole genome alignments.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"69 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144825108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
But the Clock, Tick-Tock: An Empirical Case Study Highlights the Preeminence of Relaxed Clock Models in Total-Evidence Dating 但是,时钟滴答作响:一个实证案例研究强调了放松时钟模型在全证据年代测定中的卓越地位
IF 6.5 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Pub Date : 2025-08-08 DOI: 10.1093/sysbio/syaf055
Nicolás Mongiardino Koch, Jeffrey R Thompson, Rich Mooi, Greg W Rouse
Phylogenetic clock models translate inferred amounts of evolutionary change (calculated from either genotypes or phenotypes) into estimates of elapsed time, providing a mechanism for time scaling phylogenetic trees. Relaxed-clock models, which accommodate variation in evolutionary rates across branches, are one of the main components of Bayesian dating, yet their consequences for total-evidence phylogenetics have not been thoroughly explored. Here, we combine morphological, molecular (both transcriptomic and Sanger-sequenced), and stratigraphic datasets for all major lineages of echinoids (sea urchins, heart urchins, sand dollars). We then perform total-evidence dated inference under the fossilized birth-death prior, varying two analytical conditions: the choice between autocorrelated and uncorrelated relaxed clocks, which enforce (or not) evolutionary rate inheritance; and the ability to recover fossil terminals as direct ancestors. Our results highlight a previously unnoticed interaction between tree and clock models, with analyses implementing an autocorrelated clock failing to recover any direct ancestors. Nonetheless, even under conditions conducive to the placement of fossil terminals as ancestors, we find this type of relationship to be accommodated without any impact on either topology or node ages. On the other hand, tree topology, fossil placement, divergence times, and downstream macroevolutionary inferences (e.g., ancestral state reconstructions) were all strongly affected by the type of relaxed clock implemented. In regions of the tree where molecular rate variation is pervasive and morphological signal relatively uninformative, fossil tips seem to play little to no role in informing divergence times, and instead passively move in and out of clades depending on the ages imposed upon surrounding nodes by molecular data. Our results highlight the extent to which the phylogenetic and macroevolutionary conclusions of total-evidence dated analyses are contingent on the choice of relaxed-clock model, highlighting the need for either careful methodological validation or a thorough assessment of sensitivity. Our efforts continue to illuminate the echinoid tree of life, supporting the erection of the order Apatopygoida to include three living species last sharing a common ancestor with other extant lineages around the time of the Jurassic-Cretaceous boundary. Furthermore, they also illustrate how the phylogenetic placement of extinct clades hinges upon the modelling of molecular data, evidencing the extent to which the fossil record remains subservient to phylogenomics.
系统发育时钟模型将推断的进化变化量(从基因型或表型计算)转化为经过时间的估计,提供了一种时间尺度系统发育树的机制。松弛时钟模型是贝叶斯测年法的主要组成部分之一,它能适应不同分支间进化速率的变化,但其对全证据系统发育的影响尚未得到彻底探索。在这里,我们结合形态学、分子(转录组学和桑格测序)和地层学数据集,研究了所有主要的棘皮类动物谱系(海胆、心海胆、沙美元)。然后,我们在化石出生-死亡先验下进行了全证据日期推断,改变了两个分析条件:自相关和不相关放松时钟之间的选择,这强制(或不强制)进化速率遗传;以及恢复化石终端作为直系祖先的能力。我们的结果突出了以前未被注意到的树和时钟模型之间的相互作用,实现自相关时钟的分析无法恢复任何直接祖先。然而,即使在有利于化石终端作为祖先放置的条件下,我们发现这种类型的关系可以被容纳,而不会对拓扑结构或节点年龄产生任何影响。另一方面,树木拓扑结构、化石位置、分化时间和下游宏观进化推断(例如,祖先状态重建)都受到所实现的放松时钟类型的强烈影响。在分子速率变化普遍存在且形态信号相对缺乏信息的区域,化石尖端似乎在告知分化时间方面几乎没有作用,而是被动地根据分子数据施加给周围节点的年龄在进化枝上进进出出。我们的研究结果强调了总证据日期分析的系统发育和宏观进化结论在多大程度上取决于松弛时钟模型的选择,强调了需要仔细的方法验证或彻底的敏感性评估。我们的工作将继续阐明棘刺类动物的生命之树,支持Apatopygoida目的建立,包括三个现存物种,最后与其他现存的谱系在侏罗纪-白垩纪边界时期共享一个共同的祖先。此外,它们还说明了灭绝枝的系统发育位置如何依赖于分子数据的建模,证明了化石记录在多大程度上服从于系统基因组学。
{"title":"But the Clock, Tick-Tock: An Empirical Case Study Highlights the Preeminence of Relaxed Clock Models in Total-Evidence Dating","authors":"Nicolás Mongiardino Koch, Jeffrey R Thompson, Rich Mooi, Greg W Rouse","doi":"10.1093/sysbio/syaf055","DOIUrl":"https://doi.org/10.1093/sysbio/syaf055","url":null,"abstract":"Phylogenetic clock models translate inferred amounts of evolutionary change (calculated from either genotypes or phenotypes) into estimates of elapsed time, providing a mechanism for time scaling phylogenetic trees. Relaxed-clock models, which accommodate variation in evolutionary rates across branches, are one of the main components of Bayesian dating, yet their consequences for total-evidence phylogenetics have not been thoroughly explored. Here, we combine morphological, molecular (both transcriptomic and Sanger-sequenced), and stratigraphic datasets for all major lineages of echinoids (sea urchins, heart urchins, sand dollars). We then perform total-evidence dated inference under the fossilized birth-death prior, varying two analytical conditions: the choice between autocorrelated and uncorrelated relaxed clocks, which enforce (or not) evolutionary rate inheritance; and the ability to recover fossil terminals as direct ancestors. Our results highlight a previously unnoticed interaction between tree and clock models, with analyses implementing an autocorrelated clock failing to recover any direct ancestors. Nonetheless, even under conditions conducive to the placement of fossil terminals as ancestors, we find this type of relationship to be accommodated without any impact on either topology or node ages. On the other hand, tree topology, fossil placement, divergence times, and downstream macroevolutionary inferences (e.g., ancestral state reconstructions) were all strongly affected by the type of relaxed clock implemented. In regions of the tree where molecular rate variation is pervasive and morphological signal relatively uninformative, fossil tips seem to play little to no role in informing divergence times, and instead passively move in and out of clades depending on the ages imposed upon surrounding nodes by molecular data. Our results highlight the extent to which the phylogenetic and macroevolutionary conclusions of total-evidence dated analyses are contingent on the choice of relaxed-clock model, highlighting the need for either careful methodological validation or a thorough assessment of sensitivity. Our efforts continue to illuminate the echinoid tree of life, supporting the erection of the order Apatopygoida to include three living species last sharing a common ancestor with other extant lineages around the time of the Jurassic-Cretaceous boundary. Furthermore, they also illustrate how the phylogenetic placement of extinct clades hinges upon the modelling of molecular data, evidencing the extent to which the fossil record remains subservient to phylogenomics.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"12 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144825112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Systematic Biology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1