首页 > 最新文献

Systematic Biology最新文献

英文 中文
On the utility of Deep Learning for model classification and parameter estimation on complex diversification scenarios. 深度学习在复杂多样化场景下模型分类和参数估计中的应用。
IF 6.5 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Pub Date : 2026-03-24 DOI: 10.1093/sysbio/syag030
Pablo Gutiérrez de la Peña,Guillermo Iglesias,Edgar Talavera,Andrea Sánchez Meseguer,Isabel Sanmartín
Birth-Death models applied to dated phylogenies are useful tools to study past diversification dynamics. Parameters in these stochastic models are typically inferred using likelihood-based methods; however, these approaches can exhibit computational tractability issues for models of moderate to high complexity. One approach to increase model complexity while remaining computationally tractable is Deep Learning. These techniques have recently been explored in the context of serially-sampled phylogenies (phylodynamics) and trait-dependent birth-death models (macroevolution). Here, we explore the power of Convolutional Neural Networks (CNNs) to solve classification (model selection) and regression (parameter estimation) tasks for extant-only phylogenies under six constant-rate and time-varying, lineage-homogeneous diversification scenarios: Constant Birth-Death, High Extinction, Mass Extinction, Diversity-Dependent Diversification, and the piecewise-constant scenarios Stasis-and-Radiate and Waxing-and-Waning. We simulated 10,000 phylogenetic trees under each diversification scenario, which were encoded using the CDV vectorization procedure to capture branch length information. The encoded trees were used to train a set of CNNs models designed to match three empirical phylogenies of eucalypts, conifers, and cetaceans, which have previously been used for benchmarking diversification models and differ in the number of extant tips. Additionally, we compared CNN performance with Maximum Likelihood Estimation (MLE) for the same set of scenarios. We found that CNNs exhibited classification accuracy levels of 80-93%, whereas MLE achieved levels of 70-74%. The most difficult scenarios for CNN classification were High Extinction and Mass-Extinction. For regression tasks, mean average errors were slightly higher for MLE compared with DL. Both approaches had difficulty estimating ratio parameters such as mass-extinction survival and relative extinction. Finally, we applied our CNN models for parameter estimation on the three empirical phylogenies under the best-fit diversification scenario. This allows us to discuss shortcomings and future avenues for improvement, such as the inclusion of rate-variable, lineage-heterogeneous models.
生-死模型应用于过时的系统发育是研究过去多样化动态的有用工具。这些随机模型中的参数通常使用基于似然的方法推断;然而,对于中等到高复杂性的模型,这些方法可能会出现计算可跟踪性问题。在保持计算可处理性的同时增加模型复杂性的一种方法是深度学习。这些技术最近在连续采样系统发生(系统动力学)和性状依赖的出生-死亡模型(宏观进化)的背景下进行了探索。在这里,我们探讨了卷积神经网络(cnn)在六个恒定速率和时变的谱系同质多样化场景下解决仅存物种系统发育的分类(模型选择)和回归(参数估计)任务的能力:恒定的出生-死亡、高灭绝、大灭绝、多样性依赖的多样化,以及分段恒定的停滞-辐射和兴衰场景。我们在每种多样化情景下模拟了10,000棵系统发育树,并使用CDV矢量化程序对其进行编码以获取分支长度信息。编码的树被用来训练一组cnn模型,这些模型旨在匹配桉树、针叶树和鲸目动物的三种经验系统发育,这三种系统发育以前被用于基准多样化模型,并且在现存尖端的数量上有所不同。此外,我们比较了CNN与最大似然估计(MLE)在相同场景下的性能。我们发现cnn的分类准确率为80-93%,而MLE的分类准确率为70-74%。CNN分类中最困难的情景是高度灭绝和大规模灭绝。对于回归任务,MLE的平均误差略高于DL。这两种方法都难以估计大灭绝存活率和相对灭绝率等比值参数。最后,我们将CNN模型应用于最佳拟合多样化情景下的三种经验系统发育的参数估计。这允许我们讨论缺点和未来的改进途径,例如包含速率变量、谱系异构模型。
{"title":"On the utility of Deep Learning for model classification and parameter estimation on complex diversification scenarios.","authors":"Pablo Gutiérrez de la Peña,Guillermo Iglesias,Edgar Talavera,Andrea Sánchez Meseguer,Isabel Sanmartín","doi":"10.1093/sysbio/syag030","DOIUrl":"https://doi.org/10.1093/sysbio/syag030","url":null,"abstract":"Birth-Death models applied to dated phylogenies are useful tools to study past diversification dynamics. Parameters in these stochastic models are typically inferred using likelihood-based methods; however, these approaches can exhibit computational tractability issues for models of moderate to high complexity. One approach to increase model complexity while remaining computationally tractable is Deep Learning. These techniques have recently been explored in the context of serially-sampled phylogenies (phylodynamics) and trait-dependent birth-death models (macroevolution). Here, we explore the power of Convolutional Neural Networks (CNNs) to solve classification (model selection) and regression (parameter estimation) tasks for extant-only phylogenies under six constant-rate and time-varying, lineage-homogeneous diversification scenarios: Constant Birth-Death, High Extinction, Mass Extinction, Diversity-Dependent Diversification, and the piecewise-constant scenarios Stasis-and-Radiate and Waxing-and-Waning. We simulated 10,000 phylogenetic trees under each diversification scenario, which were encoded using the CDV vectorization procedure to capture branch length information. The encoded trees were used to train a set of CNNs models designed to match three empirical phylogenies of eucalypts, conifers, and cetaceans, which have previously been used for benchmarking diversification models and differ in the number of extant tips. Additionally, we compared CNN performance with Maximum Likelihood Estimation (MLE) for the same set of scenarios. We found that CNNs exhibited classification accuracy levels of 80-93%, whereas MLE achieved levels of 70-74%. The most difficult scenarios for CNN classification were High Extinction and Mass-Extinction. For regression tasks, mean average errors were slightly higher for MLE compared with DL. Both approaches had difficulty estimating ratio parameters such as mass-extinction survival and relative extinction. Finally, we applied our CNN models for parameter estimation on the three empirical phylogenies under the best-fit diversification scenario. This allows us to discuss shortcomings and future avenues for improvement, such as the inclusion of rate-variable, lineage-heterogeneous models.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"2 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2026-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147502218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Genomic Signatures of Speciation in Butterflies 蝴蝶物种形成的基因组特征
IF 6.5 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Pub Date : 2026-03-20 DOI: 10.1093/sysbio/syag029
Qian Cong, Jing Zhang, Nick V Grishin
The classification of organisms into species is fundamental to the study of life. Contrary to popular belief, simple and quantitative standards for species delineation are often lacking, and debates about species boundaries create obstacles for conservation biology, agriculture, legislation, and education. We chose butterflies as a model system to address this key biological question. We sequenced and analyzed transcriptomes of 186 butterfly specimens representing 25 pairs of species, representing close but clearly distinct species, conspecific populations, and taxa with debated relationships. We found that species are robustly separated from conspecific populations by the combination of two measures computed on Z-linked genes: a fixation index, which detects genetic gaps between species, and the extent of gene flow, which quantifies reproductive isolation. Applying these criteria suggest that all nine butterfly pairs with debated relationships are distinct species, not populations or subspecies. Furthermore, we found that elevated divergence and positive selection in proteins involved in DNA interaction, circadian clock, pheromone sensing, development, and immune response recurrently correlate with speciation. A significant fraction of these divergent proteins are encoded by the Z chromosome, which appears to be more resistant to introgression than autosomes. Taken together, our findings point to potential common speciation mechanisms in butterflies, provide additional support for the important role of the Z chromosome in speciation, and suggest quantitative criteria for species delimitation, which is vital for the exploration of biodiversity.
把生物体分类为物种是研究生命的基础。与普遍的看法相反,物种划分通常缺乏简单和定量的标准,关于物种边界的争论给保护生物学、农业、立法和教育造成了障碍。我们选择蝴蝶作为模型系统来解决这个关键的生物学问题。我们对186个蝴蝶标本的转录组进行了测序和分析,这些标本代表了25对物种,代表了接近但明显不同的物种,同种种群和有争议的分类群。我们发现,物种与同种种群之间的分离是通过结合z连锁基因计算的两种方法来实现的:固定指数(检测物种之间的遗传差距)和基因流动程度(量化生殖隔离)。应用这些标准表明,所有有争议关系的九对蝴蝶都是不同的物种,而不是种群或亚种。此外,我们发现DNA相互作用、生物钟、信息素感知、发育和免疫反应中蛋白质的分化和正向选择的升高与物种形成反复相关。这些不同的蛋白质中有很大一部分是由Z染色体编码的,Z染色体似乎比常染色体更能抵抗基因渗入。综上所述,我们的研究结果指出了蝴蝶物种形成的潜在共同机制,为Z染色体在物种形成中的重要作用提供了额外的支持,并提出了物种划分的定量标准,这对探索生物多样性至关重要。
{"title":"Genomic Signatures of Speciation in Butterflies","authors":"Qian Cong, Jing Zhang, Nick V Grishin","doi":"10.1093/sysbio/syag029","DOIUrl":"https://doi.org/10.1093/sysbio/syag029","url":null,"abstract":"The classification of organisms into species is fundamental to the study of life. Contrary to popular belief, simple and quantitative standards for species delineation are often lacking, and debates about species boundaries create obstacles for conservation biology, agriculture, legislation, and education. We chose butterflies as a model system to address this key biological question. We sequenced and analyzed transcriptomes of 186 butterfly specimens representing 25 pairs of species, representing close but clearly distinct species, conspecific populations, and taxa with debated relationships. We found that species are robustly separated from conspecific populations by the combination of two measures computed on Z-linked genes: a fixation index, which detects genetic gaps between species, and the extent of gene flow, which quantifies reproductive isolation. Applying these criteria suggest that all nine butterfly pairs with debated relationships are distinct species, not populations or subspecies. Furthermore, we found that elevated divergence and positive selection in proteins involved in DNA interaction, circadian clock, pheromone sensing, development, and immune response recurrently correlate with speciation. A significant fraction of these divergent proteins are encoded by the Z chromosome, which appears to be more resistant to introgression than autosomes. Taken together, our findings point to potential common speciation mechanisms in butterflies, provide additional support for the important role of the Z chromosome in speciation, and suggest quantitative criteria for species delimitation, which is vital for the exploration of biodiversity.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"306 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2026-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147501741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian test of gene flow between sister lineages using genomic data. 利用基因组数据对姐妹系之间的基因流动进行贝叶斯检验。
IF 5.7 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Pub Date : 2026-03-14 DOI: 10.1093/sysbio/syag028
Ziheng Yang, Xiyun Jiao, Sirui Cheng, Tianqi Zhu

Inference of interspecific gene flow using genomic data is important to reliable reconstruction of species phylogenies and to our understanding of the speciation process. Gene flow is harder to detect if it involves sister lineages than nonsisters; for example, most heuristic methods based on data summaries are unable to infer gene flow between sisters. Likelihood-based methods can identify introgression between sisters but the test exhibits several nonstandard features, including boundary problems, indeterminate parameters, and multiple routes from the alternative to the null hypotheses. In the Bayesian test, those irregularities pose challenges to the use of the Savage-Dickey (S-D) density ratio to calculate the Bayes factor. Here we develop a theory for applying the S-D approach under nonstandard conditions. We show that the Bayesian test of introgression between sister lineages has low false-positive rates and high power. We discuss issues surrounding the estimation of the rate of gene flow between sister lineages, especially at very low or very high rates, and suggest that evidence for gene flow between sisters be assessed via a Bayesian test. We find that the species split time has a major impact on the information content in the data, with more information at deeper divergence. We use a genomic dataset from Sceloporus lizards to illustrate the test of gene flow between sister lineages.

利用基因组数据推断种间基因流动对物种系统发育的可靠重建和我们对物种形成过程的理解是重要的。如果涉及到姐妹谱系,基因流比非姐妹谱系更难检测;例如,大多数基于数据摘要的启发式方法无法推断姐妹之间的基因流动。基于似然的方法可以识别姐妹之间的渗入,但该测试显示出一些非标准特征,包括边界问题、不确定的参数以及从替代假设到零假设的多条路径。在贝叶斯检验中,这些不规则性对使用Savage-Dickey (S-D)密度比来计算贝叶斯因子提出了挑战。本文提出了一种在非标准条件下应用S-D方法的理论。我们表明,贝叶斯检验的姐妹血统之间的渗透具有低假阳性率和高功率。我们讨论了有关姐妹谱系之间基因流动速率估计的问题,特别是在非常低或非常高的速率下,并建议通过贝叶斯检验来评估姐妹之间基因流动的证据。我们发现物种分裂时间对数据中的信息含量有重要影响,分歧越深信息越多。我们使用了一个来自长孔蜥的基因组数据集来说明姐妹谱系之间基因流动的测试。
{"title":"Bayesian test of gene flow between sister lineages using genomic data.","authors":"Ziheng Yang, Xiyun Jiao, Sirui Cheng, Tianqi Zhu","doi":"10.1093/sysbio/syag028","DOIUrl":"https://doi.org/10.1093/sysbio/syag028","url":null,"abstract":"<p><p>Inference of interspecific gene flow using genomic data is important to reliable reconstruction of species phylogenies and to our understanding of the speciation process. Gene flow is harder to detect if it involves sister lineages than nonsisters; for example, most heuristic methods based on data summaries are unable to infer gene flow between sisters. Likelihood-based methods can identify introgression between sisters but the test exhibits several nonstandard features, including boundary problems, indeterminate parameters, and multiple routes from the alternative to the null hypotheses. In the Bayesian test, those irregularities pose challenges to the use of the Savage-Dickey (S-D) density ratio to calculate the Bayes factor. Here we develop a theory for applying the S-D approach under nonstandard conditions. We show that the Bayesian test of introgression between sister lineages has low false-positive rates and high power. We discuss issues surrounding the estimation of the rate of gene flow between sister lineages, especially at very low or very high rates, and suggest that evidence for gene flow between sisters be assessed via a Bayesian test. We find that the species split time has a major impact on the information content in the data, with more information at deeper divergence. We use a genomic dataset from Sceloporus lizards to illustrate the test of gene flow between sister lineages.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147460062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Phylogenetic Model of Established and Enabled Biome Shifts. 建立和启用生物群系转变的系统发育模型。
IF 5.7 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Pub Date : 2026-03-14 DOI: 10.1093/sysbio/syag026
Sean W McHugh, Michael J Donoghue, Michael J Landis

Where each species actually lives is distinct from where it could potentially survive and persist. This suggests it is important to distinguish established biome affinities (where species live) from enabled affinities (where species could live) when considering how ancestral species moved and evolved among major habitat types, yet typical phylogenetic approaches modeling biome shifts only consider established affinities while disregarding enabled ones. We introduce a new phylogenetic method, called RFBS (Realized & Fundamental Biome Shifts), to model how anagenetic and cladogenetic events cause established and enabled biome affinities (or, more generally, other discrete realized versus fundamental niche states) to shift over evolutionary timescales. We provide practical guidelines for how to assign established and enabled biome affinity states to extant taxa, using the flowering plant clade Viburnum as a case study. Through a battery of simulation experiments, we show that RFBS performs well, even when we have realistically imperfect knowledge of enabled biome affinities for most analyzed species. We also show that RFBS reliably discerns established from enabled affinities, with similar accuracy to Dispersal-Extinction-Cladogenesis models that ignore the existence of enabled biome affinities. Lastly, we apply RFBS to Viburnum to infer ancestral biomes throughout the tree and to highlight instances where repeated shifts between established affinities for warm and cold temperate forest biomes were enabled by a stable and slowly-evolving enabled affinity for both temperate biomes.

每个物种实际生活的地方与它们可能生存和延续的地方是不同的。这表明,在考虑祖先物种如何在主要栖息地类型中移动和进化时,区分已建立的生物群系亲和关系(物种生活的地方)和使能的亲和关系(物种可以生活的地方)是很重要的,但典型的系统发育方法只考虑已建立的亲和关系,而忽略了使能的亲和关系。我们引入了一种新的系统发育方法,称为RFBS(已实现和基本生物群系转移),来模拟进化和枝生事件如何导致已建立和启用的生物群系亲和力(或者更一般地说,其他离散的已实现与基本生态位状态)在进化时间尺度上发生变化。我们提供了实用的指导方针,如何分配已建立和启用的生物群系亲和状态到现存的分类群,使用开花植物分支Viburnum作为案例研究。通过一系列的模拟实验,我们表明RFBS表现良好,即使我们对大多数被分析物种的启用生物群系亲和力知之甚少。我们还表明,RFBS可靠地从启用的亲和关系中识别出建立的亲和关系,其准确性与忽略启用的生物群系亲和关系的分散-灭绝-枝发生模型相似。最后,我们将RFBS应用于Viburnum,以推断整个树的祖先生物群系,并强调在暖温带和冷温带森林生物群系之间建立的亲和力之间反复变化的实例,这些实例是由两个温带生物群系的稳定和缓慢进化的亲和性实现的。
{"title":"A Phylogenetic Model of Established and Enabled Biome Shifts.","authors":"Sean W McHugh, Michael J Donoghue, Michael J Landis","doi":"10.1093/sysbio/syag026","DOIUrl":"10.1093/sysbio/syag026","url":null,"abstract":"<p><p>Where each species actually lives is distinct from where it could potentially survive and persist. This suggests it is important to distinguish established biome affinities (where species live) from enabled affinities (where species could live) when considering how ancestral species moved and evolved among major habitat types, yet typical phylogenetic approaches modeling biome shifts only consider established affinities while disregarding enabled ones. We introduce a new phylogenetic method, called RFBS (Realized & Fundamental Biome Shifts), to model how anagenetic and cladogenetic events cause established and enabled biome affinities (or, more generally, other discrete realized versus fundamental niche states) to shift over evolutionary timescales. We provide practical guidelines for how to assign established and enabled biome affinity states to extant taxa, using the flowering plant clade Viburnum as a case study. Through a battery of simulation experiments, we show that RFBS performs well, even when we have realistically imperfect knowledge of enabled biome affinities for most analyzed species. We also show that RFBS reliably discerns established from enabled affinities, with similar accuracy to Dispersal-Extinction-Cladogenesis models that ignore the existence of enabled biome affinities. Lastly, we apply RFBS to Viburnum to infer ancestral biomes throughout the tree and to highlight instances where repeated shifts between established affinities for warm and cold temperate forest biomes were enabled by a stable and slowly-evolving enabled affinity for both temperate biomes.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147460051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Integrating Secondary Structure Information Enhances Phylogenetic Signal in Mitochondrial Protein Coding Genes 整合二级结构信息增强线粒体蛋白编码基因的系统发育信号
IF 6.5 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Pub Date : 2026-03-09 DOI: 10.1093/sysbio/syag027
Claudio Cucini, Francesco Nardi, Joan Pons
Accurate phylogenetic inference requires models that account for heterogeneity in molecular evolution. Mitochondrial protein-coding genes, which encode membrane-bound proteins composed of multiple transmembrane α-helices, exhibit considerable compositional and functional variation across structural regions, variation that is often overlooked in standard partitioning strategies. Here, we introduce TRAMPO (TRAnsMembrane Protein Order), a novel pipeline that incorporates predicted secondary structural features (i.e. matrix-facing, transmembrane, and intermembrane-facing domains) into phylogenetic partitioning schemes. We applied TRAMPO to seven mitochondrial datasets spanning crustaceans, hexapods, and vertebrates, and evaluated eight partitioning strategies based on combinations of codon position, strand, and secondary structure. Transmembrane helices, especially at second codon positions, showed pronounced thymine enrichment and hydrophobic amino-acid composition, reflecting domain-specific evolutionary constraints. To assess whether these structural patterns influence phylogenetic reconstruction, we performed maximum likelihood analyses under standard and Lie Markov models, General Heterogeneous evolution On a Single Topology, and profile mixture models. We also evaluated different models of among-site rate variation (including the proportion of invariant sites, gamma distributions, and FreeRates, which approximates rate heterogeneity using flexible discrete rate categories) to examine their interaction with partitioning strategies and overall model performance. Incorporating structural information into partitioning schemes consistently improved model fit and reduced apparent heterogeneity, as reflected in lower AIC values and more compositionally homogeneous partitions. These improvements translated into more consistent and topologically congruent phylogenetic trees across most datasets, while also reducing computational time. Notably, second codon positions within transmembrane helices were consistently retained as distinct partitions during model optimization, even in Mammals and Vertebrates, where secondary structure contributed little to overall model performance, underscoring their strong and conserved evolutionary signal. Surveys of tree space using quartet distances further supported these findings, with structurally informed models yielding more tightly clustered and internally consistent tree topologies. The benefits of structural partitioning were most pronounced in lineages of intermediate evolutionary depth and declined in ancient vertebrate and mammalian clades, where substitutional saturation accumulates with evolutionary time and strand asymmetry tends to emerge more frequently. In some cases, models with the lowest AIC did not yield the most congruent topologies, underscoring the limitations of information criteria when comparing models of different complexity. Overall, our findings demonstrate that secondary structural features,
准确的系统发育推断需要考虑分子进化异质性的模型。线粒体蛋白编码基因编码由多个跨膜α-螺旋组成的膜结合蛋白,在结构区域中表现出相当大的组成和功能变化,这种变化在标准分配策略中经常被忽视。在这里,我们引入了TRAMPO(跨膜蛋白序列),这是一种新的管道,将预测的二级结构特征(即面向基质、跨膜和面向膜间结构域)纳入系统发育分配方案。我们将TRAMPO应用于涵盖甲壳类、六足类和脊椎动物的7个线粒体数据集,并基于密码子位置、链和二级结构的组合评估了8种分配策略。跨膜螺旋,特别是在第二密码子位置,显示出明显的胸腺嘧啶富集和疏水氨基酸组成,反映了特定结构域的进化限制。为了评估这些结构模式是否影响系统发育重建,我们在标准和李马尔可夫模型、单一拓扑上的一般异质进化和剖面混合模型下进行了最大似然分析。我们还评估了站点间速率变化的不同模型(包括不变站点的比例、伽马分布和FreeRates,它使用灵活的离散速率类别近似速率异质性),以检查它们与划分策略和整体模型性能的相互作用。将结构信息纳入分区方案可以持续改善模型拟合并减少明显的异质性,这反映在AIC值较低和分区在组成上更为均匀。这些改进在大多数数据集中转化为更加一致和拓扑一致的系统发育树,同时也减少了计算时间。值得注意的是,在模型优化过程中,跨膜螺旋内的第二个密码子位置始终被保留为不同的分区,即使在哺乳动物和脊椎动物中,二级结构对整体模型性能贡献不大,强调了它们强大而保守的进化信号。使用四重奏距离的树空间调查进一步支持了这些发现,结构信息模型产生了更紧密的聚类和内部一致的树拓扑。结构划分的好处在中等进化深度的谱系中最为明显,而在古老的脊椎动物和哺乳动物分支中则有所下降,在这些分支中,取代饱和度随着进化时间的推移而积累,链不对称倾向于更频繁地出现。在某些情况下,具有最低AIC的模型不能产生最一致的拓扑,这强调了在比较不同复杂性的模型时信息标准的局限性。总的来说,我们的研究结果表明,二级结构特征,特别是跨膜螺旋的重复结构,包含有意义的系统发育信号。将这些信息合并到分区方案中可以改善树的重建并减轻潜在的异质性。TRAMPO提供了一个可扩展的、开源的工具来实现线粒体系统发育的这种方法。
{"title":"Integrating Secondary Structure Information Enhances Phylogenetic Signal in Mitochondrial Protein Coding Genes","authors":"Claudio Cucini, Francesco Nardi, Joan Pons","doi":"10.1093/sysbio/syag027","DOIUrl":"https://doi.org/10.1093/sysbio/syag027","url":null,"abstract":"Accurate phylogenetic inference requires models that account for heterogeneity in molecular evolution. Mitochondrial protein-coding genes, which encode membrane-bound proteins composed of multiple transmembrane α-helices, exhibit considerable compositional and functional variation across structural regions, variation that is often overlooked in standard partitioning strategies. Here, we introduce TRAMPO (TRAnsMembrane Protein Order), a novel pipeline that incorporates predicted secondary structural features (i.e. matrix-facing, transmembrane, and intermembrane-facing domains) into phylogenetic partitioning schemes. We applied TRAMPO to seven mitochondrial datasets spanning crustaceans, hexapods, and vertebrates, and evaluated eight partitioning strategies based on combinations of codon position, strand, and secondary structure. Transmembrane helices, especially at second codon positions, showed pronounced thymine enrichment and hydrophobic amino-acid composition, reflecting domain-specific evolutionary constraints. To assess whether these structural patterns influence phylogenetic reconstruction, we performed maximum likelihood analyses under standard and Lie Markov models, General Heterogeneous evolution On a Single Topology, and profile mixture models. We also evaluated different models of among-site rate variation (including the proportion of invariant sites, gamma distributions, and FreeRates, which approximates rate heterogeneity using flexible discrete rate categories) to examine their interaction with partitioning strategies and overall model performance. Incorporating structural information into partitioning schemes consistently improved model fit and reduced apparent heterogeneity, as reflected in lower AIC values and more compositionally homogeneous partitions. These improvements translated into more consistent and topologically congruent phylogenetic trees across most datasets, while also reducing computational time. Notably, second codon positions within transmembrane helices were consistently retained as distinct partitions during model optimization, even in Mammals and Vertebrates, where secondary structure contributed little to overall model performance, underscoring their strong and conserved evolutionary signal. Surveys of tree space using quartet distances further supported these findings, with structurally informed models yielding more tightly clustered and internally consistent tree topologies. The benefits of structural partitioning were most pronounced in lineages of intermediate evolutionary depth and declined in ancient vertebrate and mammalian clades, where substitutional saturation accumulates with evolutionary time and strand asymmetry tends to emerge more frequently. In some cases, models with the lowest AIC did not yield the most congruent topologies, underscoring the limitations of information criteria when comparing models of different complexity. Overall, our findings demonstrate that secondary structural features,","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"37 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2026-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147383420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving the Robustness of Phylogenetic Independent Contrasts: Addressing Abrupt Evolutionary Shifts with Outlier- and Distribution-Guided Correlation 提高系统发育独立对比的稳健性:用离群值和分布引导相关性解决突变的进化转变
IF 6.5 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Pub Date : 2026-03-09 DOI: 10.1093/sysbio/syag024
Zheng-Lin Chen, Rui Huang, Hong-Ji Guo, Deng-Ke Niu
Traditional phylogenetically aware correlation methods perform well under gradual evolutionary processes. However, abrupt evolutionary shifts—or macroevolutionary jumps, characteristic of punctuated evolution—can produce extreme phylogenetically independent contrasts (PIC), leading to inflated false positives or increased false negatives in trait correlation analyses. We introduce O(D)GC (Outlier- and Distribution-Guided Correlation), a flexible workflow that identifies outliers in PICs using a distribution-free boxplot criterion and applies Spearman correlation whenever influential outliers are detected. If no outliers are detected, Pearson correlation is used—automatically for large datasets (n ≥ 30), or guided by normality testing in smaller samples. We systematically compared PIC-O(D)GC with five widely applied phylogenetic correlation methods—PIC-Pearson, PIC-MM, PGLS (phylogenetic generalized least squares), MR-PMM (multi-response phylogenetic mixed model), and Corphylo—on 322,000 simulated datasets spanning five evolutionary scenarios (two shift settings: single-trait shifts and dual-trait co-directional jumps; and three no-shift gradual evolution settings), including both fixed-depth and randomly located shifts, tested across 11 shift or noise gradients, three tree sizes (16, 128, 256 tips), and both balanced and random topologies. Overall, PIC-O(D)GC achieved error rates comparable to—or noticeably higher than—those of PIC-MM, while yielding substantially lower error rates than most alternative methods. Under no-shift conditions, it retained power similar to other methods. Analyses of three empirical datasets likewise showed that PIC-O(D)GC and PIC-MM corrected shift-induced distortions that misled conventional methods. Moreover, PIC-O(D)GC offers a conceptually simple framework and incurs markedly lower computational cost. By design, its correlation-only output provides less mechanistic detail than regression-based approaches like PGLS. However, when paired with PIC diagnostics, this outlier-guided strategy highlights evolutionary jumps, distinguishes coupled from decoupled shifts, and—via clade partitioning or tip pruning—recovers background correlations, offering biologically informative insights into how punctuated events interact with gradual trends in trait evolution.
传统的系统发育相关方法在渐进进化过程中表现良好。然而,突变的进化转变——或宏观进化跳跃,间断进化的特征——可能产生极端的系统独立对比(PIC),导致性状相关分析中的假阳性或假阴性增加。我们引入O(D)GC(离群值和分布引导相关性),这是一种灵活的工作流程,可使用无分布箱线图标准识别PICs中的离群值,并在检测到有影响的离群值时应用Spearman相关性。如果没有检测到异常值,则使用Pearson相关性-自动用于大数据集(n≥30),或在较小样本中使用正态性检验。我们系统地比较了PIC-O(D)GC与5种广泛应用的系统发育相关方法(pic - pearson、PIC-MM、PGLS(系统发育广义最小二乘)、MR-PMM(多响应系统发育混合模型)和corphylo)在322,000个模拟数据集上的差异,这些数据集跨越5种进化情景(两种移位设置:单性状移位和双性状共向跳跃;以及三种无移位渐进进化设置),包括固定深度和随机位置的移位,在11种移位或噪声梯度,三种树大小(16,128,256提示)以及平衡和随机拓扑中进行测试。总的来说,PIC-O(D)GC的错误率与PIC-MM相当,或者明显高于PIC-MM,而错误率却比大多数替代方法低得多。在无换挡条件下,它与其他方法一样保持动力。对三个经验数据集的分析同样表明,PIC-O(D)GC和PIC-MM纠正了平移引起的扭曲,这些扭曲误导了传统方法。此外,PIC-O(D)GC提供了一个概念上简单的框架,并且显著降低了计算成本。通过设计,它的仅相关输出比基于回归的方法(如PGLS)提供更少的机械细节。然而,当与PIC诊断相结合时,这种异常值引导策略突出了进化跳跃,区分了耦合和解耦的变化,并通过枝划分或尖端修剪恢复了背景相关性,为间断事件如何与性状进化的逐渐趋势相互作用提供了生物学信息。
{"title":"Improving the Robustness of Phylogenetic Independent Contrasts: Addressing Abrupt Evolutionary Shifts with Outlier- and Distribution-Guided Correlation","authors":"Zheng-Lin Chen, Rui Huang, Hong-Ji Guo, Deng-Ke Niu","doi":"10.1093/sysbio/syag024","DOIUrl":"https://doi.org/10.1093/sysbio/syag024","url":null,"abstract":"Traditional phylogenetically aware correlation methods perform well under gradual evolutionary processes. However, abrupt evolutionary shifts—or macroevolutionary jumps, characteristic of punctuated evolution—can produce extreme phylogenetically independent contrasts (PIC), leading to inflated false positives or increased false negatives in trait correlation analyses. We introduce O(D)GC (Outlier- and Distribution-Guided Correlation), a flexible workflow that identifies outliers in PICs using a distribution-free boxplot criterion and applies Spearman correlation whenever influential outliers are detected. If no outliers are detected, Pearson correlation is used—automatically for large datasets (n ≥ 30), or guided by normality testing in smaller samples. We systematically compared PIC-O(D)GC with five widely applied phylogenetic correlation methods—PIC-Pearson, PIC-MM, PGLS (phylogenetic generalized least squares), MR-PMM (multi-response phylogenetic mixed model), and Corphylo—on 322,000 simulated datasets spanning five evolutionary scenarios (two shift settings: single-trait shifts and dual-trait co-directional jumps; and three no-shift gradual evolution settings), including both fixed-depth and randomly located shifts, tested across 11 shift or noise gradients, three tree sizes (16, 128, 256 tips), and both balanced and random topologies. Overall, PIC-O(D)GC achieved error rates comparable to—or noticeably higher than—those of PIC-MM, while yielding substantially lower error rates than most alternative methods. Under no-shift conditions, it retained power similar to other methods. Analyses of three empirical datasets likewise showed that PIC-O(D)GC and PIC-MM corrected shift-induced distortions that misled conventional methods. Moreover, PIC-O(D)GC offers a conceptually simple framework and incurs markedly lower computational cost. By design, its correlation-only output provides less mechanistic detail than regression-based approaches like PGLS. However, when paired with PIC diagnostics, this outlier-guided strategy highlights evolutionary jumps, distinguishes coupled from decoupled shifts, and—via clade partitioning or tip pruning—recovers background correlations, offering biologically informative insights into how punctuated events interact with gradual trends in trait evolution.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"15 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2026-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147380643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Branch Length Transforms using Optimal Tree Metric Matching 基于最优树度量匹配的分支长度变换
IF 6.5 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Pub Date : 2026-03-09 DOI: 10.1093/sysbio/syag025
Shayesteh Arasti, Puoya Tabaghi, Yasamin Tabatabaee, Alan K Mayer, Siavash Mirarab
The abundant discordance between evolutionary relationships across the genome has rekindled interest in methods for comparing and averaging trees on a shared leaf set. However, compared to tree topology, where much progress has been made, handling branch lengths has been more challenging. Species tree branch lengths can be measured in various units, often different from gene trees. Moreover, rates of evolution change across the genome, the species tree, and specific branches of gene trees. These factors compound the stochasticity of coalescence times and estimation noise, making branch lengths highly heterogeneous across the genome. For many downstream applications in phylogenomic analyses, branch lengths are as important as the topology, and yet, existing tools to compare and combine weighted trees are limited. In this paper, we address the question of matching one tree to another, accounting for their branch lengths. We define a series of computational problems called Topology-Constrained Metric Matching (TCMM) that seek to transform the branch lengths of a query tree based on a reference tree. We show that TCMM problems can be solved efficiently using a linear algebraic formulation coupled with dynamic programming preprocessing. While many applications can be imagined for this framework, we explore two applications in this paper: embedding leaves of gene trees in Euclidean space to find outliers potentially indicative of estimation errors, and summarizing gene tree branch lengths onto the species tree. In these applications, our method, when paired with existing methods, increases their accuracy at limited computational expense.
基因组中进化关系之间的大量不一致重新激起了人们对在共享叶集上比较和平均树的方法的兴趣。然而,与树形拓扑相比,处理分支长度更具挑战性,树形拓扑已经取得了很大进展。物种树的分支长度可以用不同的单位来测量,通常不同于基因树。此外,整个基因组、物种树和基因树的特定分支的进化速率都发生了变化。这些因素结合了聚结时间的随机性和估计噪声,使得整个基因组的分支长度高度异质。对于系统基因组分析的许多下游应用,分支长度与拓扑结构一样重要,然而,现有的比较和组合加权树的工具是有限的。在本文中,我们解决了匹配一棵树到另一棵树的问题,考虑到它们的分支长度。我们定义了一系列称为拓扑约束度量匹配(TCMM)的计算问题,这些问题寻求基于参考树转换查询树的分支长度。我们证明了利用线性代数公式与动态规划预处理相结合可以有效地解决TCMM问题。虽然可以想象这个框架的许多应用,但我们在本文中探索了两个应用:在欧几里得空间中嵌入基因树的叶子以寻找可能指示估计误差的异常值,以及将基因树分支长度汇总到物种树上。在这些应用中,当我们的方法与现有方法配对时,以有限的计算成本提高了它们的准确性。
{"title":"Branch Length Transforms using Optimal Tree Metric Matching","authors":"Shayesteh Arasti, Puoya Tabaghi, Yasamin Tabatabaee, Alan K Mayer, Siavash Mirarab","doi":"10.1093/sysbio/syag025","DOIUrl":"https://doi.org/10.1093/sysbio/syag025","url":null,"abstract":"The abundant discordance between evolutionary relationships across the genome has rekindled interest in methods for comparing and averaging trees on a shared leaf set. However, compared to tree topology, where much progress has been made, handling branch lengths has been more challenging. Species tree branch lengths can be measured in various units, often different from gene trees. Moreover, rates of evolution change across the genome, the species tree, and specific branches of gene trees. These factors compound the stochasticity of coalescence times and estimation noise, making branch lengths highly heterogeneous across the genome. For many downstream applications in phylogenomic analyses, branch lengths are as important as the topology, and yet, existing tools to compare and combine weighted trees are limited. In this paper, we address the question of matching one tree to another, accounting for their branch lengths. We define a series of computational problems called Topology-Constrained Metric Matching (TCMM) that seek to transform the branch lengths of a query tree based on a reference tree. We show that TCMM problems can be solved efficiently using a linear algebraic formulation coupled with dynamic programming preprocessing. While many applications can be imagined for this framework, we explore two applications in this paper: embedding leaves of gene trees in Euclidean space to find outliers potentially indicative of estimation errors, and summarizing gene tree branch lengths onto the species tree. In these applications, our method, when paired with existing methods, increases their accuracy at limited computational expense.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"1 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2026-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147383719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tensor cores unlock efficient and lower-energy massive parallelization on phylogenetic trees. 张量核在系统发育树上解锁高效和低能量的大规模并行化。
IF 6.5 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Pub Date : 2026-03-03 DOI: 10.1093/sysbio/syag017
Karthik Gangavarapu,Xiang Ji,Yucai Shao,Andrew Rambaut,Philippe Lemey,Guy Baele,Marc A Suchard
Massively parallel algorithms leveraging graphics processing units (GPUs) have significantly accelerated inference in statistical phylogenetics, with applications in understanding pathogen evolution, population dynamics, natural selection, and evolutionary timescales using ancient genomes. Continued advancements in GPU hardware necessitate innovative algorithms to fully exploit their potential. Here, we introduce three novel algorithms that accelerate matrix multiplication operations using tensor cores on NVIDIA GPUs to calculate the observed sequence data likelihood and the gradient of the log-likelihood with respect to branch-length-specific parameters under continuous-time Markov chain models of evolution. The algorithms presented in this paper deliver 2 to 3-fold gains in performance for amino acid and codon models compared to existing GPU-based massively parallel algorithms. Notably, these performance gains are accompanied by a ~2-fold reduction in energy usage, demonstrating the potential of these algorithms to lower the carbon footprint of evolutionary computing. We make our new algorithms available to the broader phylogenetics community through the high-performance, open source library BEAGLE v4.0.0.
利用图形处理单元(gpu)的大规模并行算法显著加快了统计系统发育的推断,应用于理解病原体进化、种群动态、自然选择和使用古代基因组的进化时间尺度。GPU硬件的持续进步需要创新算法来充分发挥其潜力。在这里,我们介绍了三种新的算法,利用NVIDIA gpu上的张量核加速矩阵乘法运算,计算连续时间马尔可夫链进化模型下观察到的序列数据的似然和对数似然的梯度。与现有的基于gpu的大规模并行算法相比,本文提出的算法在氨基酸和密码子模型上的性能提高了2到3倍。值得注意的是,这些性能提升伴随着2倍的能源消耗减少,这表明了这些算法在降低进化计算的碳足迹方面的潜力。我们通过高性能的开源库BEAGLE v4.0.0将我们的新算法提供给更广泛的系统发育社区。
{"title":"Tensor cores unlock efficient and lower-energy massive parallelization on phylogenetic trees.","authors":"Karthik Gangavarapu,Xiang Ji,Yucai Shao,Andrew Rambaut,Philippe Lemey,Guy Baele,Marc A Suchard","doi":"10.1093/sysbio/syag017","DOIUrl":"https://doi.org/10.1093/sysbio/syag017","url":null,"abstract":"Massively parallel algorithms leveraging graphics processing units (GPUs) have significantly accelerated inference in statistical phylogenetics, with applications in understanding pathogen evolution, population dynamics, natural selection, and evolutionary timescales using ancient genomes. Continued advancements in GPU hardware necessitate innovative algorithms to fully exploit their potential. Here, we introduce three novel algorithms that accelerate matrix multiplication operations using tensor cores on NVIDIA GPUs to calculate the observed sequence data likelihood and the gradient of the log-likelihood with respect to branch-length-specific parameters under continuous-time Markov chain models of evolution. The algorithms presented in this paper deliver 2 to 3-fold gains in performance for amino acid and codon models compared to existing GPU-based massively parallel algorithms. Notably, these performance gains are accompanied by a ~2-fold reduction in energy usage, demonstrating the potential of these algorithms to lower the carbon footprint of evolutionary computing. We make our new algorithms available to the broader phylogenetics community through the high-performance, open source library BEAGLE v4.0.0.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"25 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2026-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147329567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correction to: Evolutionary Rate Incongruences in Squamates Reveal Contrasting Patterns of Evolutionary Novelties and Innovation. 修正:鳞片的进化速率不一致揭示了进化新颖性和创新的对比模式。
IF 6.5 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Pub Date : 2026-03-03 DOI: 10.1093/sysbio/syag021
{"title":"Correction to: Evolutionary Rate Incongruences in Squamates Reveal Contrasting Patterns of Evolutionary Novelties and Innovation.","authors":"","doi":"10.1093/sysbio/syag021","DOIUrl":"https://doi.org/10.1093/sysbio/syag021","url":null,"abstract":"","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"9 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2026-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147329568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The impact of incomplete taxon sampling on inference of gene flow by Bayesian and summary methods using genomic sequence data. 不完全分类群抽样对基因组序列数据贝叶斯和汇总方法推断基因流的影响。
IF 6.5 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Pub Date : 2026-03-03 DOI: 10.1093/sysbio/syag023
Sirui Cheng,Thomas Flouri,Tianqi Zhu,Ziheng Yang
Interspecific gene flow is commonly inferred using genomic data under the multispecies coalescent (MSC) model. Incomplete taxon sampling can impact inference of gene flow in multiple ways. First unsampled ghost lineages that are sources of introgression may mislead inference of gene flow in analysis of genomic data from sampled species. Second incomplete taxon sampling causes merges of branches on the species phylogeny and complicates the definition and estimation of the rate or magnitude of gene flow, measured by the expected proportion of immigrants in the recipient population (i.e., the introgression probability). We use mathematical analysis and computer simulation to examine the impact of incomplete taxon sampling on inference of gene flow and estimation of its rate using genomic data. We introduce a Bayesian testing approach to select models of gene flow for a species triplet (such as ghost introgression, inflow, and outflow), using the Savage-Dickey density ratio to calculate Bayes factors. We show that the approach has excellent sensitivity and specificity, whereas heuristic methods based on data summaries typically cannot distinguish among those scenarios. We find that genomic data allow reliable estimation of the proportion of immigrants (rather than the number of immigrants), even when the assumed demographic model is incorrect due to incomplete taxon sampling. When population size differs among species, assuming the same size may lead to seriously biased estimates of the rate of gene flow. The f-branch approach is effective in reducing the number of gene-flow events suggested by triplet analyses but often fails to identify the correct model of gene flow and tends to underestimate the rate of gene flow. Our results highlight the need for improving summary methods to accommodate different population sizes and to infer gene flow between sister lineages.
在多物种聚结(MSC)模型下,通常使用基因组数据推断种间基因流动。分类群取样不完整会从多个方面影响基因流推断。首先,作为基因渗入来源的未采样鬼系可能会误导对样本物种基因组数据分析的基因流推断。第二,不完整的分类群抽样导致物种系统发育上的分支合并,并使基因流的速率或大小的定义和估计复杂化,基因流的速率或大小是由接受者群体中移民的预期比例(即渗入概率)来衡量的。采用数学分析和计算机模拟的方法,研究了不完全分类群抽样对基因流推断和基因流速率估计的影响。本文介绍了一种贝叶斯测试方法,使用Savage-Dickey密度比来计算贝叶斯因子,以选择物种三重态(如鬼渗、流入和流出)的基因流模型。我们表明,该方法具有出色的敏感性和特异性,而基于数据摘要的启发式方法通常无法区分这些场景。我们发现基因组数据允许对移民比例(而不是移民数量)进行可靠的估计,即使假设的人口统计学模型由于分类单元采样不完整而不正确。当物种之间的种群大小不同时,假设相同的大小可能会导致对基因流动速率的严重偏差估计。f分支方法在减少三重体分析所提示的基因流动事件的数量方面是有效的,但往往不能确定正确的基因流动模型,并且往往低估了基因流动的速率。我们的结果强调需要改进总结方法,以适应不同的群体规模和推断姐妹谱系之间的基因流动。
{"title":"The impact of incomplete taxon sampling on inference of gene flow by Bayesian and summary methods using genomic sequence data.","authors":"Sirui Cheng,Thomas Flouri,Tianqi Zhu,Ziheng Yang","doi":"10.1093/sysbio/syag023","DOIUrl":"https://doi.org/10.1093/sysbio/syag023","url":null,"abstract":"Interspecific gene flow is commonly inferred using genomic data under the multispecies coalescent (MSC) model. Incomplete taxon sampling can impact inference of gene flow in multiple ways. First unsampled ghost lineages that are sources of introgression may mislead inference of gene flow in analysis of genomic data from sampled species. Second incomplete taxon sampling causes merges of branches on the species phylogeny and complicates the definition and estimation of the rate or magnitude of gene flow, measured by the expected proportion of immigrants in the recipient population (i.e., the introgression probability). We use mathematical analysis and computer simulation to examine the impact of incomplete taxon sampling on inference of gene flow and estimation of its rate using genomic data. We introduce a Bayesian testing approach to select models of gene flow for a species triplet (such as ghost introgression, inflow, and outflow), using the Savage-Dickey density ratio to calculate Bayes factors. We show that the approach has excellent sensitivity and specificity, whereas heuristic methods based on data summaries typically cannot distinguish among those scenarios. We find that genomic data allow reliable estimation of the proportion of immigrants (rather than the number of immigrants), even when the assumed demographic model is incorrect due to incomplete taxon sampling. When population size differs among species, assuming the same size may lead to seriously biased estimates of the rate of gene flow. The f-branch approach is effective in reducing the number of gene-flow events suggested by triplet analyses but often fails to identify the correct model of gene flow and tends to underestimate the rate of gene flow. Our results highlight the need for improving summary methods to accommodate different population sizes and to infer gene flow between sister lineages.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"239 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2026-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147329569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Systematic Biology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1