Pablo Gutiérrez de la Peña,Guillermo Iglesias,Edgar Talavera,Andrea Sánchez Meseguer,Isabel Sanmartín
Birth-Death models applied to dated phylogenies are useful tools to study past diversification dynamics. Parameters in these stochastic models are typically inferred using likelihood-based methods; however, these approaches can exhibit computational tractability issues for models of moderate to high complexity. One approach to increase model complexity while remaining computationally tractable is Deep Learning. These techniques have recently been explored in the context of serially-sampled phylogenies (phylodynamics) and trait-dependent birth-death models (macroevolution). Here, we explore the power of Convolutional Neural Networks (CNNs) to solve classification (model selection) and regression (parameter estimation) tasks for extant-only phylogenies under six constant-rate and time-varying, lineage-homogeneous diversification scenarios: Constant Birth-Death, High Extinction, Mass Extinction, Diversity-Dependent Diversification, and the piecewise-constant scenarios Stasis-and-Radiate and Waxing-and-Waning. We simulated 10,000 phylogenetic trees under each diversification scenario, which were encoded using the CDV vectorization procedure to capture branch length information. The encoded trees were used to train a set of CNNs models designed to match three empirical phylogenies of eucalypts, conifers, and cetaceans, which have previously been used for benchmarking diversification models and differ in the number of extant tips. Additionally, we compared CNN performance with Maximum Likelihood Estimation (MLE) for the same set of scenarios. We found that CNNs exhibited classification accuracy levels of 80-93%, whereas MLE achieved levels of 70-74%. The most difficult scenarios for CNN classification were High Extinction and Mass-Extinction. For regression tasks, mean average errors were slightly higher for MLE compared with DL. Both approaches had difficulty estimating ratio parameters such as mass-extinction survival and relative extinction. Finally, we applied our CNN models for parameter estimation on the three empirical phylogenies under the best-fit diversification scenario. This allows us to discuss shortcomings and future avenues for improvement, such as the inclusion of rate-variable, lineage-heterogeneous models.
{"title":"On the utility of Deep Learning for model classification and parameter estimation on complex diversification scenarios.","authors":"Pablo Gutiérrez de la Peña,Guillermo Iglesias,Edgar Talavera,Andrea Sánchez Meseguer,Isabel Sanmartín","doi":"10.1093/sysbio/syag030","DOIUrl":"https://doi.org/10.1093/sysbio/syag030","url":null,"abstract":"Birth-Death models applied to dated phylogenies are useful tools to study past diversification dynamics. Parameters in these stochastic models are typically inferred using likelihood-based methods; however, these approaches can exhibit computational tractability issues for models of moderate to high complexity. One approach to increase model complexity while remaining computationally tractable is Deep Learning. These techniques have recently been explored in the context of serially-sampled phylogenies (phylodynamics) and trait-dependent birth-death models (macroevolution). Here, we explore the power of Convolutional Neural Networks (CNNs) to solve classification (model selection) and regression (parameter estimation) tasks for extant-only phylogenies under six constant-rate and time-varying, lineage-homogeneous diversification scenarios: Constant Birth-Death, High Extinction, Mass Extinction, Diversity-Dependent Diversification, and the piecewise-constant scenarios Stasis-and-Radiate and Waxing-and-Waning. We simulated 10,000 phylogenetic trees under each diversification scenario, which were encoded using the CDV vectorization procedure to capture branch length information. The encoded trees were used to train a set of CNNs models designed to match three empirical phylogenies of eucalypts, conifers, and cetaceans, which have previously been used for benchmarking diversification models and differ in the number of extant tips. Additionally, we compared CNN performance with Maximum Likelihood Estimation (MLE) for the same set of scenarios. We found that CNNs exhibited classification accuracy levels of 80-93%, whereas MLE achieved levels of 70-74%. The most difficult scenarios for CNN classification were High Extinction and Mass-Extinction. For regression tasks, mean average errors were slightly higher for MLE compared with DL. Both approaches had difficulty estimating ratio parameters such as mass-extinction survival and relative extinction. Finally, we applied our CNN models for parameter estimation on the three empirical phylogenies under the best-fit diversification scenario. This allows us to discuss shortcomings and future avenues for improvement, such as the inclusion of rate-variable, lineage-heterogeneous models.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"2 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2026-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147502218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The classification of organisms into species is fundamental to the study of life. Contrary to popular belief, simple and quantitative standards for species delineation are often lacking, and debates about species boundaries create obstacles for conservation biology, agriculture, legislation, and education. We chose butterflies as a model system to address this key biological question. We sequenced and analyzed transcriptomes of 186 butterfly specimens representing 25 pairs of species, representing close but clearly distinct species, conspecific populations, and taxa with debated relationships. We found that species are robustly separated from conspecific populations by the combination of two measures computed on Z-linked genes: a fixation index, which detects genetic gaps between species, and the extent of gene flow, which quantifies reproductive isolation. Applying these criteria suggest that all nine butterfly pairs with debated relationships are distinct species, not populations or subspecies. Furthermore, we found that elevated divergence and positive selection in proteins involved in DNA interaction, circadian clock, pheromone sensing, development, and immune response recurrently correlate with speciation. A significant fraction of these divergent proteins are encoded by the Z chromosome, which appears to be more resistant to introgression than autosomes. Taken together, our findings point to potential common speciation mechanisms in butterflies, provide additional support for the important role of the Z chromosome in speciation, and suggest quantitative criteria for species delimitation, which is vital for the exploration of biodiversity.
{"title":"Genomic Signatures of Speciation in Butterflies","authors":"Qian Cong, Jing Zhang, Nick V Grishin","doi":"10.1093/sysbio/syag029","DOIUrl":"https://doi.org/10.1093/sysbio/syag029","url":null,"abstract":"The classification of organisms into species is fundamental to the study of life. Contrary to popular belief, simple and quantitative standards for species delineation are often lacking, and debates about species boundaries create obstacles for conservation biology, agriculture, legislation, and education. We chose butterflies as a model system to address this key biological question. We sequenced and analyzed transcriptomes of 186 butterfly specimens representing 25 pairs of species, representing close but clearly distinct species, conspecific populations, and taxa with debated relationships. We found that species are robustly separated from conspecific populations by the combination of two measures computed on Z-linked genes: a fixation index, which detects genetic gaps between species, and the extent of gene flow, which quantifies reproductive isolation. Applying these criteria suggest that all nine butterfly pairs with debated relationships are distinct species, not populations or subspecies. Furthermore, we found that elevated divergence and positive selection in proteins involved in DNA interaction, circadian clock, pheromone sensing, development, and immune response recurrently correlate with speciation. A significant fraction of these divergent proteins are encoded by the Z chromosome, which appears to be more resistant to introgression than autosomes. Taken together, our findings point to potential common speciation mechanisms in butterflies, provide additional support for the important role of the Z chromosome in speciation, and suggest quantitative criteria for species delimitation, which is vital for the exploration of biodiversity.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"306 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2026-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147501741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Inference of interspecific gene flow using genomic data is important to reliable reconstruction of species phylogenies and to our understanding of the speciation process. Gene flow is harder to detect if it involves sister lineages than nonsisters; for example, most heuristic methods based on data summaries are unable to infer gene flow between sisters. Likelihood-based methods can identify introgression between sisters but the test exhibits several nonstandard features, including boundary problems, indeterminate parameters, and multiple routes from the alternative to the null hypotheses. In the Bayesian test, those irregularities pose challenges to the use of the Savage-Dickey (S-D) density ratio to calculate the Bayes factor. Here we develop a theory for applying the S-D approach under nonstandard conditions. We show that the Bayesian test of introgression between sister lineages has low false-positive rates and high power. We discuss issues surrounding the estimation of the rate of gene flow between sister lineages, especially at very low or very high rates, and suggest that evidence for gene flow between sisters be assessed via a Bayesian test. We find that the species split time has a major impact on the information content in the data, with more information at deeper divergence. We use a genomic dataset from Sceloporus lizards to illustrate the test of gene flow between sister lineages.
{"title":"Bayesian test of gene flow between sister lineages using genomic data.","authors":"Ziheng Yang, Xiyun Jiao, Sirui Cheng, Tianqi Zhu","doi":"10.1093/sysbio/syag028","DOIUrl":"https://doi.org/10.1093/sysbio/syag028","url":null,"abstract":"<p><p>Inference of interspecific gene flow using genomic data is important to reliable reconstruction of species phylogenies and to our understanding of the speciation process. Gene flow is harder to detect if it involves sister lineages than nonsisters; for example, most heuristic methods based on data summaries are unable to infer gene flow between sisters. Likelihood-based methods can identify introgression between sisters but the test exhibits several nonstandard features, including boundary problems, indeterminate parameters, and multiple routes from the alternative to the null hypotheses. In the Bayesian test, those irregularities pose challenges to the use of the Savage-Dickey (S-D) density ratio to calculate the Bayes factor. Here we develop a theory for applying the S-D approach under nonstandard conditions. We show that the Bayesian test of introgression between sister lineages has low false-positive rates and high power. We discuss issues surrounding the estimation of the rate of gene flow between sister lineages, especially at very low or very high rates, and suggest that evidence for gene flow between sisters be assessed via a Bayesian test. We find that the species split time has a major impact on the information content in the data, with more information at deeper divergence. We use a genomic dataset from Sceloporus lizards to illustrate the test of gene flow between sister lineages.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147460062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sean W McHugh, Michael J Donoghue, Michael J Landis
Where each species actually lives is distinct from where it could potentially survive and persist. This suggests it is important to distinguish established biome affinities (where species live) from enabled affinities (where species could live) when considering how ancestral species moved and evolved among major habitat types, yet typical phylogenetic approaches modeling biome shifts only consider established affinities while disregarding enabled ones. We introduce a new phylogenetic method, called RFBS (Realized & Fundamental Biome Shifts), to model how anagenetic and cladogenetic events cause established and enabled biome affinities (or, more generally, other discrete realized versus fundamental niche states) to shift over evolutionary timescales. We provide practical guidelines for how to assign established and enabled biome affinity states to extant taxa, using the flowering plant clade Viburnum as a case study. Through a battery of simulation experiments, we show that RFBS performs well, even when we have realistically imperfect knowledge of enabled biome affinities for most analyzed species. We also show that RFBS reliably discerns established from enabled affinities, with similar accuracy to Dispersal-Extinction-Cladogenesis models that ignore the existence of enabled biome affinities. Lastly, we apply RFBS to Viburnum to infer ancestral biomes throughout the tree and to highlight instances where repeated shifts between established affinities for warm and cold temperate forest biomes were enabled by a stable and slowly-evolving enabled affinity for both temperate biomes.
{"title":"A Phylogenetic Model of Established and Enabled Biome Shifts.","authors":"Sean W McHugh, Michael J Donoghue, Michael J Landis","doi":"10.1093/sysbio/syag026","DOIUrl":"10.1093/sysbio/syag026","url":null,"abstract":"<p><p>Where each species actually lives is distinct from where it could potentially survive and persist. This suggests it is important to distinguish established biome affinities (where species live) from enabled affinities (where species could live) when considering how ancestral species moved and evolved among major habitat types, yet typical phylogenetic approaches modeling biome shifts only consider established affinities while disregarding enabled ones. We introduce a new phylogenetic method, called RFBS (Realized & Fundamental Biome Shifts), to model how anagenetic and cladogenetic events cause established and enabled biome affinities (or, more generally, other discrete realized versus fundamental niche states) to shift over evolutionary timescales. We provide practical guidelines for how to assign established and enabled biome affinity states to extant taxa, using the flowering plant clade Viburnum as a case study. Through a battery of simulation experiments, we show that RFBS performs well, even when we have realistically imperfect knowledge of enabled biome affinities for most analyzed species. We also show that RFBS reliably discerns established from enabled affinities, with similar accuracy to Dispersal-Extinction-Cladogenesis models that ignore the existence of enabled biome affinities. Lastly, we apply RFBS to Viburnum to infer ancestral biomes throughout the tree and to highlight instances where repeated shifts between established affinities for warm and cold temperate forest biomes were enabled by a stable and slowly-evolving enabled affinity for both temperate biomes.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147460051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Accurate phylogenetic inference requires models that account for heterogeneity in molecular evolution. Mitochondrial protein-coding genes, which encode membrane-bound proteins composed of multiple transmembrane α-helices, exhibit considerable compositional and functional variation across structural regions, variation that is often overlooked in standard partitioning strategies. Here, we introduce TRAMPO (TRAnsMembrane Protein Order), a novel pipeline that incorporates predicted secondary structural features (i.e. matrix-facing, transmembrane, and intermembrane-facing domains) into phylogenetic partitioning schemes. We applied TRAMPO to seven mitochondrial datasets spanning crustaceans, hexapods, and vertebrates, and evaluated eight partitioning strategies based on combinations of codon position, strand, and secondary structure. Transmembrane helices, especially at second codon positions, showed pronounced thymine enrichment and hydrophobic amino-acid composition, reflecting domain-specific evolutionary constraints. To assess whether these structural patterns influence phylogenetic reconstruction, we performed maximum likelihood analyses under standard and Lie Markov models, General Heterogeneous evolution On a Single Topology, and profile mixture models. We also evaluated different models of among-site rate variation (including the proportion of invariant sites, gamma distributions, and FreeRates, which approximates rate heterogeneity using flexible discrete rate categories) to examine their interaction with partitioning strategies and overall model performance. Incorporating structural information into partitioning schemes consistently improved model fit and reduced apparent heterogeneity, as reflected in lower AIC values and more compositionally homogeneous partitions. These improvements translated into more consistent and topologically congruent phylogenetic trees across most datasets, while also reducing computational time. Notably, second codon positions within transmembrane helices were consistently retained as distinct partitions during model optimization, even in Mammals and Vertebrates, where secondary structure contributed little to overall model performance, underscoring their strong and conserved evolutionary signal. Surveys of tree space using quartet distances further supported these findings, with structurally informed models yielding more tightly clustered and internally consistent tree topologies. The benefits of structural partitioning were most pronounced in lineages of intermediate evolutionary depth and declined in ancient vertebrate and mammalian clades, where substitutional saturation accumulates with evolutionary time and strand asymmetry tends to emerge more frequently. In some cases, models with the lowest AIC did not yield the most congruent topologies, underscoring the limitations of information criteria when comparing models of different complexity. Overall, our findings demonstrate that secondary structural features,
{"title":"Integrating Secondary Structure Information Enhances Phylogenetic Signal in Mitochondrial Protein Coding Genes","authors":"Claudio Cucini, Francesco Nardi, Joan Pons","doi":"10.1093/sysbio/syag027","DOIUrl":"https://doi.org/10.1093/sysbio/syag027","url":null,"abstract":"Accurate phylogenetic inference requires models that account for heterogeneity in molecular evolution. Mitochondrial protein-coding genes, which encode membrane-bound proteins composed of multiple transmembrane α-helices, exhibit considerable compositional and functional variation across structural regions, variation that is often overlooked in standard partitioning strategies. Here, we introduce TRAMPO (TRAnsMembrane Protein Order), a novel pipeline that incorporates predicted secondary structural features (i.e. matrix-facing, transmembrane, and intermembrane-facing domains) into phylogenetic partitioning schemes. We applied TRAMPO to seven mitochondrial datasets spanning crustaceans, hexapods, and vertebrates, and evaluated eight partitioning strategies based on combinations of codon position, strand, and secondary structure. Transmembrane helices, especially at second codon positions, showed pronounced thymine enrichment and hydrophobic amino-acid composition, reflecting domain-specific evolutionary constraints. To assess whether these structural patterns influence phylogenetic reconstruction, we performed maximum likelihood analyses under standard and Lie Markov models, General Heterogeneous evolution On a Single Topology, and profile mixture models. We also evaluated different models of among-site rate variation (including the proportion of invariant sites, gamma distributions, and FreeRates, which approximates rate heterogeneity using flexible discrete rate categories) to examine their interaction with partitioning strategies and overall model performance. Incorporating structural information into partitioning schemes consistently improved model fit and reduced apparent heterogeneity, as reflected in lower AIC values and more compositionally homogeneous partitions. These improvements translated into more consistent and topologically congruent phylogenetic trees across most datasets, while also reducing computational time. Notably, second codon positions within transmembrane helices were consistently retained as distinct partitions during model optimization, even in Mammals and Vertebrates, where secondary structure contributed little to overall model performance, underscoring their strong and conserved evolutionary signal. Surveys of tree space using quartet distances further supported these findings, with structurally informed models yielding more tightly clustered and internally consistent tree topologies. The benefits of structural partitioning were most pronounced in lineages of intermediate evolutionary depth and declined in ancient vertebrate and mammalian clades, where substitutional saturation accumulates with evolutionary time and strand asymmetry tends to emerge more frequently. In some cases, models with the lowest AIC did not yield the most congruent topologies, underscoring the limitations of information criteria when comparing models of different complexity. Overall, our findings demonstrate that secondary structural features,","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"37 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2026-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147383420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Traditional phylogenetically aware correlation methods perform well under gradual evolutionary processes. However, abrupt evolutionary shifts—or macroevolutionary jumps, characteristic of punctuated evolution—can produce extreme phylogenetically independent contrasts (PIC), leading to inflated false positives or increased false negatives in trait correlation analyses. We introduce O(D)GC (Outlier- and Distribution-Guided Correlation), a flexible workflow that identifies outliers in PICs using a distribution-free boxplot criterion and applies Spearman correlation whenever influential outliers are detected. If no outliers are detected, Pearson correlation is used—automatically for large datasets (n ≥ 30), or guided by normality testing in smaller samples. We systematically compared PIC-O(D)GC with five widely applied phylogenetic correlation methods—PIC-Pearson, PIC-MM, PGLS (phylogenetic generalized least squares), MR-PMM (multi-response phylogenetic mixed model), and Corphylo—on 322,000 simulated datasets spanning five evolutionary scenarios (two shift settings: single-trait shifts and dual-trait co-directional jumps; and three no-shift gradual evolution settings), including both fixed-depth and randomly located shifts, tested across 11 shift or noise gradients, three tree sizes (16, 128, 256 tips), and both balanced and random topologies. Overall, PIC-O(D)GC achieved error rates comparable to—or noticeably higher than—those of PIC-MM, while yielding substantially lower error rates than most alternative methods. Under no-shift conditions, it retained power similar to other methods. Analyses of three empirical datasets likewise showed that PIC-O(D)GC and PIC-MM corrected shift-induced distortions that misled conventional methods. Moreover, PIC-O(D)GC offers a conceptually simple framework and incurs markedly lower computational cost. By design, its correlation-only output provides less mechanistic detail than regression-based approaches like PGLS. However, when paired with PIC diagnostics, this outlier-guided strategy highlights evolutionary jumps, distinguishes coupled from decoupled shifts, and—via clade partitioning or tip pruning—recovers background correlations, offering biologically informative insights into how punctuated events interact with gradual trends in trait evolution.
{"title":"Improving the Robustness of Phylogenetic Independent Contrasts: Addressing Abrupt Evolutionary Shifts with Outlier- and Distribution-Guided Correlation","authors":"Zheng-Lin Chen, Rui Huang, Hong-Ji Guo, Deng-Ke Niu","doi":"10.1093/sysbio/syag024","DOIUrl":"https://doi.org/10.1093/sysbio/syag024","url":null,"abstract":"Traditional phylogenetically aware correlation methods perform well under gradual evolutionary processes. However, abrupt evolutionary shifts—or macroevolutionary jumps, characteristic of punctuated evolution—can produce extreme phylogenetically independent contrasts (PIC), leading to inflated false positives or increased false negatives in trait correlation analyses. We introduce O(D)GC (Outlier- and Distribution-Guided Correlation), a flexible workflow that identifies outliers in PICs using a distribution-free boxplot criterion and applies Spearman correlation whenever influential outliers are detected. If no outliers are detected, Pearson correlation is used—automatically for large datasets (n ≥ 30), or guided by normality testing in smaller samples. We systematically compared PIC-O(D)GC with five widely applied phylogenetic correlation methods—PIC-Pearson, PIC-MM, PGLS (phylogenetic generalized least squares), MR-PMM (multi-response phylogenetic mixed model), and Corphylo—on 322,000 simulated datasets spanning five evolutionary scenarios (two shift settings: single-trait shifts and dual-trait co-directional jumps; and three no-shift gradual evolution settings), including both fixed-depth and randomly located shifts, tested across 11 shift or noise gradients, three tree sizes (16, 128, 256 tips), and both balanced and random topologies. Overall, PIC-O(D)GC achieved error rates comparable to—or noticeably higher than—those of PIC-MM, while yielding substantially lower error rates than most alternative methods. Under no-shift conditions, it retained power similar to other methods. Analyses of three empirical datasets likewise showed that PIC-O(D)GC and PIC-MM corrected shift-induced distortions that misled conventional methods. Moreover, PIC-O(D)GC offers a conceptually simple framework and incurs markedly lower computational cost. By design, its correlation-only output provides less mechanistic detail than regression-based approaches like PGLS. However, when paired with PIC diagnostics, this outlier-guided strategy highlights evolutionary jumps, distinguishes coupled from decoupled shifts, and—via clade partitioning or tip pruning—recovers background correlations, offering biologically informative insights into how punctuated events interact with gradual trends in trait evolution.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"15 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2026-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147380643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shayesteh Arasti, Puoya Tabaghi, Yasamin Tabatabaee, Alan K Mayer, Siavash Mirarab
The abundant discordance between evolutionary relationships across the genome has rekindled interest in methods for comparing and averaging trees on a shared leaf set. However, compared to tree topology, where much progress has been made, handling branch lengths has been more challenging. Species tree branch lengths can be measured in various units, often different from gene trees. Moreover, rates of evolution change across the genome, the species tree, and specific branches of gene trees. These factors compound the stochasticity of coalescence times and estimation noise, making branch lengths highly heterogeneous across the genome. For many downstream applications in phylogenomic analyses, branch lengths are as important as the topology, and yet, existing tools to compare and combine weighted trees are limited. In this paper, we address the question of matching one tree to another, accounting for their branch lengths. We define a series of computational problems called Topology-Constrained Metric Matching (TCMM) that seek to transform the branch lengths of a query tree based on a reference tree. We show that TCMM problems can be solved efficiently using a linear algebraic formulation coupled with dynamic programming preprocessing. While many applications can be imagined for this framework, we explore two applications in this paper: embedding leaves of gene trees in Euclidean space to find outliers potentially indicative of estimation errors, and summarizing gene tree branch lengths onto the species tree. In these applications, our method, when paired with existing methods, increases their accuracy at limited computational expense.
{"title":"Branch Length Transforms using Optimal Tree Metric Matching","authors":"Shayesteh Arasti, Puoya Tabaghi, Yasamin Tabatabaee, Alan K Mayer, Siavash Mirarab","doi":"10.1093/sysbio/syag025","DOIUrl":"https://doi.org/10.1093/sysbio/syag025","url":null,"abstract":"The abundant discordance between evolutionary relationships across the genome has rekindled interest in methods for comparing and averaging trees on a shared leaf set. However, compared to tree topology, where much progress has been made, handling branch lengths has been more challenging. Species tree branch lengths can be measured in various units, often different from gene trees. Moreover, rates of evolution change across the genome, the species tree, and specific branches of gene trees. These factors compound the stochasticity of coalescence times and estimation noise, making branch lengths highly heterogeneous across the genome. For many downstream applications in phylogenomic analyses, branch lengths are as important as the topology, and yet, existing tools to compare and combine weighted trees are limited. In this paper, we address the question of matching one tree to another, accounting for their branch lengths. We define a series of computational problems called Topology-Constrained Metric Matching (TCMM) that seek to transform the branch lengths of a query tree based on a reference tree. We show that TCMM problems can be solved efficiently using a linear algebraic formulation coupled with dynamic programming preprocessing. While many applications can be imagined for this framework, we explore two applications in this paper: embedding leaves of gene trees in Euclidean space to find outliers potentially indicative of estimation errors, and summarizing gene tree branch lengths onto the species tree. In these applications, our method, when paired with existing methods, increases their accuracy at limited computational expense.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"1 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2026-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147383719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Karthik Gangavarapu,Xiang Ji,Yucai Shao,Andrew Rambaut,Philippe Lemey,Guy Baele,Marc A Suchard
Massively parallel algorithms leveraging graphics processing units (GPUs) have significantly accelerated inference in statistical phylogenetics, with applications in understanding pathogen evolution, population dynamics, natural selection, and evolutionary timescales using ancient genomes. Continued advancements in GPU hardware necessitate innovative algorithms to fully exploit their potential. Here, we introduce three novel algorithms that accelerate matrix multiplication operations using tensor cores on NVIDIA GPUs to calculate the observed sequence data likelihood and the gradient of the log-likelihood with respect to branch-length-specific parameters under continuous-time Markov chain models of evolution. The algorithms presented in this paper deliver 2 to 3-fold gains in performance for amino acid and codon models compared to existing GPU-based massively parallel algorithms. Notably, these performance gains are accompanied by a ~2-fold reduction in energy usage, demonstrating the potential of these algorithms to lower the carbon footprint of evolutionary computing. We make our new algorithms available to the broader phylogenetics community through the high-performance, open source library BEAGLE v4.0.0.
{"title":"Tensor cores unlock efficient and lower-energy massive parallelization on phylogenetic trees.","authors":"Karthik Gangavarapu,Xiang Ji,Yucai Shao,Andrew Rambaut,Philippe Lemey,Guy Baele,Marc A Suchard","doi":"10.1093/sysbio/syag017","DOIUrl":"https://doi.org/10.1093/sysbio/syag017","url":null,"abstract":"Massively parallel algorithms leveraging graphics processing units (GPUs) have significantly accelerated inference in statistical phylogenetics, with applications in understanding pathogen evolution, population dynamics, natural selection, and evolutionary timescales using ancient genomes. Continued advancements in GPU hardware necessitate innovative algorithms to fully exploit their potential. Here, we introduce three novel algorithms that accelerate matrix multiplication operations using tensor cores on NVIDIA GPUs to calculate the observed sequence data likelihood and the gradient of the log-likelihood with respect to branch-length-specific parameters under continuous-time Markov chain models of evolution. The algorithms presented in this paper deliver 2 to 3-fold gains in performance for amino acid and codon models compared to existing GPU-based massively parallel algorithms. Notably, these performance gains are accompanied by a ~2-fold reduction in energy usage, demonstrating the potential of these algorithms to lower the carbon footprint of evolutionary computing. We make our new algorithms available to the broader phylogenetics community through the high-performance, open source library BEAGLE v4.0.0.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"25 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2026-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147329567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Correction to: Evolutionary Rate Incongruences in Squamates Reveal Contrasting Patterns of Evolutionary Novelties and Innovation.","authors":"","doi":"10.1093/sysbio/syag021","DOIUrl":"https://doi.org/10.1093/sysbio/syag021","url":null,"abstract":"","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"9 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2026-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147329568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Interspecific gene flow is commonly inferred using genomic data under the multispecies coalescent (MSC) model. Incomplete taxon sampling can impact inference of gene flow in multiple ways. First unsampled ghost lineages that are sources of introgression may mislead inference of gene flow in analysis of genomic data from sampled species. Second incomplete taxon sampling causes merges of branches on the species phylogeny and complicates the definition and estimation of the rate or magnitude of gene flow, measured by the expected proportion of immigrants in the recipient population (i.e., the introgression probability). We use mathematical analysis and computer simulation to examine the impact of incomplete taxon sampling on inference of gene flow and estimation of its rate using genomic data. We introduce a Bayesian testing approach to select models of gene flow for a species triplet (such as ghost introgression, inflow, and outflow), using the Savage-Dickey density ratio to calculate Bayes factors. We show that the approach has excellent sensitivity and specificity, whereas heuristic methods based on data summaries typically cannot distinguish among those scenarios. We find that genomic data allow reliable estimation of the proportion of immigrants (rather than the number of immigrants), even when the assumed demographic model is incorrect due to incomplete taxon sampling. When population size differs among species, assuming the same size may lead to seriously biased estimates of the rate of gene flow. The f-branch approach is effective in reducing the number of gene-flow events suggested by triplet analyses but often fails to identify the correct model of gene flow and tends to underestimate the rate of gene flow. Our results highlight the need for improving summary methods to accommodate different population sizes and to infer gene flow between sister lineages.
{"title":"The impact of incomplete taxon sampling on inference of gene flow by Bayesian and summary methods using genomic sequence data.","authors":"Sirui Cheng,Thomas Flouri,Tianqi Zhu,Ziheng Yang","doi":"10.1093/sysbio/syag023","DOIUrl":"https://doi.org/10.1093/sysbio/syag023","url":null,"abstract":"Interspecific gene flow is commonly inferred using genomic data under the multispecies coalescent (MSC) model. Incomplete taxon sampling can impact inference of gene flow in multiple ways. First unsampled ghost lineages that are sources of introgression may mislead inference of gene flow in analysis of genomic data from sampled species. Second incomplete taxon sampling causes merges of branches on the species phylogeny and complicates the definition and estimation of the rate or magnitude of gene flow, measured by the expected proportion of immigrants in the recipient population (i.e., the introgression probability). We use mathematical analysis and computer simulation to examine the impact of incomplete taxon sampling on inference of gene flow and estimation of its rate using genomic data. We introduce a Bayesian testing approach to select models of gene flow for a species triplet (such as ghost introgression, inflow, and outflow), using the Savage-Dickey density ratio to calculate Bayes factors. We show that the approach has excellent sensitivity and specificity, whereas heuristic methods based on data summaries typically cannot distinguish among those scenarios. We find that genomic data allow reliable estimation of the proportion of immigrants (rather than the number of immigrants), even when the assumed demographic model is incorrect due to incomplete taxon sampling. When population size differs among species, assuming the same size may lead to seriously biased estimates of the rate of gene flow. The f-branch approach is effective in reducing the number of gene-flow events suggested by triplet analyses but often fails to identify the correct model of gene flow and tends to underestimate the rate of gene flow. Our results highlight the need for improving summary methods to accommodate different population sizes and to infer gene flow between sister lineages.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"239 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2026-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147329569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}