Sebastian Prillo, Akshay Ravoor, Nir Yosef, Yun S Song
Branch length estimation is a fundamental problem in Statistical Phylogenetics and a core component of tree reconstruction algorithms. Traditionally, general time-reversible mutation models are employed, and many software tools exist for this scenario. With the advent of CRISPR/Cas9 lineage tracing technologies, there has been significant interest in the study of branch length estimation under irreversible mutation models. Under the CRISPR/Cas9 mutation model, irreversible mutations – in the form of DNA insertions or deletions – are accrued during the experiment, which are then read out at the single-cell level to reconstruct the cell lineage tree. However, most of the analyses of CRISPR/Cas9 lineage tracing data have so far been limited to the reconstruction of single-cell tree topologies, which depict lineage relationships between cells, but not the amount of time that has passed between ancestral cell states and the present. Time-resolved trees, known as chronograms, would allow one to study the evolutionary dynamics of cell populations at an unprecedented level of resolution. Indeed, time-resolved trees would reveal the timing of events on the tree, the relative fitness of subclones, and the dynamics underlying phenotypic changes in the cell population – among other important applications. In this work, we introduce the first scalable and accurate method to refine any given single-cell tree topology into a single-cell chronogram by estimating its branch lengths. To do this, we perform regularized maximum likelihood estimation under a general irreversible mutation model, paired with a conservative version of maximum parsimony that reconstructs only the ancestral states that we are confident about. To deal with the particularities of CRISPR/Cas9 lineage tracing data – such as double-resection events which affect runs of consecutive sites – we avoid making our model more complex and instead opt for using a simple but effective data encoding scheme. Similarly, we avoid explicitly modeling the missing data mechanisms – such as heritable missing data – by instead assuming that they are missing completely at random. We stabilize estimates in low-information regimes by using a simple penalized version of maximum likelihood estimation (MLE) using a minimum branch length constraint and pseudocounts. All this leads to a convex MLE problem that can be readily solved in seconds with off-the-shelf convex optimization solvers. We benchmark our method using both simulations and real lineage tracing data, and show that it performs well on several tasks, matching or outperforming competing methods such as TiDeTree and LAML in terms of accuracy, while being 10 ∼ 100 × faster. Notably, our statistical model is simpler and more general, as we do not explicitly model the intricacies of CRISPR/Cas9 lineage tracing data. In this sense, our contribution is twofold: (1) a fast and robust method for branch length estimation under a general irreversible mutation model,
{"title":"ConvexML: Fast and accurate branch length estimation under irreversible mutation models, illustrated through applications to CRISPR/Cas9-based lineage tracing","authors":"Sebastian Prillo, Akshay Ravoor, Nir Yosef, Yun S Song","doi":"10.1093/sysbio/syaf054","DOIUrl":"https://doi.org/10.1093/sysbio/syaf054","url":null,"abstract":"Branch length estimation is a fundamental problem in Statistical Phylogenetics and a core component of tree reconstruction algorithms. Traditionally, general time-reversible mutation models are employed, and many software tools exist for this scenario. With the advent of CRISPR/Cas9 lineage tracing technologies, there has been significant interest in the study of branch length estimation under irreversible mutation models. Under the CRISPR/Cas9 mutation model, irreversible mutations – in the form of DNA insertions or deletions – are accrued during the experiment, which are then read out at the single-cell level to reconstruct the cell lineage tree. However, most of the analyses of CRISPR/Cas9 lineage tracing data have so far been limited to the reconstruction of single-cell tree topologies, which depict lineage relationships between cells, but not the amount of time that has passed between ancestral cell states and the present. Time-resolved trees, known as chronograms, would allow one to study the evolutionary dynamics of cell populations at an unprecedented level of resolution. Indeed, time-resolved trees would reveal the timing of events on the tree, the relative fitness of subclones, and the dynamics underlying phenotypic changes in the cell population – among other important applications. In this work, we introduce the first scalable and accurate method to refine any given single-cell tree topology into a single-cell chronogram by estimating its branch lengths. To do this, we perform regularized maximum likelihood estimation under a general irreversible mutation model, paired with a conservative version of maximum parsimony that reconstructs only the ancestral states that we are confident about. To deal with the particularities of CRISPR/Cas9 lineage tracing data – such as double-resection events which affect runs of consecutive sites – we avoid making our model more complex and instead opt for using a simple but effective data encoding scheme. Similarly, we avoid explicitly modeling the missing data mechanisms – such as heritable missing data – by instead assuming that they are missing completely at random. We stabilize estimates in low-information regimes by using a simple penalized version of maximum likelihood estimation (MLE) using a minimum branch length constraint and pseudocounts. All this leads to a convex MLE problem that can be readily solved in seconds with off-the-shelf convex optimization solvers. We benchmark our method using both simulations and real lineage tracing data, and show that it performs well on several tasks, matching or outperforming competing methods such as TiDeTree and LAML in terms of accuracy, while being 10 ∼ 100 × faster. Notably, our statistical model is simpler and more general, as we do not explicitly model the intricacies of CRISPR/Cas9 lineage tracing data. In this sense, our contribution is twofold: (1) a fast and robust method for branch length estimation under a general irreversible mutation model, ","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"1 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144825256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gene tree discordance along a set of aligned genomes presents a challenge for phylogenomic methods to identify the non-recombining regions and reconstruct the phylogenetic tree for each region. To address this problem, many studies used the non-overlapping window approach, often with an arbitrary selection of fixed window sizes that potentially include intra-window recombination events. In this study, we propose an information theoretic approach to select a window size that best reflects the underlying histories of the alignment. First, we simulated chromosome alignments that reflected the key characteristics of an empirical dataset and found that the AIC is a good predictor of window size accuracy in correctly recovering the tree topologies of the alignment. To address the issue of missing data in empirical datasets, we designed a stepwise non-overlapping window approach that compares the AIC of two window sizes at a time, retaining only genomic regions that can be analysed using both window sizes. We then applied this method to the genomes of Heliconius butterflies and great apes. We found that the best window sizes for the butterflies’ chromosomes ranged from <125bp to 250bp, which are much shorter than those used in a previous study even though this difference in window size did not significantly change the most common topologies across the genome. On the other hand, the best window sizes for great apes’ chromosomes ranged from 500bp to 1kb with the proportion of the major topology (grouping human and chimpanzee) falling between 60% and 87%, consistent with previous findings. Additionally, we observed a notable impact of gene tree estimation error and concatenation when using small and large windows, respectively. For instance, the proportion of the major topology for great apes was 50% when using 250bp windows, but reached almost 100% for 64kb windows. In conclusion, our study highlights the challenges associated with selecting a fixed window size in non-overlapping window analyses and proposes the AIC as a less arbitrary way to select the optimal window size when running non-overlapping method on whole genome alignments.
{"title":"Selecting a Window Size for Phylogenomic Analyses of Whole Genome Alignments using AIC","authors":"Jeremias Ivan, Paul Frandsen, Robert Lanfear","doi":"10.1093/sysbio/syaf053","DOIUrl":"https://doi.org/10.1093/sysbio/syaf053","url":null,"abstract":"Gene tree discordance along a set of aligned genomes presents a challenge for phylogenomic methods to identify the non-recombining regions and reconstruct the phylogenetic tree for each region. To address this problem, many studies used the non-overlapping window approach, often with an arbitrary selection of fixed window sizes that potentially include intra-window recombination events. In this study, we propose an information theoretic approach to select a window size that best reflects the underlying histories of the alignment. First, we simulated chromosome alignments that reflected the key characteristics of an empirical dataset and found that the AIC is a good predictor of window size accuracy in correctly recovering the tree topologies of the alignment. To address the issue of missing data in empirical datasets, we designed a stepwise non-overlapping window approach that compares the AIC of two window sizes at a time, retaining only genomic regions that can be analysed using both window sizes. We then applied this method to the genomes of Heliconius butterflies and great apes. We found that the best window sizes for the butterflies’ chromosomes ranged from &lt;125bp to 250bp, which are much shorter than those used in a previous study even though this difference in window size did not significantly change the most common topologies across the genome. On the other hand, the best window sizes for great apes’ chromosomes ranged from 500bp to 1kb with the proportion of the major topology (grouping human and chimpanzee) falling between 60% and 87%, consistent with previous findings. Additionally, we observed a notable impact of gene tree estimation error and concatenation when using small and large windows, respectively. For instance, the proportion of the major topology for great apes was 50% when using 250bp windows, but reached almost 100% for 64kb windows. In conclusion, our study highlights the challenges associated with selecting a fixed window size in non-overlapping window analyses and proposes the AIC as a less arbitrary way to select the optimal window size when running non-overlapping method on whole genome alignments.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"69 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144825108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nicolás Mongiardino Koch, Jeffrey R Thompson, Rich Mooi, Greg W Rouse
Phylogenetic clock models translate inferred amounts of evolutionary change (calculated from either genotypes or phenotypes) into estimates of elapsed time, providing a mechanism for time scaling phylogenetic trees. Relaxed-clock models, which accommodate variation in evolutionary rates across branches, are one of the main components of Bayesian dating, yet their consequences for total-evidence phylogenetics have not been thoroughly explored. Here, we combine morphological, molecular (both transcriptomic and Sanger-sequenced), and stratigraphic datasets for all major lineages of echinoids (sea urchins, heart urchins, sand dollars). We then perform total-evidence dated inference under the fossilized birth-death prior, varying two analytical conditions: the choice between autocorrelated and uncorrelated relaxed clocks, which enforce (or not) evolutionary rate inheritance; and the ability to recover fossil terminals as direct ancestors. Our results highlight a previously unnoticed interaction between tree and clock models, with analyses implementing an autocorrelated clock failing to recover any direct ancestors. Nonetheless, even under conditions conducive to the placement of fossil terminals as ancestors, we find this type of relationship to be accommodated without any impact on either topology or node ages. On the other hand, tree topology, fossil placement, divergence times, and downstream macroevolutionary inferences (e.g., ancestral state reconstructions) were all strongly affected by the type of relaxed clock implemented. In regions of the tree where molecular rate variation is pervasive and morphological signal relatively uninformative, fossil tips seem to play little to no role in informing divergence times, and instead passively move in and out of clades depending on the ages imposed upon surrounding nodes by molecular data. Our results highlight the extent to which the phylogenetic and macroevolutionary conclusions of total-evidence dated analyses are contingent on the choice of relaxed-clock model, highlighting the need for either careful methodological validation or a thorough assessment of sensitivity. Our efforts continue to illuminate the echinoid tree of life, supporting the erection of the order Apatopygoida to include three living species last sharing a common ancestor with other extant lineages around the time of the Jurassic-Cretaceous boundary. Furthermore, they also illustrate how the phylogenetic placement of extinct clades hinges upon the modelling of molecular data, evidencing the extent to which the fossil record remains subservient to phylogenomics.
{"title":"But the Clock, Tick-Tock: An Empirical Case Study Highlights the Preeminence of Relaxed Clock Models in Total-Evidence Dating","authors":"Nicolás Mongiardino Koch, Jeffrey R Thompson, Rich Mooi, Greg W Rouse","doi":"10.1093/sysbio/syaf055","DOIUrl":"https://doi.org/10.1093/sysbio/syaf055","url":null,"abstract":"Phylogenetic clock models translate inferred amounts of evolutionary change (calculated from either genotypes or phenotypes) into estimates of elapsed time, providing a mechanism for time scaling phylogenetic trees. Relaxed-clock models, which accommodate variation in evolutionary rates across branches, are one of the main components of Bayesian dating, yet their consequences for total-evidence phylogenetics have not been thoroughly explored. Here, we combine morphological, molecular (both transcriptomic and Sanger-sequenced), and stratigraphic datasets for all major lineages of echinoids (sea urchins, heart urchins, sand dollars). We then perform total-evidence dated inference under the fossilized birth-death prior, varying two analytical conditions: the choice between autocorrelated and uncorrelated relaxed clocks, which enforce (or not) evolutionary rate inheritance; and the ability to recover fossil terminals as direct ancestors. Our results highlight a previously unnoticed interaction between tree and clock models, with analyses implementing an autocorrelated clock failing to recover any direct ancestors. Nonetheless, even under conditions conducive to the placement of fossil terminals as ancestors, we find this type of relationship to be accommodated without any impact on either topology or node ages. On the other hand, tree topology, fossil placement, divergence times, and downstream macroevolutionary inferences (e.g., ancestral state reconstructions) were all strongly affected by the type of relaxed clock implemented. In regions of the tree where molecular rate variation is pervasive and morphological signal relatively uninformative, fossil tips seem to play little to no role in informing divergence times, and instead passively move in and out of clades depending on the ages imposed upon surrounding nodes by molecular data. Our results highlight the extent to which the phylogenetic and macroevolutionary conclusions of total-evidence dated analyses are contingent on the choice of relaxed-clock model, highlighting the need for either careful methodological validation or a thorough assessment of sensitivity. Our efforts continue to illuminate the echinoid tree of life, supporting the erection of the order Apatopygoida to include three living species last sharing a common ancestor with other extant lineages around the time of the Jurassic-Cretaceous boundary. Furthermore, they also illustrate how the phylogenetic placement of extinct clades hinges upon the modelling of molecular data, evidencing the extent to which the fossil record remains subservient to phylogenomics.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"12 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144825112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Mountains of Southwest China, a global biodiversity hotspot, have a unique "sky island" landscape with high diversity of both ancient and recent-formed species. While their distribution patterns offer significant insights into diversification processes, the complex geological and climatic history, combined with dynamic histories of gene flow in endemic taxa, make unravelling this history challenging. This study focuses on Asian shrew moles (genus Uropsilus), an ancient group endemic to this region with an unresolved taxonomic system. By combining phylogenomic, introgression and demographic history analyses, we investigated the historical patterns of species diversification in this genus. We detected phylogenetic discordances among rapidly diverged lineages, driven by incomplete lineage sorting, both recent and ancient gene flow, and ghost introgression. The gene flow patterns revealed strong genetic isolation in the Hengduan Mountains region, contrasted by more extensive dispersal or connectivity in areas to its east, while suggesting potential ring-like diversification around the Sichuan Basin. Demographic history indicated that rapidly diverged lineages south of the Yangtze River exhibited significantly different responses to climatic fluctuations compared to other lineages, with the East Asian monsoon likely driving their radiative differentiation and dispersal. Our study demonstrates the impacts of mountain uplift, climatic changes, and the connectivity of sky island refugia in shaping the diverse patterns of species differentiation and their distribution. [phylogenomics; introgression; Asian shrew moles; demographic history].
{"title":"Species Diversification in the Sky Islands of Southwestern China Revealed by Genomic, Introgression and Demographic Analyses of Asian Shrew Moles.","authors":"Yi-Xian Li,Zhong-Zheng Chen,Quan Li,Tao Zhang,Feng Cheng,Wen-Yu Song,Xue-You Li,Shui-Wang He,Hong-Jiao Wang,Kenneth Otieno Onditi,Xue-Long Jiang","doi":"10.1093/sysbio/syaf052","DOIUrl":"https://doi.org/10.1093/sysbio/syaf052","url":null,"abstract":"The Mountains of Southwest China, a global biodiversity hotspot, have a unique \"sky island\" landscape with high diversity of both ancient and recent-formed species. While their distribution patterns offer significant insights into diversification processes, the complex geological and climatic history, combined with dynamic histories of gene flow in endemic taxa, make unravelling this history challenging. This study focuses on Asian shrew moles (genus Uropsilus), an ancient group endemic to this region with an unresolved taxonomic system. By combining phylogenomic, introgression and demographic history analyses, we investigated the historical patterns of species diversification in this genus. We detected phylogenetic discordances among rapidly diverged lineages, driven by incomplete lineage sorting, both recent and ancient gene flow, and ghost introgression. The gene flow patterns revealed strong genetic isolation in the Hengduan Mountains region, contrasted by more extensive dispersal or connectivity in areas to its east, while suggesting potential ring-like diversification around the Sichuan Basin. Demographic history indicated that rapidly diverged lineages south of the Yangtze River exhibited significantly different responses to climatic fluctuations compared to other lineages, with the East Asian monsoon likely driving their radiative differentiation and dispersal. Our study demonstrates the impacts of mountain uplift, climatic changes, and the connectivity of sky island refugia in shaping the diverse patterns of species differentiation and their distribution. [phylogenomics; introgression; Asian shrew moles; demographic history].","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"96 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144748115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Emmanuel F A Toussaint, Fabien L Condamine, Ana Paula dos Santos De Carvalho, David M Plotkin, Emily A Ellis, Kelly M Dexter, Chandra Earl, Kwaku Aduse-Poku, Michael F Braby, Hideyuki Chiba, Riley J Gott, Kiyoshi Maruyama, Ana BB Morais, Chris J Müller, Djunijanti Peggie, Szabolcs Sáfián, Roger Vila, Andrew D Warren, Masaya Yago, Jesse W Breinholt, Marianne Espeland, Naomi E Pierce, David J Lohman, Akito Y Kawahara
Characterizing drivers governing the diversification of species-rich lineages is challenging. Although butterflies are one of the most well-studied groups of insects, there are few comprehensive studies investigating their diversification dynamics. Here, we reconstruct a phylogenomic tree for ca. 1,500 species in the family Hesperiidae, the skippers, to test whether historical global climate change, geographical range evolution, and host-plant association are drivers of diversification. Our findings suggest skippers originated in Laurasia before the Cretaceous-Paleogene mass extinction, in a northern region centered on Beringia before colonizing southern regions coinciding with global climate cooling. Climate cooling also fostered the diversification of skippers throughout the Cenozoic possibly by fueling biome transitions from closed to open ecosystems such as grasslands. An early shift from dicot-feeding to monocot-feeding reduced extinction rates and increased speciation rates, explaining the large diversity of grass-feeding adapted skippers. A dynamic geographic range evolution and host-plant shifts linked with long-term climate change explain skipper butterfly diversification.
{"title":"Global climate cooling spurred skipper butterfly diversification","authors":"Emmanuel F A Toussaint, Fabien L Condamine, Ana Paula dos Santos De Carvalho, David M Plotkin, Emily A Ellis, Kelly M Dexter, Chandra Earl, Kwaku Aduse-Poku, Michael F Braby, Hideyuki Chiba, Riley J Gott, Kiyoshi Maruyama, Ana BB Morais, Chris J Müller, Djunijanti Peggie, Szabolcs Sáfián, Roger Vila, Andrew D Warren, Masaya Yago, Jesse W Breinholt, Marianne Espeland, Naomi E Pierce, David J Lohman, Akito Y Kawahara","doi":"10.1093/sysbio/syaf029","DOIUrl":"https://doi.org/10.1093/sysbio/syaf029","url":null,"abstract":"Characterizing drivers governing the diversification of species-rich lineages is challenging. Although butterflies are one of the most well-studied groups of insects, there are few comprehensive studies investigating their diversification dynamics. Here, we reconstruct a phylogenomic tree for ca. 1,500 species in the family Hesperiidae, the skippers, to test whether historical global climate change, geographical range evolution, and host-plant association are drivers of diversification. Our findings suggest skippers originated in Laurasia before the Cretaceous-Paleogene mass extinction, in a northern region centered on Beringia before colonizing southern regions coinciding with global climate cooling. Climate cooling also fostered the diversification of skippers throughout the Cenozoic possibly by fueling biome transitions from closed to open ecosystems such as grasslands. An early shift from dicot-feeding to monocot-feeding reduced extinction rates and increased speciation rates, explaining the large diversity of grass-feeding adapted skippers. A dynamic geographic range evolution and host-plant shifts linked with long-term climate change explain skipper butterfly diversification.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"86 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144715356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The dependencies between characters used in phylogenetic analysis (e.g., inapplicabilities, functional dependencies) can be taken into account by using combinations of character states as possible ancestral morphotypes, and using appropriate rates of transformation between such morphotypes. As every morphotype represents a permissible combination of the original character states, this allows easily ruling out specific combinations of character states, and taking into account changes that are either less or more likely to co-occur, or to occur in certain contexts. For inapplicable characters, Goloboff et al. (2021) used morphotypes but proposed obtaining transition probabilities between morphotypes from products of transition probabilities of the original characters and factors to incorporate dependencies. The product of transition probabilities is shown here to be flawed (failing the time-continuity requirement of phylogenetic Markov models, essential for statistical consistency under the model). Tarasov (2023) used the same delimitation of morphotypes but proposed obtaining transition probabilities from rate matrices, synthesized in a stepwise fashion from the hierarchy of dependencies. This paper shows that the rate matrices can easily be created, instead of with a stepwise synthesis, from direct comparisons between legitimate morphotypes (as done by Goloboff and De Laet 2023 for parsimony). Based on a few simple rules, the resulting rate matrices are (for inapplicable characters) identical to those obtained by Tarasov (2023). Additionally, in the computer program TNT, biological dependencies beyond mere inapplicability can be specified by the user with a simple syntax for (combinations of) states in “parent” characters restricting the states that “child” characters can take, using AND and OR conjunctions for elaborate interactions. These researcher-defined rules are used to internally convert the original characters into morphotypes, discarding morphotypes made impossible by the rules. In the case of biological dependencies (where, depending on the parent characters, there can be restrictions in the states that dependent characters can take, instead of the character being inapplicable), the rates of transition between morphotypes cannot be calculated solely from comparisons of states differing in both morphotypes –consideration of the conditions of dependency is needed as well.
{"title":"Phylogenetic Analysis of Characters with Dependencies under Maximum Likelihood","authors":"Pablo A Goloboff","doi":"10.1093/sysbio/syaf051","DOIUrl":"https://doi.org/10.1093/sysbio/syaf051","url":null,"abstract":"The dependencies between characters used in phylogenetic analysis (e.g., inapplicabilities, functional dependencies) can be taken into account by using combinations of character states as possible ancestral morphotypes, and using appropriate rates of transformation between such morphotypes. As every morphotype represents a permissible combination of the original character states, this allows easily ruling out specific combinations of character states, and taking into account changes that are either less or more likely to co-occur, or to occur in certain contexts. For inapplicable characters, Goloboff et al. (2021) used morphotypes but proposed obtaining transition probabilities between morphotypes from products of transition probabilities of the original characters and factors to incorporate dependencies. The product of transition probabilities is shown here to be flawed (failing the time-continuity requirement of phylogenetic Markov models, essential for statistical consistency under the model). Tarasov (2023) used the same delimitation of morphotypes but proposed obtaining transition probabilities from rate matrices, synthesized in a stepwise fashion from the hierarchy of dependencies. This paper shows that the rate matrices can easily be created, instead of with a stepwise synthesis, from direct comparisons between legitimate morphotypes (as done by Goloboff and De Laet 2023 for parsimony). Based on a few simple rules, the resulting rate matrices are (for inapplicable characters) identical to those obtained by Tarasov (2023). Additionally, in the computer program TNT, biological dependencies beyond mere inapplicability can be specified by the user with a simple syntax for (combinations of) states in “parent” characters restricting the states that “child” characters can take, using AND and OR conjunctions for elaborate interactions. These researcher-defined rules are used to internally convert the original characters into morphotypes, discarding morphotypes made impossible by the rules. In the case of biological dependencies (where, depending on the parent characters, there can be restrictions in the states that dependent characters can take, instead of the character being inapplicable), the rates of transition between morphotypes cannot be calculated solely from comparisons of states differing in both morphotypes –consideration of the conditions of dependency is needed as well.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"118 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2025-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144710791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Correction to: How Important Is Budding Speciation for Comparative Studies?","authors":"","doi":"10.1093/sysbio/syaf042","DOIUrl":"https://doi.org/10.1093/sysbio/syaf042","url":null,"abstract":"","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"25 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144684207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Correction to: Global Patterns of Taxonomic Uncertainty and its Impacts on Biodiversity Research.","authors":"","doi":"10.1093/sysbio/syaf045","DOIUrl":"https://doi.org/10.1093/sysbio/syaf045","url":null,"abstract":"","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"108 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144630471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joaquín Villamil,Mariana Morando,Luciano J Avila,Flávia M Lanna,Emanuel M Fonseca,Jack W Sites,Arley Camargo
Departures from the Multispecies Coalescent (MSC) assumptions could cause artefactual topologies and node height estimates, and therefore, trees inferred without MSC model fit testing could potentially misrepresent an accurate approximation of the evolutionary history of a group. The current implementation of MSC model testing for non-genomic level molecular markers cannot process trees estimated from BEAST 2, limiting its application for large datasets of sequence-based markers. Here we recode functions of the R package P2C2M to assess model fit to the MSC and apply this new implementation, which we named P2C2M2, to test the MSC model in a 16-loci dataset of 42 lizard species focused on the Liolaemus wiegmannii group. We found strong evidence of model departures in several loci, possibly due to historical gene flow, which could also be causing an unexpected position of the L. wiegmannii group within the L. montanus section of Eulaemus, when hybridization is not accounted for. The L. anomalus group is inferred as the closest to the L. wiegmannii group when gene flow is incorporated via a Multispecies Network Coalescent model, and a reticulation, suggesting historical gene flow between the L. wiegmannii and L. montanus groups is inferred, which has not been previously reported. We argue that there are at least three sources of discrepancy between the literature and the node ages estimated in our study: the use of strict molecular clocks without statistical justification, misplaced fossil calibrations, and the estimation of coalescent times instead of species divergence times. We encouraged systematists to routinely test the fit of the MSC model when estimating species trees using sequence-based markers, and to follow a phylogenetic network approach when both this test is significant and when historical gene flow is considered one plausible source of the departure from the MSC model.
{"title":"Revisiting the Multispecies Coalescent Model fit with an example from a complete molecular phylogeny of the Liolaemus wiegmannii species group (Squamata: Liolaemidae).","authors":"Joaquín Villamil,Mariana Morando,Luciano J Avila,Flávia M Lanna,Emanuel M Fonseca,Jack W Sites,Arley Camargo","doi":"10.1093/sysbio/syaf048","DOIUrl":"https://doi.org/10.1093/sysbio/syaf048","url":null,"abstract":"Departures from the Multispecies Coalescent (MSC) assumptions could cause artefactual topologies and node height estimates, and therefore, trees inferred without MSC model fit testing could potentially misrepresent an accurate approximation of the evolutionary history of a group. The current implementation of MSC model testing for non-genomic level molecular markers cannot process trees estimated from BEAST 2, limiting its application for large datasets of sequence-based markers. Here we recode functions of the R package P2C2M to assess model fit to the MSC and apply this new implementation, which we named P2C2M2, to test the MSC model in a 16-loci dataset of 42 lizard species focused on the Liolaemus wiegmannii group. We found strong evidence of model departures in several loci, possibly due to historical gene flow, which could also be causing an unexpected position of the L. wiegmannii group within the L. montanus section of Eulaemus, when hybridization is not accounted for. The L. anomalus group is inferred as the closest to the L. wiegmannii group when gene flow is incorporated via a Multispecies Network Coalescent model, and a reticulation, suggesting historical gene flow between the L. wiegmannii and L. montanus groups is inferred, which has not been previously reported. We argue that there are at least three sources of discrepancy between the literature and the node ages estimated in our study: the use of strict molecular clocks without statistical justification, misplaced fossil calibrations, and the estimation of coalescent times instead of species divergence times. We encouraged systematists to routinely test the fit of the MSC model when estimating species trees using sequence-based markers, and to follow a phylogenetic network approach when both this test is significant and when historical gene flow is considered one plausible source of the departure from the MSC model.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"697 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144594356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lara M Ko¨sters, Kevin Karbstein, Martin Hofmann, Ladislav Hodaˇc, Patrick Ma¨der, Jana Wa¨ldchen
DNA analyses have revolutionized species identification and taxonomic work. Yet, persistent challenges arise from little differentiation among and considerable variation within species, particularly among closely related groups. While images are commonly used as an alternative modality for automated identification tasks, their usability is limited by the same concerns. An integrative strategy, fusing molecular and image data through machine learning, holds significant promise for fine-grained species identification. However, a systematic overview and rigorous statistical testing concerning molecular and image preprocessing and fusion techniques, including practical advice for biologists, are missing so far. We introduce a machine learning scheme that integrates both molecular and image data for species identification. Initially, we systematically assess and compare three different DNA arrangements (aligned, unaligned, SNP-reduced) and two encoding methods (fractional, ordinal). Additionally, artificial neural networks are used to extract visual and molecular features, and we propose strategies for fusing this information. Specifically, we investigate three strategies: I) fusing directly after feature extraction, II) fusing features that passed through a fully connected layer after feature extraction, and III) fusing the output scores of both unimodal models. We systematically and statistically evaluate these strategies for four eukaryotic datasets, including two plant (Asteraceae, Poaceae) and two animal families (Lycaenidae, Coccinellidae) using Leave-One-Out Cross-Validation (LOOCV). In addition, we developed an approach to understand molecular- and image-specific identification failure. Aligned sequences with nucleotides encoded as decimal number vectors achieved the highest identification accuracy among DNA data preprocessing techniques in all four datasets. Fusing molecular and visual features directly after feature extraction yielded the best results for three out of four datasets (52-99%).Overall, combining DNA with image data significantly increased accuracy in three out of four datasets, with plant datasets showing the most substantial improvement (Asteraceae: +19%, Poaceae: +13.6%). Even for Lycaenidae with high identification accuracy based on molecular data (>96%), a statistically significant improvement (+2.1%) was observed.Detailed analysis of confusion rates between and within genera shows that DNA alone tends to identify the genus correctly, but often fails to recognize the species. The failure to resolve species is alleviated by including image data in the training. This increase in resolution hints at a hierarchical role of modalities in which molecular data coarsely groups the specimens to then be guided towards a more fine-grained identification by the connected image. We systematically showed and explained, for the first time, that optimizing the preprocessing and integration of molecular and image data offers signific
{"title":"Data Fusion for Integrative Species Identification Using Deep Learning","authors":"Lara M Ko¨sters, Kevin Karbstein, Martin Hofmann, Ladislav Hodaˇc, Patrick Ma¨der, Jana Wa¨ldchen","doi":"10.1093/sysbio/syaf026","DOIUrl":"https://doi.org/10.1093/sysbio/syaf026","url":null,"abstract":"DNA analyses have revolutionized species identification and taxonomic work. Yet, persistent challenges arise from little differentiation among and considerable variation within species, particularly among closely related groups. While images are commonly used as an alternative modality for automated identification tasks, their usability is limited by the same concerns. An integrative strategy, fusing molecular and image data through machine learning, holds significant promise for fine-grained species identification. However, a systematic overview and rigorous statistical testing concerning molecular and image preprocessing and fusion techniques, including practical advice for biologists, are missing so far. We introduce a machine learning scheme that integrates both molecular and image data for species identification. Initially, we systematically assess and compare three different DNA arrangements (aligned, unaligned, SNP-reduced) and two encoding methods (fractional, ordinal). Additionally, artificial neural networks are used to extract visual and molecular features, and we propose strategies for fusing this information. Specifically, we investigate three strategies: I) fusing directly after feature extraction, II) fusing features that passed through a fully connected layer after feature extraction, and III) fusing the output scores of both unimodal models. We systematically and statistically evaluate these strategies for four eukaryotic datasets, including two plant (Asteraceae, Poaceae) and two animal families (Lycaenidae, Coccinellidae) using Leave-One-Out Cross-Validation (LOOCV). In addition, we developed an approach to understand molecular- and image-specific identification failure. Aligned sequences with nucleotides encoded as decimal number vectors achieved the highest identification accuracy among DNA data preprocessing techniques in all four datasets. Fusing molecular and visual features directly after feature extraction yielded the best results for three out of four datasets (52-99%).Overall, combining DNA with image data significantly increased accuracy in three out of four datasets, with plant datasets showing the most substantial improvement (Asteraceae: +19%, Poaceae: +13.6%). Even for Lycaenidae with high identification accuracy based on molecular data (&gt;96%), a statistically significant improvement (+2.1%) was observed.Detailed analysis of confusion rates between and within genera shows that DNA alone tends to identify the genus correctly, but often fails to recognize the species. The failure to resolve species is alleviated by including image data in the training. This increase in resolution hints at a hierarchical role of modalities in which molecular data coarsely groups the specimens to then be guided towards a more fine-grained identification by the connected image. We systematically showed and explained, for the first time, that optimizing the preprocessing and integration of molecular and image data offers signific","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"8 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144288200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}