Leo A Featherstone, Andrew Rambaut, Sebastian Duchene, Wytamma Wirth
Molecular sequence data from rapidly evolving organisms are often sampled at different points in time. Sampling times can then be used for molecular clock calibration. The root-to-tip (RTT) regression is an essential tool to assess the degree to which the data behave in a clock-like fashion. Here, we introduce Clockor2, a client-side web application for conducting RTT regression. Clockor2 allows users to quickly fit local and global molecular clocks, thus handling the increasing complexity of genomic datasets that sample beyond the assumption of homogeneous host populations. Clockor2 is efficient, handling trees of up to the order of 104 tips, with significant speed increases compared with other RTT regression applications. Although clockor2 is written as a web application, all data processing happens on the client-side, meaning that data never leave the user's computer. Clockor2 is freely available at https://clockor2.github.io/.
{"title":"Clockor2: Inferring Global and Local Strict Molecular Clocks Using Root-to-Tip Regression.","authors":"Leo A Featherstone, Andrew Rambaut, Sebastian Duchene, Wytamma Wirth","doi":"10.1093/sysbio/syae003","DOIUrl":"10.1093/sysbio/syae003","url":null,"abstract":"<p><p>Molecular sequence data from rapidly evolving organisms are often sampled at different points in time. Sampling times can then be used for molecular clock calibration. The root-to-tip (RTT) regression is an essential tool to assess the degree to which the data behave in a clock-like fashion. Here, we introduce Clockor2, a client-side web application for conducting RTT regression. Clockor2 allows users to quickly fit local and global molecular clocks, thus handling the increasing complexity of genomic datasets that sample beyond the assumption of homogeneous host populations. Clockor2 is efficient, handling trees of up to the order of 104 tips, with significant speed increases compared with other RTT regression applications. Although clockor2 is written as a web application, all data processing happens on the client-side, meaning that data never leave the user's computer. Clockor2 is freely available at https://clockor2.github.io/.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":" ","pages":"623-628"},"PeriodicalIF":6.1,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11377183/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139898272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Interspecific interactions, including host-symbiont associations, can profoundly affect the evolution of the interacting species. Given the phylogenies of host and symbiont clades and knowledge of which host species interact with which symbiont, two questions are often asked: "Do closely related hosts interact with closely related symbionts?" and "Do host and symbiont phylogenies mirror one another?." These questions are intertwined and can even collapse under specific situations, such that they are often confused one with the other. However, in most situations, a positive answer to the first question, hereafter referred to as "cophylogenetic signal," does not imply a close match between the host and symbiont phylogenies. It suggests only that past evolutionary history has contributed to shaping present-day interactions, which can arise, for example, through present-day trait matching, or from a single ancient vicariance event that increases the probability that closely related species overlap geographically. A positive answer to the second, referred to as "phylogenetic congruence," is more restrictive as it suggests a close match between the two phylogenies, which may happen, for example, if symbiont diversification tracks host diversification or if the diversifications of the two clades were subject to the same succession of vicariance events. Here we apply a set of methods (ParaFit, PACo, and eMPRess), whose significance is often interpreted as evidence for phylogenetic congruence, to simulations under 3 biologically realistic scenarios of trait matching, a single ancient vicariance event, and phylogenetic tracking with frequent cospeciation events. The latter is the only scenario that generates phylogenetic congruence, whereas the first 2 generate a cophylogenetic signal in the absence of phylogenetic congruence. We find that tests of global-fit methods (ParaFit and PACo) are significant under the 3 scenarios, whereas tests of event-based methods (eMPRess) are only significant under the scenario of phylogenetic tracking. Therefore, significant results from global-fit methods should be interpreted in terms of cophylogenetic signal and not phylogenetic congruence; such significant results can arise under scenarios when hosts and symbionts had independent evolutionary histories. Conversely, significant results from event-based methods suggest a strong form of dependency between hosts and symbionts evolutionary histories. Clarifying the patterns detected by different cophylogenetic methods is key to understanding how interspecific interactions shape and are shaped by evolution.
{"title":"Distinguishing Cophylogenetic Signal from Phylogenetic Congruence Clarifies the Interplay Between Evolutionary History and Species Interactions.","authors":"Benoît Perez-Lamarque, Hélène Morlon","doi":"10.1093/sysbio/syae013","DOIUrl":"10.1093/sysbio/syae013","url":null,"abstract":"<p><p>Interspecific interactions, including host-symbiont associations, can profoundly affect the evolution of the interacting species. Given the phylogenies of host and symbiont clades and knowledge of which host species interact with which symbiont, two questions are often asked: \"Do closely related hosts interact with closely related symbionts?\" and \"Do host and symbiont phylogenies mirror one another?.\" These questions are intertwined and can even collapse under specific situations, such that they are often confused one with the other. However, in most situations, a positive answer to the first question, hereafter referred to as \"cophylogenetic signal,\" does not imply a close match between the host and symbiont phylogenies. It suggests only that past evolutionary history has contributed to shaping present-day interactions, which can arise, for example, through present-day trait matching, or from a single ancient vicariance event that increases the probability that closely related species overlap geographically. A positive answer to the second, referred to as \"phylogenetic congruence,\" is more restrictive as it suggests a close match between the two phylogenies, which may happen, for example, if symbiont diversification tracks host diversification or if the diversifications of the two clades were subject to the same succession of vicariance events. Here we apply a set of methods (ParaFit, PACo, and eMPRess), whose significance is often interpreted as evidence for phylogenetic congruence, to simulations under 3 biologically realistic scenarios of trait matching, a single ancient vicariance event, and phylogenetic tracking with frequent cospeciation events. The latter is the only scenario that generates phylogenetic congruence, whereas the first 2 generate a cophylogenetic signal in the absence of phylogenetic congruence. We find that tests of global-fit methods (ParaFit and PACo) are significant under the 3 scenarios, whereas tests of event-based methods (eMPRess) are only significant under the scenario of phylogenetic tracking. Therefore, significant results from global-fit methods should be interpreted in terms of cophylogenetic signal and not phylogenetic congruence; such significant results can arise under scenarios when hosts and symbionts had independent evolutionary histories. Conversely, significant results from event-based methods suggest a strong form of dependency between hosts and symbionts evolutionary histories. Clarifying the patterns detected by different cophylogenetic methods is key to understanding how interspecific interactions shape and are shaped by evolution.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":" ","pages":"613-622"},"PeriodicalIF":6.1,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140111462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel Kornai, Xiyun Jiao, Jiayi Ji, Tomáš Flouri, Ziheng Yang
The multispecies coalescent (MSC) model accommodates genealogical fluctuations across the genome and provides a natural framework for comparative analysis of genomic sequence data from closely related species to infer the history of species divergence and gene flow. Given a set of populations, hypotheses of species delimitation (and species phylogeny) may be formulated as instances of MSC models (e.g., MSC for one species versus MSC for two species) and compared using Bayesian model selection. This approach, implemented in the program bpp, has been found to be prone to over-splitting. Alternatively heuristic criteria based on population parameters (such as popula- tion split times, population sizes, and migration rates) estimated from genomic data may be used to delimit species. Here we develop hierarchical merge and split algorithms for heuristic species delimitation based on the genealogical divergence index (𝑔𝑑𝑖) and implement them in a python pipeline called hhsd. We characterize the behavior of the 𝑔𝑑𝑖 under a few simple scenarios of gene flow. We apply the new approaches to a dataset simulated under a model of isolation by distance as well as three empirical datasets. Our tests suggest that the new approaches produced sensible results and were less prone to over-splitting. We discuss possible strategies for accommodating paraphyletic species in the hierarchical algorithm, as well as the challenges of species delimitation based on heuristic criteria.
{"title":"Hierarchical Heuristic Species Delimitation under the Multispecies Coalescent Model with Migration","authors":"Daniel Kornai, Xiyun Jiao, Jiayi Ji, Tomáš Flouri, Ziheng Yang","doi":"10.1093/sysbio/syae050","DOIUrl":"https://doi.org/10.1093/sysbio/syae050","url":null,"abstract":"The multispecies coalescent (MSC) model accommodates genealogical fluctuations across the genome and provides a natural framework for comparative analysis of genomic sequence data from closely related species to infer the history of species divergence and gene flow. Given a set of populations, hypotheses of species delimitation (and species phylogeny) may be formulated as instances of MSC models (e.g., MSC for one species versus MSC for two species) and compared using Bayesian model selection. This approach, implemented in the program bpp, has been found to be prone to over-splitting. Alternatively heuristic criteria based on population parameters (such as popula- tion split times, population sizes, and migration rates) estimated from genomic data may be used to delimit species. Here we develop hierarchical merge and split algorithms for heuristic species delimitation based on the genealogical divergence index (𝑔𝑑𝑖) and implement them in a python pipeline called hhsd. We characterize the behavior of the 𝑔𝑑𝑖 under a few simple scenarios of gene flow. We apply the new approaches to a dataset simulated under a model of isolation by distance as well as three empirical datasets. Our tests suggest that the new approaches produced sensible results and were less prone to over-splitting. We discuss possible strategies for accommodating paraphyletic species in the hierarchical algorithm, as well as the challenges of species delimitation based on heuristic criteria.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"4 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142045640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thomas K F Wong, Caitlin Cherryh, Allen G Rodrigo, Matthew W Hahn, Bui Quang Minh, Robert Lanfear
Hundreds or thousands of loci are now routinely used in modern phylogenomic studies. Concatenation approaches to tree inference assume that there is a single topology for the entire dataset, but different loci may have different evolutionary histories due to incomplete lineage sorting (ILS), introgression, and/or horizontal gene transfer; even single loci may not be treelike due to recombination. To overcome this shortcoming, we introduce an implementation of a multi-tree mixture model that we call mixtures across sites and trees (MAST). This model extends a prior implementation by Boussau et al. (2009) by allowing users to estimate the weight of each of a set of pre-specified bifurcating trees in a single alignment. The MAST model allows each tree to have its own weight, topology, branch lengths, substitution model, nucleotide or amino acid frequencies, and model of rate heterogeneity across sites. We implemented the MAST model in a maximum-likelihood framework in the popular phylogenetic software, IQ-TREE. Simulations show that we can accurately recover the true model parameters, including branch lengths and tree weights for a given set of tree topologies, under a wide range of biologically realistic scenarios. We also show that we can use standard statistical inference approaches to reject a single-tree model when data are simulated under multiple trees (and vice versa). We applied the MAST model to multiple primate datasets and found that it can recover the signal of ILS in the Great Apes, as well as the asymmetry in minor trees caused by introgression among several macaque species. When applied to a dataset of 4 Platyrrhine species for which standard concatenated maximum likelihood (ML) and gene tree approaches disagree, we observe that MAST gives the highest weight (i.e., the largest proportion of sites) to the tree also supported by gene tree approaches. These results suggest that the MAST model is able to analyze a concatenated alignment using ML while avoiding some of the biases that come with assuming there is only a single tree. We discuss how the MAST model can be extended in the future.
{"title":"MAST: Phylogenetic Inference with Mixtures Across Sites and Trees.","authors":"Thomas K F Wong, Caitlin Cherryh, Allen G Rodrigo, Matthew W Hahn, Bui Quang Minh, Robert Lanfear","doi":"10.1093/sysbio/syae008","DOIUrl":"10.1093/sysbio/syae008","url":null,"abstract":"<p><p>Hundreds or thousands of loci are now routinely used in modern phylogenomic studies. Concatenation approaches to tree inference assume that there is a single topology for the entire dataset, but different loci may have different evolutionary histories due to incomplete lineage sorting (ILS), introgression, and/or horizontal gene transfer; even single loci may not be treelike due to recombination. To overcome this shortcoming, we introduce an implementation of a multi-tree mixture model that we call mixtures across sites and trees (MAST). This model extends a prior implementation by Boussau et al. (2009) by allowing users to estimate the weight of each of a set of pre-specified bifurcating trees in a single alignment. The MAST model allows each tree to have its own weight, topology, branch lengths, substitution model, nucleotide or amino acid frequencies, and model of rate heterogeneity across sites. We implemented the MAST model in a maximum-likelihood framework in the popular phylogenetic software, IQ-TREE. Simulations show that we can accurately recover the true model parameters, including branch lengths and tree weights for a given set of tree topologies, under a wide range of biologically realistic scenarios. We also show that we can use standard statistical inference approaches to reject a single-tree model when data are simulated under multiple trees (and vice versa). We applied the MAST model to multiple primate datasets and found that it can recover the signal of ILS in the Great Apes, as well as the asymmetry in minor trees caused by introgression among several macaque species. When applied to a dataset of 4 Platyrrhine species for which standard concatenated maximum likelihood (ML) and gene tree approaches disagree, we observe that MAST gives the highest weight (i.e., the largest proportion of sites) to the tree also supported by gene tree approaches. These results suggest that the MAST model is able to analyze a concatenated alignment using ML while avoiding some of the biases that come with assuming there is only a single tree. We discuss how the MAST model can be extended in the future.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":" ","pages":"375-391"},"PeriodicalIF":6.1,"publicationDate":"2024-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11282360/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139991260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qiuyi Li, Yao-Ban Chan, Nicolas Galtier, Celine Scornavacca
The evolution of gene families is complex, involving gene-level evolutionary events such as gene duplication, horizontal gene transfer, and gene loss, and other processes such as incomplete lineage sorting (ILS). Because of this, topological differences often exist between gene trees and species trees. A number of models have been recently developed to explain these discrepancies, the most realistic of which attempts to consider both gene-level events and ILS. When unified in a single model, the interaction between ILS and gene-level events can cause polymorphism in gene copy number, which we refer to as copy number hemiplasy (CNH). In this paper, we extend the Wright-Fisher process to include duplications and losses over several species, and show that the probability of CNH for this process can be significant. We study how well two unified models-multilocus multispecies coalescent (MLMSC), which models CNH, and duplication, loss, and coalescence (DLCoal), which does not-approximate the Wright-Fisher process with duplication and loss. We then study the effect of CNH on gene family evolution by comparing MLMSC and DLCoal. We generate comparable gene trees under both models, showing significant differences in various summary statistics; most importantly, CNH reduces the number of gene copies greatly. If this is not taken into account, the traditional method of estimating duplication rates (by counting the number of gene copies) becomes inaccurate. The simulated gene trees are also used for species tree inference with the summary methods ASTRAL and ASTRAL-Pro, demonstrating that their accuracy, based on CNH-unaware simulations calibrated on real data, may have been overestimated.
{"title":"The Effect of Copy Number Hemiplasy on Gene Family Evolution.","authors":"Qiuyi Li, Yao-Ban Chan, Nicolas Galtier, Celine Scornavacca","doi":"10.1093/sysbio/syae007","DOIUrl":"10.1093/sysbio/syae007","url":null,"abstract":"<p><p>The evolution of gene families is complex, involving gene-level evolutionary events such as gene duplication, horizontal gene transfer, and gene loss, and other processes such as incomplete lineage sorting (ILS). Because of this, topological differences often exist between gene trees and species trees. A number of models have been recently developed to explain these discrepancies, the most realistic of which attempts to consider both gene-level events and ILS. When unified in a single model, the interaction between ILS and gene-level events can cause polymorphism in gene copy number, which we refer to as copy number hemiplasy (CNH). In this paper, we extend the Wright-Fisher process to include duplications and losses over several species, and show that the probability of CNH for this process can be significant. We study how well two unified models-multilocus multispecies coalescent (MLMSC), which models CNH, and duplication, loss, and coalescence (DLCoal), which does not-approximate the Wright-Fisher process with duplication and loss. We then study the effect of CNH on gene family evolution by comparing MLMSC and DLCoal. We generate comparable gene trees under both models, showing significant differences in various summary statistics; most importantly, CNH reduces the number of gene copies greatly. If this is not taken into account, the traditional method of estimating duplication rates (by counting the number of gene copies) becomes inaccurate. The simulated gene trees are also used for species tree inference with the summary methods ASTRAL and ASTRAL-Pro, demonstrating that their accuracy, based on CNH-unaware simulations calibrated on real data, may have been overestimated.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":" ","pages":"355-374"},"PeriodicalIF":6.1,"publicationDate":"2024-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139707930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alex Dornburg, Katerina L Zapfe, Rachel Williams, Michael E Alfaro, Richard Morris, Haruka Adachi, Joseph Flores, Francesco Santini, Thomas J Near, Bruno Frédérich
Across the Tree of Life, most studies of phenotypic disparity and diversification have been restricted to adult organisms. However, many lineages have distinct ontogenetic phases that differ from their adult forms in morphology and ecology. Focusing disproportionately on the evolution of adult forms unnecessarily hinders our understanding of the pressures shaping evolution over time. Non-adult disparity patterns are particularly important to consider for coastal ray-finned fishes, which can have juvenile phases with distinct phenotypes. These juvenile forms are often associated with sheltered nursery environments, with phenotypic shifts between adults and juvenile stages that are readily apparent in locomotor morphology. Whether this ontogenetic variation in locomotor morphology reflects a decoupling of diversification dynamics between life stages remains unknown. Here we investigate the evolutionary dynamics of locomotor morphology between adult and juvenile triggerfishes. We integrate a time-calibrated phylogenetic framework with geometric morphometric approaches and measurement data of fin aspect ratio and incidence, and reveal a mismatch between morphospace occupancy, the evolution of morphological disparity, and the tempo of trait evolution between life stages. Collectively, our results illuminate how the heterogeneity of morpho-functional adaptations can decouple the mode and tempo of morphological diversification between ontogenetic stages.
{"title":"Considering Decoupled Phenotypic Diversification Between Ontogenetic Phases in Macroevolution: An Example Using Triggerfishes (Balistidae).","authors":"Alex Dornburg, Katerina L Zapfe, Rachel Williams, Michael E Alfaro, Richard Morris, Haruka Adachi, Joseph Flores, Francesco Santini, Thomas J Near, Bruno Frédérich","doi":"10.1093/sysbio/syae014","DOIUrl":"10.1093/sysbio/syae014","url":null,"abstract":"<p><p>Across the Tree of Life, most studies of phenotypic disparity and diversification have been restricted to adult organisms. However, many lineages have distinct ontogenetic phases that differ from their adult forms in morphology and ecology. Focusing disproportionately on the evolution of adult forms unnecessarily hinders our understanding of the pressures shaping evolution over time. Non-adult disparity patterns are particularly important to consider for coastal ray-finned fishes, which can have juvenile phases with distinct phenotypes. These juvenile forms are often associated with sheltered nursery environments, with phenotypic shifts between adults and juvenile stages that are readily apparent in locomotor morphology. Whether this ontogenetic variation in locomotor morphology reflects a decoupling of diversification dynamics between life stages remains unknown. Here we investigate the evolutionary dynamics of locomotor morphology between adult and juvenile triggerfishes. We integrate a time-calibrated phylogenetic framework with geometric morphometric approaches and measurement data of fin aspect ratio and incidence, and reveal a mismatch between morphospace occupancy, the evolution of morphological disparity, and the tempo of trait evolution between life stages. Collectively, our results illuminate how the heterogeneity of morpho-functional adaptations can decouple the mode and tempo of morphological diversification between ontogenetic stages.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":" ","pages":"434-454"},"PeriodicalIF":6.1,"publicationDate":"2024-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140137307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Susanne S Renner, Mark D Scherz, Conrad L Schoch, Marc Gottschling, Miguel Vences
Scientific names permit humans and search engines to access knowledge about the biodiversity that surrounds us, and names linked to DNA sequences are playing an ever-greater role in search-and-match identification procedures. Here, we analyze how users and curators of the National Center for Biotechnology Information (NCBI) are flagging and curating sequences derived from nomenclatural type material, which is the only way to improve the quality of DNA-based identification in the long run. For prokaryotes, 18,281 genome assemblies from type strains have been curated by NCBI staff and improve the quality of prokaryote naming. For Fungi, type-derived sequences representing over 21,000 species are now essential for fungus naming and identification. For the remaining eukaryotes, however, the numbers of sequences identifiable as type-derived are minuscule, representing only 739 species of arthropods, 1542 vertebrates, and 125 embryophytes. An increase in the production and curation of such sequences will come from (i) sequencing of types or topotypic specimens in museum collections, (ii) the March 2023 rule changes at the International Nucleotide Sequence Database Collaboration requiring more metadata for specimens, and (iii) efforts by data submitters to facilitate curation, including informing NCBI curators about a specimen's type status. We illustrate different type-data submission journeys and provide best-practice examples from a range of organisms. Expanding the number of type-derived sequences in DNA databases, especially of eukaryotes, is crucial for capturing, documenting, and protecting biodiversity.
{"title":"Improving the gold standard in NCBI GenBank and related databases: DNA sequences from type specimens and type strains.","authors":"Susanne S Renner, Mark D Scherz, Conrad L Schoch, Marc Gottschling, Miguel Vences","doi":"10.1093/sysbio/syad068","DOIUrl":"10.1093/sysbio/syad068","url":null,"abstract":"<p><p>Scientific names permit humans and search engines to access knowledge about the biodiversity that surrounds us, and names linked to DNA sequences are playing an ever-greater role in search-and-match identification procedures. Here, we analyze how users and curators of the National Center for Biotechnology Information (NCBI) are flagging and curating sequences derived from nomenclatural type material, which is the only way to improve the quality of DNA-based identification in the long run. For prokaryotes, 18,281 genome assemblies from type strains have been curated by NCBI staff and improve the quality of prokaryote naming. For Fungi, type-derived sequences representing over 21,000 species are now essential for fungus naming and identification. For the remaining eukaryotes, however, the numbers of sequences identifiable as type-derived are minuscule, representing only 739 species of arthropods, 1542 vertebrates, and 125 embryophytes. An increase in the production and curation of such sequences will come from (i) sequencing of types or topotypic specimens in museum collections, (ii) the March 2023 rule changes at the International Nucleotide Sequence Database Collaboration requiring more metadata for specimens, and (iii) efforts by data submitters to facilitate curation, including informing NCBI curators about a specimen's type status. We illustrate different type-data submission journeys and provide best-practice examples from a range of organisms. Expanding the number of type-derived sequences in DNA databases, especially of eukaryotes, is crucial for capturing, documenting, and protecting biodiversity.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":" ","pages":"486-494"},"PeriodicalIF":6.1,"publicationDate":"2024-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11502950/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"92156794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Killian Smith, Daniel Ayres, René Neumaier, Gert Wörheide, Sebastian Höhna
Phylogenies are central to many research areas in biology and commonly estimated using likelihood-based methods. Unfortunately, any likelihood-based method, including Bayesian inference, can be restrictively slow for large datasets-with many taxa and/or many sites in the sequence alignment-or complex substitutions models. The primary limiting factor when using large datasets and/or complex models in probabilistic phylogenetic analyses is the likelihood calculation, which dominates the total computation time. To address this bottleneck, we incorporated the high-performance phylogenetic library BEAGLE into RevBayes, which enables multi-threading on multi-core CPUs and GPUs, as well as hardware specific vectorized instructions for faster likelihood calculations. Our new implementation of RevBayes+BEAGLE retains the flexibility and dynamic nature that users expect from vanilla RevBayes. In addition, we implemented native parallelization within RevBayes without an external library using the message passing interface (MPI); RevBayes+MPI. We evaluated our new implementation of RevBayes+BEAGLE using multi-threading on CPUs and 2 different powerful GPUs (NVidia Titan V and NVIDIA A100) against our native implementation of RevBayes+MPI. We found good improvements in speedup when multiple cores were used, with up to 20-fold speedup when using multiple CPU cores and over 90-fold speedup when using multiple GPU cores. The improvement depended on the data type used, DNA or amino acids, and the size of the alignment, but less on the size of the tree. We additionally investigated the cost of rescaling partial likelihoods to avoid numerical underflow and showed that unnecessarily frequent and inefficient rescaling can increase runtimes up to 4-fold. Finally, we presented and compared a new approach to store partial likelihoods on branches instead of nodes that can speed up computations up to 1.7 times but comes at twice the memory requirements.
{"title":"Bayesian Phylogenetic Analysis on Multi-Core Compute Architectures: Implementation and Evaluation of BEAGLE in RevBayes With MPI.","authors":"Killian Smith, Daniel Ayres, René Neumaier, Gert Wörheide, Sebastian Höhna","doi":"10.1093/sysbio/syae005","DOIUrl":"10.1093/sysbio/syae005","url":null,"abstract":"<p><p>Phylogenies are central to many research areas in biology and commonly estimated using likelihood-based methods. Unfortunately, any likelihood-based method, including Bayesian inference, can be restrictively slow for large datasets-with many taxa and/or many sites in the sequence alignment-or complex substitutions models. The primary limiting factor when using large datasets and/or complex models in probabilistic phylogenetic analyses is the likelihood calculation, which dominates the total computation time. To address this bottleneck, we incorporated the high-performance phylogenetic library BEAGLE into RevBayes, which enables multi-threading on multi-core CPUs and GPUs, as well as hardware specific vectorized instructions for faster likelihood calculations. Our new implementation of RevBayes+BEAGLE retains the flexibility and dynamic nature that users expect from vanilla RevBayes. In addition, we implemented native parallelization within RevBayes without an external library using the message passing interface (MPI); RevBayes+MPI. We evaluated our new implementation of RevBayes+BEAGLE using multi-threading on CPUs and 2 different powerful GPUs (NVidia Titan V and NVIDIA A100) against our native implementation of RevBayes+MPI. We found good improvements in speedup when multiple cores were used, with up to 20-fold speedup when using multiple CPU cores and over 90-fold speedup when using multiple GPU cores. The improvement depended on the data type used, DNA or amino acids, and the size of the alignment, but less on the size of the tree. We additionally investigated the cost of rescaling partial likelihoods to avoid numerical underflow and showed that unnecessarily frequent and inefficient rescaling can increase runtimes up to 4-fold. Finally, we presented and compared a new approach to store partial likelihoods on branches instead of nodes that can speed up computations up to 1.7 times but comes at twice the memory requirements.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":" ","pages":"455-469"},"PeriodicalIF":6.1,"publicationDate":"2024-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139571417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anna Wróbel, Ewelina Klichowska, Arkadiusz Nowak, Marcin Nobis
Diversification and demographic responses are key processes shaping species evolutionary history. Yet we still lack a full understanding of ecological mechanisms that shape genetic diversity at different spatial scales upon rapid environmental changes. In this study, we examined genetic differentiation in an extremophilic grass Puccinellia pamirica and factors affecting its population dynamics among the occupied hypersaline alpine wetlands on the arid Pamir Plateau in Central Asia. Using genomic data, we found evidence of fine-scale population structure and gene flow among the localities established across the high-elevation plateau as well as fingerprints of historical demographic expansion. We showed that an increase in the effective population size could coincide with the Last Glacial Period, which was followed by the species demographic decline during the Holocene. Geographic distance plays a vital role in shaping the spatial genetic structure of P. pamirica alongside with isolation-by-environment and habitat fragmentation. Our results highlight a complex history of divergence and gene flow in this species-poor alpine region during the Late Quaternary. We demonstrate that regional climate specificity and a shortage of nonclimate data largely impede predictions of future range changes of the alpine extremophile using ecological niche modeling. This study emphasizes the importance of fine-scale environmental heterogeneity for population dynamics and species distribution shifts.
多样性和人口反应是影响物种进化史的关键过程。然而,我们对环境快速变化时在不同空间尺度上形成遗传多样性的生态机制仍缺乏全面了解。在这项研究中,我们考察了中亚干旱的帕米尔高原上一种嗜极端水草 Puccinellia pamirica 的遗传分化以及影响其种群动态的因素。利用基因组数据,我们发现了在高海拔高原各地建立的精细种群结构和基因流动的证据,以及历史上人口扩张的痕迹。我们的研究表明,有效种群数量的增加可能与末次冰川期相吻合,而在全新世期间,种群数量随之减少。除了环境隔离和栖息地破碎化之外,地理距离对 P. pamirica 的空间遗传结构的形成也起着至关重要的作用。我们的研究结果突显了第四纪晚期这一物种贫乏的高山地区复杂的分化和基因流动历史。我们的研究结果表明,地区气候的特殊性和非气候数据的缺乏在很大程度上阻碍了利用生态位建模预测这种高山极端物种未来分布范围的变化。这项研究强调了细尺度环境异质性对种群动态和物种分布变化的重要性。
{"title":"Alpine Extremophytes in Evolutionary Turmoil: Complex Diversification Patterns and Demographic Responses of a Halophilic Grass in a Central Asian Biodiversity Hotspot.","authors":"Anna Wróbel, Ewelina Klichowska, Arkadiusz Nowak, Marcin Nobis","doi":"10.1093/sysbio/syad073","DOIUrl":"10.1093/sysbio/syad073","url":null,"abstract":"<p><p>Diversification and demographic responses are key processes shaping species evolutionary history. Yet we still lack a full understanding of ecological mechanisms that shape genetic diversity at different spatial scales upon rapid environmental changes. In this study, we examined genetic differentiation in an extremophilic grass Puccinellia pamirica and factors affecting its population dynamics among the occupied hypersaline alpine wetlands on the arid Pamir Plateau in Central Asia. Using genomic data, we found evidence of fine-scale population structure and gene flow among the localities established across the high-elevation plateau as well as fingerprints of historical demographic expansion. We showed that an increase in the effective population size could coincide with the Last Glacial Period, which was followed by the species demographic decline during the Holocene. Geographic distance plays a vital role in shaping the spatial genetic structure of P. pamirica alongside with isolation-by-environment and habitat fragmentation. Our results highlight a complex history of divergence and gene flow in this species-poor alpine region during the Late Quaternary. We demonstrate that regional climate specificity and a shortage of nonclimate data largely impede predictions of future range changes of the alpine extremophile using ecological niche modeling. This study emphasizes the importance of fine-scale environmental heterogeneity for population dynamics and species distribution shifts.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":" ","pages":"263-278"},"PeriodicalIF":6.1,"publicationDate":"2024-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11282368/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139032576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Andes mountains of western South America are a globally important biodiversity hotspot, yet there is a paucity of resolved phylogenies for plant clades from this region. Filling an important gap in our understanding of the World's richest flora, we present the first phylogeny of Freziera (Pentaphylacaceae), an Andean-centered, cloud forest radiation. Our dataset was obtained via hybrid-enriched target sequence capture of Angiosperms353 universal loci for 50 of the ca. 75 spp., obtained almost entirely from herbarium specimens. We identify high phylogenomic complexity in Freziera, including the presence of data artifacts. Via by-eye observation of gene trees, detailed examination of warnings from recently improved assembly pipelines, and gene tree filtering, we identified that artifactual orthologs (i.e., the presence of only one copy of a multicopy gene due to differential assembly) were an important source of gene tree heterogeneity that had a negative impact on phylogenetic inference and support. These artifactual orthologs may be common in plant phylogenomic datasets, where multiple instances of genome duplication are common. After accounting for artifactual orthologs as source of gene tree error, we identified a significant, but nonspecific signal of introgression using Patterson's D and f4 statistics. Despite phylogenomic complexity, we were able to resolve Freziera into 9 well-supported subclades whose evolution has been shaped by multiple evolutionary processes, including incomplete lineage sorting, historical gene flow, and gene duplication. Our results highlight the complexities of plant phylogenomics, which are heightened in Andean radiations, and show the impact of filtering data processing artifacts and standard filtering approaches on phylogenetic inference.
南美洲西部的安第斯山脉是全球重要的生物多样性热点地区,但该地区植物支系的系统发生却很少。我们首次提出了以安第斯山脉为中心的云林辐射植物--Freziera(五枫香科)的系统发生,填补了我们对世界上最丰富植物区系了解的一个重要空白。我们的数据集是通过rid-enriched target sequence capture of Angiosperms获得的。这些数据几乎全部来自标本馆标本。我们在 Freziera 中发现了高度的系统发生复杂性,包括数据伪造的存在。通过亲眼观察基因树、详细检查最近改进的组装管道发出的警告以及基因树过滤,我们发现伪造的直系同源物(即由于差异组装导致多拷贝基因只有一个拷贝)是基因树异质性的一个重要来源,对系统发生推断和支持有负面影响。在植物系统发生组数据集中,这些人为的直向同源物可能很常见,因为在植物系统发生组数据集中,多个基因组重复的情况很普遍。在考虑了作为基因树误差来源的伪造直系同源物之后,我们利用 Patterson's D 和 f4 统计发现了一个显著但非特异性的引种信号。尽管系统发生组十分复杂,但我们仍能将 Freziera 分解为 9 个支持度较高的亚支系,其进化受多种进化过程的影响,包括不完全的世系分类、历史基因流和基因复制。我们的研究结果凸显了植物系统发生组学的复杂性,而安第斯地区的辐射则使这种复杂性更加突出,同时也显示了过滤数据处理人工痕迹和标准过滤方法对系统发生推断的影响。
{"title":"Artifactual Orthologs and the Need for Diligent Data Exploration in Complex Phylogenomic Datasets: A Museomic Case Study from the Andean Flora.","authors":"Laura A Frost, Ana M Bedoya, Laura P Lagomarsino","doi":"10.1093/sysbio/syad076","DOIUrl":"10.1093/sysbio/syad076","url":null,"abstract":"<p><p>The Andes mountains of western South America are a globally important biodiversity hotspot, yet there is a paucity of resolved phylogenies for plant clades from this region. Filling an important gap in our understanding of the World's richest flora, we present the first phylogeny of Freziera (Pentaphylacaceae), an Andean-centered, cloud forest radiation. Our dataset was obtained via hybrid-enriched target sequence capture of Angiosperms353 universal loci for 50 of the ca. 75 spp., obtained almost entirely from herbarium specimens. We identify high phylogenomic complexity in Freziera, including the presence of data artifacts. Via by-eye observation of gene trees, detailed examination of warnings from recently improved assembly pipelines, and gene tree filtering, we identified that artifactual orthologs (i.e., the presence of only one copy of a multicopy gene due to differential assembly) were an important source of gene tree heterogeneity that had a negative impact on phylogenetic inference and support. These artifactual orthologs may be common in plant phylogenomic datasets, where multiple instances of genome duplication are common. After accounting for artifactual orthologs as source of gene tree error, we identified a significant, but nonspecific signal of introgression using Patterson's D and f4 statistics. Despite phylogenomic complexity, we were able to resolve Freziera into 9 well-supported subclades whose evolution has been shaped by multiple evolutionary processes, including incomplete lineage sorting, historical gene flow, and gene duplication. Our results highlight the complexities of plant phylogenomics, which are heightened in Andean radiations, and show the impact of filtering data processing artifacts and standard filtering approaches on phylogenetic inference.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":" ","pages":"308-322"},"PeriodicalIF":6.1,"publicationDate":"2024-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139088586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}