Jakub Truszkowski, Allison Perrigo, David Broman, Fredrik Ronquist, Alexandre Antonelli
Bayesian phylogenetics is now facing a critical point. Over the last 20 years, Bayesian methods have reshaped phylogenetic inference and gained widespread popularity due to their high accuracy, the ability to quantify the uncertainty of inferences and the possibility of accommodating multiple aspects of evolutionary processes in the models that are used. Unfortunately, Bayesian methods are computationally expensive, and typical applications involve at most a few hundred sequences. This is problematic in the age of rapidly expanding genomic data and increasing scope of evolutionary analyses, forcing researchers to resort to less accurate but faster methods, such as maximum parsimony and maximum likelihood. Does this spell doom for Bayesian methods? Not necessarily. Here, we discuss some recently proposed approaches that could help scale up Bayesian analyses of evolutionary problems considerably. We focus on two particular aspects: online phylogenetics, where new data sequences are added to existing analyses, and alternatives to Markov chain Monte Carlo (MCMC) for scalable Bayesian inference. We identify 5 specific challenges and discuss how they might be overcome. We believe that online phylogenetic approaches and Sequential Monte Carlo hold great promise and could potentially speed up tree inference by orders of magnitude. We call for collaborative efforts to speed up the development of methods for real-time tree expansion through online phylogenetics.
{"title":"Online tree expansion could help solve the problem of scalability in Bayesian phylogenetics.","authors":"Jakub Truszkowski, Allison Perrigo, David Broman, Fredrik Ronquist, Alexandre Antonelli","doi":"10.1093/sysbio/syad045","DOIUrl":"10.1093/sysbio/syad045","url":null,"abstract":"<p><p>Bayesian phylogenetics is now facing a critical point. Over the last 20 years, Bayesian methods have reshaped phylogenetic inference and gained widespread popularity due to their high accuracy, the ability to quantify the uncertainty of inferences and the possibility of accommodating multiple aspects of evolutionary processes in the models that are used. Unfortunately, Bayesian methods are computationally expensive, and typical applications involve at most a few hundred sequences. This is problematic in the age of rapidly expanding genomic data and increasing scope of evolutionary analyses, forcing researchers to resort to less accurate but faster methods, such as maximum parsimony and maximum likelihood. Does this spell doom for Bayesian methods? Not necessarily. Here, we discuss some recently proposed approaches that could help scale up Bayesian analyses of evolutionary problems considerably. We focus on two particular aspects: online phylogenetics, where new data sequences are added to existing analyses, and alternatives to Markov chain Monte Carlo (MCMC) for scalable Bayesian inference. We identify 5 specific challenges and discuss how they might be overcome. We believe that online phylogenetic approaches and Sequential Monte Carlo hold great promise and could potentially speed up tree inference by orders of magnitude. We call for collaborative efforts to speed up the development of methods for real-time tree expansion through online phylogenetics.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":null,"pages":null},"PeriodicalIF":6.5,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10627553/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10235207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexey Markin, Sanket Wagle, Siddhant Grover, Amy L Vincent Baker, Oliver Eulenstein, Tavis K Anderson
The use of next-generation sequencing technology has enabled phylogenetic studies with hundreds of thousands of taxa. Such large-scale phylogenies have become a critical component in genomic epidemiology in pathogens such as SARS-CoV-2 and influenza A virus. However, detailed phenotypic characterization of pathogens or generating a computationally tractable dataset for detailed phylogenetic analyses requires objective subsampling of taxa. To address this need, we propose parnas, an objective and flexible algorithm to sample and select taxa that best represent observed diversity by solving a generalized k-medoids problem on a phylogenetic tree. parnas solves this problem efficiently and exactly by novel optimizations and adapting algorithms from operations research. For more nuanced selections, taxa can be weighted with metadata or genetic sequence parameters, and the pool of potential representatives can be user-constrained. Motivated by influenza A virus genomic surveillance and vaccine design, parnas can be applied to identify representative taxa that optimally cover the diversity in a phylogeny within a specified distance radius. We demonstrated that parnas is more efficient and flexible than existing approaches. To demonstrate its utility, we applied parnas to 1) quantify SARS-CoV-2 genetic diversity over time, 2) select representative influenza A virus in swine genes derived from over 5 years of genomic surveillance data, and 3) identify gaps in H3N2 human influenza A virus vaccine coverage. We suggest that our method, through the objective selection of representatives in a phylogeny, provides criteria for quantifying genetic diversity that has application in the the rational design of multivalent vaccines and genomic epidemiology. PARNAS is available at https://github.com/flu-crew/parnas.
{"title":"PARNAS: Objectively Selecting the Most Representative Taxa on a Phylogeny.","authors":"Alexey Markin, Sanket Wagle, Siddhant Grover, Amy L Vincent Baker, Oliver Eulenstein, Tavis K Anderson","doi":"10.1093/sysbio/syad028","DOIUrl":"10.1093/sysbio/syad028","url":null,"abstract":"<p><p>The use of next-generation sequencing technology has enabled phylogenetic studies with hundreds of thousands of taxa. Such large-scale phylogenies have become a critical component in genomic epidemiology in pathogens such as SARS-CoV-2 and influenza A virus. However, detailed phenotypic characterization of pathogens or generating a computationally tractable dataset for detailed phylogenetic analyses requires objective subsampling of taxa. To address this need, we propose parnas, an objective and flexible algorithm to sample and select taxa that best represent observed diversity by solving a generalized k-medoids problem on a phylogenetic tree. parnas solves this problem efficiently and exactly by novel optimizations and adapting algorithms from operations research. For more nuanced selections, taxa can be weighted with metadata or genetic sequence parameters, and the pool of potential representatives can be user-constrained. Motivated by influenza A virus genomic surveillance and vaccine design, parnas can be applied to identify representative taxa that optimally cover the diversity in a phylogeny within a specified distance radius. We demonstrated that parnas is more efficient and flexible than existing approaches. To demonstrate its utility, we applied parnas to 1) quantify SARS-CoV-2 genetic diversity over time, 2) select representative influenza A virus in swine genes derived from over 5 years of genomic surveillance data, and 3) identify gaps in H3N2 human influenza A virus vaccine coverage. We suggest that our method, through the objective selection of representatives in a phylogeny, provides criteria for quantifying genetic diversity that has application in the the rational design of multivalent vaccines and genomic epidemiology. PARNAS is available at https://github.com/flu-crew/parnas.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":null,"pages":null},"PeriodicalIF":6.5,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10627562/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9543633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Evolutionary dynamics operating across deep time leave footprints in the shapes of phylogenetic trees. For the last several decades, researchers have used increasingly large and robust phylogenies to study the evolutionary history of individual clades and to investigate the causes of the glaring disparities in diversity among groups. Whereas typically not the focal point of individual clade-level studies, many researchers have remarked on recurrent patterns that have been observed across many different groups and at many different time scales. Whereas previous studies have documented various such regularities in topology and branch length distributions, they have typically focused on a single pattern and used a disparate collection (oftentimes, of quite variable reliability) of trees to assess it. Here we take advantage of modern megaphylogenies and unify previous disparate observations about the shapes embedded in the Tree of Life to create a catalog of the "major features of macroevolution." By characterizing such a large swath of subtrees in a consistent way, we hope to provide a set of phenomena that process-based macroevolutionary models of diversification ought to seek to explain.
{"title":"The Major Features of Macroevolution.","authors":"L Francisco Henao-Diaz, Matt Pennell","doi":"10.1093/sysbio/syad032","DOIUrl":"10.1093/sysbio/syad032","url":null,"abstract":"<p><p>Evolutionary dynamics operating across deep time leave footprints in the shapes of phylogenetic trees. For the last several decades, researchers have used increasingly large and robust phylogenies to study the evolutionary history of individual clades and to investigate the causes of the glaring disparities in diversity among groups. Whereas typically not the focal point of individual clade-level studies, many researchers have remarked on recurrent patterns that have been observed across many different groups and at many different time scales. Whereas previous studies have documented various such regularities in topology and branch length distributions, they have typically focused on a single pattern and used a disparate collection (oftentimes, of quite variable reliability) of trees to assess it. Here we take advantage of modern megaphylogenies and unify previous disparate observations about the shapes embedded in the Tree of Life to create a catalog of the \"major features of macroevolution.\" By characterizing such a large swath of subtrees in a consistent way, we hope to provide a set of phenomena that process-based macroevolutionary models of diversification ought to seek to explain.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":null,"pages":null},"PeriodicalIF":6.5,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9542216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Julia Van Etten, Timothy G Stephens, Debashish Bhattacharya
In the age of genome sequencing, whole-genome data is readily and frequently generated, leading to a wealth of new information that can be used to advance various fields of research. New approaches, such as alignment-free phylogenetic methods that utilize k-mer-based distance scoring, are becoming increasingly popular given their ability to rapidly generate phylogenetic information from whole-genome data. However, these methods have not yet been tested using environmental data, which often tends to be highly fragmented and incomplete. Here, we compare the results of one alignment-free approach (which utilizes the D2 statistic) to traditional multi-gene maximum likelihood trees in 3 algal groups that have high-quality genome data available. In addition, we simulate lower-quality, fragmented genome data using these algae to test method robustness to genome quality and completeness. Finally, we apply the alignment-free approach to environmental metagenome assembled genome data of unclassified Saccharibacteria and Trebouxiophyte algae, and single-cell amplified data from uncultured marine stramenopiles to demonstrate its utility with real datasets. We find that in all instances, the alignment-free method produces phylogenies that are comparable, and often more informative, than those created using the traditional multi-gene approach. The k-mer-based method performs well even when there are significant missing data that include marker genes traditionally used for tree reconstruction. Our results demonstrate the value of alignment-free approaches for classifying novel, often cryptic or rare, species, that may not be culturable or are difficult to access using single-cell methods, but fill important gaps in the tree of life.
{"title":"A k-mer-Based Approach for Phylogenetic Classification of Taxa in Environmental Genomic Data.","authors":"Julia Van Etten, Timothy G Stephens, Debashish Bhattacharya","doi":"10.1093/sysbio/syad037","DOIUrl":"10.1093/sysbio/syad037","url":null,"abstract":"<p><p>In the age of genome sequencing, whole-genome data is readily and frequently generated, leading to a wealth of new information that can be used to advance various fields of research. New approaches, such as alignment-free phylogenetic methods that utilize k-mer-based distance scoring, are becoming increasingly popular given their ability to rapidly generate phylogenetic information from whole-genome data. However, these methods have not yet been tested using environmental data, which often tends to be highly fragmented and incomplete. Here, we compare the results of one alignment-free approach (which utilizes the D2 statistic) to traditional multi-gene maximum likelihood trees in 3 algal groups that have high-quality genome data available. In addition, we simulate lower-quality, fragmented genome data using these algae to test method robustness to genome quality and completeness. Finally, we apply the alignment-free approach to environmental metagenome assembled genome data of unclassified Saccharibacteria and Trebouxiophyte algae, and single-cell amplified data from uncultured marine stramenopiles to demonstrate its utility with real datasets. We find that in all instances, the alignment-free method produces phylogenies that are comparable, and often more informative, than those created using the traditional multi-gene approach. The k-mer-based method performs well even when there are significant missing data that include marker genes traditionally used for tree reconstruction. Our results demonstrate the value of alignment-free approaches for classifying novel, often cryptic or rare, species, that may not be culturable or are difficult to access using single-cell methods, but fill important gaps in the tree of life.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":null,"pages":null},"PeriodicalIF":6.5,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9999697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider the evolution of phylogenetic gene trees along phylogenetic species networks, according to the network multispecies coalescent process, and introduce a new network coalescent model with correlated inheritance of gene flow. This model generalizes two traditional versions of the network coalescent: with independent or common inheritance. At each reticulation, multiple lineages of a given locus are inherited from parental populations chosen at random, either independently across lineages or with positive correlation according to a Dirichlet process. This process may account for locus-specific probabilities of inheritance, for example. We implemented the simulation of gene trees under these network coalescent models in the Julia package PhyloCoalSimulations, which depends on PhyloNetworks and its powerful network manipulation tools. Input species phylogenies can be read in extended Newick format, either in numbers of generations or in coalescent units. Simulated gene trees can be written in Newick format, and in a way that preserves information about their embedding within the species network. This embedding can be used for downstream purposes, such as to simulate species-specific processes like rate variation across species, or for other scenarios as illustrated in this note. This package should be useful for simulation studies and simulation-based inference methods. The software is available open source with documentation and a tutorial at https://github.com/cecileane/PhyloCoalSimulations.jl.
{"title":"PhyloCoalSimulations: A Simulator for Network Multispecies Coalescent Models, Including a New Extension for the Inheritance of Gene Flow.","authors":"John Fogg, Elizabeth S Allman, Cécile Ané","doi":"10.1093/sysbio/syad030","DOIUrl":"10.1093/sysbio/syad030","url":null,"abstract":"<p><p>We consider the evolution of phylogenetic gene trees along phylogenetic species networks, according to the network multispecies coalescent process, and introduce a new network coalescent model with correlated inheritance of gene flow. This model generalizes two traditional versions of the network coalescent: with independent or common inheritance. At each reticulation, multiple lineages of a given locus are inherited from parental populations chosen at random, either independently across lineages or with positive correlation according to a Dirichlet process. This process may account for locus-specific probabilities of inheritance, for example. We implemented the simulation of gene trees under these network coalescent models in the Julia package PhyloCoalSimulations, which depends on PhyloNetworks and its powerful network manipulation tools. Input species phylogenies can be read in extended Newick format, either in numbers of generations or in coalescent units. Simulated gene trees can be written in Newick format, and in a way that preserves information about their embedding within the species network. This embedding can be used for downstream purposes, such as to simulate species-specific processes like rate variation across species, or for other scenarios as illustrated in this note. This package should be useful for simulation studies and simulation-based inference methods. The software is available open source with documentation and a tutorial at https://github.com/cecileane/PhyloCoalSimulations.jl.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":null,"pages":null},"PeriodicalIF":6.5,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9547786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bayesian phylogenetic inference requires a tree prior, which models the underlying diversification process that gives rise to the phylogeny. Existing birth-death diversification models include a wide range of features, for instance, lineage-specific variations in speciation and extinction (SSE) rates. While across-lineage variation in SSE rates is widespread in empirical datasets, few heterogeneous rate models have been implemented as tree priors for Bayesian phylogenetic inference. As a consequence, rate heterogeneity is typically ignored when reconstructing phylogenies, and rate heterogeneity is usually investigated on fixed trees. In this paper, we present a new BEAST2 package implementing the cladogenetic diversification rate shift (ClaDS) model as a tree prior. ClaDS is a birth-death diversification model designed to capture small progressive variations in birth and death rates along a phylogeny. Unlike previous implementations of ClaDS, which were designed to be used with fixed, user-chosen phylogenies, our package is implemented in the BEAST2 framework and thus allows full phylogenetic inference, where the phylogeny and model parameters are co-estimated from a molecular alignment. Our package provides all necessary components of the inference, including a new tree object and operators to propose moves to the Monte-Carlo Markov chain. It also includes a graphical interface through BEAUti. We validate our implementation of the package by comparing the produced distributions to simulated data and show an empirical example of the full inference, using a dataset of cetaceans.
{"title":"The ClaDS rate-heterogeneous birth-death prior for full phylogenetic inference in BEAST2.","authors":"Joëlle Barido-Sottani, Hélène Morlon","doi":"10.1093/sysbio/syad027","DOIUrl":"10.1093/sysbio/syad027","url":null,"abstract":"<p><p>Bayesian phylogenetic inference requires a tree prior, which models the underlying diversification process that gives rise to the phylogeny. Existing birth-death diversification models include a wide range of features, for instance, lineage-specific variations in speciation and extinction (SSE) rates. While across-lineage variation in SSE rates is widespread in empirical datasets, few heterogeneous rate models have been implemented as tree priors for Bayesian phylogenetic inference. As a consequence, rate heterogeneity is typically ignored when reconstructing phylogenies, and rate heterogeneity is usually investigated on fixed trees. In this paper, we present a new BEAST2 package implementing the cladogenetic diversification rate shift (ClaDS) model as a tree prior. ClaDS is a birth-death diversification model designed to capture small progressive variations in birth and death rates along a phylogeny. Unlike previous implementations of ClaDS, which were designed to be used with fixed, user-chosen phylogenies, our package is implemented in the BEAST2 framework and thus allows full phylogenetic inference, where the phylogeny and model parameters are co-estimated from a molecular alignment. Our package provides all necessary components of the inference, including a new tree object and operators to propose moves to the Monte-Carlo Markov chain. It also includes a graphical interface through BEAUti. We validate our implementation of the package by comparing the produced distributions to simulated data and show an empirical example of the full inference, using a dataset of cetaceans.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":null,"pages":null},"PeriodicalIF":6.5,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10627560/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9438881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yubo Zhang, Qingjie Zhu, Yi Shao, Yanchen Jiang, Yidan Ouyang, Li Zhang, Wei Zhang
Resolving phylogenetic relationships among taxa remains a challenge in the era of big data due to the presence of genetic admixture in a wide range of organisms. Rapidly developing sequencing technologies and statistical tests enable evolutionary relationships to be disentangled at a genome-wide level, yet many of these tests are computationally intensive and rely on phased genotypes, large sample sizes, restricted phylogenetic topologies, or hypothesis testing. To overcome these difficulties, we developed a deep learning-based approach, named ERICA, for inferring genome-wide evolutionary relationships and local introgressed regions from sequence data. ERICA accepts sequence alignments of both population genomic data and multiple genome assemblies, and efficiently identifies discordant genealogy patterns and exchanged regions across genomes when compared with other methods. We further tested ERICA using real population genomic data from Heliconius butterflies that have undergone adaptive radiation and frequent hybridization. Finally, we applied ERICA to characterize hybridization and introgression in wild and cultivated rice, revealing the important role of introgression in rice domestication and adaptation. Taken together, our findings demonstrate that ERICA provides an effective method for teasing apart evolutionary relationships using whole genome data, which can ultimately facilitate evolutionary studies on hybridization and introgression.
{"title":"Inferring Historical Introgression with Deep Learning.","authors":"Yubo Zhang, Qingjie Zhu, Yi Shao, Yanchen Jiang, Yidan Ouyang, Li Zhang, Wei Zhang","doi":"10.1093/sysbio/syad033","DOIUrl":"10.1093/sysbio/syad033","url":null,"abstract":"<p><p>Resolving phylogenetic relationships among taxa remains a challenge in the era of big data due to the presence of genetic admixture in a wide range of organisms. Rapidly developing sequencing technologies and statistical tests enable evolutionary relationships to be disentangled at a genome-wide level, yet many of these tests are computationally intensive and rely on phased genotypes, large sample sizes, restricted phylogenetic topologies, or hypothesis testing. To overcome these difficulties, we developed a deep learning-based approach, named ERICA, for inferring genome-wide evolutionary relationships and local introgressed regions from sequence data. ERICA accepts sequence alignments of both population genomic data and multiple genome assemblies, and efficiently identifies discordant genealogy patterns and exchanged regions across genomes when compared with other methods. We further tested ERICA using real population genomic data from Heliconius butterflies that have undergone adaptive radiation and frequent hybridization. Finally, we applied ERICA to characterize hybridization and introgression in wild and cultivated rice, revealing the important role of introgression in rice domestication and adaptation. Taken together, our findings demonstrate that ERICA provides an effective method for teasing apart evolutionary relationships using whole genome data, which can ultimately facilitate evolutionary studies on hybridization and introgression.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":null,"pages":null},"PeriodicalIF":6.5,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9606856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jorge L Ramirez, Carolina B Machado, Paulo Roberto Antunes de Mello Affonso, Pedro M Galetti
Past sea level changes and geological instability along watershed boundaries have largely influenced fish distribution across coastal basins, either by dispersal via palaeodrainages now submerged or by headwater captures, respectively. Accordingly, the South American Atlantic coast encompasses several small and isolated drainages that share a similar species composition, representing a suitable model to infer historical processes. Leporinus bahiensis is a freshwater fish species widespread along adjacent coastal basins over narrow continental shelf with no evidence of palaeodrainage connections at low sea level periods. Therefore, this study aimed to reconstruct its evolutionary history to infer the role of headwater captures in the dispersal process. To accomplish this, we employed molecular-level phylogenetic and population structure analyses based on Sanger sequences (5 genes) and genome-wide SNP data. Phylogenetic trees based on Sanger data were inconclusive, but SNPs data did support the monophyletic status of L. bahiensis. Both COI and SNP data revealed structured populations according to each hydrographic basin. Species delimitation analyses revealed from 3 (COI) to 5 (multilocus approach) MOTUs, corresponding to the sampled basins. An intricate biogeographic scenario was inferred and supported by Approximate Bayesian Computation (ABC) analysis. Specifically, a staggered pattern was revealed and characterized by sequential headwater captures from basins adjacent to upland drainages into small coastal basins at different periods. These headwater captures resulted in dispersal throughout contiguous coastal basins, followed by deep genetic divergence among lineages. To decipher such recent divergences, as herein represented by L. bahiensis populations, we used genome-wide SNPs data. Indeed, the combined use of genome-wide SNPs data and ABC method allowed us to reconstruct the evolutionary history and speciation of L. bahiensis. This framework might be useful in disentangling the diversification process in other neotropical fishes subject to a reticulate geological history.
{"title":"Speciation in Coastal Basins Driven by Staggered Headwater Captures: Dispersal of a Species Complex, Leporinus bahiensis, as Revealed by Genome-wide SNP Data.","authors":"Jorge L Ramirez, Carolina B Machado, Paulo Roberto Antunes de Mello Affonso, Pedro M Galetti","doi":"10.1093/sysbio/syad034","DOIUrl":"10.1093/sysbio/syad034","url":null,"abstract":"<p><p>Past sea level changes and geological instability along watershed boundaries have largely influenced fish distribution across coastal basins, either by dispersal via palaeodrainages now submerged or by headwater captures, respectively. Accordingly, the South American Atlantic coast encompasses several small and isolated drainages that share a similar species composition, representing a suitable model to infer historical processes. Leporinus bahiensis is a freshwater fish species widespread along adjacent coastal basins over narrow continental shelf with no evidence of palaeodrainage connections at low sea level periods. Therefore, this study aimed to reconstruct its evolutionary history to infer the role of headwater captures in the dispersal process. To accomplish this, we employed molecular-level phylogenetic and population structure analyses based on Sanger sequences (5 genes) and genome-wide SNP data. Phylogenetic trees based on Sanger data were inconclusive, but SNPs data did support the monophyletic status of L. bahiensis. Both COI and SNP data revealed structured populations according to each hydrographic basin. Species delimitation analyses revealed from 3 (COI) to 5 (multilocus approach) MOTUs, corresponding to the sampled basins. An intricate biogeographic scenario was inferred and supported by Approximate Bayesian Computation (ABC) analysis. Specifically, a staggered pattern was revealed and characterized by sequential headwater captures from basins adjacent to upland drainages into small coastal basins at different periods. These headwater captures resulted in dispersal throughout contiguous coastal basins, followed by deep genetic divergence among lineages. To decipher such recent divergences, as herein represented by L. bahiensis populations, we used genome-wide SNPs data. Indeed, the combined use of genome-wide SNPs data and ABC method allowed us to reconstruct the evolutionary history and speciation of L. bahiensis. This framework might be useful in disentangling the diversification process in other neotropical fishes subject to a reticulate geological history.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":null,"pages":null},"PeriodicalIF":6.5,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10627554/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9557736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexander M Kramer, Bryan Thornlow, Cheng Ye, Nicola De Maio, Jakob McBroome, Angie S Hinrichs, Robert Lanfear, Yatish Turakhia, Russell Corbett-Detig
Phylogenetics has been foundational to SARS-CoV-2 research and public health policy, assisting in genomic surveillance, contact tracing, and assessing emergence and spread of new variants. However, phylogenetic analyses of SARS-CoV-2 have often relied on tools designed for de novo phylogenetic inference, in which all data are collected before any analysis is performed and the phylogeny is inferred once from scratch. SARS-CoV-2 data sets do not fit this mold. There are currently over 14 million sequenced SARS-CoV-2 genomes in online databases, with tens of thousands of new genomes added every day. Continuous data collection, combined with the public health relevance of SARS-CoV-2, invites an "online" approach to phylogenetics, in which new samples are added to existing phylogenetic trees every day. The extremely dense sampling of SARS-CoV-2 genomes also invites a comparison between likelihood and parsimony approaches to phylogenetic inference. Maximum likelihood (ML) and pseudo-ML methods may be more accurate when there are multiple changes at a single site on a single branch, but this accuracy comes at a large computational cost, and the dense sampling of SARS-CoV-2 genomes means that these instances will be extremely rare because each internal branch is expected to be extremely short. Therefore, it may be that approaches based on maximum parsimony (MP) are sufficiently accurate for reconstructing phylogenies of SARS-CoV-2, and their simplicity means that they can be applied to much larger data sets. Here, we evaluate the performance of de novo and online phylogenetic approaches, as well as ML, pseudo-ML, and MP frameworks for inferring large and dense SARS-CoV-2 phylogenies. Overall, we find that online phylogenetics produces similar phylogenetic trees to de novo analyses for SARS-CoV-2, and that MP optimization with UShER and matOptimize produces equivalent SARS-CoV-2 phylogenies to some of the most popular ML and pseudo-ML inference tools. MP optimization with UShER and matOptimize is thousands of times faster than presently available implementations of ML and online phylogenetics is faster than de novo inference. Our results therefore suggest that parsimony-based methods like UShER and matOptimize represent an accurate and more practical alternative to established ML implementations for large SARS-CoV-2 phylogenies and could be successfully applied to other similar data sets with particularly dense sampling and short branch lengths.
{"title":"Online Phylogenetics with matOptimize Produces Equivalent Trees and is Dramatically More Efficient for Large SARS-CoV-2 Phylogenies than de novo and Maximum-Likelihood Implementations.","authors":"Alexander M Kramer, Bryan Thornlow, Cheng Ye, Nicola De Maio, Jakob McBroome, Angie S Hinrichs, Robert Lanfear, Yatish Turakhia, Russell Corbett-Detig","doi":"10.1093/sysbio/syad031","DOIUrl":"10.1093/sysbio/syad031","url":null,"abstract":"<p><p>Phylogenetics has been foundational to SARS-CoV-2 research and public health policy, assisting in genomic surveillance, contact tracing, and assessing emergence and spread of new variants. However, phylogenetic analyses of SARS-CoV-2 have often relied on tools designed for de novo phylogenetic inference, in which all data are collected before any analysis is performed and the phylogeny is inferred once from scratch. SARS-CoV-2 data sets do not fit this mold. There are currently over 14 million sequenced SARS-CoV-2 genomes in online databases, with tens of thousands of new genomes added every day. Continuous data collection, combined with the public health relevance of SARS-CoV-2, invites an \"online\" approach to phylogenetics, in which new samples are added to existing phylogenetic trees every day. The extremely dense sampling of SARS-CoV-2 genomes also invites a comparison between likelihood and parsimony approaches to phylogenetic inference. Maximum likelihood (ML) and pseudo-ML methods may be more accurate when there are multiple changes at a single site on a single branch, but this accuracy comes at a large computational cost, and the dense sampling of SARS-CoV-2 genomes means that these instances will be extremely rare because each internal branch is expected to be extremely short. Therefore, it may be that approaches based on maximum parsimony (MP) are sufficiently accurate for reconstructing phylogenies of SARS-CoV-2, and their simplicity means that they can be applied to much larger data sets. Here, we evaluate the performance of de novo and online phylogenetic approaches, as well as ML, pseudo-ML, and MP frameworks for inferring large and dense SARS-CoV-2 phylogenies. Overall, we find that online phylogenetics produces similar phylogenetic trees to de novo analyses for SARS-CoV-2, and that MP optimization with UShER and matOptimize produces equivalent SARS-CoV-2 phylogenies to some of the most popular ML and pseudo-ML inference tools. MP optimization with UShER and matOptimize is thousands of times faster than presently available implementations of ML and online phylogenetics is faster than de novo inference. Our results therefore suggest that parsimony-based methods like UShER and matOptimize represent an accurate and more practical alternative to established ML implementations for large SARS-CoV-2 phylogenies and could be successfully applied to other similar data sets with particularly dense sampling and short branch lengths.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":null,"pages":null},"PeriodicalIF":6.5,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10627557/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9614158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lihua Yang, A J Harris, Fang Wen, Zheng Li, Chao Feng, Hanghui Kong, Ming Kang
Allopolyploid plants have long been regarded as possessing genetic advantages under certain circumstances due to the combined effects of their hybrid origins and duplicated genomes. However, the evolutionary consequences of allopolyploidy in lineage diversification remain to be fully understood. Here, we investigate the evolutionary consequences of allopolyploidy using 138 transcriptomic sequences of Gesneriaceae, including 124 newly sequenced, focusing particularly on the largest subtribe Didymocarpinae. We estimated the phylogeny of Gesneriaceae using concatenated and coalescent-based methods based on five different nuclear matrices and 27 plastid genes, focusing on relationships among major clades. To better understand the evolutionary affinities in this family, we applied a range of approaches to characterize the extent and cause of phylogenetic incongruence. We found that extensive conflicts between nuclear and chloroplast genomes and among nuclear genes were caused by both incomplete lineage sorting (ILS) and reticulation, and we found evidence of widespread ancient hybridization and introgression. Using the most highly supported phylogenomic framework, we revealed multiple bursts of gene duplication throughout the evolutionary history of Gesneriaceae. By incorporating molecular dating and analyses of diversification dynamics, our study shows that an ancient allopolyploidization event occurred around the Oligocene-Miocene boundary, which may have driven the rapid radiation of core Didymocarpinae.
{"title":"Phylogenomic Analyses Reveal an Allopolyploid Origin of Core Didymocarpinae (Gesneriaceae) Followed by Rapid Radiation.","authors":"Lihua Yang, A J Harris, Fang Wen, Zheng Li, Chao Feng, Hanghui Kong, Ming Kang","doi":"10.1093/sysbio/syad029","DOIUrl":"10.1093/sysbio/syad029","url":null,"abstract":"<p><p>Allopolyploid plants have long been regarded as possessing genetic advantages under certain circumstances due to the combined effects of their hybrid origins and duplicated genomes. However, the evolutionary consequences of allopolyploidy in lineage diversification remain to be fully understood. Here, we investigate the evolutionary consequences of allopolyploidy using 138 transcriptomic sequences of Gesneriaceae, including 124 newly sequenced, focusing particularly on the largest subtribe Didymocarpinae. We estimated the phylogeny of Gesneriaceae using concatenated and coalescent-based methods based on five different nuclear matrices and 27 plastid genes, focusing on relationships among major clades. To better understand the evolutionary affinities in this family, we applied a range of approaches to characterize the extent and cause of phylogenetic incongruence. We found that extensive conflicts between nuclear and chloroplast genomes and among nuclear genes were caused by both incomplete lineage sorting (ILS) and reticulation, and we found evidence of widespread ancient hybridization and introgression. Using the most highly supported phylogenomic framework, we revealed multiple bursts of gene duplication throughout the evolutionary history of Gesneriaceae. By incorporating molecular dating and analyses of diversification dynamics, our study shows that an ancient allopolyploidization event occurred around the Oligocene-Miocene boundary, which may have driven the rapid radiation of core Didymocarpinae.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":null,"pages":null},"PeriodicalIF":6.5,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10627561/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9434513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}