Determining the link between genomic and phenotypic change is a fundamental goal in evolutionary biology. Insights into this link can be gained by using a phylogenetic approach to test for correlations between rates of molecular and morphological evolution. However, there has been persistent uncertainty about the relationship between these rates, partly because conflicting results have been obtained using various methods that have not been examined in detail. We carried out a simulation study to evaluate the performance of 5 statistical methods for detecting correlated rates of evolution. Our simulations explored the evolution of molecular sequences and morphological characters under a range of conditions. Of the methods tested, Bayesian relaxed-clock estimation of branch rates was able to detect correlated rates of evolution correctly in the largest number of cases. This was followed by correlations of root-to-tip distances, Bayesian model selection, independent sister-pairs contrasts, and likelihood-based model selection. As expected, the power to detect correlated rates increased with the amount of data, both in terms of tree size and number of morphological characters. Likewise, greater among-lineage rate variation in the data led to improved performance of all 5 methods, particularly for Bayesian relaxed-clock analysis when the rate model was mismatched. We then applied these methods to a data set from flowering plants and did not find evidence of a correlation in evolutionary rates between genomic data and morphological characters. The results of our study have practical implications for phylogenetic analyses of combined molecular and morphological data sets, and highlight the conditions under which the links between genomic and phenotypic rates of evolution can be evaluated quantitatively.
The evolutionary implications and frequency of hybridization and introgression are increasingly being recognized across the tree of life. To detect hybridization from multi-locus and genome-wide sequence data, a popular class of methods are based on summary statistics from subsets of 3 or 4 taxa. However, these methods often carry the assumption of a constant substitution rate across lineages and genes, which is commonly violated in many groups. In this work, we quantify the effects of rate variation on the D test (also known as ABBA-BABA test), the D3 test, and HyDe. All 3 tests are used widely across a range of taxonomic groups, in part because they are very fast to compute. We consider rate variation across species lineages, across genes, their lineage-by-gene interaction, and rate variation across gene-tree edges. We simulated species networks according to a birth-death-hybridization process, so as to capture a range of realistic species phylogenies. For all 3 methods tested, we found a marked increase in the false discovery of reticulation (type-1 error rate) when there is rate variation across species lineages. The D3 test was the most sensitive, with around 80% type-1 error, such that D3 appears to more sensitive to a departure from the clock than to the presence of reticulation. For all 3 tests, the power to detect hybridization events decreased as the number of hybridization events increased, indicating that multiple hybridization events can obscure one another if they occur within a small subset of taxa. Our study highlights the need to consider rate variation when using site-based summary statistics, and points to the advantages of methods that do not require assumptions on evolutionary rates across lineages or across genes.
Mangrove pit vipers of the Trimeresurus purpureomaculatus-erythrurus complex are the only species of viper known to naturally inhabit mangroves. Despite serving integral ecological functions in mangrove ecosystems, the evolutionary history, distribution, and species boundaries of mangrove pit vipers remain poorly understood, partly due to overlapping distributions, confusing phenotypic variations, and the lack of focused studies. Here, we present the first genomic study on mangrove pit vipers and introduce a robust hypothesis-driven species delimitation framework that considers gene flow and phylogenetic uncertainty in conjunction with a novel application of a new class of speciation-based delimitation model implemented through the program Delineate. Our results showed that gene flow produced phylogenetic conflict in our focal species and substantiates the artefactual branch effect where highly admixed populations appear as divergent nonmonophyletic lineages arranged in a stepwise manner at the basal position of clades. Despite the confounding effects of gene flow, we were able to obtain unequivocal support for the recognition of a new species based on the intersection and congruence of multiple lines of evidence. This study demonstrates that an integrative hypothesis-driven approach predicated on the consideration of multiple plausible evolutionary histories, population structure/differentiation, gene flow, and the implementation of a speciation-based delimitation model can effectively delimit species in the presence of gene flow and phylogenetic conflict.
Phylogenetic tree reconciliation is extensively employed for the examination of coevolution between host and symbiont species. An important concern is the requirement for dependable cost values when selecting event-based parsimonious reconciliation. Although certain approaches deduce event probabilities unique to each pair of host and symbiont trees, which can subsequently be converted into cost values, a significant limitation lies in their inability to model the invasion of diverse host species by the same symbiont species (termed as a spread event), which is believed to occur in symbiotic relationships. Invasions lead to the observation of multiple associations between symbionts and their hosts (indicating that a symbiont is no longer exclusive to a single host), which are incompatible with the existing methods of coevolution. Here, we present a method called AmoCoala (an enhanced version of the tool Coala) that provides a more realistic estimation of cophylogeny event probabilities for a given pair of host and symbiont trees, even in the presence of spread events. We expand the classical 4-event coevolutionary model to include 2 additional outcomes, vertical and horizontal spreads, that lead to multiple associations. In the initial step, we estimate the probabilities of spread events using heuristic frequencies. Subsequently, in the second step, we employ an approximate Bayesian computation approach to infer the probabilities of the remaining 4 classical events (cospeciation, duplication, host switch, and loss) based on these values. By incorporating spread events, our reconciliation model enables a more accurate consideration of multiple associations. This improvement enhances the precision of estimated cost sets, paving the way to a more reliable reconciliation of host and symbiont trees. To validate our method, we conducted experiments on synthetic datasets and demonstrated its efficacy using real-world examples. Our results showcase that AmoCoala produces biologically plausible reconciliation scenarios, further emphasizing its effectiveness.
Model selection aims to choose the most adequate model for the statistical analysis at hand. The model must be complex enough to capture the complexity of the data but should be simple enough not to overfit. In phylogenetics, the most common model selection scenario concerns selecting an adequate substitution and partition model for sequence evolution to infer a phylogenetic tree. Previously, several studies showed that substitution model under-parameterization can bias phylogenetic studies. Here, we explored the impact of substitution model over-parameterization in a Bayesian statistical framework. We performed simulations under the simplest substitution model, the Jukes-Cantor model, and compare posterior estimates of phylogenetic tree topologies and tree length under the true model to the most complex model, the $text{GTR}+Gamma+text{I}$ substitution model, including over-splitting the data into additional subsets (i.e., applying partitioned models). We explored 4 choices of prior distributions: the default substitution model priors of MrBayes, BEAST2, and RevBayes and a newly devised prior choice (Tame). Our results show that Bayesian inference of phylogeny is robust to substitution model over-parameterization and over-partitioning but only under our new prior settings. All 3 current default priors introduced biases for the estimated tree length. We conclude that substitution and partition model selection are superfluous steps in Bayesian phylogenetic inference pipelines if well-behaved prior distributions are applied and more effort should focus on more complex and biologically realistic substitution models.
Multi-type birth-death (MTBD) models are phylodynamic analogies of compartmental models in classical epidemiology. They serve to infer such epidemiological parameters as the average number of secondary infections Re and the infectious time from a phylogenetic tree (a genealogy of pathogen sequences). The representatives of this model family focus on various aspects of pathogen epidemics. For instance, the birth-death exposed-infectious (BDEI) model describes the transmission of pathogens featuring an incubation period (when there is a delay between the moment of infection and becoming infectious, as for Ebola and SARS-CoV-2), and permits its estimation along with other parameters. With constantly growing sequencing data, MTBD models should be extremely useful for unravelling information on pathogen epidemics. However, existing implementations of these models in a phylodynamic framework have not yet caught up with the sequencing speed. Computing time and numerical instability issues limit their applicability to medium data sets (≤ 500 samples), while the accuracy of estimations should increase with more data. We propose a new highly parallelizable formulation of ordinary differential equations for MTBD models. We also extend them to forests to represent situations when a (sub-)epidemic started from several cases (e.g., multiple introductions to a country). We implemented it for the BDEI model in a maximum likelihood framework using a combination of numerical analysis methods for efficient equation resolution. Our implementation estimates epidemiological parameter values and their confidence intervals in two minutes on a phylogenetic tree of 10,000 samples. Comparison to the existing implementations on simulated data shows that it is not only much faster but also more accurate. An application of our tool to the 2014 Ebola epidemic in Sierra-Leone is also convincing, with very fast calculation and precise estimates. As MTBD models are closely related to Cladogenetic State Speciation and Extinction (ClaSSE)-like models, our findings could also be easily transferred to the macroevolution domain.
The genomic era has opened up vast opportunities in molecular systematics, one of which is deciphering the evolutionary history in fine detail. Under this mass of data, analyzing the point mutations of standard markers is often too crude and slow for fine-scale phylogenetics. Nevertheless, genome dynamics (GD) events provide alternative, often richer information. The synteny index (SI) between a pair of genomes combines gene order and gene content information, allowing the comparison of genomes of unequal gene content, together with order considerations of their common genes. Recently, genome dynamics has been modeled as a continuous-time Markov process, and gene distance in the genome as a birth-death-immigration process. Nevertheless, due to complexities arising in this setting, no precise and provably consistent estimators could be derived, resulting in heuristic solutions. Here, we extend this modeling approach by using techniques from birth-death theory to derive explicit expressions of the system's probabilistic dynamics in the form of rational functions of the model parameters. This, in turn, allows us to infer analytically accurate distances between organisms based on their SI. Subsequently, we establish additivity of this estimated evolutionary distance (a desirable property yielding phylogenetic consistency). Applying the new measure in simulation studies shows that it provides accurate results in realistic settings and even under model extensions such as gene gain/loss or over a tree structure. In the real-data realm, we applied the new formulation to unique data structure that we constructed-the ordered orthology DB-based on a new version of the EggNOG database, to construct a tree with more than 4.5K taxa. To the best of our knowledge, this is the largest gene-order-based tree constructed and it overcomes shortcomings found in previous approaches. Constructing a GD-based tree allows to confirm and contrast findings based on other phylogenetic approaches, as we show.
Burmese amber is a significant source of fossils that documents the mid-Cretaceous biota. This deposit was formed around 99 Ma on the Burma Terrane, which broke away from Gondwana and later collided with Asia, although the timing is disputed. Palpimanoidea is a dispersal-limited group that was a dominant element of the Mesozoic spider fauna, and has an extensive fossil record, particularly from Burmese amber. Using morphological and molecular data, evolutionary relationships of living and fossil Palpimanoidea are examined. Divergence dating with fossils as terminal tips shows timing of diversification is contemporaneous with continental breakup.Ancestral range estimations show widespread ancestral ranges that divide into lineages that inherit different Pangean fragments, consistent with vicariance. Our results suggest that the Burmese amber fauna has ties to Gondwana due to a historical connection in the Early Cretaceous, and that the Burma Terrane facilitated biotic exchange by transporting lineages from Gondwana into the Holarctic in the Cretaceous.
Extinction is a dominant force shaping patterns of biodiversity through time; however its role as a catalyst of speciation through its interaction with intraspecific variation has been overlooked. Here, we synthesize ideas alluded to by Darwin and others into the model of "speciation-by-extinction" in which speciation results from the extinction of intermediate populations within a single geographically variable species. We explore the properties and distinguishing features of speciation-by-extinction with respect to other established speciation models. We demonstrate its plausibility by showing that the experimental extinction of populations within variable species can result in speciation. The prerequisites for speciation-by-extinction, geographically structured intraspecific variation and local extinction, are ubiquitous in nature. We propose that speciation-by-extinction may be a prevalent, but underappreciated, speciation mechanism.