Biology has become a highly mathematical discipline in which probabilistic models play a central role. As a result, research in the biological sciences is now dependent on computational tools capable of carrying out complex analyses. These tools must be validated before they can be used, but what is understood as validation varies widely among methodological contributions. This may be a consequence of the still embryonic stage of the literature on statistical software validation for computational biology. Our manuscript aims to advance this literature. Here, we describe, illustrate and introduce new good practices for assessing the correctness of a model implementation, with an emphasis on Bayesian methods. We also introduce a suite of functionalities for automating validation protocols. It is our hope that the guidelines presented here help sharpen the focus of discussions on (as well as elevate) expected standards of statistical software for biology.
Eyes within the marine gastropod superfamily Stromboidea range widely in size, from 0.2 to 2.3 mm - the largest eyes known in any gastropod. Despite this interesting variation, the underlying evolutionary pressures remain unknown. Here, we use the wealth of material available in museum collections to explore the evolution of stromboid eye size and structure. Our results suggest that depth is a key light-limiting factor in stromboid eye evolution; here, increasing water depth is correlated with increasing aperture width relative to lens diameter, and therefore an increasing investment in sensitivity in dim light environments. In the major clade containing all large-eyed stromboid families, species observed active during the day and the night had wider eye apertures relative to lens sizes than species observed active during the day only, thereby prioritising sensitivity over resolution. Species with no consistent diel activity pattern also had smaller body sizes than exclusively day-active species, which may suggest that smaller animals are more vulnerable to shell-crushing predators, and avoid the higher predation pressure experienced by animals active during the day. Within the same major clade, ancestral state reconstruction suggests that absolute eye size increased above 1 mm twice. The unresolved position of Varicospira, however, weakens this hypothesis and further work with additional markers is needed to confirm this result.
Evolutionary changes in geographic distribution and larval host plants may promote the rapid diversification of montane insects, but this scenario has been rarely investigated. We studied rapid radiation of the butterfly genus Colias, which has diversified in mountain ecosystems in Eurasia, Africa, and the Americas. Based on a dataset of 150 nuclear protein-coding genetic loci and mitochondrial genomes, we constructed a time-calibrated phylogenetic tree of Colias species with broad taxon sampling. We then inferred their ancestral geographic ranges, historical diversification rates, and the evolution of host use. We found that the most recent common ancestor of Colias was likely geographically widespread and originated ~3.5 Ma. The group subsequently diversified in different regions across the world, often in tandem with geographic expansion events. No aspect of elevation was found to have a direct effect on diversification. The genus underwent a burst of diversification soon after the divergence of the Neotropical lineage, followed by an exponential decline in diversification rate toward the present. The ancestral host repertoire included the legume genera Astragalus and Trifolium but later expanded to include a wide range of Fabaceae genera and plants in more distantly related families, punctuated with periods of host range expansion and contraction. We suggest that the widespread distribution of the ancestor of all extant Colias lineages set the stage for diversification by isolation of populations that locally adapted to the various different environments they encountered, including different host plants. In this scenario, elevation is not the main driver but might have accelerated diversification by isolating populations.
Recent genomic analyses have highlighted the prevalence of speciation with gene flow in many taxa and have underscored the importance of accounting for these reticulate evolutionary processes when constructing species trees and generating parameter estimates. This is especially important for deepening our understanding of speciation in the sea where fast-moving ocean currents, expanses of deep water, and periodic episodes of sea level rise and fall act as soft and temporary allopatric barriers that facilitate both divergence and secondary contact. Under these conditions, gene flow is not expected to cease completely while contemporary distributions are expected to differ from historical ones. Here, we conduct range-wide sampling for Pederson's cleaner shrimp (Ancylomenes pedersoni), a species complex from the Greater Caribbean that contains three clearly delimited mitochondrial lineages with both allopatric and sympatric distributions. Using mtDNA barcodes and a genomic ddRADseq approach, we combine classic phylogenetic analyses with extensive topology testing and demographic modeling (10 site frequency replicates × 45 evolutionary models × 50 model simulations/replicate = 22,500 simulations) to test species boundaries and reconstruct the evolutionary history of what was expected to be a simple case study. Instead, our results indicate a history of allopatric divergence, secondary contact, introgression, and endemic hybrid speciation that we hypothesize was driven by the final closure of the Isthmus of Panama and the strengthening of the Gulf Stream Current ~3.5 Ma. The history of this species complex recovered by model-based methods that allow reticulation differs from that recovered by standard phylogenetic analyses and is unexpected given contemporary distributions. The geologically and biologically meaningful insights gained by our model selection analyses illuminate what is likely a novel pathway of species formation not previously documented that resulted from one of the most biogeographically significant events in Earth's history.
Phylogenetic trees establish a historical context for the study of organismal form and function. Most phylogenetic trees are estimated using a model of evolution. For molecular data, modeling evolution is often based on biochemical observations about changes between character states. For example, there are 4 nucleotides, and we can make assumptions about the probability of transitions between them. By contrast, for morphological characters, we may not know a priori how many characters states there are per character, as both extant sampling and the fossil record may be highly incomplete, which leads to an observer bias. For a given character, the state space may be larger than what has been observed in the sample of taxa collected by the researcher. In this case, how many evolutionary rates are needed to even describe transitions between morphological character states may not be clear, potentially leading to model misspecification. To explore the impact of this model misspecification, we simulated character data with varying numbers of character states per character. We then used the data to estimate phylogenetic trees using models of evolution with the correct number of character states and an incorrect number of character states. The results of this study indicate that this observer bias may lead to phylogenetic error, particularly in the branch lengths of trees. If the state space is wrongly assumed to be too large, then we underestimate the branch lengths, and the opposite occurs when the state space is wrongly assumed to be too small.
Rapidly evolving taxa are excellent models for understanding the mechanisms that give rise to biodiversity. However, developing an accurate historical framework for comparative analysis of such lineages remains a challenge due to ubiquitous incomplete lineage sorting (ILS) and introgression. Here, we use a whole-genome alignment, multiple locus-sampling strategies, and summary-tree and single nucleotide polymorphism-based species-tree methods to infer a species tree for eastern North American Neodiprion species, a clade of pine-feeding sawflies (Order: Hymenopteran; Family: Diprionidae). We recovered a well-supported species tree that-except for three uncertain relationships-was robust to different strategies for analyzing whole-genome data. Nevertheless, underlying gene-tree discordance was high. To understand this genealogical variation, we used multiple linear regression to model site concordance factors estimated in 50-kb windows as a function of several genomic predictor variables. We found that site concordance factors tended to be higher in regions of the genome with more parsimony-informative sites, fewer singletons, less missing data, lower GC content, more genes, lower recombination rates, and lower D-statistics (less introgression). Together, these results suggest that ILS, introgression, and genotyping error all shape the genomic landscape of gene-tree discordance in Neodiprion. More generally, our findings demonstrate how combining phylogenomic analysis with knowledge of local genomic features can reveal mechanisms that produce topological heterogeneity across genomes.
Models have always been central to inferring molecular evolution and to reconstructing phylogenetic trees. Their use typically involves the development of a mechanistic framework reflecting our understanding of the underlying biological processes, such as nucleotide substitutions, and the estimation of model parameters by maximum likelihood or Bayesian inference. However, deriving and optimizing the likelihood of the data is not always possible under complex evolutionary scenarios or even tractable for large datasets, often leading to unrealistic simplifying assumptions in the fitted models. To overcome this issue, we coupled stochastic simulations of genome evolution with a new supervised deep-learning model to infer key parameters of molecular evolution. Our model is designed to directly analyze multiple sequence alignments and estimate per-site evolutionary rates and divergence without requiring a known phylogenetic tree. The accuracy of our predictions matched that of likelihood-based phylogenetic inference when rate heterogeneity followed a simple gamma distribution, but it strongly exceeded it under more complex patterns of rate variation, such as codon models. Our approach is highly scalable and can be efficiently applied to genomic data, as we showed on a dataset of 26 million nucleotides from the clownfish clade. Our simulations also showed that the integration of per-site rates obtained by deep learning within a Bayesian framework led to significantly more accurate phylogenetic inference, particularly with respect to the estimated branch lengths. We thus propose that future advancements in phylogenetic analysis will benefit from a semi-supervised learning approach that combines deep-learning estimation of substitution rates, which allows for more flexible models of rate variation, and probabilistic inference of the phylogenetic tree, which guarantees interpretability and a rigorous assessment of statistical support.
Dating phylogenetic trees to obtain branch lengths in time units is essential for many downstream applications but has remained challenging. Dating requires inferring substitution rates that can change across the tree. While we can assume to have information about a small subset of nodes from the fossil record or sampling times (for fast-evolving organisms), inferring the ages of the other nodes essentially requires extrapolation and interpolation. Assuming a distribution of branch rates, we can formulate dating as a constrained maximum likelihood (ML) estimation problem. While ML dating methods exist, their accuracy degrades in the face of model misspecification, where the assumed parametric statistical distribution of branch rates vastly differs from the true distribution. Notably, most existing methods assume rigid, often unimodal, branch rate distributions. A second challenge is that the likelihood function involves an integral over the continuous domain of the rates, often leading to difficult non-convex optimization problems. To tackle both challenges, we propose a new method called Molecular Dating using Categorical-models (MD-Cat). MD-Cat uses a categorical model of rates inspired by non-parametric statistics and can approximate a large family of models by discretizing the rate distribution into k categories. Under this model, we can use the Expectation-Maximization algorithm to co-estimate rate categories and branch lengths in time units. Our model has fewer assumptions about the true distribution of branch rates than parametric models such as Gamma or LogNormal distribution. Our results on two simulated and real datasets of Angiosperms and HIV and a wide selection of rate distributions show that MD-Cat is often more accurate than the alternatives, especially on datasets with exponential or multimodal rate distributions.
The role of interspecific hybridization has recently seen increasing attention, especially in the context of diversification dynamics. Genomic research has now made it abundantly clear that both hybridization and introgression-the exchange of genetic material through hybridization and backcrossing-are far more common than previously thought. Besides cases of ongoing or recent genetic exchange between taxa, an increasing number of studies report "ancient introgression"- referring to results of hybridization that took place in the distant past. However, it is not clear whether commonly used methods for the detection of introgression are applicable to such old systems, given that most of these methods were originally developed for analyses at the level of populations and recently diverged species, affected by recent or ongoing genetic exchange. In particular, the assumption of constant evolutionary rates, which is implicit in many commonly used approaches, is more likely to be violated as evolutionary divergence increases. To test the limitations of introgression detection methods when being applied to old systems, we simulated thousands of genomic datasets under a wide range of settings, with varying degrees of among-species rate variation and introgression. Using these simulated datasets, we showed that some commonly applied statistical methods, including the D-statistic and certain tests based on sets of local phylogenetic trees, can produce false-positive signals of introgression between divergent taxa that have different rates of evolution. These misleading signals are caused by the presence of homoplasies occurring at different rates in different lineages. To distinguish between the patterns caused by rate variation and genuine introgression, we developed a new test that is based on the expected clustering of introgressed sites along the genome and implemented this test in the program Dsuite.