Phylogenomics has the power to uncover complex phylogenetic scenarios across the genome. In most cases, no single topology is reflected across the entire genome as the phylogenetic signal differs among genomic regions due to processes, such as introgression and incomplete lineage sorting. Baleen whales are among the largest vertebrates on Earth with a high dispersal potential in a relatively unrestricted habitat, the oceans. The fin whale (Balaenoptera physalus) is one of the most enigmatic baleen whale species, currently divided into four subspecies. It has been a matter of debate whether phylogeographic patterns explain taxonomic variation in fin whales. Here we present a chromosome-level whole genome analysis of the phylogenetic relationships among fin whales from multiple ocean basins. First, we estimated concatenated and consensus phylogenies for both the mitochondrial and nuclear genomes. The consensus phylogenies based upon the autosomal genome uncovered monophyletic clades associated with each ocean basin, aligning with the current understanding of subspecies division. Nevertheless, discordances were detected in the phylogenies based on the Y chromosome, mitochondrial genome, autosomal genome and X chromosome. Furthermore, we detected signs of introgression and pervasive phylogenetic discordance across the autosomal genome. This complex phylogenetic scenario could be explained by a puzzle of introgressive events, not yet documented in fin whales. Similarly, incomplete lineage sorting and low phylogenetic signal could lead to such phylogenetic discordances. Our study reinforces the pitfalls of relying on concatenated or single locus phylogenies to determine taxonomic relationships below the species level by illustrating the underlying nuances which some phylogenetic approaches may fail to capture. We emphasize the significance of accurate taxonomic delineation in fin whales by exploring crucial information revealed through genome-wide assessments.
African antelope diversity is a globally unique vestige of a much richer world-wide Pleistocene megafauna. Despite this, the evolutionary processes leading to the prolific radiation of African antelopes are not well understood. Here, we sequenced 145 whole genomes from both subspecies of the waterbuck (Kobus ellipsiprymnus), an African antelope believed to be in the process of speciation. We investigated genetic structure and population divergence and found evidence of a mid-Pleistocene separation on either side of the eastern Great Rift Valley, consistent with vicariance caused by a rain shadow along the so-called 'Kingdon's Line'. However, we also found pervasive evidence of both recent and widespread historical gene flow across the Rift Valley barrier. By inferring the genome-wide landscape of variation among subspecies, we found 14 genomic regions of elevated differentiation, including a locus that may be related to each subspecies' distinctive coat pigmentation pattern. We investigated these regions as candidate speciation islands. However, we observed no significant reduction in gene flow in these regions, nor any indications of selection against hybrids. Altogether, these results suggest a pattern whereby climatically driven vicariance is the most important process driving the African antelope radiation, and suggest that reproductive isolation may not set in until very late in the divergence process. This has a significant impact on taxonomic inference, as many taxa will be in a gray area of ambiguous systematic status, possibly explaining why it has been hard to achieve consensus regarding the species status of many African antelopes. Our analyses demonstrate how population genetics based on low-depth whole genome sequencing can provide new insights that can help resolve how far lineages have gone along the path to speciation.
We introduce PhyloJunction, a computational framework designed to facilitate the prototyping, test- ing, and characterization of evolutionary models. PhyloJunction is distributed as an open-source Python library that can be used to implement a variety of models, thanks to its flexible graphical modeling architecture and dedicated model specification language. Model design and use are exposed to users via command-line and graphical interfaces, which integrate the steps of simulating, summarizing, and visualizing data. This paper describes the features of PhyloJunction - which include, but are not limited to, a general implementation of a popular family of phylogenetic diversification models - and, moving forward, how it may be expanded to not only include new models, but to also become a platform for conducting and teaching statistical learning.
Adaptive radiation involves diversification along multiple trait axes, producing phenotypically diverse, species-rich lineages. Theory generally predicts that multi-trait evolution occurs via a 'stages' model, with some traits saturating early in a lineage's history, and others diversifying later. Despite its multidimensional nature, however, we know surprisingly little about how different suites of traits evolve during adaptive radiation. Here, we investigated the rate, pattern, and timing of morphological and physiological evolution in the anole lizard adaptive radiation from the Caribbean island of Hispaniola. Rates and patterns of morphological and physiological diversity are largely unaligned, corresponding to independent selective pressures associated with structural and thermal niches. Cold tolerance evolution reflects parapatric divergence across elevation, rather than niche partitioning within communities. Heat tolerance evolution and the preferred temperature evolve more slowly than cold tolerance, reflecting behavioral buffering, particularly in edge-habitat species (a pattern associated with the Bogert effect). In contrast to the nearby island of Puerto Rico, closely related anoles on Hispaniola do not sympatrically partition thermal niche space. Instead, allopatric and parapatric separation across biogeographic and environmental boundaries serves to keep morphologically similar close relatives apart. The phenotypic diversity of this island's adaptive radiation accumulated largely as a by-product of time, with surprisingly few exceptional pulses of trait evolution. A better understanding of the processes that guide multidimensional trait evolution (and nuance therein) will prove key in determining whether the stages model should be considered a common theme of adaptive radiation.
Ancient DNA (aDNA) is increasingly being used to investigate questions such as the phylogenetic relationships and divergence times of extant and extinct species. If aDNA samples are sufficiently old, expected branch lengths (in units of nucleotide substitutions) are reduced relative to contemporary samples. This can be accounted for by incorporating sample ages into phylogenetic analyses. Existing methods that use tip (sample) dates infer gene trees rather than species trees, which can lead to incorrect or biased inferences of the species tree. Methods using a multispecies coalescent (MSC) model overcome these issues. We developed an MSC model with tip dates and implemented it in the program bpp. The method performed well for a range of biologically realistic scenarios, estimating calibrated divergence times and mutation rates precisely. Simulations suggest that estimation precision can be best improved by prioritizing sampling of many loci and more ancient samples. Incorrectly treating ancient samples as contemporary in analyzing simulated data, mimicking a common practice of empirical analyses, led to large systematic biases in model parameters, including divergence times. Two genomic datasets of mammoths and elephants were analyzed, demonstrating the method's empirical utility.
The evolution of gene families is complex, involving gene-level evolutionary events such as gene duplication, horizontal gene transfer, and gene loss, and other processes such as incomplete lineage sorting (ILS). Because of this, topological differences often exist between gene trees and species trees. A number of models have been recently developed to explain these discrepancies, the most realistic of which attempts to consider both gene-level events and ILS. When unified in a single model, the interaction between ILS and gene-level events can cause polymorphism in gene copy number, which we refer to as copy number hemiplasy (CNH). In this paper, we extend the Wright-Fisher process to include duplications and losses over several species, and show that the probability of CNH for this process can be significant. We study how well two unified models-multilocus multispecies coalescent (MLMSC), which models CNH, and duplication, loss, and coalescence (DLCoal), which does not-approximate the Wright-Fisher process with duplication and loss. We then study the effect of CNH on gene family evolution by comparing MLMSC and DLCoal. We generate comparable gene trees under both models, showing significant differences in various summary statistics; most importantly, CNH reduces the number of gene copies greatly. If this is not taken into account, the traditional method of estimating duplication rates (by counting the number of gene copies) becomes inaccurate. The simulated gene trees are also used for species tree inference with the summary methods ASTRAL and ASTRAL-Pro, demonstrating that their accuracy, based on CNH-unaware simulations calibrated on real data, may have been overestimated.
Hundreds or thousands of loci are now routinely used in modern phylogenomic studies. Concatenation approaches to tree inference assume that there is a single topology for the entire dataset, but different loci may have different evolutionary histories due to incomplete lineage sorting (ILS), introgression, and/or horizontal gene transfer; even single loci may not be treelike due to recombination. To overcome this shortcoming, we introduce an implementation of a multi-tree mixture model that we call mixtures across sites and trees (MAST). This model extends a prior implementation by Boussau et al. (2009) by allowing users to estimate the weight of each of a set of pre-specified bifurcating trees in a single alignment. The MAST model allows each tree to have its own weight, topology, branch lengths, substitution model, nucleotide or amino acid frequencies, and model of rate heterogeneity across sites. We implemented the MAST model in a maximum-likelihood framework in the popular phylogenetic software, IQ-TREE. Simulations show that we can accurately recover the true model parameters, including branch lengths and tree weights for a given set of tree topologies, under a wide range of biologically realistic scenarios. We also show that we can use standard statistical inference approaches to reject a single-tree model when data are simulated under multiple trees (and vice versa). We applied the MAST model to multiple primate datasets and found that it can recover the signal of ILS in the Great Apes, as well as the asymmetry in minor trees caused by introgression among several macaque species. When applied to a dataset of 4 Platyrrhine species for which standard concatenated maximum likelihood (ML) and gene tree approaches disagree, we observe that MAST gives the highest weight (i.e., the largest proportion of sites) to the tree also supported by gene tree approaches. These results suggest that the MAST model is able to analyze a concatenated alignment using ML while avoiding some of the biases that come with assuming there is only a single tree. We discuss how the MAST model can be extended in the future.
Across the Tree of Life, most studies of phenotypic disparity and diversification have been restricted to adult organisms. However, many lineages have distinct ontogenetic phases that differ from their adult forms in morphology and ecology. Focusing disproportionately on the evolution of adult forms unnecessarily hinders our understanding of the pressures shaping evolution over time. Non-adult disparity patterns are particularly important to consider for coastal ray-finned fishes, which can have juvenile phases with distinct phenotypes. These juvenile forms are often associated with sheltered nursery environments, with phenotypic shifts between adults and juvenile stages that are readily apparent in locomotor morphology. Whether this ontogenetic variation in locomotor morphology reflects a decoupling of diversification dynamics between life stages remains unknown. Here we investigate the evolutionary dynamics of locomotor morphology between adult and juvenile triggerfishes. We integrate a time-calibrated phylogenetic framework with geometric morphometric approaches and measurement data of fin aspect ratio and incidence, and reveal a mismatch between morphospace occupancy, the evolution of morphological disparity, and the tempo of trait evolution between life stages. Collectively, our results illuminate how the heterogeneity of morpho-functional adaptations can decouple the mode and tempo of morphological diversification between ontogenetic stages.
Scientific names permit humans and search engines to access knowledge about the biodiversity that surrounds us, and names linked to DNA sequences are playing an ever-greater role in search-and-match identification procedures. Here, we analyze how users and curators of the National Center for Biotechnology Information (NCBI) are flagging and curating sequences derived from nomenclatural type material, which is the only way to improve the quality of DNA-based identification in the long run. For prokaryotes, 18,281 genome assemblies from type strains have been curated by NCBI staff and improve the quality of prokaryote naming. For Fungi, type-derived sequences representing over 21,000 species are now essential for fungus naming and identification. For the remaining eukaryotes, however, the numbers of sequences identifiable as type-derived are minuscule, representing only 739 species of arthropods, 1542 vertebrates, and 125 embryophytes. An increase in the production and curation of such sequences will come from (i) sequencing of types or topotypic specimens in museum collections, (ii) the March 2023 rule changes at the International Nucleotide Sequence Database Collaboration requiring more metadata for specimens, and (iii) efforts by data submitters to facilitate curation, including informing NCBI curators about a specimen's type status. We illustrate different type-data submission journeys and provide best-practice examples from a range of organisms. Expanding the number of type-derived sequences in DNA databases, especially of eukaryotes, is crucial for capturing, documenting, and protecting biodiversity.