Feathers are a key novelty underpinning the evolutionary success of birds, yet the origin of feathers remains poorly understood. Debates about feather evolution hinge upon whether filamentous integument has evolved once or multiple times independently on the lineage leading to modern birds. These contradictory results stem from methodological differences in statistical ancestral state estimates. Here we conduct a comprehensive comparison of ancestral state estimation methodologies applied to stem-group birds, testing the role of outgroup inclusion, tree time scaling method, model choice and character coding strategy. Models are compared based on their Akaike Information Criteria (AIC), mutual information, as well as the uncertainty of marginal ancestral state estimates. Our results demonstrate that ancestral state estimates of stem-bird integument are strongly influenced by tree time scaling method, outgroup selection and model choice, while character coding strategy seems to have less effect on the ancestral estimates produced. We identify the best fitting and most generalizable models using AIC scores and leave-one-out cross-validation (LOOCV) respectively. Our analyses broadly support the independent origin of filamentous integument in dinosaurs and pterosaurs and support a younger evolutionary origin of feathers than has been suggested previously. In terms of model selection, we observe little correlation between AIC/AICc and LOOCV error, suggesting that, for our dataset, model fit does not reliably predict generalizability. However, both approaches favor models that infer a similar pattern of feather evolution. More globally, our study highlights that special care must be taken in selecting the outgroup, tree and model when conducting ASE analyses.
Genomes are composed of a mosaic of segments inherited from different ancestors, each separated by past recombination events. Consequently, genealogical relationships among multiple genomes vary spatially across different genomic regions. Genealogical variation among unlinked (uncorrelated) genomic regions is well described for either a single population (coalescent) or multiple structured populations (multispecies coalescent). However, the expected similarity among genealogies at linked regions of a genome is less well characterized. Recently, an analytical solution was derived for the distribution of the waiting distance for a change in the genealogical tree spatially across a genome for a single population with constant effective population size. Here we describe a generalization of this result, in terms of the distribution of waiting distances between changes in genealogical trees and topologies for multiple structured populations with branch-specific effective population sizes (i.e., under the multispecies coalescent). We implemented our model in the Python package ipcoal and validated its accuracy against stochastic coalescent simulations. Using a novel likelihood framework we show that tree and topology-change waiting distances in an ARG can be used to fit species tree model parameters, demonstrating an application of our model for developing new methods for phylogenetic inference. The Multi-Species Sequentially Markov Coalescent (MS-SMC) model presented here represents a major advance for linking local ancestry inference to hierarchical demographic models.
For many questions in ecology and evolution, the most relevant data to consider are attributes of lineage pairs. Comparative tests for causal relationships among traits like 'diet niche overlap', 'divergence time', and 'strength of reproductive isolation (RI)' - measured for pairwise combinations of related species or populations - have led to several groundbreaking insights, but the correct statistical approach for these analyses has never been clear. Lineage-pair traits are non-independent, but unlike the expected covariance among species' traits, which is captured by a phylogenetic covariance matrix arising from a given model, the expected covariance among lineage-pair traits has not been explicitly formulated. Analyses of pairwise-defined data have thus employed untested workarounds for non-independence rather than direct models of lineage-pair covariance, with consequences that are unexplored. Here, we consider how evolutionary relatedness among taxa translates into non-independence among taxonomic pairs. We develop models by which phylogenetic signal in an underlying character generates covariance among pairs in a lineage-pair trait. We incorporate the resulting lineage-pair covariance matrices into modified versions of phylogenetic generalized least squares and a new phylogenetic beta regression for bounded response variables. Both outperform previous approaches in simulation tests. We find that a common heuristic method, node averaging, imparts a greater cost to model performance than does the non-independence it was designed to correct. We re-analyze two empirical datasets to find dramatic improvements in model fit and, in the case of avian hybridization data, an even stronger relationship between pair age and RI than is revealed from uncorrected analysis. We finally present a new tool, the R package phylopairs, that allows empiricists to test relationships among pairwise-defined variables in a way that is statistically robust and more straightforward to implement.
Species diversification is characterized by speciation and extinction, the rates of which can, under some assumptions, be estimated from time-calibrated phylogenies. However, maximum likelihood estimation methods (MLE) for inferring rates are limited to simpler models and can show bias, particularly in small phylogenies. Likelihood-free methods to estimate parameters of diversification models using deep learning have started to emerge, but how robust neural network methods are at handling the intricate nature of phylogenetic data remains an open question. Here we present a new ensemble neural network approach to estimate diversification parameters from phylogenetic trees that leverages different classes of neural networks (dense neural network, graph neural network, and long short-term memory recurrent network) and simultaneously learns from graph representations of phylogenies, their branching times and their summary statistics. Our best-performing ensemble neural network (which adjusts the graph neural network result using a recurrent neural network) delivers estimates faster than MLE and shows less sensitivity to tree size for constant-rate and diversity-dependent speciation scenarios. It performs well compared to an existing convolutional network approach. However, like MLE, our approach still fails to recover parameters precisely under a protracted birth-death process. Our analysis suggests that the primary limitation to accurate parameter estimation is the amount of information contained within a phylogeny, as indicated by its size and the strength of effects shaping it. In cases where MLE is unavailable, our neural network method provides a promising alternative for estimating phylogenetic tree parameters. If detectable phylogenetic signals are present, our approach delivers results that are comparable to MLE but without inherent biases.

