首页 > 最新文献

Systematic Biology最新文献

英文 中文
ConvexML: Fast and accurate branch length estimation under irreversible mutation models, illustrated through applications to CRISPR/Cas9-based lineage tracing ConvexML:在不可逆突变模型下快速准确的分支长度估计,通过应用于基于CRISPR/ cas9的谱系追踪来说明
IF 6.5 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Pub Date : 2025-08-12 DOI: 10.1093/sysbio/syaf054
Sebastian Prillo, Akshay Ravoor, Nir Yosef, Yun S Song
Branch length estimation is a fundamental problem in Statistical Phylogenetics and a core component of tree reconstruction algorithms. Traditionally, general time-reversible mutation models are employed, and many software tools exist for this scenario. With the advent of CRISPR/Cas9 lineage tracing technologies, there has been significant interest in the study of branch length estimation under irreversible mutation models. Under the CRISPR/Cas9 mutation model, irreversible mutations – in the form of DNA insertions or deletions – are accrued during the experiment, which are then read out at the single-cell level to reconstruct the cell lineage tree. However, most of the analyses of CRISPR/Cas9 lineage tracing data have so far been limited to the reconstruction of single-cell tree topologies, which depict lineage relationships between cells, but not the amount of time that has passed between ancestral cell states and the present. Time-resolved trees, known as chronograms, would allow one to study the evolutionary dynamics of cell populations at an unprecedented level of resolution. Indeed, time-resolved trees would reveal the timing of events on the tree, the relative fitness of subclones, and the dynamics underlying phenotypic changes in the cell population – among other important applications. In this work, we introduce the first scalable and accurate method to refine any given single-cell tree topology into a single-cell chronogram by estimating its branch lengths. To do this, we perform regularized maximum likelihood estimation under a general irreversible mutation model, paired with a conservative version of maximum parsimony that reconstructs only the ancestral states that we are confident about. To deal with the particularities of CRISPR/Cas9 lineage tracing data – such as double-resection events which affect runs of consecutive sites – we avoid making our model more complex and instead opt for using a simple but effective data encoding scheme. Similarly, we avoid explicitly modeling the missing data mechanisms – such as heritable missing data – by instead assuming that they are missing completely at random. We stabilize estimates in low-information regimes by using a simple penalized version of maximum likelihood estimation (MLE) using a minimum branch length constraint and pseudocounts. All this leads to a convex MLE problem that can be readily solved in seconds with off-the-shelf convex optimization solvers. We benchmark our method using both simulations and real lineage tracing data, and show that it performs well on several tasks, matching or outperforming competing methods such as TiDeTree and LAML in terms of accuracy, while being 10 ∼ 100 × faster. Notably, our statistical model is simpler and more general, as we do not explicitly model the intricacies of CRISPR/Cas9 lineage tracing data. In this sense, our contribution is twofold: (1) a fast and robust method for branch length estimation under a general irreversible mutation model,
分支长度估计是统计系统发育学中的一个基本问题,也是树重建算法的核心组成部分。传统上,一般采用时间可逆的突变模型,并且存在许多用于此场景的软件工具。随着CRISPR/Cas9谱系追踪技术的出现,人们对不可逆突变模型下分支长度估计的研究产生了浓厚的兴趣。在CRISPR/Cas9突变模型下,不可逆转的突变——以DNA插入或缺失的形式——在实验过程中积累,然后在单细胞水平上读出这些突变,以重建细胞谱系树。然而,迄今为止,对CRISPR/Cas9谱系追踪数据的大多数分析都局限于单细胞树拓扑结构的重建,这些拓扑结构描述了细胞之间的谱系关系,而不是祖先细胞状态与当前状态之间经过的时间。时间分辨树,也就是时间表,将使人们能够以前所未有的分辨率研究细胞群体的进化动态。事实上,时间分辨树将揭示树中事件的时间,亚克隆的相对适应性,以及细胞群体中表型变化的动态-以及其他重要应用。在这项工作中,我们引入了第一个可扩展和精确的方法,通过估计其分支长度将任何给定的单细胞树拓扑细化为单细胞时序图。为此,我们在一般不可逆突变模型下执行正则化最大似然估计,并与仅重建我们确信的祖先状态的最大简约性的保守版本配对。为了处理CRISPR/Cas9谱系追踪数据的特殊性-例如影响连续位点运行的双切除事件-我们避免使我们的模型更复杂,而是选择使用简单但有效的数据编码方案。同样,我们避免显式地对缺失的数据机制(例如可继承的缺失数据)建模,而是假设它们完全是随机丢失的。我们通过使用最小分支长度约束和伪计数的最大似然估计(MLE)的简单惩罚版本来稳定低信息状态下的估计。所有这些都导致了一个凸MLE问题,这个问题可以用现成的凸优化求解器在几秒钟内轻松解决。我们使用模拟和真实谱系追踪数据对我们的方法进行了基准测试,并表明它在几个任务上表现良好,在准确性方面匹配或优于TiDeTree和LAML等竞争方法,同时速度快10 ~ 100倍。值得注意的是,我们的统计模型更简单,更通用,因为我们没有明确地模拟CRISPR/Cas9谱系追踪数据的复杂性。从这个意义上说,我们的贡献是双重的:(1)在一般不可逆突变模型下快速和鲁棒的分支长度估计方法,以及(2)特定于CRISPR/ cas9谱系追踪数据的数据编码方案,使其适用于一般模型。我们的分支长度估计方法,我们称之为“ConvexML”,应该广泛适用于任何具有不可逆突变(理想情况下,具有高多样性)和几乎可以忽略的缺失数据机制的进化模型。‘ ConvexML ’可以通过ConvexML开源Python包获得。
{"title":"ConvexML: Fast and accurate branch length estimation under irreversible mutation models, illustrated through applications to CRISPR/Cas9-based lineage tracing","authors":"Sebastian Prillo, Akshay Ravoor, Nir Yosef, Yun S Song","doi":"10.1093/sysbio/syaf054","DOIUrl":"https://doi.org/10.1093/sysbio/syaf054","url":null,"abstract":"Branch length estimation is a fundamental problem in Statistical Phylogenetics and a core component of tree reconstruction algorithms. Traditionally, general time-reversible mutation models are employed, and many software tools exist for this scenario. With the advent of CRISPR/Cas9 lineage tracing technologies, there has been significant interest in the study of branch length estimation under irreversible mutation models. Under the CRISPR/Cas9 mutation model, irreversible mutations – in the form of DNA insertions or deletions – are accrued during the experiment, which are then read out at the single-cell level to reconstruct the cell lineage tree. However, most of the analyses of CRISPR/Cas9 lineage tracing data have so far been limited to the reconstruction of single-cell tree topologies, which depict lineage relationships between cells, but not the amount of time that has passed between ancestral cell states and the present. Time-resolved trees, known as chronograms, would allow one to study the evolutionary dynamics of cell populations at an unprecedented level of resolution. Indeed, time-resolved trees would reveal the timing of events on the tree, the relative fitness of subclones, and the dynamics underlying phenotypic changes in the cell population – among other important applications. In this work, we introduce the first scalable and accurate method to refine any given single-cell tree topology into a single-cell chronogram by estimating its branch lengths. To do this, we perform regularized maximum likelihood estimation under a general irreversible mutation model, paired with a conservative version of maximum parsimony that reconstructs only the ancestral states that we are confident about. To deal with the particularities of CRISPR/Cas9 lineage tracing data – such as double-resection events which affect runs of consecutive sites – we avoid making our model more complex and instead opt for using a simple but effective data encoding scheme. Similarly, we avoid explicitly modeling the missing data mechanisms – such as heritable missing data – by instead assuming that they are missing completely at random. We stabilize estimates in low-information regimes by using a simple penalized version of maximum likelihood estimation (MLE) using a minimum branch length constraint and pseudocounts. All this leads to a convex MLE problem that can be readily solved in seconds with off-the-shelf convex optimization solvers. We benchmark our method using both simulations and real lineage tracing data, and show that it performs well on several tasks, matching or outperforming competing methods such as TiDeTree and LAML in terms of accuracy, while being 10 ∼ 100 × faster. Notably, our statistical model is simpler and more general, as we do not explicitly model the intricacies of CRISPR/Cas9 lineage tracing data. In this sense, our contribution is twofold: (1) a fast and robust method for branch length estimation under a general irreversible mutation model, ","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"1 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144825256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Selecting a Window Size for Phylogenomic Analyses of Whole Genome Alignments using AIC 利用AIC选择全基因组比对系统发育分析的窗口大小
IF 6.5 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Pub Date : 2025-08-12 DOI: 10.1093/sysbio/syaf053
Jeremias Ivan, Paul Frandsen, Robert Lanfear
Gene tree discordance along a set of aligned genomes presents a challenge for phylogenomic methods to identify the non-recombining regions and reconstruct the phylogenetic tree for each region. To address this problem, many studies used the non-overlapping window approach, often with an arbitrary selection of fixed window sizes that potentially include intra-window recombination events. In this study, we propose an information theoretic approach to select a window size that best reflects the underlying histories of the alignment. First, we simulated chromosome alignments that reflected the key characteristics of an empirical dataset and found that the AIC is a good predictor of window size accuracy in correctly recovering the tree topologies of the alignment. To address the issue of missing data in empirical datasets, we designed a stepwise non-overlapping window approach that compares the AIC of two window sizes at a time, retaining only genomic regions that can be analysed using both window sizes. We then applied this method to the genomes of Heliconius butterflies and great apes. We found that the best window sizes for the butterflies’ chromosomes ranged from <125bp to 250bp, which are much shorter than those used in a previous study even though this difference in window size did not significantly change the most common topologies across the genome. On the other hand, the best window sizes for great apes’ chromosomes ranged from 500bp to 1kb with the proportion of the major topology (grouping human and chimpanzee) falling between 60% and 87%, consistent with previous findings. Additionally, we observed a notable impact of gene tree estimation error and concatenation when using small and large windows, respectively. For instance, the proportion of the major topology for great apes was 50% when using 250bp windows, but reached almost 100% for 64kb windows. In conclusion, our study highlights the challenges associated with selecting a fixed window size in non-overlapping window analyses and proposes the AIC as a less arbitrary way to select the optimal window size when running non-overlapping method on whole genome alignments.
基因树的不一致性对系统基因组学方法识别非重组区域和重建每个区域的系统发育树提出了挑战。为了解决这个问题,许多研究使用了非重叠窗口方法,通常是任意选择固定窗口大小,可能包括窗口内重组事件。在这项研究中,我们提出了一种信息理论方法来选择最能反映对齐潜在历史的窗口大小。首先,我们模拟了反映经验数据集关键特征的染色体比对,并发现AIC在正确恢复染色体比对的树拓扑结构方面是一个很好的窗口大小精度预测器。为了解决经验数据集中缺失数据的问题,我们设计了一种逐步非重叠窗口方法,该方法一次比较两个窗口大小的AIC,只保留可以使用两个窗口大小进行分析的基因组区域。然后,我们将这种方法应用于蝴蝶和类人猿的基因组。我们发现蝴蝶染色体的最佳窗口大小在125bp到250bp之间,这比之前研究中使用的要短得多,尽管这种窗口大小的差异并没有显著改变基因组中最常见的拓扑结构。另一方面,类人猿染色体的最佳窗口大小在500bp到1kb之间,主要拓扑结构(人类和黑猩猩分组)的比例在60%到87%之间,与先前的发现一致。此外,我们观察到分别使用小窗口和大窗口时基因树估计误差和连接的显着影响。例如,当使用250bp的窗口时,类人猿的主要拓扑比例为50%,但对于64kb的窗口,这一比例几乎达到100%。总之,我们的研究强调了在非重叠窗口分析中选择固定窗口大小的挑战,并提出AIC是在全基因组比对中运行非重叠方法时选择最佳窗口大小的一种不那么任意的方法。
{"title":"Selecting a Window Size for Phylogenomic Analyses of Whole Genome Alignments using AIC","authors":"Jeremias Ivan, Paul Frandsen, Robert Lanfear","doi":"10.1093/sysbio/syaf053","DOIUrl":"https://doi.org/10.1093/sysbio/syaf053","url":null,"abstract":"Gene tree discordance along a set of aligned genomes presents a challenge for phylogenomic methods to identify the non-recombining regions and reconstruct the phylogenetic tree for each region. To address this problem, many studies used the non-overlapping window approach, often with an arbitrary selection of fixed window sizes that potentially include intra-window recombination events. In this study, we propose an information theoretic approach to select a window size that best reflects the underlying histories of the alignment. First, we simulated chromosome alignments that reflected the key characteristics of an empirical dataset and found that the AIC is a good predictor of window size accuracy in correctly recovering the tree topologies of the alignment. To address the issue of missing data in empirical datasets, we designed a stepwise non-overlapping window approach that compares the AIC of two window sizes at a time, retaining only genomic regions that can be analysed using both window sizes. We then applied this method to the genomes of Heliconius butterflies and great apes. We found that the best window sizes for the butterflies’ chromosomes ranged from <125bp to 250bp, which are much shorter than those used in a previous study even though this difference in window size did not significantly change the most common topologies across the genome. On the other hand, the best window sizes for great apes’ chromosomes ranged from 500bp to 1kb with the proportion of the major topology (grouping human and chimpanzee) falling between 60% and 87%, consistent with previous findings. Additionally, we observed a notable impact of gene tree estimation error and concatenation when using small and large windows, respectively. For instance, the proportion of the major topology for great apes was 50% when using 250bp windows, but reached almost 100% for 64kb windows. In conclusion, our study highlights the challenges associated with selecting a fixed window size in non-overlapping window analyses and proposes the AIC as a less arbitrary way to select the optimal window size when running non-overlapping method on whole genome alignments.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"69 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144825108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
But the Clock, Tick-Tock: An Empirical Case Study Highlights the Preeminence of Relaxed Clock Models in Total-Evidence Dating 但是,时钟滴答作响:一个实证案例研究强调了放松时钟模型在全证据年代测定中的卓越地位
IF 6.5 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Pub Date : 2025-08-08 DOI: 10.1093/sysbio/syaf055
Nicolás Mongiardino Koch, Jeffrey R Thompson, Rich Mooi, Greg W Rouse
Phylogenetic clock models translate inferred amounts of evolutionary change (calculated from either genotypes or phenotypes) into estimates of elapsed time, providing a mechanism for time scaling phylogenetic trees. Relaxed-clock models, which accommodate variation in evolutionary rates across branches, are one of the main components of Bayesian dating, yet their consequences for total-evidence phylogenetics have not been thoroughly explored. Here, we combine morphological, molecular (both transcriptomic and Sanger-sequenced), and stratigraphic datasets for all major lineages of echinoids (sea urchins, heart urchins, sand dollars). We then perform total-evidence dated inference under the fossilized birth-death prior, varying two analytical conditions: the choice between autocorrelated and uncorrelated relaxed clocks, which enforce (or not) evolutionary rate inheritance; and the ability to recover fossil terminals as direct ancestors. Our results highlight a previously unnoticed interaction between tree and clock models, with analyses implementing an autocorrelated clock failing to recover any direct ancestors. Nonetheless, even under conditions conducive to the placement of fossil terminals as ancestors, we find this type of relationship to be accommodated without any impact on either topology or node ages. On the other hand, tree topology, fossil placement, divergence times, and downstream macroevolutionary inferences (e.g., ancestral state reconstructions) were all strongly affected by the type of relaxed clock implemented. In regions of the tree where molecular rate variation is pervasive and morphological signal relatively uninformative, fossil tips seem to play little to no role in informing divergence times, and instead passively move in and out of clades depending on the ages imposed upon surrounding nodes by molecular data. Our results highlight the extent to which the phylogenetic and macroevolutionary conclusions of total-evidence dated analyses are contingent on the choice of relaxed-clock model, highlighting the need for either careful methodological validation or a thorough assessment of sensitivity. Our efforts continue to illuminate the echinoid tree of life, supporting the erection of the order Apatopygoida to include three living species last sharing a common ancestor with other extant lineages around the time of the Jurassic-Cretaceous boundary. Furthermore, they also illustrate how the phylogenetic placement of extinct clades hinges upon the modelling of molecular data, evidencing the extent to which the fossil record remains subservient to phylogenomics.
系统发育时钟模型将推断的进化变化量(从基因型或表型计算)转化为经过时间的估计,提供了一种时间尺度系统发育树的机制。松弛时钟模型是贝叶斯测年法的主要组成部分之一,它能适应不同分支间进化速率的变化,但其对全证据系统发育的影响尚未得到彻底探索。在这里,我们结合形态学、分子(转录组学和桑格测序)和地层学数据集,研究了所有主要的棘皮类动物谱系(海胆、心海胆、沙美元)。然后,我们在化石出生-死亡先验下进行了全证据日期推断,改变了两个分析条件:自相关和不相关放松时钟之间的选择,这强制(或不强制)进化速率遗传;以及恢复化石终端作为直系祖先的能力。我们的结果突出了以前未被注意到的树和时钟模型之间的相互作用,实现自相关时钟的分析无法恢复任何直接祖先。然而,即使在有利于化石终端作为祖先放置的条件下,我们发现这种类型的关系可以被容纳,而不会对拓扑结构或节点年龄产生任何影响。另一方面,树木拓扑结构、化石位置、分化时间和下游宏观进化推断(例如,祖先状态重建)都受到所实现的放松时钟类型的强烈影响。在分子速率变化普遍存在且形态信号相对缺乏信息的区域,化石尖端似乎在告知分化时间方面几乎没有作用,而是被动地根据分子数据施加给周围节点的年龄在进化枝上进进出出。我们的研究结果强调了总证据日期分析的系统发育和宏观进化结论在多大程度上取决于松弛时钟模型的选择,强调了需要仔细的方法验证或彻底的敏感性评估。我们的工作将继续阐明棘刺类动物的生命之树,支持Apatopygoida目的建立,包括三个现存物种,最后与其他现存的谱系在侏罗纪-白垩纪边界时期共享一个共同的祖先。此外,它们还说明了灭绝枝的系统发育位置如何依赖于分子数据的建模,证明了化石记录在多大程度上服从于系统基因组学。
{"title":"But the Clock, Tick-Tock: An Empirical Case Study Highlights the Preeminence of Relaxed Clock Models in Total-Evidence Dating","authors":"Nicolás Mongiardino Koch, Jeffrey R Thompson, Rich Mooi, Greg W Rouse","doi":"10.1093/sysbio/syaf055","DOIUrl":"https://doi.org/10.1093/sysbio/syaf055","url":null,"abstract":"Phylogenetic clock models translate inferred amounts of evolutionary change (calculated from either genotypes or phenotypes) into estimates of elapsed time, providing a mechanism for time scaling phylogenetic trees. Relaxed-clock models, which accommodate variation in evolutionary rates across branches, are one of the main components of Bayesian dating, yet their consequences for total-evidence phylogenetics have not been thoroughly explored. Here, we combine morphological, molecular (both transcriptomic and Sanger-sequenced), and stratigraphic datasets for all major lineages of echinoids (sea urchins, heart urchins, sand dollars). We then perform total-evidence dated inference under the fossilized birth-death prior, varying two analytical conditions: the choice between autocorrelated and uncorrelated relaxed clocks, which enforce (or not) evolutionary rate inheritance; and the ability to recover fossil terminals as direct ancestors. Our results highlight a previously unnoticed interaction between tree and clock models, with analyses implementing an autocorrelated clock failing to recover any direct ancestors. Nonetheless, even under conditions conducive to the placement of fossil terminals as ancestors, we find this type of relationship to be accommodated without any impact on either topology or node ages. On the other hand, tree topology, fossil placement, divergence times, and downstream macroevolutionary inferences (e.g., ancestral state reconstructions) were all strongly affected by the type of relaxed clock implemented. In regions of the tree where molecular rate variation is pervasive and morphological signal relatively uninformative, fossil tips seem to play little to no role in informing divergence times, and instead passively move in and out of clades depending on the ages imposed upon surrounding nodes by molecular data. Our results highlight the extent to which the phylogenetic and macroevolutionary conclusions of total-evidence dated analyses are contingent on the choice of relaxed-clock model, highlighting the need for either careful methodological validation or a thorough assessment of sensitivity. Our efforts continue to illuminate the echinoid tree of life, supporting the erection of the order Apatopygoida to include three living species last sharing a common ancestor with other extant lineages around the time of the Jurassic-Cretaceous boundary. Furthermore, they also illustrate how the phylogenetic placement of extinct clades hinges upon the modelling of molecular data, evidencing the extent to which the fossil record remains subservient to phylogenomics.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"12 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144825112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Species Diversification in the Sky Islands of Southwestern China Revealed by Genomic, Introgression and Demographic Analyses of Asian Shrew Moles. 亚洲鼩鼱基因组、基因渗入和人口统计学分析揭示中国西南天空群岛物种多样性。
IF 6.5 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Pub Date : 2025-07-31 DOI: 10.1093/sysbio/syaf052
Yi-Xian Li,Zhong-Zheng Chen,Quan Li,Tao Zhang,Feng Cheng,Wen-Yu Song,Xue-You Li,Shui-Wang He,Hong-Jiao Wang,Kenneth Otieno Onditi,Xue-Long Jiang
The Mountains of Southwest China, a global biodiversity hotspot, have a unique "sky island" landscape with high diversity of both ancient and recent-formed species. While their distribution patterns offer significant insights into diversification processes, the complex geological and climatic history, combined with dynamic histories of gene flow in endemic taxa, make unravelling this history challenging. This study focuses on Asian shrew moles (genus Uropsilus), an ancient group endemic to this region with an unresolved taxonomic system. By combining phylogenomic, introgression and demographic history analyses, we investigated the historical patterns of species diversification in this genus. We detected phylogenetic discordances among rapidly diverged lineages, driven by incomplete lineage sorting, both recent and ancient gene flow, and ghost introgression. The gene flow patterns revealed strong genetic isolation in the Hengduan Mountains region, contrasted by more extensive dispersal or connectivity in areas to its east, while suggesting potential ring-like diversification around the Sichuan Basin. Demographic history indicated that rapidly diverged lineages south of the Yangtze River exhibited significantly different responses to climatic fluctuations compared to other lineages, with the East Asian monsoon likely driving their radiative differentiation and dispersal. Our study demonstrates the impacts of mountain uplift, climatic changes, and the connectivity of sky island refugia in shaping the diverse patterns of species differentiation and their distribution. [phylogenomics; introgression; Asian shrew moles; demographic history].
中国西南山区是全球生物多样性热点地区,拥有独特的“天空岛”景观,古代和现代物种多样性都很高。虽然它们的分布模式为多样化过程提供了重要的见解,但复杂的地质和气候历史,加上地方性分类群中基因流动的动态历史,使得解开这一历史具有挑战性。本研究的重点是亚洲鼩鼱鼹鼠(Uropsilus属),这是该地区特有的一个古老类群,分类系统尚未确定。通过系统基因组学、基因渗入和人口统计学分析,研究了该属植物物种多样化的历史模式。我们发现在快速分化的谱系中存在系统发育不一致,这是由不完整的谱系分类、现代和古代基因流动以及幽灵渗入所驱动的。基因流动模式显示横断山脉地区具有较强的遗传隔离性,而横断山脉以东地区则具有较广泛的分散或连通性,表明四川盆地周围存在潜在的环状多样化。人口统计历史表明,长江以南迅速分化的谱系对气候波动的响应明显不同于其他谱系,东亚季风可能推动了它们的辐射分化和扩散。研究结果表明,高山隆升、气候变化和天岛避难所的连通性对物种分化和分布格局的影响。[phylogenomics;渐渗现象;亚洲鼩鼱;人口历史)。
{"title":"Species Diversification in the Sky Islands of Southwestern China Revealed by Genomic, Introgression and Demographic Analyses of Asian Shrew Moles.","authors":"Yi-Xian Li,Zhong-Zheng Chen,Quan Li,Tao Zhang,Feng Cheng,Wen-Yu Song,Xue-You Li,Shui-Wang He,Hong-Jiao Wang,Kenneth Otieno Onditi,Xue-Long Jiang","doi":"10.1093/sysbio/syaf052","DOIUrl":"https://doi.org/10.1093/sysbio/syaf052","url":null,"abstract":"The Mountains of Southwest China, a global biodiversity hotspot, have a unique \"sky island\" landscape with high diversity of both ancient and recent-formed species. While their distribution patterns offer significant insights into diversification processes, the complex geological and climatic history, combined with dynamic histories of gene flow in endemic taxa, make unravelling this history challenging. This study focuses on Asian shrew moles (genus Uropsilus), an ancient group endemic to this region with an unresolved taxonomic system. By combining phylogenomic, introgression and demographic history analyses, we investigated the historical patterns of species diversification in this genus. We detected phylogenetic discordances among rapidly diverged lineages, driven by incomplete lineage sorting, both recent and ancient gene flow, and ghost introgression. The gene flow patterns revealed strong genetic isolation in the Hengduan Mountains region, contrasted by more extensive dispersal or connectivity in areas to its east, while suggesting potential ring-like diversification around the Sichuan Basin. Demographic history indicated that rapidly diverged lineages south of the Yangtze River exhibited significantly different responses to climatic fluctuations compared to other lineages, with the East Asian monsoon likely driving their radiative differentiation and dispersal. Our study demonstrates the impacts of mountain uplift, climatic changes, and the connectivity of sky island refugia in shaping the diverse patterns of species differentiation and their distribution. [phylogenomics; introgression; Asian shrew moles; demographic history].","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"96 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144748115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Global climate cooling spurred skipper butterfly diversification 全球气候变冷促使跳蝶多样化
IF 6.5 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Pub Date : 2025-07-28 DOI: 10.1093/sysbio/syaf029
Emmanuel F A Toussaint, Fabien L Condamine, Ana Paula dos Santos De Carvalho, David M Plotkin, Emily A Ellis, Kelly M Dexter, Chandra Earl, Kwaku Aduse-Poku, Michael F Braby, Hideyuki Chiba, Riley J Gott, Kiyoshi Maruyama, Ana BB Morais, Chris J Müller, Djunijanti Peggie, Szabolcs Sáfián, Roger Vila, Andrew D Warren, Masaya Yago, Jesse W Breinholt, Marianne Espeland, Naomi E Pierce, David J Lohman, Akito Y Kawahara
Characterizing drivers governing the diversification of species-rich lineages is challenging. Although butterflies are one of the most well-studied groups of insects, there are few comprehensive studies investigating their diversification dynamics. Here, we reconstruct a phylogenomic tree for ca. 1,500 species in the family Hesperiidae, the skippers, to test whether historical global climate change, geographical range evolution, and host-plant association are drivers of diversification. Our findings suggest skippers originated in Laurasia before the Cretaceous-Paleogene mass extinction, in a northern region centered on Beringia before colonizing southern regions coinciding with global climate cooling. Climate cooling also fostered the diversification of skippers throughout the Cenozoic possibly by fueling biome transitions from closed to open ecosystems such as grasslands. An early shift from dicot-feeding to monocot-feeding reduced extinction rates and increased speciation rates, explaining the large diversity of grass-feeding adapted skippers. A dynamic geographic range evolution and host-plant shifts linked with long-term climate change explain skipper butterfly diversification.
描述控制物种丰富谱系多样化的驱动因素具有挑战性。虽然蝴蝶是被研究得最充分的昆虫群体之一,但很少有全面的研究调查它们的多样化动态。在这里,我们重建了大约1500个跳蛛科物种的系统基因组树,以测试历史全球气候变化、地理范围进化和寄主-植物关联是否是多样性的驱动因素。我们的研究结果表明,在白垩纪-古近纪大灭绝之前,在以白令陆桥为中心的北部地区,跳船起源于劳亚,然后在全球气候变冷的同时向南部地区殖民。在整个新生代,气候变冷也促进了跳船的多样化,可能是通过推动生物群落从封闭的生态系统向开放的生态系统(如草原)转变。早期从双食到单食的转变降低了灭绝率,增加了物种形成率,解释了食草适应跳船的巨大多样性。与长期气候变化相关的动态地理范围演变和寄主植物转移解释了跳蝶的多样化。
{"title":"Global climate cooling spurred skipper butterfly diversification","authors":"Emmanuel F A Toussaint, Fabien L Condamine, Ana Paula dos Santos De Carvalho, David M Plotkin, Emily A Ellis, Kelly M Dexter, Chandra Earl, Kwaku Aduse-Poku, Michael F Braby, Hideyuki Chiba, Riley J Gott, Kiyoshi Maruyama, Ana BB Morais, Chris J Müller, Djunijanti Peggie, Szabolcs Sáfián, Roger Vila, Andrew D Warren, Masaya Yago, Jesse W Breinholt, Marianne Espeland, Naomi E Pierce, David J Lohman, Akito Y Kawahara","doi":"10.1093/sysbio/syaf029","DOIUrl":"https://doi.org/10.1093/sysbio/syaf029","url":null,"abstract":"Characterizing drivers governing the diversification of species-rich lineages is challenging. Although butterflies are one of the most well-studied groups of insects, there are few comprehensive studies investigating their diversification dynamics. Here, we reconstruct a phylogenomic tree for ca. 1,500 species in the family Hesperiidae, the skippers, to test whether historical global climate change, geographical range evolution, and host-plant association are drivers of diversification. Our findings suggest skippers originated in Laurasia before the Cretaceous-Paleogene mass extinction, in a northern region centered on Beringia before colonizing southern regions coinciding with global climate cooling. Climate cooling also fostered the diversification of skippers throughout the Cenozoic possibly by fueling biome transitions from closed to open ecosystems such as grasslands. An early shift from dicot-feeding to monocot-feeding reduced extinction rates and increased speciation rates, explaining the large diversity of grass-feeding adapted skippers. A dynamic geographic range evolution and host-plant shifts linked with long-term climate change explain skipper butterfly diversification.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"86 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144715356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Phylogenetic Analysis of Characters with Dependencies under Maximum Likelihood 极大似然下具有依赖性性状的系统发育分析
IF 6.5 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Pub Date : 2025-07-26 DOI: 10.1093/sysbio/syaf051
Pablo A Goloboff
The dependencies between characters used in phylogenetic analysis (e.g., inapplicabilities, functional dependencies) can be taken into account by using combinations of character states as possible ancestral morphotypes, and using appropriate rates of transformation between such morphotypes. As every morphotype represents a permissible combination of the original character states, this allows easily ruling out specific combinations of character states, and taking into account changes that are either less or more likely to co-occur, or to occur in certain contexts. For inapplicable characters, Goloboff et al. (2021) used morphotypes but proposed obtaining transition probabilities between morphotypes from products of transition probabilities of the original characters and factors to incorporate dependencies. The product of transition probabilities is shown here to be flawed (failing the time-continuity requirement of phylogenetic Markov models, essential for statistical consistency under the model). Tarasov (2023) used the same delimitation of morphotypes but proposed obtaining transition probabilities from rate matrices, synthesized in a stepwise fashion from the hierarchy of dependencies. This paper shows that the rate matrices can easily be created, instead of with a stepwise synthesis, from direct comparisons between legitimate morphotypes (as done by Goloboff and De Laet 2023 for parsimony). Based on a few simple rules, the resulting rate matrices are (for inapplicable characters) identical to those obtained by Tarasov (2023). Additionally, in the computer program TNT, biological dependencies beyond mere inapplicability can be specified by the user with a simple syntax for (combinations of) states in “parent” characters restricting the states that “child” characters can take, using AND and OR conjunctions for elaborate interactions. These researcher-defined rules are used to internally convert the original characters into morphotypes, discarding morphotypes made impossible by the rules. In the case of biological dependencies (where, depending on the parent characters, there can be restrictions in the states that dependent characters can take, instead of the character being inapplicable), the rates of transition between morphotypes cannot be calculated solely from comparisons of states differing in both morphotypes –consideration of the conditions of dependency is needed as well.
系统发育分析中使用的性状之间的依赖性(例如,不适用性,功能依赖性)可以通过使用性状状态组合作为可能的祖先形态,并在这些形态之间使用适当的转换速率来考虑。由于每种形态都代表了原始角色状态的一种可允许的组合,这就可以很容易地排除角色状态的特定组合,并考虑到更少或更有可能同时发生的变化,或者在特定环境中发生的变化。对于不适用的字符,Goloboff等人(2021)使用形态型,但提出从原始字符的转移概率和因素的乘积中获得形态型之间的转移概率,以纳入依赖关系。转移概率的乘积在这里是有缺陷的(不符合系统发育马尔可夫模型的时间连续性要求,这对模型下的统计一致性至关重要)。Tarasov(2023)使用了相同的形态划分,但提出了从速率矩阵中获得转移概率的建议,并从依赖关系的层次结构中逐步合成。本文表明,速率矩阵可以很容易地创建,而不是通过逐步合成,从合法形态之间的直接比较(如Goloboff和De Laet 2023所做的那样)。基于一些简单的规则,得到的速率矩阵(对于不适用的字符)与Tarasov(2023)得到的相同。此外,在计算机程序TNT中,用户可以使用“父”字符状态的简单语法(组合)来指定生物依赖性,限制“子”字符可以采取的状态,使用AND和OR连词进行复杂的交互。这些研究人员定义的规则用于在内部将原始字符转换为形态,丢弃因规则而无法实现的形态。在生物依赖的情况下(根据亲本性状,依赖性状可以采取的状态可能有限制,而不是性状不适用),形态之间的转换速率不能仅仅通过比较两种形态不同的状态来计算——也需要考虑依赖条件。
{"title":"Phylogenetic Analysis of Characters with Dependencies under Maximum Likelihood","authors":"Pablo A Goloboff","doi":"10.1093/sysbio/syaf051","DOIUrl":"https://doi.org/10.1093/sysbio/syaf051","url":null,"abstract":"The dependencies between characters used in phylogenetic analysis (e.g., inapplicabilities, functional dependencies) can be taken into account by using combinations of character states as possible ancestral morphotypes, and using appropriate rates of transformation between such morphotypes. As every morphotype represents a permissible combination of the original character states, this allows easily ruling out specific combinations of character states, and taking into account changes that are either less or more likely to co-occur, or to occur in certain contexts. For inapplicable characters, Goloboff et al. (2021) used morphotypes but proposed obtaining transition probabilities between morphotypes from products of transition probabilities of the original characters and factors to incorporate dependencies. The product of transition probabilities is shown here to be flawed (failing the time-continuity requirement of phylogenetic Markov models, essential for statistical consistency under the model). Tarasov (2023) used the same delimitation of morphotypes but proposed obtaining transition probabilities from rate matrices, synthesized in a stepwise fashion from the hierarchy of dependencies. This paper shows that the rate matrices can easily be created, instead of with a stepwise synthesis, from direct comparisons between legitimate morphotypes (as done by Goloboff and De Laet 2023 for parsimony). Based on a few simple rules, the resulting rate matrices are (for inapplicable characters) identical to those obtained by Tarasov (2023). Additionally, in the computer program TNT, biological dependencies beyond mere inapplicability can be specified by the user with a simple syntax for (combinations of) states in “parent” characters restricting the states that “child” characters can take, using AND and OR conjunctions for elaborate interactions. These researcher-defined rules are used to internally convert the original characters into morphotypes, discarding morphotypes made impossible by the rules. In the case of biological dependencies (where, depending on the parent characters, there can be restrictions in the states that dependent characters can take, instead of the character being inapplicable), the rates of transition between morphotypes cannot be calculated solely from comparisons of states differing in both morphotypes –consideration of the conditions of dependency is needed as well.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"118 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2025-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144710791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correction to: How Important Is Budding Speciation for Comparative Studies? 修正:萌芽物种形成对比较研究有多重要?
IF 6.5 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Pub Date : 2025-07-23 DOI: 10.1093/sysbio/syaf042
{"title":"Correction to: How Important Is Budding Speciation for Comparative Studies?","authors":"","doi":"10.1093/sysbio/syaf042","DOIUrl":"https://doi.org/10.1093/sysbio/syaf042","url":null,"abstract":"","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"25 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144684207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correction to: Global Patterns of Taxonomic Uncertainty and its Impacts on Biodiversity Research. 修正:全球分类学不确定性格局及其对生物多样性研究的影响。
IF 6.5 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Pub Date : 2025-07-15 DOI: 10.1093/sysbio/syaf045
{"title":"Correction to: Global Patterns of Taxonomic Uncertainty and its Impacts on Biodiversity Research.","authors":"","doi":"10.1093/sysbio/syaf045","DOIUrl":"https://doi.org/10.1093/sysbio/syaf045","url":null,"abstract":"","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"108 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144630471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Revisiting the Multispecies Coalescent Model fit with an example from a complete molecular phylogeny of the Liolaemus wiegmannii species group (Squamata: Liolaemidae). 重新审视多物种聚结模型,以一个完整的分子系统发育的例子来拟合Liolaemus wiegmannii种群(Squamata: Liolaemidae)。
IF 6.5 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Pub Date : 2025-07-10 DOI: 10.1093/sysbio/syaf048
Joaquín Villamil,Mariana Morando,Luciano J Avila,Flávia M Lanna,Emanuel M Fonseca,Jack W Sites,Arley Camargo
Departures from the Multispecies Coalescent (MSC) assumptions could cause artefactual topologies and node height estimates, and therefore, trees inferred without MSC model fit testing could potentially misrepresent an accurate approximation of the evolutionary history of a group. The current implementation of MSC model testing for non-genomic level molecular markers cannot process trees estimated from BEAST 2, limiting its application for large datasets of sequence-based markers. Here we recode functions of the R package P2C2M to assess model fit to the MSC and apply this new implementation, which we named P2C2M2, to test the MSC model in a 16-loci dataset of 42 lizard species focused on the Liolaemus wiegmannii group. We found strong evidence of model departures in several loci, possibly due to historical gene flow, which could also be causing an unexpected position of the L. wiegmannii group within the L. montanus section of Eulaemus, when hybridization is not accounted for. The L. anomalus group is inferred as the closest to the L. wiegmannii group when gene flow is incorporated via a Multispecies Network Coalescent model, and a reticulation, suggesting historical gene flow between the L. wiegmannii and L. montanus groups is inferred, which has not been previously reported. We argue that there are at least three sources of discrepancy between the literature and the node ages estimated in our study: the use of strict molecular clocks without statistical justification, misplaced fossil calibrations, and the estimation of coalescent times instead of species divergence times. We encouraged systematists to routinely test the fit of the MSC model when estimating species trees using sequence-based markers, and to follow a phylogenetic network approach when both this test is significant and when historical gene flow is considered one plausible source of the departure from the MSC model.
偏离多物种聚合(MSC)假设可能会导致人为的拓扑结构和节点高度估计,因此,未经MSC模型拟合检验推断的树可能会错误地反映一个群体进化历史的准确近似值。目前实施的非基因组水平分子标记的MSC模型测试不能处理从BEAST 2估计的树,限制了其在基于序列的标记的大型数据集的应用。在这里,我们重新编码R包P2C2M的功能,以评估模型与MSC的拟合性,并应用我们命名为P2C2M2的新实现,在以Liolaemus wiegmannii类群为重点的42种蜥蜴的16个位点数据集中测试MSC模型。我们在几个位点上发现了模型偏离的有力证据,可能是由于历史基因流动,这也可能导致L. wiegmannii群在Eulaemus的L. montanus部分中出现意外的位置,当没有考虑杂交时。通过多物种网络聚结模型(multi - species Network Coalescent model)和网状结构将L. wiegmannii类群与L. montanus类群的基因流结合,推测出L. wiegmannii和L. montanus类群之间的历史基因流,这在以前没有报道过。我们认为,文献和我们研究中估计的节点年龄之间至少有三个差异的来源:使用严格的分子钟而没有统计依据,错误的化石校准,以及估计成结时间而不是物种分化时间。我们鼓励系统学家在使用基于序列的标记估计物种树时,常规地测试MSC模型的拟合性,当这两个测试都很重要,并且当历史基因流被认为是偏离MSC模型的一个合理来源时,遵循系统发育网络方法。
{"title":"Revisiting the Multispecies Coalescent Model fit with an example from a complete molecular phylogeny of the Liolaemus wiegmannii species group (Squamata: Liolaemidae).","authors":"Joaquín Villamil,Mariana Morando,Luciano J Avila,Flávia M Lanna,Emanuel M Fonseca,Jack W Sites,Arley Camargo","doi":"10.1093/sysbio/syaf048","DOIUrl":"https://doi.org/10.1093/sysbio/syaf048","url":null,"abstract":"Departures from the Multispecies Coalescent (MSC) assumptions could cause artefactual topologies and node height estimates, and therefore, trees inferred without MSC model fit testing could potentially misrepresent an accurate approximation of the evolutionary history of a group. The current implementation of MSC model testing for non-genomic level molecular markers cannot process trees estimated from BEAST 2, limiting its application for large datasets of sequence-based markers. Here we recode functions of the R package P2C2M to assess model fit to the MSC and apply this new implementation, which we named P2C2M2, to test the MSC model in a 16-loci dataset of 42 lizard species focused on the Liolaemus wiegmannii group. We found strong evidence of model departures in several loci, possibly due to historical gene flow, which could also be causing an unexpected position of the L. wiegmannii group within the L. montanus section of Eulaemus, when hybridization is not accounted for. The L. anomalus group is inferred as the closest to the L. wiegmannii group when gene flow is incorporated via a Multispecies Network Coalescent model, and a reticulation, suggesting historical gene flow between the L. wiegmannii and L. montanus groups is inferred, which has not been previously reported. We argue that there are at least three sources of discrepancy between the literature and the node ages estimated in our study: the use of strict molecular clocks without statistical justification, misplaced fossil calibrations, and the estimation of coalescent times instead of species divergence times. We encouraged systematists to routinely test the fit of the MSC model when estimating species trees using sequence-based markers, and to follow a phylogenetic network approach when both this test is significant and when historical gene flow is considered one plausible source of the departure from the MSC model.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"697 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144594356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data Fusion for Integrative Species Identification Using Deep Learning 基于深度学习的综合物种识别数据融合
IF 6.5 1区 生物学 Q1 EVOLUTIONARY BIOLOGY Pub Date : 2025-06-13 DOI: 10.1093/sysbio/syaf026
Lara M Ko¨sters, Kevin Karbstein, Martin Hofmann, Ladislav Hodaˇc, Patrick Ma¨der, Jana Wa¨ldchen
DNA analyses have revolutionized species identification and taxonomic work. Yet, persistent challenges arise from little differentiation among and considerable variation within species, particularly among closely related groups. While images are commonly used as an alternative modality for automated identification tasks, their usability is limited by the same concerns. An integrative strategy, fusing molecular and image data through machine learning, holds significant promise for fine-grained species identification. However, a systematic overview and rigorous statistical testing concerning molecular and image preprocessing and fusion techniques, including practical advice for biologists, are missing so far. We introduce a machine learning scheme that integrates both molecular and image data for species identification. Initially, we systematically assess and compare three different DNA arrangements (aligned, unaligned, SNP-reduced) and two encoding methods (fractional, ordinal). Additionally, artificial neural networks are used to extract visual and molecular features, and we propose strategies for fusing this information. Specifically, we investigate three strategies: I) fusing directly after feature extraction, II) fusing features that passed through a fully connected layer after feature extraction, and III) fusing the output scores of both unimodal models. We systematically and statistically evaluate these strategies for four eukaryotic datasets, including two plant (Asteraceae, Poaceae) and two animal families (Lycaenidae, Coccinellidae) using Leave-One-Out Cross-Validation (LOOCV). In addition, we developed an approach to understand molecular- and image-specific identification failure. Aligned sequences with nucleotides encoded as decimal number vectors achieved the highest identification accuracy among DNA data preprocessing techniques in all four datasets. Fusing molecular and visual features directly after feature extraction yielded the best results for three out of four datasets (52-99%).Overall, combining DNA with image data significantly increased accuracy in three out of four datasets, with plant datasets showing the most substantial improvement (Asteraceae: +19%, Poaceae: +13.6%). Even for Lycaenidae with high identification accuracy based on molecular data (>96%), a statistically significant improvement (+2.1%) was observed.Detailed analysis of confusion rates between and within genera shows that DNA alone tends to identify the genus correctly, but often fails to recognize the species. The failure to resolve species is alleviated by including image data in the training. This increase in resolution hints at a hierarchical role of modalities in which molecular data coarsely groups the specimens to then be guided towards a more fine-grained identification by the connected image. We systematically showed and explained, for the first time, that optimizing the preprocessing and integration of molecular and image data offers signific
DNA分析已经彻底改变了物种鉴定和分类工作。然而,持续的挑战来自于物种之间的微小分化和物种内部的巨大变异,特别是在密切相关的群体之间。虽然图像通常被用作自动识别任务的替代方式,但它们的可用性受到相同问题的限制。通过机器学习融合分子和图像数据的综合策略对细粒度物种识别具有重要意义。然而,关于分子和图像预处理和融合技术的系统概述和严格的统计测试,包括对生物学家的实用建议,到目前为止还缺乏。我们介绍了一种机器学习方案,该方案集成了分子和图像数据,用于物种识别。首先,我们系统地评估和比较了三种不同的DNA排列(排列,未排列,snp还原)和两种编码方法(分数,序数)。此外,利用人工神经网络提取视觉和分子特征,并提出了融合这些信息的策略。具体来说,我们研究了三种策略:I)特征提取后直接融合,II)融合特征提取后通过全连接层的特征,以及III)融合两个单峰模型的输出分数。我们对4个真核生物数据集,包括2个植物科(Asteraceae, Poaceae)和2个动物科(Lycaenidae, Coccinellidae),使用留一交叉验证(LOOCV)系统和统计地评估了这些策略。此外,我们开发了一种方法来理解分子和图像特异性识别失败。以十进制数向量编码的核苷酸序列在所有四个数据集的DNA数据预处理技术中获得了最高的识别精度。在特征提取后直接融合分子特征和视觉特征对四分之三的数据集产生了最好的结果(52-99%)。总体而言,将DNA与图像数据相结合可以显著提高4个数据集中的3个数据集的准确性,其中植物数据集的改善最为显著(Asteraceae: +19%, Poaceae: +13.6%)。即使对于基于分子数据的高鉴定准确率(>96%)的Lycaenidae,也有统计学上显著的提高(+2.1%)。对属之间和属内混淆率的详细分析表明,单靠DNA往往能正确识别属,但往往不能识别种。通过在训练中加入图像数据,可以缓解物种分辨失败的问题。这种分辨率的增加暗示了模式的层次作用,其中分子数据粗略地将标本分组,然后通过连接的图像引导到更细粒度的识别。我们首次系统地展示并解释了优化分子和图像数据的预处理和集成提供了显着的好处,特别是对于遗传相似和形态难以区分的物种,通过减少模式特异性失败率和信息差距来增强物种识别。我们的研究结果可以为不同生物群体的整合工作提供信息,从而提高真核生物物种的自动化识别。
{"title":"Data Fusion for Integrative Species Identification Using Deep Learning","authors":"Lara M Ko¨sters, Kevin Karbstein, Martin Hofmann, Ladislav Hodaˇc, Patrick Ma¨der, Jana Wa¨ldchen","doi":"10.1093/sysbio/syaf026","DOIUrl":"https://doi.org/10.1093/sysbio/syaf026","url":null,"abstract":"DNA analyses have revolutionized species identification and taxonomic work. Yet, persistent challenges arise from little differentiation among and considerable variation within species, particularly among closely related groups. While images are commonly used as an alternative modality for automated identification tasks, their usability is limited by the same concerns. An integrative strategy, fusing molecular and image data through machine learning, holds significant promise for fine-grained species identification. However, a systematic overview and rigorous statistical testing concerning molecular and image preprocessing and fusion techniques, including practical advice for biologists, are missing so far. We introduce a machine learning scheme that integrates both molecular and image data for species identification. Initially, we systematically assess and compare three different DNA arrangements (aligned, unaligned, SNP-reduced) and two encoding methods (fractional, ordinal). Additionally, artificial neural networks are used to extract visual and molecular features, and we propose strategies for fusing this information. Specifically, we investigate three strategies: I) fusing directly after feature extraction, II) fusing features that passed through a fully connected layer after feature extraction, and III) fusing the output scores of both unimodal models. We systematically and statistically evaluate these strategies for four eukaryotic datasets, including two plant (Asteraceae, Poaceae) and two animal families (Lycaenidae, Coccinellidae) using Leave-One-Out Cross-Validation (LOOCV). In addition, we developed an approach to understand molecular- and image-specific identification failure. Aligned sequences with nucleotides encoded as decimal number vectors achieved the highest identification accuracy among DNA data preprocessing techniques in all four datasets. Fusing molecular and visual features directly after feature extraction yielded the best results for three out of four datasets (52-99%).Overall, combining DNA with image data significantly increased accuracy in three out of four datasets, with plant datasets showing the most substantial improvement (Asteraceae: +19%, Poaceae: +13.6%). Even for Lycaenidae with high identification accuracy based on molecular data (>96%), a statistically significant improvement (+2.1%) was observed.Detailed analysis of confusion rates between and within genera shows that DNA alone tends to identify the genus correctly, but often fails to recognize the species. The failure to resolve species is alleviated by including image data in the training. This increase in resolution hints at a hierarchical role of modalities in which molecular data coarsely groups the specimens to then be guided towards a more fine-grained identification by the connected image. We systematically showed and explained, for the first time, that optimizing the preprocessing and integration of molecular and image data offers signific","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"8 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144288200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Systematic Biology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1