Phylogenetic tree statistics: A systematic overview using the new R package ‘treestats’

IF 3.6 1区 生物学 Q2 BIOCHEMISTRY & MOLECULAR BIOLOGY Molecular Phylogenetics and Evolution Pub Date : 2024-08-06 DOI:10.1016/j.ympev.2024.108168
Thijs Janzen, Rampal S. Etienne
{"title":"Phylogenetic tree statistics: A systematic overview using the new R package ‘treestats’","authors":"Thijs Janzen,&nbsp;Rampal S. Etienne","doi":"10.1016/j.ympev.2024.108168","DOIUrl":null,"url":null,"abstract":"<div><p>Phylogenetic trees are believed to contain a wealth of information on diversification processes. However, comparing phylogenetic trees is not straightforward due to their high dimensionality. Researchers have therefore defined a wide range of low-dimensional summary statistics. Currently, it remains unexplored to what extent these summary statistics cover the same underlying information and what summary statistics best explain observed variation across phylogenies. Furthermore, a large subset of available summary statistics focusses on measuring the topological features of a phylogenetic tree, but are often only explored at the extreme edge cases of the fully balanced or imbalanced tree and not for trees of intermediate balance.</p><p>Here, we introduce a new R package called ‘treestats’, that provides speed optimized code to compute 70 summary statistics. We study correlations between summary statistics on empirical trees and on trees simulated using several diversification models. Furthermore, we introduce an algorithm to create intermediately balanced trees in a well-defined manner, in order to explore variation in summary statistics across a balance gradient.</p><p>We find that almost all summary statistics are correlated with tree size, and find that it is difficult, if not impossible, to correct for tree size, unless the tree generating model is known. Furthermore, we find that across empirical and simulated trees, at least three large clusters of correlated summary statistics can be found, where statistics group together based on information used (topology or branching times). However, the finer grained correlation structure appears to depend strongly on either the taxonomic group studied (in empirical studies) or the tree generating model (in simulation studies).</p><p>Amongst statistics describing the (im)balance of a tree, we find that almost all statistics vary non-linearly, and sometimes even non-monotonically, with our generated balance gradient. This indicates that balance is perhaps a more complex property of a tree than previously thought. Furthermore, using our new imbalancing algorithm, we devise a numerical test to identify balance statistics, and identify several statistics as balance statistics that were not previously considered as such. Lastly, our results lead to several recommendations on which statistics to select when analyzing and comparing phylogenetic trees.</p></div>","PeriodicalId":56109,"journal":{"name":"Molecular Phylogenetics and Evolution","volume":"200 ","pages":"Article 108168"},"PeriodicalIF":3.6000,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S105579032400160X/pdfft?md5=32adcc7509fbad416d837e8d108cca1d&pid=1-s2.0-S105579032400160X-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Molecular Phylogenetics and Evolution","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S105579032400160X","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Phylogenetic trees are believed to contain a wealth of information on diversification processes. However, comparing phylogenetic trees is not straightforward due to their high dimensionality. Researchers have therefore defined a wide range of low-dimensional summary statistics. Currently, it remains unexplored to what extent these summary statistics cover the same underlying information and what summary statistics best explain observed variation across phylogenies. Furthermore, a large subset of available summary statistics focusses on measuring the topological features of a phylogenetic tree, but are often only explored at the extreme edge cases of the fully balanced or imbalanced tree and not for trees of intermediate balance.

Here, we introduce a new R package called ‘treestats’, that provides speed optimized code to compute 70 summary statistics. We study correlations between summary statistics on empirical trees and on trees simulated using several diversification models. Furthermore, we introduce an algorithm to create intermediately balanced trees in a well-defined manner, in order to explore variation in summary statistics across a balance gradient.

We find that almost all summary statistics are correlated with tree size, and find that it is difficult, if not impossible, to correct for tree size, unless the tree generating model is known. Furthermore, we find that across empirical and simulated trees, at least three large clusters of correlated summary statistics can be found, where statistics group together based on information used (topology or branching times). However, the finer grained correlation structure appears to depend strongly on either the taxonomic group studied (in empirical studies) or the tree generating model (in simulation studies).

Amongst statistics describing the (im)balance of a tree, we find that almost all statistics vary non-linearly, and sometimes even non-monotonically, with our generated balance gradient. This indicates that balance is perhaps a more complex property of a tree than previously thought. Furthermore, using our new imbalancing algorithm, we devise a numerical test to identify balance statistics, and identify several statistics as balance statistics that were not previously considered as such. Lastly, our results lead to several recommendations on which statistics to select when analyzing and comparing phylogenetic trees.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
系统发生树统计:使用新的 R 软件包 "treestats "进行系统概述。
系统发生树被认为包含了有关物种多样化过程的大量信息。然而,由于系统发生树的维度较高,对其进行比较并不简单。因此,研究人员定义了一系列低维汇总统计。目前,这些汇总统计在多大程度上涵盖了相同的基本信息,以及哪些汇总统计最能解释在不同系统发育过程中观察到的变异,这些问题仍有待探索。此外,大量可用的汇总统计子集侧重于测量系统发生树的拓扑特征,但通常只在完全平衡或不平衡树的极端边缘情况下进行探索,而不针对中间平衡的树。在这里,我们介绍一个名为 "treestats "的新 R 软件包,它提供了计算 70 个汇总统计量的速度优化代码。我们研究了经验树和使用多种多样化模型模拟的树的汇总统计之间的相关性。此外,我们还引入了一种以明确定义的方式创建中间平衡树的算法,以探索平衡梯度上汇总统计量的变化。我们发现,几乎所有的汇总统计数据都与树的大小相关,并发现除非已知树的生成模型,否则很难甚至不可能对树的大小进行校正。此外,我们还发现,在经验树和模拟树中,至少可以找到三大类相关的汇总统计量,这些统计量根据所使用的信息(拓扑结构或分支时间)进行分组。不过,更细粒度的相关结构似乎在很大程度上取决于所研究的分类群(在经验研究中)或树生成模型(在模拟研究中)。在描述树木(不)平衡的统计数据中,我们发现几乎所有的统计数据都与我们生成的平衡梯度呈非线性变化,有时甚至是非单调变化。这表明,平衡也许是树的一个比以前认为的更复杂的属性。此外,利用我们新的不平衡算法,我们设计了一个数字测试来识别平衡统计量,并将以前未曾考虑过的几个统计量识别为平衡统计量。最后,我们的研究结果就分析和比较系统发生树时应选择哪些统计量提出了一些建议。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Molecular Phylogenetics and Evolution
Molecular Phylogenetics and Evolution 生物-进化生物学
CiteScore
7.50
自引率
7.30%
发文量
249
审稿时长
7.5 months
期刊介绍: Molecular Phylogenetics and Evolution is dedicated to bringing Darwin''s dream within grasp - to "have fairly true genealogical trees of each great kingdom of Nature." The journal provides a forum for molecular studies that advance our understanding of phylogeny and evolution, further the development of phylogenetically more accurate taxonomic classifications, and ultimately bring a unified classification for all the ramifying lines of life. Phylogeographic studies will be considered for publication if they offer EXCEPTIONAL theoretical or empirical advances.
期刊最新文献
Forget-me-not phylogenomics: Improving the resolution and taxonomy of a rapid island and mountain radiation in Aotearoa New Zealand (Myosotis; Boraginaceae). Taken to extremes: Loss of plastid rpl32 in Streptophyta and Cuscuta's unconventional solution for its replacement. Phylogenetic origin of dioecious Callicarpa (Lamiaceae) species endemic to the Ogasawara Islands revealed by chloroplast and nuclear whole genome analyses Reassessing the evolutionary relationships of tropical wandering spiders using phylogenomics: A UCE-based phylogeny of Ctenidae (Araneae) with the discovery of a new lycosoid family Molecular phylogenetics of nursery web spiders (Araneae: Pisauridae)
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1