{"title":"系统发生树统计:使用新的 R 软件包 \"treestats \"进行系统概述。","authors":"Thijs Janzen, Rampal S. Etienne","doi":"10.1016/j.ympev.2024.108168","DOIUrl":null,"url":null,"abstract":"<div><p>Phylogenetic trees are believed to contain a wealth of information on diversification processes. However, comparing phylogenetic trees is not straightforward due to their high dimensionality. Researchers have therefore defined a wide range of low-dimensional summary statistics. Currently, it remains unexplored to what extent these summary statistics cover the same underlying information and what summary statistics best explain observed variation across phylogenies. Furthermore, a large subset of available summary statistics focusses on measuring the topological features of a phylogenetic tree, but are often only explored at the extreme edge cases of the fully balanced or imbalanced tree and not for trees of intermediate balance.</p><p>Here, we introduce a new R package called ‘treestats’, that provides speed optimized code to compute 70 summary statistics. We study correlations between summary statistics on empirical trees and on trees simulated using several diversification models. Furthermore, we introduce an algorithm to create intermediately balanced trees in a well-defined manner, in order to explore variation in summary statistics across a balance gradient.</p><p>We find that almost all summary statistics are correlated with tree size, and find that it is difficult, if not impossible, to correct for tree size, unless the tree generating model is known. Furthermore, we find that across empirical and simulated trees, at least three large clusters of correlated summary statistics can be found, where statistics group together based on information used (topology or branching times). However, the finer grained correlation structure appears to depend strongly on either the taxonomic group studied (in empirical studies) or the tree generating model (in simulation studies).</p><p>Amongst statistics describing the (im)balance of a tree, we find that almost all statistics vary non-linearly, and sometimes even non-monotonically, with our generated balance gradient. This indicates that balance is perhaps a more complex property of a tree than previously thought. Furthermore, using our new imbalancing algorithm, we devise a numerical test to identify balance statistics, and identify several statistics as balance statistics that were not previously considered as such. Lastly, our results lead to several recommendations on which statistics to select when analyzing and comparing phylogenetic trees.</p></div>","PeriodicalId":56109,"journal":{"name":"Molecular Phylogenetics and Evolution","volume":"200 ","pages":"Article 108168"},"PeriodicalIF":3.6000,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S105579032400160X/pdfft?md5=32adcc7509fbad416d837e8d108cca1d&pid=1-s2.0-S105579032400160X-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Phylogenetic tree statistics: A systematic overview using the new R package ‘treestats’\",\"authors\":\"Thijs Janzen, Rampal S. Etienne\",\"doi\":\"10.1016/j.ympev.2024.108168\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Phylogenetic trees are believed to contain a wealth of information on diversification processes. However, comparing phylogenetic trees is not straightforward due to their high dimensionality. Researchers have therefore defined a wide range of low-dimensional summary statistics. Currently, it remains unexplored to what extent these summary statistics cover the same underlying information and what summary statistics best explain observed variation across phylogenies. Furthermore, a large subset of available summary statistics focusses on measuring the topological features of a phylogenetic tree, but are often only explored at the extreme edge cases of the fully balanced or imbalanced tree and not for trees of intermediate balance.</p><p>Here, we introduce a new R package called ‘treestats’, that provides speed optimized code to compute 70 summary statistics. We study correlations between summary statistics on empirical trees and on trees simulated using several diversification models. Furthermore, we introduce an algorithm to create intermediately balanced trees in a well-defined manner, in order to explore variation in summary statistics across a balance gradient.</p><p>We find that almost all summary statistics are correlated with tree size, and find that it is difficult, if not impossible, to correct for tree size, unless the tree generating model is known. Furthermore, we find that across empirical and simulated trees, at least three large clusters of correlated summary statistics can be found, where statistics group together based on information used (topology or branching times). However, the finer grained correlation structure appears to depend strongly on either the taxonomic group studied (in empirical studies) or the tree generating model (in simulation studies).</p><p>Amongst statistics describing the (im)balance of a tree, we find that almost all statistics vary non-linearly, and sometimes even non-monotonically, with our generated balance gradient. This indicates that balance is perhaps a more complex property of a tree than previously thought. Furthermore, using our new imbalancing algorithm, we devise a numerical test to identify balance statistics, and identify several statistics as balance statistics that were not previously considered as such. Lastly, our results lead to several recommendations on which statistics to select when analyzing and comparing phylogenetic trees.</p></div>\",\"PeriodicalId\":56109,\"journal\":{\"name\":\"Molecular Phylogenetics and Evolution\",\"volume\":\"200 \",\"pages\":\"Article 108168\"},\"PeriodicalIF\":3.6000,\"publicationDate\":\"2024-08-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S105579032400160X/pdfft?md5=32adcc7509fbad416d837e8d108cca1d&pid=1-s2.0-S105579032400160X-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Molecular Phylogenetics and Evolution\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S105579032400160X\",\"RegionNum\":1,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Molecular Phylogenetics and Evolution","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S105579032400160X","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0
摘要
系统发生树被认为包含了有关物种多样化过程的大量信息。然而,由于系统发生树的维度较高,对其进行比较并不简单。因此,研究人员定义了一系列低维汇总统计。目前,这些汇总统计在多大程度上涵盖了相同的基本信息,以及哪些汇总统计最能解释在不同系统发育过程中观察到的变异,这些问题仍有待探索。此外,大量可用的汇总统计子集侧重于测量系统发生树的拓扑特征,但通常只在完全平衡或不平衡树的极端边缘情况下进行探索,而不针对中间平衡的树。在这里,我们介绍一个名为 "treestats "的新 R 软件包,它提供了计算 70 个汇总统计量的速度优化代码。我们研究了经验树和使用多种多样化模型模拟的树的汇总统计之间的相关性。此外,我们还引入了一种以明确定义的方式创建中间平衡树的算法,以探索平衡梯度上汇总统计量的变化。我们发现,几乎所有的汇总统计数据都与树的大小相关,并发现除非已知树的生成模型,否则很难甚至不可能对树的大小进行校正。此外,我们还发现,在经验树和模拟树中,至少可以找到三大类相关的汇总统计量,这些统计量根据所使用的信息(拓扑结构或分支时间)进行分组。不过,更细粒度的相关结构似乎在很大程度上取决于所研究的分类群(在经验研究中)或树生成模型(在模拟研究中)。在描述树木(不)平衡的统计数据中,我们发现几乎所有的统计数据都与我们生成的平衡梯度呈非线性变化,有时甚至是非单调变化。这表明,平衡也许是树的一个比以前认为的更复杂的属性。此外,利用我们新的不平衡算法,我们设计了一个数字测试来识别平衡统计量,并将以前未曾考虑过的几个统计量识别为平衡统计量。最后,我们的研究结果就分析和比较系统发生树时应选择哪些统计量提出了一些建议。
Phylogenetic tree statistics: A systematic overview using the new R package ‘treestats’
Phylogenetic trees are believed to contain a wealth of information on diversification processes. However, comparing phylogenetic trees is not straightforward due to their high dimensionality. Researchers have therefore defined a wide range of low-dimensional summary statistics. Currently, it remains unexplored to what extent these summary statistics cover the same underlying information and what summary statistics best explain observed variation across phylogenies. Furthermore, a large subset of available summary statistics focusses on measuring the topological features of a phylogenetic tree, but are often only explored at the extreme edge cases of the fully balanced or imbalanced tree and not for trees of intermediate balance.
Here, we introduce a new R package called ‘treestats’, that provides speed optimized code to compute 70 summary statistics. We study correlations between summary statistics on empirical trees and on trees simulated using several diversification models. Furthermore, we introduce an algorithm to create intermediately balanced trees in a well-defined manner, in order to explore variation in summary statistics across a balance gradient.
We find that almost all summary statistics are correlated with tree size, and find that it is difficult, if not impossible, to correct for tree size, unless the tree generating model is known. Furthermore, we find that across empirical and simulated trees, at least three large clusters of correlated summary statistics can be found, where statistics group together based on information used (topology or branching times). However, the finer grained correlation structure appears to depend strongly on either the taxonomic group studied (in empirical studies) or the tree generating model (in simulation studies).
Amongst statistics describing the (im)balance of a tree, we find that almost all statistics vary non-linearly, and sometimes even non-monotonically, with our generated balance gradient. This indicates that balance is perhaps a more complex property of a tree than previously thought. Furthermore, using our new imbalancing algorithm, we devise a numerical test to identify balance statistics, and identify several statistics as balance statistics that were not previously considered as such. Lastly, our results lead to several recommendations on which statistics to select when analyzing and comparing phylogenetic trees.
期刊介绍:
Molecular Phylogenetics and Evolution is dedicated to bringing Darwin''s dream within grasp - to "have fairly true genealogical trees of each great kingdom of Nature." The journal provides a forum for molecular studies that advance our understanding of phylogeny and evolution, further the development of phylogenetically more accurate taxonomic classifications, and ultimately bring a unified classification for all the ramifying lines of life. Phylogeographic studies will be considered for publication if they offer EXCEPTIONAL theoretical or empirical advances.