A Randomized Algorithm for Comparing Sets of Phylogenetic Trees

Proceedings of the ... Asia-Pacific bioinformatics conference Pub Date : 2007-01-01 DOI:10.1142/9781860947995_0015

Seung-Jin Sul, T. Williams

{"title":"A Randomized Algorithm for Comparing Sets of Phylogenetic Trees","authors":"Seung-Jin Sul, T. Williams","doi":"10.1142/9781860947995_0015","DOIUrl":null,"url":null,"abstract":"Phylogenetic analysis often produce a large number of candidate evolutionary trees, each a hypothesis of the ”true” tree. Post-processing techniques such as stri ct consensus trees are widely used to summarize the evolutionary relationships into a single tree. H owever, valuable information is lost during the summarization process. A more elementary step is to produce estimates of the topological differences that exist among all pairs of trees. We design a new randomized algorithm, called Hash-RF, that computes the all-to-all Robinson-Foulds (RF) distance—the most common distance metric for comparing two phylogenetic trees. Our approach uses a hash table to organize the bipartitions of a tree, and a universal hashing function makes our algorithm randomized. We compare the performance of our Hash-RF algorithm to PAUP*’s implementation of computing the all-to-all RF distance matrix. Our experiments focus on the algorithmic performance of comparing sets of biological trees, where the size of each tree ranged from 500 to 2,000 taxa and the collection of trees varied from 200 to 1,000 trees. Our experimental results clearly show that our Hash-RF algorithm is up to 500 times faster than PAUP*’s approach. Thus, Hash-RF provides an efficient alter native to a single tree summary of a collection of trees and potentially gives researchers the abil ity to explore their data in new and interesting ways.","PeriodicalId":74513,"journal":{"name":"Proceedings of the ... Asia-Pacific bioinformatics conference","volume":"65 1","pages":"121-130"},"PeriodicalIF":0.0000,"publicationDate":"2007-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... Asia-Pacific bioinformatics conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/9781860947995_0015","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 23

Abstract

Phylogenetic analysis often produce a large number of candidate evolutionary trees, each a hypothesis of the ”true” tree. Post-processing techniques such as stri ct consensus trees are widely used to summarize the evolutionary relationships into a single tree. H owever, valuable information is lost during the summarization process. A more elementary step is to produce estimates of the topological differences that exist among all pairs of trees. We design a new randomized algorithm, called Hash-RF, that computes the all-to-all Robinson-Foulds (RF) distance—the most common distance metric for comparing two phylogenetic trees. Our approach uses a hash table to organize the bipartitions of a tree, and a universal hashing function makes our algorithm randomized. We compare the performance of our Hash-RF algorithm to PAUP*’s implementation of computing the all-to-all RF distance matrix. Our experiments focus on the algorithmic performance of comparing sets of biological trees, where the size of each tree ranged from 500 to 2,000 taxa and the collection of trees varied from 200 to 1,000 trees. Our experimental results clearly show that our Hash-RF algorithm is up to 500 times faster than PAUP*’s approach. Thus, Hash-RF provides an efficient alter native to a single tree summary of a collection of trees and potentially gives researchers the abil ity to explore their data in new and interesting ways.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

一种比较系统发育树集的随机算法

系统发育分析通常会产生大量的候选进化树，每一个都是对“真正的”进化树的假设。严格共识树等后处理技术被广泛用于将进化关系归纳为一棵树。然而，在总结的过程中，有价值的信息丢失了。一个更基本的步骤是对所有树对之间存在的拓扑差异进行估计。我们设计了一种新的随机算法，称为哈希-RF，它计算所有到所有的罗宾逊-福尔兹(RF)距离，这是比较两个系统发育树最常见的距离度量。我们的方法使用哈希表来组织树的二分区，而通用哈希函数使我们的算法随机化。我们将我们的Hash-RF算法的性能与PAUP*计算全对全RF距离矩阵的实现进行了比较。我们的实验重点是比较生物树集的算法性能，其中每棵树的大小从500到2000个分类群不等，树木的集合从200到1000棵不等。我们的实验结果清楚地表明，我们的Hash-RF算法比PAUP*的方法快500倍。因此，Hash-RF提供了一种有效的替代树集合的单一树摘要，并可能使研究人员能够以新的和有趣的方式探索他们的数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the ... Asia-Pacific bioinformatics conference

自引率

0.00%

发文量

期刊最新文献

Tuning Privacy-Utility Tradeoff in Genomic Studies Using Selective SNP Hiding. The Future of Bioinformatics CHEMICAL COMPOUND CLASSIFICATION WITH AUTOMATICALLY MINED STRUCTURE PATTERNS. Predicting Nucleolar Proteins Using Support-Vector Machines Proceedings of the 6th Asia-Pacific Bioinformatics Conference, APBC 2008, 14-17 January 2008, Kyoto, Japan