An assessment of a metric space database index to support sequence homology

Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings. Pub Date : 2003-03-10 DOI:10.1109/BIBE.2003.1188976

Rui Mao, Weijia Xu, Neha Singh, Daniel P. Miranker

{"title":"An assessment of a metric space database index to support sequence homology","authors":"Rui Mao, Weijia Xu, Neha Singh, Daniel P. Miranker","doi":"10.1109/BIBE.2003.1188976","DOIUrl":null,"url":null,"abstract":"Hierarchical metric-space clustering methods have been commonly used to organize proteomes into taxonomies. Consequently, it is often anticipated that hierarchical clustering can be leveraged as a basis for scalable database index structures capable of managing the hyper-exponential growth of sequence data. M-tree is one such data structure specialized for the management of large data sets on disk. We explore the application of M-trees to the storage and retrieval of peptide sequence data. Exploiting a technique first suggested by Myers (1994), we organize the database as records of fixed length substrings. Empirical results are promising. However, metric-space indexes are subject to \"the curse of dimensionality\" and the ultimate performance of an index is sensitive to the quality of the initial construction of the index. We introduce new hierarchical bulk-load algorithm that alternates between top-down and bottom-up clustering to initialize the index. Using the Yeast Proteomes, the bi-directional bulk load produces a more effective index than the existing M-tree initialization algorithms.","PeriodicalId":178814,"journal":{"name":"Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings.","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2003-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBE.2003.1188976","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 30

Abstract

Hierarchical metric-space clustering methods have been commonly used to organize proteomes into taxonomies. Consequently, it is often anticipated that hierarchical clustering can be leveraged as a basis for scalable database index structures capable of managing the hyper-exponential growth of sequence data. M-tree is one such data structure specialized for the management of large data sets on disk. We explore the application of M-trees to the storage and retrieval of peptide sequence data. Exploiting a technique first suggested by Myers (1994), we organize the database as records of fixed length substrings. Empirical results are promising. However, metric-space indexes are subject to "the curse of dimensionality" and the ultimate performance of an index is sensitive to the quality of the initial construction of the index. We introduce new hierarchical bulk-load algorithm that alternates between top-down and bottom-up clustering to initialize the index. Using the Yeast Proteomes, the bi-directional bulk load produces a more effective index than the existing M-tree initialization algorithms.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

一个度量空间数据库索引支持序列同源性的评估

层次度量空间聚类方法通常用于将蛋白质组组织成分类。因此，通常预期可以利用分层聚类作为可扩展数据库索引结构的基础，以管理序列数据的超指数增长。M-tree就是这样一种数据结构，专门用于管理磁盘上的大型数据集。我们探索了m树在肽序列数据存储和检索中的应用。利用Myers(1994)首先提出的技术，我们将数据库组织为固定长度子字符串的记录。实证结果是有希望的。然而，度量空间指标受到“维度诅咒”的影响，指标的最终性能对指标的初始构建质量很敏感。我们引入了新的分层大负载算法，该算法在自顶向下和自底向上聚类之间交替进行初始化索引。使用酵母蛋白质组，双向批量加载产生比现有的m树初始化算法更有效的索引。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings.

自引率

0.00%

发文量

期刊最新文献

GenoMosaic: on-demand multiple genome comparison and comparative annotation Respiratory gating for MRI and MRS in rodents DHC: a density-based hierarchical clustering method for time series gene expression data Evolving bubbles for prostate surface detection from TRUS images A repulsive clustering algorithm for gene expression data