一种在大规模宏基因组数据集中识别蛋白质家族的有效并行方法

2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2008-11-15 DOI:10.1145/1413370.1413406

Changjun Wu, A. Kalyanaraman

{"title":"一种在大规模宏基因组数据集中识别蛋白质家族的有效并行方法","authors":"Changjun Wu, A. Kalyanaraman","doi":"10.1145/1413370.1413406","DOIUrl":null,"url":null,"abstract":"Metagenomics is the study of environmental microbial communities using state-of-the-art genomic tools. Recent advancements in high-throughput technologies have enabled the accumulation of large volumes of metagenomic data that was until a couple of years back was deemed impractical for generation. A primary bottleneck, however, is in the lack of scalable algorithms and open source software for large-scale data processing. In this paper, we present the design and implementation of a novel parallel approach to identify protein families from large-scale metagenomic data. Given a set of peptide sequences we reduce the problem to one of detecting arbitrarily-sized dense subgraphs from bipartite graphs. Our approach efficiently parallelizes this task on a distributed memory machine through a combination of divide-and-conquer and combinatorial pattern matching heuristic techniques. We present performance and quality results of extensively testing our implementation on 160 K randomly sampled sequences from the CAMERA environmental sequence database using 512 nodes of a BlueGene/L supercomputer.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"95 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":"{\"title\":\"An efficient parallel approach for identifying protein families in large-scale metagenomic data sets\",\"authors\":\"Changjun Wu, A. Kalyanaraman\",\"doi\":\"10.1145/1413370.1413406\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Metagenomics is the study of environmental microbial communities using state-of-the-art genomic tools. Recent advancements in high-throughput technologies have enabled the accumulation of large volumes of metagenomic data that was until a couple of years back was deemed impractical for generation. A primary bottleneck, however, is in the lack of scalable algorithms and open source software for large-scale data processing. In this paper, we present the design and implementation of a novel parallel approach to identify protein families from large-scale metagenomic data. Given a set of peptide sequences we reduce the problem to one of detecting arbitrarily-sized dense subgraphs from bipartite graphs. Our approach efficiently parallelizes this task on a distributed memory machine through a combination of divide-and-conquer and combinatorial pattern matching heuristic techniques. We present performance and quality results of extensively testing our implementation on 160 K randomly sampled sequences from the CAMERA environmental sequence database using 512 nodes of a BlueGene/L supercomputer.\",\"PeriodicalId\":230761,\"journal\":{\"name\":\"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis\",\"volume\":\"95 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-11-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"19\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1413370.1413406\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1413370.1413406","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

摘要

宏基因组学是使用最先进的基因组工具研究环境微生物群落。近年来，高通量技术的进步使得大量宏基因组数据的积累成为可能，而这些数据在几年前还被认为是不切实际的。然而，主要的瓶颈是缺乏可扩展的算法和用于大规模数据处理的开源软件。在本文中，我们提出了一种新的平行方法的设计和实现，从大规模宏基因组数据中识别蛋白质家族。给定一组肽序列，我们将问题简化为从二部图中检测任意大小的密集子图的问题。我们的方法通过分而治之和组合模式匹配启发式技术的结合，在分布式内存机器上有效地并行化了该任务。我们使用BlueGene/L超级计算机的512个节点，对来自CAMERA环境序列数据库的160 K随机采样序列进行了广泛的测试，并给出了性能和质量结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

An efficient parallel approach for identifying protein families in large-scale metagenomic data sets

Metagenomics is the study of environmental microbial communities using state-of-the-art genomic tools. Recent advancements in high-throughput technologies have enabled the accumulation of large volumes of metagenomic data that was until a couple of years back was deemed impractical for generation. A primary bottleneck, however, is in the lack of scalable algorithms and open source software for large-scale data processing. In this paper, we present the design and implementation of a novel parallel approach to identify protein families from large-scale metagenomic data. Given a set of peptide sequences we reduce the problem to one of detecting arbitrarily-sized dense subgraphs from bipartite graphs. Our approach efficiently parallelizes this task on a distributed memory machine through a combination of divide-and-conquer and combinatorial pattern matching heuristic techniques. We present performance and quality results of extensively testing our implementation on 160 K randomly sampled sequences from the CAMERA environmental sequence database using 512 nodes of a BlueGene/L supercomputer.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis

自引率

0.00%

发文量