{"title":"GRASP2: Fast and memory-efficient gene-centric assembly and homolog search","authors":"Cuncong Zhong, Youngik Yang, Shibu Yooseph","doi":"10.1109/ICCABS.2017.8114296","DOIUrl":null,"url":null,"abstract":"A crucial task for metagenomic analysis is to annotate the function and taxonomy of the sequencing reads generated from a microbiome sample. In general, the reads can either be assembled into contigs and searched against reference databases, or individually searched without assembly. The first approach may suffer due to the fragmentary and incomplete nature of nucleotide sequence assembly, while the second approach is hampered by the reduced functional signal that a short read can contain. To tackle these issues, we previously developed GRASP (Guided Reference-based Assembly of Short Peptides), which accepts a reference protein sequence as input and aims to assemble its homologs from a database containing fragmentary protein sequences. In addition to a gene-centric assembly tool, GRASP also serves as a homolog search tool when using the assembled protein sequences as templates to recruit reads. GrASP has significantly improved sensitivity (60–80% vs. 30–40%) compared to other homolog search tools such as BLAST. However, GRASP is time- and space-consuming compared to these tools, and is not scalable to large datasets. Subsequently, we developed GRASPx which is 30X faster than GRASP. Here, we present a completely redesigned algorithm, GRASP2, for this computational problem. GRASP2 utilizes Burrow-Wheeler Transformation (BWT) to assist with assembly graph generation, and reduces the search space by employing a fast ungapped alignment strategy to reduce unnecessary traversal of non-homologous paths in the assembly graph. GRASP2 is 8-fold faster than GRASPx (and 250-fold faster than GRASP) and uses 8-fold less memory while maintaining the original high sensitivity of GRASP, which makes GRASP2 a useful tool for metagenomics data analysis. GRASP2 is implemented in C++ and is freely available from http://www.sourceforge.net/projects/grasp2.","PeriodicalId":89933,"journal":{"name":"IEEE ... International Conference on Computational Advances in Bio and Medical Sciences : [proceedings]. IEEE International Conference on Computational Advances in Bio and Medical Sciences","volume":"41 1","pages":"1"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE ... International Conference on Computational Advances in Bio and Medical Sciences : [proceedings]. IEEE International Conference on Computational Advances in Bio and Medical Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCABS.2017.8114296","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
A crucial task for metagenomic analysis is to annotate the function and taxonomy of the sequencing reads generated from a microbiome sample. In general, the reads can either be assembled into contigs and searched against reference databases, or individually searched without assembly. The first approach may suffer due to the fragmentary and incomplete nature of nucleotide sequence assembly, while the second approach is hampered by the reduced functional signal that a short read can contain. To tackle these issues, we previously developed GRASP (Guided Reference-based Assembly of Short Peptides), which accepts a reference protein sequence as input and aims to assemble its homologs from a database containing fragmentary protein sequences. In addition to a gene-centric assembly tool, GRASP also serves as a homolog search tool when using the assembled protein sequences as templates to recruit reads. GrASP has significantly improved sensitivity (60–80% vs. 30–40%) compared to other homolog search tools such as BLAST. However, GRASP is time- and space-consuming compared to these tools, and is not scalable to large datasets. Subsequently, we developed GRASPx which is 30X faster than GRASP. Here, we present a completely redesigned algorithm, GRASP2, for this computational problem. GRASP2 utilizes Burrow-Wheeler Transformation (BWT) to assist with assembly graph generation, and reduces the search space by employing a fast ungapped alignment strategy to reduce unnecessary traversal of non-homologous paths in the assembly graph. GRASP2 is 8-fold faster than GRASPx (and 250-fold faster than GRASP) and uses 8-fold less memory while maintaining the original high sensitivity of GRASP, which makes GRASP2 a useful tool for metagenomics data analysis. GRASP2 is implemented in C++ and is freely available from http://www.sourceforge.net/projects/grasp2.
宏基因组分析的一个关键任务是对微生物组样本产生的测序reads的功能和分类进行注释。通常,可以将读取的数据组装成contigs并根据参考数据库进行搜索,也可以不进行组装单独搜索。第一种方法可能由于核苷酸序列组装的片段性和不完全性而受到影响,而第二种方法则受到短读段可能包含的功能信号减少的阻碍。为了解决这些问题,我们之前开发了GRASP (Guided reference -based Assembly of Short Peptides),它接受参考蛋白序列作为输入,旨在从包含片段蛋白序列的数据库中组装其同源物。除了以基因为中心的组装工具,当使用组装的蛋白质序列作为模板招募reads时,GRASP还可以作为同源物搜索工具。与BLAST等其他同源搜索工具相比,GrASP的灵敏度显著提高(60-80% vs 30-40%)。然而,与这些工具相比,GRASP耗费时间和空间,并且不能扩展到大型数据集。随后,我们开发了比GRASP快30倍的GRASPx。在这里,我们提出了一个完全重新设计的算法,GRASP2,来解决这个计算问题。GRASP2利用Burrow-Wheeler变换(BWT)来辅助装配图的生成,并通过采用快速无间隙对齐策略来减少装配图中非同源路径的不必要遍历来减少搜索空间。GRASP2比GRASPx快8倍(比GRASP快250倍),使用的内存少8倍,同时保持了GRASP原有的高灵敏度,这使得GRASP2成为宏基因组数据分析的有用工具。GRASP2是用c++实现的,可以从http://www.sourceforge.net/projects/grasp2免费获得。