SparkSW: Scalable Distributed Computing System for Large-Scale Biological Sequence Alignment

Guoguang Zhao, Cheng Ling, Donghong Sun
{"title":"SparkSW: Scalable Distributed Computing System for Large-Scale Biological Sequence Alignment","authors":"Guoguang Zhao, Cheng Ling, Donghong Sun","doi":"10.1109/CCGrid.2015.55","DOIUrl":null,"url":null,"abstract":"The Smith-Waterman (SW) algorithm is universally used for a database search owing to its high sensitively. The widespread impact of the algorithm is reflected in over 8000 citations that the algorithm has received in the past decades. However, the algorithm is prohibitively high in terms of time and space complexity, and so poses significant computational challenges. Apache Spark is an increasingly popular fast big data analytics engine, which has been highly successful in implementing large-scale data-intensive applications on commercial hardware. This paper presents the first ever reported system that implements the SW algorithm on Apache Spark based distributed computing framework, with a couple of off-the-shelf workstations, which is named as SparkSW. The scalability and load-balancing efficiency of the system are investigated by realistic ultra-large database from the state-of-the-art UniRef100. The experimental results indicate that 1) SparkSW is load-balancing for parallel adaptive on workloads and scales extremely well with the increases of computing resource, 2) SparkSW provides a fast and universal option high sensitively biological sequence alignments. The success of SparkSW also reveals that Apache Spark framework provides an efficient solution to facilitate coping with ever increasing sizes of biological sequence databases, especially generated by second-generation sequencing technologies.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"22 1","pages":"845-852"},"PeriodicalIF":0.0000,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"36","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGrid.2015.55","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 36

Abstract

The Smith-Waterman (SW) algorithm is universally used for a database search owing to its high sensitively. The widespread impact of the algorithm is reflected in over 8000 citations that the algorithm has received in the past decades. However, the algorithm is prohibitively high in terms of time and space complexity, and so poses significant computational challenges. Apache Spark is an increasingly popular fast big data analytics engine, which has been highly successful in implementing large-scale data-intensive applications on commercial hardware. This paper presents the first ever reported system that implements the SW algorithm on Apache Spark based distributed computing framework, with a couple of off-the-shelf workstations, which is named as SparkSW. The scalability and load-balancing efficiency of the system are investigated by realistic ultra-large database from the state-of-the-art UniRef100. The experimental results indicate that 1) SparkSW is load-balancing for parallel adaptive on workloads and scales extremely well with the increases of computing resource, 2) SparkSW provides a fast and universal option high sensitively biological sequence alignments. The success of SparkSW also reveals that Apache Spark framework provides an efficient solution to facilitate coping with ever increasing sizes of biological sequence databases, especially generated by second-generation sequencing technologies.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
SparkSW:大规模生物序列比对可扩展分布式计算系统
Smith-Waterman (SW)算法由于其高灵敏度而被广泛用于数据库搜索。该算法的广泛影响反映在该算法在过去几十年中收到的8000多次引用中。然而,该算法在时间和空间复杂度方面过高,因此提出了重大的计算挑战。Apache Spark是一个日益流行的快速大数据分析引擎,它在商业硬件上实现大规模数据密集型应用方面非常成功。本文提出了第一个基于Apache Spark的分布式计算框架实现软件算法的系统,该系统使用了几个现成的工作站,命名为SparkSW。采用最先进的UniRef100超大数据库对系统的可扩展性和负载均衡效率进行了研究。实验结果表明:1)SparkSW具有负载均衡的并行自适应能力,并且随着计算资源的增加具有良好的可扩展性;2)SparkSW提供了一种快速、通用的高灵敏度生物序列比对选择。SparkSW的成功也表明,Apache Spark框架提供了一个有效的解决方案,以方便应对不断增长的生物序列数据库,特别是由第二代测序技术生成的数据库。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Self Protecting Data Sharing Using Generic Policies Partition-Aware Routing to Improve Network Isolation in Infiniband Based Multi-tenant Clusters MIC-Tandem: Parallel X!Tandem Using MIC on Tandem Mass Spectrometry Based Proteomics Data Study of the KVM CPU Performance of Open-Source Cloud Management Platforms Visualizing City Events on Search Engine: Tword the Search Infrustration for Smart City
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1