Optimization of data-intensive next generation sequencing in high performance computing

N. Kathiresan, Rashid J. Al-Ali, P. Jithesh, Tariq AbuZaid, Ramzi Temanni, A. Ptitsyn
{"title":"Optimization of data-intensive next generation sequencing in high performance computing","authors":"N. Kathiresan, Rashid J. Al-Ali, P. Jithesh, Tariq AbuZaid, Ramzi Temanni, A. Ptitsyn","doi":"10.1109/BIBE.2015.7367654","DOIUrl":null,"url":null,"abstract":"Advancement in Next Generation Sequencing (NGS) technology are associated with ever-increasing volume of genomic data every year. These genomic data are efficiently processed by empirical parallelism using High Performance Computing (HPC). The processed data can be used for genome-wide association studies, genetics, personalized medicine and many other areas. There are different kind of algorithms and implementations used in different phases of genome processing. In this paper, we used BWAKIT and GATK based software for processing larger volume of genomic data that are referred as \"NGS workflow at SIDRA\". We used BWAKIT for genome alignment and GATK for variant discovery in the NGS workflow that required larger computation and huge memory requirement respectively. We observed, the CPU utilization is not more than 45% during variant discovery and hence, it is necessary to understand the optimal selection (in terms of number of threads or cores) of the resources during the NGS workflow automation. We analyzed the performance bottleneck and application optimization in terms of \"scalability\" (use maximum available CPUs and memory) and \"multiple instances of NGS workflow with different genome data within a node\" (process more volume of genome data concurrently with limited set of CPUs and memory). We observed that, 40%, 65%, 71% and 76% improvement in performance while processing 2, 4, 8 and 16 samples concurrently using our own scheduling heuristics. As a result, our proposed NGS workflow automation will improve the performance upto 76% compared to application scalability based workflows.","PeriodicalId":422807,"journal":{"name":"2015 IEEE 15th International Conference on Bioinformatics and Bioengineering (BIBE)","volume":"166 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 15th International Conference on Bioinformatics and Bioengineering (BIBE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBE.2015.7367654","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Advancement in Next Generation Sequencing (NGS) technology are associated with ever-increasing volume of genomic data every year. These genomic data are efficiently processed by empirical parallelism using High Performance Computing (HPC). The processed data can be used for genome-wide association studies, genetics, personalized medicine and many other areas. There are different kind of algorithms and implementations used in different phases of genome processing. In this paper, we used BWAKIT and GATK based software for processing larger volume of genomic data that are referred as "NGS workflow at SIDRA". We used BWAKIT for genome alignment and GATK for variant discovery in the NGS workflow that required larger computation and huge memory requirement respectively. We observed, the CPU utilization is not more than 45% during variant discovery and hence, it is necessary to understand the optimal selection (in terms of number of threads or cores) of the resources during the NGS workflow automation. We analyzed the performance bottleneck and application optimization in terms of "scalability" (use maximum available CPUs and memory) and "multiple instances of NGS workflow with different genome data within a node" (process more volume of genome data concurrently with limited set of CPUs and memory). We observed that, 40%, 65%, 71% and 76% improvement in performance while processing 2, 4, 8 and 16 samples concurrently using our own scheduling heuristics. As a result, our proposed NGS workflow automation will improve the performance upto 76% compared to application scalability based workflows.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
高性能计算中数据密集型下一代排序的优化
下一代测序(NGS)技术的进步与每年不断增加的基因组数据量有关。这些基因组数据通过使用高性能计算(HPC)的经验并行有效地处理。处理后的数据可用于全基因组关联研究、遗传学、个性化医疗和许多其他领域。在基因组处理的不同阶段有不同的算法和实现。在本文中,我们使用基于BWAKIT和GATK的软件来处理更大量的基因组数据,这被称为“SIDRA的NGS工作流程”。在NGS工作流程中,我们分别使用BWAKIT进行基因组比对和GATK进行变异发现,这两个工作流程分别需要较大的计算量和巨大的内存需求。我们观察到,在变体发现过程中,CPU利用率不超过45%,因此,有必要了解NGS工作流自动化过程中资源的最佳选择(根据线程数或内核数)。我们从“可伸缩性”(使用最大可用cpu和内存)和“在一个节点内使用不同基因组数据的多个NGS工作流实例”(使用有限的cpu和内存集并发处理更多的基因组数据)方面分析了性能瓶颈和应用程序优化。我们观察到,在使用我们自己的调度启发式方法同时处理2、4、8和16个样本时,性能分别提高了40%、65%、71%和76%。因此,与基于应用程序可伸缩性的工作流相比,我们提出的NGS工作流自动化将提高高达76%的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Automated SOSORT-recommended angles measurement in patients with adolescent idiopathic scoliosis Estimating changes in a cognitive performance using heart rate variability Some examples on the performance of density functional theory in the description of bioinorganic systems and processes Modeling the metabolism of escherichia coli under oxygen gradients with dynamically changing flux bounds An automated approach to conduct effective on-site presumptive drug tests
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1