Optimization of data-intensive next generation sequencing in high performance computing

2015 IEEE 15th International Conference on Bioinformatics and Bioengineering (BIBE) Pub Date : 2015-11-02 DOI:10.1109/BIBE.2015.7367654

N. Kathiresan, Rashid J. Al-Ali, P. Jithesh, Tariq AbuZaid, Ramzi Temanni, A. Ptitsyn

{"title":"Optimization of data-intensive next generation sequencing in high performance computing","authors":"N. Kathiresan, Rashid J. Al-Ali, P. Jithesh, Tariq AbuZaid, Ramzi Temanni, A. Ptitsyn","doi":"10.1109/BIBE.2015.7367654","DOIUrl":null,"url":null,"abstract":"Advancement in Next Generation Sequencing (NGS) technology are associated with ever-increasing volume of genomic data every year. These genomic data are efficiently processed by empirical parallelism using High Performance Computing (HPC). The processed data can be used for genome-wide association studies, genetics, personalized medicine and many other areas. There are different kind of algorithms and implementations used in different phases of genome processing. In this paper, we used BWAKIT and GATK based software for processing larger volume of genomic data that are referred as \"NGS workflow at SIDRA\". We used BWAKIT for genome alignment and GATK for variant discovery in the NGS workflow that required larger computation and huge memory requirement respectively. We observed, the CPU utilization is not more than 45% during variant discovery and hence, it is necessary to understand the optimal selection (in terms of number of threads or cores) of the resources during the NGS workflow automation. We analyzed the performance bottleneck and application optimization in terms of \"scalability\" (use maximum available CPUs and memory) and \"multiple instances of NGS workflow with different genome data within a node\" (process more volume of genome data concurrently with limited set of CPUs and memory). We observed that, 40%, 65%, 71% and 76% improvement in performance while processing 2, 4, 8 and 16 samples concurrently using our own scheduling heuristics. As a result, our proposed NGS workflow automation will improve the performance upto 76% compared to application scalability based workflows.","PeriodicalId":422807,"journal":{"name":"2015 IEEE 15th International Conference on Bioinformatics and Bioengineering (BIBE)","volume":"166 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 15th International Conference on Bioinformatics and Bioengineering (BIBE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBE.2015.7367654","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Advancement in Next Generation Sequencing (NGS) technology are associated with ever-increasing volume of genomic data every year. These genomic data are efficiently processed by empirical parallelism using High Performance Computing (HPC). The processed data can be used for genome-wide association studies, genetics, personalized medicine and many other areas. There are different kind of algorithms and implementations used in different phases of genome processing. In this paper, we used BWAKIT and GATK based software for processing larger volume of genomic data that are referred as "NGS workflow at SIDRA". We used BWAKIT for genome alignment and GATK for variant discovery in the NGS workflow that required larger computation and huge memory requirement respectively. We observed, the CPU utilization is not more than 45% during variant discovery and hence, it is necessary to understand the optimal selection (in terms of number of threads or cores) of the resources during the NGS workflow automation. We analyzed the performance bottleneck and application optimization in terms of "scalability" (use maximum available CPUs and memory) and "multiple instances of NGS workflow with different genome data within a node" (process more volume of genome data concurrently with limited set of CPUs and memory). We observed that, 40%, 65%, 71% and 76% improvement in performance while processing 2, 4, 8 and 16 samples concurrently using our own scheduling heuristics. As a result, our proposed NGS workflow automation will improve the performance upto 76% compared to application scalability based workflows.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

高性能计算中数据密集型下一代排序的优化

下一代测序(NGS)技术的进步与每年不断增加的基因组数据量有关。这些基因组数据通过使用高性能计算(HPC)的经验并行有效地处理。处理后的数据可用于全基因组关联研究、遗传学、个性化医疗和许多其他领域。在基因组处理的不同阶段有不同的算法和实现。在本文中，我们使用基于BWAKIT和GATK的软件来处理更大量的基因组数据，这被称为“SIDRA的NGS工作流程”。在NGS工作流程中，我们分别使用BWAKIT进行基因组比对和GATK进行变异发现，这两个工作流程分别需要较大的计算量和巨大的内存需求。我们观察到，在变体发现过程中，CPU利用率不超过45%，因此，有必要了解NGS工作流自动化过程中资源的最佳选择(根据线程数或内核数)。我们从“可伸缩性”(使用最大可用cpu和内存)和“在一个节点内使用不同基因组数据的多个NGS工作流实例”(使用有限的cpu和内存集并发处理更多的基因组数据)方面分析了性能瓶颈和应用程序优化。我们观察到，在使用我们自己的调度启发式方法同时处理2、4、8和16个样本时，性能分别提高了40%、65%、71%和76%。因此，与基于应用程序可伸缩性的工作流相比，我们提出的NGS工作流自动化将提高高达76%的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2015 IEEE 15th International Conference on Bioinformatics and Bioengineering (BIBE)

自引率

0.00%

发文量