Optimization of hadoop cluster foranalyzing large-scale sequence data inbioinformatics

IF 0.3 Q4 MATHEMATICS Annales Mathematicae et Informaticae Pub Date : 2019-01-01 DOI:10.33039/AMI.2019.01.002

Ádám Tóth, Ramin Karimi

引用次数: 1

Abstract

Unexpected growth of high-throughput sequencing platforms in recent years impacted virtually all areas of modern biology. However, the ability to produce data continues to outpace the ability to analyze them. Therefore, continuous efforts are also needed to improve bioinformatics applications for a better use of these research opportunities. Due to the complexity and diver-sity of metagenomics data, it has been a major challenging field of bioinformatics. Sequence-based identification methods such as using DNA signature (unique k-mer) are the most recent popular methods of real-time analysis of raw sequencing data. DNA signature discovery is compute-intensive and time-consuming.Hadoop,the application of parallel and distributed computing is one of the popular applications for the analysis of large scale data in bioinformatics. Optimization of the time-consumption and computational resource usages such as CPU consumption and memory usage are the main goals of this paper, along with the management of the Hadoop cluster nodes.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

生物信息学中大规模序列数据分析的hadoop集群优化

近年来，高通量测序平台的意外增长几乎影响了现代生物学的所有领域。然而，生成数据的能力继续超过分析数据的能力。因此，为了更好地利用这些研究机会，还需要不断努力提高生物信息学的应用。由于宏基因组学数据的复杂性和多样性，它一直是生物信息学的一个主要挑战领域。基于序列的鉴定方法，如使用DNA签名(独特的k-mer)是最新流行的实时分析原始测序数据的方法。DNA特征的发现需要大量计算，而且耗时。Hadoop是并行和分布式计算的应用，是生物信息学中大规模数据分析的热门应用之一。优化时间消耗和计算资源使用(如CPU消耗和内存使用)以及Hadoop集群节点的管理是本文的主要目标。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Annales Mathematicae et Informaticae MATHEMATICS-

CiteScore

0.90

自引率

0.00%

发文量