Distributed Memory Partitioning of High-Throughput Sequencing Datasets for Enabling Parallel Genomics Analyses

Nagakishore Jammula, Sriram P. Chockalingam, S. Aluru
{"title":"Distributed Memory Partitioning of High-Throughput Sequencing Datasets for Enabling Parallel Genomics Analyses","authors":"Nagakishore Jammula, Sriram P. Chockalingam, S. Aluru","doi":"10.1145/3107411.3107491","DOIUrl":null,"url":null,"abstract":"State-of-the-art high-throughput sequencing instruments decipher in excess of a billion short genomic fragments per run. The output sequences are referred to as 'reads'. These read datasets facilitate a wide variety of analyses with applications in areas such as genomics, metagenomics, and transcriptomics. Owing to the large size of the read datasets, such analyses are often compute and memory intensive. In this paper, we present a parallel algorithm for partitioning large-scale read datasets in order to facilitate distributed-memory parallel analyses. During the process of partitioning the read datasets, we construct and partition the associated de Bruijn graph in parallel. This allows applications that make use of a variant of the de Bruijn graph, such as de novo assembly, to directly leverage the generated de Bruijn graph partitions. In addition, we propose a mechanism for evaluating the quality of the generated partitions of reads and demonstrate that our algorithm produces high quality partitions. Our implementation is available at github.com/ParBLiSS/read_partitioning.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3107411.3107491","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

State-of-the-art high-throughput sequencing instruments decipher in excess of a billion short genomic fragments per run. The output sequences are referred to as 'reads'. These read datasets facilitate a wide variety of analyses with applications in areas such as genomics, metagenomics, and transcriptomics. Owing to the large size of the read datasets, such analyses are often compute and memory intensive. In this paper, we present a parallel algorithm for partitioning large-scale read datasets in order to facilitate distributed-memory parallel analyses. During the process of partitioning the read datasets, we construct and partition the associated de Bruijn graph in parallel. This allows applications that make use of a variant of the de Bruijn graph, such as de novo assembly, to directly leverage the generated de Bruijn graph partitions. In addition, we propose a mechanism for evaluating the quality of the generated partitions of reads and demonstrate that our algorithm produces high quality partitions. Our implementation is available at github.com/ParBLiSS/read_partitioning.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
支持并行基因组学分析的高通量测序数据集的分布式内存分区
最先进的高通量测序仪器破译超过十亿短基因组片段每运行。输出序列被称为“读取”。这些读取数据集有助于在基因组学、宏基因组学和转录组学等领域进行各种分析。由于读取数据集的规模很大,这种分析通常需要大量的计算和内存。在本文中,我们提出了一种用于划分大规模读数据集的并行算法,以促进分布式内存并行分析。在划分读数据集的过程中,我们并行地构造和划分相关联的de Bruijn图。这允许使用de Bruijn图的变体的应用程序,例如de novo assembly,直接利用生成的de Bruijn图分区。此外,我们提出了一种评估读取分区质量的机制,并证明了我们的算法产生了高质量的分区。我们的实现可以在github.com/ParBLiSS/read_partitioning上获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Mapping Free Text into MedDRA by Natural Language Processing: A Modular Approach in Designing and Evaluating Software Extensions Evolving Conformation Paths to Model Protein Structural Transitions Supervised Machine Learning Approaches Predict and Characterize Nanomaterial Exposures: MWCNT Markers in Lung Lavage Fluid. Geometry Analysis for Protein Secondary Structures Matching Problem Geometric Sampling Framework for Exploring Molecular Walker Energetics and Dynamics
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1