Mining sequential patterns from uncertain big DNA in the spark framework

Fan Jiang, C. Leung, O. Sarumi, Christine Y. Zhang
{"title":"Mining sequential patterns from uncertain big DNA in the spark framework","authors":"Fan Jiang, C. Leung, O. Sarumi, Christine Y. Zhang","doi":"10.1109/BIBM.2016.7822641","DOIUrl":null,"url":null,"abstract":"Big data has become ubiquitous as high volumes of wide varieties of valuable data of different veracities (e.g., precise, imprecise or uncertain data) are made available at a high velocity through fast throughput machines and techniques for data gathering and curation in many real life applications in various domains and application areas such as bioinformatics, biomedicine, finance, social networking, and weather forecasting. In bioinformatics, terabytes of deoxyribonucleic acid (DNA) sequences can now be generated within a few hours with the use of next generation sequencing (NGS) technologies such as Illumina HiSeq X and Illumina Genome Analyzer. Due to the nature of these NGS technologies, generated data are usually inherent with some noise or other forms of error. These uncertain data are embedded with a wealth of information in the form of frequent patterns. Mining frequently occurring patterns (e.g., motifs) from these big uncertain DNA sequences is a challenge in bioinformatics and biomedicine. Many existing algorithms are serial and mine DNA sequence motifs using precise data mining methods. Mining of motifs from big DNA sequences is a computationally intensive task because of the high volume and the associated uncertainty of these DNA sequences. In this paper, we propose a scalable algorithm for high performance computing on bioinformatics. Specifically, our parallel algorithm uses a fault-tolerant collection of resilient distributed datasets (RDDs) in Apache Spark computing framework to mine sequence motifs from uncertain big DNA data. Experimental results show that our algorithm extracts accurate motifs within a short time frame.","PeriodicalId":345384,"journal":{"name":"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBM.2016.7822641","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 20

Abstract

Big data has become ubiquitous as high volumes of wide varieties of valuable data of different veracities (e.g., precise, imprecise or uncertain data) are made available at a high velocity through fast throughput machines and techniques for data gathering and curation in many real life applications in various domains and application areas such as bioinformatics, biomedicine, finance, social networking, and weather forecasting. In bioinformatics, terabytes of deoxyribonucleic acid (DNA) sequences can now be generated within a few hours with the use of next generation sequencing (NGS) technologies such as Illumina HiSeq X and Illumina Genome Analyzer. Due to the nature of these NGS technologies, generated data are usually inherent with some noise or other forms of error. These uncertain data are embedded with a wealth of information in the form of frequent patterns. Mining frequently occurring patterns (e.g., motifs) from these big uncertain DNA sequences is a challenge in bioinformatics and biomedicine. Many existing algorithms are serial and mine DNA sequence motifs using precise data mining methods. Mining of motifs from big DNA sequences is a computationally intensive task because of the high volume and the associated uncertainty of these DNA sequences. In this paper, we propose a scalable algorithm for high performance computing on bioinformatics. Specifically, our parallel algorithm uses a fault-tolerant collection of resilient distributed datasets (RDDs) in Apache Spark computing framework to mine sequence motifs from uncertain big DNA data. Experimental results show that our algorithm extracts accurate motifs within a short time frame.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
在spark框架中从不确定的大DNA中挖掘序列模式
在生物信息学、生物医学、金融、社交网络和天气预报等不同领域和应用领域中,通过快速吞吐量的机器和数据收集和管理技术,大量各种不同真实性的有价值数据(例如精确、不精确或不确定数据)以高速提供,大数据已经变得无处不在。在生物信息学领域,使用下一代测序(NGS)技术,如Illumina HiSeq X和Illumina Genome Analyzer,现在可以在几个小时内生成tb级的脱氧核糖核酸(DNA)序列。由于这些NGS技术的性质,生成的数据通常带有一些噪声或其他形式的误差。这些不确定的数据以频繁模式的形式嵌入了丰富的信息。从这些大的不确定DNA序列中挖掘频繁出现的模式(例如,基序)是生物信息学和生物医学的一个挑战。现有的许多算法都是串行的,使用精确的数据挖掘方法来挖掘DNA序列基序。由于这些DNA序列的高容量和相关的不确定性,从大DNA序列中挖掘基序是一项计算密集型的任务。在本文中,我们提出了一种可扩展的生物信息学高性能计算算法。具体来说,我们的并行算法使用Apache Spark计算框架中的弹性分布式数据集(rdd)的容错集合,从不确定的大DNA数据中挖掘序列基序。实验结果表明,该算法能在较短的时间内提取出准确的图案。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
The role of high performance, grid and cloud computing in high-throughput sequencing A novel algorithm for identifying essential proteins by integrating subcellular localization CNNsite: Prediction of DNA-binding residues in proteins using Convolutional Neural Network with sequence features Inferring Social Influence of anti-Tobacco mass media campaigns Emotion recognition from multi-channel EEG data through Convolutional Recurrent Neural Network
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1