Secondary Structure Predictions for Long RNA Sequences Based on Inversion Excursions and MapReduce.

Daniel T Yehdego, Boyu Zhang, Vikram K R Kodimala, Kyle L Johnson, Michela Taufer, Ming-Ying Leung
{"title":"Secondary Structure Predictions for Long RNA Sequences Based on Inversion Excursions and MapReduce.","authors":"Daniel T Yehdego,&nbsp;Boyu Zhang,&nbsp;Vikram K R Kodimala,&nbsp;Kyle L Johnson,&nbsp;Michela Taufer,&nbsp;Ming-Ying Leung","doi":"10.1109/IPDPSW.2013.109","DOIUrl":null,"url":null,"abstract":"<p><p>Secondary structures of ribonucleic acid (RNA) molecules play important roles in many biological processes including gene expression and regulation. Experimental observations and computing limitations suggest that we can approach the secondary structure prediction problem for long RNA sequences by segmenting them into shorter chunks, predicting the secondary structures of each chunk individually using existing prediction programs, and then assembling the results to give the structure of the original sequence. The selection of cutting points is a crucial component of the segmenting step. Noting that stem-loops and pseudoknots always contain an inversion, i.e., a stretch of nucleotides followed closely by its inverse complementary sequence, we developed two cutting methods for segmenting long RNA sequences based on inversion excursions: the centered and optimized method. Each step of searching for inversions, chunking, and predictions can be performed in parallel. In this paper we use a MapReduce framework, i.e., Hadoop, to extensively explore meaningful inversion stem lengths and gap sizes for the segmentation and identify correlations between chunking methods and prediction accuracy. We show that for a set of long RNA sequences in the RFAM database, whose secondary structures are known to contain pseudoknots, our approach predicts secondary structures more accurately than methods that do not segment the sequence, when the latter predictions are possible computationally. We also show that, as sequences exceed certain lengths, some programs cannot computationally predict pseudoknots while our chunking methods can. Overall, our predicted structures still retain the accuracy level of the original prediction programs when compared with known experimental secondary structure.</p>","PeriodicalId":90848,"journal":{"name":"IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum : [proceedings]. IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"2013 ","pages":"520-529"},"PeriodicalIF":0.0000,"publicationDate":"2013-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/IPDPSW.2013.109","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum : [proceedings]. IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2013.109","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

Abstract

Secondary structures of ribonucleic acid (RNA) molecules play important roles in many biological processes including gene expression and regulation. Experimental observations and computing limitations suggest that we can approach the secondary structure prediction problem for long RNA sequences by segmenting them into shorter chunks, predicting the secondary structures of each chunk individually using existing prediction programs, and then assembling the results to give the structure of the original sequence. The selection of cutting points is a crucial component of the segmenting step. Noting that stem-loops and pseudoknots always contain an inversion, i.e., a stretch of nucleotides followed closely by its inverse complementary sequence, we developed two cutting methods for segmenting long RNA sequences based on inversion excursions: the centered and optimized method. Each step of searching for inversions, chunking, and predictions can be performed in parallel. In this paper we use a MapReduce framework, i.e., Hadoop, to extensively explore meaningful inversion stem lengths and gap sizes for the segmentation and identify correlations between chunking methods and prediction accuracy. We show that for a set of long RNA sequences in the RFAM database, whose secondary structures are known to contain pseudoknots, our approach predicts secondary structures more accurately than methods that do not segment the sequence, when the latter predictions are possible computationally. We also show that, as sequences exceed certain lengths, some programs cannot computationally predict pseudoknots while our chunking methods can. Overall, our predicted structures still retain the accuracy level of the original prediction programs when compared with known experimental secondary structure.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于反转漂移和MapReduce的长RNA序列二级结构预测。
核糖核酸(RNA)分子的二级结构在包括基因表达和调控在内的许多生物过程中起着重要作用。实验观察和计算限制表明,我们可以通过将长RNA序列分割成较短的片段,使用现有的预测程序分别预测每个片段的二级结构,然后将结果组合起来给出原始序列的结构来解决长RNA序列的二级结构预测问题。切割点的选择是分割步骤的关键组成部分。注意到茎环和假结总是包含反转,即一段核苷酸紧跟着其逆互补序列,我们开发了两种基于反转漂移的切割长RNA序列的方法:中心和优化方法。搜索反转、分块和预测的每一步都可以并行执行。在本文中,我们使用MapReduce框架,即Hadoop,广泛探索有意义的反演茎长度和分割间隙大小,并确定分块方法与预测精度之间的相关性。我们表明,对于RFAM数据库中的一组长RNA序列,其二级结构已知包含假结,当后一种预测在计算上可能时,我们的方法比不分割序列的方法更准确地预测二级结构。我们还表明,当序列超过一定长度时,一些程序无法计算预测伪结,而我们的分块方法可以。总的来说,与已知的实验二级结构相比,我们预测的结构仍然保持了原始预测程序的精度水平。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Parallel Maximum Cardinality Matching for General Graphs on GPUs. Shared-Memory Parallel Edmonds Blossom Algorithm for Maximum Cardinality Matching in General Graphs. Sequre: a high-performance framework for rapid development of secure bioinformatics pipelines. Application of Distributed Agent-based Modeling to Investigate Opioid Use Outcomes in Justice Involved Populations. Optimizing High-Performance Computing Systems for Biomedical Workloads.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1