Secondary Structure Predictions for Long RNA Sequences Based on Inversion Excursions and MapReduce.

IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum : [proceedings]. IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-01 DOI:10.1109/IPDPSW.2013.109

Daniel T Yehdego, Boyu Zhang, Vikram K R Kodimala, Kyle L Johnson, Michela Taufer, Ming-Ying Leung

{"title":"Secondary Structure Predictions for Long RNA Sequences Based on Inversion Excursions and MapReduce.","authors":"Daniel T Yehdego, Boyu Zhang, Vikram K R Kodimala, Kyle L Johnson, Michela Taufer, Ming-Ying Leung","doi":"10.1109/IPDPSW.2013.109","DOIUrl":null,"url":null,"abstract":"<p><p>Secondary structures of ribonucleic acid (RNA) molecules play important roles in many biological processes including gene expression and regulation. Experimental observations and computing limitations suggest that we can approach the secondary structure prediction problem for long RNA sequences by segmenting them into shorter chunks, predicting the secondary structures of each chunk individually using existing prediction programs, and then assembling the results to give the structure of the original sequence. The selection of cutting points is a crucial component of the segmenting step. Noting that stem-loops and pseudoknots always contain an inversion, i.e., a stretch of nucleotides followed closely by its inverse complementary sequence, we developed two cutting methods for segmenting long RNA sequences based on inversion excursions: the centered and optimized method. Each step of searching for inversions, chunking, and predictions can be performed in parallel. In this paper we use a MapReduce framework, i.e., Hadoop, to extensively explore meaningful inversion stem lengths and gap sizes for the segmentation and identify correlations between chunking methods and prediction accuracy. We show that for a set of long RNA sequences in the RFAM database, whose secondary structures are known to contain pseudoknots, our approach predicts secondary structures more accurately than methods that do not segment the sequence, when the latter predictions are possible computationally. We also show that, as sequences exceed certain lengths, some programs cannot computationally predict pseudoknots while our chunking methods can. Overall, our predicted structures still retain the accuracy level of the original prediction programs when compared with known experimental secondary structure.</p>","PeriodicalId":90848,"journal":{"name":"IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum : [proceedings]. IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"2013 ","pages":"520-529"},"PeriodicalIF":0.0000,"publicationDate":"2013-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/IPDPSW.2013.109","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum : [proceedings]. IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2013.109","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

Secondary structures of ribonucleic acid (RNA) molecules play important roles in many biological processes including gene expression and regulation. Experimental observations and computing limitations suggest that we can approach the secondary structure prediction problem for long RNA sequences by segmenting them into shorter chunks, predicting the secondary structures of each chunk individually using existing prediction programs, and then assembling the results to give the structure of the original sequence. The selection of cutting points is a crucial component of the segmenting step. Noting that stem-loops and pseudoknots always contain an inversion, i.e., a stretch of nucleotides followed closely by its inverse complementary sequence, we developed two cutting methods for segmenting long RNA sequences based on inversion excursions: the centered and optimized method. Each step of searching for inversions, chunking, and predictions can be performed in parallel. In this paper we use a MapReduce framework, i.e., Hadoop, to extensively explore meaningful inversion stem lengths and gap sizes for the segmentation and identify correlations between chunking methods and prediction accuracy. We show that for a set of long RNA sequences in the RFAM database, whose secondary structures are known to contain pseudoknots, our approach predicts secondary structures more accurately than methods that do not segment the sequence, when the latter predictions are possible computationally. We also show that, as sequences exceed certain lengths, some programs cannot computationally predict pseudoknots while our chunking methods can. Overall, our predicted structures still retain the accuracy level of the original prediction programs when compared with known experimental secondary structure.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于反转漂移和MapReduce的长RNA序列二级结构预测。

核糖核酸(RNA)分子的二级结构在包括基因表达和调控在内的许多生物过程中起着重要作用。实验观察和计算限制表明，我们可以通过将长RNA序列分割成较短的片段，使用现有的预测程序分别预测每个片段的二级结构，然后将结果组合起来给出原始序列的结构来解决长RNA序列的二级结构预测问题。切割点的选择是分割步骤的关键组成部分。注意到茎环和假结总是包含反转，即一段核苷酸紧跟着其逆互补序列，我们开发了两种基于反转漂移的切割长RNA序列的方法:中心和优化方法。搜索反转、分块和预测的每一步都可以并行执行。在本文中，我们使用MapReduce框架，即Hadoop，广泛探索有意义的反演茎长度和分割间隙大小，并确定分块方法与预测精度之间的相关性。我们表明，对于RFAM数据库中的一组长RNA序列，其二级结构已知包含假结，当后一种预测在计算上可能时，我们的方法比不分割序列的方法更准确地预测二级结构。我们还表明，当序列超过一定长度时，一些程序无法计算预测伪结，而我们的分块方法可以。总的来说，与已知的实验二级结构相比，我们预测的结构仍然保持了原始预测程序的精度水平。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助