Enhancement of accuracy and efficiency for RNA secondary structure prediction by sequence segmentation and MapReduce

Q3 Biochemistry, Genetics and Molecular Biology BMC Structural Biology Pub Date : 2013-11-08 DOI:10.1186/1472-6807-13-S1-S3

Boyu Zhang, Daniel T Yehdego, Kyle L Johnson, Ming-Ying Leung, Michela Taufer

{"title":"Enhancement of accuracy and efficiency for RNA secondary structure prediction by sequence segmentation and MapReduce","authors":"Boyu Zhang, Daniel T Yehdego, Kyle L Johnson, Ming-Ying Leung, Michela Taufer","doi":"10.1186/1472-6807-13-S1-S3","DOIUrl":null,"url":null,"abstract":"Ribonucleic acid (RNA) molecules play important roles in many biological processes including gene expression and regulation. Their secondary structures are crucial for the RNA functionality, and the prediction of the secondary structures is widely studied. Our previous research shows that cutting long sequences into shorter chunks, predicting secondary structures of the chunks independently using thermodynamic methods, and reconstructing the entire secondary structure from the predicted chunk structures can yield better accuracy than predicting the secondary structure using the RNA sequence as a whole. The chunking, prediction, and reconstruction processes can use different methods and parameters, some of which produce more accurate predictions than others. In this paper, we study the prediction accuracy and efficiency of three different chunking methods using seven popular secondary structure prediction programs that apply to two datasets of RNA with known secondary structures, which include both pseudoknotted and non-pseudoknotted sequences, as well as a family of viral genome RNAs whose structures have not been predicted before. Our modularized MapReduce framework based on Hadoop allows us to study the problem in a parallel and robust environment.On average, the maximum accuracy retention values are larger than one for our chunking methods and the seven prediction programs over 50 non-pseudoknotted sequences, meaning that the secondary structure predicted using chunking is more similar to the real structure than the secondary structure predicted by using the whole sequence. We observe similar results for the 23 pseudoknotted sequences, except for the NUPACK program using the centered chunking method. The performance analysis for 14 long RNA sequences from the Nodaviridae virus family outlines how the coarse-grained mapping of chunking and predictions in the MapReduce framework exhibits shorter turnaround times for short RNA sequences. However, as the lengths of the RNA sequences increase, the fine-grained mapping can surpass the coarse-grained mapping in performance.By using our MapReduce framework together with statistical analysis on the accuracy retention results, we observe how the inversion-based chunking methods can outperform predictions using the whole sequence. Our chunk-based approach also enables us to predict secondary structures for very long RNA sequences, which is not feasible with traditional methods alone.","PeriodicalId":51240,"journal":{"name":"BMC Structural Biology","volume":"13 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2013-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/1472-6807-13-S1-S3","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Structural Biology","FirstCategoryId":"1085","ListUrlMain":"https://link.springer.com/article/10.1186/1472-6807-13-S1-S3","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Biochemistry, Genetics and Molecular Biology","Score":null,"Total":0}

引用次数: 21

Abstract

Ribonucleic acid (RNA) molecules play important roles in many biological processes including gene expression and regulation. Their secondary structures are crucial for the RNA functionality, and the prediction of the secondary structures is widely studied. Our previous research shows that cutting long sequences into shorter chunks, predicting secondary structures of the chunks independently using thermodynamic methods, and reconstructing the entire secondary structure from the predicted chunk structures can yield better accuracy than predicting the secondary structure using the RNA sequence as a whole. The chunking, prediction, and reconstruction processes can use different methods and parameters, some of which produce more accurate predictions than others. In this paper, we study the prediction accuracy and efficiency of three different chunking methods using seven popular secondary structure prediction programs that apply to two datasets of RNA with known secondary structures, which include both pseudoknotted and non-pseudoknotted sequences, as well as a family of viral genome RNAs whose structures have not been predicted before. Our modularized MapReduce framework based on Hadoop allows us to study the problem in a parallel and robust environment.

On average, the maximum accuracy retention values are larger than one for our chunking methods and the seven prediction programs over 50 non-pseudoknotted sequences, meaning that the secondary structure predicted using chunking is more similar to the real structure than the secondary structure predicted by using the whole sequence. We observe similar results for the 23 pseudoknotted sequences, except for the NUPACK program using the centered chunking method. The performance analysis for 14 long RNA sequences from the Nodaviridae virus family outlines how the coarse-grained mapping of chunking and predictions in the MapReduce framework exhibits shorter turnaround times for short RNA sequences. However, as the lengths of the RNA sequences increase, the fine-grained mapping can surpass the coarse-grained mapping in performance.

By using our MapReduce framework together with statistical analysis on the accuracy retention results, we observe how the inversion-based chunking methods can outperform predictions using the whole sequence. Our chunk-based approach also enables us to predict secondary structures for very long RNA sequences, which is not feasible with traditional methods alone.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用序列分割和MapReduce技术提高RNA二级结构预测的准确性和效率

核糖核酸(RNA)分子在包括基因表达和调控在内的许多生物过程中起着重要作用。它们的二级结构对RNA的功能起着至关重要的作用，对其二级结构的预测也得到了广泛的研究。我们之前的研究表明，将长序列切割成较短的片段，利用热力学方法独立预测片段的二级结构，并根据预测的片段结构重建整个二级结构，比将RNA序列作为一个整体预测二级结构具有更高的准确性。分块、预测和重建过程可以使用不同的方法和参数，其中一些方法产生的预测比其他方法更准确。在本文中，我们研究了使用七种流行的二级结构预测程序的三种不同的分块方法的预测精度和效率，这些程序适用于两个已知二级结构的RNA数据集，其中包括假结和非假结序列，以及以前未预测结构的病毒基因组RNA家族。我们基于Hadoop的模块化MapReduce框架允许我们在并行和健壮的环境中研究问题。平均而言，我们的分块方法和7种预测方案在50个非假结序列上的最大准确度保留值都大于1，这意味着使用分块预测的二级结构比使用整个序列预测的二级结构更接近真实结构。除了使用中心分块方法的NUPACK程序外，我们对23个伪结序列观察到类似的结果。对来自noddaviridae病毒家族的14个长RNA序列的性能分析概述了MapReduce框架中的粗粒度分组映射和预测如何在短RNA序列上显示更短的周转时间。然而，随着RNA序列长度的增加，细粒度映射在性能上可以超过粗粒度映射。通过使用我们的MapReduce框架以及对准确性保持结果的统计分析，我们观察到基于反转的分块方法如何优于使用整个序列的预测。我们基于块的方法也使我们能够预测非常长的RNA序列的二级结构，这是单独使用传统方法不可行的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

BMC Structural Biology BIOPHYSICS-

CiteScore

3.60

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： BMC Structural Biology is an open access, peer-reviewed journal that considers articles on investigations into the structure of biological macromolecules, including solving structures, structural and functional analyses, and computational modeling.