Recount: expectation maximization based error correction tool for next generation sequencing data.

Genome informatics. International Conference on Genome Informatics Pub Date : 2009-10-01

Edward Wijaya, Martin C Frith, Yutaka Suzuki, Paul Horton

{"title":"Recount: expectation maximization based error correction tool for next generation sequencing data.","authors":"Edward Wijaya, Martin C Frith, Yutaka Suzuki, Paul Horton","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>Next generation sequencing technologies enable rapid, large-scale production of sequence data sets. Unfortunately these technologies also have a non-neglible sequencing error rate, which biases their outputs by introducing false reads and reducing the quantity of the real reads. Although methods developed for SAGE data can reduce these false counts to a considerable degree, until now they have not been implemented in a scalable way. Recently, a program named FREC has been developed to address this problem for next generation sequencing data. In this paper, we introduce RECOUNT, our implementation of an Expectation Maximization algorithm for tag count correction and compare it to FREC. Using both the reference genome and simulated data, we find that RECOUNT performs as well or better than FREC, while using much less memory (e.g. 5GB vs. 75GB). Furthermore, we report the first analysis of tag count correction with real data in the context of gene expression analysis. Our results show that tag count correction not only increases the number of mappable tags, but can make a real difference in the biological interpretation of next generation sequencing data. RECOUNT is an open-source C++ program available at http://seq.cbrc.jp/recount.</p>","PeriodicalId":73143,"journal":{"name":"Genome informatics. International Conference on Genome Informatics","volume":"23 1","pages":"189-201"},"PeriodicalIF":0.0000,"publicationDate":"2009-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genome informatics. International Conference on Genome Informatics","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Next generation sequencing technologies enable rapid, large-scale production of sequence data sets. Unfortunately these technologies also have a non-neglible sequencing error rate, which biases their outputs by introducing false reads and reducing the quantity of the real reads. Although methods developed for SAGE data can reduce these false counts to a considerable degree, until now they have not been implemented in a scalable way. Recently, a program named FREC has been developed to address this problem for next generation sequencing data. In this paper, we introduce RECOUNT, our implementation of an Expectation Maximization algorithm for tag count correction and compare it to FREC. Using both the reference genome and simulated data, we find that RECOUNT performs as well or better than FREC, while using much less memory (e.g. 5GB vs. 75GB). Furthermore, we report the first analysis of tag count correction with real data in the context of gene expression analysis. Our results show that tag count correction not only increases the number of mappable tags, but can make a real difference in the biological interpretation of next generation sequencing data. RECOUNT is an open-source C++ program available at http://seq.cbrc.jp/recount.

微信好友朋友圈 QQ好友复制链接

本刊更多论文

重新计算:基于期望最大化的错误校正工具，用于下一代测序数据。

下一代测序技术能够快速、大规模地生产序列数据集。不幸的是，这些技术也有一个不可忽视的测序错误率，通过引入错误的读取和减少真实读取的数量，使它们的输出产生偏差。尽管为SAGE数据开发的方法可以在相当程度上减少这些错误计数，但到目前为止，它们还没有以可扩展的方式实施。最近，一个名为FREC的程序已经开发出来，以解决下一代测序数据的这个问题。在本文中，我们介绍了我们实现的标签计数校正的期望最大化算法，并将其与FREC进行了比较。使用参考基因组和模拟数据，我们发现重新计算的性能与FREC一样好，甚至更好，同时使用更少的内存(例如5GB对75GB)。此外，我们报告了在基因表达分析的背景下首次使用真实数据进行标签计数校正的分析。我们的研究结果表明，标签计数校正不仅增加了可映射标签的数量，而且可以对下一代测序数据的生物学解释产生真正的影响。重新计算是一个开源的c++程序，可以在http://seq.cbrc.jp/recount上找到。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Genome informatics. International Conference on Genome Informatics

自引率

0.00%

发文量

期刊最新文献

Docking-calculation-based method for predicting protein-RNA interactions. Sign: large-scale gene network estimation environment for high performance computing. Linear regression models predicting strength of transcriptional activity of promoters. Database for crude drugs and Kampo medicine. Mechanism of cell cycle disruption by multiple p53 pulses.