Recount: expectation maximization based error correction tool for next generation sequencing data.

Edward Wijaya, Martin C Frith, Yutaka Suzuki, Paul Horton
{"title":"Recount: expectation maximization based error correction tool for next generation sequencing data.","authors":"Edward Wijaya,&nbsp;Martin C Frith,&nbsp;Yutaka Suzuki,&nbsp;Paul Horton","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>Next generation sequencing technologies enable rapid, large-scale production of sequence data sets. Unfortunately these technologies also have a non-neglible sequencing error rate, which biases their outputs by introducing false reads and reducing the quantity of the real reads. Although methods developed for SAGE data can reduce these false counts to a considerable degree, until now they have not been implemented in a scalable way. Recently, a program named FREC has been developed to address this problem for next generation sequencing data. In this paper, we introduce RECOUNT, our implementation of an Expectation Maximization algorithm for tag count correction and compare it to FREC. Using both the reference genome and simulated data, we find that RECOUNT performs as well or better than FREC, while using much less memory (e.g. 5GB vs. 75GB). Furthermore, we report the first analysis of tag count correction with real data in the context of gene expression analysis. Our results show that tag count correction not only increases the number of mappable tags, but can make a real difference in the biological interpretation of next generation sequencing data. RECOUNT is an open-source C++ program available at http://seq.cbrc.jp/recount.</p>","PeriodicalId":73143,"journal":{"name":"Genome informatics. International Conference on Genome Informatics","volume":"23 1","pages":"189-201"},"PeriodicalIF":0.0000,"publicationDate":"2009-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genome informatics. International Conference on Genome Informatics","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Next generation sequencing technologies enable rapid, large-scale production of sequence data sets. Unfortunately these technologies also have a non-neglible sequencing error rate, which biases their outputs by introducing false reads and reducing the quantity of the real reads. Although methods developed for SAGE data can reduce these false counts to a considerable degree, until now they have not been implemented in a scalable way. Recently, a program named FREC has been developed to address this problem for next generation sequencing data. In this paper, we introduce RECOUNT, our implementation of an Expectation Maximization algorithm for tag count correction and compare it to FREC. Using both the reference genome and simulated data, we find that RECOUNT performs as well or better than FREC, while using much less memory (e.g. 5GB vs. 75GB). Furthermore, we report the first analysis of tag count correction with real data in the context of gene expression analysis. Our results show that tag count correction not only increases the number of mappable tags, but can make a real difference in the biological interpretation of next generation sequencing data. RECOUNT is an open-source C++ program available at http://seq.cbrc.jp/recount.

分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
重新计算:基于期望最大化的错误校正工具,用于下一代测序数据。
下一代测序技术能够快速、大规模地生产序列数据集。不幸的是,这些技术也有一个不可忽视的测序错误率,通过引入错误的读取和减少真实读取的数量,使它们的输出产生偏差。尽管为SAGE数据开发的方法可以在相当程度上减少这些错误计数,但到目前为止,它们还没有以可扩展的方式实施。最近,一个名为FREC的程序已经开发出来,以解决下一代测序数据的这个问题。在本文中,我们介绍了我们实现的标签计数校正的期望最大化算法,并将其与FREC进行了比较。使用参考基因组和模拟数据,我们发现重新计算的性能与FREC一样好,甚至更好,同时使用更少的内存(例如5GB对75GB)。此外,我们报告了在基因表达分析的背景下首次使用真实数据进行标签计数校正的分析。我们的研究结果表明,标签计数校正不仅增加了可映射标签的数量,而且可以对下一代测序数据的生物学解释产生真正的影响。重新计算是一个开源的c++程序,可以在http://seq.cbrc.jp/recount上找到。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Docking-calculation-based method for predicting protein-RNA interactions. Sign: large-scale gene network estimation environment for high performance computing. Linear regression models predicting strength of transcriptional activity of promoters. Database for crude drugs and Kampo medicine. Mechanism of cell cycle disruption by multiple p53 pulses.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1