Factorial analysis of error correction performance using simulated next-generation sequencing data

2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) Pub Date : 2016-12-01 DOI:10.1109/BIBM.2016.7822685

Isaac Akogwu, Nan Wang, Chaoyang Zhang, Hwanseok Choi, H. Hong, P. Gong

{"title":"Factorial analysis of error correction performance using simulated next-generation sequencing data","authors":"Isaac Akogwu, Nan Wang, Chaoyang Zhang, Hwanseok Choi, H. Hong, P. Gong","doi":"10.1109/BIBM.2016.7822685","DOIUrl":null,"url":null,"abstract":"Error correction is a critical initial step in next-generation sequencing (NGS) data analysis. Although more than 60 tools have been developed, there is no systematic evidence-based comparison with regard to their strength and weakness, especially in terms of correction accuracy. Here we report a full factorial simulation study to examine how NGS dataset characteristics (genome size, coverage depth and read length in particular) affect error correction performance (precision and F-score), as well as to compare performance sensitivity/resistance of six k-mer spectrum-based methods to variations in dataset characteristics. Multi-way ANOVA tests indicate that choice of correction method and dataset characteristics had significant effects on performance metrics. Overall, BFC, Bless, Bloocoo and Musket performed better than Lighter and Trowel on 27 synthetic datasets. For each chosen method, read length and coverage depth showed more pronounced impact on performance than genome size. This study shed insights to the performance behavior of error correction methods in response to the common variables one would encounter in real-world NGS datasets. It also warrants further studies of wet lab-generated experimental NGS data to validate findings obtained from this simulation study.","PeriodicalId":345384,"journal":{"name":"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBM.2016.7822685","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Error correction is a critical initial step in next-generation sequencing (NGS) data analysis. Although more than 60 tools have been developed, there is no systematic evidence-based comparison with regard to their strength and weakness, especially in terms of correction accuracy. Here we report a full factorial simulation study to examine how NGS dataset characteristics (genome size, coverage depth and read length in particular) affect error correction performance (precision and F-score), as well as to compare performance sensitivity/resistance of six k-mer spectrum-based methods to variations in dataset characteristics. Multi-way ANOVA tests indicate that choice of correction method and dataset characteristics had significant effects on performance metrics. Overall, BFC, Bless, Bloocoo and Musket performed better than Lighter and Trowel on 27 synthetic datasets. For each chosen method, read length and coverage depth showed more pronounced impact on performance than genome size. This study shed insights to the performance behavior of error correction methods in response to the common variables one would encounter in real-world NGS datasets. It also warrants further studies of wet lab-generated experimental NGS data to validate findings obtained from this simulation study.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用模拟新一代测序数据进行误差校正性能的析因分析

错误校正是新一代测序(NGS)数据分析的关键步骤。虽然已经开发了60多种工具，但对于它们的优缺点，特别是在校正准确性方面，还没有系统的基于证据的比较。在这里，我们报告了一项全因子模拟研究，以研究NGS数据集特征(基因组大小，覆盖深度和读取长度)如何影响纠错性能(精度和f分数)，以及比较六种基于k-mer谱的方法对数据集特征变化的性能敏感性/阻力。多因素方差分析表明，校正方法和数据集特征的选择对性能指标有显著影响。总体而言，BFC、Bless、Bloocoo和Musket在27个合成数据集上的表现优于Lighter和Trowel。对于每种选择的方法，读取长度和覆盖深度比基因组大小对性能的影响更明显。这项研究揭示了错误校正方法在响应现实世界NGS数据集中可能遇到的常见变量时的性能行为。它还需要进一步研究湿实验室生成的实验NGS数据，以验证从模拟研究中获得的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

自引率

0.00%

发文量