Annette Lien, Leonardo Pestana Legori, Louis Kraft, Peter Wad Sackett, Gabriel Renaud
{"title":"Benchmarking software tools for trimming adapters and merging next-generation sequencing data for ancient DNA.","authors":"Annette Lien, Leonardo Pestana Legori, Louis Kraft, Peter Wad Sackett, Gabriel Renaud","doi":"10.3389/fbinf.2023.1260486","DOIUrl":null,"url":null,"abstract":"<p><p>Ancient DNA is highly degraded, resulting in very short sequences. Reads generated with modern high-throughput sequencing machines are generally longer than ancient DNA molecules, therefore the reads often contain some portion of the sequencing adaptors. It is crucial to remove those adaptors, as they can interfere with downstream analysis. Furthermore, overlapping portions when DNA has been read forward and backward (paired-end) can be merged to correct sequencing errors and improve read quality. Several tools have been developed for adapter trimming and read merging, however, no one has attempted to evaluate their accuracy and evaluate their potential impact on downstream analyses. Through the simulation of sequencing data, seven commonly used tools were analyzed in their ability to reconstruct ancient DNA sequences through read merging. The analyzed tools exhibit notable differences in their abilities to correct sequence errors and identify the correct read overlap, but the most substantial difference is observed in their ability to calculate quality scores for merged bases. Selecting the most appropriate tool for a given project depends on several factors, although some tools such as fastp have some shortcomings, whereas others like leeHom outperform the other tools in most aspects. While the choice of tool did not result in a measurable difference when analyzing population genetics using principal component analysis, it is important to note that downstream analyses that are sensitive to wrongly merged reads or that rely on quality scores can be significantly impacted by the choice of tool.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"3 ","pages":"1260486"},"PeriodicalIF":2.8000,"publicationDate":"2023-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10733496/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fbinf.2023.1260486","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Ancient DNA is highly degraded, resulting in very short sequences. Reads generated with modern high-throughput sequencing machines are generally longer than ancient DNA molecules, therefore the reads often contain some portion of the sequencing adaptors. It is crucial to remove those adaptors, as they can interfere with downstream analysis. Furthermore, overlapping portions when DNA has been read forward and backward (paired-end) can be merged to correct sequencing errors and improve read quality. Several tools have been developed for adapter trimming and read merging, however, no one has attempted to evaluate their accuracy and evaluate their potential impact on downstream analyses. Through the simulation of sequencing data, seven commonly used tools were analyzed in their ability to reconstruct ancient DNA sequences through read merging. The analyzed tools exhibit notable differences in their abilities to correct sequence errors and identify the correct read overlap, but the most substantial difference is observed in their ability to calculate quality scores for merged bases. Selecting the most appropriate tool for a given project depends on several factors, although some tools such as fastp have some shortcomings, whereas others like leeHom outperform the other tools in most aspects. While the choice of tool did not result in a measurable difference when analyzing population genetics using principal component analysis, it is important to note that downstream analyses that are sensitive to wrongly merged reads or that rely on quality scores can be significantly impacted by the choice of tool.
古 DNA 降解程度高,因此序列非常短。现代高通量测序机器生成的读数通常比古 DNA 分子长,因此读数中往往含有部分测序适配体。移除这些适配体至关重要,因为它们会干扰下游分析。此外,DNA 正向和反向(成对端)读取时的重叠部分可以合并,以纠正测序错误并提高读取质量。目前已开发出几种用于适配器修剪和读取合并的工具,但还没有人尝试评估它们的准确性以及对下游分析的潜在影响。通过模拟测序数据,分析了七种常用工具通过读取合并重建古 DNA 序列的能力。所分析的工具在纠正序列错误和识别正确的读数重叠方面表现出明显的差异,但最大的差异在于它们计算合并碱基质量分数的能力。为特定项目选择最合适的工具取决于多个因素,尽管一些工具(如 fastp)存在一些缺陷,但其他工具(如 leeHom)在大多数方面都优于其他工具。虽然在使用主成分分析进行群体遗传学分析时,工具的选择并不会造成明显的差异,但值得注意的是,对错误合并读数敏感或依赖质量分数的下游分析可能会受到工具选择的重大影响。