Benjamin Dubois, Mathieu Delitte, Salomé Lengrand, Claude Bragard, Anne Legrève, Frédéric Debode
{"title":"PRONAME: a user-friendly pipeline to process long-read nanopore metabarcoding data by generating high-quality consensus sequences.","authors":"Benjamin Dubois, Mathieu Delitte, Salomé Lengrand, Claude Bragard, Anne Legrève, Frédéric Debode","doi":"10.3389/fbinf.2024.1483255","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The study of sample taxonomic composition has evolved from direct observations and labor-intensive morphological studies to different DNA sequencing methodologies. Most of these studies leverage the metabarcoding approach, which involves the amplification of a small taxonomically-informative portion of the genome and its subsequent high-throughput sequencing. Recent advances in sequencing technology brought by Oxford Nanopore Technologies have revolutionized the field, enabling portability, affordable cost and long-read sequencing, therefore leading to a significant increase in taxonomic resolution. However, Nanopore sequencing data exhibit a particular profile, with a higher error rate compared with Illumina sequencing, and existing bioinformatics pipelines for the analysis of such data are scarce and often insufficient, requiring specialized tools to accurately process long-read sequences.</p><p><strong>Results: </strong>We present PRONAME (PROcessing NAnopore MEtabarcoding data), an open-source, user-friendly pipeline optimized for processing raw Nanopore sequencing data. PRONAME includes precompiled databases for complete 16S sequences (Silva138 and Greengenes2) and a newly developed and curated database dedicated to bacterial 16S-ITS-23S operon sequences. The user can also provide a custom database if desired, therefore enabling the analysis of metabarcoding data for any domain of life. The pipeline significantly improves sequence accuracy, implementing innovative error-correction strategies and taking advantage of the new sequencing chemistry to produce high-quality duplex reads. Evaluations using a mock community have shown that PRONAME delivers consensus sequences demonstrating at least 99.5% accuracy with standard settings (and up to 99.7%), making it a robust tool for genomic analysis of complex multi-species communities.</p><p><strong>Conclusion: </strong>PRONAME meets the challenges of long-read Nanopore data processing, offering greater accuracy and versatility than existing pipelines. By integrating Nanopore-specific quality filtering, clustering and error correction, PRONAME produces high-precision consensus sequences. This brings the accuracy of Nanopore sequencing close to that of Illumina sequencing, while taking advantage of the benefits of long-read technologies.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"4 ","pages":"1483255"},"PeriodicalIF":2.8000,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11695402/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fbinf.2024.1483255","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: The study of sample taxonomic composition has evolved from direct observations and labor-intensive morphological studies to different DNA sequencing methodologies. Most of these studies leverage the metabarcoding approach, which involves the amplification of a small taxonomically-informative portion of the genome and its subsequent high-throughput sequencing. Recent advances in sequencing technology brought by Oxford Nanopore Technologies have revolutionized the field, enabling portability, affordable cost and long-read sequencing, therefore leading to a significant increase in taxonomic resolution. However, Nanopore sequencing data exhibit a particular profile, with a higher error rate compared with Illumina sequencing, and existing bioinformatics pipelines for the analysis of such data are scarce and often insufficient, requiring specialized tools to accurately process long-read sequences.
Results: We present PRONAME (PROcessing NAnopore MEtabarcoding data), an open-source, user-friendly pipeline optimized for processing raw Nanopore sequencing data. PRONAME includes precompiled databases for complete 16S sequences (Silva138 and Greengenes2) and a newly developed and curated database dedicated to bacterial 16S-ITS-23S operon sequences. The user can also provide a custom database if desired, therefore enabling the analysis of metabarcoding data for any domain of life. The pipeline significantly improves sequence accuracy, implementing innovative error-correction strategies and taking advantage of the new sequencing chemistry to produce high-quality duplex reads. Evaluations using a mock community have shown that PRONAME delivers consensus sequences demonstrating at least 99.5% accuracy with standard settings (and up to 99.7%), making it a robust tool for genomic analysis of complex multi-species communities.
Conclusion: PRONAME meets the challenges of long-read Nanopore data processing, offering greater accuracy and versatility than existing pipelines. By integrating Nanopore-specific quality filtering, clustering and error correction, PRONAME produces high-precision consensus sequences. This brings the accuracy of Nanopore sequencing close to that of Illumina sequencing, while taking advantage of the benefits of long-read technologies.
背景:样品分类组成的研究已经从直接观察和劳动密集型形态学研究发展到不同的DNA测序方法。这些研究大多利用元条形码方法,其中涉及基因组的一小部分分类信息的扩增和随后的高通量测序。牛津纳米孔技术带来的测序技术的最新进展已经彻底改变了该领域,使便携性,可负担的成本和长读测序,因此导致分类分辨率的显着增加。然而,与Illumina测序相比,纳米孔测序数据具有更高的错误率,并且现有的用于分析此类数据的生物信息学管道稀缺且往往不足,需要专门的工具来准确处理长读序列。结果:我们提出了PRONAME(处理纳米孔元条形码数据),一个开源的,用户友好的管道,优化处理原始纳米孔测序数据。PRONAME包括预编译的完整16S序列数据库(Silva138和Greengenes2)和一个新开发的专门用于细菌16S- its - 23s操纵子序列的数据库。如果需要,用户还可以提供自定义数据库,因此可以分析任何生命领域的元条形码数据。该管道显着提高了序列准确性,实施了创新的纠错策略,并利用新的测序化学来产生高质量的双工读取。使用模拟群落的评估表明,PRONAME提供的一致性序列在标准设置下的准确率至少为99.5%(最高可达99.7%),使其成为复杂多物种群落基因组分析的强大工具。结论:PRONAME满足长读纳米孔数据处理的挑战,提供比现有管道更高的准确性和通用性。通过整合纳米孔特定的质量过滤,聚类和纠错,PRONAME产生高精度的共识序列。这使得纳米孔测序的准确性接近Illumina测序,同时利用了长读技术的优势。