Run-length compressed metagenomic read classification with SMEM-finding and tagging.

bioRxiv : the preprint server for biology Pub Date : 2025-03-24 DOI:10.1101/2025.02.25.640119

Lore Depuydt, Omar Y Ahmed, Jan Fostier, Ben Langmead, Travis Gagie

{"title":"Run-length compressed metagenomic read classification with SMEM-finding and tagging.","authors":"Lore Depuydt, Omar Y Ahmed, Jan Fostier, Ben Langmead, Travis Gagie","doi":"10.1101/2025.02.25.640119","DOIUrl":null,"url":null,"abstract":"<p><p>Metagenomic read classification is a fundamental task in computational biology, yet it remains challenging due to the scale, diversity, and complexity of sequencing datasets. We propose a novel, run-length compressed index based on the move structure that enables efficient multi-class metagenomic classification in <math><mi>O</mi> <mo>(</mo> <mi>r</mi> <mo>)</mo></math> space, where <math><mi>r</mi></math> is the number of character runs in the BWT of the reference text. Our method identifies all super-maximal exact matches (SMEMs) of length at least <math><mi>L</mi></math> between a read and the reference dataset and associates each SMEM with one class identifier using a sampled tag array. A consensus algorithm then compacts these SMEMs with their class identifier into a single classification per read. We are the first to perform run-length compressed read classification based on full SMEMs instead of semi-SMEMs. We evaluate our approach on both long and short reads in two conceptually distinct datasets: a large bacterial pan-genome with few metagenomic classes and a smaller 16S rRNA gene database spanning thousands of genera or classes. Our method consistently outperforms SPUMONI 2 in accuracy and runtime while maintaining the same asymptotic memory complexity of <math><mi>O</mi> <mo>(</mo> <mi>r</mi> <mo>)</mo></math> . Compared to Cliffy, we demonstrate better memory efficiency while achieving superior accuracy on the simpler dataset and comparable performance on the more complex one. Overall, our implementation carefully balances accuracy, runtime, and memory usage, offering a versatile solution for metagenomic classification across diverse datasets. The open-source C++11 implementation is available at https://github.com/biointec/tagger under the AGPL-3.0 license.</p>","PeriodicalId":519960,"journal":{"name":"bioRxiv : the preprint server for biology","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11888359/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv : the preprint server for biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2025.02.25.640119","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Metagenomic read classification is a fundamental task in computational biology, yet it remains challenging due to the scale, diversity, and complexity of sequencing datasets. We propose a novel, run-length compressed index based on the move structure that enables efficient multi-class metagenomic classification in $O (r)$ space, where $r$ is the number of character runs in the BWT of the reference text. Our method identifies all super-maximal exact matches (SMEMs) of length at least $L$ between a read and the reference dataset and associates each SMEM with one class identifier using a sampled tag array. A consensus algorithm then compacts these SMEMs with their class identifier into a single classification per read. We are the first to perform run-length compressed read classification based on full SMEMs instead of semi-SMEMs. We evaluate our approach on both long and short reads in two conceptually distinct datasets: a large bacterial pan-genome with few metagenomic classes and a smaller 16S rRNA gene database spanning thousands of genera or classes. Our method consistently outperforms SPUMONI 2 in accuracy and runtime while maintaining the same asymptotic memory complexity of $O (r)$ . Compared to Cliffy, we demonstrate better memory efficiency while achieving superior accuracy on the simpler dataset and comparable performance on the more complex one. Overall, our implementation carefully balances accuracy, runtime, and memory usage, offering a versatile solution for metagenomic classification across diverse datasets. The open-source C++11 implementation is available at https://github.com/biointec/tagger under the AGPL-3.0 license.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

运行长度压缩宏基因组读取分类与smm查找和标记。

宏基因组分类是计算生物学的一项基本任务，但由于测序数据集的规模、多样性和复杂性，它仍然具有挑战性。我们提出了一种新颖的、无损的、基于移动结构的运行长度压缩索引，该索引能够在O (r)空间中实现高效的多类宏基因组分类。我们的方法识别读取数据集和参考数据集之间长度至少为L的所有超级最大精确匹配（smm），并使用采样标记数组将每个SMEM与一个类标识符关联起来。然后，一致性算法将这些sme及其类标识符压缩为每次读取的单个分类。我们是第一个基于完整smb而不是半smb执行运行长度压缩读分类的人。我们在两个概念上不同的数据集上对我们的方法进行了长读和短读的评估：一个大型细菌泛基因组，很少有宏基因组类；一个较小的16S rRNA基因数据库，跨越数千个属或类。我们的方法在准确性和运行时间上始终优于SPUMONI 2，并且只有适度的内存开销。与Cliffy相比，我们展示了更好的内存效率，同时在更简单的数据集上实现了更高的准确性，在更复杂的数据集上实现了相当的性能。总的来说，我们的实现仔细地平衡了准确性、运行时和内存使用，为跨不同数据集的宏基因组分类提供了一个通用的解决方案。在AGPL-3.0许可下，可以在https://github.com/biointec/tagger上获得开源c++ 11实现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

bioRxiv : the preprint server for biology

自引率

0.00%

发文量