Run-length compressed metagenomic read classification with SMEM-finding and tagging.

Lore Depuydt, Omar Y Ahmed, Jan Fostier, Ben Langmead, Travis Gagie
{"title":"Run-length compressed metagenomic read classification with SMEM-finding and tagging.","authors":"Lore Depuydt, Omar Y Ahmed, Jan Fostier, Ben Langmead, Travis Gagie","doi":"10.1101/2025.02.25.640119","DOIUrl":null,"url":null,"abstract":"<p><p>Metagenomic read classification is a fundamental task in computational biology, yet it remains challenging due to the scale, diversity, and complexity of sequencing datasets. We propose a novel, run-length compressed index based on the move structure that enables efficient multi-class metagenomic classification in <math><mi>O</mi> <mo>(</mo> <mi>r</mi> <mo>)</mo></math> space, where <math><mi>r</mi></math> is the number of character runs in the BWT of the reference text. Our method identifies all super-maximal exact matches (SMEMs) of length at least <math><mi>L</mi></math> between a read and the reference dataset and associates each SMEM with one class identifier using a sampled tag array. A consensus algorithm then compacts these SMEMs with their class identifier into a single classification per read. We are the first to perform run-length compressed read classification based on full SMEMs instead of semi-SMEMs. We evaluate our approach on both long and short reads in two conceptually distinct datasets: a large bacterial pan-genome with few metagenomic classes and a smaller 16S rRNA gene database spanning thousands of genera or classes. Our method consistently outperforms SPUMONI 2 in accuracy and runtime while maintaining the same asymptotic memory complexity of <math><mi>O</mi> <mo>(</mo> <mi>r</mi> <mo>)</mo></math> . Compared to Cliffy, we demonstrate better memory efficiency while achieving superior accuracy on the simpler dataset and comparable performance on the more complex one. Overall, our implementation carefully balances accuracy, runtime, and memory usage, offering a versatile solution for metagenomic classification across diverse datasets. The open-source C++11 implementation is available at https://github.com/biointec/tagger under the AGPL-3.0 license.</p>","PeriodicalId":519960,"journal":{"name":"bioRxiv : the preprint server for biology","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11888359/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv : the preprint server for biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2025.02.25.640119","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Metagenomic read classification is a fundamental task in computational biology, yet it remains challenging due to the scale, diversity, and complexity of sequencing datasets. We propose a novel, run-length compressed index based on the move structure that enables efficient multi-class metagenomic classification in O ( r ) space, where r is the number of character runs in the BWT of the reference text. Our method identifies all super-maximal exact matches (SMEMs) of length at least L between a read and the reference dataset and associates each SMEM with one class identifier using a sampled tag array. A consensus algorithm then compacts these SMEMs with their class identifier into a single classification per read. We are the first to perform run-length compressed read classification based on full SMEMs instead of semi-SMEMs. We evaluate our approach on both long and short reads in two conceptually distinct datasets: a large bacterial pan-genome with few metagenomic classes and a smaller 16S rRNA gene database spanning thousands of genera or classes. Our method consistently outperforms SPUMONI 2 in accuracy and runtime while maintaining the same asymptotic memory complexity of O ( r ) . Compared to Cliffy, we demonstrate better memory efficiency while achieving superior accuracy on the simpler dataset and comparable performance on the more complex one. Overall, our implementation carefully balances accuracy, runtime, and memory usage, offering a versatile solution for metagenomic classification across diverse datasets. The open-source C++11 implementation is available at https://github.com/biointec/tagger under the AGPL-3.0 license.

Abstract Image

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
运行长度压缩宏基因组读取分类与smm查找和标记。
宏基因组分类是计算生物学的一项基本任务,但由于测序数据集的规模、多样性和复杂性,它仍然具有挑战性。我们提出了一种新颖的、无损的、基于移动结构的运行长度压缩索引,该索引能够在O (r)空间中实现高效的多类宏基因组分类。我们的方法识别读取数据集和参考数据集之间长度至少为L的所有超级最大精确匹配(smm),并使用采样标记数组将每个SMEM与一个类标识符关联起来。然后,一致性算法将这些sme及其类标识符压缩为每次读取的单个分类。我们是第一个基于完整smb而不是半smb执行运行长度压缩读分类的人。我们在两个概念上不同的数据集上对我们的方法进行了长读和短读的评估:一个大型细菌泛基因组,很少有宏基因组类;一个较小的16S rRNA基因数据库,跨越数千个属或类。我们的方法在准确性和运行时间上始终优于SPUMONI 2,并且只有适度的内存开销。与Cliffy相比,我们展示了更好的内存效率,同时在更简单的数据集上实现了更高的准确性,在更复杂的数据集上实现了相当的性能。总的来说,我们的实现仔细地平衡了准确性、运行时和内存使用,为跨不同数据集的宏基因组分类提供了一个通用的解决方案。在AGPL-3.0许可下,可以在https://github.com/biointec/tagger上获得开源c++ 11实现。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Combinatorial constraints predict that mitochondrial networks contain a large component. Placental Insulin-like Growth Factor 1 Insufficiency Drives Neurodevelopmental Disorder‑Relevant Behavioral Changes with Sex‑Specific Vulnerabilities. Gene- and domain-aware calibration increases the clinical utility of variant effect predictors. Beyond Fixation: Persistent Genetic Variation Under Intense Selection. A Three-dimensional Analytical Framework for Retinal Microvasculature Reveals Layer-associated Vulnerability in Development and Neovascular Remodeling.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1