MISSH: Fast Hashing of Multiple Spaced Seeds

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS IEEE/ACM Transactions on Computational Biology and Bioinformatics Pub Date : 2024-09-25 DOI:10.1109/TCBB.2024.3467368

Eleonora Mian;Enrico Petrucci;Cinzia Pizzi;Matteo Comin

{"title":"MISSH: Fast Hashing of Multiple Spaced Seeds","authors":"Eleonora Mian;Enrico Petrucci;Cinzia Pizzi;Matteo Comin","doi":"10.1109/TCBB.2024.3467368","DOIUrl":null,"url":null,"abstract":"Alignment-free analysis of sequences has revolutionized the high-throughput processing of sequencing data within numerous bioinformatics pipelines. Hashing \n<inline-formula><tex-math>$k$</tex-math></inline-formula>\n-mers represents a common function across various alignment-free applications, serving as a crucial tool for indexing, querying, and rapid similarity searching. More recently, spaced seeds, a specialized pattern that accommodates errors or mutations, have become a standard choice over traditional \n<inline-formula><tex-math>$k$</tex-math></inline-formula>\n-mers. Spaced seeds offer enhanced sensitivity in many applications when compared to \n<inline-formula><tex-math>$k$</tex-math></inline-formula>\n-mers. However, it's important to note that hashing spaced seeds significantly increases computational time. Furthermore, if multiple spaced seeds are employed, accuracy can be further improved, albeit at the expense of longer processing times. This paper addresses the challenge of efficiently hashing multiple spaced seeds. The proposed algorithms leverage the similarity of adjacent spaced seed hash values within an input sequence, allowing for the swift computation of subsequent hashes. Our experimental results, conducted across various tests, demonstrate a remarkable performance improvement over previously suggested algorithms, with potential speedups of up to 20 times. Additionally, we apply these efficient spaced seed hashing algorithms to a metagenomic application, specifically the classification of reads using Clark-S (Ounit and Lonardi, 2016). Our findings reveal a substantial speedup, effectively mitigating the slowdown caused by the utilization of multiple spaced seeds.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2330-2339"},"PeriodicalIF":3.6000,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10693556/","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Alignment-free analysis of sequences has revolutionized the high-throughput processing of sequencing data within numerous bioinformatics pipelines. Hashing

$k$

-mers represents a common function across various alignment-free applications, serving as a crucial tool for indexing, querying, and rapid similarity searching. More recently, spaced seeds, a specialized pattern that accommodates errors or mutations, have become a standard choice over traditional

$k$

-mers. Spaced seeds offer enhanced sensitivity in many applications when compared to

$k$

-mers. However, it's important to note that hashing spaced seeds significantly increases computational time. Furthermore, if multiple spaced seeds are employed, accuracy can be further improved, albeit at the expense of longer processing times. This paper addresses the challenge of efficiently hashing multiple spaced seeds. The proposed algorithms leverage the similarity of adjacent spaced seed hash values within an input sequence, allowing for the swift computation of subsequent hashes. Our experimental results, conducted across various tests, demonstrate a remarkable performance improvement over previously suggested algorithms, with potential speedups of up to 20 times. Additionally, we apply these efficient spaced seed hashing algorithms to a metagenomic application, specifically the classification of reads using Clark-S (Ounit and Lonardi, 2016). Our findings reveal a substantial speedup, effectively mitigating the slowdown caused by the utilization of multiple spaced seeds.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

MISSH：多间隔种子快速散列。

序列的无配对分析彻底改变了众多生物信息学管道中对测序数据的高通量处理。散列 k-mers 是各种无配对应用的共同功能，是索引、查询和快速相似性搜索的重要工具。最近，间隔种子（一种可容纳错误或突变的专门模式）已成为传统 k-mers 的标准选择。在许多应用中，间隔种子比 k-mers具有更高的灵敏度。不过，值得注意的是，散列间隔种子会大大增加计算时间。此外，如果采用多个间隔种子，准确性还能进一步提高，但代价是需要更长的处理时间。本文解决了高效散列多个间隔种子的难题。所提出的算法利用了输入序列中相邻间隔种子哈希值的相似性，允许快速计算后续哈希值。我们在各种测试中得出的实验结果表明，与之前提出的算法相比，本文的性能有了显著提高，速度可能提高 20 倍。此外，我们还将这些高效的间隔种子散列算法应用于元基因组应用，特别是使用 Clark-S 算法对读数进行分类 [Ounit and Lonardi, 2016]。我们的研究结果表明，该算法的速度大幅提升，有效缓解了因使用多间隔种子而导致的速度减慢问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE/ACM Transactions on Computational Biology and Bioinformatics 工程技术-计算机：跨学科应用

CiteScore

7.50

自引率

6.70%

发文量

479

审稿时长

3 months

期刊介绍： IEEE/ACM Transactions on Computational Biology and Bioinformatics emphasizes the algorithmic, mathematical, statistical and computational methods that are central in bioinformatics and computational biology; the development and testing of effective computer programs in bioinformatics; the development of biological databases; and important biological results that are obtained from the use of these methods, programs and databases; the emerging field of Systems Biology, where many forms of data are used to create a computer-based model of a complex biological system