{"title":"MISSH: Fast Hashing of Multiple Spaced Seeds.","authors":"Eleonora Mian, Enrico Petrucci, Cinzia Pizzi, Matteo Comin","doi":"10.1109/TCBB.2024.3467368","DOIUrl":null,"url":null,"abstract":"<p><p>Alignment-free analysis of sequences has revolutionized the high-throughput processing of sequencing data within numerous bioinformatics pipelines. Hashing k-mers represents a common function across various alignment-free applications, serving as a crucial tool for indexing, querying, and rapid similarity searching. More recently, spaced seeds, a specialized pattern that accommodates errors or mutations, have become a standard choice over traditional k-mers. Spaced seeds offer enhanced sensitivity in many applications when compared to k-mers. However, it's important to note that hashing spaced seeds significantly increases computational time. Furthermore, if multiple spaced seeds are employed, accuracy can be further improved, albeit at the expense of longer processing times. This paper addresses the challenge of efficiently hashing multiple spaced seeds. The proposed algorithms leverage the similarity of adjacent spaced seed hash values within an input sequence, allowing for the swift computation of subsequent hashes. Our experimental results, conducted across various tests, demonstrate a remarkable performance improvement over previously suggested algorithms, with potential speedups of up to 20 times. Additionally, we apply these efficient spaced seed hashing algorithms to a metagenomic application, specifically the classification of reads using Clark-S [Ounit and Lonardi, 2016]. Our findings reveal a substantial speedup, effectively mitigating the slowdown caused by the utilization of multiple spaced seeds.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6000,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1109/TCBB.2024.3467368","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Alignment-free analysis of sequences has revolutionized the high-throughput processing of sequencing data within numerous bioinformatics pipelines. Hashing k-mers represents a common function across various alignment-free applications, serving as a crucial tool for indexing, querying, and rapid similarity searching. More recently, spaced seeds, a specialized pattern that accommodates errors or mutations, have become a standard choice over traditional k-mers. Spaced seeds offer enhanced sensitivity in many applications when compared to k-mers. However, it's important to note that hashing spaced seeds significantly increases computational time. Furthermore, if multiple spaced seeds are employed, accuracy can be further improved, albeit at the expense of longer processing times. This paper addresses the challenge of efficiently hashing multiple spaced seeds. The proposed algorithms leverage the similarity of adjacent spaced seed hash values within an input sequence, allowing for the swift computation of subsequent hashes. Our experimental results, conducted across various tests, demonstrate a remarkable performance improvement over previously suggested algorithms, with potential speedups of up to 20 times. Additionally, we apply these efficient spaced seed hashing algorithms to a metagenomic application, specifically the classification of reads using Clark-S [Ounit and Lonardi, 2016]. Our findings reveal a substantial speedup, effectively mitigating the slowdown caused by the utilization of multiple spaced seeds.
期刊介绍:
IEEE/ACM Transactions on Computational Biology and Bioinformatics emphasizes the algorithmic, mathematical, statistical and computational methods that are central in bioinformatics and computational biology; the development and testing of effective computer programs in bioinformatics; the development of biological databases; and important biological results that are obtained from the use of these methods, programs and databases; the emerging field of Systems Biology, where many forms of data are used to create a computer-based model of a complex biological system