GSearch: ultra-fast and scalable genome search by combining K-mer hashing with hierarchical navigable small world graphs.

IF 16.6 2区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Nucleic Acids Research Pub Date : 2024-09-09 DOI:10.1093/nar/gkae609
Jianshu Zhao, Jean Pierre Both, Luis M Rodriguez-R, Konstantinos T Konstantinidis
{"title":"GSearch: ultra-fast and scalable genome search by combining K-mer hashing with hierarchical navigable small world graphs.","authors":"Jianshu Zhao, Jean Pierre Both, Luis M Rodriguez-R, Konstantinos T Konstantinidis","doi":"10.1093/nar/gkae609","DOIUrl":null,"url":null,"abstract":"<p><p>Genome search and/or classification typically involves finding the best-match database (reference) genomes and has become increasingly challenging due to the growing number of available database genomes and the fact that traditional methods do not scale well with large databases. By combining k-mer hashing-based probabilistic data structures (i.e. ProbMinHash, SuperMinHash, Densified MinHash and SetSketch) to estimate genomic distance, with a graph based nearest neighbor search algorithm (Hierarchical Navigable Small World Graphs, or HNSW), we created a new data structure and developed an associated computer program, GSearch, that is orders of magnitude faster than alternative tools while maintaining high accuracy and low memory usage. For example, GSearch can search 8000 query genomes against all available microbial or viral genomes for their best matches (n = ∼318 000 or ∼3 000 000, respectively) within a few minutes on a personal laptop, using ∼6 GB of memory (2.5 GB via SetSketch). Notably, GSearch has an O(log(N)) time complexity and will scale well with billions of genomes based on a database splitting strategy. Further, GSearch implements a three-step search strategy depending on the degree of novelty of the query genomes to maximize specificity and sensitivity. Therefore, GSearch solves a major bottleneck of microbiome studies that require genome search and/or classification.</p>","PeriodicalId":19471,"journal":{"name":"Nucleic Acids Research","volume":null,"pages":null},"PeriodicalIF":16.6000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nucleic Acids Research","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/nar/gkae609","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Genome search and/or classification typically involves finding the best-match database (reference) genomes and has become increasingly challenging due to the growing number of available database genomes and the fact that traditional methods do not scale well with large databases. By combining k-mer hashing-based probabilistic data structures (i.e. ProbMinHash, SuperMinHash, Densified MinHash and SetSketch) to estimate genomic distance, with a graph based nearest neighbor search algorithm (Hierarchical Navigable Small World Graphs, or HNSW), we created a new data structure and developed an associated computer program, GSearch, that is orders of magnitude faster than alternative tools while maintaining high accuracy and low memory usage. For example, GSearch can search 8000 query genomes against all available microbial or viral genomes for their best matches (n = ∼318 000 or ∼3 000 000, respectively) within a few minutes on a personal laptop, using ∼6 GB of memory (2.5 GB via SetSketch). Notably, GSearch has an O(log(N)) time complexity and will scale well with billions of genomes based on a database splitting strategy. Further, GSearch implements a three-step search strategy depending on the degree of novelty of the query genomes to maximize specificity and sensitivity. Therefore, GSearch solves a major bottleneck of microbiome studies that require genome search and/or classification.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
GSearch:通过将 K-mer 哈希算法与分层导航小世界图相结合,实现超快速、可扩展的基因组搜索。
基因组搜索和/或分类通常涉及寻找最匹配的数据库(参考)基因组,由于可用数据库基因组的数量不断增加,而传统方法又不能很好地扩展到大型数据库,因此基因组搜索和/或分类变得越来越具有挑战性。通过将基于 k-mer 哈希值的概率数据结构(即 ProbMinHash、SuperMinHash、Densified MinHash 和 SetSketch)与基于图的近邻搜索算法(Hierarchical Navigable Small World Graphs,或 HNSW)相结合来估计基因组距离,我们创建了一种新的数据结构,并开发了相关的计算机程序 GSearch。例如,GSearch可以在几分钟内用个人笔记本电脑搜索8000个查询基因组与所有可用的微生物或病毒基因组进行最佳匹配(分别为n = ∼318 000 或 ∼3 000 000),使用的内存为∼6 GB(通过SetSketch为2.5 GB)。值得注意的是,GSearch 的时间复杂度为 O(log(N)),根据数据库拆分策略,可以很好地扩展到数十亿个基因组。此外,GSearch 还根据查询基因组的新颖程度实施了三步搜索策略,以最大限度地提高特异性和灵敏度。因此,GSearch 解决了微生物组研究中需要基因组搜索和/或分类的主要瓶颈问题。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Nucleic Acids Research
Nucleic Acids Research 生物-生化与分子生物学
CiteScore
27.10
自引率
4.70%
发文量
1057
审稿时长
2 months
期刊介绍: Nucleic Acids Research (NAR) is a scientific journal that publishes research on various aspects of nucleic acids and proteins involved in nucleic acid metabolism and interactions. It covers areas such as chemistry and synthetic biology, computational biology, gene regulation, chromatin and epigenetics, genome integrity, repair and replication, genomics, molecular biology, nucleic acid enzymes, RNA, and structural biology. The journal also includes a Survey and Summary section for brief reviews. Additionally, each year, the first issue is dedicated to biological databases, and an issue in July focuses on web-based software resources for the biological community. Nucleic Acids Research is indexed by several services including Abstracts on Hygiene and Communicable Diseases, Animal Breeding Abstracts, Agricultural Engineering Abstracts, Agbiotech News and Information, BIOSIS Previews, CAB Abstracts, and EMBASE.
期刊最新文献
High-throughput single telomere analysis using DNA microarray and fluorescent in situ hybridization DciA secures bidirectional replication initiation in Vibrio cholerae Correction to 'Expanded MutaT7 toolkit efficiently and simultaneously accesses all possible transition mutations in bacteria'. Polyubiquitinated PCNA triggers SLX4-mediated break-induced replication in alternative lengthening of telomeres (ALT) cancer cells Correction to 'Advancing quantitative PCR with color cycle multiplex amplification'.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1