{"title":"Operational Advice for Dense and Sparse Retrievers: HNSW, Flat, or Inverted Indexes?","authors":"Jimmy Lin","doi":"arxiv-2409.06464","DOIUrl":null,"url":null,"abstract":"Practitioners working on dense retrieval today face a bewildering number of\nchoices. Beyond selecting the embedding model, another consequential choice is\nthe actual implementation of nearest-neighbor vector search. While best\npractices recommend HNSW indexes, flat vector indexes with brute-force search\nrepresent another viable option, particularly for smaller corpora and for rapid\nprototyping. In this paper, we provide experimental results on the BEIR dataset\nusing the open-source Lucene search library that explicate the tradeoffs\nbetween HNSW and flat indexes (including quantized variants) from the\nperspectives of indexing time, query evaluation performance, and retrieval\nquality. With additional comparisons between dense and sparse retrievers, our\nresults provide guidance for today's search practitioner in understanding the\ndesign space of dense and sparse retrievers. To our knowledge, we are the first\nto provide operational advice supported by empirical experiments in this\nregard.","PeriodicalId":501281,"journal":{"name":"arXiv - CS - Information Retrieval","volume":"14 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06464","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Practitioners working on dense retrieval today face a bewildering number of
choices. Beyond selecting the embedding model, another consequential choice is
the actual implementation of nearest-neighbor vector search. While best
practices recommend HNSW indexes, flat vector indexes with brute-force search
represent another viable option, particularly for smaller corpora and for rapid
prototyping. In this paper, we provide experimental results on the BEIR dataset
using the open-source Lucene search library that explicate the tradeoffs
between HNSW and flat indexes (including quantized variants) from the
perspectives of indexing time, query evaluation performance, and retrieval
quality. With additional comparisons between dense and sparse retrievers, our
results provide guidance for today's search practitioner in understanding the
design space of dense and sparse retrievers. To our knowledge, we are the first
to provide operational advice supported by empirical experiments in this
regard.