Set similarity search beyond MinHash

Tobias Christiani, R. Pagh
{"title":"Set similarity search beyond MinHash","authors":"Tobias Christiani, R. Pagh","doi":"10.1145/3055399.3055443","DOIUrl":null,"url":null,"abstract":"We consider the problem of approximate set similarity search under Braun-Blanquet similarity B(x, y) = |x ∩ y| / max(|x|, |y|). The (b1, b2)-approximate Braun-Blanquet similarity search problem is to preprocess a collection of sets P such that, given a query set q, if there exists x Ε P with B(q, x) ≥ b1, then we can efficiently return x′ Ε P with B(q, x′) > b2. We present a simple data structure that solves this problem with space usage O(n1+ρlogn + ∑x ε P|x|) and query time O(|q|nρ logn) where n = |P| and ρ = log(1/b1)/log(1/b2). Making use of existing lower bounds for locality-sensitive hashing by O'Donnell et al. (TOCT 2014) we show that this value of ρ is tight across the parameter space, i.e., for every choice of constants 0 < b2 < b1 < 1. In the case where all sets have the same size our solution strictly improves upon the value of ρ that can be obtained through the use of state-of-the-art data-independent techniques in the Indyk-Motwani locality-sensitive hashing framework (STOC 1998) such as Broder's MinHash (CCS 1997) for Jaccard similarity and Andoni et al.'s cross-polytope LSH (NIPS 2015) for cosine similarity. Surprisingly, even though our solution is data-independent, for a large part of the parameter space we outperform the currently best data-dependent method by Andoni and Razenshteyn (STOC 2015).","PeriodicalId":20615,"journal":{"name":"Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing","volume":"18 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2016-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"41","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3055399.3055443","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 41

Abstract

We consider the problem of approximate set similarity search under Braun-Blanquet similarity B(x, y) = |x ∩ y| / max(|x|, |y|). The (b1, b2)-approximate Braun-Blanquet similarity search problem is to preprocess a collection of sets P such that, given a query set q, if there exists x Ε P with B(q, x) ≥ b1, then we can efficiently return x′ Ε P with B(q, x′) > b2. We present a simple data structure that solves this problem with space usage O(n1+ρlogn + ∑x ε P|x|) and query time O(|q|nρ logn) where n = |P| and ρ = log(1/b1)/log(1/b2). Making use of existing lower bounds for locality-sensitive hashing by O'Donnell et al. (TOCT 2014) we show that this value of ρ is tight across the parameter space, i.e., for every choice of constants 0 < b2 < b1 < 1. In the case where all sets have the same size our solution strictly improves upon the value of ρ that can be obtained through the use of state-of-the-art data-independent techniques in the Indyk-Motwani locality-sensitive hashing framework (STOC 1998) such as Broder's MinHash (CCS 1997) for Jaccard similarity and Andoni et al.'s cross-polytope LSH (NIPS 2015) for cosine similarity. Surprisingly, even though our solution is data-independent, for a large part of the parameter space we outperform the currently best data-dependent method by Andoni and Razenshteyn (STOC 2015).
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
设置超越MinHash的相似性搜索
考虑Braun-Blanquet相似度B(x, y) = |x∩y| / max(|x|, |y|)下的近似集相似度搜索问题。(b1, b2)近似布朗-布兰凯相似搜索问题是对集合P进行预处理,使得给定一个查询集q,如果存在x Ε P且B(q, x)≥b1,则我们可以有效地返回x ' Ε P且B(q, x ') > b2。我们提出了一个简单的数据结构来解决这个问题,它的空间使用为O(n1+ρlogn +∑x ε P|x|),查询时间为O(|q|nρ logn),其中n = |P|, ρ = log(1/b1)/log(1/b2)。利用O'Donnell等人(TOCT 2014)对位置敏感哈希的现有下界,我们表明ρ的这个值在参数空间上是紧的,即对于常数0 < b2 < b1 < 1的每一个选择。在所有集合具有相同大小的情况下,我们的解决方案严格改进了ρ值,ρ值可以通过使用最先进的数据独立技术在Indyk-Motwani位置敏感散列框架(STOC 1998)中获得,例如Broder的MinHash (CCS 1997)用于Jaccard相似性和Andoni等人的交叉多面体LSH (NIPS 2015)用于余弦相似性。令人惊讶的是,尽管我们的解决方案是数据独立的,但在很大一部分参数空间中,我们的性能优于目前最好的由Andoni和Razenshteyn (STOC 2015)提出的数据依赖方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Online service with delay A simpler and faster strongly polynomial algorithm for generalized flow maximization Low rank approximation with entrywise l1-norm error Fast convergence of learning in games (invited talk) Surviving in directed graphs: a quasi-polynomial-time polylogarithmic approximation for two-connected directed Steiner tree
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1