{"title":"L2AP: Fast cosine similarity search with prefix L-2 norm bounds","authors":"D. Anastasiu, G. Karypis","doi":"10.1109/ICDE.2014.6816700","DOIUrl":null,"url":null,"abstract":"The All-Pairs similarity search, or self-similarity join problem, finds all pairs of vectors in a high dimensional sparse dataset with a similarity value higher than a given threshold. The problem has been classically solved using a dynamically built inverted index. The search time is reduced by early pruning of candidates using size and value-based bounds on the similarity. In the context of cosine similarity and weighted vectors, leveraging the Cauchy-Schwarz inequality, we propose new ℓ2-norm bounds for reducing the inverted index size, candidate pool size, and the number of full dot-product computations. We tighten previous candidate generation and verification bounds and introduce several new ones to further improve our algorithm's performance. Our new pruning strategies enable significant speedups over baseline approaches, most times outperforming even approximate solutions. We perform an extensive evaluation of our algorithm, L2AP, and compare against state-of-the-art exact and approximate methods, AllPairs, MMJoin, and BayesLSH, across a variety of real-world datasets and similarity thresholds.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"48","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 30th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2014.6816700","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 48
Abstract
The All-Pairs similarity search, or self-similarity join problem, finds all pairs of vectors in a high dimensional sparse dataset with a similarity value higher than a given threshold. The problem has been classically solved using a dynamically built inverted index. The search time is reduced by early pruning of candidates using size and value-based bounds on the similarity. In the context of cosine similarity and weighted vectors, leveraging the Cauchy-Schwarz inequality, we propose new ℓ2-norm bounds for reducing the inverted index size, candidate pool size, and the number of full dot-product computations. We tighten previous candidate generation and verification bounds and introduce several new ones to further improve our algorithm's performance. Our new pruning strategies enable significant speedups over baseline approaches, most times outperforming even approximate solutions. We perform an extensive evaluation of our algorithm, L2AP, and compare against state-of-the-art exact and approximate methods, AllPairs, MMJoin, and BayesLSH, across a variety of real-world datasets and similarity thresholds.
all - pair相似性搜索,或自相似性连接问题,在一个高维稀疏数据集中寻找相似性值高于给定阈值的所有向量对。这个问题已经用动态建立的倒排索引经典地解决了。通过在相似性上使用大小和基于值的界限对候选对象进行早期修剪,减少了搜索时间。在余弦相似度和加权向量的背景下,利用Cauchy-Schwarz不等式,我们提出了新的2-范数界限,以减少倒索引大小,候选池大小和完整点积计算的数量。我们收紧了之前的候选生成和验证边界,并引入了几个新的边界来进一步提高算法的性能。我们的新修剪策略可以显著提高基线方法的速度,大多数情况下甚至优于近似解决方案。我们对L2AP算法进行了广泛的评估,并在各种真实数据集和相似阈值上与最先进的精确和近似方法AllPairs、MMJoin和BayesLSH进行了比较。