L2AP: Fast cosine similarity search with prefix L-2 norm bounds

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-05-19 DOI:10.1109/ICDE.2014.6816700

D. Anastasiu, G. Karypis

{"title":"L2AP: Fast cosine similarity search with prefix L-2 norm bounds","authors":"D. Anastasiu, G. Karypis","doi":"10.1109/ICDE.2014.6816700","DOIUrl":null,"url":null,"abstract":"The All-Pairs similarity search, or self-similarity join problem, finds all pairs of vectors in a high dimensional sparse dataset with a similarity value higher than a given threshold. The problem has been classically solved using a dynamically built inverted index. The search time is reduced by early pruning of candidates using size and value-based bounds on the similarity. In the context of cosine similarity and weighted vectors, leveraging the Cauchy-Schwarz inequality, we propose new ℓ2-norm bounds for reducing the inverted index size, candidate pool size, and the number of full dot-product computations. We tighten previous candidate generation and verification bounds and introduce several new ones to further improve our algorithm's performance. Our new pruning strategies enable significant speedups over baseline approaches, most times outperforming even approximate solutions. We perform an extensive evaluation of our algorithm, L2AP, and compare against state-of-the-art exact and approximate methods, AllPairs, MMJoin, and BayesLSH, across a variety of real-world datasets and similarity thresholds.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"48","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 30th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2014.6816700","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 48

Abstract

The All-Pairs similarity search, or self-similarity join problem, finds all pairs of vectors in a high dimensional sparse dataset with a similarity value higher than a given threshold. The problem has been classically solved using a dynamically built inverted index. The search time is reduced by early pruning of candidates using size and value-based bounds on the similarity. In the context of cosine similarity and weighted vectors, leveraging the Cauchy-Schwarz inequality, we propose new ℓ2-norm bounds for reducing the inverted index size, candidate pool size, and the number of full dot-product computations. We tighten previous candidate generation and verification bounds and introduce several new ones to further improve our algorithm's performance. Our new pruning strategies enable significant speedups over baseline approaches, most times outperforming even approximate solutions. We perform an extensive evaluation of our algorithm, L2AP, and compare against state-of-the-art exact and approximate methods, AllPairs, MMJoin, and BayesLSH, across a variety of real-world datasets and similarity thresholds.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

L2AP:前缀L-2范数界的快速余弦相似度搜索

all - pair相似性搜索，或自相似性连接问题，在一个高维稀疏数据集中寻找相似性值高于给定阈值的所有向量对。这个问题已经用动态建立的倒排索引经典地解决了。通过在相似性上使用大小和基于值的界限对候选对象进行早期修剪，减少了搜索时间。在余弦相似度和加权向量的背景下，利用Cauchy-Schwarz不等式，我们提出了新的2-范数界限，以减少倒索引大小，候选池大小和完整点积计算的数量。我们收紧了之前的候选生成和验证边界，并引入了几个新的边界来进一步提高算法的性能。我们的新修剪策略可以显著提高基线方法的速度，大多数情况下甚至优于近似解决方案。我们对L2AP算法进行了广泛的评估，并在各种真实数据集和相似阈值上与最先进的精确和近似方法AllPairs、MMJoin和BayesLSH进行了比较。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2014 IEEE 30th International Conference on Data Engineering

自引率

0.00%

发文量

期刊最新文献

Managing uncertainty in spatial and spatio-temporal data Locality-sensitive operators for parallel main-memory database clusters KnowLife: A knowledge graph for health and life sciences We can learn your #hashtags: Connecting tweets to explicit topics A demonstration of MNTG - A web-based road network traffic generator