Query filtering using two-dimensional local embeddings

IF 3.4 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Information Systems Pub Date : 2021-11-01 DOI:10.1016/j.is.2021.101808

Lucia Vadicamo , Richard Connor , Edgar Chávez

{"title":"Query filtering using two-dimensional local embeddings","authors":"Lucia Vadicamo , Richard Connor , Edgar Chávez","doi":"10.1016/j.is.2021.101808","DOIUrl":null,"url":null,"abstract":"<div>In high dimensional data sets, exact indexes are ineffective for proximity queries, and a sequential scan over the entire data set is unavoidable. Accepting this, here we present a new approach employing two-dimensional embeddings. Each database element is mapped to the <math><mrow><mi>X</mi><mi>Y</mi></mrow></math> plane using the four-point property. The caveat is that the mapping is local: in other words, each object is mapped using a different mapping.The idea is that each element of the data is associated with a pair of reference objects that is well-suited to filter that particular object, in cases where it is not relevant to a query. This maximises the probability of excluding that object from a search. At query time, a query is compared with a pool of reference objects which allow its mapping to all the planes used by data objects. Then, for each query/object pair, a lower bound of the actual distance is obtained. The technique can be applied to any metric space that possesses the four-point property, therefore including Euclidean, Cosine, Triangular, Jensen–Shannon, and Quadratic Form distances.Our experiments show that for all the data sets tested, of varying dimensionality, our approach can filter more objects than a standard metric indexing approach. For low dimensional data this does not make a good search mechanism in its own right, as it does not scale with the size of the data: that is, its cost is linear with respect to the data size. However, we also show that it can be added as a post-filter to other mechanisms, increasing efficiency with little extra cost in space or time. For high-dimensional data, we show related approximate techniques which, we believe, give the best known compromise for speeding up the essential sequential scan. The potential uses of our filtering technique include pure GPU searching, taking advantage of the tiny memory footprint of the mapping.</div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"101 ","pages":"Article 101808"},"PeriodicalIF":3.4000,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/j.is.2021.101808","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306437921000570","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

In high dimensional data sets, exact indexes are ineffective for proximity queries, and a sequential scan over the entire data set is unavoidable. Accepting this, here we present a new approach employing two-dimensional embeddings. Each database element is mapped to the $X Y$ plane using the four-point property. The caveat is that the mapping is local: in other words, each object is mapped using a different mapping.

The idea is that each element of the data is associated with a pair of reference objects that is well-suited to filter that particular object, in cases where it is not relevant to a query. This maximises the probability of excluding that object from a search. At query time, a query is compared with a pool of reference objects which allow its mapping to all the planes used by data objects. Then, for each query/object pair, a lower bound of the actual distance is obtained. The technique can be applied to any metric space that possesses the four-point property, therefore including Euclidean, Cosine, Triangular, Jensen–Shannon, and Quadratic Form distances.

Our experiments show that for all the data sets tested, of varying dimensionality, our approach can filter more objects than a standard metric indexing approach. For low dimensional data this does not make a good search mechanism in its own right, as it does not scale with the size of the data: that is, its cost is linear with respect to the data size. However, we also show that it can be added as a post-filter to other mechanisms, increasing efficiency with little extra cost in space or time. For high-dimensional data, we show related approximate techniques which, we believe, give the best known compromise for speeding up the essential sequential scan. The potential uses of our filtering technique include pure GPU searching, taking advantage of the tiny memory footprint of the mapping.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用二维局部嵌入的查询过滤

在高维数据集中，精确索引对于接近查询是无效的，并且对整个数据集进行顺序扫描是不可避免的。在此，我们提出了一种采用二维嵌入的新方法。使用四点属性将每个数据库元素映射到XY平面。需要注意的是，映射是本地的:换句话说，每个对象都使用不同的映射进行映射。其思想是，数据的每个元素都与一对引用对象相关联，在与查询无关的情况下，这对引用对象非常适合过滤特定对象。这将最大化从搜索中排除该对象的概率。在查询时，将查询与一个引用对象池进行比较，该对象池允许将查询映射到数据对象使用的所有平面。然后，对于每个查询/对象对，得到实际距离的下界。该技术可以应用于任何具有四点性质的度量空间，因此包括欧几里得、余弦、三角形、Jensen-Shannon和二次型距离。我们的实验表明，对于所有测试的数据集，不同的维度，我们的方法可以过滤更多的对象比标准度量索引方法。对于低维数据，这本身并不是一个很好的搜索机制，因为它不随数据大小而扩展:也就是说，它的成本与数据大小成线性关系。然而，我们也表明它可以作为后过滤器添加到其他机制中，在很少的额外空间或时间成本下提高效率。对于高维数据，我们展示了相关的近似技术，我们认为，这是加快基本顺序扫描的最佳折衷方案。我们的过滤技术的潜在用途包括纯GPU搜索，利用映射的微小内存占用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Information Systems 工程技术-计算机：信息系统

CiteScore

9.40

自引率

2.70%

发文量

112

审稿时长

53 days

期刊介绍： Information systems are the software and hardware systems that support data-intensive applications. The journal Information Systems publishes articles concerning the design and implementation of languages, data models, process models, algorithms, software and hardware for information systems. Subject areas include data management issues as presented in the principal international database conferences (e.g., ACM SIGMOD/PODS, VLDB, ICDE and ICDT/EDBT) as well as data-related issues from the fields of data mining/machine learning, information retrieval coordinated with structured data, internet and cloud data management, business process management, web semantics, visual and audio information systems, scientific computing, and data science. Implementation papers having to do with massively parallel data management, fault tolerance in practice, and special purpose hardware for data-intensive systems are also welcome. Manuscripts from application domains, such as urban informatics, social and natural science, and Internet of Things, are also welcome. All papers should highlight innovative solutions to data management problems such as new data models, performance enhancements, and show how those innovations contribute to the goals of the application.