Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. There are several variants of the similarity search problem, and one of the most relevant is the r-near neighbor (r-NN) problem: given a radius r>0 and a set of points S, construct a data structure that, for any given query point q, returns a point p within distance at most r from q. In this paper, we study the r-NN problem in the light of fairness. We consider fairness in the sense of equal opportunity: all points that are within distance r from the query should have the same probability to be returned. In the low-dimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee. To address this, we propose efficient data structures for r-NN where all points in S that are near q have the same probability to be selected and returned by the query. Specifically, we first propose a black-box approach that, given any LSH scheme, constructs a data structure for uniformly sampling points in the neighborhood of a query. Then, we develop a data structure for fair similarity search under inner product that requires nearly-linear space and exploits locality sensitive filters. The paper concludes with an experimental evaluation that highlights (un)fairness in a recommendation setting on real-world datasets and discusses the inherent unfairness introduced by solving other variants of the problem.
相似搜索是一种基本的算法原语,广泛应用于许多计算机科学学科。相似搜索问题有几种变体,其中最相关的是r-近邻(r- nn)问题:给定半径r>0和一组点S,构造一个数据结构,对于任意给定的查询点q,返回距离q最大为r的点p。本文从公平性的角度研究r- nn问题。我们从机会均等的意义上考虑公平性:距离查询r以内的所有点应该具有相同的返回概率。在低维情况下,这个问题首先由Hu, Qiao, and Tao (PODS 2014)研究。局部敏感哈希(LSH)是理论上最强大的高维相似性搜索方法,但它不能提供这样的公平性保证。为了解决这个问题,我们为r-NN提出了有效的数据结构,其中S中靠近q的所有点都有相同的概率被查询选择和返回。具体来说,我们首先提出了一种黑盒方法,该方法在给定任意LSH方案的情况下,为查询的邻域内的均匀采样点构建数据结构。然后,我们开发了一种内积下公平相似搜索的数据结构,该结构需要近线性空间并利用局域敏感滤波器。本文以一个实验评估作为结论,强调了在现实世界数据集上推荐设置的公平性,并讨论了通过解决问题的其他变体引入的固有不公平性。
{"title":"Fair Near Neighbor Search: Independent Range Sampling in High Dimensions","authors":"Martin Aumüller, R. Pagh, Francesco Silvestri","doi":"10.1145/3375395.3387648","DOIUrl":"https://doi.org/10.1145/3375395.3387648","url":null,"abstract":"Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. There are several variants of the similarity search problem, and one of the most relevant is the r-near neighbor (r-NN) problem: given a radius r>0 and a set of points S, construct a data structure that, for any given query point q, returns a point p within distance at most r from q. In this paper, we study the r-NN problem in the light of fairness. We consider fairness in the sense of equal opportunity: all points that are within distance r from the query should have the same probability to be returned. In the low-dimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee. To address this, we propose efficient data structures for r-NN where all points in S that are near q have the same probability to be selected and returned by the query. Specifically, we first propose a black-box approach that, given any LSH scheme, constructs a data structure for uniformly sampling points in the neighborhood of a query. Then, we develop a data structure for fair similarity search under inner product that requires nearly-linear space and exploits locality sensitive filters. The paper concludes with an experimental evaluation that highlights (un)fairness in a recommendation setting on real-world datasets and discusses the inherent unfairness introduced by solving other variants of the problem.","PeriodicalId":412441,"journal":{"name":"Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117079924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Quantiles, such as the median or percentiles, provide concise and useful information about the distribution of a collection of items, drawn from a totally ordered universe. We study data structures, called quantile summaries, which keep track of all quantiles of a stream of items, up to an error of at most ε. That is, an ε-approximate quantile summary first processes a stream and then, given any quantile query 0łe φłe 1, returns an item from the stream, which is a φ'-quantile for some φ' = φ +- ε. We focus on comparison-based quantile summaries that can only compare two items and are otherwise completely oblivious of the universe. The best such deterministic quantile summary to date, due to Greenwald and Khanna [6], stores at most O(1/ε ⋅ log ε N) items, where N is the number of items in the stream. We prove that this space bound is optimal by showing a matching lower bound. Our result thus rules out the possibility of constructing a deterministic comparison-based quantile summary in space f(ε)⋅ o(log N), for any function f that does not depend on N. As a corollary, we improve the lower bound for biased quantiles, which provide a stronger, relative-error guarantee of (1+-ε)⋅ φ, and for other related computational tasks.
{"title":"A Tight Lower Bound for Comparison-Based Quantile Summaries","authors":"Graham Cormode, P. Veselý","doi":"10.1145/3375395.3387650","DOIUrl":"https://doi.org/10.1145/3375395.3387650","url":null,"abstract":"Quantiles, such as the median or percentiles, provide concise and useful information about the distribution of a collection of items, drawn from a totally ordered universe. We study data structures, called quantile summaries, which keep track of all quantiles of a stream of items, up to an error of at most ε. That is, an ε-approximate quantile summary first processes a stream and then, given any quantile query 0łe φłe 1, returns an item from the stream, which is a φ'-quantile for some φ' = φ +- ε. We focus on comparison-based quantile summaries that can only compare two items and are otherwise completely oblivious of the universe. The best such deterministic quantile summary to date, due to Greenwald and Khanna [6], stores at most O(1/ε ⋅ log ε N) items, where N is the number of items in the stream. We prove that this space bound is optimal by showing a matching lower bound. Our result thus rules out the possibility of constructing a deterministic comparison-based quantile summary in space f(ε)⋅ o(log N), for any function f that does not depend on N. As a corollary, we improve the lower bound for biased quantiles, which provide a stronger, relative-error guarantee of (1+-ε)⋅ φ, and for other related computational tasks.","PeriodicalId":412441,"journal":{"name":"Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121618270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The chase procedure is a fundamental algorithmic tool in database theory with a variety of applications. A key problem concerning the chase procedure is all-instances termination: for a given set of tuple-generating dependencies (TGDs), is it the case that the chase terminates for every input database? In view of the fact that this problem is undecidable, it is natural to ask whether known well-behaved classes of TGDs ensure decidability. We consider here the main paradigms that led to robust TGD-based formalisms, that is, guardedness and stickiness. Although all-instances termination is well-understood for the oblivious chase, the more subtle case of the restricted (a.k.a. the standard) chase is rather unexplored. We show that all-instances restricted chase termination for guarded/sticky single-head TGDs is decidable in elementary time.
{"title":"All-Instances Restricted Chase Termination","authors":"Tomasz Gogacz, J. Marcinkowski, Andreas Pieris","doi":"10.1145/3375395.3387644","DOIUrl":"https://doi.org/10.1145/3375395.3387644","url":null,"abstract":"The chase procedure is a fundamental algorithmic tool in database theory with a variety of applications. A key problem concerning the chase procedure is all-instances termination: for a given set of tuple-generating dependencies (TGDs), is it the case that the chase terminates for every input database? In view of the fact that this problem is undecidable, it is natural to ask whether known well-behaved classes of TGDs ensure decidability. We consider here the main paradigms that led to robust TGD-based formalisms, that is, guardedness and stickiness. Although all-instances termination is well-understood for the oblivious chase, the more subtle case of the restricted (a.k.a. the standard) chase is rather unexplored. We show that all-instances restricted chase termination for guarded/sticky single-head TGDs is decidable in elementary time.","PeriodicalId":412441,"journal":{"name":"Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125476003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
It is our great pleasure to welcome you to the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS 2015), held in Melbourne, Victoria, Australia, on May 31 -- June 4, 2015, in conjunction with the 2015 ACM SIGMOD International Conference on Management of Data. Since the first edition of the symposium in 1982, the PODS papers are distinguished by a rigorous approach to widely diverse problems in data management, often bringing to bear techniques from a variety of different areas, including computational logic, finite model theory, computational complexity, algorithm design and analysis, programming languages, and artificial intelligence. The PODS Symposia study data management challenges in a variety of application contexts, including more recently probabilistic data, streaming data, graph data, information retrieval, ontology and semantic web, and data-driven processes and systems. PODS has a tradition of being the premier international conference on the theoretical and foundational aspects of mdata management, and the interested reader is referred to the PODS web pages at http://www.sigmod.org/thepods- pages/ for information on the history of this conference series. This year's symposium continues this tradition, but in addition the PODS Executive Committee decided to broaden the scope of PODS, and to explicitly invite for submission papers providing original, substantial contributions in one or more of the following categories: a) deep theoretical exploration of topical areas central to data management; b) new formal frameworks that aim at providing the basis for deeper theoretical investigation of important emerging issues in data management; and c) validation of theoretical approaches from the lens of practical applicability in data management. This volume contains the proceedings of PODS 2015, which include an abstract for the keynote address by Michael I. Johnson (University of California, Berkeley), papers based on two invited tutorials by Todd J. Green (LogicBlox, USA) and Graham Cormode (University of Warwick, UK), and 25 contributions that were selected by the Program Committee for presentation at the symposium. This year, PODS experimented for the first time with two submission cycles, where the first cycle allowed also for papers to be revised and resubmitted. For the first cycle, 29 papers were submitted, 4 of which were directly selected for inclusion in the proceedings, and 7 were invited for a resubmission after a revision. The quality of most of the revised papers increased substantially with respect to the first submission, and 6 of those in the end were selected for the proceedings. For the second cycle, 51 papers were submitted, 15 of which were selected, resulting in 25 papers selected overall from a total number of 80 submissions. Most of the 25 accepted papers are extended abstracts. While all submissions have been reviewed by at least four Program Committee members, they have not been forma
我们非常高兴地欢迎您参加于2015年5月31日至6月4日在澳大利亚维多利亚州墨尔本举行的第34届ACM SIGMOD- sigact - sigai数据库系统原理研讨会(PODS 2015),该研讨会与2015年ACM SIGMOD数据管理国际会议同时举行。自1982年研讨会的第一版以来,PODS论文的特点是采用严格的方法来解决数据管理中的各种问题,通常采用来自不同领域的技术,包括计算逻辑,有限模型理论,计算复杂性,算法设计和分析,编程语言和人工智能。PODS专题讨论会研究了各种应用环境中的数据管理挑战,包括最近的概率数据、流数据、图形数据、信息检索、本体和语义网,以及数据驱动的过程和系统。PODS有作为mdata管理理论和基础方面的主要国际会议的传统,感兴趣的读者可以访问PODS的网页http://www.sigmod.org/thepods- pages/以获取有关该系列会议历史的信息。今年的研讨会延续了这一传统,但此外,数据管理中心执行委员会决定扩大数据管理中心的范围,并明确邀请提交在以下一个或多个类别中提供原创、实质性贡献的论文:a)对数据管理中心专题领域的深入理论探索;B)新的正式框架,旨在为数据管理中重要新问题的更深入的理论研究提供基础;c)从数据管理的实际适用性角度验证理论方法。本卷包含2015年PODS会议记录,其中包括Michael I. Johnson(加州大学伯克利分校)的主题演讲摘要,Todd J. Green (LogicBlox,美国)和Graham Cormode(英国华威大学)的两篇受邀教程的论文,以及由项目委员会选择在研讨会上发表的25篇论文。今年,PODS首次尝试了两个提交周期,其中第一个周期也允许论文修改和重新提交。在第一个周期,提交了29篇论文,其中4篇被直接选中列入会议记录,7篇在修订后被邀请重新提交。与第一次提交的论文相比,大多数修订后的论文的质量大大提高,其中6篇最终入选论文集。在第二个周期,提交了51篇论文,其中15篇被选中,结果从总共80篇论文中选出了25篇。被接受的25篇论文大部分是扩展摘要。虽然所有提交的材料都经过至少四名项目委员会成员的审查,但尚未正式提交。预计这些论文中描述的大部分研究将以更完善和详细的形式发表在科学期刊上。关于上述三个类别,在80份意见书中(参见:,接受论文25篇),47篇(退稿)。, 19)被作者归为(a)类,28(见附件)。在(b)类中,只有6个(见第6条)。(3)类别(c)。类别是非排他性的,分类不是强制性的;事实上,有几篇论文被分类在一个以上的类别中。3)提交作品,未指定类别。项目委员会的一项重要任务是选出2015年PODS最佳论文奖。委员会选择了Tom J. Ameloot、Gaetano Geck、Bas Ketsman、Frank Neven和Thomas Schwentick的论文《Parallel-Correctness and Transferability for Conjunctive Queries》。在此,我们代表委员会向作者表示诚挚的祝贺。自2008年以来,PODS将ACM PODS Alberto O. Mendelzon时间测试奖授予十年前在PODS会议上发表的一篇或少数论文,这些论文在其间的十年中影响最大。今年的委员会由Dan Suciu(主席)、Foto Afrati和Frank Neven组成,他们选择了以下两篇论文。向他们的作者致以最热烈的祝贺!Michael Benedikt、Wenfei Fan和Floris Geerts撰写的“dtd存在下的XPath可满足性”以及Luc Segoufin和Victor Vianu撰写的“视图和查询:确定性和重写”。
{"title":"Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","authors":"T. Milo, Diego Calvanese","doi":"10.1145/3403468","DOIUrl":"https://doi.org/10.1145/3403468","url":null,"abstract":"It is our great pleasure to welcome you to the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS 2015), held in Melbourne, Victoria, Australia, on May 31 -- June 4, 2015, in conjunction with the 2015 ACM SIGMOD International Conference on Management of Data. Since the first edition of the symposium in 1982, the PODS papers are distinguished by a rigorous approach to widely diverse problems in data management, often bringing to bear techniques from a variety of different areas, including computational logic, finite model theory, computational complexity, algorithm design and analysis, programming languages, and artificial intelligence. The PODS Symposia study data management challenges in a variety of application contexts, including more recently probabilistic data, streaming data, graph data, information retrieval, ontology and semantic web, and data-driven processes and systems. PODS has a tradition of being the premier international conference on the theoretical and foundational aspects of mdata management, and the interested reader is referred to the PODS web pages at http://www.sigmod.org/thepods- pages/ for information on the history of this conference series. \u0000 \u0000This year's symposium continues this tradition, but in addition the PODS Executive Committee decided to broaden the scope of PODS, and to explicitly invite for submission papers providing original, substantial contributions in one or more of the following categories: a) deep theoretical exploration of topical areas central to data management; b) new formal frameworks that aim at providing the basis for deeper theoretical investigation of important emerging issues in data management; and c) validation of theoretical approaches from the lens of practical applicability in data management. This volume contains the proceedings of PODS 2015, which include an abstract for the keynote address by Michael I. Johnson (University of California, Berkeley), papers based on two invited tutorials by Todd J. Green (LogicBlox, USA) and Graham Cormode (University of Warwick, UK), and 25 contributions that were selected by the Program Committee for presentation at the symposium. \u0000 \u0000This year, PODS experimented for the first time with two submission cycles, where the first cycle allowed also for papers to be revised and resubmitted. For the first cycle, 29 papers were submitted, 4 of which were directly selected for inclusion in the proceedings, and 7 were invited for a resubmission after a revision. The quality of most of the revised papers increased substantially with respect to the first submission, and 6 of those in the end were selected for the proceedings. For the second cycle, 51 papers were submitted, 15 of which were selected, resulting in 25 papers selected overall from a total number of 80 submissions. Most of the 25 accepted papers are extended abstracts. While all submissions have been reviewed by at least four Program Committee members, they have not been forma","PeriodicalId":412441,"journal":{"name":"Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132353170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}