Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems最新文献_第2页

Privacy 隐私

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-06-20 DOI: 10.1145/3452021.3458816

Kobbi Nissim

There are significant gaps between legal and technical thinking around data privacy. Technical standards are described using mathematical language whereas legal standards are not rigorous from a mathematical point of view and often resort to concepts which they only partially define. As a result, arguments about the adequacy of technical privacy measures for satisfying legal privacy often lack rigor, and their conclusions are uncertain. The uncertainty is exacerbated by a litany of successful privacy attacks on privacy measures thought to meet legal expectations but then shown to fall short of doing so. As computer systems manipulating individual privacy-sensitive data become integrated in almost every aspect of society, and as such systems increasingly make decisions of legal significance, the need to bridge the diverging, and sometimes conflicting legal and technical approaches becomes urgent. We formulate and prove formal claims -- "legal theorems'' -- addressing legal questions such as whether the use of technological measures satisfies the requirements of a legal privacy standard. In particular, we analyze the notion of singling out from the GDPR and whether technologies such as k-anonymity and differential privacy prevent singling out. Our long-term goal is to develop concepts which are on one hand technical, so they can be integrated in the design of computer systems, and can be used in legal reasoning and for policymaking on the other hand.

{"title":"Privacy","authors":"Kobbi Nissim","doi":"10.1145/3452021.3458816","DOIUrl":"https://doi.org/10.1145/3452021.3458816","url":null,"abstract":"There are significant gaps between legal and technical thinking around data privacy. Technical standards are described using mathematical language whereas legal standards are not rigorous from a mathematical point of view and often resort to concepts which they only partially define. As a result, arguments about the adequacy of technical privacy measures for satisfying legal privacy often lack rigor, and their conclusions are uncertain. The uncertainty is exacerbated by a litany of successful privacy attacks on privacy measures thought to meet legal expectations but then shown to fall short of doing so. As computer systems manipulating individual privacy-sensitive data become integrated in almost every aspect of society, and as such systems increasingly make decisions of legal significance, the need to bridge the diverging, and sometimes conflicting legal and technical approaches becomes urgent. We formulate and prove formal claims -- \"legal theorems'' -- addressing legal questions such as whether the use of technological measures satisfies the requirements of a legal privacy standard. In particular, we analyze the notion of singling out from the GDPR and whether technologies such as k-anonymity and differential privacy prevent singling out. Our long-term goal is to develop concepts which are on one hand technical, so they can be integrated in the design of computer systems, and can be used in legal reasoning and for policymaking on the other hand.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122625894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Data-Independent Space Partitionings for Summaries 用于摘要的数据独立空间分区

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-06-20 DOI: 10.1145/3452021.3458316

Graham Cormode, M. Garofalakis, Michael Shekelyan

Histograms are a standard tool in data management for describing multidimensional data. It is often convenient or even necessary to define data independent histograms, to partition space in advance without observing the data itself. Specific motivations arise in managing data when it is not suitable to frequently change the boundaries between histogram cells. For example, when the data is subject to many insertions and deletions; when data is distributed across multiple systems; or when producing a privacy-preserving representation of the data. The baseline approach is to consider an equiwidth histogram, i.e., a regular grid over the space. However, this is not optimal for the objective of splitting the multidimensional space into (possibly overlapping) bins, such that each box can be rebuilt using a set of non-overlapping bins with minimal excess (or deficit) of volume. Thus, we investigate how to split the space into bins and identify novel solutions that offer a good balance of desirable properties. As many data processing tools require a dataset as an input, we propose efficient methods how to obtain synthetic point sets that match the histograms over the overlapping bins.

直方图是数据管理中用于描述多维数据的标准工具。定义与数据无关的直方图通常是方便的，甚至是必要的，可以在不观察数据本身的情况下提前划分空间。当不适合频繁更改直方图单元格之间的边界时，会产生管理数据的特定动机。例如，当数据受到许多插入和删除时;当数据分布在多个系统时;或者在生成数据的隐私保护表示时。基线方法是考虑一个等宽直方图，即空间上的规则网格。然而，对于将多维空间分割成(可能重叠的)箱子的目标来说，这并不是最优的，这样每个盒子都可以使用一组不重叠的箱子来重建，这些箱子具有最小的多余(或不足)体积。因此，我们研究了如何将空间分割成箱子，并确定了提供理想属性的良好平衡的新解决方案。由于许多数据处理工具需要数据集作为输入，我们提出了如何获得与重叠箱上的直方图匹配的合成点集的有效方法。

{"title":"Data-Independent Space Partitionings for Summaries","authors":"Graham Cormode, M. Garofalakis, Michael Shekelyan","doi":"10.1145/3452021.3458316","DOIUrl":"https://doi.org/10.1145/3452021.3458316","url":null,"abstract":"Histograms are a standard tool in data management for describing multidimensional data. It is often convenient or even necessary to define data independent histograms, to partition space in advance without observing the data itself. Specific motivations arise in managing data when it is not suitable to frequently change the boundaries between histogram cells. For example, when the data is subject to many insertions and deletions; when data is distributed across multiple systems; or when producing a privacy-preserving representation of the data. The baseline approach is to consider an equiwidth histogram, i.e., a regular grid over the space. However, this is not optimal for the objective of splitting the multidimensional space into (possibly overlapping) bins, such that each box can be rebuilt using a set of non-overlapping bins with minimal excess (or deficit) of volume. Thus, we investigate how to split the space into bins and identify novel solutions that offer a good balance of desirable properties. As many data processing tools require a dataset as an input, we propose efficient methods how to obtain synthetic point sets that match the histograms over the overlapping bins.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114193020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Model-theoretic Characterizations of Rule-based Ontologies 基于规则本体的模型理论表征

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-06-20 DOI: 10.1145/3452021.3458310

Marco Console, Phokion G. Kolaitis, Andreas Pieris

An ontology specifies an abstract model of a domain of interest via a formal language that is typically based on logic. Although description logics are popular formalisms for modeling ontologies, tuple-generating dependencies (tgds), originally introduced as a unifying framework for database integrity constraints, and later on used in data exchange and integration, are also well suited for modeling ontologies that are intended for data-intensive tasks. The reason is that, unlike description logics, tgds can easily handle higher-arity relations that naturally occur in relational databases. In recent years, there has been an extensive study of tgd-ontologies and of their applications to several different data-intensive tasks. However, the fundamental question of whether the expressive power of tgd-ontologies can be characterized in terms of model-theoretic properties remains largely unexplored. We establish several characterizations of tgd-ontologies, including characterizations of ontologies specified by such central classes of tgds as full, linear, guarded, and frontier-guarded tgds. Our characterizations use the well-known notions of critical instance and direct product, as well as a novel locality property for tgd-ontologies. We further use this locality property to decide whether an ontology expressed by frontier-guarded (respectively, guarded) tgds can be expressed by tgds in the weaker class of guarded (respectively, linear) tgds, and effectively construct such an equivalent ontology if one exists.

本体通过通常基于逻辑的形式语言指定感兴趣领域的抽象模型。尽管描述逻辑是建模本体的常用形式化方法，但元组生成依赖关系(metadata -generating dependencies, tgds)也非常适合用于数据密集型任务的本体建模。tgds最初是作为数据库完整性约束的统一框架引入的，后来用于数据交换和集成。原因是，与描述逻辑不同，tgds可以轻松处理关系数据库中自然出现的更高密度的关系。近年来，人们对tgd本体及其在不同数据密集型任务中的应用进行了广泛的研究。然而，tgd本体的表达能力是否可以用模型理论性质来表征这一基本问题在很大程度上仍未得到探讨。我们建立了几个tgd本体的表征，包括由tgd的中心类指定的本体的表征，如完整的、线性的、保护的和边界保护的tgd。我们的描述使用了众所周知的关键实例和直接积的概念，以及tgd本体的一个新的局部性。我们进一步利用这一局部性来确定由边界守卫(分别守卫)tgds表示的本体是否可以由守卫(分别线性)tgds的弱类中的tgds表示，如果存在的话，我们可以有效地构造这样的等价本体。

{"title":"Model-theoretic Characterizations of Rule-based Ontologies","authors":"Marco Console, Phokion G. Kolaitis, Andreas Pieris","doi":"10.1145/3452021.3458310","DOIUrl":"https://doi.org/10.1145/3452021.3458310","url":null,"abstract":"An ontology specifies an abstract model of a domain of interest via a formal language that is typically based on logic. Although description logics are popular formalisms for modeling ontologies, tuple-generating dependencies (tgds), originally introduced as a unifying framework for database integrity constraints, and later on used in data exchange and integration, are also well suited for modeling ontologies that are intended for data-intensive tasks. The reason is that, unlike description logics, tgds can easily handle higher-arity relations that naturally occur in relational databases. In recent years, there has been an extensive study of tgd-ontologies and of their applications to several different data-intensive tasks. However, the fundamental question of whether the expressive power of tgd-ontologies can be characterized in terms of model-theoretic properties remains largely unexplored. We establish several characterizations of tgd-ontologies, including characterizations of ontologies specified by such central classes of tgds as full, linear, guarded, and frontier-guarded tgds. Our characterizations use the well-known notions of critical instance and direct product, as well as a novel locality property for tgd-ontologies. We further use this locality property to decide whether an ontology expressed by frontier-guarded (respectively, guarded) tgds can be expressed by tgds in the weaker class of guarded (respectively, linear) tgds, and effectively construct such an equivalent ontology if one exists.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133606720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Frequent Elements with Witnesses in Data Streams 数据流中具有见证的频繁元素

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-06-20 DOI: 10.1145/3452021.3458330

C. Konrad

Detecting frequent elements is among the oldest and most-studied problems in the area of data streams. Given a stream of m data items in 1, 2, dots, n, the objective is to output items that appear at least d times, for some threshold parameter d, and provably optimal algorithms are known today. However, in many applications, knowing only the frequent elements themselves is not enough: For example, an Internet router may not only need to know the most frequent destination IP addresses of forwarded packages, but also the timestamps of when these packages appeared or any other meta-data that "arrived'' with the packages, e.g., their source IP addresses. In this paper, we introduce the witness version of the frequent elements problem: Given a desired approximation guarantee α ge 1$ and a desired frequency $d łe Δ$, where Δ is the frequency of the most frequent item, the objective is to report an item together with at least $d / α$ timestamps of when the item appeared in the stream (or any other meta-data that arrived with the items). We give provably optimal algorithms for both the insertion-only and insertion-deletion stream settings: In insertion-only streams, we show that space $tildeO (n + d cdot n^frac1 α )$ is necessary and sufficient for every integral $1 łe α łe łog n$. In insertion-deletion streams, we show that space $tildeO (fracn cdot d α^2 )$ is necessary and sufficient, for every α łe sqrtn $.

检测频繁元素是数据流领域中最古老和研究最多的问题之一。给定一个包含m个数据项的流，在 1,2,dots, n中，目标是输出至少出现d次的项，对于某些阈值参数d，并且目前已知的可证明的最优算法。然而，在许多应用程序中，只知道频繁元素本身是不够的:例如，互联网路由器可能不仅需要知道转发包的最频繁的目的地IP地址，还需要知道这些包出现的时间戳或与包一起“到达”的任何其他元数据，例如，它们的源IP地址。在本文中，我们引入了见证版本的频繁元素问题:给定一个期望的近似保证α ge 1 $ and a desired frequency $ d łe Δ $, where Δ is the frequency of the most frequent item, the objective is to report an item together with at least $ d / α $ timestamps of when the item appeared in the stream (or any other meta-data that arrived with the items). We give provably optimal algorithms for both the insertion-only and insertion-deletion stream settings: In insertion-only streams, we show that space $tildeO (n + d cdot n^ frac 1 α) $ is necessary and sufficient for every integral $ 1 łe α łe łog n $. In insertion-deletion streams, we show that space $tildeO (fracncdot d α^2) $ is necessary and sufficient, for every α łe sqrtn $。

{"title":"Frequent Elements with Witnesses in Data Streams","authors":"C. Konrad","doi":"10.1145/3452021.3458330","DOIUrl":"https://doi.org/10.1145/3452021.3458330","url":null,"abstract":"Detecting frequent elements is among the oldest and most-studied problems in the area of data streams. Given a stream of m data items in 1, 2, dots, n, the objective is to output items that appear at least d times, for some threshold parameter d, and provably optimal algorithms are known today. However, in many applications, knowing only the frequent elements themselves is not enough: For example, an Internet router may not only need to know the most frequent destination IP addresses of forwarded packages, but also the timestamps of when these packages appeared or any other meta-data that \"arrived'' with the packages, e.g., their source IP addresses. In this paper, we introduce the witness version of the frequent elements problem: Given a desired approximation guarantee α ge 1$ and a desired frequency $d łe Δ$, where Δ is the frequency of the most frequent item, the objective is to report an item together with at least $d / α$ timestamps of when the item appeared in the stream (or any other meta-data that arrived with the items). We give provably optimal algorithms for both the insertion-only and insertion-deletion stream settings: In insertion-only streams, we show that space $tildeO (n + d cdot n^frac1 α )$ is necessary and sufficient for every integral $1 łe α łe łog n$. In insertion-deletion streams, we show that space $tildeO (fracn cdot d α^2 )$ is necessary and sufficient, for every α łe sqrtn $.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133269373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Estimating the Size of Union of Sets in Streaming Models 流模型中集合并集大小的估计

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-06-20 DOI: 10.1145/3452021.3458333

Kuldeep S. Meel, N. V. Vinodchandran, Sourav Chakraborty

In this paper we study the problem of estimating the size of the union of sets $S_1, dots, S_M$ where each set $S_i subseteq Ømega$ (for some discrete universe $Ømega$) is implicitly presented and comes in a streaming fashion. We define the notion of Delphic sets to capture class of streaming problems where membership, sampling, and counting calls to the sets are efficient. In particular, we show our notion of Delphic sets capture three well known problems: Klee's measure problem (discrete version), test coverage estimation, and model counting of DNF formulas. The Klee's measure problem corresponds to computation of volume of multi-dimension axis aligned rectangles, i.e., every d-dimension axis-aligned rectangle can be defined as $[a_1,b_1] times [a_2,b_2] times łdots times [a_d, b_d]$. The problem of test coverage estimation focuses on the computation of coverage measure for a given testing array in the context of combinatorial testing, which is a fundamental technique in the context of hardware and software testing. Finally, given a DNF formula $varphi = T_1 vee T_2 vee łdots vee T_M$, the problem of model counting seeks to compute the number of satisfying assignments of $varphi$. The primary contribution of our work is a simple and efficient sampling-based algorithm, called hybrid, for estimating the of union of sets in streaming setting. Our algorithm has the space complexity of $O(Rłog |Ømega|)$ and update time is $O(Rłog R cdot łog(M/δ) cdot łog|Ømega|)$ where, $R = Ołeft(łog (M/δ)cdot varepsilon^2 right).$ Consequently, our algorithm provides the first algorithm with linear dependence on d for Klee's measure problem in streaming setting for $d>1$, thereby settling the open problem of Tirthpura and Woodruff (PODS-12). Furthermore, a straightforward application of our algorithm lends to an efficient algorithm for coverage estimation problem in streaming setting. We then investigate whether the space complexity for coverage estimation can be further improved, and in this context, we present another streaming algorithm that uses near-optimal $O(tłog n/varepsilon^2)$ space complexity but uses an update algorithm that is in $rm P ^rm NP $, thereby showcasing an interesting time vs space trade-off in the streaming setting. Finally, we demonstrate the generality of our Delphic sets by obtaining a streaming algorithm for model counting of DNF formulas. It is worth remarking that we view a key strength of our work is the simplicity of both the algorithm and its theoretical analysis, which makes it amenable to practical implementation and easy adoption.

在本文中，我们研究了估计集合并集的大小$S_1, dots, S_M$的问题，其中每个集合$S_i subseteq Ømega$(对于某些离散宇宙$Ømega$)隐式地表示并以流方式出现。我们定义了德尔菲集合的概念，以捕获一类流问题，其中对集合的隶属关系、采样和计数调用是有效的。特别地，我们展示了我们的德尔菲集的概念捕获了三个众所周知的问题:Klee的度量问题(离散版本)，测试覆盖率估计和DNF公式的模型计数。Klee's measure问题对应于多维轴向矩形的体积计算，即每个d维轴向矩形都可以定义为$[a_1,b_1] times [a_2,b_2] times łdots times [a_d, b_d]$。测试覆盖估计问题是组合测试环境下给定测试阵列的覆盖度量计算问题，是硬件和软件测试环境下的一项基本技术。最后，给定DNF公式$varphi = T_1 vee T_2 vee łdots vee T_M$，模型计数问题寻求计算$varphi$的满足赋值的个数。我们工作的主要贡献是一个简单而有效的基于采样的算法，称为hybrid，用于估计流设置中集合的并集。我们的算法的空间复杂度为$O(Rłog |Ømega|)$，更新时间为$O(Rłog R cdot łog(M/δ) cdot łog|Ømega|)$，其中，$R = Ołeft(łog (M/δ)cdot varepsilon^2 right).$因此，我们的算法为$d>1$的流设置中Klee的度量问题提供了第一个与d线性相关的算法，从而解决了Tirthpura和Woodruff (pod -12)的开放问题。此外，该算法的简单应用为流环境下的覆盖估计问题提供了一种有效的算法。然后，我们研究覆盖估计的空间复杂性是否可以进一步提高，在这种情况下，我们提出了另一种流算法，该算法使用接近最优的$O(tłog n/varepsilon^2)$空间复杂性，但使用$rm P ^rm NP $中的更新算法，从而展示了流设置中有趣的时间与空间权衡。最后，我们通过获得一个用于DNF公式模型计数的流算法来证明我们的德尔菲集的一般性。值得注意的是，我们认为我们工作的一个关键优势是算法及其理论分析的简单性，这使得它易于实际实现和易于采用。

{"title":"Estimating the Size of Union of Sets in Streaming Models","authors":"Kuldeep S. Meel, N. V. Vinodchandran, Sourav Chakraborty","doi":"10.1145/3452021.3458333","DOIUrl":"https://doi.org/10.1145/3452021.3458333","url":null,"abstract":"In this paper we study the problem of estimating the size of the union of sets $S_1, dots, S_M$ where each set $S_i subseteq Ømega$ (for some discrete universe $Ømega$) is implicitly presented and comes in a streaming fashion. We define the notion of Delphic sets to capture class of streaming problems where membership, sampling, and counting calls to the sets are efficient. In particular, we show our notion of Delphic sets capture three well known problems: Klee's measure problem (discrete version), test coverage estimation, and model counting of DNF formulas. The Klee's measure problem corresponds to computation of volume of multi-dimension axis aligned rectangles, i.e., every d-dimension axis-aligned rectangle can be defined as $[a_1,b_1] times [a_2,b_2] times łdots times [a_d, b_d]$. The problem of test coverage estimation focuses on the computation of coverage measure for a given testing array in the context of combinatorial testing, which is a fundamental technique in the context of hardware and software testing. Finally, given a DNF formula $varphi = T_1 vee T_2 vee łdots vee T_M$, the problem of model counting seeks to compute the number of satisfying assignments of $varphi$. The primary contribution of our work is a simple and efficient sampling-based algorithm, called hybrid, for estimating the of union of sets in streaming setting. Our algorithm has the space complexity of $O(Rłog |Ømega|)$ and update time is $O(Rłog R cdot łog(M/δ) cdot łog|Ømega|)$ where, $R = Ołeft(łog (M/δ)cdot varepsilon^2 right).$ Consequently, our algorithm provides the first algorithm with linear dependence on d for Klee's measure problem in streaming setting for $d>1$, thereby settling the open problem of Tirthpura and Woodruff (PODS-12). Furthermore, a straightforward application of our algorithm lends to an efficient algorithm for coverage estimation problem in streaming setting. We then investigate whether the space complexity for coverage estimation can be further improved, and in this context, we present another streaming algorithm that uses near-optimal $O(tłog n/varepsilon^2)$ space complexity but uses an update algorithm that is in $rm P ^rm NP $, thereby showcasing an interesting time vs space trade-off in the streaming setting. Finally, we demonstrate the generality of our Delphic sets by obtaining a streaming algorithm for model counting of DNF formulas. It is worth remarking that we view a key strength of our work is the simplicity of both the algorithm and its theoretical analysis, which makes it amenable to practical implementation and easy adoption.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129725528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

2021 ACM PODS Alberto O. Mendelzon Test-of-Time Award 2021年ACM PODS Alberto O. Mendelzon时间测试奖

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-06-20 DOI: 10.1145/3452021.3452909

A. Bonifati, R. Pagh, T. Schwentick

The ACM PODS Alberto O. Mendelzon Test-of-Time Award is awarded every year to a paper or a small number of papers published in the PODS proceedings ten years prior that had the most impact in terms of research, methodology, or transfer to practice over the intervening decade. The PODS Executive Committee has appointed us to serve as the Award Committee for 2021. After careful consideration and having solicited external nominations and advice, we have selected the following paper as the award winner for 2021: Tight bounds for L_p samplers, finding duplicates in streams, and related problems by Hossein Jowhari, Mert Sağlam and Gábor Tardos Citation. This paper addresses a question posed by Cormode et al. in VLDB 2005, namely whether a uniform (or nearly uniform) sample can be maintained in a dynamically changing database, where data items may be inserted and deleted, while using space much smaller than the size of the database. More generally, it considers maintaining an L_p sample, where an element must be sampled with probability proportional to w^p (possibly up to some small relative error), where w is a weight that may change dynamically. In SODA 2010, Monemizadeh and Woodruff showed that it is possible to perform L_p sampling in a stream using polylogarithmic space. The PODS 2011 paper by Jowhari, Sağlam and Tardos essentially closes the problem by presenting algorithms with improved space usage, as well as a matching lower bound showing that it is not possible to asymptotically improve the upper bounds. The paper has had a considerable impact on the design of algorithms in streaming and distributed models of computation, where L_p sampling has become an essential part of the toolbox. The survey "L_p Samplers and Their Applications" in ACM Computing Surveys (2019) presents a number of surprising applications, for example in graph algorithms and in randomized numerical linear algebra.

ACM PODS Alberto O. Mendelzon Test-of-Time奖每年颁发给十年前在PODS会议记录上发表的一篇或少数论文，这些论文在研究、方法或在其间的十年中转化为实践方面具有最大的影响。PODS执行委员会已任命我们担任2021年的奖项委员会。经过仔细考虑并征求外部提名和建议，我们选择以下论文作为2021年的获奖者:L_p采样器的紧密界限，在流中发现重复，以及Hossein Jowhari, Mert Sağlam和Gábor Tardos Citation的相关问题。本文解决了Cormode等人在VLDB 2005中提出的一个问题，即在一个动态变化的数据库中，在数据项可以插入和删除的情况下，是否可以使用比数据库大小小得多的空间来维护一个统一(或近乎统一)的样本。更一般地说，它考虑维护一个L_p样本，其中一个元素必须以与w^p成比例的概率进行采样(可能有一些小的相对误差)，其中w是一个可能动态变化的权重。在SODA 2010中，Monemizadeh和Woodruff证明了使用多对数空间在流中执行L_p采样是可能的。Jowhari, Sağlam和Tardos在PODS 2011年发表的论文基本上解决了这个问题，提出了改进空间利用率的算法，以及一个匹配的下界，表明不可能渐近地改进上界。本文对流计算和分布式计算模型的算法设计产生了相当大的影响，其中L_p采样已成为工具箱中必不可少的一部分。《ACM计算调查(2019)》中的调查“L_p采样器及其应用”提出了许多令人惊讶的应用，例如在图算法和随机数值线性代数中。

{"title":"2021 ACM PODS Alberto O. Mendelzon Test-of-Time Award","authors":"A. Bonifati, R. Pagh, T. Schwentick","doi":"10.1145/3452021.3452909","DOIUrl":"https://doi.org/10.1145/3452021.3452909","url":null,"abstract":"The ACM PODS Alberto O. Mendelzon Test-of-Time Award is awarded every year to a paper or a small number of papers published in the PODS proceedings ten years prior that had the most impact in terms of research, methodology, or transfer to practice over the intervening decade. The PODS Executive Committee has appointed us to serve as the Award Committee for 2021. After careful consideration and having solicited external nominations and advice, we have selected the following paper as the award winner for 2021: Tight bounds for L_p samplers, finding duplicates in streams, and related problems by Hossein Jowhari, Mert Sağlam and Gábor Tardos Citation. This paper addresses a question posed by Cormode et al. in VLDB 2005, namely whether a uniform (or nearly uniform) sample can be maintained in a dynamically changing database, where data items may be inserted and deleted, while using space much smaller than the size of the database. More generally, it considers maintaining an L_p sample, where an element must be sampled with probability proportional to w^p (possibly up to some small relative error), where w is a weight that may change dynamically. In SODA 2010, Monemizadeh and Woodruff showed that it is possible to perform L_p sampling in a stream using polylogarithmic space. The PODS 2011 paper by Jowhari, Sağlam and Tardos essentially closes the problem by presenting algorithms with improved space usage, as well as a matching lower bound showing that it is not possible to asymptotically improve the upper bounds. The paper has had a considerable impact on the design of algorithms in streaming and distributed models of computation, where L_p sampling has become an essential part of the toolbox. The survey \"L_p Samplers and Their Applications\" in ACM Computing Surveys (2019) presents a number of surprising applications, for example in graph algorithms and in randomized numerical linear algebra.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121017168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Modern Lower Bound Techniques in Database Theory and Constraint Satisfaction 数据库理论中的现代下界技术与约束满足

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-06-20 DOI: 10.1145/3452021.3458814

D. Marx

Conditional lower bounds based on $Pneq NP$, the Exponential-Time Hypothesis (ETH), or similar complexity assumptions can provide very useful information about what type of algorithms are likely to be possible. Ideally, such lower bounds would be able to demonstrate that the best known algorithms are essentially optimal and cannot be improved further. In this tutorial, we overview different types of lower bounds, and see how they can be applied to problems in database theory and constraint satisfaction.

基于$Pneq NP$的条件下界，指数时间假设(ETH)或类似的复杂性假设可以提供关于哪种类型的算法可能是可能的非常有用的信息。理想情况下，这样的下界将能够证明，最知名的算法本质上是最优的，不能进一步改进。在本教程中，我们概述了不同类型的下界，并了解如何将它们应用于数据库理论和约束满足中的问题。

引用次数: 1

Minimum Coresets for Maxima Representation of Multidimensional Data 多维数据的最大表示的最小核心集

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-06-20 DOI: 10.1145/3452021.3458322

Yanhao Wang, M. Mathioudakis, Yuchen Li, K. Tan

Coresets are succinct summaries of large datasets such that, for a given problem, the solution obtained from a coreset is provably competitive with the solution obtained from the full dataset. As such, coreset-based data summarization techniques have been successfully applied to various problems, e.g., geometric optimization, clustering, and approximate query processing, for scaling them up to massive data. In this paper, we study coresets for the maxima representation of multidimensional data: Given a set P of points in $ mathbbR ^d $, where d is a small constant, and an error parameter $ varepsilon in (0,1) $, a subset $ Q subseteq P $ is an $ varepsilon $-coreset for the maxima representation of P iff the maximum of Q is an $ varepsilon $-approximation of the maximum of P for any vector $ u in mathbbR ^d $, where the maximum is taken over the inner products between the set of points (P or Q) and u. We define a novel minimum $varepsilon$-coreset problem that asks for an $varepsilon$-coreset of the smallest size for the maxima representation of a point set. For the two-dimensional case, we develop an optimal polynomial-time algorithm for the minimum $ varepsilon $-coreset problem by transforming it into the shortest-cycle problem in a directed graph. Then, we prove that this problem is NP-hard in three or higher dimensions and present polynomial-time approximation algorithms in an arbitrary fixed dimension. Finally, we provide extensive experimental results on both real and synthetic datasets to demonstrate the superior performance of our proposed algorithms.

核心集是大型数据集的简洁总结，对于给定的问题，从核心集获得的解决方案可证明与从完整数据集获得的解决方案具有竞争力。因此，基于核心集的数据摘要技术已经成功地应用于各种问题，例如几何优化、聚类和近似查询处理，以便将它们扩展到海量数据。在本文中，我们研究了多维数据的最大表示的核心集:给定$ mathbbR ^d $中点的集合P，其中d是一个小常数，错误参数$ varepsilon in(0,1) $，如果Q的最大值是任意向量$ u mathbbR ^d $中P的最大值的$ varepsilon $-近似，则子集$ Q subseteq P $是P的最大表示的$ varepsilon $-coreset，其中最大值被取为点集(P或Q)与u之间的内积。我们定义了一个新的minimum $varepsilon$-coreset问题，该问题要求为点集的最大表示提供最小大小的$varepsilon$-coreset。对于二维情况，我们通过将最小$ varepsilon $-coreset问题转化为有向图中的最短周期问题，开发了一个最优多项式时间算法。然后，我们证明了这个问题在三维或更高的维度上是np困难的，并在任意固定的维度上给出了多项式时间逼近算法。最后，我们在真实和合成数据集上提供了广泛的实验结果，以证明我们提出的算法的优越性能。

{"title":"Minimum Coresets for Maxima Representation of Multidimensional Data","authors":"Yanhao Wang, M. Mathioudakis, Yuchen Li, K. Tan","doi":"10.1145/3452021.3458322","DOIUrl":"https://doi.org/10.1145/3452021.3458322","url":null,"abstract":"Coresets are succinct summaries of large datasets such that, for a given problem, the solution obtained from a coreset is provably competitive with the solution obtained from the full dataset. As such, coreset-based data summarization techniques have been successfully applied to various problems, e.g., geometric optimization, clustering, and approximate query processing, for scaling them up to massive data. In this paper, we study coresets for the maxima representation of multidimensional data: Given a set P of points in $ mathbbR ^d $, where d is a small constant, and an error parameter $ varepsilon in (0,1) $, a subset $ Q subseteq P $ is an $ varepsilon $-coreset for the maxima representation of P iff the maximum of Q is an $ varepsilon $-approximation of the maximum of P for any vector $ u in mathbbR ^d $, where the maximum is taken over the inner products between the set of points (P or Q) and u. We define a novel minimum $varepsilon$-coreset problem that asks for an $varepsilon$-coreset of the smallest size for the maxima representation of a point set. For the two-dimensional case, we develop an optimal polynomial-time algorithm for the minimum $ varepsilon $-coreset problem by transforming it into the shortest-cycle problem in a directed graph. Then, we prove that this problem is NP-hard in three or higher dimensions and present polynomial-time approximation algorithms in an arbitrary fixed dimension. Finally, we provide extensive experimental results on both real and synthetic datasets to demonstrate the superior performance of our proposed algorithms.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123764571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Improved Differentially Private Euclidean Distance Approximation 改进差分私有欧几里得距离近似

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-06-20 DOI: 10.1145/3452021.3458328

N. Stausholm

This work shows how to privately and more accurately estimate Euclidean distance between pairs of vectors. Input vectors x and y are mapped to differentially private sketches $x'$ and $y'$, from which one can estimate the distance between x and y. Our estimator relies on the Sparser Johnson-Lindenstrauss constructions by Kane & Nelson (Journal of the ACM 2014), which for any 0<α,β<1/2 have optimal output dimension k=Θ(α^-2 łog(1/β)) and sparsity s=O(α^-1 łog(1/β)). We combine the constructions of Kane & Nelson with either the Laplace or the Gaussian mechanism from the differential privacy literature, depending on the privacy parameters $varepsilon$ and δ. We also suggest a differentially private version of Fast Johnson-Lindenstrauss Transform (FJLT) by Ailon & Chazelle (SIAM Journal of Computing 2009) which offers a tradeoff in speed for variance for certain parameters. We answer an open question by Kenthapadi et al. (Journal of Privacy and Confidentiality 2013) by analyzing the privacy and utility guarantees of an estimator for Euclidean distance, relying on Laplacian rather than Gaussian noise. We prove that the Laplace mechanism yields lower variance than the Gaussian mechanism whenever δ<β^O(1/α). Thus, our work poses an improvement over the work of Kenthapadi et al. by giving a more efficient estimator with lower variance for sufficiently small δ. Our sketch also achieves pure differential privacy as a neat side-effect of the Laplace mechanism rather than the approximate differential privacy guarantee of the Gaussian mechanism, which may not be sufficiently strong for some settings. Our main result is a special case of more general, technical results proving that one can generally construct unbiased estimators for Euclidean distance with a high level of utility even under the constraint of differential privacy. The bulk of our analysis is proving that the variance of the estimator does not suffer too much in the presence of differential privacy.

这项工作展示了如何私下和更准确地估计向量对之间的欧几里得距离。输入向量x和y被映射到不同的私有草图$x'$和$y'$，从中可以估计x和y之间的距离。我们的估计器依赖于Kane和Nelson的Sparser Johnson-Lindenstrauss结构(ACM杂志2014)，对于任何0<α，β<1/2具有最佳输出维k=Θ(α^-2 łog(1/β))和稀疏度s=O(α^-1 łog(1/β))。根据隐私参数$varepsilon$和δ，我们将Kane & Nelson的构造与差分隐私文献中的拉普拉斯或高斯机制结合起来。我们还建议由Ailon和Chazelle (SIAM Journal of Computing 2009)设计的快速约翰逊-林登施特劳斯变换(FJLT)的不同私有版本，它为某些参数的方差提供了速度折衷。我们通过分析欧几里得距离估计器的隐私和效用保证来回答Kenthapadi等人(《隐私与机密杂志》，2013)提出的一个开放性问题，该估计器依赖于拉普拉斯噪声而不是高斯噪声。我们证明了当δ<β^O(1/α)时，拉普拉斯机制比高斯机制产生更小的方差。因此，我们的工作对Kenthapadi等人的工作进行了改进，为足够小的δ提供了更有效的估计器，方差更低。我们的草图还实现了纯差分隐私，作为拉普拉斯机制的一个简洁的副作用，而不是高斯机制的近似差分隐私保证，这在某些设置下可能不够强大。我们的主要结果是一个更一般的特殊情况，技术结果证明，即使在微分隐私的约束下，人们通常也可以构造具有高效用的欧几里得距离无偏估计。我们的大部分分析证明了在差分隐私存在的情况下，估计器的方差不会受到太大的影响。

{"title":"Improved Differentially Private Euclidean Distance Approximation","authors":"N. Stausholm","doi":"10.1145/3452021.3458328","DOIUrl":"https://doi.org/10.1145/3452021.3458328","url":null,"abstract":"This work shows how to privately and more accurately estimate Euclidean distance between pairs of vectors. Input vectors x and y are mapped to differentially private sketches $x'$ and $y'$, from which one can estimate the distance between x and y. Our estimator relies on the Sparser Johnson-Lindenstrauss constructions by Kane & Nelson (Journal of the ACM 2014), which for any 0<α,β<1/2 have optimal output dimension k=Θ(α^-2 łog(1/β)) and sparsity s=O(α^-1 łog(1/β)). We combine the constructions of Kane & Nelson with either the Laplace or the Gaussian mechanism from the differential privacy literature, depending on the privacy parameters $varepsilon$ and δ. We also suggest a differentially private version of Fast Johnson-Lindenstrauss Transform (FJLT) by Ailon & Chazelle (SIAM Journal of Computing 2009) which offers a tradeoff in speed for variance for certain parameters. We answer an open question by Kenthapadi et al. (Journal of Privacy and Confidentiality 2013) by analyzing the privacy and utility guarantees of an estimator for Euclidean distance, relying on Laplacian rather than Gaussian noise. We prove that the Laplace mechanism yields lower variance than the Gaussian mechanism whenever δ<β^O(1/α). Thus, our work poses an improvement over the work of Kenthapadi et al. by giving a more efficient estimator with lower variance for sufficiently small δ. Our sketch also achieves pure differential privacy as a neat side-effect of the Laplace mechanism rather than the approximate differential privacy guarantee of the Gaussian mechanism, which may not be sufficiently strong for some settings. Our main result is a special case of more general, technical results proving that one can generally construct unbiased estimators for Euclidean distance with a high level of utility even under the constraint of differential privacy. The bulk of our analysis is proving that the variance of the estimator does not suffer too much in the presence of differential privacy.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132018155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Consistent Query Answering for Primary Keys on Path Queries 路径查询主键一致性查询应答

Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pub Date : 2021-06-20 DOI: 10.1145/3452021.3458334

Paraschos Koutris, Xiating Ouyang, J. Wijsen

We study the data complexity of consistent query answering (CQA) on databases that may violate the primary key constraints. A repair is a maximal consistent subset of the database. For a Boolean query q, the problem CERTAINTY(q) takes a database as input, and asks whether or not each repair satisfies the query q. It is known that for any self-join-free Boolean conjunctive query q, CERTAINTY(q) is in FO, L-complete, or coNP-complete. In particular, CERTAINTY(q) is in FO for any self-join-free Boolean path query q. In this paper, we show that if self-joins are allowed, then the complexity of CERTAINTY(q) for Boolean path queries q exhibits a tetrachotomy between FO, NL-complete, PTIME-complete, and coNP-complete. Moreover, it is decidable, in polynomial time in the size of the query q, which of the four cases applies.

研究了可能违反主键约束的数据库上一致性查询应答(CQA)的数据复杂度。修复是数据库的最大一致性子集。对于布尔查询q，问题确定性(q)以数据库作为输入，并询问每个修复是否满足查询q。已知对于任何自连接无布尔连接查询q，确定性(q)在FO、L-complete或coNP-complete中。特别地，对于任何无自连接的布尔路径查询q，确定性(q)在FO中。在本文中，我们证明了如果允许自连接，那么布尔路径查询q的确定性(q)的复杂度在FO、NL-complete、PTIME-complete和coNP-complete之间呈现四切分。此外，查询q的大小在多项式时间内是可确定的，适用于四种情况中的哪一种。

引用次数: 7