There are significant gaps between legal and technical thinking around data privacy. Technical standards are described using mathematical language whereas legal standards are not rigorous from a mathematical point of view and often resort to concepts which they only partially define. As a result, arguments about the adequacy of technical privacy measures for satisfying legal privacy often lack rigor, and their conclusions are uncertain. The uncertainty is exacerbated by a litany of successful privacy attacks on privacy measures thought to meet legal expectations but then shown to fall short of doing so. As computer systems manipulating individual privacy-sensitive data become integrated in almost every aspect of society, and as such systems increasingly make decisions of legal significance, the need to bridge the diverging, and sometimes conflicting legal and technical approaches becomes urgent. We formulate and prove formal claims -- "legal theorems'' -- addressing legal questions such as whether the use of technological measures satisfies the requirements of a legal privacy standard. In particular, we analyze the notion of singling out from the GDPR and whether technologies such as k-anonymity and differential privacy prevent singling out. Our long-term goal is to develop concepts which are on one hand technical, so they can be integrated in the design of computer systems, and can be used in legal reasoning and for policymaking on the other hand.
{"title":"Privacy","authors":"Kobbi Nissim","doi":"10.1145/3452021.3458816","DOIUrl":"https://doi.org/10.1145/3452021.3458816","url":null,"abstract":"There are significant gaps between legal and technical thinking around data privacy. Technical standards are described using mathematical language whereas legal standards are not rigorous from a mathematical point of view and often resort to concepts which they only partially define. As a result, arguments about the adequacy of technical privacy measures for satisfying legal privacy often lack rigor, and their conclusions are uncertain. The uncertainty is exacerbated by a litany of successful privacy attacks on privacy measures thought to meet legal expectations but then shown to fall short of doing so. As computer systems manipulating individual privacy-sensitive data become integrated in almost every aspect of society, and as such systems increasingly make decisions of legal significance, the need to bridge the diverging, and sometimes conflicting legal and technical approaches becomes urgent. We formulate and prove formal claims -- \"legal theorems'' -- addressing legal questions such as whether the use of technological measures satisfies the requirements of a legal privacy standard. In particular, we analyze the notion of singling out from the GDPR and whether technologies such as k-anonymity and differential privacy prevent singling out. Our long-term goal is to develop concepts which are on one hand technical, so they can be integrated in the design of computer systems, and can be used in legal reasoning and for policymaking on the other hand.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122625894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Histograms are a standard tool in data management for describing multidimensional data. It is often convenient or even necessary to define data independent histograms, to partition space in advance without observing the data itself. Specific motivations arise in managing data when it is not suitable to frequently change the boundaries between histogram cells. For example, when the data is subject to many insertions and deletions; when data is distributed across multiple systems; or when producing a privacy-preserving representation of the data. The baseline approach is to consider an equiwidth histogram, i.e., a regular grid over the space. However, this is not optimal for the objective of splitting the multidimensional space into (possibly overlapping) bins, such that each box can be rebuilt using a set of non-overlapping bins with minimal excess (or deficit) of volume. Thus, we investigate how to split the space into bins and identify novel solutions that offer a good balance of desirable properties. As many data processing tools require a dataset as an input, we propose efficient methods how to obtain synthetic point sets that match the histograms over the overlapping bins.
{"title":"Data-Independent Space Partitionings for Summaries","authors":"Graham Cormode, M. Garofalakis, Michael Shekelyan","doi":"10.1145/3452021.3458316","DOIUrl":"https://doi.org/10.1145/3452021.3458316","url":null,"abstract":"Histograms are a standard tool in data management for describing multidimensional data. It is often convenient or even necessary to define data independent histograms, to partition space in advance without observing the data itself. Specific motivations arise in managing data when it is not suitable to frequently change the boundaries between histogram cells. For example, when the data is subject to many insertions and deletions; when data is distributed across multiple systems; or when producing a privacy-preserving representation of the data. The baseline approach is to consider an equiwidth histogram, i.e., a regular grid over the space. However, this is not optimal for the objective of splitting the multidimensional space into (possibly overlapping) bins, such that each box can be rebuilt using a set of non-overlapping bins with minimal excess (or deficit) of volume. Thus, we investigate how to split the space into bins and identify novel solutions that offer a good balance of desirable properties. As many data processing tools require a dataset as an input, we propose efficient methods how to obtain synthetic point sets that match the histograms over the overlapping bins.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114193020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marco Console, Phokion G. Kolaitis, Andreas Pieris
An ontology specifies an abstract model of a domain of interest via a formal language that is typically based on logic. Although description logics are popular formalisms for modeling ontologies, tuple-generating dependencies (tgds), originally introduced as a unifying framework for database integrity constraints, and later on used in data exchange and integration, are also well suited for modeling ontologies that are intended for data-intensive tasks. The reason is that, unlike description logics, tgds can easily handle higher-arity relations that naturally occur in relational databases. In recent years, there has been an extensive study of tgd-ontologies and of their applications to several different data-intensive tasks. However, the fundamental question of whether the expressive power of tgd-ontologies can be characterized in terms of model-theoretic properties remains largely unexplored. We establish several characterizations of tgd-ontologies, including characterizations of ontologies specified by such central classes of tgds as full, linear, guarded, and frontier-guarded tgds. Our characterizations use the well-known notions of critical instance and direct product, as well as a novel locality property for tgd-ontologies. We further use this locality property to decide whether an ontology expressed by frontier-guarded (respectively, guarded) tgds can be expressed by tgds in the weaker class of guarded (respectively, linear) tgds, and effectively construct such an equivalent ontology if one exists.
{"title":"Model-theoretic Characterizations of Rule-based Ontologies","authors":"Marco Console, Phokion G. Kolaitis, Andreas Pieris","doi":"10.1145/3452021.3458310","DOIUrl":"https://doi.org/10.1145/3452021.3458310","url":null,"abstract":"An ontology specifies an abstract model of a domain of interest via a formal language that is typically based on logic. Although description logics are popular formalisms for modeling ontologies, tuple-generating dependencies (tgds), originally introduced as a unifying framework for database integrity constraints, and later on used in data exchange and integration, are also well suited for modeling ontologies that are intended for data-intensive tasks. The reason is that, unlike description logics, tgds can easily handle higher-arity relations that naturally occur in relational databases. In recent years, there has been an extensive study of tgd-ontologies and of their applications to several different data-intensive tasks. However, the fundamental question of whether the expressive power of tgd-ontologies can be characterized in terms of model-theoretic properties remains largely unexplored. We establish several characterizations of tgd-ontologies, including characterizations of ontologies specified by such central classes of tgds as full, linear, guarded, and frontier-guarded tgds. Our characterizations use the well-known notions of critical instance and direct product, as well as a novel locality property for tgd-ontologies. We further use this locality property to decide whether an ontology expressed by frontier-guarded (respectively, guarded) tgds can be expressed by tgds in the weaker class of guarded (respectively, linear) tgds, and effectively construct such an equivalent ontology if one exists.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133606720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Detecting frequent elements is among the oldest and most-studied problems in the area of data streams. Given a stream of m data items in 1, 2, dots, n, the objective is to output items that appear at least d times, for some threshold parameter d, and provably optimal algorithms are known today. However, in many applications, knowing only the frequent elements themselves is not enough: For example, an Internet router may not only need to know the most frequent destination IP addresses of forwarded packages, but also the timestamps of when these packages appeared or any other meta-data that "arrived'' with the packages, e.g., their source IP addresses. In this paper, we introduce the witness version of the frequent elements problem: Given a desired approximation guarantee α ge 1$ and a desired frequency $d łe Δ$, where Δ is the frequency of the most frequent item, the objective is to report an item together with at least $d / α$ timestamps of when the item appeared in the stream (or any other meta-data that arrived with the items). We give provably optimal algorithms for both the insertion-only and insertion-deletion stream settings: In insertion-only streams, we show that space $tildeO (n + d cdot n^frac1 α )$ is necessary and sufficient for every integral $1 łe α łe łog n$. In insertion-deletion streams, we show that space $tildeO (fracn cdot d α^2 )$ is necessary and sufficient, for every α łe sqrtn $.
检测频繁元素是数据流领域中最古老和研究最多的问题之一。给定一个包含m个数据项的流,在 1,2,dots, n中,目标是输出至少出现d次的项,对于某些阈值参数d,并且目前已知的可证明的最优算法。然而,在许多应用程序中,只知道频繁元素本身是不够的:例如,互联网路由器可能不仅需要知道转发包的最频繁的目的地IP地址,还需要知道这些包出现的时间戳或与包一起“到达”的任何其他元数据,例如,它们的源IP地址。在本文中,我们引入了见证版本的频繁元素问题:给定一个期望的近似保证α ge 1 $ and a desired frequency $ d łe Δ $, where Δ is the frequency of the most frequent item, the objective is to report an item together with at least $ d / α $ timestamps of when the item appeared in the stream (or any other meta-data that arrived with the items). We give provably optimal algorithms for both the insertion-only and insertion-deletion stream settings: In insertion-only streams, we show that space $tildeO (n + d cdot n^ frac 1 α) $ is necessary and sufficient for every integral $ 1 łe α łe łog n $. In insertion-deletion streams, we show that space $tildeO (fracncdot d α^2) $ is necessary and sufficient, for every α łe sqrtn $。
{"title":"Frequent Elements with Witnesses in Data Streams","authors":"C. Konrad","doi":"10.1145/3452021.3458330","DOIUrl":"https://doi.org/10.1145/3452021.3458330","url":null,"abstract":"Detecting frequent elements is among the oldest and most-studied problems in the area of data streams. Given a stream of m data items in 1, 2, dots, n, the objective is to output items that appear at least d times, for some threshold parameter d, and provably optimal algorithms are known today. However, in many applications, knowing only the frequent elements themselves is not enough: For example, an Internet router may not only need to know the most frequent destination IP addresses of forwarded packages, but also the timestamps of when these packages appeared or any other meta-data that \"arrived'' with the packages, e.g., their source IP addresses. In this paper, we introduce the witness version of the frequent elements problem: Given a desired approximation guarantee α ge 1$ and a desired frequency $d łe Δ$, where Δ is the frequency of the most frequent item, the objective is to report an item together with at least $d / α$ timestamps of when the item appeared in the stream (or any other meta-data that arrived with the items). We give provably optimal algorithms for both the insertion-only and insertion-deletion stream settings: In insertion-only streams, we show that space $tildeO (n + d cdot n^frac1 α )$ is necessary and sufficient for every integral $1 łe α łe łog n$. In insertion-deletion streams, we show that space $tildeO (fracn cdot d α^2 )$ is necessary and sufficient, for every α łe sqrtn $.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133269373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kuldeep S. Meel, N. V. Vinodchandran, Sourav Chakraborty
In this paper we study the problem of estimating the size of the union of sets $S_1, dots, S_M$ where each set $S_i subseteq Ømega$ (for some discrete universe $Ømega$) is implicitly presented and comes in a streaming fashion. We define the notion of Delphic sets to capture class of streaming problems where membership, sampling, and counting calls to the sets are efficient. In particular, we show our notion of Delphic sets capture three well known problems: Klee's measure problem (discrete version), test coverage estimation, and model counting of DNF formulas. The Klee's measure problem corresponds to computation of volume of multi-dimension axis aligned rectangles, i.e., every d-dimension axis-aligned rectangle can be defined as $[a_1,b_1] times [a_2,b_2] times łdots times [a_d, b_d]$. The problem of test coverage estimation focuses on the computation of coverage measure for a given testing array in the context of combinatorial testing, which is a fundamental technique in the context of hardware and software testing. Finally, given a DNF formula $varphi = T_1 vee T_2 vee łdots vee T_M$, the problem of model counting seeks to compute the number of satisfying assignments of $varphi$. The primary contribution of our work is a simple and efficient sampling-based algorithm, called hybrid, for estimating the of union of sets in streaming setting. Our algorithm has the space complexity of $O(Rłog |Ømega|)$ and update time is $O(Rłog R cdot łog(M/δ) cdot łog|Ømega|)$ where, $R = Ołeft(łog (M/δ)cdot varepsilon^2 right).$ Consequently, our algorithm provides the first algorithm with linear dependence on d for Klee's measure problem in streaming setting for $d>1$, thereby settling the open problem of Tirthpura and Woodruff (PODS-12). Furthermore, a straightforward application of our algorithm lends to an efficient algorithm for coverage estimation problem in streaming setting. We then investigate whether the space complexity for coverage estimation can be further improved, and in this context, we present another streaming algorithm that uses near-optimal $O(tłog n/varepsilon^2)$ space complexity but uses an update algorithm that is in $rm P ^rm NP $, thereby showcasing an interesting time vs space trade-off in the streaming setting. Finally, we demonstrate the generality of our Delphic sets by obtaining a streaming algorithm for model counting of DNF formulas. It is worth remarking that we view a key strength of our work is the simplicity of both the algorithm and its theoretical analysis, which makes it amenable to practical implementation and easy adoption.
在本文中,我们研究了估计集合并集的大小$S_1, dots, S_M$的问题,其中每个集合$S_i subseteq Ømega$(对于某些离散宇宙$Ømega$)隐式地表示并以流方式出现。我们定义了德尔菲集合的概念,以捕获一类流问题,其中对集合的隶属关系、采样和计数调用是有效的。特别地,我们展示了我们的德尔菲集的概念捕获了三个众所周知的问题:Klee的度量问题(离散版本),测试覆盖率估计和DNF公式的模型计数。Klee's measure问题对应于多维轴向矩形的体积计算,即每个d维轴向矩形都可以定义为$[a_1,b_1] times [a_2,b_2] times łdots times [a_d, b_d]$。测试覆盖估计问题是组合测试环境下给定测试阵列的覆盖度量计算问题,是硬件和软件测试环境下的一项基本技术。最后,给定DNF公式$varphi = T_1 vee T_2 vee łdots vee T_M$,模型计数问题寻求计算$varphi$的满足赋值的个数。我们工作的主要贡献是一个简单而有效的基于采样的算法,称为hybrid,用于估计流设置中集合的并集。我们的算法的空间复杂度为$O(Rłog |Ømega|)$,更新时间为$O(Rłog R cdot łog(M/δ) cdot łog|Ømega|)$,其中,$R = Ołeft(łog (M/δ)cdot varepsilon^2 right).$因此,我们的算法为$d>1$的流设置中Klee的度量问题提供了第一个与d线性相关的算法,从而解决了Tirthpura和Woodruff (pod -12)的开放问题。此外,该算法的简单应用为流环境下的覆盖估计问题提供了一种有效的算法。然后,我们研究覆盖估计的空间复杂性是否可以进一步提高,在这种情况下,我们提出了另一种流算法,该算法使用接近最优的$O(tłog n/varepsilon^2)$空间复杂性,但使用$rm P ^rm NP $中的更新算法,从而展示了流设置中有趣的时间与空间权衡。最后,我们通过获得一个用于DNF公式模型计数的流算法来证明我们的德尔菲集的一般性。值得注意的是,我们认为我们工作的一个关键优势是算法及其理论分析的简单性,这使得它易于实际实现和易于采用。
{"title":"Estimating the Size of Union of Sets in Streaming Models","authors":"Kuldeep S. Meel, N. V. Vinodchandran, Sourav Chakraborty","doi":"10.1145/3452021.3458333","DOIUrl":"https://doi.org/10.1145/3452021.3458333","url":null,"abstract":"In this paper we study the problem of estimating the size of the union of sets $S_1, dots, S_M$ where each set $S_i subseteq Ømega$ (for some discrete universe $Ømega$) is implicitly presented and comes in a streaming fashion. We define the notion of Delphic sets to capture class of streaming problems where membership, sampling, and counting calls to the sets are efficient. In particular, we show our notion of Delphic sets capture three well known problems: Klee's measure problem (discrete version), test coverage estimation, and model counting of DNF formulas. The Klee's measure problem corresponds to computation of volume of multi-dimension axis aligned rectangles, i.e., every d-dimension axis-aligned rectangle can be defined as $[a_1,b_1] times [a_2,b_2] times łdots times [a_d, b_d]$. The problem of test coverage estimation focuses on the computation of coverage measure for a given testing array in the context of combinatorial testing, which is a fundamental technique in the context of hardware and software testing. Finally, given a DNF formula $varphi = T_1 vee T_2 vee łdots vee T_M$, the problem of model counting seeks to compute the number of satisfying assignments of $varphi$. The primary contribution of our work is a simple and efficient sampling-based algorithm, called hybrid, for estimating the of union of sets in streaming setting. Our algorithm has the space complexity of $O(Rłog |Ømega|)$ and update time is $O(Rłog R cdot łog(M/δ) cdot łog|Ømega|)$ where, $R = Ołeft(łog (M/δ)cdot varepsilon^2 right).$ Consequently, our algorithm provides the first algorithm with linear dependence on d for Klee's measure problem in streaming setting for $d>1$, thereby settling the open problem of Tirthpura and Woodruff (PODS-12). Furthermore, a straightforward application of our algorithm lends to an efficient algorithm for coverage estimation problem in streaming setting. We then investigate whether the space complexity for coverage estimation can be further improved, and in this context, we present another streaming algorithm that uses near-optimal $O(tłog n/varepsilon^2)$ space complexity but uses an update algorithm that is in $rm P ^rm NP $, thereby showcasing an interesting time vs space trade-off in the streaming setting. Finally, we demonstrate the generality of our Delphic sets by obtaining a streaming algorithm for model counting of DNF formulas. It is worth remarking that we view a key strength of our work is the simplicity of both the algorithm and its theoretical analysis, which makes it amenable to practical implementation and easy adoption.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129725528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The ACM PODS Alberto O. Mendelzon Test-of-Time Award is awarded every year to a paper or a small number of papers published in the PODS proceedings ten years prior that had the most impact in terms of research, methodology, or transfer to practice over the intervening decade. The PODS Executive Committee has appointed us to serve as the Award Committee for 2021. After careful consideration and having solicited external nominations and advice, we have selected the following paper as the award winner for 2021: Tight bounds for L_p samplers, finding duplicates in streams, and related problems by Hossein Jowhari, Mert Sağlam and Gábor Tardos Citation. This paper addresses a question posed by Cormode et al. in VLDB 2005, namely whether a uniform (or nearly uniform) sample can be maintained in a dynamically changing database, where data items may be inserted and deleted, while using space much smaller than the size of the database. More generally, it considers maintaining an L_p sample, where an element must be sampled with probability proportional to w^p (possibly up to some small relative error), where w is a weight that may change dynamically. In SODA 2010, Monemizadeh and Woodruff showed that it is possible to perform L_p sampling in a stream using polylogarithmic space. The PODS 2011 paper by Jowhari, Sağlam and Tardos essentially closes the problem by presenting algorithms with improved space usage, as well as a matching lower bound showing that it is not possible to asymptotically improve the upper bounds. The paper has had a considerable impact on the design of algorithms in streaming and distributed models of computation, where L_p sampling has become an essential part of the toolbox. The survey "L_p Samplers and Their Applications" in ACM Computing Surveys (2019) presents a number of surprising applications, for example in graph algorithms and in randomized numerical linear algebra.
ACM PODS Alberto O. Mendelzon Test-of-Time奖每年颁发给十年前在PODS会议记录上发表的一篇或少数论文,这些论文在研究、方法或在其间的十年中转化为实践方面具有最大的影响。PODS执行委员会已任命我们担任2021年的奖项委员会。经过仔细考虑并征求外部提名和建议,我们选择以下论文作为2021年的获奖者:L_p采样器的紧密界限,在流中发现重复,以及Hossein Jowhari, Mert Sağlam和Gábor Tardos Citation的相关问题。本文解决了Cormode等人在VLDB 2005中提出的一个问题,即在一个动态变化的数据库中,在数据项可以插入和删除的情况下,是否可以使用比数据库大小小得多的空间来维护一个统一(或近乎统一)的样本。更一般地说,它考虑维护一个L_p样本,其中一个元素必须以与w^p成比例的概率进行采样(可能有一些小的相对误差),其中w是一个可能动态变化的权重。在SODA 2010中,Monemizadeh和Woodruff证明了使用多对数空间在流中执行L_p采样是可能的。Jowhari, Sağlam和Tardos在PODS 2011年发表的论文基本上解决了这个问题,提出了改进空间利用率的算法,以及一个匹配的下界,表明不可能渐近地改进上界。本文对流计算和分布式计算模型的算法设计产生了相当大的影响,其中L_p采样已成为工具箱中必不可少的一部分。《ACM计算调查(2019)》中的调查“L_p采样器及其应用”提出了许多令人惊讶的应用,例如在图算法和随机数值线性代数中。
{"title":"2021 ACM PODS Alberto O. Mendelzon Test-of-Time Award","authors":"A. Bonifati, R. Pagh, T. Schwentick","doi":"10.1145/3452021.3452909","DOIUrl":"https://doi.org/10.1145/3452021.3452909","url":null,"abstract":"The ACM PODS Alberto O. Mendelzon Test-of-Time Award is awarded every year to a paper or a small number of papers published in the PODS proceedings ten years prior that had the most impact in terms of research, methodology, or transfer to practice over the intervening decade. The PODS Executive Committee has appointed us to serve as the Award Committee for 2021. After careful consideration and having solicited external nominations and advice, we have selected the following paper as the award winner for 2021: Tight bounds for L_p samplers, finding duplicates in streams, and related problems by Hossein Jowhari, Mert Sağlam and Gábor Tardos Citation. This paper addresses a question posed by Cormode et al. in VLDB 2005, namely whether a uniform (or nearly uniform) sample can be maintained in a dynamically changing database, where data items may be inserted and deleted, while using space much smaller than the size of the database. More generally, it considers maintaining an L_p sample, where an element must be sampled with probability proportional to w^p (possibly up to some small relative error), where w is a weight that may change dynamically. In SODA 2010, Monemizadeh and Woodruff showed that it is possible to perform L_p sampling in a stream using polylogarithmic space. The PODS 2011 paper by Jowhari, Sağlam and Tardos essentially closes the problem by presenting algorithms with improved space usage, as well as a matching lower bound showing that it is not possible to asymptotically improve the upper bounds. The paper has had a considerable impact on the design of algorithms in streaming and distributed models of computation, where L_p sampling has become an essential part of the toolbox. The survey \"L_p Samplers and Their Applications\" in ACM Computing Surveys (2019) presents a number of surprising applications, for example in graph algorithms and in randomized numerical linear algebra.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121017168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Conditional lower bounds based on $Pneq NP$, the Exponential-Time Hypothesis (ETH), or similar complexity assumptions can provide very useful information about what type of algorithms are likely to be possible. Ideally, such lower bounds would be able to demonstrate that the best known algorithms are essentially optimal and cannot be improved further. In this tutorial, we overview different types of lower bounds, and see how they can be applied to problems in database theory and constraint satisfaction.
{"title":"Modern Lower Bound Techniques in Database Theory and Constraint Satisfaction","authors":"D. Marx","doi":"10.1145/3452021.3458814","DOIUrl":"https://doi.org/10.1145/3452021.3458814","url":null,"abstract":"Conditional lower bounds based on $Pneq NP$, the Exponential-Time Hypothesis (ETH), or similar complexity assumptions can provide very useful information about what type of algorithms are likely to be possible. Ideally, such lower bounds would be able to demonstrate that the best known algorithms are essentially optimal and cannot be improved further. In this tutorial, we overview different types of lower bounds, and see how they can be applied to problems in database theory and constraint satisfaction.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122473491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Coresets are succinct summaries of large datasets such that, for a given problem, the solution obtained from a coreset is provably competitive with the solution obtained from the full dataset. As such, coreset-based data summarization techniques have been successfully applied to various problems, e.g., geometric optimization, clustering, and approximate query processing, for scaling them up to massive data. In this paper, we study coresets for the maxima representation of multidimensional data: Given a set P of points in $ mathbbR ^d $, where d is a small constant, and an error parameter $ varepsilon in (0,1) $, a subset $ Q subseteq P $ is an $ varepsilon $-coreset for the maxima representation of P iff the maximum of Q is an $ varepsilon $-approximation of the maximum of P for any vector $ u in mathbbR ^d $, where the maximum is taken over the inner products between the set of points (P or Q) and u. We define a novel minimum $varepsilon$-coreset problem that asks for an $varepsilon$-coreset of the smallest size for the maxima representation of a point set. For the two-dimensional case, we develop an optimal polynomial-time algorithm for the minimum $ varepsilon $-coreset problem by transforming it into the shortest-cycle problem in a directed graph. Then, we prove that this problem is NP-hard in three or higher dimensions and present polynomial-time approximation algorithms in an arbitrary fixed dimension. Finally, we provide extensive experimental results on both real and synthetic datasets to demonstrate the superior performance of our proposed algorithms.
{"title":"Minimum Coresets for Maxima Representation of Multidimensional Data","authors":"Yanhao Wang, M. Mathioudakis, Yuchen Li, K. Tan","doi":"10.1145/3452021.3458322","DOIUrl":"https://doi.org/10.1145/3452021.3458322","url":null,"abstract":"Coresets are succinct summaries of large datasets such that, for a given problem, the solution obtained from a coreset is provably competitive with the solution obtained from the full dataset. As such, coreset-based data summarization techniques have been successfully applied to various problems, e.g., geometric optimization, clustering, and approximate query processing, for scaling them up to massive data. In this paper, we study coresets for the maxima representation of multidimensional data: Given a set P of points in $ mathbbR ^d $, where d is a small constant, and an error parameter $ varepsilon in (0,1) $, a subset $ Q subseteq P $ is an $ varepsilon $-coreset for the maxima representation of P iff the maximum of Q is an $ varepsilon $-approximation of the maximum of P for any vector $ u in mathbbR ^d $, where the maximum is taken over the inner products between the set of points (P or Q) and u. We define a novel minimum $varepsilon$-coreset problem that asks for an $varepsilon$-coreset of the smallest size for the maxima representation of a point set. For the two-dimensional case, we develop an optimal polynomial-time algorithm for the minimum $ varepsilon $-coreset problem by transforming it into the shortest-cycle problem in a directed graph. Then, we prove that this problem is NP-hard in three or higher dimensions and present polynomial-time approximation algorithms in an arbitrary fixed dimension. Finally, we provide extensive experimental results on both real and synthetic datasets to demonstrate the superior performance of our proposed algorithms.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123764571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This work shows how to privately and more accurately estimate Euclidean distance between pairs of vectors. Input vectors x and y are mapped to differentially private sketches $x'$ and $y'$, from which one can estimate the distance between x and y. Our estimator relies on the Sparser Johnson-Lindenstrauss constructions by Kane & Nelson (Journal of the ACM 2014), which for any 0<α,β<1/2 have optimal output dimension k=Θ(α^-2 łog(1/β)) and sparsity s=O(α^-1 łog(1/β)). We combine the constructions of Kane & Nelson with either the Laplace or the Gaussian mechanism from the differential privacy literature, depending on the privacy parameters $varepsilon$ and δ. We also suggest a differentially private version of Fast Johnson-Lindenstrauss Transform (FJLT) by Ailon & Chazelle (SIAM Journal of Computing 2009) which offers a tradeoff in speed for variance for certain parameters. We answer an open question by Kenthapadi et al. (Journal of Privacy and Confidentiality 2013) by analyzing the privacy and utility guarantees of an estimator for Euclidean distance, relying on Laplacian rather than Gaussian noise. We prove that the Laplace mechanism yields lower variance than the Gaussian mechanism whenever δ<β^O(1/α). Thus, our work poses an improvement over the work of Kenthapadi et al. by giving a more efficient estimator with lower variance for sufficiently small δ. Our sketch also achieves pure differential privacy as a neat side-effect of the Laplace mechanism rather than the approximate differential privacy guarantee of the Gaussian mechanism, which may not be sufficiently strong for some settings. Our main result is a special case of more general, technical results proving that one can generally construct unbiased estimators for Euclidean distance with a high level of utility even under the constraint of differential privacy. The bulk of our analysis is proving that the variance of the estimator does not suffer too much in the presence of differential privacy.
{"title":"Improved Differentially Private Euclidean Distance Approximation","authors":"N. Stausholm","doi":"10.1145/3452021.3458328","DOIUrl":"https://doi.org/10.1145/3452021.3458328","url":null,"abstract":"This work shows how to privately and more accurately estimate Euclidean distance between pairs of vectors. Input vectors x and y are mapped to differentially private sketches $x'$ and $y'$, from which one can estimate the distance between x and y. Our estimator relies on the Sparser Johnson-Lindenstrauss constructions by Kane & Nelson (Journal of the ACM 2014), which for any 0<α,β<1/2 have optimal output dimension k=Θ(α^-2 łog(1/β)) and sparsity s=O(α^-1 łog(1/β)). We combine the constructions of Kane & Nelson with either the Laplace or the Gaussian mechanism from the differential privacy literature, depending on the privacy parameters $varepsilon$ and δ. We also suggest a differentially private version of Fast Johnson-Lindenstrauss Transform (FJLT) by Ailon & Chazelle (SIAM Journal of Computing 2009) which offers a tradeoff in speed for variance for certain parameters. We answer an open question by Kenthapadi et al. (Journal of Privacy and Confidentiality 2013) by analyzing the privacy and utility guarantees of an estimator for Euclidean distance, relying on Laplacian rather than Gaussian noise. We prove that the Laplace mechanism yields lower variance than the Gaussian mechanism whenever δ<β^O(1/α). Thus, our work poses an improvement over the work of Kenthapadi et al. by giving a more efficient estimator with lower variance for sufficiently small δ. Our sketch also achieves pure differential privacy as a neat side-effect of the Laplace mechanism rather than the approximate differential privacy guarantee of the Gaussian mechanism, which may not be sufficiently strong for some settings. Our main result is a special case of more general, technical results proving that one can generally construct unbiased estimators for Euclidean distance with a high level of utility even under the constraint of differential privacy. The bulk of our analysis is proving that the variance of the estimator does not suffer too much in the presence of differential privacy.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132018155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We study the data complexity of consistent query answering (CQA) on databases that may violate the primary key constraints. A repair is a maximal consistent subset of the database. For a Boolean query q, the problem CERTAINTY(q) takes a database as input, and asks whether or not each repair satisfies the query q. It is known that for any self-join-free Boolean conjunctive query q, CERTAINTY(q) is in FO, L-complete, or coNP-complete. In particular, CERTAINTY(q) is in FO for any self-join-free Boolean path query q. In this paper, we show that if self-joins are allowed, then the complexity of CERTAINTY(q) for Boolean path queries q exhibits a tetrachotomy between FO, NL-complete, PTIME-complete, and coNP-complete. Moreover, it is decidable, in polynomial time in the size of the query q, which of the four cases applies.
{"title":"Consistent Query Answering for Primary Keys on Path Queries","authors":"Paraschos Koutris, Xiating Ouyang, J. Wijsen","doi":"10.1145/3452021.3458334","DOIUrl":"https://doi.org/10.1145/3452021.3458334","url":null,"abstract":"We study the data complexity of consistent query answering (CQA) on databases that may violate the primary key constraints. A repair is a maximal consistent subset of the database. For a Boolean query q, the problem CERTAINTY(q) takes a database as input, and asks whether or not each repair satisfies the query q. It is known that for any self-join-free Boolean conjunctive query q, CERTAINTY(q) is in FO, L-complete, or coNP-complete. In particular, CERTAINTY(q) is in FO for any self-join-free Boolean path query q. In this paper, we show that if self-joins are allowed, then the complexity of CERTAINTY(q) for Boolean path queries q exhibits a tetrachotomy between FO, NL-complete, PTIME-complete, and coNP-complete. Moreover, it is decidable, in polynomial time in the size of the query q, which of the four cases applies.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126242014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}