首页 > 最新文献

ACM Transactions on Database Systems (TODS)最新文献

英文 中文
Designing a Query Language for RDF 为RDF设计查询语言
Pub Date : 2017-10-27 DOI: 10.1145/3129247
M. Arenas, M. Ugarte
When querying an Resource Description Framework (RDF) graph, a prominent feature is the possibility of extending the answer to a query with optional information. However, the definition of this feature in SPARQL—the standard RDF query language—has raised some important issues. Most notably, the use of this feature increases the complexity of the evaluation problem, and its closed-world semantics is in conflict with the underlying open-world semantics of RDF. Many approaches for fixing such problems have been proposed, the most prominent being the introduction of the semantic notion of weakly monotone SPARQL query. Weakly monotone SPARQL queries have shaped the class of queries that conform to the open-world semantics of RDF. Unfortunately, finding an effective way of restricting SPARQL to the fragment of weakly monotone queries has proven to be an elusive problem. In practice, the most widely adopted fragment for writing SPARQL queries is based on the syntactic notion of well designedness. This notion has proven to be a good approach for writing SPARQL queries, but its expressive power has yet to be fully understood. The starting point of this article is to understand the relation between well-designed queries and the semantic notion of weak monotonicity. It is known that every well-designed SPARQL query is weakly monotone; as our first contribution we prove that the converse does not hold, even if an extension of this notion based on the use of disjunction is considered. Given this negative result, we embark on the task of defining syntactic fragments that are weakly monotone and have higher expressive power than the fragment of well-designed queries. To this end, we move to a more general scenario where infinite RDF graphs are also allowed, so interpolation techniques studied for first-order logic can be applied. With the use of these techniques, we are able to define a new operator for SPARQL that gives rise to a query language with the desired properties (over finite and infinite RDF graphs). It should be noticed that every query in this fragment is weakly monotone if we restrict the semantics to finite RDF graphs. Moreover, we use this result to provide a simple characterization of the class of monotone CONSTRUCT queries, that is, the class of SPARQL queries that produce RDF graphs as output. Finally, we pinpoint the complexity of the evaluation problem for the query languages identified in the article.
查询资源描述框架(Resource Description Framework, RDF)图时,一个突出的特性是可以将答案扩展到包含可选信息的查询。但是,在sparql(标准RDF查询语言)中对该特性的定义引起了一些重要的问题。最值得注意的是,这个特性的使用增加了评估问题的复杂性,并且它的封闭世界语义与RDF的底层开放世界语义相冲突。已经提出了许多解决此类问题的方法,其中最突出的是引入弱单调SPARQL查询的语义概念。弱单调的SPARQL查询塑造了一类符合RDF开放世界语义的查询。不幸的是,找到一种将SPARQL限制为弱单调查询片段的有效方法已被证明是一个难以捉摸的问题。在实践中,用于编写SPARQL查询的最广泛采用的片段是基于良好设计的语法概念。这个概念已经被证明是编写SPARQL查询的一种好方法,但是它的表达能力还没有被完全理解。本文的出发点是理解设计良好的查询与弱单调性的语义概念之间的关系。众所周知,每一个设计良好的SPARQL查询都是弱单调的;作为我们的第一个贡献,我们证明了逆命题是不成立的,即使我们考虑使用析取来扩展这个概念。考虑到这个否定的结果,我们开始了定义语法片段的任务,这些语法片段是弱单调的,比设计良好的查询片段具有更高的表达能力。为此,我们转向更一般的场景,其中也允许无限RDF图,因此可以应用为一阶逻辑研究的插值技术。通过使用这些技术,我们能够为SPARQL定义一个新的操作符,从而产生具有所需属性(在有限和无限RDF图上)的查询语言。应该注意的是,如果我们将语义限制为有限的RDF图,则该片段中的每个查询都是弱单调的。此外,我们使用此结果提供单调CONSTRUCT查询类的简单特征,即生成RDF图作为输出的SPARQL查询类。最后,我们指出了本文中确定的查询语言的求值问题的复杂性。
{"title":"Designing a Query Language for RDF","authors":"M. Arenas, M. Ugarte","doi":"10.1145/3129247","DOIUrl":"https://doi.org/10.1145/3129247","url":null,"abstract":"When querying an Resource Description Framework (RDF) graph, a prominent feature is the possibility of extending the answer to a query with optional information. However, the definition of this feature in SPARQL—the standard RDF query language—has raised some important issues. Most notably, the use of this feature increases the complexity of the evaluation problem, and its closed-world semantics is in conflict with the underlying open-world semantics of RDF. Many approaches for fixing such problems have been proposed, the most prominent being the introduction of the semantic notion of weakly monotone SPARQL query. Weakly monotone SPARQL queries have shaped the class of queries that conform to the open-world semantics of RDF. Unfortunately, finding an effective way of restricting SPARQL to the fragment of weakly monotone queries has proven to be an elusive problem. In practice, the most widely adopted fragment for writing SPARQL queries is based on the syntactic notion of well designedness. This notion has proven to be a good approach for writing SPARQL queries, but its expressive power has yet to be fully understood. The starting point of this article is to understand the relation between well-designed queries and the semantic notion of weak monotonicity. It is known that every well-designed SPARQL query is weakly monotone; as our first contribution we prove that the converse does not hold, even if an extension of this notion based on the use of disjunction is considered. Given this negative result, we embark on the task of defining syntactic fragments that are weakly monotone and have higher expressive power than the fragment of well-designed queries. To this end, we move to a more general scenario where infinite RDF graphs are also allowed, so interpolation techniques studied for first-order logic can be applied. With the use of these techniques, we are able to define a new operator for SPARQL that gives rise to a query language with the desired properties (over finite and infinite RDF graphs). It should be noticed that every query in this fragment is weakly monotone if we restrict the semantics to finite RDF graphs. Moreover, we use this result to provide a simple characterization of the class of monotone CONSTRUCT queries, that is, the class of SPARQL queries that produce RDF graphs as output. Finally, we pinpoint the complexity of the evaluation problem for the query languages identified in the article.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"144 1","pages":"1 - 46"},"PeriodicalIF":0.0,"publicationDate":"2017-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86755637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
PrivBayes
Pub Date : 2017-10-27 DOI: 10.1145/3134428
Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, D. Srivastava, Xiaokui Xiao
Privacy-preserving data publishing is an important problem that has been the focus of extensive study. The state-of-the-art solution for this problem is differential privacy, which offers a strong degree of privacy protection without making restrictive assumptions about the adversary. Existing techniques using differential privacy, however, cannot effectively handle the publication of high-dimensional data. In particular, when the input dataset contains a large number of attributes, existing methods require injecting a prohibitive amount of noise compared to the signal in the data, which renders the published data next to useless. To address the deficiency of the existing methods, this paper presents PrivBayes, a differentially private method for releasing high-dimensional data. Given a dataset D, PrivBayes first constructs a Bayesian network N, which (i) provides a succinct model of the correlations among the attributes in D and (ii) allows us to approximate the distribution of data in D using a set P of low-dimensional marginals of D. After that, PrivBayes injects noise into each marginal in P to ensure differential privacy and then uses the noisy marginals and the Bayesian network to construct an approximation of the data distribution in D. Finally, PrivBayes samples tuples from the approximate distribution to construct a synthetic dataset, and then releases the synthetic data. Intuitively, PrivBayes circumvents the curse of dimensionality, as it injects noise into the low-dimensional marginals in P instead of the high-dimensional dataset D. Private construction of Bayesian networks turns out to be significantly challenging, and we introduce a novel approach that uses a surrogate function for mutual information to build the model more accurately. We experimentally evaluate PrivBayes on real data and demonstrate that it significantly outperforms existing solutions in terms of accuracy.
保护隐私的数据发布是一个重要的问题,一直是广泛研究的焦点。这个问题的最先进的解决方案是差分隐私,它提供了很强的隐私保护程度,而不需要对对手进行限制性假设。然而,使用差分隐私的现有技术不能有效地处理高维数据的发布。特别是,当输入数据集包含大量属性时,与数据中的信号相比,现有方法需要注入大量的噪声,这使得发布的数据几乎无用。针对现有方法的不足,本文提出了一种用于高维数据发布的差分私有方法PrivBayes。给定一个数据集D, PrivBayes首先构造贝叶斯网络N,(我)提供了一个简洁的模型的属性之间的相关性在D和(2)允许我们近似分布的数据在低维D使用一组P D的人之后,PrivBayes注入噪声在P,确保每个边际微分隐私,然后使用嘈杂的不着边际和贝叶斯网络构造一个近似的数据分布在D .最后,PrivBayes从近似分布中抽取元组来构建合成数据集,然后释放合成数据。直观地说,PrivBayes规避了维数的诅咒,因为它将噪声注入到P中的低维边缘而不是高维数据集d中。贝叶斯网络的私有构建被证明是非常具有挑战性的,我们引入了一种新的方法,使用替代函数来获取相互信息,以更准确地构建模型。我们在真实数据上对PrivBayes进行了实验评估,并证明它在准确性方面明显优于现有的解决方案。
{"title":"PrivBayes","authors":"Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, D. Srivastava, Xiaokui Xiao","doi":"10.1145/3134428","DOIUrl":"https://doi.org/10.1145/3134428","url":null,"abstract":"Privacy-preserving data publishing is an important problem that has been the focus of extensive study. The state-of-the-art solution for this problem is differential privacy, which offers a strong degree of privacy protection without making restrictive assumptions about the adversary. Existing techniques using differential privacy, however, cannot effectively handle the publication of high-dimensional data. In particular, when the input dataset contains a large number of attributes, existing methods require injecting a prohibitive amount of noise compared to the signal in the data, which renders the published data next to useless. To address the deficiency of the existing methods, this paper presents PrivBayes, a differentially private method for releasing high-dimensional data. Given a dataset D, PrivBayes first constructs a Bayesian network N, which (i) provides a succinct model of the correlations among the attributes in D and (ii) allows us to approximate the distribution of data in D using a set P of low-dimensional marginals of D. After that, PrivBayes injects noise into each marginal in P to ensure differential privacy and then uses the noisy marginals and the Bayesian network to construct an approximation of the data distribution in D. Finally, PrivBayes samples tuples from the approximate distribution to construct a synthetic dataset, and then releases the synthetic data. Intuitively, PrivBayes circumvents the curse of dimensionality, as it injects noise into the low-dimensional marginals in P instead of the high-dimensional dataset D. Private construction of Bayesian networks turns out to be significantly challenging, and we introduce a novel approach that uses a surrogate function for mutual information to build the model more accurately. We experimentally evaluate PrivBayes on real data and demonstrate that it significantly outperforms existing solutions in terms of accuracy.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"2 1","pages":"1 - 41"},"PeriodicalIF":0.0,"publicationDate":"2017-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86978669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 103
On the Expressive Power of Query Languages for Matrices 矩阵查询语言的表达能力
Pub Date : 2017-09-25 DOI: 10.1145/3331445
R. Brijder, Floris Geerts, J. V. D. Bussche, Timmy Weerwag
We investigate the expressive power of MATLANG, a formal language for matrix manipulation based on common matrix operations and linear algebra. The language can be extended with the operation inv for inverting a matrix. In MATLANG + inv, we can compute the transitive closure of directed graphs, whereas we show that this is not possible without inversion. Indeed, we show that the basic language can be simulated in the relational algebra with arithmetic operations, grouping, and summation. We also consider an operation eigen for diagonalizing a matrix. It is defined such that for each eigenvalue a set of mutually orthogonal eigenvectors is returned that span the eigenspace of that eigenvalue. We show that inv can be expressed in MATLANG + eigen. We put forward the open question whether there are Boolean queries about matrices, or generic queries about graphs, expressible in MATLANG + eigen but not in MATLANG + inv. Finally, the evaluation problem for MATLANG + eigen is shown to be complete for the complexity class ∃ R.
我们研究了MATLANG的表达能力,MATLANG是一种基于常见矩阵运算和线性代数的矩阵操作的形式语言。该语言可以用逆矩阵的逆运算进行扩展。在MATLANG + inv中,我们可以计算有向图的传递闭包,然而我们表明,如果没有反转,这是不可能的。实际上,我们展示了基本语言可以用算术运算、分组和求和在关系代数中模拟。我们还考虑了矩阵对角化的一个运算特征。它被定义为对于每个特征值返回一组相互正交的特征向量,这些特征向量张成了该特征值的特征空间。我们证明了inv可以用MATLANG +特征表示。我们提出了关于矩阵的布尔查询或关于图的一般查询是否存在可在MATLANG + eigen中表达而不能在MATLANG + inv中表达的开放性问题。最后,证明了对于复杂性类∃R, MATLANG + eigen的求值问题是完备的。
{"title":"On the Expressive Power of Query Languages for Matrices","authors":"R. Brijder, Floris Geerts, J. V. D. Bussche, Timmy Weerwag","doi":"10.1145/3331445","DOIUrl":"https://doi.org/10.1145/3331445","url":null,"abstract":"We investigate the expressive power of MATLANG, a formal language for matrix manipulation based on common matrix operations and linear algebra. The language can be extended with the operation inv for inverting a matrix. In MATLANG + inv, we can compute the transitive closure of directed graphs, whereas we show that this is not possible without inversion. Indeed, we show that the basic language can be simulated in the relational algebra with arithmetic operations, grouping, and summation. We also consider an operation eigen for diagonalizing a matrix. It is defined such that for each eigenvalue a set of mutually orthogonal eigenvectors is returned that span the eigenspace of that eigenvalue. We show that inv can be expressed in MATLANG + eigen. We put forward the open question whether there are Boolean queries about matrices, or generic queries about graphs, expressible in MATLANG + eigen but not in MATLANG + inv. Finally, the evaluation problem for MATLANG + eigen is shown to be complete for the complexity class ∃ R.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"1 1","pages":"1 - 31"},"PeriodicalIF":0.0,"publicationDate":"2017-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82784322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
BonXai BonXai
Pub Date : 2017-08-24 DOI: 10.1145/3105960
W. Martens, F. Neven, Matthias Niewerth, Thomas Schwentick
While the migration from DTD to XML Schema was driven by a need for increased expressivity and flexibility, the latter was also significantly more complex to use and understand. Whereas DTDs are characterized by their simplicity, XML Schema Documents are notoriously difficult. In this article, we introduce the XML specification language BonXai, which incorporates many features of XML Schema but is arguably almost as easy to use as DTDs. In brief, the latter is achieved by sacrificing the explicit use of types in favor of simple patterns expressing contexts for elements. The goal of BonXai is not to replace XML Schema but rather to provide a simpler alternative for users who want to go beyond the expressiveness and features of DTD but do not need the explicit use of types. Furthermore, XML Schema processing tools can be used as a back-end for BonXai, since BonXai can be automatically converted into XML Schema. A particularly strong point of BonXai is its solid foundation rooted in a decade of theoretical work around pattern-based schemas. We present a formal model for a core fragment of BonXai and the translation algorithms to and from a core fragment of XML Schema. We prove that BonXai and XML Schema can be converted back-and-forth on the level of tree languages and we formally study the size trade-offs between the two languages.
虽然从DTD迁移到XML Schema是由于需要增加表达性和灵活性,但后者的使用和理解也明显更加复杂。dtd的特点是简单,而XML模式文档却是出了名的困难。在本文中,我们将介绍XML规范语言BonXai,它集成了XML Schema的许多特性,但可以说几乎和dtd一样易于使用。简而言之,后者是通过牺牲类型的显式使用而支持表达元素上下文的简单模式来实现的。BonXai的目标不是取代XML Schema,而是为那些希望超越DTD的表达性和特性但又不需要显式使用类型的用户提供一种更简单的替代方案。此外,XML模式处理工具可以用作BonXai的后端,因为BonXai可以自动转换为XML模式。BonXai特别强大的一点是其坚实的基础植根于十年来围绕基于模式的模式的理论工作。提出了BonXai核心片段的形式化模型和XML Schema核心片段之间的转换算法。我们证明了BonXai和XML Schema可以在树语言级别上来回转换,并正式研究了两种语言之间的大小权衡。
{"title":"BonXai","authors":"W. Martens, F. Neven, Matthias Niewerth, Thomas Schwentick","doi":"10.1145/3105960","DOIUrl":"https://doi.org/10.1145/3105960","url":null,"abstract":"While the migration from DTD to XML Schema was driven by a need for increased expressivity and flexibility, the latter was also significantly more complex to use and understand. Whereas DTDs are characterized by their simplicity, XML Schema Documents are notoriously difficult. In this article, we introduce the XML specification language BonXai, which incorporates many features of XML Schema but is arguably almost as easy to use as DTDs. In brief, the latter is achieved by sacrificing the explicit use of types in favor of simple patterns expressing contexts for elements. The goal of BonXai is not to replace XML Schema but rather to provide a simpler alternative for users who want to go beyond the expressiveness and features of DTD but do not need the explicit use of types. Furthermore, XML Schema processing tools can be used as a back-end for BonXai, since BonXai can be automatically converted into XML Schema. A particularly strong point of BonXai is its solid foundation rooted in a decade of theoretical work around pattern-based schemas. We present a formal model for a core fragment of BonXai and the translation algorithms to and from a core fragment of XML Schema. We prove that BonXai and XML Schema can be converted back-and-forth on the level of tree languages and we formally study the size trade-offs between the two languages.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"120 1","pages":"1 - 42"},"PeriodicalIF":0.0,"publicationDate":"2017-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76658356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Query Nesting, Assignment, and Aggregation in SPARQL 1.1 SPARQL 1.1中的查询嵌套、分配和聚合
Pub Date : 2017-08-12 DOI: 10.1145/3083898
M. Kaminski, Egor V. Kostylev, B. C. Grau
Answering aggregate queries is a key requirement of emerging applications of Semantic Technologies, such as data warehousing, business intelligence, and sensor networks. To fulfil the requirements of such applications, the standardization of SPARQL 1.1 led to the introduction of a wide range of constructs that enable value computation, aggregation, and query nesting. In this article, we provide an in-depth formal analysis of the semantics and expressive power of these new constructs as defined in the SPARQL 1.1 specification, and hence lay the necessary foundations for the development of robust, scalable, and extensible query engines supporting complex numerical and analytics tasks.
回答聚合查询是语义技术新兴应用程序(如数据仓库、商业智能和传感器网络)的关键需求。为了满足此类应用程序的需求,SPARQL 1.1的标准化导致引入了广泛的结构,这些结构支持值计算、聚合和查询嵌套。在本文中,我们将对SPARQL 1.1规范中定义的这些新结构的语义和表达能力进行深入的形式化分析,从而为开发支持复杂数值和分析任务的健壮的、可伸缩的和可扩展的查询引擎奠定必要的基础。
{"title":"Query Nesting, Assignment, and Aggregation in SPARQL 1.1","authors":"M. Kaminski, Egor V. Kostylev, B. C. Grau","doi":"10.1145/3083898","DOIUrl":"https://doi.org/10.1145/3083898","url":null,"abstract":"Answering aggregate queries is a key requirement of emerging applications of Semantic Technologies, such as data warehousing, business intelligence, and sensor networks. To fulfil the requirements of such applications, the standardization of SPARQL 1.1 led to the introduction of a wide range of constructs that enable value computation, aggregation, and query nesting. In this article, we provide an in-depth formal analysis of the semantics and expressive power of these new constructs as defined in the SPARQL 1.1 specification, and hence lay the necessary foundations for the development of robust, scalable, and extensible query engines supporting complex numerical and analytics tasks.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"48 1","pages":"1 - 46"},"PeriodicalIF":0.0,"publicationDate":"2017-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85541104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
On the Hardness and Approximation of Euclidean DBSCAN 欧几里得DBSCAN的硬度和近似
Pub Date : 2017-07-31 DOI: 10.1145/3083897
Junhao Gan, Yufei Tao
DBSCAN is a method proposed in 1996 for clustering multi-dimensional points, and has received extensive applications. Its computational hardness is still unsolved to this date. The original KDD‚96 paper claimed an algorithm of O(n log n) ”average runtime complexity„ (where n is the number of data points) without a rigorous proof. In 2013, a genuine O(n log n)-time algorithm was found in 2D space under Euclidean distance. The hardness of dimensionality d ≥3 has remained open ever since. This article considers the problem of computing DBSCAN clusters from scratch (assuming no existing indexes) under Euclidean distance. We prove that, for d ≥3, the problem requires ω(n 4/3) time to solve, unless very significant breakthroughs—ones widely believed to be impossible—could be made in theoretical computer science. Motivated by this, we propose a relaxed version of the problem called ρ-approximate DBSCAN, which returns the same clusters as DBSCAN, unless the clusters are ”unstable„ (i.e., they change once the input parameters are slightly perturbed). The ρ-approximate problem can be settled in O(n) expected time regardless of the constant dimensionality d. The article also enhances the previous result on the exact DBSCAN problem in 2D space. We show that, if the n data points have been pre-sorted on each dimension (i.e., one sorted list per dimension), the problem can be settled in O(n) worst-case time. As a corollary, when all the coordinates are integers, the 2D DBSCAN problem can be solved in O(n log log n) time deterministically, improving the existing O(n log n) bound.
DBSCAN是1996年提出的一种多维点聚类方法,得到了广泛的应用。它的计算硬度至今仍未解决。最初的KDD, 96论文声称一个O(n log n)“平均运行时复杂度”(其中n是数据点的数量)的算法没有严格的证明。2013年,在欧氏距离下的二维空间中找到了一个真正的O(n log n)时间算法。d≥3维的硬度一直保持开放状态。本文考虑在欧氏距离下从头开始计算DBSCAN簇(假设没有现有索引)的问题。我们证明,对于d≥3,问题需要ω(n 4/3)时间来解决,除非在理论计算机科学中可以取得非常重大的突破-这些突破被普遍认为是不可能的。受此启发,我们提出了这个问题的一个宽松版本,称为ρ-approximate DBSCAN,它返回与DBSCAN相同的簇,除非簇是“不稳定的”(即,一旦输入参数受到轻微扰动,它们就会改变)。在不考虑维数d不变的情况下,ρ-近似问题可以在O(n)期望时间内得到解决。本文还对二维空间中精确DBSCAN问题的结果进行了改进。我们证明,如果n个数据点已经在每个维度上预先排序(即每个维度一个排序列表),则问题可以在O(n)最坏情况时间内解决。作为一个推论,当所有坐标都为整数时,二维DBSCAN问题可以在O(n log log n)时间内确定性地求解,从而改进了现有的O(n log n)界。
{"title":"On the Hardness and Approximation of Euclidean DBSCAN","authors":"Junhao Gan, Yufei Tao","doi":"10.1145/3083897","DOIUrl":"https://doi.org/10.1145/3083897","url":null,"abstract":"DBSCAN is a method proposed in 1996 for clustering multi-dimensional points, and has received extensive applications. Its computational hardness is still unsolved to this date. The original KDD‚96 paper claimed an algorithm of O(n log n) ”average runtime complexity„ (where n is the number of data points) without a rigorous proof. In 2013, a genuine O(n log n)-time algorithm was found in 2D space under Euclidean distance. The hardness of dimensionality d ≥3 has remained open ever since. This article considers the problem of computing DBSCAN clusters from scratch (assuming no existing indexes) under Euclidean distance. We prove that, for d ≥3, the problem requires ω(n 4/3) time to solve, unless very significant breakthroughs—ones widely believed to be impossible—could be made in theoretical computer science. Motivated by this, we propose a relaxed version of the problem called ρ-approximate DBSCAN, which returns the same clusters as DBSCAN, unless the clusters are ”unstable„ (i.e., they change once the input parameters are slightly perturbed). The ρ-approximate problem can be settled in O(n) expected time regardless of the constant dimensionality d. The article also enhances the previous result on the exact DBSCAN problem in 2D space. We show that, if the n data points have been pre-sorted on each dimension (i.e., one sorted list per dimension), the problem can be settled in O(n) worst-case time. As a corollary, when all the coordinates are integers, the 2D DBSCAN problem can be solved in O(n log log n) time deterministically, improving the existing O(n log n) bound.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"43 1","pages":"1 - 45"},"PeriodicalIF":0.0,"publicationDate":"2017-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89926980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 58
Detecting Inclusion Dependencies on Very Many Tables 检测包含依赖于非常多的表
Pub Date : 2017-07-31 DOI: 10.1145/3105959
Fabian Tschirschnitz, Thorsten Papenbrock, Felix Naumann
Detecting inclusion dependencies, the prerequisite of foreign keys, in relational data is a challenging task. Detecting them among the hundreds of thousands or even millions of tables on the web is daunting. Still, such inclusion dependencies can help connect disparate pieces of information on the Web and reveal unknown relationships among tables. With the algorithm Many, we present a novel inclusion dependency detection algorithm, specialized for the very many—but typically small—tables found on the Web. We make use of Bloom filters and indexed bit-vectors to show the feasibility of our approach. Our evaluation on two corpora of Web tables shows a superior runtime over known approaches and its usefulness to reveal hidden structures on the Web.
在关系数据中检测包含依赖关系(外键的先决条件)是一项具有挑战性的任务。在网络上数十万甚至数百万的表中检测它们是令人生畏的。尽管如此,这种包含依赖关系可以帮助连接Web上不同的信息片段,并揭示表之间未知的关系。通过算法Many,我们提出了一种新的包含依赖检测算法,专门用于Web上非常多(但通常很小)的表。我们使用布隆过滤器和索引位向量来显示我们方法的可行性。我们对两个Web表语料库的评估表明,它比已知的方法运行时更好,而且它在揭示Web上隐藏结构方面很有用。
{"title":"Detecting Inclusion Dependencies on Very Many Tables","authors":"Fabian Tschirschnitz, Thorsten Papenbrock, Felix Naumann","doi":"10.1145/3105959","DOIUrl":"https://doi.org/10.1145/3105959","url":null,"abstract":"Detecting inclusion dependencies, the prerequisite of foreign keys, in relational data is a challenging task. Detecting them among the hundreds of thousands or even millions of tables on the web is daunting. Still, such inclusion dependencies can help connect disparate pieces of information on the Web and reveal unknown relationships among tables. With the algorithm Many, we present a novel inclusion dependency detection algorithm, specialized for the very many—but typically small—tables found on the Web. We make use of Bloom filters and indexed bit-vectors to show the feasibility of our approach. Our evaluation on two corpora of Web tables shows a superior runtime over known approaches and its usefulness to reveal hidden structures on the Web.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"93 1","pages":"1 - 29"},"PeriodicalIF":0.0,"publicationDate":"2017-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73280772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
DBSCAN Revisited, Revisited 再来,再来
Pub Date : 2017-07-31 DOI: 10.1145/3068335
Erich Schubert, J. Sander, M. Ester, H. Kriegel, Xiaowei Xu
At SIGMOD 2015, an article was presented with the title “DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation” that won the conference’s best paper award. In this technical correspondence, we want to point out some inaccuracies in the way DBSCAN was represented, and why the criticism should have been directed at the assumption about the performance of spatial index structures such as R-trees and not at an algorithm that can use such indexes. We will also discuss the relationship of DBSCAN performance and the indexability of the dataset, and discuss some heuristics for choosing appropriate DBSCAN parameters. Some indicators of bad parameters will be proposed to help guide future users of this algorithm in choosing parameters such as to obtain both meaningful results and good performance. In new experiments, we show that the new SIGMOD 2015 methods do not appear to offer practical benefits if the DBSCAN parameters are well chosen and thus they are primarily of theoretical interest. In conclusion, the original DBSCAN algorithm with effective indexes and reasonably chosen parameter values performs competitively compared to the method proposed by Gan and Tao.
在SIGMOD 2015上,一篇题为“DBSCAN重访:错误声明、不可修复性和近似”的文章获得了会议最佳论文奖。在这篇技术通信中,我们想指出DBSCAN表示方式中的一些不准确之处,以及为什么批评应该针对空间索引结构(如r树)的性能假设,而不是针对可以使用这些索引的算法。我们还将讨论DBSCAN性能与数据集可索引性的关系,并讨论选择适当的DBSCAN参数的一些启发式方法。本文将提出一些不良参数的指标,以帮助指导该算法未来的用户选择参数,如获得有意义的结果和良好的性能。在新的实验中,我们表明,如果DBSCAN参数选择得当,新的SIGMOD 2015方法似乎不会提供实际好处,因此它们主要具有理论意义。综上所述,具有有效指标和合理参数选择的原始DBSCAN算法与Gan和Tao提出的方法相比具有竞争力。
{"title":"DBSCAN Revisited, Revisited","authors":"Erich Schubert, J. Sander, M. Ester, H. Kriegel, Xiaowei Xu","doi":"10.1145/3068335","DOIUrl":"https://doi.org/10.1145/3068335","url":null,"abstract":"At SIGMOD 2015, an article was presented with the title “DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation” that won the conference’s best paper award. In this technical correspondence, we want to point out some inaccuracies in the way DBSCAN was represented, and why the criticism should have been directed at the assumption about the performance of spatial index structures such as R-trees and not at an algorithm that can use such indexes. We will also discuss the relationship of DBSCAN performance and the indexability of the dataset, and discuss some heuristics for choosing appropriate DBSCAN parameters. Some indicators of bad parameters will be proposed to help guide future users of this algorithm in choosing parameters such as to obtain both meaningful results and good performance. In new experiments, we show that the new SIGMOD 2015 methods do not appear to offer practical benefits if the DBSCAN parameters are well chosen and thus they are primarily of theoretical interest. In conclusion, the original DBSCAN algorithm with effective indexes and reasonably chosen parameter values performs competitively compared to the method proposed by Gan and Tao.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"17 1","pages":"1 - 21"},"PeriodicalIF":0.0,"publicationDate":"2017-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81710955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 966
Efficient SimRank-Based Similarity Join 高效的基于simmrank的相似性连接
Pub Date : 2017-07-31 DOI: 10.1145/3083899
Weiguo Zheng, Lei Zou, Lei Chen, Dongyan Zhao
Graphs have been widely used to model complex data in many real-world applications. Answering vertex join queries over large graphs is meaningful and interesting, which can benefit friend recommendation in social networks and link prediction, and so on. In this article, we adopt “SimRank” [13] to evaluate the similarity between two vertices in a large graph because of its generality. Note that “Simank” is purely structure dependent, and it does not rely on the domain knowledge. Specifically, we define a SimRank-based join (SRJ) query to find all vertex pairs satisfying the threshold from two sets of vertices U and V. To reduce the search space, we propose a shortest-path-distance-based upper bound for SimRank scores to prune unpromising vertex pairs. In the verification, we propose a novel index, called h-go cover+, to efficiently compute the SimRank score of any single vertex pair. Given a graph G, we only materialize the SimRank scores of a small proportion of vertex pairs (i.e., the h-go cover + vertex pairs), based on which the SimRank score of any vertex pair can be computed easily. To find the h-go cover + vertex pairs, we propose an efficient method without building the vertex-pair graph. Hence, large graphs can be dealt with easily. Extensive experiments over both real and synthetic datasets confirm the efficiency of our solution.
在许多实际应用程序中,图被广泛用于为复杂数据建模。回答大型图上的顶点连接查询是有意义和有趣的,这可以有利于社交网络中的朋友推荐和链接预测等。在本文中,由于“simmrank”的通用性,我们采用“simmrank”[13]来评估大型图中两个顶点之间的相似性。请注意,“Simank”是纯粹依赖于结构的,它不依赖于领域知识。具体来说,我们定义了一个基于simmrank的连接(SRJ)查询,从两组顶点U和v中找到满足阈值的所有顶点对。为了减少搜索空间,我们提出了一个基于最短路径距离的simmrank分数上界,以修剪没有希望的顶点对。在验证中,我们提出了一种新的索引,称为h-go cover+,以有效地计算任何单个顶点对的simmrank分数。给定一个图G,我们只物化了一小部分顶点对(即h-go覆盖+顶点对)的simmrank分数,在此基础上可以很容易地计算出任何顶点对的simmrank分数。为了求出h-go覆盖+顶点对,我们提出了一种无需构建顶点对图的高效方法。因此,可以很容易地处理大型图。在真实和合成数据集上进行的大量实验证实了我们的解决方案的有效性。
{"title":"Efficient SimRank-Based Similarity Join","authors":"Weiguo Zheng, Lei Zou, Lei Chen, Dongyan Zhao","doi":"10.1145/3083899","DOIUrl":"https://doi.org/10.1145/3083899","url":null,"abstract":"Graphs have been widely used to model complex data in many real-world applications. Answering vertex join queries over large graphs is meaningful and interesting, which can benefit friend recommendation in social networks and link prediction, and so on. In this article, we adopt “SimRank” [13] to evaluate the similarity between two vertices in a large graph because of its generality. Note that “Simank” is purely structure dependent, and it does not rely on the domain knowledge. Specifically, we define a SimRank-based join (SRJ) query to find all vertex pairs satisfying the threshold from two sets of vertices U and V. To reduce the search space, we propose a shortest-path-distance-based upper bound for SimRank scores to prune unpromising vertex pairs. In the verification, we propose a novel index, called h-go cover+, to efficiently compute the SimRank score of any single vertex pair. Given a graph G, we only materialize the SimRank scores of a small proportion of vertex pairs (i.e., the h-go cover + vertex pairs), based on which the SimRank score of any vertex pair can be computed easily. To find the h-go cover + vertex pairs, we propose an efficient method without building the vertex-pair graph. Hence, large graphs can be dealt with easily. Extensive experiments over both real and synthetic datasets confirm the efficiency of our solution.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"16 1","pages":"1 - 37"},"PeriodicalIF":0.0,"publicationDate":"2017-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90377495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Consistent Query Answering for Self-Join-Free Conjunctive Queries Under Primary Key Constraints 主键约束下无自连接连接查询的一致性查询应答
Pub Date : 2017-06-01 DOI: 10.1145/3068334
Paraschos Koutris, J. Wijsen
A relational database is said to be uncertain if primary key constraints can possibly be violated. A repair (or possible world) of an uncertain database is obtained by selecting a maximal number of tuples without ever selecting two distinct tuples with the same primary key value. For any Boolean query q, CERTAINTY(q) is the problem that takes an uncertain database db as input and asks whether q is true in every repair of db. The complexity of this problem has been particularly studied for q ranging over the class of self-join-free Boolean conjunctive queries. A research challenge is to determine, given q, whether CERTAINTY(q) belongs to complexity classes FO, P, or coNP-complete. In this article, we combine existing techniques for studying this complexity classification task. We show that, for any self-join-free Boolean conjunctive query q, it can be decided whether or not CERTAINTY(q) is in FO. We additionally show how to construct a single SQL query for solving CERTAINTY(q) if it is in FO. Further, for any self-join-free Boolean conjunctive query q, CERTAINTY(q) is either in P or coNP-complete and the complexity dichotomy is effective. This settles a research question that has been open for 10 years.
关系数据库被称为不确定是否可能违反主键约束。通过选择最大数量的元组,而不选择具有相同主键值的两个不同的元组,可以获得不确定数据库的修复(或可能世界)。对于任何布尔查询q,确定性(q)是将不确定数据库db作为输入,并在每次修复db时询问q是否为真的问题。这个问题的复杂性已经特别地研究了q的范围超过自连接无布尔合查询类。一个研究挑战是,给定q,确定确定性(q)是否属于复杂度类FO, P,或coNP-complete。在本文中,我们结合现有的技术来研究这种复杂性分类任务。证明了对于任意自连接无布尔合查询q,可以判定确定性(q)是否在FO中。我们还展示了如何构建一个SQL查询来解决确定性(q),如果它在FO中。进一步,对于任何自连接无布尔合查询q,确定性(q)要么在P中,要么在conp中完全,并且复杂度二分法是有效的。这就解决了一个存在了10年的研究问题。
{"title":"Consistent Query Answering for Self-Join-Free Conjunctive Queries Under Primary Key Constraints","authors":"Paraschos Koutris, J. Wijsen","doi":"10.1145/3068334","DOIUrl":"https://doi.org/10.1145/3068334","url":null,"abstract":"A relational database is said to be uncertain if primary key constraints can possibly be violated. A repair (or possible world) of an uncertain database is obtained by selecting a maximal number of tuples without ever selecting two distinct tuples with the same primary key value. For any Boolean query q, CERTAINTY(q) is the problem that takes an uncertain database db as input and asks whether q is true in every repair of db. The complexity of this problem has been particularly studied for q ranging over the class of self-join-free Boolean conjunctive queries. A research challenge is to determine, given q, whether CERTAINTY(q) belongs to complexity classes FO, P, or coNP-complete. In this article, we combine existing techniques for studying this complexity classification task. We show that, for any self-join-free Boolean conjunctive query q, it can be decided whether or not CERTAINTY(q) is in FO. We additionally show how to construct a single SQL query for solving CERTAINTY(q) if it is in FO. Further, for any self-join-free Boolean conjunctive query q, CERTAINTY(q) is either in P or coNP-complete and the complexity dichotomy is effective. This settles a research question that has been open for 10 years.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"38 1","pages":"1 - 45"},"PeriodicalIF":0.0,"publicationDate":"2017-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79366834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
期刊
ACM Transactions on Database Systems (TODS)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1