ACM Transactions on Database Systems最新文献_第9页

Maximizing Range Sum in External Memory 在外部存储器中最大化范围总和

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2014-10-07 DOI: 10.1145/2629477

Dong-Wan Choi, C. Chung, Yufei Tao

This article studies the MaxRS problem in spatial databases. Given a set O of weighted points and a rectangle r of a given size, the goal of the MaxRS problem is to find a location of r such that the sum of the weights of all the points covered by r is maximized. This problem is useful in many location-based services such as finding the best place for a new franchise store with a limited delivery range and finding the hotspot with the largest number of nearby attractions for a tourist with a limited reachable range. However, the problem has been studied mainly in the theoretical perspective, particularly in computational geometry. The existing algorithms from the computational geometry community are in-memory algorithms that do not guarantee the scalability. In this article, we propose a scalable external-memory algorithm (ExactMaxRS) for the MaxRS problem that is optimal in terms of the I/O complexity. In addition, we propose an approximation algorithm (ApproxMaxCRS) for the MaxCRS problem that is a circle version of the MaxRS problem. We prove the correctness and optimality of the ExactMaxRS algorithm along with the approximation bound of the ApproxMaxCRS algorithm. Furthermore, motivated by the fact that all the existing solutions simply assume that there is no tied area for the best location, we extend the MaxRS problem to a more fundamental problem, namely AllMaxRS, so that all the locations with the same best score can be retrieved. We first prove that the AllMaxRS problem cannot be trivially solved by applying the techniques for the MaxRS problem. Then we propose an output-sensitive external-memory algorithm (TwoPhaseMaxRS) that gives the exact solution for the AllMaxRS problem through two phases. Also, we prove both the soundness and completeness of the result returned from TwoPhaseMaxRS. From extensive experimental results, we show that ExactMaxRS and ApproxMaxCRS are several orders of magnitude faster than methods adapted from existing algorithms, the approximation bound in practice is much better than the theoretical bound of ApproxMaxCRS, and TwoPhaseMaxRS is not only much faster but also more robust than the straightforward extension of ExactMaxRS.

本文研究了空间数据库中的MaxRS问题。给定一组O个加权点和一个给定大小的矩形r, MaxRS问题的目标是找到一个r的位置，使得r所覆盖的所有点的权重总和最大化。这个问题在许多基于位置的服务中都很有用，比如为一个配送范围有限的新加盟店找到最佳地点，为一个可到达范围有限的游客找到附近景点最多的热点。然而，这个问题主要是从理论的角度，特别是从计算几何的角度来研究的。来自计算几何社区的现有算法是内存中的算法，不能保证可伸缩性。在本文中，我们针对MaxRS问题提出了一种可扩展的外部内存算法(ExactMaxRS)，该算法在I/O复杂度方面是最佳的。此外，我们为MaxCRS问题提出了一种近似算法(ApproxMaxCRS)，该算法是MaxRS问题的圆形版本。我们证明了ExactMaxRS算法的正确性和最优性，并给出了该算法的近似界。此外，由于所有现有的解决方案都简单地假设最佳位置没有固定的区域，因此我们将MaxRS问题扩展到一个更基本的问题，即AllMaxRS，以便可以检索到具有相同最佳分数的所有位置。我们首先用MaxRS问题的技术证明了AllMaxRS问题不能简单地求解。然后，我们提出了一种输出敏感的外部存储器算法(TwoPhaseMaxRS)，该算法通过两个相位给出了AllMaxRS问题的精确解。此外，我们还证明了从TwoPhaseMaxRS返回的结果的健全性和完整性。从大量的实验结果中，我们发现ExactMaxRS和ApproxMaxCRS比现有算法的方法快了几个数量级，实践中的近似界比ApproxMaxCRS的理论界要好得多，而TwoPhaseMaxRS不仅比ExactMaxRS的直接扩展快得多，而且更健壮。

{"title":"Maximizing Range Sum in External Memory","authors":"Dong-Wan Choi, C. Chung, Yufei Tao","doi":"10.1145/2629477","DOIUrl":"https://doi.org/10.1145/2629477","url":null,"abstract":"This article studies the MaxRS problem in spatial databases. Given a set O of weighted points and a rectangle r of a given size, the goal of the MaxRS problem is to find a location of r such that the sum of the weights of all the points covered by r is maximized. This problem is useful in many location-based services such as finding the best place for a new franchise store with a limited delivery range and finding the hotspot with the largest number of nearby attractions for a tourist with a limited reachable range. However, the problem has been studied mainly in the theoretical perspective, particularly in computational geometry. The existing algorithms from the computational geometry community are in-memory algorithms that do not guarantee the scalability. In this article, we propose a scalable external-memory algorithm (ExactMaxRS) for the MaxRS problem that is optimal in terms of the I/O complexity. In addition, we propose an approximation algorithm (ApproxMaxCRS) for the MaxCRS problem that is a circle version of the MaxRS problem. We prove the correctness and optimality of the ExactMaxRS algorithm along with the approximation bound of the ApproxMaxCRS algorithm.\u0000 Furthermore, motivated by the fact that all the existing solutions simply assume that there is no tied area for the best location, we extend the MaxRS problem to a more fundamental problem, namely AllMaxRS, so that all the locations with the same best score can be retrieved. We first prove that the AllMaxRS problem cannot be trivially solved by applying the techniques for the MaxRS problem. Then we propose an output-sensitive external-memory algorithm (TwoPhaseMaxRS) that gives the exact solution for the AllMaxRS problem through two phases. Also, we prove both the soundness and completeness of the result returned from TwoPhaseMaxRS.\u0000 From extensive experimental results, we show that ExactMaxRS and ApproxMaxCRS are several orders of magnitude faster than methods adapted from existing algorithms, the approximation bound in practice is much better than the theoretical bound of ApproxMaxCRS, and TwoPhaseMaxRS is not only much faster but also more robust than the straightforward extension of ExactMaxRS.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"68 1","pages":"21:1-21:44"},"PeriodicalIF":1.8,"publicationDate":"2014-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78249478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Towards a Painless Index for Spatial Objects 迈向空间对象的无痛索引

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2014-10-07 DOI: 10.1145/2629333

Rui Zhang, Jianzhong Qi, Martin Stradling, Jin Huang

Conventional spatial indexes, represented by the R-tree, employ multidimensional tree structures that are complicated and require enormous efforts to implement in a full-fledged database management system (DBMS). An alternative approach for supporting spatial queries is mapping-based indexing, which maps both data and queries into a one-dimensional space such that data can be indexed and queries can be processed through a one-dimensional indexing structure such as the B+. Mapping-based indexing requires implementing only a few mapping functions, incurring much less effort in implementation compared to conventional spatial index structures. Yet, a major concern about using mapping-based indexes is their lower efficiency than conventional tree structures. In this article, we propose a mapping-based spatial indexing scheme called Size Separation Indexing (SSI). SSI is equipped with a suite of techniques including size separation, data distribution transformation, and more efficient mapping algorithms. These techniques overcome the drawbacks of existing mapping-based indexes and significantly improve the efficiency of query processing. We show through extensive experiments that, for window queries on spatial objects with nonzero extents, SSI has two orders of magnitude better performance than existing mapping-based indexes and competitive performance to the R-tree as a standalone implementation. We have also implemented SSI on top of two off-the-shelf DBMSs, PostgreSQL and a commercial platform, both having R-tree implementation. In this case, SSI is up to two orders of magnitude faster than their provided spatial indexes. Therefore, we achieve a spatial index more efficient than the R-tree in a DBMS implementation that is at the same time easy to implement. This result may upset a common perception that has existed for a long time in this area that the R-tree is the best choice for indexing spatial objects.

由r树表示的传统空间索引采用复杂的多维树结构，需要在成熟的数据库管理系统(DBMS)中进行大量的工作来实现。支持空间查询的另一种方法是基于映射的索引，它将数据和查询都映射到一维空间，这样就可以对数据进行索引，并且可以通过一维索引结构(如B+)处理查询。基于映射的索引只需要实现几个映射函数，与传统的空间索引结构相比，实现的工作量要少得多。然而，使用基于映射的索引的一个主要问题是它们的效率低于传统的树结构。在本文中，我们提出了一种基于映射的空间索引方案，称为大小分离索引(SSI)。SSI配备了一系列技术，包括大小分离、数据分布转换和更有效的映射算法。这些技术克服了现有的基于映射的索引的缺点，并显著提高了查询处理的效率。我们通过大量的实验表明，对于非零区空间对象的窗口查询，SSI的性能比现有的基于映射的索引好两个数量级，并且与r树作为独立实现的性能相当。我们还在两个现成的dbms (PostgreSQL和一个商业平台)上实现了SSI，它们都有r树实现。在这种情况下，SSI比他们提供的空间索引快两个数量级。因此，我们在DBMS实现中实现了比r树更有效的空间索引，同时也易于实现。这个结果可能会颠覆这个领域长期存在的一种普遍看法，即r树是索引空间对象的最佳选择。

{"title":"Towards a Painless Index for Spatial Objects","authors":"Rui Zhang, Jianzhong Qi, Martin Stradling, Jin Huang","doi":"10.1145/2629333","DOIUrl":"https://doi.org/10.1145/2629333","url":null,"abstract":"Conventional spatial indexes, represented by the R-tree, employ multidimensional tree structures that are complicated and require enormous efforts to implement in a full-fledged database management system (DBMS). An alternative approach for supporting spatial queries is mapping-based indexing, which maps both data and queries into a one-dimensional space such that data can be indexed and queries can be processed through a one-dimensional indexing structure such as the B+. Mapping-based indexing requires implementing only a few mapping functions, incurring much less effort in implementation compared to conventional spatial index structures. Yet, a major concern about using mapping-based indexes is their lower efficiency than conventional tree structures.\u0000 In this article, we propose a mapping-based spatial indexing scheme called Size Separation Indexing (SSI). SSI is equipped with a suite of techniques including size separation, data distribution transformation, and more efficient mapping algorithms. These techniques overcome the drawbacks of existing mapping-based indexes and significantly improve the efficiency of query processing. We show through extensive experiments that, for window queries on spatial objects with nonzero extents, SSI has two orders of magnitude better performance than existing mapping-based indexes and competitive performance to the R-tree as a standalone implementation. We have also implemented SSI on top of two off-the-shelf DBMSs, PostgreSQL and a commercial platform, both having R-tree implementation. In this case, SSI is up to two orders of magnitude faster than their provided spatial indexes. Therefore, we achieve a spatial index more efficient than the R-tree in a DBMS implementation that is at the same time easy to implement. This result may upset a common perception that has existed for a long time in this area that the R-tree is the best choice for indexing spatial objects.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"26 1","pages":"19:1-19:42"},"PeriodicalIF":1.8,"publicationDate":"2014-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90260827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Privacy-Preserving Ad-Hoc Equi-Join on Outsourced Data 在外包数据上保护隐私的Ad-Hoc对等连接

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2014-10-07 DOI: 10.1145/2629501

HweeHwa Pang, Xuhua Ding

In IT outsourcing, a user may delegate the data storage and query processing functions to a third-party server that is not completely trusted. This gives rise to the need to safeguard the privacy of the database as well as the user queries over it. In this article, we address the problem of running ad hoc equi-join queries directly on encrypted data in such a setting. Our contribution is the first solution that achieves constant complexity per pair of records that are evaluated for the join. After formalizing the privacy requirements pertaining to the database and user queries, we introduce a cryptographic construct for securely joining records across relations. The construct protects the database with a strong encryption scheme. Moreover, information disclosure after executing an equi-join is kept to the minimum—that two input records combine to form an output record if and only if they share common join attribute values. There is no disclosure on records that are not part of the join result. Building on this construct, we then present join algorithms that optimize the join execution by eliminating the need to match every record pair from the input relations. We provide a detailed analysis of the cost of the algorithms and confirm the analysis through extensive experiments with both synthetic and benchmark workloads. Through this evaluation, we tease out useful insights on how to configure the join algorithms to deliver acceptable execution time in practice.

在IT外包中，用户可能会将数据存储和查询处理功能委托给不完全可信的第三方服务器。这就需要保护数据库的隐私以及用户对数据库的查询。在本文中，我们将解决在这种设置中直接对加密数据运行临时等连接查询的问题。我们的贡献是第一个解决方案，它实现了为连接评估的每对记录的恒定复杂性。在形式化了与数据库和用户查询相关的隐私需求之后，我们将引入一个加密结构，用于跨关系安全地连接记录。该构造使用强大的加密方案保护数据库。此外，执行相等连接后的信息公开保持在最低限度——当且仅当两个输入记录共享公共连接属性值时，它们组合成一个输出记录。对于不属于联接结果的记录没有公开。在此构造的基础上，我们介绍连接算法，通过消除匹配输入关系中的每个记录对的需要来优化连接执行。我们提供了对算法成本的详细分析，并通过对合成和基准工作负载的广泛实验来验证分析。通过这个评估，我们梳理出关于如何配置连接算法以在实践中提供可接受的执行时间的有用见解。

{"title":"Privacy-Preserving Ad-Hoc Equi-Join on Outsourced Data","authors":"HweeHwa Pang, Xuhua Ding","doi":"10.1145/2629501","DOIUrl":"https://doi.org/10.1145/2629501","url":null,"abstract":"In IT outsourcing, a user may delegate the data storage and query processing functions to a third-party server that is not completely trusted. This gives rise to the need to safeguard the privacy of the database as well as the user queries over it. In this article, we address the problem of running ad hoc equi-join queries directly on encrypted data in such a setting. Our contribution is the first solution that achieves constant complexity per pair of records that are evaluated for the join. After formalizing the privacy requirements pertaining to the database and user queries, we introduce a cryptographic construct for securely joining records across relations. The construct protects the database with a strong encryption scheme. Moreover, information disclosure after executing an equi-join is kept to the minimum—that two input records combine to form an output record if and only if they share common join attribute values. There is no disclosure on records that are not part of the join result.\u0000 Building on this construct, we then present join algorithms that optimize the join execution by eliminating the need to match every record pair from the input relations. We provide a detailed analysis of the cost of the algorithms and confirm the analysis through extensive experiments with both synthetic and benchmark workloads. Through this evaluation, we tease out useful insights on how to configure the join algorithms to deliver acceptable execution time in practice.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"25 1","pages":"23:1-23:40"},"PeriodicalIF":1.8,"publicationDate":"2014-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75978099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

A Join-Like Operator to Combine Data Cubes and Answer Queries from Multiple Data Cubes 一个类似join的操作符，用于组合数据集和回答来自多个数据集的查询

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2014-10-07 DOI: 10.1145/2638545

F. M. Malvestuto

In order to answer a “joint” query from multiple data cubes, Pourabass and Shoshani [2007] distinguish the data cube on the measure of interest (called the “primary” data cube) from the other data cubes (called “proxy” data cubes) that are used to involve the dimensions (in the query) not in the primary data cube. They demonstrate in study cases that, if the measures of the primary and proxy data cubes are correlated, then the answer to a joint query is an accurate estimate of its true value. Needless to say, for two or more proxy data cubes, the result depends upon the way the primary and proxy data cubes are combined together; however, for certain combination schemes Pourabass and Shoshani provide a sufficient condition, that they call proxy noncommonality, for the invariance of the result. In this article, we introduce: (1) a merge operator combining the contents of a primary data cube with the contents of a proxy data cube, (2) merge expressions for general combination schemes, and (3) an equivalence relation between merge expressions having the same pattern. Then, we prove that proxy noncommonality characterizes patterns for which every two merge expressions are equivalent. Moreover, we provide an efficient procedure for answering joint queries in the special case of perfect merge expressions. Finally, we show that our results apply to data cubes in which measures are obtained from unaggregated data using the aggregate functions SUM, COUNT, MAX, and MIN, and a lot more.

为了回答来自多个数据立方体的“联合”查询，Pourabass和Shoshani[2007]区分了感兴趣度量的数据立方体(称为“主”数据立方体)和其他数据立方体(称为“代理”数据立方体)，这些数据立方体用于涉及(在查询中)不在主数据立方体中的维度。它们在研究案例中证明，如果主数据集和代理数据集的度量是相关的，那么联合查询的答案就是对其真实值的准确估计。不用说，对于两个或多个代理数据集，结果取决于主数据集和代理数据集组合在一起的方式;然而，对于某些组合方案，Pourabass和Shoshani为结果的不变性提供了一个充分条件，他们称之为代理非通用性。在本文中，我们将介绍:(1)将主数据多维数据集的内容与代理数据多维数据集的内容组合在一起的合并运算符，(2)通用组合方案的合并表达式，以及(3)具有相同模式的合并表达式之间的等价关系。然后，我们证明了每两个合并表达式等价的模式的代理非共性特征。此外，我们还提供了在完全合并表达式的特殊情况下回答联合查询的有效过程。最后，我们展示了我们的结果适用于数据立方，其中度量是使用聚合函数SUM、COUNT、MAX和MIN等从非聚合数据中获得的。

{"title":"A Join-Like Operator to Combine Data Cubes and Answer Queries from Multiple Data Cubes","authors":"F. M. Malvestuto","doi":"10.1145/2638545","DOIUrl":"https://doi.org/10.1145/2638545","url":null,"abstract":"In order to answer a “joint” query from multiple data cubes, Pourabass and Shoshani [2007] distinguish the data cube on the measure of interest (called the “primary” data cube) from the other data cubes (called “proxy” data cubes) that are used to involve the dimensions (in the query) not in the primary data cube. They demonstrate in study cases that, if the measures of the primary and proxy data cubes are correlated, then the answer to a joint query is an accurate estimate of its true value. Needless to say, for two or more proxy data cubes, the result depends upon the way the primary and proxy data cubes are combined together; however, for certain combination schemes Pourabass and Shoshani provide a sufficient condition, that they call proxy noncommonality, for the invariance of the result.\u0000 In this article, we introduce: (1) a merge operator combining the contents of a primary data cube with the contents of a proxy data cube, (2) merge expressions for general combination schemes, and (3) an equivalence relation between merge expressions having the same pattern. Then, we prove that proxy noncommonality characterizes patterns for which every two merge expressions are equivalent. Moreover, we provide an efficient procedure for answering joint queries in the special case of perfect merge expressions. Finally, we show that our results apply to data cubes in which measures are obtained from unaggregated data using the aggregate functions SUM, COUNT, MAX, and MIN, and a lot more.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"26 1","pages":"24:1-24:31"},"PeriodicalIF":1.8,"publicationDate":"2014-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72760622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Materialization Optimizations for Feature Selection Workloads 特征选择工作负载的物化优化

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2014-06-18 DOI: 10.1145/2877204

Ce Zhang, Arun Kumar, C. Ré

There is an arms race in the data management industry to support statistical analytics. Feature selection, the process of selecting a feature set that will be used to build a statistical model, is widely regarded as the most critical step of statistical analytics. Thus, we argue that managing the feature selection process is a pressing data management challenge. We study this challenge by describing a feature selection language and a supporting prototype system that builds on top of current industrial R-integration layers. From our interactions with analysts, we learned that feature selection is an interactive human-in-the-loop process, which means that feature selection workloads are rife with reuse opportunities. Thus, we study how to materialize portions of this computation using not only classical database materialization optimizations but also methods that have not previously been used in database optimization, including structural decomposition methods (like QR factorization) and warmstart. These new methods have no analogue in traditional SQL systems, but they may be interesting for array and scientific database applications. On a diverse set of datasets and programs, we find that traditional database-style approaches that ignore these new opportunities are more than two orders of magnitude slower than an optimal plan in this new trade-off space across multiple R backends. Furthermore, we show that it is possible to build a simple cost-based optimizer to automatically select a near-optimal execution plan for feature selection.

在数据管理行业中，有一场支持统计分析的军备竞赛。特征选择，即选择一个特征集用于建立统计模型的过程，被广泛认为是统计分析中最关键的一步。因此，我们认为管理特征选择过程是一项紧迫的数据管理挑战。我们通过描述一种特征选择语言和一个建立在当前工业r集成层之上的支持原型系统来研究这一挑战。从我们与分析师的交互中，我们了解到功能选择是一个交互的人在循环过程，这意味着功能选择工作负载充满了重用的机会。因此，我们不仅使用经典的数据库物化优化，而且还使用以前未在数据库优化中使用的方法，包括结构分解方法(如QR分解)和warmstart，来研究如何物化该计算的部分。这些新方法在传统的SQL系统中没有类似的东西，但是对于数组和科学数据库应用程序来说，它们可能很有趣。在不同的数据集和程序中，我们发现忽略这些新机会的传统数据库风格方法比跨多个R后端的新权衡空间中的最佳计划慢两个数量级以上。此外，我们展示了构建一个简单的基于成本的优化器来自动选择一个接近最优的执行计划来进行特征选择是可能的。

{"title":"Materialization Optimizations for Feature Selection Workloads","authors":"Ce Zhang, Arun Kumar, C. Ré","doi":"10.1145/2877204","DOIUrl":"https://doi.org/10.1145/2877204","url":null,"abstract":"There is an arms race in the data management industry to support statistical analytics. Feature selection, the process of selecting a feature set that will be used to build a statistical model, is widely regarded as the most critical step of statistical analytics. Thus, we argue that managing the feature selection process is a pressing data management challenge. We study this challenge by describing a feature selection language and a supporting prototype system that builds on top of current industrial R-integration layers. From our interactions with analysts, we learned that feature selection is an interactive human-in-the-loop process, which means that feature selection workloads are rife with reuse opportunities. Thus, we study how to materialize portions of this computation using not only classical database materialization optimizations but also methods that have not previously been used in database optimization, including structural decomposition methods (like QR factorization) and warmstart. These new methods have no analogue in traditional SQL systems, but they may be interesting for array and scientific database applications. On a diverse set of datasets and programs, we find that traditional database-style approaches that ignore these new opportunities are more than two orders of magnitude slower than an optimal plan in this new trade-off space across multiple R backends. Furthermore, we show that it is possible to build a simple cost-based optimizer to automatically select a near-optimal execution plan for feature selection.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"1 1","pages":"2:1-2:32"},"PeriodicalIF":1.8,"publicationDate":"2014-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79755488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 148

Query Rewriting and Optimization for Ontological Databases 本体数据库查询改写与优化

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2014-05-12 DOI: 10.1145/2638546

G. Gottlob, G. Orsi, Andreas Pieris

Ontological queries are evaluated against a knowledge base consisting of an extensional database and an ontology (i.e., a set of logical assertions and constraints that derive new intensional knowledge from the extensional database), rather than directly on the extensional database. The evaluation and optimization of such queries is an intriguing new problem for database research. In this article, we discuss two important aspects of this problem: query rewriting and query optimization. Query rewriting consists of the compilation of an ontological query into an equivalent first-order query against the underlying extensional database. We present a novel query rewriting algorithm for rather general types of ontological constraints that is well suited for practical implementations. In particular, we show how a conjunctive query against a knowledge base, expressed using linear and sticky existential rules, that is, members of the recently introduced Datalog± family of ontology languages, can be compiled into a union of conjunctive queries (UCQ) against the underlying database. Ontological query optimization, in this context, attempts to improve this rewriting process soas to produce possibly small and cost-effective UCQ rewritings for an input query.

本体查询是根据由扩展数据库和本体(即，一组逻辑断言和约束，从扩展数据库派生新的内涵知识)组成的知识库进行评估的，而不是直接在扩展数据库上进行评估。这类查询的评估和优化是数据库研究中一个有趣的新问题。在本文中，我们将讨论这个问题的两个重要方面:查询重写和查询优化。查询重写包括将本体查询编译为针对底层扩展数据库的等效一阶查询。我们提出了一种新的查询重写算法，用于相当一般类型的本体论约束，非常适合于实际实现。特别是，我们展示了如何使用线性和粘性存在规则(即最近引入的Datalog±本体语言家族的成员)表示针对知识库的连接查询，可以编译成针对底层数据库的连接查询联合(UCQ)。在这种情况下，本体论查询优化试图改进这种重写过程，以便为输入查询生成可能较小且经济有效的UCQ重写。

{"title":"Query Rewriting and Optimization for Ontological Databases","authors":"G. Gottlob, G. Orsi, Andreas Pieris","doi":"10.1145/2638546","DOIUrl":"https://doi.org/10.1145/2638546","url":null,"abstract":"Ontological queries are evaluated against a knowledge base consisting of an extensional database and an ontology (i.e., a set of logical assertions and constraints that derive new intensional knowledge from the extensional database), rather than directly on the extensional database. The evaluation and optimization of such queries is an intriguing new problem for database research. In this article, we discuss two important aspects of this problem: query rewriting and query optimization. Query rewriting consists of the compilation of an ontological query into an equivalent first-order query against the underlying extensional database. We present a novel query rewriting algorithm for rather general types of ontological constraints that is well suited for practical implementations. In particular, we show how a conjunctive query against a knowledge base, expressed using linear and sticky existential rules, that is, members of the recently introduced Datalog± family of ontology languages, can be compiled into a union of conjunctive queries (UCQ) against the underlying database. Ontological query optimization, in this context, attempts to improve this rewriting process soas to produce possibly small and cost-effective UCQ rewritings for an input query.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"37 1","pages":"25:1-25:46"},"PeriodicalIF":1.8,"publicationDate":"2014-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81232380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 96

XLynx—An FPGA-based XML filter for hybrid XQuery processing xlynx -用于混合XQuery处理的基于fpga的XML过滤器

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2013-11-01 DOI: 10.1145/2536800

J. Teubner, L. Woods, Chongling Nie

While offering unique performance and energy-saving advantages, the use of Field-Programmable Gate Arrays (FPGAs) for database acceleration has demanded major concessions from system designers. Either the programmable chips have been used for very basic application tasks (such as implementing a rigid class of selection predicates) or their circuit definition had to be completely recompiled at runtime—a very CPU-intensive and time-consuming effort. This work eliminates the need for such concessions. As part of our XLynx implementation—an FPGA-based XML filter—we present skeleton automata, which is a design principle for data-intensive hardware circuits that offers high expressiveness and quick reconfiguration at the same time. Skeleton automata provide a generic implementation for a class of finite-state automata. They can be parameterized to any particular automaton instance in a matter of microseconds or less (as opposed to minutes or hours for complete recompilation). We showcase skeleton automata based on XML projection [Marian and Siméon 2003], a filtering technique that illustrates the feasibility of our strategy for a real-world and challenging task. By performing XML projection in hardware and filtering data in the network, we report on performance improvements of several factors while remaining nonintrusive to the back-end XML processor (we evaluate XLynx using the Saxon engine).

在提供独特的性能和节能优势的同时，使用现场可编程门阵列(fpga)进行数据库加速已经要求系统设计师做出重大让步。可编程芯片要么用于非常基本的应用程序任务(例如实现一类严格的选择谓词)，要么必须在运行时完全重新编译它们的电路定义——这是一项非常耗费cpu和时间的工作。这项工作消除了这种让步的需要。作为XLynx实现(基于fpga的XML过滤器)的一部分，我们提出了框架自动机，这是一种用于数据密集型硬件电路的设计原则，它同时提供了高表达性和快速重新配置。骨架自动机提供了一类有限状态自动机的通用实现。它们可以在几微秒或更短的时间内参数化为任何特定的自动机实例(完全重新编译需要几分钟或几小时)。我们展示了基于XML投影的骨架自动机[Marian and sim 2003]，这是一种过滤技术，说明了我们的策略在现实世界中具有挑战性的任务中的可行性。通过在硬件中执行XML投影并在网络中过滤数据，我们报告了几个因素的性能改进，同时保持对后端XML处理器的非侵入性(我们使用Saxon引擎评估XLynx)。

{"title":"XLynx—An FPGA-based XML filter for hybrid XQuery processing","authors":"J. Teubner, L. Woods, Chongling Nie","doi":"10.1145/2536800","DOIUrl":"https://doi.org/10.1145/2536800","url":null,"abstract":"While offering unique performance and energy-saving advantages, the use of Field-Programmable Gate Arrays (FPGAs) for database acceleration has demanded major concessions from system designers. Either the programmable chips have been used for very basic application tasks (such as implementing a rigid class of selection predicates) or their circuit definition had to be completely recompiled at runtime—a very CPU-intensive and time-consuming effort.\u0000 This work eliminates the need for such concessions. As part of our XLynx implementation—an FPGA-based XML filter—we present skeleton automata, which is a design principle for data-intensive hardware circuits that offers high expressiveness and quick reconfiguration at the same time. Skeleton automata provide a generic implementation for a class of finite-state automata. They can be parameterized to any particular automaton instance in a matter of microseconds or less (as opposed to minutes or hours for complete recompilation).\u0000 We showcase skeleton automata based on XML projection [Marian and Siméon 2003], a filtering technique that illustrates the feasibility of our strategy for a real-world and challenging task. By performing XML projection in hardware and filtering data in the network, we report on performance improvements of several factors while remaining nonintrusive to the back-end XML processor (we evaluate XLynx using the Saxon engine).","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"37 1","pages":"23"},"PeriodicalIF":1.8,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73934502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Learning schema mappings 学习图式映射

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2013-11-01 DOI: 10.1145/2539032.2539035

B. T. Cate, V. Dalmau, Phokion G. Kolaitis

A schema mapping is a high-level specification of the relationship between a source schema and a target schema. Recently, a line of research has emerged that aims at deriving schema mappings automatically or semi-automatically with the help of data examples, that is, pairs consisting of a source instance and a target instance that depict, in some precise sense, the intended behavior of the schema mapping. Several different uses of data examples for deriving, refining, or illustrating a schema mapping have already been proposed and studied. In this article, we use the lens of computational learning theory to systematically investigate the problem of obtaining algorithmically a schema mapping from data examples. Our aim is to leverage the rich body of work on learning theory in order to develop a framework for exploring the power and the limitations of the various algorithmic methods for obtaining schema mappings from data examples. We focus on GAV schema mappings, that is, schema mappings specified by GAV (Global-As-View) constraints. GAV constraints are the most basic and the most widely supported language for specifying schema mappings. We present an efficient algorithm for learning GAV schema mappings using Angluin's model of exact learning with membership and equivalence queries. This is optimal, since we show that neither membership queries nor equivalence queries suffice, unless the source schema consists of unary relations only. We also obtain results concerning the learnability of schema mappings in the context of Valiant's well-known PAC (Probably-Approximately-Correct) learning model, and concerning the learnability of restricted classes of GAV schema mappings. Finally, as a byproduct of our work, we show that there is no efficient algorithm for approximating the shortest GAV schema mapping fitting a given set of examples, unless the source schema consists of unary relations only.

模式映射是源模式和目标模式之间关系的高级规范。最近，出现了一系列旨在借助数据示例自动或半自动地派生模式映射的研究，即由源实例和目标实例组成的对，它们在某种程度上精确地描述了模式映射的预期行为。已经提出并研究了用于派生、精炼或说明模式映射的数据示例的几种不同用法。在本文中，我们使用计算学习理论的镜头来系统地研究从数据示例中获得算法模式映射的问题。我们的目标是利用学习理论的丰富工作来开发一个框架，用于探索从数据示例中获取模式映射的各种算法方法的功能和局限性。我们关注GAV模式映射，即由GAV(全局即视图)约束指定的模式映射。GAV约束是用于指定模式映射的最基本和最广泛支持的语言。利用Angluin的精确学习模型，提出了一种高效的GAV模式映射学习算法。这是最优的，因为我们证明了成员查询和等价查询都不够，除非源模式仅由一元关系组成。在Valiant著名的PAC (Probably-Approximately-Correct，大概正确)学习模型的背景下，我们也得到了关于模式映射可学习性的结果，以及关于GAV模式映射的受限类的可学习性的结果。最后，作为我们工作的副产品，我们表明，除非源模式仅由一元关系组成，否则没有有效的算法来近似拟合给定示例集的最短GAV模式映射。

{"title":"Learning schema mappings","authors":"B. T. Cate, V. Dalmau, Phokion G. Kolaitis","doi":"10.1145/2539032.2539035","DOIUrl":"https://doi.org/10.1145/2539032.2539035","url":null,"abstract":"A schema mapping is a high-level specification of the relationship between a source schema and a target schema. Recently, a line of research has emerged that aims at deriving schema mappings automatically or semi-automatically with the help of data examples, that is, pairs consisting of a source instance and a target instance that depict, in some precise sense, the intended behavior of the schema mapping. Several different uses of data examples for deriving, refining, or illustrating a schema mapping have already been proposed and studied.\u0000 In this article, we use the lens of computational learning theory to systematically investigate the problem of obtaining algorithmically a schema mapping from data examples. Our aim is to leverage the rich body of work on learning theory in order to develop a framework for exploring the power and the limitations of the various algorithmic methods for obtaining schema mappings from data examples. We focus on GAV schema mappings, that is, schema mappings specified by GAV (Global-As-View) constraints. GAV constraints are the most basic and the most widely supported language for specifying schema mappings. We present an efficient algorithm for learning GAV schema mappings using Angluin's model of exact learning with membership and equivalence queries. This is optimal, since we show that neither membership queries nor equivalence queries suffice, unless the source schema consists of unary relations only. We also obtain results concerning the learnability of schema mappings in the context of Valiant's well-known PAC (Probably-Approximately-Correct) learning model, and concerning the learnability of restricted classes of GAV schema mappings. Finally, as a byproduct of our work, we show that there is no efficient algorithm for approximating the shortest GAV schema mapping fitting a given set of examples, unless the source schema consists of unary relations only.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"44 1","pages":"28"},"PeriodicalIF":1.8,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79315030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Foreword to invited papers issue 特邀论文的前言

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2013-11-01 DOI: 10.1145/2539032.2539033

Z. Ozsoyoglu

Germany). These papers were recommended by the program chairs and program committees of the respective conferences as the best papers to be invited for submission, and they were selected for publication after a full peer-review process according to TODS policies. The first three invited articles in this issue are selected from SIGMOD 2012. Zaniolo introduces a new language (XSeq) for processing complex event queries over XML streams, discusses the language implementation over visibly pushdown automata, and presents formal semantics of the language, complexity analysis, and performance evaluation over multiple settings. studies the problem of thinning related to visualizing map datasets at multiple scales (i.e., given a region at a particular zoom level, the goal is to return a small number of records to represent the region). They describe several novel algorithms and present analyses and experimentation demonstrating the trade-offs among the algorithms. The third article, " XLynx—An FPGA-Based XML Filter for Hybrid XQuery Processing " by Jens Teubner, Louis Woods, and Chongling Nie, addresses the problem of speeding up Xquery processing by focusing on the task of XML projection. They propose a novel method using field-programmable gate-arrays (FPGAs), called the skeleton automata, which is both highly expressive and easily reconfigurable. The next three articles in this issue are selected from PODS 2012. The article " The Complexity of Regular Expressions and Property Paths in SPARQL " by Katja Losemann and Wim Martens formalizes the W3C semantics of property paths and studies the complexity of two basic problems related to query evaluation on graphs. The second article, " Static Analysis and Optimization of Semantic Web Queries " by studies the optimization of SPARQL queries with a special focus on the Optionality feature in SPARQL and presents extensive results including a characterization of query answers and various complexity results for query evaluation, subsumption, and equivalence. summary of a dataset is a compression that allows one to estimate various statistical quantities about the dataset. This article considers mergeable summaries of datasets where a summary of the combined dataset can be obtained from the summaries of two datasets and presents several novel results and well-designed experiments supporting the theoretical results on the mergeability of different statistical quantities.

{"title":"Foreword to invited papers issue","authors":"Z. Ozsoyoglu","doi":"10.1145/2539032.2539033","DOIUrl":"https://doi.org/10.1145/2539032.2539033","url":null,"abstract":"Germany). These papers were recommended by the program chairs and program committees of the respective conferences as the best papers to be invited for submission, and they were selected for publication after a full peer-review process according to TODS policies. The first three invited articles in this issue are selected from SIGMOD 2012. Zaniolo introduces a new language (XSeq) for processing complex event queries over XML streams, discusses the language implementation over visibly pushdown automata, and presents formal semantics of the language, complexity analysis, and performance evaluation over multiple settings. studies the problem of thinning related to visualizing map datasets at multiple scales (i.e., given a region at a particular zoom level, the goal is to return a small number of records to represent the region). They describe several novel algorithms and present analyses and experimentation demonstrating the trade-offs among the algorithms. The third article, \" XLynx—An FPGA-Based XML Filter for Hybrid XQuery Processing \" by Jens Teubner, Louis Woods, and Chongling Nie, addresses the problem of speeding up Xquery processing by focusing on the task of XML projection. They propose a novel method using field-programmable gate-arrays (FPGAs), called the skeleton automata, which is both highly expressive and easily reconfigurable. The next three articles in this issue are selected from PODS 2012. The article \" The Complexity of Regular Expressions and Property Paths in SPARQL \" by Katja Losemann and Wim Martens formalizes the W3C semantics of property paths and studies the complexity of two basic problems related to query evaluation on graphs. The second article, \" Static Analysis and Optimization of Semantic Web Queries \" by studies the optimization of SPARQL queries with a special focus on the Optionality feature in SPARQL and presents extensive results including a characterization of query answers and various complexity results for query evaluation, subsumption, and equivalence. summary of a dataset is a compression that allows one to estimate various statistical quantities about the dataset. This article considers mergeable summaries of datasets where a summary of the combined dataset can be obtained from the summaries of two datasets and presents several novel results and well-designed experiments supporting the theoretical results on the mergeability of different statistical quantities.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"38 1","pages":"20"},"PeriodicalIF":1.8,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/2539032.2539033","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64153211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The complexity of regular expressions and property paths in SPARQL SPARQL中正则表达式和属性路径的复杂性

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2013-11-01 DOI: 10.1145/2494529

Katja Losemann, W. Martens

The World Wide Web Consortium (W3C) recently introduced property paths in SPARQL 1.1, a query language for RDF data. Property paths allow SPARQL queries to evaluate regular expressions over graph-structured data. However, they differ from standard regular expressions in several notable aspects. For example, they have a limited form of negation, they have numerical occurrence indicators as syntactic sugar, and their semantics on graphs is defined in a nonstandard manner. We formalize the W3C semantics of property paths and investigate various query evaluation problems on graphs. More specifically, let x and y be two nodes in an edge-labeled graph and r be an expression. We study the complexities of: (1) deciding whether there exists a path from x to y that matches r and (2) counting how many paths from x to y match r. Our main results show that, compared to an alternative semantics of regular expressions on graphs, the complexity of (1) and (2) under W3C semantics is significantly higher. Whereas the alternative semantics remains in polynomial time for large fragments of expressions, the W3C semantics makes problems (1) and (2) intractable almost immediately. As a side-result, we prove that the membership problem for regular expressions with numerical occurrence indicators and negation is in polynomial time.

万维网联盟(W3C)最近在SPARQL 1.1中引入了属性路径，SPARQL 1.1是RDF数据的查询语言。属性路径允许SPARQL查询在图结构数据上计算正则表达式。然而，它们在几个值得注意的方面与标准正则表达式不同。例如，它们具有有限形式的否定，它们具有数字出现指示符作为语法糖，并且它们在图上的语义以非标准的方式定义。我们形式化了属性路径的W3C语义，并研究了图上的各种查询求值问题。更具体地说，设x和y是边标记图中的两个节点，r是一个表达式。我们研究的复杂性:(1)决定是否存在从x到y匹配r的路径，(2)计算有多少从x到y匹配r的路径。我们的主要结果表明，与图上正则表达式的另一种语义相比，W3C语义下(1)和(2)的复杂性明显更高。对于大的表达式片段，替代语义仍然需要多项式时间，而W3C语义使问题(1)和(2)几乎立即变得棘手。作为附带结果，我们证明了具有数字出现指标和负数的正则表达式的隶属性问题在多项式时间内解决。

{"title":"The complexity of regular expressions and property paths in SPARQL","authors":"Katja Losemann, W. Martens","doi":"10.1145/2494529","DOIUrl":"https://doi.org/10.1145/2494529","url":null,"abstract":"The World Wide Web Consortium (W3C) recently introduced property paths in SPARQL 1.1, a query language for RDF data. Property paths allow SPARQL queries to evaluate regular expressions over graph-structured data. However, they differ from standard regular expressions in several notable aspects. For example, they have a limited form of negation, they have numerical occurrence indicators as syntactic sugar, and their semantics on graphs is defined in a nonstandard manner.\u0000 We formalize the W3C semantics of property paths and investigate various query evaluation problems on graphs. More specifically, let x and y be two nodes in an edge-labeled graph and r be an expression. We study the complexities of: (1) deciding whether there exists a path from x to y that matches r and (2) counting how many paths from x to y match r. Our main results show that, compared to an alternative semantics of regular expressions on graphs, the complexity of (1) and (2) under W3C semantics is significantly higher. Whereas the alternative semantics remains in polynomial time for large fragments of expressions, the W3C semantics makes problems (1) and (2) intractable almost immediately.\u0000 As a side-result, we prove that the membership problem for regular expressions with numerical occurrence indicators and negation is in polynomial time.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"214 1","pages":"24"},"PeriodicalIF":1.8,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74161554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 63