This article studies the MaxRS problem in spatial databases. Given a set O of weighted points and a rectangle r of a given size, the goal of the MaxRS problem is to find a location of r such that the sum of the weights of all the points covered by r is maximized. This problem is useful in many location-based services such as finding the best place for a new franchise store with a limited delivery range and finding the hotspot with the largest number of nearby attractions for a tourist with a limited reachable range. However, the problem has been studied mainly in the theoretical perspective, particularly in computational geometry. The existing algorithms from the computational geometry community are in-memory algorithms that do not guarantee the scalability. In this article, we propose a scalable external-memory algorithm (ExactMaxRS) for the MaxRS problem that is optimal in terms of the I/O complexity. In addition, we propose an approximation algorithm (ApproxMaxCRS) for the MaxCRS problem that is a circle version of the MaxRS problem. We prove the correctness and optimality of the ExactMaxRS algorithm along with the approximation bound of the ApproxMaxCRS algorithm. Furthermore, motivated by the fact that all the existing solutions simply assume that there is no tied area for the best location, we extend the MaxRS problem to a more fundamental problem, namely AllMaxRS, so that all the locations with the same best score can be retrieved. We first prove that the AllMaxRS problem cannot be trivially solved by applying the techniques for the MaxRS problem. Then we propose an output-sensitive external-memory algorithm (TwoPhaseMaxRS) that gives the exact solution for the AllMaxRS problem through two phases. Also, we prove both the soundness and completeness of the result returned from TwoPhaseMaxRS. From extensive experimental results, we show that ExactMaxRS and ApproxMaxCRS are several orders of magnitude faster than methods adapted from existing algorithms, the approximation bound in practice is much better than the theoretical bound of ApproxMaxCRS, and TwoPhaseMaxRS is not only much faster but also more robust than the straightforward extension of ExactMaxRS.
{"title":"Maximizing Range Sum in External Memory","authors":"Dong-Wan Choi, C. Chung, Yufei Tao","doi":"10.1145/2629477","DOIUrl":"https://doi.org/10.1145/2629477","url":null,"abstract":"This article studies the MaxRS problem in spatial databases. Given a set O of weighted points and a rectangle r of a given size, the goal of the MaxRS problem is to find a location of r such that the sum of the weights of all the points covered by r is maximized. This problem is useful in many location-based services such as finding the best place for a new franchise store with a limited delivery range and finding the hotspot with the largest number of nearby attractions for a tourist with a limited reachable range. However, the problem has been studied mainly in the theoretical perspective, particularly in computational geometry. The existing algorithms from the computational geometry community are in-memory algorithms that do not guarantee the scalability. In this article, we propose a scalable external-memory algorithm (ExactMaxRS) for the MaxRS problem that is optimal in terms of the I/O complexity. In addition, we propose an approximation algorithm (ApproxMaxCRS) for the MaxCRS problem that is a circle version of the MaxRS problem. We prove the correctness and optimality of the ExactMaxRS algorithm along with the approximation bound of the ApproxMaxCRS algorithm.\u0000 Furthermore, motivated by the fact that all the existing solutions simply assume that there is no tied area for the best location, we extend the MaxRS problem to a more fundamental problem, namely AllMaxRS, so that all the locations with the same best score can be retrieved. We first prove that the AllMaxRS problem cannot be trivially solved by applying the techniques for the MaxRS problem. Then we propose an output-sensitive external-memory algorithm (TwoPhaseMaxRS) that gives the exact solution for the AllMaxRS problem through two phases. Also, we prove both the soundness and completeness of the result returned from TwoPhaseMaxRS.\u0000 From extensive experimental results, we show that ExactMaxRS and ApproxMaxCRS are several orders of magnitude faster than methods adapted from existing algorithms, the approximation bound in practice is much better than the theoretical bound of ApproxMaxCRS, and TwoPhaseMaxRS is not only much faster but also more robust than the straightforward extension of ExactMaxRS.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"68 1","pages":"21:1-21:44"},"PeriodicalIF":1.8,"publicationDate":"2014-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78249478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rui Zhang, Jianzhong Qi, Martin Stradling, Jin Huang
Conventional spatial indexes, represented by the R-tree, employ multidimensional tree structures that are complicated and require enormous efforts to implement in a full-fledged database management system (DBMS). An alternative approach for supporting spatial queries is mapping-based indexing, which maps both data and queries into a one-dimensional space such that data can be indexed and queries can be processed through a one-dimensional indexing structure such as the B+. Mapping-based indexing requires implementing only a few mapping functions, incurring much less effort in implementation compared to conventional spatial index structures. Yet, a major concern about using mapping-based indexes is their lower efficiency than conventional tree structures. In this article, we propose a mapping-based spatial indexing scheme called Size Separation Indexing (SSI). SSI is equipped with a suite of techniques including size separation, data distribution transformation, and more efficient mapping algorithms. These techniques overcome the drawbacks of existing mapping-based indexes and significantly improve the efficiency of query processing. We show through extensive experiments that, for window queries on spatial objects with nonzero extents, SSI has two orders of magnitude better performance than existing mapping-based indexes and competitive performance to the R-tree as a standalone implementation. We have also implemented SSI on top of two off-the-shelf DBMSs, PostgreSQL and a commercial platform, both having R-tree implementation. In this case, SSI is up to two orders of magnitude faster than their provided spatial indexes. Therefore, we achieve a spatial index more efficient than the R-tree in a DBMS implementation that is at the same time easy to implement. This result may upset a common perception that has existed for a long time in this area that the R-tree is the best choice for indexing spatial objects.
{"title":"Towards a Painless Index for Spatial Objects","authors":"Rui Zhang, Jianzhong Qi, Martin Stradling, Jin Huang","doi":"10.1145/2629333","DOIUrl":"https://doi.org/10.1145/2629333","url":null,"abstract":"Conventional spatial indexes, represented by the R-tree, employ multidimensional tree structures that are complicated and require enormous efforts to implement in a full-fledged database management system (DBMS). An alternative approach for supporting spatial queries is mapping-based indexing, which maps both data and queries into a one-dimensional space such that data can be indexed and queries can be processed through a one-dimensional indexing structure such as the B+. Mapping-based indexing requires implementing only a few mapping functions, incurring much less effort in implementation compared to conventional spatial index structures. Yet, a major concern about using mapping-based indexes is their lower efficiency than conventional tree structures.\u0000 In this article, we propose a mapping-based spatial indexing scheme called Size Separation Indexing (SSI). SSI is equipped with a suite of techniques including size separation, data distribution transformation, and more efficient mapping algorithms. These techniques overcome the drawbacks of existing mapping-based indexes and significantly improve the efficiency of query processing. We show through extensive experiments that, for window queries on spatial objects with nonzero extents, SSI has two orders of magnitude better performance than existing mapping-based indexes and competitive performance to the R-tree as a standalone implementation. We have also implemented SSI on top of two off-the-shelf DBMSs, PostgreSQL and a commercial platform, both having R-tree implementation. In this case, SSI is up to two orders of magnitude faster than their provided spatial indexes. Therefore, we achieve a spatial index more efficient than the R-tree in a DBMS implementation that is at the same time easy to implement. This result may upset a common perception that has existed for a long time in this area that the R-tree is the best choice for indexing spatial objects.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"26 1","pages":"19:1-19:42"},"PeriodicalIF":1.8,"publicationDate":"2014-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90260827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In IT outsourcing, a user may delegate the data storage and query processing functions to a third-party server that is not completely trusted. This gives rise to the need to safeguard the privacy of the database as well as the user queries over it. In this article, we address the problem of running ad hoc equi-join queries directly on encrypted data in such a setting. Our contribution is the first solution that achieves constant complexity per pair of records that are evaluated for the join. After formalizing the privacy requirements pertaining to the database and user queries, we introduce a cryptographic construct for securely joining records across relations. The construct protects the database with a strong encryption scheme. Moreover, information disclosure after executing an equi-join is kept to the minimum—that two input records combine to form an output record if and only if they share common join attribute values. There is no disclosure on records that are not part of the join result. Building on this construct, we then present join algorithms that optimize the join execution by eliminating the need to match every record pair from the input relations. We provide a detailed analysis of the cost of the algorithms and confirm the analysis through extensive experiments with both synthetic and benchmark workloads. Through this evaluation, we tease out useful insights on how to configure the join algorithms to deliver acceptable execution time in practice.
{"title":"Privacy-Preserving Ad-Hoc Equi-Join on Outsourced Data","authors":"HweeHwa Pang, Xuhua Ding","doi":"10.1145/2629501","DOIUrl":"https://doi.org/10.1145/2629501","url":null,"abstract":"In IT outsourcing, a user may delegate the data storage and query processing functions to a third-party server that is not completely trusted. This gives rise to the need to safeguard the privacy of the database as well as the user queries over it. In this article, we address the problem of running ad hoc equi-join queries directly on encrypted data in such a setting. Our contribution is the first solution that achieves constant complexity per pair of records that are evaluated for the join. After formalizing the privacy requirements pertaining to the database and user queries, we introduce a cryptographic construct for securely joining records across relations. The construct protects the database with a strong encryption scheme. Moreover, information disclosure after executing an equi-join is kept to the minimum—that two input records combine to form an output record if and only if they share common join attribute values. There is no disclosure on records that are not part of the join result.\u0000 Building on this construct, we then present join algorithms that optimize the join execution by eliminating the need to match every record pair from the input relations. We provide a detailed analysis of the cost of the algorithms and confirm the analysis through extensive experiments with both synthetic and benchmark workloads. Through this evaluation, we tease out useful insights on how to configure the join algorithms to deliver acceptable execution time in practice.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"25 1","pages":"23:1-23:40"},"PeriodicalIF":1.8,"publicationDate":"2014-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75978099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In order to answer a “joint” query from multiple data cubes, Pourabass and Shoshani [2007] distinguish the data cube on the measure of interest (called the “primary” data cube) from the other data cubes (called “proxy” data cubes) that are used to involve the dimensions (in the query) not in the primary data cube. They demonstrate in study cases that, if the measures of the primary and proxy data cubes are correlated, then the answer to a joint query is an accurate estimate of its true value. Needless to say, for two or more proxy data cubes, the result depends upon the way the primary and proxy data cubes are combined together; however, for certain combination schemes Pourabass and Shoshani provide a sufficient condition, that they call proxy noncommonality, for the invariance of the result. In this article, we introduce: (1) a merge operator combining the contents of a primary data cube with the contents of a proxy data cube, (2) merge expressions for general combination schemes, and (3) an equivalence relation between merge expressions having the same pattern. Then, we prove that proxy noncommonality characterizes patterns for which every two merge expressions are equivalent. Moreover, we provide an efficient procedure for answering joint queries in the special case of perfect merge expressions. Finally, we show that our results apply to data cubes in which measures are obtained from unaggregated data using the aggregate functions SUM, COUNT, MAX, and MIN, and a lot more.
{"title":"A Join-Like Operator to Combine Data Cubes and Answer Queries from Multiple Data Cubes","authors":"F. M. Malvestuto","doi":"10.1145/2638545","DOIUrl":"https://doi.org/10.1145/2638545","url":null,"abstract":"In order to answer a “joint” query from multiple data cubes, Pourabass and Shoshani [2007] distinguish the data cube on the measure of interest (called the “primary” data cube) from the other data cubes (called “proxy” data cubes) that are used to involve the dimensions (in the query) not in the primary data cube. They demonstrate in study cases that, if the measures of the primary and proxy data cubes are correlated, then the answer to a joint query is an accurate estimate of its true value. Needless to say, for two or more proxy data cubes, the result depends upon the way the primary and proxy data cubes are combined together; however, for certain combination schemes Pourabass and Shoshani provide a sufficient condition, that they call proxy noncommonality, for the invariance of the result.\u0000 In this article, we introduce: (1) a merge operator combining the contents of a primary data cube with the contents of a proxy data cube, (2) merge expressions for general combination schemes, and (3) an equivalence relation between merge expressions having the same pattern. Then, we prove that proxy noncommonality characterizes patterns for which every two merge expressions are equivalent. Moreover, we provide an efficient procedure for answering joint queries in the special case of perfect merge expressions. Finally, we show that our results apply to data cubes in which measures are obtained from unaggregated data using the aggregate functions SUM, COUNT, MAX, and MIN, and a lot more.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"26 1","pages":"24:1-24:31"},"PeriodicalIF":1.8,"publicationDate":"2014-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72760622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
There is an arms race in the data management industry to support statistical analytics. Feature selection, the process of selecting a feature set that will be used to build a statistical model, is widely regarded as the most critical step of statistical analytics. Thus, we argue that managing the feature selection process is a pressing data management challenge. We study this challenge by describing a feature selection language and a supporting prototype system that builds on top of current industrial R-integration layers. From our interactions with analysts, we learned that feature selection is an interactive human-in-the-loop process, which means that feature selection workloads are rife with reuse opportunities. Thus, we study how to materialize portions of this computation using not only classical database materialization optimizations but also methods that have not previously been used in database optimization, including structural decomposition methods (like QR factorization) and warmstart. These new methods have no analogue in traditional SQL systems, but they may be interesting for array and scientific database applications. On a diverse set of datasets and programs, we find that traditional database-style approaches that ignore these new opportunities are more than two orders of magnitude slower than an optimal plan in this new trade-off space across multiple R backends. Furthermore, we show that it is possible to build a simple cost-based optimizer to automatically select a near-optimal execution plan for feature selection.
{"title":"Materialization Optimizations for Feature Selection Workloads","authors":"Ce Zhang, Arun Kumar, C. Ré","doi":"10.1145/2877204","DOIUrl":"https://doi.org/10.1145/2877204","url":null,"abstract":"There is an arms race in the data management industry to support statistical analytics. Feature selection, the process of selecting a feature set that will be used to build a statistical model, is widely regarded as the most critical step of statistical analytics. Thus, we argue that managing the feature selection process is a pressing data management challenge. We study this challenge by describing a feature selection language and a supporting prototype system that builds on top of current industrial R-integration layers. From our interactions with analysts, we learned that feature selection is an interactive human-in-the-loop process, which means that feature selection workloads are rife with reuse opportunities. Thus, we study how to materialize portions of this computation using not only classical database materialization optimizations but also methods that have not previously been used in database optimization, including structural decomposition methods (like QR factorization) and warmstart. These new methods have no analogue in traditional SQL systems, but they may be interesting for array and scientific database applications. On a diverse set of datasets and programs, we find that traditional database-style approaches that ignore these new opportunities are more than two orders of magnitude slower than an optimal plan in this new trade-off space across multiple R backends. Furthermore, we show that it is possible to build a simple cost-based optimizer to automatically select a near-optimal execution plan for feature selection.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"1 1","pages":"2:1-2:32"},"PeriodicalIF":1.8,"publicationDate":"2014-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79755488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ontological queries are evaluated against a knowledge base consisting of an extensional database and an ontology (i.e., a set of logical assertions and constraints that derive new intensional knowledge from the extensional database), rather than directly on the extensional database. The evaluation and optimization of such queries is an intriguing new problem for database research. In this article, we discuss two important aspects of this problem: query rewriting and query optimization. Query rewriting consists of the compilation of an ontological query into an equivalent first-order query against the underlying extensional database. We present a novel query rewriting algorithm for rather general types of ontological constraints that is well suited for practical implementations. In particular, we show how a conjunctive query against a knowledge base, expressed using linear and sticky existential rules, that is, members of the recently introduced Datalog± family of ontology languages, can be compiled into a union of conjunctive queries (UCQ) against the underlying database. Ontological query optimization, in this context, attempts to improve this rewriting process soas to produce possibly small and cost-effective UCQ rewritings for an input query.
{"title":"Query Rewriting and Optimization for Ontological Databases","authors":"G. Gottlob, G. Orsi, Andreas Pieris","doi":"10.1145/2638546","DOIUrl":"https://doi.org/10.1145/2638546","url":null,"abstract":"Ontological queries are evaluated against a knowledge base consisting of an extensional database and an ontology (i.e., a set of logical assertions and constraints that derive new intensional knowledge from the extensional database), rather than directly on the extensional database. The evaluation and optimization of such queries is an intriguing new problem for database research. In this article, we discuss two important aspects of this problem: query rewriting and query optimization. Query rewriting consists of the compilation of an ontological query into an equivalent first-order query against the underlying extensional database. We present a novel query rewriting algorithm for rather general types of ontological constraints that is well suited for practical implementations. In particular, we show how a conjunctive query against a knowledge base, expressed using linear and sticky existential rules, that is, members of the recently introduced Datalog± family of ontology languages, can be compiled into a union of conjunctive queries (UCQ) against the underlying database. Ontological query optimization, in this context, attempts to improve this rewriting process soas to produce possibly small and cost-effective UCQ rewritings for an input query.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"37 1","pages":"25:1-25:46"},"PeriodicalIF":1.8,"publicationDate":"2014-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81232380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
While offering unique performance and energy-saving advantages, the use of Field-Programmable Gate Arrays (FPGAs) for database acceleration has demanded major concessions from system designers. Either the programmable chips have been used for very basic application tasks (such as implementing a rigid class of selection predicates) or their circuit definition had to be completely recompiled at runtime—a very CPU-intensive and time-consuming effort. This work eliminates the need for such concessions. As part of our XLynx implementation—an FPGA-based XML filter—we present skeleton automata, which is a design principle for data-intensive hardware circuits that offers high expressiveness and quick reconfiguration at the same time. Skeleton automata provide a generic implementation for a class of finite-state automata. They can be parameterized to any particular automaton instance in a matter of microseconds or less (as opposed to minutes or hours for complete recompilation). We showcase skeleton automata based on XML projection [Marian and Siméon 2003], a filtering technique that illustrates the feasibility of our strategy for a real-world and challenging task. By performing XML projection in hardware and filtering data in the network, we report on performance improvements of several factors while remaining nonintrusive to the back-end XML processor (we evaluate XLynx using the Saxon engine).
在提供独特的性能和节能优势的同时,使用现场可编程门阵列(fpga)进行数据库加速已经要求系统设计师做出重大让步。可编程芯片要么用于非常基本的应用程序任务(例如实现一类严格的选择谓词),要么必须在运行时完全重新编译它们的电路定义——这是一项非常耗费cpu和时间的工作。这项工作消除了这种让步的需要。作为XLynx实现(基于fpga的XML过滤器)的一部分,我们提出了框架自动机,这是一种用于数据密集型硬件电路的设计原则,它同时提供了高表达性和快速重新配置。骨架自动机提供了一类有限状态自动机的通用实现。它们可以在几微秒或更短的时间内参数化为任何特定的自动机实例(完全重新编译需要几分钟或几小时)。我们展示了基于XML投影的骨架自动机[Marian and sim 2003],这是一种过滤技术,说明了我们的策略在现实世界中具有挑战性的任务中的可行性。通过在硬件中执行XML投影并在网络中过滤数据,我们报告了几个因素的性能改进,同时保持对后端XML处理器的非侵入性(我们使用Saxon引擎评估XLynx)。
{"title":"XLynx—An FPGA-based XML filter for hybrid XQuery processing","authors":"J. Teubner, L. Woods, Chongling Nie","doi":"10.1145/2536800","DOIUrl":"https://doi.org/10.1145/2536800","url":null,"abstract":"While offering unique performance and energy-saving advantages, the use of Field-Programmable Gate Arrays (FPGAs) for database acceleration has demanded major concessions from system designers. Either the programmable chips have been used for very basic application tasks (such as implementing a rigid class of selection predicates) or their circuit definition had to be completely recompiled at runtime—a very CPU-intensive and time-consuming effort.\u0000 This work eliminates the need for such concessions. As part of our XLynx implementation—an FPGA-based XML filter—we present skeleton automata, which is a design principle for data-intensive hardware circuits that offers high expressiveness and quick reconfiguration at the same time. Skeleton automata provide a generic implementation for a class of finite-state automata. They can be parameterized to any particular automaton instance in a matter of microseconds or less (as opposed to minutes or hours for complete recompilation).\u0000 We showcase skeleton automata based on XML projection [Marian and Siméon 2003], a filtering technique that illustrates the feasibility of our strategy for a real-world and challenging task. By performing XML projection in hardware and filtering data in the network, we report on performance improvements of several factors while remaining nonintrusive to the back-end XML processor (we evaluate XLynx using the Saxon engine).","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"37 1","pages":"23"},"PeriodicalIF":1.8,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73934502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A schema mapping is a high-level specification of the relationship between a source schema and a target schema. Recently, a line of research has emerged that aims at deriving schema mappings automatically or semi-automatically with the help of data examples, that is, pairs consisting of a source instance and a target instance that depict, in some precise sense, the intended behavior of the schema mapping. Several different uses of data examples for deriving, refining, or illustrating a schema mapping have already been proposed and studied. In this article, we use the lens of computational learning theory to systematically investigate the problem of obtaining algorithmically a schema mapping from data examples. Our aim is to leverage the rich body of work on learning theory in order to develop a framework for exploring the power and the limitations of the various algorithmic methods for obtaining schema mappings from data examples. We focus on GAV schema mappings, that is, schema mappings specified by GAV (Global-As-View) constraints. GAV constraints are the most basic and the most widely supported language for specifying schema mappings. We present an efficient algorithm for learning GAV schema mappings using Angluin's model of exact learning with membership and equivalence queries. This is optimal, since we show that neither membership queries nor equivalence queries suffice, unless the source schema consists of unary relations only. We also obtain results concerning the learnability of schema mappings in the context of Valiant's well-known PAC (Probably-Approximately-Correct) learning model, and concerning the learnability of restricted classes of GAV schema mappings. Finally, as a byproduct of our work, we show that there is no efficient algorithm for approximating the shortest GAV schema mapping fitting a given set of examples, unless the source schema consists of unary relations only.
{"title":"Learning schema mappings","authors":"B. T. Cate, V. Dalmau, Phokion G. Kolaitis","doi":"10.1145/2539032.2539035","DOIUrl":"https://doi.org/10.1145/2539032.2539035","url":null,"abstract":"A schema mapping is a high-level specification of the relationship between a source schema and a target schema. Recently, a line of research has emerged that aims at deriving schema mappings automatically or semi-automatically with the help of data examples, that is, pairs consisting of a source instance and a target instance that depict, in some precise sense, the intended behavior of the schema mapping. Several different uses of data examples for deriving, refining, or illustrating a schema mapping have already been proposed and studied.\u0000 In this article, we use the lens of computational learning theory to systematically investigate the problem of obtaining algorithmically a schema mapping from data examples. Our aim is to leverage the rich body of work on learning theory in order to develop a framework for exploring the power and the limitations of the various algorithmic methods for obtaining schema mappings from data examples. We focus on GAV schema mappings, that is, schema mappings specified by GAV (Global-As-View) constraints. GAV constraints are the most basic and the most widely supported language for specifying schema mappings. We present an efficient algorithm for learning GAV schema mappings using Angluin's model of exact learning with membership and equivalence queries. This is optimal, since we show that neither membership queries nor equivalence queries suffice, unless the source schema consists of unary relations only. We also obtain results concerning the learnability of schema mappings in the context of Valiant's well-known PAC (Probably-Approximately-Correct) learning model, and concerning the learnability of restricted classes of GAV schema mappings. Finally, as a byproduct of our work, we show that there is no efficient algorithm for approximating the shortest GAV schema mapping fitting a given set of examples, unless the source schema consists of unary relations only.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"44 1","pages":"28"},"PeriodicalIF":1.8,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79315030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Germany). These papers were recommended by the program chairs and program committees of the respective conferences as the best papers to be invited for submission, and they were selected for publication after a full peer-review process according to TODS policies. The first three invited articles in this issue are selected from SIGMOD 2012. Zaniolo introduces a new language (XSeq) for processing complex event queries over XML streams, discusses the language implementation over visibly pushdown automata, and presents formal semantics of the language, complexity analysis, and performance evaluation over multiple settings. studies the problem of thinning related to visualizing map datasets at multiple scales (i.e., given a region at a particular zoom level, the goal is to return a small number of records to represent the region). They describe several novel algorithms and present analyses and experimentation demonstrating the trade-offs among the algorithms. The third article, " XLynx—An FPGA-Based XML Filter for Hybrid XQuery Processing " by Jens Teubner, Louis Woods, and Chongling Nie, addresses the problem of speeding up Xquery processing by focusing on the task of XML projection. They propose a novel method using field-programmable gate-arrays (FPGAs), called the skeleton automata, which is both highly expressive and easily reconfigurable. The next three articles in this issue are selected from PODS 2012. The article " The Complexity of Regular Expressions and Property Paths in SPARQL " by Katja Losemann and Wim Martens formalizes the W3C semantics of property paths and studies the complexity of two basic problems related to query evaluation on graphs. The second article, " Static Analysis and Optimization of Semantic Web Queries " by studies the optimization of SPARQL queries with a special focus on the Optionality feature in SPARQL and presents extensive results including a characterization of query answers and various complexity results for query evaluation, subsumption, and equivalence. summary of a dataset is a compression that allows one to estimate various statistical quantities about the dataset. This article considers mergeable summaries of datasets where a summary of the combined dataset can be obtained from the summaries of two datasets and presents several novel results and well-designed experiments supporting the theoretical results on the mergeability of different statistical quantities.
{"title":"Foreword to invited papers issue","authors":"Z. Ozsoyoglu","doi":"10.1145/2539032.2539033","DOIUrl":"https://doi.org/10.1145/2539032.2539033","url":null,"abstract":"Germany). These papers were recommended by the program chairs and program committees of the respective conferences as the best papers to be invited for submission, and they were selected for publication after a full peer-review process according to TODS policies. The first three invited articles in this issue are selected from SIGMOD 2012. Zaniolo introduces a new language (XSeq) for processing complex event queries over XML streams, discusses the language implementation over visibly pushdown automata, and presents formal semantics of the language, complexity analysis, and performance evaluation over multiple settings. studies the problem of thinning related to visualizing map datasets at multiple scales (i.e., given a region at a particular zoom level, the goal is to return a small number of records to represent the region). They describe several novel algorithms and present analyses and experimentation demonstrating the trade-offs among the algorithms. The third article, \" XLynx—An FPGA-Based XML Filter for Hybrid XQuery Processing \" by Jens Teubner, Louis Woods, and Chongling Nie, addresses the problem of speeding up Xquery processing by focusing on the task of XML projection. They propose a novel method using field-programmable gate-arrays (FPGAs), called the skeleton automata, which is both highly expressive and easily reconfigurable. The next three articles in this issue are selected from PODS 2012. The article \" The Complexity of Regular Expressions and Property Paths in SPARQL \" by Katja Losemann and Wim Martens formalizes the W3C semantics of property paths and studies the complexity of two basic problems related to query evaluation on graphs. The second article, \" Static Analysis and Optimization of Semantic Web Queries \" by studies the optimization of SPARQL queries with a special focus on the Optionality feature in SPARQL and presents extensive results including a characterization of query answers and various complexity results for query evaluation, subsumption, and equivalence. summary of a dataset is a compression that allows one to estimate various statistical quantities about the dataset. This article considers mergeable summaries of datasets where a summary of the combined dataset can be obtained from the summaries of two datasets and presents several novel results and well-designed experiments supporting the theoretical results on the mergeability of different statistical quantities.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"38 1","pages":"20"},"PeriodicalIF":1.8,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/2539032.2539033","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64153211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The World Wide Web Consortium (W3C) recently introduced property paths in SPARQL 1.1, a query language for RDF data. Property paths allow SPARQL queries to evaluate regular expressions over graph-structured data. However, they differ from standard regular expressions in several notable aspects. For example, they have a limited form of negation, they have numerical occurrence indicators as syntactic sugar, and their semantics on graphs is defined in a nonstandard manner. We formalize the W3C semantics of property paths and investigate various query evaluation problems on graphs. More specifically, let x and y be two nodes in an edge-labeled graph and r be an expression. We study the complexities of: (1) deciding whether there exists a path from x to y that matches r and (2) counting how many paths from x to y match r. Our main results show that, compared to an alternative semantics of regular expressions on graphs, the complexity of (1) and (2) under W3C semantics is significantly higher. Whereas the alternative semantics remains in polynomial time for large fragments of expressions, the W3C semantics makes problems (1) and (2) intractable almost immediately. As a side-result, we prove that the membership problem for regular expressions with numerical occurrence indicators and negation is in polynomial time.
{"title":"The complexity of regular expressions and property paths in SPARQL","authors":"Katja Losemann, W. Martens","doi":"10.1145/2494529","DOIUrl":"https://doi.org/10.1145/2494529","url":null,"abstract":"The World Wide Web Consortium (W3C) recently introduced property paths in SPARQL 1.1, a query language for RDF data. Property paths allow SPARQL queries to evaluate regular expressions over graph-structured data. However, they differ from standard regular expressions in several notable aspects. For example, they have a limited form of negation, they have numerical occurrence indicators as syntactic sugar, and their semantics on graphs is defined in a nonstandard manner.\u0000 We formalize the W3C semantics of property paths and investigate various query evaluation problems on graphs. More specifically, let x and y be two nodes in an edge-labeled graph and r be an expression. We study the complexities of: (1) deciding whether there exists a path from x to y that matches r and (2) counting how many paths from x to y match r. Our main results show that, compared to an alternative semantics of regular expressions on graphs, the complexity of (1) and (2) under W3C semantics is significantly higher. Whereas the alternative semantics remains in polynomial time for large fragments of expressions, the W3C semantics makes problems (1) and (2) intractable almost immediately.\u0000 As a side-result, we prove that the membership problem for regular expressions with numerical occurrence indicators and negation is in polynomial time.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"214 1","pages":"24"},"PeriodicalIF":1.8,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74161554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}