We study exact and approximate methods for maximum inner product search, a fundamental problem in a number of data mining and information retrieval tasks. We propose the LEMP framework, which supports both exact and approximate search with quality guarantees. At its heart, LEMP transforms a maximum inner product search problem over a large database of vectors into a number of smaller cosine similarity search problems. This transformation allows LEMP to prune large parts of the search space immediately and to select suitable search algorithms for each of the remaining problems individually. LEMP is able to leverage existing methods for cosine similarity search, but we also provide a number of novel search algorithms tailored to our setting. We conducted an extensive experimental study that provides insight into the performance of many state-of-the-art techniques—including LEMP—on multiple real-world datasets. We found that LEMP often was significantly faster or more accurate than alternative methods.
{"title":"Exact and Approximate Maximum Inner Product Search with LEMP","authors":"Christina Teflioudi, Rainer Gemulla","doi":"10.1145/2996452","DOIUrl":"https://doi.org/10.1145/2996452","url":null,"abstract":"We study exact and approximate methods for maximum inner product search, a fundamental problem in a number of data mining and information retrieval tasks. We propose the LEMP framework, which supports both exact and approximate search with quality guarantees. At its heart, LEMP transforms a maximum inner product search problem over a large database of vectors into a number of smaller cosine similarity search problems. This transformation allows LEMP to prune large parts of the search space immediately and to select suitable search algorithms for each of the remaining problems individually. LEMP is able to leverage existing methods for cosine similarity search, but we also provide a number of novel search algorithms tailored to our setting. We conducted an extensive experimental study that provides insight into the performance of many state-of-the-art techniques—including LEMP—on multiple real-world datasets. We found that LEMP often was significantly faster or more accurate than alternative methods.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"1 1","pages":"1 - 49"},"PeriodicalIF":0.0,"publicationDate":"2016-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82356636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Instead of constructing complex declarative queries, many users prefer to write their programs using procedural code embedded with simple queries. Since many users are not expert programmers or the programs are written in a rush, these programs usually exhibit poor performance in practice and it is a challenge to automatically and efficiently optimize these programs. In this article, we present UniAD, which stands for Unified execution for Ad hoc Data processing, a system designed to simplify the programming of data processing tasks and provide efficient execution for user programs. We provide the background of program semantics and propose a novel intermediate representation, called Unified Intermediate Representation (UniIR), which utilizes a simple and expressive mechanism HOQ to describe the operations performed in programs. By combining both procedural and declarative logics with the proposed intermediate representation, we can perform various optimizations across the boundary between procedural and declarative code. We propose a transformation-based optimizer to automatically optimize programs and implement the UniAD system. The extensive experimental results on various benchmarks demonstrate that our techniques can significantly improve the performance of a wide range of data processing programs.
{"title":"UniAD","authors":"Xiaogang Shi, B. Cui, G. Dobbie, Beng Chin Ooi","doi":"10.1145/3009957","DOIUrl":"https://doi.org/10.1145/3009957","url":null,"abstract":"Instead of constructing complex declarative queries, many users prefer to write their programs using procedural code embedded with simple queries. Since many users are not expert programmers or the programs are written in a rush, these programs usually exhibit poor performance in practice and it is a challenge to automatically and efficiently optimize these programs. In this article, we present UniAD, which stands for Unified execution for Ad hoc Data processing, a system designed to simplify the programming of data processing tasks and provide efficient execution for user programs. We provide the background of program semantics and propose a novel intermediate representation, called Unified Intermediate Representation (UniIR), which utilizes a simple and expressive mechanism HOQ to describe the operations performed in programs. By combining both procedural and declarative logics with the proposed intermediate representation, we can perform various optimizations across the boundary between procedural and declarative code. We propose a transformation-based optimizer to automatically optimize programs and implement the UniAD system. The extensive experimental results on various benchmarks demonstrate that our techniques can significantly improve the performance of a wide range of data processing programs.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"22 1","pages":"1 - 42"},"PeriodicalIF":0.0,"publicationDate":"2016-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77549439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiufeng Liu, Lukasz Golab, W. Golab, I. Ilyas, Shichao Jin
Smart electricity meters have been replacing conventional meters worldwide, enabling automated collection of fine-grained (e.g., every 15 minutes or hourly) consumption data. A variety of smart meter analytics algorithms and applications have been proposed, mainly in the smart grid literature. However, the focus has been on what can be done with the data rather than how to do it efficiently. In this article, we examine smart meter analytics from a software performance perspective. First, we design a performance benchmark that includes common smart meter analytics tasks. These include offline feature extraction and model building as well as a framework for online anomaly detection that we propose. Second, since obtaining real smart meter data is difficult due to privacy issues, we present an algorithm for generating large realistic datasets from a small seed of real data. Third, we implement the proposed benchmark using five representative platforms: a traditional numeric computing platform (Matlab), a relational DBMS with a built-in machine learning toolkit (PostgreSQL/MADlib), a main-memory column store (“System C”), and two distributed data processing platforms (Hive and Spark/Spark Streaming). We compare the five platforms in terms of application development effort and performance on a multicore machine as well as a cluster of 16 commodity servers.
{"title":"Smart Meter Data Analytics","authors":"Xiufeng Liu, Lukasz Golab, W. Golab, I. Ilyas, Shichao Jin","doi":"10.1145/3004295","DOIUrl":"https://doi.org/10.1145/3004295","url":null,"abstract":"Smart electricity meters have been replacing conventional meters worldwide, enabling automated collection of fine-grained (e.g., every 15 minutes or hourly) consumption data. A variety of smart meter analytics algorithms and applications have been proposed, mainly in the smart grid literature. However, the focus has been on what can be done with the data rather than how to do it efficiently. In this article, we examine smart meter analytics from a software performance perspective. First, we design a performance benchmark that includes common smart meter analytics tasks. These include offline feature extraction and model building as well as a framework for online anomaly detection that we propose. Second, since obtaining real smart meter data is difficult due to privacy issues, we present an algorithm for generating large realistic datasets from a small seed of real data. Third, we implement the proposed benchmark using five representative platforms: a traditional numeric computing platform (Matlab), a relational DBMS with a built-in machine learning toolkit (PostgreSQL/MADlib), a main-memory column store (“System C”), and two distributed data processing platforms (Hive and Spark/Spark Streaming). We compare the five platforms in terms of application development effort and performance on a multicore machine as well as a cluster of 16 commodity servers.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"1 1","pages":"1 - 39"},"PeriodicalIF":0.0,"publicationDate":"2016-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78691424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mahmoud Abo Khamis, H. Ngo, Christopher Ré, A. Rudra
We present a simple geometric framework for the relational join. Using this framework, we design an algorithm that achieves the fractional hypertree-width bound, which generalizes classical and recent worst-case algorithmic results on computing joins. In addition, we use our framework and the same algorithm to show a series of what are colloquially known as beyond worst-case results. The framework allows us to prove results for data stored in BTrees, multidimensional data structures, and even multiple indices per table. A key idea in our framework is formalizing the inference one does with an index as a type of geometric resolution, transforming the algorithmic problem of computing joins to a geometric problem. Our notion of geometric resolution can be viewed as a geometric analog of logical resolution. In addition to the geometry and logic connections, our algorithm can also be thought of as backtracking search with memoization.
{"title":"Joins via Geometric Resolutions","authors":"Mahmoud Abo Khamis, H. Ngo, Christopher Ré, A. Rudra","doi":"10.1145/2967101","DOIUrl":"https://doi.org/10.1145/2967101","url":null,"abstract":"We present a simple geometric framework for the relational join. Using this framework, we design an algorithm that achieves the fractional hypertree-width bound, which generalizes classical and recent worst-case algorithmic results on computing joins. In addition, we use our framework and the same algorithm to show a series of what are colloquially known as beyond worst-case results. The framework allows us to prove results for data stored in BTrees, multidimensional data structures, and even multiple indices per table. A key idea in our framework is formalizing the inference one does with an index as a type of geometric resolution, transforming the algorithmic problem of computing joins to a geometric problem. Our notion of geometric resolution can be viewed as a geometric analog of logical resolution. In addition to the geometry and logic connections, our algorithm can also be thought of as backtracking search with memoization.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"45 1","pages":"1 - 45"},"PeriodicalIF":0.0,"publicationDate":"2016-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77528447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dimitra Papadimitriou, G. Koutrika, J. Mylopoulos, Yannis Velegrakis
Human activity is almost always intentional, be it in a physical context or as part of an interaction with a computer system. By understanding why user-generated events are happening and what purposes they serve, a system can offer a significantly improved and more engaging experience. However, goals cannot be easily captured. Analyzing user actions such as clicks and purchases can reveal patterns and behaviors, but understanding the goals behind these actions is a different and challenging issue. Our work presents a unified, multidisciplinary viewpoint for goal management that covers many different cases where goals can be used and techniques with which they can be exploited. Our purpose is to provide a common reference point to the concepts and challenging tasks that need to be formally defined when someone wants to approach a data analysis problem from a goal-oriented point of view. This work also serves as a springboard to discuss several open challenges and opportunities for goal-oriented approaches in data management, analysis, and sharing systems and applications.
{"title":"The Goal Behind the Action","authors":"Dimitra Papadimitriou, G. Koutrika, J. Mylopoulos, Yannis Velegrakis","doi":"10.1145/2934666","DOIUrl":"https://doi.org/10.1145/2934666","url":null,"abstract":"Human activity is almost always intentional, be it in a physical context or as part of an interaction with a computer system. By understanding why user-generated events are happening and what purposes they serve, a system can offer a significantly improved and more engaging experience. However, goals cannot be easily captured. Analyzing user actions such as clicks and purchases can reveal patterns and behaviors, but understanding the goals behind these actions is a different and challenging issue. Our work presents a unified, multidisciplinary viewpoint for goal management that covers many different cases where goals can be used and techniques with which they can be exploited. Our purpose is to provide a common reference point to the concepts and challenging tasks that need to be formally defined when someone wants to approach a data analysis problem from a goal-oriented point of view. This work also serves as a springboard to discuss several open challenges and opportunities for goal-oriented approaches in data management, analysis, and sharing systems and applications.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"1 1","pages":"1 - 43"},"PeriodicalIF":0.0,"publicationDate":"2016-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89768905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Maabout, C. Ordonez, Patrick Kamnang Wanko, N. Hanusse
Given a table T(Id, D1, …, Dd), the skycube of T is the set of skylines with respect to to all nonempty subsets (subspaces) of the set of all dimensions {D1, …, Dd}. To optimize the evaluation of any skyline query, the solutions proposed so far in the literature either (i) precompute all of the skylines or (ii) use compression techniques so that the derivation of any skyline can be done with little effort. Even though solutions (i) are appealing because skyline queries have optimal execution time, they suffer from time and space scalability because the number of skylines to be materialized is exponential with respect to d. On the other hand, solutions (ii) are attractive in terms of memory consumption, but as we show, they also have a high time complexity. In this article, we make contributions to both kinds of solutions. We first observe that skyline patterns are monotonic. This property leads to a simple yet efficient solution for full and partial skycube materialization when the skyline with respect to all dimensions, the topmost skyline, is small. On the other hand, when the topmost skyline is large relative to the size of the input table, it turns out that functional dependencies, a fundamental concept in databases, uncover a monotonic property between skylines. Equipped with this information, we show that closed attributes sets are fundamental for partial and full skycube materialization. Extensive experiments with real and synthetic datasets show that our solutions generally outperform state-of-the-art algorithms.
{"title":"Skycube Materialization Using the Topmost Skyline or Functional Dependencies","authors":"S. Maabout, C. Ordonez, Patrick Kamnang Wanko, N. Hanusse","doi":"10.1145/2955092","DOIUrl":"https://doi.org/10.1145/2955092","url":null,"abstract":"Given a table T(Id, D1, …, Dd), the skycube of T is the set of skylines with respect to to all nonempty subsets (subspaces) of the set of all dimensions {D1, …, Dd}. To optimize the evaluation of any skyline query, the solutions proposed so far in the literature either (i) precompute all of the skylines or (ii) use compression techniques so that the derivation of any skyline can be done with little effort. Even though solutions (i) are appealing because skyline queries have optimal execution time, they suffer from time and space scalability because the number of skylines to be materialized is exponential with respect to d. On the other hand, solutions (ii) are attractive in terms of memory consumption, but as we show, they also have a high time complexity. In this article, we make contributions to both kinds of solutions. We first observe that skyline patterns are monotonic. This property leads to a simple yet efficient solution for full and partial skycube materialization when the skyline with respect to all dimensions, the topmost skyline, is small. On the other hand, when the topmost skyline is large relative to the size of the input table, it turns out that functional dependencies, a fundamental concept in databases, uncover a monotonic property between skylines. Equipped with this information, we show that closed attributes sets are fundamental for partial and full skycube materialization. Extensive experiments with real and synthetic datasets show that our solutions generally outperform state-of-the-art algorithms.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"18 1","pages":"1 - 40"},"PeriodicalIF":0.0,"publicationDate":"2016-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84435336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuanyuan Tian, Fatma Özcan, Tao Zou, R. Goncalves, H. Pirahesh
The Hadoop Distributed File System (HDFS) has become an important data repository in the enterprise as the center for all business analytics, from SQL queries and machine learning to reporting. At the same time, enterprise data warehouses (EDWs) continue to support critical business analytics. This has created the need for a new generation of a special federation between Hadoop-like big data platforms and EDWs, which we call the hybrid warehouse. There are many applications that require correlating data stored in HDFS with EDW data, such as the analysis that associates click logs stored in HDFS with the sales data stored in the database. All existing solutions reach out to HDFS and read the data into the EDW to perform the joins, assuming that the Hadoop side does not have efficient SQL support. In this article, we show that it is actually better to do most data processing on the HDFS side, provided that we can leverage a sophisticated execution engine for joins on the Hadoop side. We identify the best hybrid warehouse architecture by studying various algorithms to join database and HDFS tables. We utilize Bloom filters to minimize the data movement and exploit the massive parallelism in both systems to the fullest extent possible. We describe a new zigzag join algorithm and show that it is a robust join algorithm for hybrid warehouses that performs well in almost all cases. We further develop a sophisticated cost model for the various join algorithms and show that it can facilitate query optimization in the hybrid warehouse to correctly choose the right algorithm under different predicate and join selectivities.
{"title":"Building a Hybrid Warehouse","authors":"Yuanyuan Tian, Fatma Özcan, Tao Zou, R. Goncalves, H. Pirahesh","doi":"10.1145/2972950","DOIUrl":"https://doi.org/10.1145/2972950","url":null,"abstract":"The Hadoop Distributed File System (HDFS) has become an important data repository in the enterprise as the center for all business analytics, from SQL queries and machine learning to reporting. At the same time, enterprise data warehouses (EDWs) continue to support critical business analytics. This has created the need for a new generation of a special federation between Hadoop-like big data platforms and EDWs, which we call the hybrid warehouse. There are many applications that require correlating data stored in HDFS with EDW data, such as the analysis that associates click logs stored in HDFS with the sales data stored in the database. All existing solutions reach out to HDFS and read the data into the EDW to perform the joins, assuming that the Hadoop side does not have efficient SQL support. In this article, we show that it is actually better to do most data processing on the HDFS side, provided that we can leverage a sophisticated execution engine for joins on the Hadoop side. We identify the best hybrid warehouse architecture by studying various algorithms to join database and HDFS tables. We utilize Bloom filters to minimize the data movement and exploit the massive parallelism in both systems to the fullest extent possible. We describe a new zigzag join algorithm and show that it is a robust join algorithm for hybrid warehouses that performs well in almost all cases. We further develop a sophisticated cost model for the various join algorithms and show that it can facilitate query optimization in the hybrid warehouse to correctly choose the right algorithm under different predicate and join selectivities.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"146 1","pages":"1 - 38"},"PeriodicalIF":0.0,"publicationDate":"2016-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88095629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bettina Fazzinga, S. Flesca, F. Furfaro, F. Parisi
A probabilistic framework for cleaning the data collected by Radio-Frequency IDentification (RFID) tracking systems is introduced. What has to be cleaned is the set of trajectories that are the possible interpretations of the readings: a trajectory in this set is a sequence whose generic element is a location covered by the reader(s) that made the detection at the corresponding time point. The cleaning is guided by integrity constraints and consists of discarding the inconsistent trajectories and assigning to the others a suitable probability of being the actual one. The probabilities are evaluated by adopting probabilistic conditioning that logically consists of the following steps. First, the trajectories are assigned a priori probabilities that rely on the independence assumption between the time points. Then, these probabilities are revised according to the spatio-temporal correlations encoded by the constraints. This is done by conditioning the a priori probability of each trajectory to the event that the constraints are satisfied: this means taking the ratio of this a priori probability to the sum of the a priori probabilities of all the consistent trajectories. Instead of performing these steps by materializing all the trajectories and their a priori probabilities (which is infeasible, owing to the typically huge number of trajectories), our approach exploits a data structure called conditioned trajectory graph (ct-graph) that compactly represents the trajectories and their conditioned probabilities, and an algorithm for efficiently constructing the ct-graph, which progressively builds it while avoiding the construction of components encoding inconsistent trajectories.
{"title":"Exploiting Integrity Constraints for Cleaning Trajectories of RFID-Monitored Objects","authors":"Bettina Fazzinga, S. Flesca, F. Furfaro, F. Parisi","doi":"10.1145/2939368","DOIUrl":"https://doi.org/10.1145/2939368","url":null,"abstract":"A probabilistic framework for cleaning the data collected by Radio-Frequency IDentification (RFID) tracking systems is introduced. What has to be cleaned is the set of trajectories that are the possible interpretations of the readings: a trajectory in this set is a sequence whose generic element is a location covered by the reader(s) that made the detection at the corresponding time point. The cleaning is guided by integrity constraints and consists of discarding the inconsistent trajectories and assigning to the others a suitable probability of being the actual one. The probabilities are evaluated by adopting probabilistic conditioning that logically consists of the following steps. First, the trajectories are assigned a priori probabilities that rely on the independence assumption between the time points. Then, these probabilities are revised according to the spatio-temporal correlations encoded by the constraints. This is done by conditioning the a priori probability of each trajectory to the event that the constraints are satisfied: this means taking the ratio of this a priori probability to the sum of the a priori probabilities of all the consistent trajectories. Instead of performing these steps by materializing all the trajectories and their a priori probabilities (which is infeasible, owing to the typically huge number of trajectories), our approach exploits a data structure called conditioned trajectory graph (ct-graph) that compactly represents the trajectories and their conditioned probabilities, and an algorithm for efficiently constructing the ct-graph, which progressively builds it while avoiding the construction of components encoding inconsistent trajectories.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"15 1","pages":"1 - 52"},"PeriodicalIF":0.0,"publicationDate":"2016-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87491699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Bourhis, M. Manna, Michael Morak, Andreas Pieris
We perform an in-depth complexity analysis of query answering under guarded-based classes of disjunctive tuple-generating dependencies (DTGDs), focusing on (unions of) conjunctive queries ((U)CQs). We show that the problem under investigation is very hard, namely 2ExpTime-complete, even for fixed sets of dependencies of a very restricted form. This is a surprising lower bound that demonstrates the enormous impact of disjunction on query answering under guarded-based tuple-generating dependencies, and also reveals the source of complexity for expressive logics such as the guarded fragment of first-order logic. We then proceed to investigate whether prominent subclasses of (U)CQs (i.e., queries of bounded treewidth and hypertree-width, and acyclic queries) have a positive impact on the complexity of the problem under consideration. We show that queries of bounded treewidth and bounded hypertree-width do not reduce the complexity of our problem, even if we focus on predicates of bounded arity or on fixed sets of DTGDs. Regarding acyclic queries, although the problem remains 2ExpTime-complete in general, in some relevant settings the complexity reduces to ExpTime-complete. Finally, with the aim of identifying tractable cases, we focus our attention on atomic queries. We show that atomic queries do not make the query answering problem easier under classes of guarded-based DTGDs that allow more than one atom to occur in the body of the dependencies. However, the complexity significantly decreases in the case of dependencies that can have only one atom in the body. In particular, we obtain a Ptime-completeness if we focus on predicates of bounded arity, and AC0-membership when the set of dependencies and the query are fixed. Interestingly, our results can be used as a generic tool for establishing complexity results for query answering under various description logics.
{"title":"Guarded-Based Disjunctive Tuple-Generating Dependencies","authors":"P. Bourhis, M. Manna, Michael Morak, Andreas Pieris","doi":"10.1145/2976736","DOIUrl":"https://doi.org/10.1145/2976736","url":null,"abstract":"We perform an in-depth complexity analysis of query answering under guarded-based classes of disjunctive tuple-generating dependencies (DTGDs), focusing on (unions of) conjunctive queries ((U)CQs). We show that the problem under investigation is very hard, namely 2ExpTime-complete, even for fixed sets of dependencies of a very restricted form. This is a surprising lower bound that demonstrates the enormous impact of disjunction on query answering under guarded-based tuple-generating dependencies, and also reveals the source of complexity for expressive logics such as the guarded fragment of first-order logic. We then proceed to investigate whether prominent subclasses of (U)CQs (i.e., queries of bounded treewidth and hypertree-width, and acyclic queries) have a positive impact on the complexity of the problem under consideration. We show that queries of bounded treewidth and bounded hypertree-width do not reduce the complexity of our problem, even if we focus on predicates of bounded arity or on fixed sets of DTGDs. Regarding acyclic queries, although the problem remains 2ExpTime-complete in general, in some relevant settings the complexity reduces to ExpTime-complete. Finally, with the aim of identifying tractable cases, we focus our attention on atomic queries. We show that atomic queries do not make the query answering problem easier under classes of guarded-based DTGDs that allow more than one atom to occur in the body of the dependencies. However, the complexity significantly decreases in the case of dependencies that can have only one atom in the body. In particular, we obtain a Ptime-completeness if we focus on predicates of bounded arity, and AC0-membership when the set of dependencies and the query are fixed. Interestingly, our results can be used as a generic tool for establishing complexity results for query answering under various description logics.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"29 1","pages":"1 - 45"},"PeriodicalIF":0.0,"publicationDate":"2016-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79370569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anton Dignös, Michael H. Böhlen, J. Gamper, Christian S. Jensen
Many databases contain temporal, or time-referenced, data and use intervals to capture the temporal aspect. While SQL-based database management systems (DBMSs) are capable of supporting the management of interval data, the support they offer can be improved considerably. A range of proposed temporal data models and query languages offer ample evidence to this effect. Natural queries that are very difficult to formulate in SQL are easy to formulate in these temporal query languages. The increased focus on analytics over historical data where queries are generally more complex exacerbates the difficulties and thus the potential benefits of a temporal query language. Commercial DBMSs have recently started to offer limited temporal functionality in a step-by-step manner, focusing on the representation of intervals and neglecting the implementation of the query evaluation engine. This article demonstrates how it is possible to extend the relational database engine to achieve a full-fledged, industrial-strength implementation of sequenced temporal queries, which intuitively are queries that are evaluated at each time point. Our approach reduces temporal queries to nontemporal queries over data with adjusted intervals, and it leaves the processing of nontemporal queries unaffected. Specifically, the approach hinges on three concepts: interval adjustment, timestamp propagation, and attribute scaling. Interval adjustment is enabled by introducing two new relational operators, a temporal normalizer and a temporal aligner, and the latter two concepts are enabled by the replication of timestamp attributes and the use of so-called scaling functions. By providing a set of reduction rules, we can transform any temporal query, expressed in terms of temporal relational operators, to a query expressed in terms of relational operators and the two new operators. We prove that the size of a transformed query is linear in the number of temporal operators in the original query. An integration of the new operators and the transformation rules, along with query optimization rules, into the kernel of PostgreSQL is reported. Empirical studies with the resulting temporal DBMS are covered that offer insights into pertinent design properties of the article's proposal. The new system is available as open-source software.
{"title":"Extending the Kernel of a Relational DBMS with Comprehensive Support for Sequenced Temporal Queries","authors":"Anton Dignös, Michael H. Böhlen, J. Gamper, Christian S. Jensen","doi":"10.1145/2967608","DOIUrl":"https://doi.org/10.1145/2967608","url":null,"abstract":"Many databases contain temporal, or time-referenced, data and use intervals to capture the temporal aspect. While SQL-based database management systems (DBMSs) are capable of supporting the management of interval data, the support they offer can be improved considerably. A range of proposed temporal data models and query languages offer ample evidence to this effect. Natural queries that are very difficult to formulate in SQL are easy to formulate in these temporal query languages. The increased focus on analytics over historical data where queries are generally more complex exacerbates the difficulties and thus the potential benefits of a temporal query language. Commercial DBMSs have recently started to offer limited temporal functionality in a step-by-step manner, focusing on the representation of intervals and neglecting the implementation of the query evaluation engine. This article demonstrates how it is possible to extend the relational database engine to achieve a full-fledged, industrial-strength implementation of sequenced temporal queries, which intuitively are queries that are evaluated at each time point. Our approach reduces temporal queries to nontemporal queries over data with adjusted intervals, and it leaves the processing of nontemporal queries unaffected. Specifically, the approach hinges on three concepts: interval adjustment, timestamp propagation, and attribute scaling. Interval adjustment is enabled by introducing two new relational operators, a temporal normalizer and a temporal aligner, and the latter two concepts are enabled by the replication of timestamp attributes and the use of so-called scaling functions. By providing a set of reduction rules, we can transform any temporal query, expressed in terms of temporal relational operators, to a query expressed in terms of relational operators and the two new operators. We prove that the size of a transformed query is linear in the number of temporal operators in the original query. An integration of the new operators and the transformation rules, along with query optimization rules, into the kernel of PostgreSQL is reported. Empirical studies with the resulting temporal DBMS are covered that offer insights into pertinent design properties of the article's proposal. The new system is available as open-source software.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"6 1","pages":"1 - 46"},"PeriodicalIF":0.0,"publicationDate":"2016-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87124117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}