In the last decade, improvements on single-core performance of CPUs has stagnated. Consequently, methods for the development and optimization of software for these platforms have to be reconsidered. Software must be optimized such that the available single-core performance is exploited more effectively. This can be achieved by reducing the number of instructions that need to be executed. In this article, we show that layered database applications execute many redundant, nonessential, instructions that can be eliminated without affecting the course of execution and the output of the application. This elimination is performed using a vertical integration process which breaks down the different layers of layered database applications. By doing so, applications are being reduced to their essence, and as a consequence, transformations can be carried out that affect both the application code and the data access code which were not possible before. We show that this vertical integration process can be fully automated and, as such, be integrated in an operational workflow. Experimental evaluation of this approach shows that up to 95% of the instructions can be eliminated. The reduction of instructions leads to a more efficient use of the available hardware resources. This results in greatly improved performance of the application and a significant reduction in energy consumption.
{"title":"Reducing Layered Database Applications to their Essence through Vertical Integration","authors":"K. Rietveld, H. Wijshoff","doi":"10.1145/2818180","DOIUrl":"https://doi.org/10.1145/2818180","url":null,"abstract":"In the last decade, improvements on single-core performance of CPUs has stagnated. Consequently, methods for the development and optimization of software for these platforms have to be reconsidered. Software must be optimized such that the available single-core performance is exploited more effectively. This can be achieved by reducing the number of instructions that need to be executed. In this article, we show that layered database applications execute many redundant, nonessential, instructions that can be eliminated without affecting the course of execution and the output of the application. This elimination is performed using a vertical integration process which breaks down the different layers of layered database applications. By doing so, applications are being reduced to their essence, and as a consequence, transformations can be carried out that affect both the application code and the data access code which were not possible before. We show that this vertical integration process can be fully automated and, as such, be integrated in an operational workflow. Experimental evaluation of this approach shows that up to 95% of the instructions can be eliminated. The reduction of instructions leads to a more efficient use of the available hardware resources. This results in greatly improved performance of the application and a significant reduction in energy consumption.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"15 1","pages":"18:1-18:39"},"PeriodicalIF":1.8,"publicationDate":"2015-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84116670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zitong Chen, Yubao Liu, R. C. Wong, Jiamin Xiong, Ganglin Mai, Cheng Long
In this article, we study an optimal location query based on a road network. Specifically, given a road network containing clients and servers, an optimal location query finds a location on the road network such that when a new server is set up at this location, a certain cost function computed based on the clients and servers (including the new server) is optimized. Two types of cost functions, namely, MinMax and MaxSum, have been used for this query. The optimal location query problem with MinMax as the cost function is called the MinMax query, which finds a location for setting up a new server such that the maximum cost of a client being served by his/her closest server is minimized. The optimal location query problem with MaxSum as the cost function is called the MaxSum query, which finds a location for setting up a new server such that the sum of the weights of clients attracted by the new server is maximized. The MinMax query and the MaxSum query correspond to two types of optimal location query with the objectives defined from the clients' perspective and from the new server's perspective, respectively. Unfortunately, the existing solutions for the optimal query problem are not efficient. In this article, we propose an efficient algorithm, namely, MinMax-Alg (MaxSum-Alg), for the MinMax (MaxSum) query, which is based on a novel idea of nearest location component. We also discuss two extensions of the optimal location query, namely, the optimal multiple-location query and the optimal location query on a 3D road network. Extensive experiments were conducted, showing that our algorithms are faster than the state of the art by at least an order of magnitude on large real benchmark datasets. For example, in our largest real datasets, the state of the art ran for more than 10 (12) hours while our algorithm ran within 3 (2) minutes only for the MinMax (MaxSum) query, that is, our algorithm ran at least 200 (600) times faster than the state of the art.
{"title":"Optimal Location Queries in Road Networks","authors":"Zitong Chen, Yubao Liu, R. C. Wong, Jiamin Xiong, Ganglin Mai, Cheng Long","doi":"10.1145/2818179","DOIUrl":"https://doi.org/10.1145/2818179","url":null,"abstract":"In this article, we study an optimal location query based on a road network. Specifically, given a road network containing clients and servers, an optimal location query finds a location on the road network such that when a new server is set up at this location, a certain cost function computed based on the clients and servers (including the new server) is optimized. Two types of cost functions, namely, MinMax and MaxSum, have been used for this query. The optimal location query problem with MinMax as the cost function is called the MinMax query, which finds a location for setting up a new server such that the maximum cost of a client being served by his/her closest server is minimized. The optimal location query problem with MaxSum as the cost function is called the MaxSum query, which finds a location for setting up a new server such that the sum of the weights of clients attracted by the new server is maximized. The MinMax query and the MaxSum query correspond to two types of optimal location query with the objectives defined from the clients' perspective and from the new server's perspective, respectively. Unfortunately, the existing solutions for the optimal query problem are not efficient. In this article, we propose an efficient algorithm, namely, MinMax-Alg (MaxSum-Alg), for the MinMax (MaxSum) query, which is based on a novel idea of nearest location component. We also discuss two extensions of the optimal location query, namely, the optimal multiple-location query and the optimal location query on a 3D road network. Extensive experiments were conducted, showing that our algorithms are faster than the state of the art by at least an order of magnitude on large real benchmark datasets. For example, in our largest real datasets, the state of the art ran for more than 10 (12) hours while our algorithm ran within 3 (2) minutes only for the MinMax (MaxSum) query, that is, our algorithm ran at least 200 (600) times faster than the state of the art.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"5 1","pages":"17:1-17:41"},"PeriodicalIF":1.8,"publicationDate":"2015-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81579240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Panos Parchas, Francesco Gullo, D. Papadias, F. Bonchi
Data in several applications can be represented as an uncertain graph whose edges are labeled with a probability of existence. Exact query processing on uncertain graphs is prohibitive for most applications, as it involves evaluation over an exponential number of instantiations. Thus, typical approaches employ Monte-Carlo sampling, which (i) draws a number of possible graphs (samples), (ii) evaluates the query on each of them, and (iii) aggregates the individual answers to generate the final result. However, this approach can also be extremely time consuming for large uncertain graphs commonly found in practice. To facilitate efficiency, we study the problem of extracting a single representative instance from an uncertain graph. Conventional processing techniques can then be applied on this representative to closely approximate the result on the original graph. In order to maintain data utility, the representative instance should preserve structural characteristics of the uncertain graph. We start with representatives that capture the expected vertex degrees, as this is a fundamental property of the graph topology. We then generalize the notion of vertex degree to the concept of n-clique cardinality, that is, the number of cliques of size n that contain a vertex. For the first problem, we propose two methods: Average Degree Rewiring (ADR), which is based on random edge rewiring, and Approximate B-Matching (ABM), which applies graph matching techniques. For the second problem, we develop a greedy approach and a game-theoretic framework. We experimentally demonstrate, with real uncertain graphs, that indeed the representative instances can be used to answer, efficiently and accurately, queries based on several metrics such as shortest path distance, clustering coefficient, and betweenness centrality.
{"title":"Uncertain Graph Processing through Representative Instances","authors":"Panos Parchas, Francesco Gullo, D. Papadias, F. Bonchi","doi":"10.1145/2818182","DOIUrl":"https://doi.org/10.1145/2818182","url":null,"abstract":"Data in several applications can be represented as an uncertain graph whose edges are labeled with a probability of existence. Exact query processing on uncertain graphs is prohibitive for most applications, as it involves evaluation over an exponential number of instantiations. Thus, typical approaches employ Monte-Carlo sampling, which (i) draws a number of possible graphs (samples), (ii) evaluates the query on each of them, and (iii) aggregates the individual answers to generate the final result. However, this approach can also be extremely time consuming for large uncertain graphs commonly found in practice. To facilitate efficiency, we study the problem of extracting a single representative instance from an uncertain graph. Conventional processing techniques can then be applied on this representative to closely approximate the result on the original graph.\u0000 In order to maintain data utility, the representative instance should preserve structural characteristics of the uncertain graph. We start with representatives that capture the expected vertex degrees, as this is a fundamental property of the graph topology. We then generalize the notion of vertex degree to the concept of n-clique cardinality, that is, the number of cliques of size n that contain a vertex. For the first problem, we propose two methods: Average Degree Rewiring (ADR), which is based on random edge rewiring, and Approximate B-Matching (ABM), which applies graph matching techniques. For the second problem, we develop a greedy approach and a game-theoretic framework. We experimentally demonstrate, with real uncertain graphs, that indeed the representative instances can be used to answer, efficiently and accurately, queries based on several metrics such as shortest path distance, clustering coefficient, and betweenness centrality.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"54 1","pages":"20:1-20:39"},"PeriodicalIF":1.8,"publicationDate":"2015-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89336246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Florin Rusu, Zixuan Zhuang, Mingxi Wu, C. Jermaine
Antijoin cardinality estimation is among a handful of problems that has eluded accurate efficient solutions amenable to implementation in relational query optimizers. Given the widespread use of antijoin and subset-based queries in analytical workloads and the extensive research targeted at join cardinality estimation—a seemingly related problem—the lack of adequate solutions for antijoin cardinality estimation is intriguing. In this article, we introduce a novel sampling-based estimator for antijoin cardinality that (unlike existent estimators) provides sufficient accuracy and efficiency to be implemented in a query optimizer. The proposed estimator incorporates three novel ideas. First, we use prior workload information when learning a mixture superpopulation model of the data offline. Second, we design a Bayesian statistics framework that updates the superpopulation model according to the live queries, thus allowing the estimator to adapt dynamically to the online workload. Third, we develop an efficient algorithm for sampling from a hypergeometric distribution in order to generate Monte Carlo trials, without explicitly instantiating either the population or the sample. When put together, these ideas form the basis of an efficient antijoin cardinality estimator satisfying the strict requirements of a query optimizer, as shown by the extensive experimental results over synthetically-generated as well as massive TPC-H data.
{"title":"Workload-Driven Antijoin Cardinality Estimation","authors":"Florin Rusu, Zixuan Zhuang, Mingxi Wu, C. Jermaine","doi":"10.1145/2818178","DOIUrl":"https://doi.org/10.1145/2818178","url":null,"abstract":"Antijoin cardinality estimation is among a handful of problems that has eluded accurate efficient solutions amenable to implementation in relational query optimizers. Given the widespread use of antijoin and subset-based queries in analytical workloads and the extensive research targeted at join cardinality estimation—a seemingly related problem—the lack of adequate solutions for antijoin cardinality estimation is intriguing. In this article, we introduce a novel sampling-based estimator for antijoin cardinality that (unlike existent estimators) provides sufficient accuracy and efficiency to be implemented in a query optimizer. The proposed estimator incorporates three novel ideas. First, we use prior workload information when learning a mixture superpopulation model of the data offline. Second, we design a Bayesian statistics framework that updates the superpopulation model according to the live queries, thus allowing the estimator to adapt dynamically to the online workload. Third, we develop an efficient algorithm for sampling from a hypergeometric distribution in order to generate Monte Carlo trials, without explicitly instantiating either the population or the sample. When put together, these ideas form the basis of an efficient antijoin cardinality estimator satisfying the strict requirements of a query optimizer, as shown by the extensive experimental results over synthetically-generated as well as massive TPC-H data.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"10 1","pages":"16:1-16:41"},"PeriodicalIF":1.8,"publicationDate":"2015-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87121341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A string-similarity measure quantifies the similarity between two text strings for approximate string matching or comparison. For example, the strings “Sam” and “Samuel” can be considered to be similar. Most existing work that computes the similarity of two strings only considers syntactic similarities, for example, number of common words or q-grams. While this is indeed an indicator of similarity, there are many important cases where syntactically-different strings can represent the same real-world object. For example, “Bill” is a short form of “William,” and “Database Management Systems” can be abbreviated as “DBMS.” Given a collection of predefined synonyms, the purpose of this article is to explore such existing knowledge to effectively evaluate the similarity between two strings and efficiently perform similarity searches and joins, thereby boosting the quality of approximate string matching. In particular, we first present an expansion-based framework to measure string similarities efficiently while considering synonyms. We then study efficient algorithms for similarity searches and joins by proposing two novel indexes, called SI-trees and QP-trees, which combine signature-filtering and length-filtering strategies. In order to improve the efficiency of our algorithms, we develop an estimator to estimate the size of candidates to enable an online selection of signature filters. This estimator provides strong low-error, high-confidence guarantees while requiring only logarithmic space and time costs, thus making our method attractive both in theory and in practice. Finally, the experimental results from a comprehensive study of the algorithms with three real datasets verify the effectiveness and efficiency of our approaches.
{"title":"Boosting the Quality of Approximate String Matching by Synonyms","authors":"Jiaheng Lu, Chunbin Lin, Wei Wang, Chen Li, Xiaokui Xiao","doi":"10.1145/2818177","DOIUrl":"https://doi.org/10.1145/2818177","url":null,"abstract":"A string-similarity measure quantifies the similarity between two text strings for approximate string matching or comparison. For example, the strings “Sam” and “Samuel” can be considered to be similar. Most existing work that computes the similarity of two strings only considers syntactic similarities, for example, number of common words or q-grams. While this is indeed an indicator of similarity, there are many important cases where syntactically-different strings can represent the same real-world object. For example, “Bill” is a short form of “William,” and “Database Management Systems” can be abbreviated as “DBMS.” Given a collection of predefined synonyms, the purpose of this article is to explore such existing knowledge to effectively evaluate the similarity between two strings and efficiently perform similarity searches and joins, thereby boosting the quality of approximate string matching.\u0000 In particular, we first present an expansion-based framework to measure string similarities efficiently while considering synonyms. We then study efficient algorithms for similarity searches and joins by proposing two novel indexes, called SI-trees and QP-trees, which combine signature-filtering and length-filtering strategies. In order to improve the efficiency of our algorithms, we develop an estimator to estimate the size of candidates to enable an online selection of signature filters. This estimator provides strong low-error, high-confidence guarantees while requiring only logarithmic space and time costs, thus making our method attractive both in theory and in practice. Finally, the experimental results from a comprehensive study of the algorithms with three real datasets verify the effectiveness and efficiency of our approaches.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"60 1","pages":"15:1-15:42"},"PeriodicalIF":1.8,"publicationDate":"2015-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73288325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Traditional intuitive methods of decision-making are no longer adequate to deal with the complex problems faced by the modern policymaker. Thus systems must be developed to provide the information and analysis necessary for the decisions which must be made. These systems are called decision support systems. Although database systems provide a key ingredient to decision support systems, the problems now facing the policymaker are different from those problems to which database systems have been applied in the past. The problems are usually not known in advance, they are constantly changing, and answers are needed quickly. Hence additional technologies, methodologies, and approaches must expand the traditional areas of database and operating systems research (as well as other software and hardware research) in order for them to become truly effective in supporting policymakers. This paper describes recent work in this area and indicates where future work is needed. Specifically the paper discusses: (1) why there exists a vital need for decision support systems; (2) examples from work in the field of energy which make explicit the characteristics which distinguish these decision support systems from traditional operational and managerial systems; (3) how an awareness of decision support systems has evolved, including a brief review of work done by others and a statement of the computational needs of decision support systems which are consistent with contemporary technology; (4) an approach which has been made to meet many of these computational needs through the development and implementation of a computational facility, the Generalized Management Information System (GMIS); and (5) the application of this computational facility to a complex and important energy problem facing New England in a typical study within the New England Energy Management Information System (NEEMIS) Project.
{"title":"Databsse system approach the management decision support","authors":"J. Donovan","doi":"10.1145/320493.320500","DOIUrl":"https://doi.org/10.1145/320493.320500","url":null,"abstract":"Traditional intuitive methods of decision-making are no longer adequate to deal with the complex problems faced by the modern policymaker. Thus systems must be developed to provide the information and analysis necessary for the decisions which must be made. These systems are called decision support systems. Although database systems provide a key ingredient to decision support systems, the problems now facing the policymaker are different from those problems to which database systems have been applied in the past. The problems are usually not known in advance, they are constantly changing, and answers are needed quickly. Hence additional technologies, methodologies, and approaches must expand the traditional areas of database and operating systems research (as well as other software and hardware research) in order for them to become truly effective in supporting policymakers.\u0000This paper describes recent work in this area and indicates where future work is needed. Specifically the paper discusses: (1) why there exists a vital need for decision support systems; (2) examples from work in the field of energy which make explicit the characteristics which distinguish these decision support systems from traditional operational and managerial systems; (3) how an awareness of decision support systems has evolved, including a brief review of work done by others and a statement of the computational needs of decision support systems which are consistent with contemporary technology; (4) an approach which has been made to meet many of these computational needs through the development and implementation of a computational facility, the Generalized Management Information System (GMIS); and (5) the application of this computational facility to a complex and important energy problem facing New England in a typical study within the New England Energy Management Information System (NEEMIS) Project.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"40 1","pages":"344-369"},"PeriodicalIF":1.8,"publicationDate":"2015-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88866176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kaustubh Beedkar, K. Berberich, Rainer Gemulla, Iris Miliaraki
Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are sufficiently scalable to handle datasets with billions of sequences; such large-scale datasets arise, for instance, in text mining and session analysis. In this article, we propose MG-FSM, a scalable algorithm for frequent sequence mining on MapReduce. MG-FSM can handle so-called “gap constraints”, which can be used to limit the output to a controlled set of frequent sequences. Both positional and temporal gap constraints, as well as appropriate maximality and closedness constraints, are supported. At its heart, MG-FSM partitions the input database in a way that allows us to mine each partition independently using any existing frequent sequence mining algorithm. We introduce the notion of ω-equivalency, which is a generalization of the notion of a “projected database” used by many frequent pattern mining algorithms. We also present a number of optimization techniques that minimize partition size, and therefore computational and communication costs, while still maintaining correctness. Our experimental study in the contexts of text mining and session analysis suggests that MG-FSM is significantly more efficient and scalable than alternative approaches.
{"title":"Closing the Gap: Sequence Mining at Scale","authors":"Kaustubh Beedkar, K. Berberich, Rainer Gemulla, Iris Miliaraki","doi":"10.1145/2757217","DOIUrl":"https://doi.org/10.1145/2757217","url":null,"abstract":"Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are sufficiently scalable to handle datasets with billions of sequences; such large-scale datasets arise, for instance, in text mining and session analysis. In this article, we propose MG-FSM, a scalable algorithm for frequent sequence mining on MapReduce. MG-FSM can handle so-called “gap constraints”, which can be used to limit the output to a controlled set of frequent sequences. Both positional and temporal gap constraints, as well as appropriate maximality and closedness constraints, are supported. At its heart, MG-FSM partitions the input database in a way that allows us to mine each partition independently using any existing frequent sequence mining algorithm. We introduce the notion of ω-equivalency, which is a generalization of the notion of a “projected database” used by many frequent pattern mining algorithms. We also present a number of optimization techniques that minimize partition size, and therefore computational and communication costs, while still maintaining correctness. Our experimental study in the contexts of text mining and session analysis suggests that MG-FSM is significantly more efficient and scalable than alternative approaches.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"10 1","pages":"8:1-8:44"},"PeriodicalIF":1.8,"publicationDate":"2015-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73189902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Efficient processing of skyline queries has been an area of growing interest. Many of the earlier skyline techniques assumed that the skyline query is applied to a single data table. Naturally, these algorithms were not suitable for many applications in which the skyline query may involve attributes belonging to multiple data sources. In other words, if the data used in the skyline query are stored in multiple tables, then join operations would be required before the skyline can be searched. The task of computing skylines on multiple data sources has been coined as the skyline-join problem and various skyline-join algorithms have been proposed. However, the current proposals suffer several drawbacks: they often need to scan the input tables exhaustively in order to obtain the set of skyline-join results; moreover, the pruning techniques employed to eliminate the tuples are largely based on expensive pairwise tuple-to-tuple comparisons. In this article, we aim to address these shortcomings by proposing two novel skyline-join algorithms, namely skyline-sensitive join (S2J) and symmetric skyline-sensitive join (S3J), to process skyline queries over two data sources. Our approaches compute the results using a novel layer/region pruning technique (LR-pruning) that prunes the join space in blocks as opposed to individual data points, thereby avoiding excessive pairwise point-to-point dominance checks. Furthermore, the S3J algorithm utilizes an early stopping condition in order to successfully compute the skyline results by accessing only a subset of the input tables. In addition to S2J and S3J, we also propose the S2 J-M and S3 J-M algorithms. These algorithms extend S2J's and S3J's two-way skyline-join ability to efficiently process skyline-join queries over more than two data sources. S2 J-M and S3 J-M leverage the extended concept of LR-pruning, called M-way LR-pruning, to compute multi-way skyline-joins in which more than two data sources are integrated during skyline processing. We report extensive experimental results that confirm the advantages of the proposed algorithms over state-of-the-art skyline-join techniques.
高效处理天际线查询已经成为人们越来越感兴趣的一个领域。许多早期的skyline技术都假定skyline查询应用于单个数据表。当然,这些算法不适合许多应用程序,其中skyline查询可能涉及属于多个数据源的属性。换句话说,如果skyline查询中使用的数据存储在多个表中,那么在搜索skyline之前将需要进行连接操作。在多个数据源上计算天际线的任务被称为天际线连接问题,并提出了各种天际线连接算法。然而,目前的建议有几个缺点:它们经常需要彻底扫描输入表以获得skyline-join结果集;此外,用于消除元组的修剪技术主要基于昂贵的成对元组到元组比较。在本文中,我们的目标是通过提出两种新的天际线连接算法来解决这些缺点,即天际线敏感连接(S2J)和对称天际线敏感连接(S3J),以处理两个数据源上的天际线查询。我们的方法使用一种新的层/区域修剪技术(lr -剪枝)来计算结果,该技术修剪块中的连接空间,而不是单个数据点,从而避免了过多的成对点对点优势检查。此外,S3J算法利用早期停止条件,通过仅访问输入表的子集来成功计算天际线结果。除了S2J和S3J算法,我们还提出了S2 J-M和S3 J-M算法。这些算法扩展了S2J和S3J的双向天际线连接能力,以有效地处理两个以上数据源上的天际线连接查询。S2 J-M和S3 J-M利用扩展的lr -剪枝概念(称为M-way lr -剪枝)来计算在天际线处理过程中集成两个以上数据源的多路天际线连接。我们报告了大量的实验结果,证实了所提出的算法比最先进的天际线连接技术的优势。
{"title":"Efficient Processing of Skyline-Join Queries over Multiple Data Sources","authors":"M. Nagendra, K. Candan","doi":"10.1145/2699483","DOIUrl":"https://doi.org/10.1145/2699483","url":null,"abstract":"Efficient processing of skyline queries has been an area of growing interest. Many of the earlier skyline techniques assumed that the skyline query is applied to a single data table. Naturally, these algorithms were not suitable for many applications in which the skyline query may involve attributes belonging to multiple data sources. In other words, if the data used in the skyline query are stored in multiple tables, then join operations would be required before the skyline can be searched. The task of computing skylines on multiple data sources has been coined as the skyline-join problem and various skyline-join algorithms have been proposed. However, the current proposals suffer several drawbacks: they often need to scan the input tables exhaustively in order to obtain the set of skyline-join results; moreover, the pruning techniques employed to eliminate the tuples are largely based on expensive pairwise tuple-to-tuple comparisons. In this article, we aim to address these shortcomings by proposing two novel skyline-join algorithms, namely skyline-sensitive join (S2J) and symmetric skyline-sensitive join (S3J), to process skyline queries over two data sources. Our approaches compute the results using a novel layer/region pruning technique (LR-pruning) that prunes the join space in blocks as opposed to individual data points, thereby avoiding excessive pairwise point-to-point dominance checks. Furthermore, the S3J algorithm utilizes an early stopping condition in order to successfully compute the skyline results by accessing only a subset of the input tables. In addition to S2J and S3J, we also propose the S2 J-M and S3 J-M algorithms. These algorithms extend S2J's and S3J's two-way skyline-join ability to efficiently process skyline-join queries over more than two data sources. S2 J-M and S3 J-M leverage the extended concept of LR-pruning, called M-way LR-pruning, to compute multi-way skyline-joins in which more than two data sources are integrated during skyline processing. We report extensive experimental results that confirm the advantages of the proposed algorithms over state-of-the-art skyline-join techniques.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"42 1","pages":"10:1-10:46"},"PeriodicalIF":1.8,"publicationDate":"2015-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82475921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To address the frequently occurring situation where data is inexact or imprecise, a number of extensions to the classical notion of a functional dependency (FD) integrity constraint have been proposed in recent years. One of these extensions is the notion of a differential dependency (DD), introduced in the recent article &ldquoDifferential Dependencies: Reasoning and Discovery&rdquo by Song and Chen in the March 2011 edition of this journal. A DD generalises the notion of an FD by requiring only that the values of the attribute from the RHS of the DD satisfy a distance constraint whenever the values of attributes from the LHS of the DD satisfy a distance constraint. In contrast, an FD requires that the values from the attributes in the RHS of an FD be equal whenever the values of the attributes from the LHS of the FD are equal. The article &ldquoDifferential Dependencies: Reasoning and Discovery&rdquo investigated a number of aspects of DDs, the most important of which, since they form the basis for the other topics investigated, were the consistency problem (determining whether there exists a relation instance that satisfies a set of DDs) and the implication problem (determining whether a set of DDs logically implies another DD). Concerning these problems, a number of results were claimed in &ldquoDifferential Dependencies: Reasoning and Discovery&rdquo. In this article we conduct a detailed analysis of the correctness of these results. The outcomes of our analysis are that, for almost every claimed result, we show there are either fundamental errors in the proof or the result is false. For some of the claimed results we are able to provide corrected proofs, but for other results their correctness remains open.
{"title":"Technical Correspondence: “Differential Dependencies: Reasoning and Discovery” Revisited","authors":"M. Vincent, Jixue Liu, Hong-Cheu Liu, S. Link","doi":"10.1145/2757214","DOIUrl":"https://doi.org/10.1145/2757214","url":null,"abstract":"To address the frequently occurring situation where data is inexact or imprecise, a number of extensions to the classical notion of a functional dependency (FD) integrity constraint have been proposed in recent years. One of these extensions is the notion of a differential dependency (DD), introduced in the recent article &ldquoDifferential Dependencies: Reasoning and Discovery&rdquo by Song and Chen in the March 2011 edition of this journal. A DD generalises the notion of an FD by requiring only that the values of the attribute from the RHS of the DD satisfy a distance constraint whenever the values of attributes from the LHS of the DD satisfy a distance constraint. In contrast, an FD requires that the values from the attributes in the RHS of an FD be equal whenever the values of the attributes from the LHS of the FD are equal.\u0000 The article &ldquoDifferential Dependencies: Reasoning and Discovery&rdquo investigated a number of aspects of DDs, the most important of which, since they form the basis for the other topics investigated, were the consistency problem (determining whether there exists a relation instance that satisfies a set of DDs) and the implication problem (determining whether a set of DDs logically implies another DD). Concerning these problems, a number of results were claimed in &ldquoDifferential Dependencies: Reasoning and Discovery&rdquo. In this article we conduct a detailed analysis of the correctness of these results. The outcomes of our analysis are that, for almost every claimed result, we show there are either fundamental errors in the proof or the result is false. For some of the claimed results we are able to provide corrected proofs, but for other results their correctness remains open.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"16 1","pages":"14:1-14:18"},"PeriodicalIF":1.8,"publicationDate":"2015-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81868185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xin Cao, G. Cong, Tao Guo, Christian S. Jensen, B. Ooi
With the proliferation of geo-positioning and geo-tagging techniques, spatio-textual objects that possess both a geographical location and a textual description are gaining in prevalence, and spatial keyword queries that exploit both location and textual description are gaining in prominence. However, the queries studied so far generally focus on finding individual objects that each satisfy a query rather than finding groups of objects where the objects in a group together satisfy a query. We define the problem of retrieving a group of spatio-textual objects such that the group's keywords cover the query's keywords and such that the objects are nearest to the query location and have the smallest inter-object distances. Specifically, we study three instantiations of this problem, all of which are NP-hard. We devise exact solutions as well as approximate solutions with provable approximation bounds to the problems. In addition, we solve the problems of retrieving top-k groups of three instantiations, and study a weighted version of the problem that incorporates object weights. We present empirical studies that offer insight into the efficiency of the solutions, as well as the accuracy of the approximate solutions.
{"title":"Efficient Processing of Spatial Group Keyword Queries","authors":"Xin Cao, G. Cong, Tao Guo, Christian S. Jensen, B. Ooi","doi":"10.1145/2772600","DOIUrl":"https://doi.org/10.1145/2772600","url":null,"abstract":"With the proliferation of geo-positioning and geo-tagging techniques, spatio-textual objects that possess both a geographical location and a textual description are gaining in prevalence, and spatial keyword queries that exploit both location and textual description are gaining in prominence. However, the queries studied so far generally focus on finding individual objects that each satisfy a query rather than finding groups of objects where the objects in a group together satisfy a query.\u0000 We define the problem of retrieving a group of spatio-textual objects such that the group's keywords cover the query's keywords and such that the objects are nearest to the query location and have the smallest inter-object distances. Specifically, we study three instantiations of this problem, all of which are NP-hard. We devise exact solutions as well as approximate solutions with provable approximation bounds to the problems. In addition, we solve the problems of retrieving top-k groups of three instantiations, and study a weighted version of the problem that incorporates object weights. We present empirical studies that offer insight into the efficiency of the solutions, as well as the accuracy of the approximate solutions.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"10 1","pages":"13:1-13:48"},"PeriodicalIF":1.8,"publicationDate":"2015-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82212687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}