Pub Date : 2023-05-13DOI: https://dl.acm.org/doi/10.1145/3588434
Georgios J. Fakas, Georgios Kalamatianos
More often than not, spatial objects are associated with some context, in the form of text, descriptive tags (e.g., points of interest, flickr photos), or linked entities in semantic graphs (e.g., Yago2, DBpedia). Hence, location-based retrieval should be extended to consider not only the locations but also the context of the objects, especially when the retrieved objects are too many and the query result is overwhelming. In this article, we study the problem of selecting a subset of the query result, which is the most representative. We argue that objects with similar context and nearby locations should proportionally be represented in the selection. Proportionality dictates the pairwise comparison of all retrieved objects and hence bears a high cost. We propose novel algorithms which greatly reduce the cost of proportional object selection in practice. In addition, we propose pre-processing, pruning, and approximate computation techniques that their combination reduces the computational cost of the algorithms even further. We theoretically analyze the approximation quality of our approaches. Extensive empirical studies on real datasets show that our algorithms are effective and efficient. A user evaluation verifies that proportional selection is more preferable than random selection and selection based on object diversification.
{"title":"Proportionality on Spatial Data with Context","authors":"Georgios J. Fakas, Georgios Kalamatianos","doi":"https://dl.acm.org/doi/10.1145/3588434","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3588434","url":null,"abstract":"<p>More often than not, spatial objects are associated with some context, in the form of text, descriptive tags (e.g., points of interest, flickr photos), or linked entities in semantic graphs (e.g., Yago2, DBpedia). Hence, location-based retrieval should be extended to consider not only the locations but also the context of the objects, especially when the retrieved objects are too many and the query result is overwhelming. In this article, we study the problem of selecting a subset of the query result, which is the most representative. We argue that objects with similar context and nearby locations should proportionally be represented in the selection. Proportionality dictates the pairwise comparison of all retrieved objects and hence bears a high cost. We propose novel algorithms which greatly reduce the cost of proportional object selection in practice. In addition, we propose pre-processing, pruning, and approximate computation techniques that their combination reduces the computational cost of the algorithms even further. We theoretically analyze the approximation quality of our approaches. Extensive empirical studies on real datasets show that our algorithms are effective and efficient. A user evaluation verifies that proportional selection is more preferable than random selection and selection based on object diversification.</p>","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"1 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138530937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hardware-enabled enclaves have been applied to efficiently enforce data security and privacy protection in cloud database services. Such enclaved systems, however, are reported to suffer from I/O-size (also referred to as communication-volume)-based side-channel attacks. Albeit differentially private padding has been exploited to defend against these attacks as a principle method, it introduces a challenging bi-objective parametric query optimization (BPQO) problem and current solutions are still not satisfactory. Concretely, the goal in BPQO is to find a Pareto-optimal plan that makes a tradeoff between query performance and privacy loss; existing solutions are subjected to poor computational efficiency and high cloud resource waste. In this article, we propose a two-phase optimization algorithm called TPOA to solve the BPQO problem. TPOA incorporates two novel ideas: divide-and-conquer to separately handle parameters according to their types in optimization for dimensionality reduction; on-demand-optimization to progressively build a set of necessary Pareto-optimal plans instead of seeking a complete set for saving resources. Besides, we introduce an acceleration mechanism in TPOA to improve its efficiency, which prunes the non-optimal candidate plans in advance. We theoretically prove the correctness of TPOA, numerically analyze its complexity, and formally give an end-to-end privacy analysis. Through a comprehensive evaluation on its efficiency by running baseline algorithms over synthetic and test-bed benchmarks, we can conclude that TPOA outperforms all benchmarked methods with an overall efficiency improvement of roughly two orders of magnitude; moreover, the acceleration mechanism speeds up TPOA by 10-200×.
{"title":"Efficient Bi-objective SQL Optimization for Enclaved Cloud Databases with Differentially Private Padding","authors":"Yaxing Chen, Qinghua Zheng, Zheng Yan","doi":"10.1145/3597021","DOIUrl":"https://doi.org/10.1145/3597021","url":null,"abstract":"Hardware-enabled enclaves have been applied to efficiently enforce data security and privacy protection in cloud database services. Such enclaved systems, however, are reported to suffer from I/O-size (also referred to as communication-volume)-based side-channel attacks. Albeit differentially private padding has been exploited to defend against these attacks as a principle method, it introduces a challenging bi-objective parametric query optimization (BPQO) problem and current solutions are still not satisfactory. Concretely, the goal in BPQO is to find a Pareto-optimal plan that makes a tradeoff between query performance and privacy loss; existing solutions are subjected to poor computational efficiency and high cloud resource waste. In this article, we propose a two-phase optimization algorithm called TPOA to solve the BPQO problem. TPOA incorporates two novel ideas: divide-and-conquer to separately handle parameters according to their types in optimization for dimensionality reduction; on-demand-optimization to progressively build a set of necessary Pareto-optimal plans instead of seeking a complete set for saving resources. Besides, we introduce an acceleration mechanism in TPOA to improve its efficiency, which prunes the non-optimal candidate plans in advance. We theoretically prove the correctness of TPOA, numerically analyze its complexity, and formally give an end-to-end privacy analysis. Through a comprehensive evaluation on its efficiency by running baseline algorithms over synthetic and test-bed benchmarks, we can conclude that TPOA outperforms all benchmarked methods with an overall efficiency improvement of roughly two orders of magnitude; moreover, the acceleration mechanism speeds up TPOA by 10-200×.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"48 1","pages":"1 - 40"},"PeriodicalIF":1.8,"publicationDate":"2023-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46250116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the era of big data, data sharing not only boosts the economy of the world but also brings about problems of privacy disclosure and copyright infringement. The collected data may contain users’ sensitive information; thus, privacy protection should be applied to the data prior to them being shared. Moreover, the shared data may be re-shared to third parties without the consent or awareness of the original data providers. Therefore, there is an urgent need for copyright tracking. There are few works satisfying the requirements of both privacy protection and copyright tracking. The main challenge is how to protect the shared data and realize copyright tracking while not undermining the utility of the data. In this article, we propose a novel solution of a reversible database watermarking scheme based on order-preserving encryption. First, we encrypt the data using order-preserving encryption and adjust an encryption parameter within an appropriate interval to generate a ciphertext with redundant space. Then, we leverage the redundant space to embed robust reversible watermarking. We adopt grouping and K-means to improve the embedding capacity and the robustness of the watermark. Formal theoretical analysis proves that the proposed scheme guarantees correctness and security. Results of extensive experiments show that OPEW has 100% data utility, and the robustness and efficiency of OPEW are better than existing works.
{"title":"Reversible Database Watermarking Based on Order-preserving Encryption for Data Sharing","authors":"Donghui Hu, Qing Wang, Song Yan, Xiaojun Liu, Meng Li, Shuli Zheng","doi":"10.1145/3589761","DOIUrl":"https://doi.org/10.1145/3589761","url":null,"abstract":"In the era of big data, data sharing not only boosts the economy of the world but also brings about problems of privacy disclosure and copyright infringement. The collected data may contain users’ sensitive information; thus, privacy protection should be applied to the data prior to them being shared. Moreover, the shared data may be re-shared to third parties without the consent or awareness of the original data providers. Therefore, there is an urgent need for copyright tracking. There are few works satisfying the requirements of both privacy protection and copyright tracking. The main challenge is how to protect the shared data and realize copyright tracking while not undermining the utility of the data. In this article, we propose a novel solution of a reversible database watermarking scheme based on order-preserving encryption. First, we encrypt the data using order-preserving encryption and adjust an encryption parameter within an appropriate interval to generate a ciphertext with redundant space. Then, we leverage the redundant space to embed robust reversible watermarking. We adopt grouping and K-means to improve the embedding capacity and the robustness of the watermark. Formal theoretical analysis proves that the proposed scheme guarantees correctness and security. Results of extensive experiments show that OPEW has 100% data utility, and the robustness and efficiency of OPEW are better than existing works.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"48 1","pages":"1 - 25"},"PeriodicalIF":1.8,"publicationDate":"2023-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42105460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
More often than not, spatial objects are associated with some context, in the form of text, descriptive tags (e.g., points of interest, flickr photos), or linked entities in semantic graphs (e.g., Yago2, DBpedia). Hence, location-based retrieval should be extended to consider not only the locations but also the context of the objects, especially when the retrieved objects are too many and the query result is overwhelming. In this article, we study the problem of selecting a subset of the query result, which is the most representative. We argue that objects with similar context and nearby locations should proportionally be represented in the selection. Proportionality dictates the pairwise comparison of all retrieved objects and hence bears a high cost. We propose novel algorithms which greatly reduce the cost of proportional object selection in practice. In addition, we propose pre-processing, pruning, and approximate computation techniques that their combination reduces the computational cost of the algorithms even further. We theoretically analyze the approximation quality of our approaches. Extensive empirical studies on real datasets show that our algorithms are effective and efficient. A user evaluation verifies that proportional selection is more preferable than random selection and selection based on object diversification.
{"title":"Proportionality on Spatial Data with Context","authors":"G. Fakas, Georgios Kalamatianos","doi":"10.1145/3588434","DOIUrl":"https://doi.org/10.1145/3588434","url":null,"abstract":"More often than not, spatial objects are associated with some context, in the form of text, descriptive tags (e.g., points of interest, flickr photos), or linked entities in semantic graphs (e.g., Yago2, DBpedia). Hence, location-based retrieval should be extended to consider not only the locations but also the context of the objects, especially when the retrieved objects are too many and the query result is overwhelming. In this article, we study the problem of selecting a subset of the query result, which is the most representative. We argue that objects with similar context and nearby locations should proportionally be represented in the selection. Proportionality dictates the pairwise comparison of all retrieved objects and hence bears a high cost. We propose novel algorithms which greatly reduce the cost of proportional object selection in practice. In addition, we propose pre-processing, pruning, and approximate computation techniques that their combination reduces the computational cost of the algorithms even further. We theoretically analyze the approximation quality of our approaches. Extensive empirical studies on real datasets show that our algorithms are effective and efficient. A user evaluation verifies that proportional selection is more preferable than random selection and selection based on object diversification.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"48 1","pages":"1 - 37"},"PeriodicalIF":1.8,"publicationDate":"2023-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44577160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Event data are often dirty owing to various recording conventions or simply system errors. These errors may cause serious damage to real applications, such as inaccurate provenance answers, poor profiling results, or concealing interesting patterns from event data. Cleaning dirty event data is strongly demanded. While existing event data cleaning techniques view event logs as sequences, structural information does exist among events, such as the task passing relationships between staffs in workflow or the invocation relationships among different micro-services in monitoring application performance. We argue that such structural information enhances not only the accuracy of repairing inconsistent events but also the computation efficiency. It is notable that both the structure and the names (labeling) of events could be inconsistent. In real applications, while an unsound structure is not repaired automatically (which requires manual effort from business actors to handle the structure error), it is highly desirable to repair the inconsistent event names introduced by recording mistakes. In this article, we first prove that the inconsistent label repairing problem is NP-complete. Then, we propose a graph repair approach for (1) detecting unsound structures, and (2) repairing inconsistent event names. Efficient pruning techniques together with two heuristic solutions are also presented. Extensive experiments over real and synthetic datasets demonstrate both the effectiveness and efficiency of our proposal.
{"title":"Efficiently Cleaning Structured Event Logs: A Graph Repair Approach","authors":"Ruihong Huang, Jianmin Wang, Shaoxu Song, Xuemin Lin, Xiaochen Zhu, Jian Pei","doi":"https://dl.acm.org/doi/10.1145/3571281","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3571281","url":null,"abstract":"<p>Event data are often dirty owing to various recording conventions or simply system errors. These errors may cause serious damage to real applications, such as inaccurate provenance answers, poor profiling results, or concealing interesting patterns from event data. Cleaning dirty event data is strongly demanded. While existing event data cleaning techniques view event logs as sequences, structural information does exist among events, such as the task passing relationships between staffs in workflow or the invocation relationships among different micro-services in monitoring application performance. We argue that such structural information enhances not only the accuracy of repairing inconsistent events but also the computation efficiency. It is notable that both the structure and the names (labeling) of events could be inconsistent. In real applications, while an unsound structure is not repaired automatically (which requires manual effort from business actors to handle the structure error), it is highly desirable to repair the inconsistent event names introduced by recording mistakes. In this article, we first prove that the inconsistent label repairing problem is NP-complete. Then, we propose a graph repair approach for (1) detecting unsound structures, and (2) repairing inconsistent event names. Efficient pruning techniques together with two heuristic solutions are also presented. Extensive experiments over real and synthetic datasets demonstrate both the effectiveness and efficiency of our proposal.</p>","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"26 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138530968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, Benny Kimelfeld, Mirek Riedewald
We study the question of when we can provide direct access to the k-th answer to a Conjunctive Query (CQ) according to a specified order over the answers in time logarithmic in the size of the database, following a preprocessing step that constructs a data structure in time quasilinear in database size. Specifically, we embark on the challenge of identifying the tractable answer orderings , that is, those orders that allow for such complexity guarantees. To better understand the computational challenge at hand, we also investigate the more modest task of providing access to only a single answer (i.e., finding the answer at a given position), a task that we refer to as the selection problem , and ask when it can be performed in quasilinear time. We also explore the question of when selection is indeed easier than ranked direct access. We begin with lexicographic orders . For each of the two problems, we give a decidable characterization (under conventional complexity assumptions) of the class of tractable lexicographic orders for every CQ without self-joins. We then continue to the more general orders by the sum of attribute weights and establish the corresponding decidable characterizations, for each of the two problems, of the tractable CQs without self-joins. Finally, we explore the question of when the satisfaction of Functional Dependencies (FDs) can be utilized for tractability and establish the corresponding generalizations of our characterizations for every set of unary FDs.
{"title":"Tractable Orders for Direct Access to Ranked Answers of Conjunctive Queries","authors":"Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, Benny Kimelfeld, Mirek Riedewald","doi":"10.1145/3578517","DOIUrl":"https://doi.org/10.1145/3578517","url":null,"abstract":"We study the question of when we can provide direct access to the k-th answer to a Conjunctive Query (CQ) according to a specified order over the answers in time logarithmic in the size of the database, following a preprocessing step that constructs a data structure in time quasilinear in database size. Specifically, we embark on the challenge of identifying the tractable answer orderings , that is, those orders that allow for such complexity guarantees. To better understand the computational challenge at hand, we also investigate the more modest task of providing access to only a single answer (i.e., finding the answer at a given position), a task that we refer to as the selection problem , and ask when it can be performed in quasilinear time. We also explore the question of when selection is indeed easier than ranked direct access. We begin with lexicographic orders . For each of the two problems, we give a decidable characterization (under conventional complexity assumptions) of the class of tractable lexicographic orders for every CQ without self-joins. We then continue to the more general orders by the sum of attribute weights and establish the corresponding decidable characterizations, for each of the two problems, of the tractable CQs without self-joins. Finally, we explore the question of when the satisfaction of Functional Dependencies (FDs) can be utilized for tractability and establish the corresponding generalizations of our characterizations for every set of unary FDs.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"241 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135905698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-13DOI: https://dl.acm.org/doi/10.1145/3570956
Thanh Do, Goetz Graefe
Sorting and searching are large parts of database query processing, e.g., in the forms of index creation, index maintenance, and index lookup, and comparing pairs of keys is a substantial part of the effort in sorting and searching. We have worked on simple, efficient implementations of decades-old, neglected, effective techniques for fast comparisons and fast sorting, in particular offset-value coding. In the process, we happened upon its mutually beneficial relationship with prefix truncation in run files as well as the duality of compression techniques in row- and column-format storage structures, namely prefix truncation and run-length encoding of leading key columns. We also found a beneficial relationship with consumers of sorted streams, e.g., merging parallel streams, in-stream aggregation, and merge join. We report on our implementation in the context of Google’s Napa and F1 Query systems as well as an experimental evaluation of performance and scalability.
{"title":"Robust and Efficient Sorting with Offset-value Coding","authors":"Thanh Do, Goetz Graefe","doi":"https://dl.acm.org/doi/10.1145/3570956","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3570956","url":null,"abstract":"<p>Sorting and searching are large parts of database query processing, e.g., in the forms of index creation, index maintenance, and index lookup, and comparing pairs of keys is a substantial part of the effort in sorting and searching. We have worked on simple, efficient implementations of decades-old, neglected, effective techniques for fast comparisons and fast sorting, in particular offset-value coding. In the process, we happened upon its mutually beneficial relationship with prefix truncation in run files as well as the duality of compression techniques in row- and column-format storage structures, namely prefix truncation and run-length encoding of leading key columns. We also found a beneficial relationship with consumers of sorted streams, e.g., merging parallel streams, in-stream aggregation, and merge join. We report on our implementation in the context of Google’s Napa and F1 Query systems as well as an experimental evaluation of performance and scalability.</p>","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"19 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138530887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-13DOI: https://dl.acm.org/doi/10.1145/3578517
Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, Benny Kimelfeld, Mirek Riedewald
We study the question of when we can provide direct access to the k-th answer to a Conjunctive Query (CQ) according to a specified order over the answers in time logarithmic in the size of the database, following a preprocessing step that constructs a data structure in time quasilinear in database size. Specifically, we embark on the challenge of identifying the tractable answer orderings, that is, those orders that allow for such complexity guarantees. To better understand the computational challenge at hand, we also investigate the more modest task of providing access to only a single answer (i.e., finding the answer at a given position), a task that we refer to as the selection problem, and ask when it can be performed in quasilinear time. We also explore the question of when selection is indeed easier than ranked direct access.
We begin with lexicographic orders. For each of the two problems, we give a decidable characterization (under conventional complexity assumptions) of the class of tractable lexicographic orders for every CQ without self-joins. We then continue to the more general orders by the sum of attribute weights and establish the corresponding decidable characterizations, for each of the two problems, of the tractable CQs without self-joins. Finally, we explore the question of when the satisfaction of Functional Dependencies (FDs) can be utilized for tractability and establish the corresponding generalizations of our characterizations for every set of unary FDs.
{"title":"Tractable Orders for Direct Access to Ranked Answers of Conjunctive Queries","authors":"Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, Benny Kimelfeld, Mirek Riedewald","doi":"https://dl.acm.org/doi/10.1145/3578517","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3578517","url":null,"abstract":"<p>We study the question of when we can provide <i>direct access to the k-th answer</i> to a Conjunctive Query (CQ) according to a specified order over the answers in time logarithmic in the size of the database, following a preprocessing step that constructs a data structure in time quasilinear in database size. Specifically, we embark on the challenge of identifying <i>the tractable answer orderings</i>, that is, those orders that allow for such complexity guarantees. To better understand the computational challenge at hand, we also investigate the more modest task of providing access to only a single answer (i.e., finding the answer at a given position), a task that we refer to as <i>the selection problem</i>, and ask when it can be performed in quasilinear time. We also explore the question of when selection is indeed easier than ranked direct access. </p><p></p><p>We begin with <i>lexicographic orders</i>. For each of the two problems, we give a decidable characterization (under conventional complexity assumptions) of the class of tractable lexicographic orders for every CQ without self-joins. We then continue to the more general <i>orders by the sum of attribute weights</i> and establish the corresponding decidable characterizations, for each of the two problems, of the tractable CQs without self-joins. Finally, we explore the question of when the satisfaction of Functional Dependencies (FDs) can be utilized for tractability and establish the corresponding generalizations of our characterizations for every set of unary FDs.</p>","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"43 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138530905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-06DOI: https://dl.acm.org/doi/10.1145/3568027
Thanh Do, Goetz Graefe, Jeffrey Naughton
Database query processing requires algorithms for duplicate removal, grouping, and aggregation. Three algorithms exist: in-stream aggregation is most efficient by far but requires sorted input; sort-based aggregation relies on external merge sort; and hash aggregation relies on an in-memory hash table plus hash partitioning to temporary storage. Cost-based query optimization chooses which algorithm to use based on several factors, including the sort order of the input, input and output sizes, and the need for sorted output. For example, hash-based aggregation is ideal for output smaller than the available memory (e.g., Query 1 of TPC-H), whereas sorting the entire input and aggregating after sorting are preferable when both aggregation input and output are large and the output needs to be sorted for a subsequent operation such as a merge join.
Unfortunately, the size information required for a sound choice is often inaccurate or unavailable during query optimization, leading to sub-optimal algorithm choices. In response, this article introduces a new algorithm for sort-based duplicate removal, grouping, and aggregation. The new algorithm always performs at least as well as both traditional hash-based and traditional sort-based algorithms. It can serve as a system’s only aggregation algorithm for unsorted inputs, thus preventing erroneous algorithm choices. Furthermore, the new algorithm produces sorted output that can speed up subsequent operations. Google’s F1 Query uses the new algorithm in production workloads that aggregate petabytes of data every day.
{"title":"Efficient Sorting, Duplicate Removal, Grouping, and Aggregation","authors":"Thanh Do, Goetz Graefe, Jeffrey Naughton","doi":"https://dl.acm.org/doi/10.1145/3568027","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3568027","url":null,"abstract":"<p>Database query processing requires algorithms for duplicate removal, grouping, and aggregation. Three algorithms exist: in-stream aggregation is most efficient by far but requires sorted input; sort-based aggregation relies on external merge sort; and hash aggregation relies on an in-memory hash table plus hash partitioning to temporary storage. Cost-based query optimization chooses which algorithm to use based on several factors, including the sort order of the input, input and output sizes, and the need for sorted output. For example, hash-based aggregation is ideal for output smaller than the available memory (e.g., Query 1 of TPC-H), whereas sorting the entire input and aggregating after sorting are preferable when both aggregation input and output are large and the output needs to be sorted for a subsequent operation such as a merge join.</p><p>Unfortunately, the size information required for a sound choice is often inaccurate or unavailable during query optimization, leading to sub-optimal algorithm choices. In response, this article introduces a new algorithm for sort-based duplicate removal, grouping, and aggregation. The new algorithm always performs at least as well as both traditional hash-based and traditional sort-based algorithms. It can serve as a system’s only aggregation algorithm for unsorted inputs, thus preventing erroneous algorithm choices. Furthermore, the new algorithm produces sorted output that can speed up subsequent operations. Google’s F1 Query uses the new algorithm in production workloads that aggregate petabytes of data every day.</p>","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"32 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138530909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-12-16DOI: https://dl.acm.org/doi/10.1145/3563773
Victor Junqiu Wei, Raymond Chi-Wing Wong, Cheng Long, David Mount, Hanan Samet
Due to the advance of the geo-spatial positioning and the computer graphics technology, digital terrain data has become increasingly popular nowadays. Query processing on terrain data has attracted considerable attention from both the academic and the industry communities.
Proximity queries such as the shortest path/distance query, k nearest/farthest neighbor query, and top-k closest/farthest pairs query are fundamental and important queries in the context of the terrain surfaces, and they have a lot of applications in Geographical Information System, 3D object feature vector construction, and 3D object data mining. In this article, we first study the most fundamental type of query, namely, shortest distance and path query, which is to find the shortest distance and path between two points of interest on the surface of the terrain. As observed by existing studies, computing the exact shortest distance/path is very expensive. Some existing studies proposed ϵ-approximate distance and path oracles, where ϵ is a non-negative real-valued error parameter. However, the best-known algorithm has a large oracle construction time, a large oracle size, and a large query time. Motivated by this, we propose a novel ϵ-approximate distance and path oracle called the Space Efficient distance and path oracle (SE), which has a small oracle construction time, a small oracle size, and a small distance and path query time, thanks to its compactness of storing concise information about pairwise distances between any two points-of-interest. Then, we propose several algorithms for the k nearest/farthest neighbor and top-k closest/farthest pairs queries with the assistance of our distance and path oracle SE.
Our experimental results show that the oracle construction time, the oracle size, and the distance and path query time of SE are up to two, three, and five orders of magnitude faster than the best-known algorithm, respectively. Besides, our algorithms for other proximity queries including k nearest/farthest neighbor queries and top-k closest/farthest pairs queries significantly outperform the state-of-the-art algorithms by up to two orders of magnitude.
随着地理空间定位技术和计算机图形技术的发展,数字地形数据日益普及。地形数据的查询处理已经引起了学术界和工业界的广泛关注。最近路径/距离查询、k个最近/最远邻居查询和top-k个最近/最远对查询等接近性查询是地形表面环境中最基本和重要的查询,在地理信息系统、三维物体特征向量构建和三维物体数据挖掘等方面有着广泛的应用。在本文中,我们首先研究了最基本的查询类型,即最短距离和路径查询,即在地形表面上查找两个兴趣点之间的最短距离和路径。根据现有的研究,计算精确的最短距离/路径是非常昂贵的。一些现有的研究提出了ϵ-approximate距离和路径预言器,其中的λ是一个非负的实值误差参数。然而,最著名的算法具有较大的oracle构建时间、较大的oracle大小和较大的查询时间。基于此,我们提出了一种新颖的ϵ-approximate距离路径oracle,称为空间高效距离路径oracle (Space Efficient distance and path oracle, SE),它具有较小的oracle构建时间,较小的oracle大小,以及较小的距离和路径查询时间,这得益于它的紧凑性,可以存储任何两个兴趣点之间的两两距离的简明信息。然后,在距离和路径oracle SE的帮助下,我们提出了k个最近/最远邻居和top-k个最近/最远对查询的几种算法。我们的实验结果表明,SE的oracle构建时间、oracle大小、距离和路径查询时间分别比最知名的算法快2、3和5个数量级。此外,我们的算法用于其他邻近查询,包括k最近/最远邻居查询和top-k最近/最远对查询,其性能明显优于最先进的算法,最高可达两个数量级。
{"title":"Proximity Queries on Terrain Surface","authors":"Victor Junqiu Wei, Raymond Chi-Wing Wong, Cheng Long, David Mount, Hanan Samet","doi":"https://dl.acm.org/doi/10.1145/3563773","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3563773","url":null,"abstract":"<p>Due to the advance of the geo-spatial positioning and the computer graphics technology, digital terrain data has become increasingly popular nowadays. Query processing on terrain data has attracted considerable attention from both the academic and the industry communities.</p><p>Proximity queries such as the shortest path/distance query, <i>k</i> nearest/farthest neighbor query, and top-<i>k</i> closest/farthest pairs query are fundamental and important queries in the context of the terrain surfaces, and they have a lot of applications in Geographical Information System, 3D object feature vector construction, and 3D object data mining. In this article, we first study the most fundamental type of query, namely, shortest distance and path query, which is to find the shortest distance and path between two points of interest on the surface of the terrain. As observed by existing studies, computing the exact shortest distance/path is very expensive. Some existing studies proposed <i>ϵ</i>-approximate distance and path oracles, where <i>ϵ</i> is a non-negative real-valued error parameter. However, the best-known algorithm has a large oracle construction time, a large oracle size, and a large query time. Motivated by this, we propose a novel <i>ϵ</i>-approximate distance and path oracle called the <i><underline>S</underline>pace <underline>E</underline>fficient distance and path oracle (SE),</i> which has a small oracle construction time, a small oracle size, and a small distance and path query time, thanks to its compactness of storing concise information about pairwise distances between any two points-of-interest. Then, we propose several algorithms for the <i>k</i> nearest/farthest neighbor and top-<i>k</i> closest/farthest pairs queries with the assistance of our distance and path oracle <i>SE</i>. </p><p>Our experimental results show that the oracle construction time, the oracle size, and the distance and path query time of <i>SE</i> are up to two, three, and five orders of magnitude faster than the best-known algorithm, respectively. Besides, our algorithms for other proximity queries including <i>k</i> nearest/farthest neighbor queries and top-<i>k</i> closest/farthest pairs queries significantly outperform the state-of-the-art algorithms by up to two orders of magnitude.</p>","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"13 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2022-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138530911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}