ACM Transactions on Database Systems最新文献_第3页

Proportionality on Spatial Data with Context 具有上下文的空间数据的比例性

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2023-05-13 DOI: https://dl.acm.org/doi/10.1145/3588434

Georgios J. Fakas, Georgios Kalamatianos

More often than not, spatial objects are associated with some context, in the form of text, descriptive tags (e.g., points of interest, flickr photos), or linked entities in semantic graphs (e.g., Yago2, DBpedia). Hence, location-based retrieval should be extended to consider not only the locations but also the context of the objects, especially when the retrieved objects are too many and the query result is overwhelming. In this article, we study the problem of selecting a subset of the query result, which is the most representative. We argue that objects with similar context and nearby locations should proportionally be represented in the selection. Proportionality dictates the pairwise comparison of all retrieved objects and hence bears a high cost. We propose novel algorithms which greatly reduce the cost of proportional object selection in practice. In addition, we propose pre-processing, pruning, and approximate computation techniques that their combination reduces the computational cost of the algorithms even further. We theoretically analyze the approximation quality of our approaches. Extensive empirical studies on real datasets show that our algorithms are effective and efficient. A user evaluation verifies that proportional selection is more preferable than random selection and selection based on object diversification.

通常，空间对象以文本、描述性标记(例如，兴趣点、flickr照片)或语义图中的链接实体(例如，Yago2、DBpedia)的形式与某些上下文相关联。因此，应该扩展基于位置的检索，不仅要考虑对象的位置，还要考虑对象的上下文，特别是在检索对象太多，查询结果压倒性的情况下。在本文中，我们研究了从查询结果中选择一个最具代表性的子集的问题。我们认为具有相似背景和附近位置的对象应该在选择中按比例表示。比例性要求对所有检索到的对象进行两两比较，因此代价很高。我们提出了新的算法，在实践中大大降低了比例目标选择的成本。此外，我们提出了预处理、剪枝和近似计算技术，它们的组合进一步降低了算法的计算成本。我们从理论上分析了我们的方法的近似质量。对实际数据集的大量实证研究表明，我们的算法是有效和高效的。用户评价验证了比例选择优于随机选择和基于对象多样化的选择。

{"title":"Proportionality on Spatial Data with Context","authors":"Georgios J. Fakas, Georgios Kalamatianos","doi":"https://dl.acm.org/doi/10.1145/3588434","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3588434","url":null,"abstract":"More often than not, spatial objects are associated with some context, in the form of text, descriptive tags (e.g., points of interest, flickr photos), or linked entities in semantic graphs (e.g., Yago2, DBpedia). Hence, location-based retrieval should be extended to consider not only the locations but also the context of the objects, especially when the retrieved objects are too many and the query result is overwhelming. In this article, we study the problem of selecting a subset of the query result, which is the most representative. We argue that objects with similar context and nearby locations should proportionally be represented in the selection. Proportionality dictates the pairwise comparison of all retrieved objects and hence bears a high cost. We propose novel algorithms which greatly reduce the cost of proportional object selection in practice. In addition, we propose pre-processing, pruning, and approximate computation techniques that their combination reduces the computational cost of the algorithms even further. We theoretically analyze the approximation quality of our approaches. Extensive empirical studies on real datasets show that our algorithms are effective and efficient. A user evaluation verifies that proportional selection is more preferable than random selection and selection based on object diversification.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"1 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138530937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient Bi-objective SQL Optimization for Enclaved Cloud Databases with Differentially Private Padding 基于差分私有填充的封闭云数据库双目标SQL优化

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2023-05-11 DOI: 10.1145/3597021

Yaxing Chen, Qinghua Zheng, Zheng Yan

Hardware-enabled enclaves have been applied to efficiently enforce data security and privacy protection in cloud database services. Such enclaved systems, however, are reported to suffer from I/O-size (also referred to as communication-volume)-based side-channel attacks. Albeit differentially private padding has been exploited to defend against these attacks as a principle method, it introduces a challenging bi-objective parametric query optimization (BPQO) problem and current solutions are still not satisfactory. Concretely, the goal in BPQO is to find a Pareto-optimal plan that makes a tradeoff between query performance and privacy loss; existing solutions are subjected to poor computational efficiency and high cloud resource waste. In this article, we propose a two-phase optimization algorithm called TPOA to solve the BPQO problem. TPOA incorporates two novel ideas: divide-and-conquer to separately handle parameters according to their types in optimization for dimensionality reduction; on-demand-optimization to progressively build a set of necessary Pareto-optimal plans instead of seeking a complete set for saving resources. Besides, we introduce an acceleration mechanism in TPOA to improve its efficiency, which prunes the non-optimal candidate plans in advance. We theoretically prove the correctness of TPOA, numerically analyze its complexity, and formally give an end-to-end privacy analysis. Through a comprehensive evaluation on its efficiency by running baseline algorithms over synthetic and test-bed benchmarks, we can conclude that TPOA outperforms all benchmarked methods with an overall efficiency improvement of roughly two orders of magnitude; moreover, the acceleration mechanism speeds up TPOA by 10-200×.

支持硬件的enclave已被应用于有效地加强云数据库服务中的数据安全和隐私保护。然而，据报道，这种封闭的系统会遭受基于I/ o大小(也称为通信量)的侧信道攻击。尽管差分私有填充作为一种主要方法被用来防御这些攻击，但它引入了一个具有挑战性的双目标参数查询优化(BPQO)问题，目前的解决方案仍然不令人满意。具体来说，BPQO的目标是找到一个在查询性能和隐私损失之间进行权衡的帕累托最优计划;现有的解决方案存在计算效率低、云资源浪费大的问题。在本文中，我们提出了一种两阶段优化算法TPOA来解决BPQO问题。TPOA融合了两种新颖的思想:分而治之，根据参数的类型分别处理参数进行降维优化;按需优化，逐步构建一套必要的帕累托最优方案，而不是为了节省资源而寻求一套完整的方案。此外，为了提高TPOA算法的效率，我们在TPOA算法中引入了加速机制，提前剔除了非最优候选方案。我们从理论上证明了TPOA的正确性，从数值上分析了其复杂性，并在形式上给出了端到端的隐私分析。通过在综合基准和试验台基准上运行基线算法对其效率进行综合评估，我们可以得出结论:TPOA的总体效率提高了大约两个数量级，优于所有基准方法;加速机构使TPOA提高10-200倍。

{"title":"Efficient Bi-objective SQL Optimization for Enclaved Cloud Databases with Differentially Private Padding","authors":"Yaxing Chen, Qinghua Zheng, Zheng Yan","doi":"10.1145/3597021","DOIUrl":"https://doi.org/10.1145/3597021","url":null,"abstract":"Hardware-enabled enclaves have been applied to efficiently enforce data security and privacy protection in cloud database services. Such enclaved systems, however, are reported to suffer from I/O-size (also referred to as communication-volume)-based side-channel attacks. Albeit differentially private padding has been exploited to defend against these attacks as a principle method, it introduces a challenging bi-objective parametric query optimization (BPQO) problem and current solutions are still not satisfactory. Concretely, the goal in BPQO is to find a Pareto-optimal plan that makes a tradeoff between query performance and privacy loss; existing solutions are subjected to poor computational efficiency and high cloud resource waste. In this article, we propose a two-phase optimization algorithm called TPOA to solve the BPQO problem. TPOA incorporates two novel ideas: divide-and-conquer to separately handle parameters according to their types in optimization for dimensionality reduction; on-demand-optimization to progressively build a set of necessary Pareto-optimal plans instead of seeking a complete set for saving resources. Besides, we introduce an acceleration mechanism in TPOA to improve its efficiency, which prunes the non-optimal candidate plans in advance. We theoretically prove the correctness of TPOA, numerically analyze its complexity, and formally give an end-to-end privacy analysis. Through a comprehensive evaluation on its efficiency by running baseline algorithms over synthetic and test-bed benchmarks, we can conclude that TPOA outperforms all benchmarked methods with an overall efficiency improvement of roughly two orders of magnitude; moreover, the acceleration mechanism speeds up TPOA by 10-200×.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"48 1","pages":"1 - 40"},"PeriodicalIF":1.8,"publicationDate":"2023-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46250116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Reversible Database Watermarking Based on Order-preserving Encryption for Data Sharing 基于保序加密的数据共享可逆数据库水印

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2023-04-17 DOI: 10.1145/3589761

Donghui Hu, Qing Wang, Song Yan, Xiaojun Liu, Meng Li, Shuli Zheng

In the era of big data, data sharing not only boosts the economy of the world but also brings about problems of privacy disclosure and copyright infringement. The collected data may contain users’ sensitive information; thus, privacy protection should be applied to the data prior to them being shared. Moreover, the shared data may be re-shared to third parties without the consent or awareness of the original data providers. Therefore, there is an urgent need for copyright tracking. There are few works satisfying the requirements of both privacy protection and copyright tracking. The main challenge is how to protect the shared data and realize copyright tracking while not undermining the utility of the data. In this article, we propose a novel solution of a reversible database watermarking scheme based on order-preserving encryption. First, we encrypt the data using order-preserving encryption and adjust an encryption parameter within an appropriate interval to generate a ciphertext with redundant space. Then, we leverage the redundant space to embed robust reversible watermarking. We adopt grouping and K-means to improve the embedding capacity and the robustness of the watermark. Formal theoretical analysis proves that the proposed scheme guarantees correctness and security. Results of extensive experiments show that OPEW has 100% data utility, and the robustness and efficiency of OPEW are better than existing works.

在大数据时代，数据共享不仅促进了世界经济的发展，还带来了隐私泄露和版权侵权的问题。收集的数据可能包含用户的敏感信息；因此，在共享数据之前，应该对数据进行隐私保护。此外，共享数据可能会在未经原始数据提供商同意或知情的情况下重新共享给第三方。因此，迫切需要对版权进行跟踪。很少有作品同时满足隐私保护和版权追踪的要求。主要挑战是如何在不破坏数据效用的情况下保护共享数据并实现版权跟踪。在本文中，我们提出了一种基于保序加密的可逆数据库水印方案。首先，我们使用保序加密对数据进行加密，并在适当的间隔内调整加密参数，以生成具有冗余空间的密文。然后，我们利用冗余空间嵌入鲁棒的可逆水印。我们采用分组和K-means来提高水印的嵌入能力和鲁棒性。形式化的理论分析证明了该方案的正确性和安全性。大量实验结果表明，OPEW具有100%的数据实用性，并且其鲁棒性和效率优于现有工作。

{"title":"Reversible Database Watermarking Based on Order-preserving Encryption for Data Sharing","authors":"Donghui Hu, Qing Wang, Song Yan, Xiaojun Liu, Meng Li, Shuli Zheng","doi":"10.1145/3589761","DOIUrl":"https://doi.org/10.1145/3589761","url":null,"abstract":"In the era of big data, data sharing not only boosts the economy of the world but also brings about problems of privacy disclosure and copyright infringement. The collected data may contain users’ sensitive information; thus, privacy protection should be applied to the data prior to them being shared. Moreover, the shared data may be re-shared to third parties without the consent or awareness of the original data providers. Therefore, there is an urgent need for copyright tracking. There are few works satisfying the requirements of both privacy protection and copyright tracking. The main challenge is how to protect the shared data and realize copyright tracking while not undermining the utility of the data. In this article, we propose a novel solution of a reversible database watermarking scheme based on order-preserving encryption. First, we encrypt the data using order-preserving encryption and adjust an encryption parameter within an appropriate interval to generate a ciphertext with redundant space. Then, we leverage the redundant space to embed robust reversible watermarking. We adopt grouping and K-means to improve the embedding capacity and the robustness of the watermark. Formal theoretical analysis proves that the proposed scheme guarantees correctness and security. Results of extensive experiments show that OPEW has 100% data utility, and the robustness and efficiency of OPEW are better than existing works.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"48 1","pages":"1 - 25"},"PeriodicalIF":1.8,"publicationDate":"2023-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42105460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Proportionality on Spatial Data with Context 具有上下文的空间数据的比例性

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2023-03-17 DOI: 10.1145/3588434

G. Fakas, Georgios Kalamatianos

More often than not, spatial objects are associated with some context, in the form of text, descriptive tags (e.g., points of interest, flickr photos), or linked entities in semantic graphs (e.g., Yago2, DBpedia). Hence, location-based retrieval should be extended to consider not only the locations but also the context of the objects, especially when the retrieved objects are too many and the query result is overwhelming. In this article, we study the problem of selecting a subset of the query result, which is the most representative. We argue that objects with similar context and nearby locations should proportionally be represented in the selection. Proportionality dictates the pairwise comparison of all retrieved objects and hence bears a high cost. We propose novel algorithms which greatly reduce the cost of proportional object selection in practice. In addition, we propose pre-processing, pruning, and approximate computation techniques that their combination reduces the computational cost of the algorithms even further. We theoretically analyze the approximation quality of our approaches. Extensive empirical studies on real datasets show that our algorithms are effective and efficient. A user evaluation verifies that proportional selection is more preferable than random selection and selection based on object diversification.

通常，空间对象以文本、描述性标签（例如，兴趣点、flickr照片）或语义图中的链接实体（例如，Yago2、DBpedia）的形式与某些上下文相关联。因此，基于位置的检索应该扩展到不仅考虑对象的位置，还考虑对象的上下文，特别是当检索到的对象太多并且查询结果太多时。在本文中，我们研究了选择查询结果的子集的问题，这是最具代表性的。我们认为，具有相似上下文和附近位置的对象应该在选择中按比例表示。比例性要求对所有检索到的对象进行成对比较，因此成本很高。我们提出了新的算法，在实践中大大降低了比例对象选择的成本。此外，我们提出了预处理、修剪和近似计算技术，它们的结合进一步降低了算法的计算成本。我们从理论上分析了我们的方法的近似质量。对真实数据集的大量实证研究表明，我们的算法是有效的。用户评估验证了比例选择比随机选择和基于对象多样化的选择更可取。

{"title":"Proportionality on Spatial Data with Context","authors":"G. Fakas, Georgios Kalamatianos","doi":"10.1145/3588434","DOIUrl":"https://doi.org/10.1145/3588434","url":null,"abstract":"More often than not, spatial objects are associated with some context, in the form of text, descriptive tags (e.g., points of interest, flickr photos), or linked entities in semantic graphs (e.g., Yago2, DBpedia). Hence, location-based retrieval should be extended to consider not only the locations but also the context of the objects, especially when the retrieved objects are too many and the query result is overwhelming. In this article, we study the problem of selecting a subset of the query result, which is the most representative. We argue that objects with similar context and nearby locations should proportionally be represented in the selection. Proportionality dictates the pairwise comparison of all retrieved objects and hence bears a high cost. We propose novel algorithms which greatly reduce the cost of proportional object selection in practice. In addition, we propose pre-processing, pruning, and approximate computation techniques that their combination reduces the computational cost of the algorithms even further. We theoretically analyze the approximation quality of our approaches. Extensive empirical studies on real datasets show that our algorithms are effective and efficient. A user evaluation verifies that proportional selection is more preferable than random selection and selection based on object diversification.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"48 1","pages":"1 - 37"},"PeriodicalIF":1.8,"publicationDate":"2023-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44577160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Efficiently Cleaning Structured Event Logs: A Graph Repair Approach 有效地清理结构化事件日志:一种图修复方法

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2023-03-13 DOI: https://dl.acm.org/doi/10.1145/3571281

Ruihong Huang, Jianmin Wang, Shaoxu Song, Xuemin Lin, Xiaochen Zhu, Jian Pei

Event data are often dirty owing to various recording conventions or simply system errors. These errors may cause serious damage to real applications, such as inaccurate provenance answers, poor profiling results, or concealing interesting patterns from event data. Cleaning dirty event data is strongly demanded. While existing event data cleaning techniques view event logs as sequences, structural information does exist among events, such as the task passing relationships between staffs in workflow or the invocation relationships among different micro-services in monitoring application performance. We argue that such structural information enhances not only the accuracy of repairing inconsistent events but also the computation efficiency. It is notable that both the structure and the names (labeling) of events could be inconsistent. In real applications, while an unsound structure is not repaired automatically (which requires manual effort from business actors to handle the structure error), it is highly desirable to repair the inconsistent event names introduced by recording mistakes. In this article, we first prove that the inconsistent label repairing problem is NP-complete. Then, we propose a graph repair approach for (1) detecting unsound structures, and (2) repairing inconsistent event names. Efficient pruning techniques together with two heuristic solutions are also presented. Extensive experiments over real and synthetic datasets demonstrate both the effectiveness and efficiency of our proposal.

由于各种记录约定或简单的系统错误，事件数据通常是脏的。这些错误可能会对实际应用程序造成严重损害，例如不准确的来源答案、糟糕的分析结果，或者从事件数据中隐藏有趣的模式。强烈要求清理脏事件数据。虽然现有的事件数据清理技术将事件日志视为序列，但事件之间确实存在结构化信息，例如工作流中人员之间的任务传递关系，或者监视应用程序性能时不同微服务之间的调用关系。我们认为这种结构信息不仅提高了修复不一致事件的准确性，而且提高了计算效率。值得注意的是，事件的结构和名称(标记)可能不一致。在实际的应用程序中，虽然不会自动修复不健全的结构(这需要业务参与者手动处理结构错误)，但非常希望修复由记录错误引入的不一致的事件名称。在本文中，我们首先证明了不一致标签修复问题是np完全的。然后，我们提出了一种图修复方法，用于(1)检测不健全的结构，(2)修复不一致的事件名称。提出了有效的剪枝技术和两种启发式解决方案。在真实和合成数据集上进行的大量实验证明了我们的建议的有效性和效率。

{"title":"Efficiently Cleaning Structured Event Logs: A Graph Repair Approach","authors":"Ruihong Huang, Jianmin Wang, Shaoxu Song, Xuemin Lin, Xiaochen Zhu, Jian Pei","doi":"https://dl.acm.org/doi/10.1145/3571281","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3571281","url":null,"abstract":"Event data are often dirty owing to various recording conventions or simply system errors. These errors may cause serious damage to real applications, such as inaccurate provenance answers, poor profiling results, or concealing interesting patterns from event data. Cleaning dirty event data is strongly demanded. While existing event data cleaning techniques view event logs as sequences, structural information does exist among events, such as the task passing relationships between staffs in workflow or the invocation relationships among different micro-services in monitoring application performance. We argue that such structural information enhances not only the accuracy of repairing inconsistent events but also the computation efficiency. It is notable that both the structure and the names (labeling) of events could be inconsistent. In real applications, while an unsound structure is not repaired automatically (which requires manual effort from business actors to handle the structure error), it is highly desirable to repair the inconsistent event names introduced by recording mistakes. In this article, we first prove that the inconsistent label repairing problem is NP-complete. Then, we propose a graph repair approach for (1) detecting unsound structures, and (2) repairing inconsistent event names. Efficient pruning techniques together with two heuristic solutions are also presented. Extensive experiments over real and synthetic datasets demonstrate both the effectiveness and efficiency of our proposal.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"26 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138530968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Tractable Orders for Direct Access to Ranked Answers of Conjunctive Queries 直接访问连接查询排序答案的可处理顺序

2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2023-03-13 DOI: 10.1145/3578517

Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, Benny Kimelfeld, Mirek Riedewald

We study the question of when we can provide direct access to the k-th answer to a Conjunctive Query (CQ) according to a specified order over the answers in time logarithmic in the size of the database, following a preprocessing step that constructs a data structure in time quasilinear in database size. Specifically, we embark on the challenge of identifying the tractable answer orderings , that is, those orders that allow for such complexity guarantees. To better understand the computational challenge at hand, we also investigate the more modest task of providing access to only a single answer (i.e., finding the answer at a given position), a task that we refer to as the selection problem , and ask when it can be performed in quasilinear time. We also explore the question of when selection is indeed easier than ranked direct access. We begin with lexicographic orders . For each of the two problems, we give a decidable characterization (under conventional complexity assumptions) of the class of tractable lexicographic orders for every CQ without self-joins. We then continue to the more general orders by the sum of attribute weights and establish the corresponding decidable characterizations, for each of the two problems, of the tractable CQs without self-joins. Finally, we explore the question of when the satisfaction of Functional Dependencies (FDs) can be utilized for tractability and establish the corresponding generalizations of our characterizations for every set of unary FDs.

我们研究的问题是，在预处理步骤构建数据库大小在时间上拟线性的数据结构之后，何时可以根据数据库大小在时间上对数的答案的指定顺序直接访问连接查询(CQ)的第k个答案。具体地说，我们着手于识别可处理的应答顺序的挑战，即那些允许这样的复杂性保证的顺序。为了更好地理解手头的计算挑战，我们还研究了更温和的任务，即只提供对单个答案的访问(即，在给定位置找到答案)，我们称之为选择问题的任务，并询问何时可以在拟线性时间内执行。我们还探讨了选择何时确实比排名直接访问更容易的问题。我们从字典顺序开始。对于这两个问题中的每一个，我们给出了一个可处理的字典顺序类的可确定特征(在传统的复杂性假设下)，对于每个没有自连接的CQ。然后，我们通过属性权值的和继续到更一般的阶，并为这两个问题中的每一个建立了相应的无自连接的可处理cq的可判定特征。最后，我们探讨了何时可以利用功能依赖(fd)的满足来实现可追溯性的问题，并建立了我们对每一组一元fd的描述的相应推广。

{"title":"Tractable Orders for Direct Access to Ranked Answers of Conjunctive Queries","authors":"Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, Benny Kimelfeld, Mirek Riedewald","doi":"10.1145/3578517","DOIUrl":"https://doi.org/10.1145/3578517","url":null,"abstract":"We study the question of when we can provide direct access to the k-th answer to a Conjunctive Query (CQ) according to a specified order over the answers in time logarithmic in the size of the database, following a preprocessing step that constructs a data structure in time quasilinear in database size. Specifically, we embark on the challenge of identifying the tractable answer orderings , that is, those orders that allow for such complexity guarantees. To better understand the computational challenge at hand, we also investigate the more modest task of providing access to only a single answer (i.e., finding the answer at a given position), a task that we refer to as the selection problem , and ask when it can be performed in quasilinear time. We also explore the question of when selection is indeed easier than ranked direct access. We begin with lexicographic orders . For each of the two problems, we give a decidable characterization (under conventional complexity assumptions) of the class of tractable lexicographic orders for every CQ without self-joins. We then continue to the more general orders by the sum of attribute weights and establish the corresponding decidable characterizations, for each of the two problems, of the tractable CQs without self-joins. Finally, we explore the question of when the satisfaction of Functional Dependencies (FDs) can be utilized for tractability and establish the corresponding generalizations of our characterizations for every set of unary FDs.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"241 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135905698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Robust and Efficient Sorting with Offset-value Coding 基于偏移值编码的鲁棒高效排序

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2023-03-13 DOI: https://dl.acm.org/doi/10.1145/3570956

Thanh Do, Goetz Graefe

Sorting and searching are large parts of database query processing, e.g., in the forms of index creation, index maintenance, and index lookup, and comparing pairs of keys is a substantial part of the effort in sorting and searching. We have worked on simple, efficient implementations of decades-old, neglected, effective techniques for fast comparisons and fast sorting, in particular offset-value coding. In the process, we happened upon its mutually beneficial relationship with prefix truncation in run files as well as the duality of compression techniques in row- and column-format storage structures, namely prefix truncation and run-length encoding of leading key columns. We also found a beneficial relationship with consumers of sorted streams, e.g., merging parallel streams, in-stream aggregation, and merge join. We report on our implementation in the context of Google’s Napa and F1 Query systems as well as an experimental evaluation of performance and scalability.

排序和搜索是数据库查询处理的重要组成部分，例如以索引创建、索引维护和索引查找的形式进行排序和搜索，比较键对是排序和搜索工作的重要组成部分。我们致力于简单、高效地实现几十年来被忽视的快速比较和快速排序的有效技术，特别是偏移值编码。在这个过程中，我们偶然发现了它与运行文件中的前缀截断以及行格式和列格式存储结构中压缩技术的对偶性，即前缀截断和前导键列的运行长度编码之间的互利关系。我们还发现了与已排序流的消费者之间的有益关系，例如合并并行流、流内聚合和合并连接。我们报告了我们在Google的Napa和F1查询系统中的实现，以及对性能和可扩展性的实验评估。

引用次数: 0

Tractable Orders for Direct Access to Ranked Answers of Conjunctive Queries 直接访问连接查询排序答案的可处理顺序

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2023-03-13 DOI: https://dl.acm.org/doi/10.1145/3578517

Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, Benny Kimelfeld, Mirek Riedewald

We study the question of when we can provide direct access to the k-th answer to a Conjunctive Query (CQ) according to a specified order over the answers in time logarithmic in the size of the database, following a preprocessing step that constructs a data structure in time quasilinear in database size. Specifically, we embark on the challenge of identifying the tractable answer orderings, that is, those orders that allow for such complexity guarantees. To better understand the computational challenge at hand, we also investigate the more modest task of providing access to only a single answer (i.e., finding the answer at a given position), a task that we refer to as the selection problem, and ask when it can be performed in quasilinear time. We also explore the question of when selection is indeed easier than ranked direct access.

We begin with lexicographic orders. For each of the two problems, we give a decidable characterization (under conventional complexity assumptions) of the class of tractable lexicographic orders for every CQ without self-joins. We then continue to the more general orders by the sum of attribute weights and establish the corresponding decidable characterizations, for each of the two problems, of the tractable CQs without self-joins. Finally, we explore the question of when the satisfaction of Functional Dependencies (FDs) can be utilized for tractability and establish the corresponding generalizations of our characterizations for every set of unary FDs.

我们研究的问题是，在预处理步骤构建数据库大小在时间上拟线性的数据结构之后，何时可以根据数据库大小在时间上对数的答案的指定顺序直接访问连接查询(CQ)的第k个答案。具体地说，我们着手于识别可处理的应答顺序的挑战，即那些允许这样的复杂性保证的顺序。为了更好地理解手头的计算挑战，我们还研究了更温和的任务，即只提供对单个答案的访问(即，在给定位置找到答案)，我们称之为选择问题的任务，并询问何时可以在拟线性时间内执行。我们还探讨了选择何时确实比排名直接访问更容易的问题。我们从字典顺序开始。对于这两个问题中的每一个，我们给出了一个可处理的字典顺序类的可确定特征(在传统的复杂性假设下)，对于每个没有自连接的CQ。然后，我们通过属性权值的和继续到更一般的阶，并为这两个问题中的每一个建立了相应的无自连接的可处理cq的可判定特征。最后，我们探讨了何时可以利用功能依赖(fd)的满足来实现可追溯性的问题，并建立了我们对每一组一元fd的描述的相应推广。

{"title":"Tractable Orders for Direct Access to Ranked Answers of Conjunctive Queries","authors":"Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, Benny Kimelfeld, Mirek Riedewald","doi":"https://dl.acm.org/doi/10.1145/3578517","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3578517","url":null,"abstract":"We study the question of when we can provide direct access to the k-th answer to a Conjunctive Query (CQ) according to a specified order over the answers in time logarithmic in the size of the database, following a preprocessing step that constructs a data structure in time quasilinear in database size. Specifically, we embark on the challenge of identifying the tractable answer orderings, that is, those orders that allow for such complexity guarantees. To better understand the computational challenge at hand, we also investigate the more modest task of providing access to only a single answer (i.e., finding the answer at a given position), a task that we refer to as the selection problem, and ask when it can be performed in quasilinear time. We also explore the question of when selection is indeed easier than ranked direct access. We begin with lexicographic orders. For each of the two problems, we give a decidable characterization (under conventional complexity assumptions) of the class of tractable lexicographic orders for every CQ without self-joins. We then continue to the more general orders by the sum of attribute weights and establish the corresponding decidable characterizations, for each of the two problems, of the tractable CQs without self-joins. Finally, we explore the question of when the satisfaction of Functional Dependencies (FDs) can be utilized for tractability and establish the corresponding generalizations of our characterizations for every set of unary FDs.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"43 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138530905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient Sorting, Duplicate Removal, Grouping, and Aggregation 高效排序、重复删除、分组和聚合

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2023-01-06 DOI: https://dl.acm.org/doi/10.1145/3568027

Thanh Do, Goetz Graefe, Jeffrey Naughton

Database query processing requires algorithms for duplicate removal, grouping, and aggregation. Three algorithms exist: in-stream aggregation is most efficient by far but requires sorted input; sort-based aggregation relies on external merge sort; and hash aggregation relies on an in-memory hash table plus hash partitioning to temporary storage. Cost-based query optimization chooses which algorithm to use based on several factors, including the sort order of the input, input and output sizes, and the need for sorted output. For example, hash-based aggregation is ideal for output smaller than the available memory (e.g., Query 1 of TPC-H), whereas sorting the entire input and aggregating after sorting are preferable when both aggregation input and output are large and the output needs to be sorted for a subsequent operation such as a merge join.

Unfortunately, the size information required for a sound choice is often inaccurate or unavailable during query optimization, leading to sub-optimal algorithm choices. In response, this article introduces a new algorithm for sort-based duplicate removal, grouping, and aggregation. The new algorithm always performs at least as well as both traditional hash-based and traditional sort-based algorithms. It can serve as a system’s only aggregation algorithm for unsorted inputs, thus preventing erroneous algorithm choices. Furthermore, the new algorithm produces sorted output that can speed up subsequent operations. Google’s F1 Query uses the new algorithm in production workloads that aggregate petabytes of data every day.

数据库查询处理需要重复删除、分组和聚合算法。目前存在三种算法:流内聚合是目前最有效的，但需要排序输入;基于排序的聚合依赖于外部归并排序;哈希聚合依赖于内存中的哈希表以及对临时存储的哈希分区。基于成本的查询优化根据几个因素选择使用哪种算法，包括输入的排序顺序、输入和输出的大小，以及对排序输出的需求。例如，对于小于可用内存的输出(例如，TPC-H的查询1)，基于散列的聚合是理想的，而当聚合输入和输出都很大并且需要为后续操作(如合并连接)对输出进行排序时，对整个输入进行排序并在排序后进行聚合是可取的。不幸的是，在查询优化期间，合理选择所需的大小信息通常不准确或不可用，从而导致次优算法选择。作为回应，本文介绍了一种新的基于排序的重复删除、分组和聚合算法。新算法的性能至少与传统的基于哈希和基于排序的算法一样好。它可以作为系统对未排序输入的唯一聚合算法，从而防止错误的算法选择。此外，新算法产生排序输出，可以加快后续操作。谷歌的F1查询在每天聚合数pb数据的生产工作负载中使用了这种新算法。

{"title":"Efficient Sorting, Duplicate Removal, Grouping, and Aggregation","authors":"Thanh Do, Goetz Graefe, Jeffrey Naughton","doi":"https://dl.acm.org/doi/10.1145/3568027","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3568027","url":null,"abstract":"Database query processing requires algorithms for duplicate removal, grouping, and aggregation. Three algorithms exist: in-stream aggregation is most efficient by far but requires sorted input; sort-based aggregation relies on external merge sort; and hash aggregation relies on an in-memory hash table plus hash partitioning to temporary storage. Cost-based query optimization chooses which algorithm to use based on several factors, including the sort order of the input, input and output sizes, and the need for sorted output. For example, hash-based aggregation is ideal for output smaller than the available memory (e.g., Query 1 of TPC-H), whereas sorting the entire input and aggregating after sorting are preferable when both aggregation input and output are large and the output needs to be sorted for a subsequent operation such as a merge join.Unfortunately, the size information required for a sound choice is often inaccurate or unavailable during query optimization, leading to sub-optimal algorithm choices. In response, this article introduces a new algorithm for sort-based duplicate removal, grouping, and aggregation. The new algorithm always performs at least as well as both traditional hash-based and traditional sort-based algorithms. It can serve as a system’s only aggregation algorithm for unsorted inputs, thus preventing erroneous algorithm choices. Furthermore, the new algorithm produces sorted output that can speed up subsequent operations. Google’s F1 Query uses the new algorithm in production workloads that aggregate petabytes of data every day.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"32 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138530909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Proximity Queries on Terrain Surface 地形表面邻近查询

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2022-12-16 DOI: https://dl.acm.org/doi/10.1145/3563773

Victor Junqiu Wei, Raymond Chi-Wing Wong, Cheng Long, David Mount, Hanan Samet

Due to the advance of the geo-spatial positioning and the computer graphics technology, digital terrain data has become increasingly popular nowadays. Query processing on terrain data has attracted considerable attention from both the academic and the industry communities.

Proximity queries such as the shortest path/distance query, k nearest/farthest neighbor query, and top-k closest/farthest pairs query are fundamental and important queries in the context of the terrain surfaces, and they have a lot of applications in Geographical Information System, 3D object feature vector construction, and 3D object data mining. In this article, we first study the most fundamental type of query, namely, shortest distance and path query, which is to find the shortest distance and path between two points of interest on the surface of the terrain. As observed by existing studies, computing the exact shortest distance/path is very expensive. Some existing studies proposed ϵ-approximate distance and path oracles, where ϵ is a non-negative real-valued error parameter. However, the best-known algorithm has a large oracle construction time, a large oracle size, and a large query time. Motivated by this, we propose a novel ϵ-approximate distance and path oracle called the Space Efficient distance and path oracle (SE), which has a small oracle construction time, a small oracle size, and a small distance and path query time, thanks to its compactness of storing concise information about pairwise distances between any two points-of-interest. Then, we propose several algorithms for the k nearest/farthest neighbor and top-k closest/farthest pairs queries with the assistance of our distance and path oracle SE.

Our experimental results show that the oracle construction time, the oracle size, and the distance and path query time of SE are up to two, three, and five orders of magnitude faster than the best-known algorithm, respectively. Besides, our algorithms for other proximity queries including k nearest/farthest neighbor queries and top-k closest/farthest pairs queries significantly outperform the state-of-the-art algorithms by up to two orders of magnitude.

随着地理空间定位技术和计算机图形技术的发展，数字地形数据日益普及。地形数据的查询处理已经引起了学术界和工业界的广泛关注。最近路径/距离查询、k个最近/最远邻居查询和top-k个最近/最远对查询等接近性查询是地形表面环境中最基本和重要的查询，在地理信息系统、三维物体特征向量构建和三维物体数据挖掘等方面有着广泛的应用。在本文中，我们首先研究了最基本的查询类型，即最短距离和路径查询，即在地形表面上查找两个兴趣点之间的最短距离和路径。根据现有的研究，计算精确的最短距离/路径是非常昂贵的。一些现有的研究提出了ϵ-approximate距离和路径预言器，其中的λ是一个非负的实值误差参数。然而，最著名的算法具有较大的oracle构建时间、较大的oracle大小和较大的查询时间。基于此，我们提出了一种新颖的ϵ-approximate距离路径oracle，称为空间高效距离路径oracle (Space Efficient distance and path oracle, SE)，它具有较小的oracle构建时间，较小的oracle大小，以及较小的距离和路径查询时间，这得益于它的紧凑性，可以存储任何两个兴趣点之间的两两距离的简明信息。然后，在距离和路径oracle SE的帮助下，我们提出了k个最近/最远邻居和top-k个最近/最远对查询的几种算法。我们的实验结果表明，SE的oracle构建时间、oracle大小、距离和路径查询时间分别比最知名的算法快2、3和5个数量级。此外，我们的算法用于其他邻近查询，包括k最近/最远邻居查询和top-k最近/最远对查询，其性能明显优于最先进的算法，最高可达两个数量级。

{"title":"Proximity Queries on Terrain Surface","authors":"Victor Junqiu Wei, Raymond Chi-Wing Wong, Cheng Long, David Mount, Hanan Samet","doi":"https://dl.acm.org/doi/10.1145/3563773","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3563773","url":null,"abstract":"Due to the advance of the geo-spatial positioning and the computer graphics technology, digital terrain data has become increasingly popular nowadays. Query processing on terrain data has attracted considerable attention from both the academic and the industry communities.Proximity queries such as the shortest path/distance query, k nearest/farthest neighbor query, and top-k closest/farthest pairs query are fundamental and important queries in the context of the terrain surfaces, and they have a lot of applications in Geographical Information System, 3D object feature vector construction, and 3D object data mining. In this article, we first study the most fundamental type of query, namely, shortest distance and path query, which is to find the shortest distance and path between two points of interest on the surface of the terrain. As observed by existing studies, computing the exact shortest distance/path is very expensive. Some existing studies proposed ϵ-approximate distance and path oracles, where ϵ is a non-negative real-valued error parameter. However, the best-known algorithm has a large oracle construction time, a large oracle size, and a large query time. Motivated by this, we propose a novel ϵ-approximate distance and path oracle called the <underline>S</underline>pace <underline>E</underline>fficient distance and path oracle (SE), which has a small oracle construction time, a small oracle size, and a small distance and path query time, thanks to its compactness of storing concise information about pairwise distances between any two points-of-interest. Then, we propose several algorithms for the k nearest/farthest neighbor and top-k closest/farthest pairs queries with the assistance of our distance and path oracle SE. Our experimental results show that the oracle construction time, the oracle size, and the distance and path query time of SE are up to two, three, and five orders of magnitude faster than the best-known algorithm, respectively. Besides, our algorithms for other proximity queries including k nearest/farthest neighbor queries and top-k closest/farthest pairs queries significantly outperform the state-of-the-art algorithms by up to two orders of magnitude.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"13 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2022-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138530911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0